A simple variation on the Greenwald-Khanna stream summary


December 1, 2018

1 Problem

Consider an infinite stream of (yt,wt) tuples, where xiU, some totally ordered set, and wiR+, the set of positive real numbers. The goal is scan the stream and maintain a data structure which at any point permits -approximate rank queries.

2 Naive solution

Let t=1,2... be the indices of the input stream, and assume that the data-structure has been updated to account for the Tth element. It is straightforward to maintain the quantity WT=Tt=1wt. Beyond that, the algorithm maintains a collection of tuples of the form (xi,si,Li,Ri)nti=1.

Assume that T1 pairs have been encountered, so that (yT,wT) is the current element. If yT=xi for some tuple based on xi, one adds wT to si, and for all j<i, adds wT to Rj, and for all j>i, adds wT to Lj. If yT=xi for all i, and i and i+ at the indices of the immediate smaller and and larger elements, then one inserts the new element yT,wT,Li+si,Ri++si+, and updates all other xi-tuples as before.

The inclusive range of ranks of the ith tuple is Li to WTRi; the bisected rank range is Li+si/2 to WTRisi/2. (The doubly-exclusive range of ranks need not be nonempty.)

Our goal is to estimate the rank of a given element y with error at most WT. To achieve this goal, we maintain an invariant: that the maximum possible range of absolute ranks which could fall between two elements is g(i,i+1):=WTRi+1Lisisi+1WT. 1

diagram

Then, since y either matches some element’s x value (in which case WTRiLisi<WT is implied by the gap condition), or y falls inside the gap between xi and xi+1, and the range of possible values is <WT. Note that this invariant is preserved when new elements are inserted, as either the band is split, or g(i,i+1) is preserved. To keep memory use and run-time low, we (eventually) delete all elements which can be deleted while preserving the invariant. (We delete node i when WTRi+1Li1si1si+1<WT, so that afterwards the gap condition still holds.) The nontrivial part is proving that, for weights where 0<infwt<supwt<, that this procedure obtains reasonable space bounds, and for that, see [2] Thm 1.

3 Partial-sum decomposition

This is equivalent to the naive representation; conversion between the two is straightforward; operations on the converted form have reduced cost.

Instead of storing tuples (xi,si,Li,Ri), store (xi,si,li,ri), with i=1...nT in sorted order by xi, where the properties that i1j=1sj+ij=1lj=Li and ntj=i+1sj+ntj=irj=Ri are maintained.

Inserting a tuple equates to inserting (yT,wT,0,0) in sorted position.

To delete (xi,si,li,ri), we remove it and increment its immediate neighbors to compensate, i.e, li+1+=li+si, and ri1+=ri+si.

Useful interpretations of these coordinates:

3.1 Merging estimators

We can combine m different estimators by merging their partial-sum decomposed representations together, adding corresponding si,li,ri values when multiple xi match between different sources. This preserves relative uncertainty bounds.

Performing compression operations thereafter will reduce to an -approximate quantile summary.

4 Implementations

The specific implementations affect the run-time; however, de-amortization ensures O(lognt) update time. Depending on the data structure, it may take O(lognt) to query a single approximate rank, or O(nt) to uncompress the data structure, after which queries resolve to O(lognt) binary search.

4.1 Misc

Sometimes the (x,s,l,r)i lists are produced in e.g. double precision format, but stored in single precision. If this conversion results in several adjacent entries with identical x, then their associated s,l,r values should be added to form a combined entry. (Note that there are cases in which the left r and right l values can be guaranteed to belong to x itself, e.g. when x is the least or greatest value.)

5 Analysis

Greenwald-Khanna is equivalent to maintaining an -approximate quantile summary.

This is a strict refinement to Greenwald-Khanna, in that the rank uncertainty provided by this method is bounded above by that estimate used by their algorithm. Operation run-time and space usage are thus the same.

See [1] for biased quantile estimation – instead of using the invariant that g(i,i+1)<WT, one could use some rank-dependent function on the right hand side, i.e. Li, to concentrate elements at low percentiles.

6 Generalizations

7 Significance

The problem occurs, for instance, in Monte Carlo simulations estimating a temporal distribution from statistically/color/energy/otherwise weighted samples drawn from it. This model variation is slightly more intuitive (requires one less type of approximation) than the Greenwald-Khanna algorithm. This is important since Monte Carlo estimation already involves a complicated set of underlying assumptions.2

References

[1]

Graham Cormode et al. “Effective computation of biased quantiles over data streams”. In: Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE. 2005, pp. 20–31.

[2]

Michael Greenwald and Sanjeev Khanna. “Space-efficient Online Computation of Quantile Summaries”. In: SIGMOD Rec. 30.2 (May 2001), pp. 58–66. issn: 0163-5808. doi: 10.1145/376284.375670. url: https://doi.acm.org/10.1145/376284.375670.

1This should not be confused with the maximum width between two elements, which is a local property.

2The conceptually simpler histogram is not as memory efficient.


Return to main page | as PDF