Consider an infinite stream of tuples, where , some totally ordered set, and , the set of positive real numbers. The goal is scan the stream and maintain a data structure which at any point permits -approximate rank queries.
Let be the indices of the input stream, and assume that the data-structure has been updated to account for the th element. It is straightforward to maintain the quantity . Beyond that, the algorithm maintains a collection of tuples of the form .
Assume that pairs have been encountered, so that is the current element. If for some tuple based on , one adds to , and for all , adds to , and for all , adds to . If for all , and and at the indices of the immediate smaller and and larger elements, then one inserts the new element , and updates all other -tuples as before.
The inclusive range of ranks of the th tuple is to ; the bisected rank range is to . (The doubly-exclusive range of ranks need not be nonempty.)
Our goal is to estimate the rank of a given element with error at most . To achieve this goal, we maintain an invariant: that the maximum possible range of absolute ranks which could fall between two elements is . ^{1}
Then, since either matches some element’s value (in which case is implied by the gap condition), or falls inside the gap between and , and the range of possible values is . Note that this invariant is preserved when new elements are inserted, as either the band is split, or is preserved. To keep memory use and run-time low, we (eventually) delete all elements which can be deleted while preserving the invariant. (We delete node when , so that afterwards the gap condition still holds.) The nontrivial part is proving that, for weights where , that this procedure obtains reasonable space bounds, and for that, see [2] Thm 1.
This is equivalent to the naive representation; conversion between the two is straightforward; operations on the converted form have reduced cost.
Instead of storing tuples , store , with in sorted order by , where the properties that and are maintained.
Inserting a tuple equates to inserting in sorted position.
To delete , we remove it and increment its immediate neighbors to compensate, i.e, , and .
Useful interpretations of these coordinates:
We can combine different estimators by merging their partial-sum decomposed representations together, adding corresponding values when multiple match between different sources. This preserves relative uncertainty bounds.
Performing compression operations thereafter will reduce to an -approximate quantile summary.
The specific implementations affect the run-time; however, de-amortization ensures update time. Depending on the data structure, it may take to query a single approximate rank, or to uncompress the data structure, after which queries resolve to binary search.
New entries are inserted by locating the greatest where . If , then can be added to . Otherwise, the and can be computed, in logarithmic time, so that one can check if a new added tuple would not immediately be deleted. (For instance, with uniform quantiles, one might immediately delete when .) If the new added tuple is not deleted, an element should be inserted between and at the base level and (depending on skip list rules) some number of levels above. In levels where a new tuple is not inserted, and should be incremented by .
Elements above or below the minimum or maximum element are always inserted; the old minimum or maximum should then be checked for deletion.
Although a scheme for non-local deletions is likely possible (by tracking the minimum amounts added to the left and right which could induce a deletion), selecting a random base element and testing it for deletion at each phase likely suffices.
This method has been implemented, and is somewhat slow (e.g. for 250 items, where for contrast an equivalent 1st-order interpolated histogram takes ) Contact mailbox xslr for source code. Example data structure diagram, produced from a mixed delta and smooth distribution, trending from left to right in time:
It’s possible to amortize the batch operation by performing work in chunks of size . An additional heap can be used to queue items that arrived during the batch operation; although the existing heap can store the new items as well, i.e., by flipping the sign of the weights of incoming elements and adjusting the heap sort order to prefer the old items.
Sometimes the lists are produced in e.g. double precision format, but stored in single precision. If this conversion results in several adjacent entries with identical , then their associated values should be added to form a combined entry. (Note that there are cases in which the left and right values can be guaranteed to belong to itself, e.g. when is the least or greatest value.)
Greenwald-Khanna is equivalent to maintaining an -approximate quantile summary.
This is a strict refinement to Greenwald-Khanna, in that the rank uncertainty provided by this method is bounded above by that estimate used by their algorithm. Operation run-time and space usage are thus the same.
See [1] for biased quantile estimation – instead of using the invariant that , one could use some rank-dependent function on the right hand side, i.e. , to concentrate elements at low percentiles.
The problem occurs, for instance, in Monte Carlo simulations estimating a temporal distribution from statistically/color/energy/otherwise weighted samples drawn from it. This model variation is slightly more intuitive (requires one less type of approximation) than the Greenwald-Khanna algorithm. This is important since Monte Carlo estimation already involves a complicated set of underlying assumptions.^{2}
Graham Cormode et al. “Effective computation of biased quantiles over data streams”. In: Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE. 2005, pp. 20–31.
Michael Greenwald and Sanjeev Khanna. “Space-efficient Online Computation of Quantile Summaries”. In: SIGMOD Rec. 30.2 (May 2001), pp. 58–66. issn: 0163-5808. doi: 10.1145/376284.375670. url: https://doi.acm.org/10.1145/376284.375670.
^{1}This should not be confused with the maximum width between two elements, which is a local property.
^{2}The conceptually simpler histogram is not as memory efficient.