## A simple variation on the Greenwald-Khanna stream summary

December 1, 2018

### 1 Problem

Consider an infinite stream of tuples, where , some totally ordered set, and , the set of positive real numbers. The goal is scan the stream and maintain a data structure which at any point permits -approximate rank queries.

### 2 Naive solution

Let be the indices of the input stream, and assume that the data-structure has been updated to account for the th element. It is straightforward to maintain the quantity . Beyond that, the algorithm maintains a collection of tuples of the form .

Assume that pairs have been encountered, so that is the current element. If for some tuple based on , one adds to , and for all , adds to , and for all , adds to . If for all , and and at the indices of the immediate smaller and and larger elements, then one inserts the new element , and updates all other -tuples as before.

The inclusive range of ranks of the th tuple is to ; the bisected rank range is to . (The doubly-exclusive range of ranks need not be nonempty.)

Our goal is to estimate the rank of a given element with error at most . To achieve this goal, we maintain an invariant: that the maximum possible range of absolute ranks which could fall between two elements is . 1 Then, since either matches some element’s value (in which case is implied by the gap condition), or falls inside the gap between and , and the range of possible values is . Note that this invariant is preserved when new elements are inserted, as either the band is split, or is preserved. To keep memory use and run-time low, we (eventually) delete all elements which can be deleted while preserving the invariant. (We delete node when , so that afterwards the gap condition still holds.) The nontrivial part is proving that, for weights where , that this procedure obtains reasonable space bounds, and for that, see  Thm 1.

### 3 Partial-sum decomposition

This is equivalent to the naive representation; conversion between the two is straightforward; operations on the converted form have reduced cost.

Instead of storing tuples , store , with in sorted order by , where the properties that and are maintained.

Inserting a tuple equates to inserting in sorted position.

To delete , we remove it and increment its immediate neighbors to compensate, i.e, , and .

Useful interpretations of these coordinates:

• is a sampled item
• is the (minimum known) cumulative weight at item
• is the maximum possible value which can not, based on the current set of tuples, be ruled out from having been associated with a hypothetical value immediately less than .
• is like , with less replaced by greater

#### 3.1 Merging estimators

We can combine different estimators by merging their partial-sum decomposed representations together, adding corresponding values when multiple match between different sources. This preserves relative uncertainty bounds.

Performing compression operations thereafter will reduce to an -approximate quantile summary.

### 4 Implementations

The specific implementations affect the run-time; however, de-amortization ensures update time. Depending on the data structure, it may take to query a single approximate rank, or to uncompress the data structure, after which queries resolve to binary search.

• A randomized skip list can be used to store the sorted list of all tuples at the base level; each higher level above summarizes the level below, so that the sum can be computed in time, by adding the highest level possible and terms on the path from the top left node to the th base node.

New entries are inserted by locating the greatest where . If , then can be added to . Otherwise, the and can be computed, in logarithmic time, so that one can check if a new added tuple would not immediately be deleted. (For instance, with uniform quantiles, one might immediately delete when .) If the new added tuple is not deleted, an element should be inserted between and at the base level and (depending on skip list rules) some number of levels above. In levels where a new tuple is not inserted, and should be incremented by .

Elements above or below the minimum or maximum element are always inserted; the old minimum or maximum should then be checked for deletion.

Although a scheme for non-local deletions is likely possible (by tracking the minimum amounts added to the left and right which could induce a deletion), selecting a random base element and testing it for deletion at each phase likely suffices.

This method has been implemented, and is somewhat slow (e.g. for 250 items, where for contrast an equivalent 1st-order interpolated histogram takes ) Contact mailbox xslr for source code. Example data structure diagram, produced from a mixed delta and smooth distribution, trending from left to right in time: • A batch approach stores the tuples as a sorted array, and queues new elements in a priority heap. Periodically, one merges a sorted list produced from the heap with the sorted array. The now much larger list can be scanned linearly (with and updated incrementally) and tuples can be marked for deletion (based on the current state of the list). Finally, one removes every second deletion-marked tuple, symmetrically moving in from the left and the right.

It’s possible to amortize the batch operation by performing work in chunks of size . An additional heap can be used to queue items that arrived during the batch operation; although the existing heap can store the new items as well, i.e., by flipping the sign of the weights of incoming elements and adjusting the heap sort order to prefer the old items.

#### 4.1 Misc

Sometimes the lists are produced in e.g. double precision format, but stored in single precision. If this conversion results in several adjacent entries with identical , then their associated values should be added to form a combined entry. (Note that there are cases in which the left and right values can be guaranteed to belong to itself, e.g. when is the least or greatest value.)

### 5 Analysis

Greenwald-Khanna is equivalent to maintaining an -approximate quantile summary.

This is a strict refinement to Greenwald-Khanna, in that the rank uncertainty provided by this method is bounded above by that estimate used by their algorithm. Operation run-time and space usage are thus the same.

See  for biased quantile estimation – instead of using the invariant that , one could use some rank-dependent function on the right hand side, i.e. , to concentrate elements at low percentiles.

### 6 Generalizations

• For the circular case, one fix the first element in the stream, and define a total order over all using the interval excluding it. Alternatively, without such a reference, one can record the total contribution in the band between any two elements; this uses memory usage, where is the number of stored comparison elements at time . Update time is still . This may be more useful for more complicated range queries.
• Higher dimensional cases permit the same partial sum decomposition. In the general -dimensional case, quantities per element would be required to count a new pair.

### 7 Significance

The problem occurs, for instance, in Monte Carlo simulations estimating a temporal distribution from statistically/color/energy/otherwise weighted samples drawn from it. This model variation is slightly more intuitive (requires one less type of approximation) than the Greenwald-Khanna algorithm. This is important since Monte Carlo estimation already involves a complicated set of underlying assumptions.2

### References



Graham Cormode et al. “Effective computation of biased quantiles over data streams”. In: Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE. 2005, pp. 20–31.



Michael Greenwald and Sanjeev Khanna. “Space-efficient Online Computation of Quantile Summaries”. In: SIGMOD Rec. 30.2 (May 2001), pp. 58–66. issn: 0163-5808. doi: 10.1145/376284.375670. url: https://doi.acm.org/10.1145/376284.375670.

1This should not be confused with the maximum width between two elements, which is a local property.

2The conceptually simpler histogram is not as memory efficient.