Estimating the Number of Distinct Elements in a Stream

You can read the notes from the previous lecture of Chandra Chekuri's course on Basics of Probability, Probabilistic Counting, and Reservoir Sampling here.

1. Estimating Frequency Moments in Streams

A significant fraction of streaming literature is on the problem of estimating frequency moments. Let σ = a 1 , a 2 , , a m σ = a 1 , a 2 , , a m sigma=a_(1),a_(2),dots,a_(m)\sigma=a_{1}, a_{2}, \ldots, a_{m} be a stream of numbers where for each i , a i i , a i i,a_(i)i, a_{i} is an intger between 1 1 11 and n n nn. We will try to stick to the notation of using m m mm for the length of the stream and n n nn for range of the integers[1]. Let f i f i f_(i)f_{i} be the number of occurences (or frequency) of integer i i ii in the stream. We let f = ( f 1 , f 2 , , f n ) f = f 1 , f 2 , , f n f=(f_(1),f_(2),dots,f_(n))\mathbf{f}=\left(f_{1}, f_{2}, \ldots, f_{n}\right) be the frequency vector for a given stream σ σ sigma\sigma. For k 0 , F k ( σ ) k 0 , F k ( σ ) k >= 0,F_(k)(sigma)k \geq 0, F_{k}(\sigma) is defined to be the k k kk'th frequency moment of σ σ sigma\sigma:
F k = i f i k . F k = i f i k . F_(k)=sum_(i)f_(i)^(k).F_{k}=\sum_{i} f_{i}^{k} .
We will discuss several algorithms to estimate F k F k F_(k)F_{k} for various values of k k kk. For instance F 0 F 0 F_(0)F_{0} is simply the number of distinct elements in σ σ sigma\sigma. Note that F 1 = i f i = m F 1 = i f i = m F_(1)=sum_(i)f_(i)=mF_{1}=\sum_{i} f_{i}=m, the length of the stream. A k k kk increases to F k F k ooF_(k)\infty F_{k} will concentrate on the most frequent element and we can thing of F F F_(oo)F_{\infty} as finding the most frequent element.
Definition 1 Let A ) ( σ ) A ) ( σ ) A)(sigma)\mathcal{A})(\sigma) be the real-valued output of a randomized streaming algorithm on stream σ σ sigma\sigma. We say that A A A\mathcal{A} provides an ( α , β ) ( α , β ) (alpha,beta)(\alpha, \beta)-approximation for a real-valued function g g gg if
Pr [ | A ( σ ) g ( σ ) 1 | > α ] β Pr A ( σ ) g ( σ ) 1 > α β Pr[|(A(sigma))/(g(sigma))-1| > alpha] <= beta\operatorname{Pr}\left[\left|\frac{\mathcal{A}(\sigma)}{g(\sigma)}-1\right|>\alpha\right] \leq \beta
for all σ σ sigma\sigma.
Our ideal goal is to obtain a ( ϵ , δ ) ( ϵ , δ ) (epsilon,delta)(\epsilon, \delta)-approximation for any given ϵ , δ ( 0 , 1 ) ϵ , δ ( 0 , 1 ) epsilon,delta in(0,1)\epsilon, \delta \in(0,1).

2. Background on Hashing

Hashing techniques play a fundamental role in streaming, in particular for estimating frequency moments. We will briefly review hashing from a theoretical point of view and in particular k k kk-universal hashing.
A hash function maps a finite universe U U U\mathcal{U} to some range R R R\mathcal{R}. Typically the range is the set of integers [ 0 . . L 1 ] [ 0 . . L 1 ] [0..L-1][0 . . L-1] for some finite L L LL (here L L LL is the number of buckets in the hash table). Sometimes it is convenient to consider, for the sake of developing intuition, hash functions that maps U U U\mathcal{U} to the continuous interval [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1]. We will, in general, be working with a family of hash functions H H H\mathcal{H} and h h hh will be drawn from H H H\mathcal{H} uniformly at random; the analyis of the algorithm will be based on properties of H H H\mathcal{H}. We would like H H H\mathcal{H} to have two important and contradictory properties:
  • a random function from H H H\mathcal{H} should behave like a completely random function from U U U\mathcal{U} to the range.
  • H H H\mathcal{H} should have nice computational properties:
    • a uniformly random function from H H H\mathcal{H} should be easy to sample
    • any function h H h H h inHh \in \mathcal{H} should have small representation size so that it can be stored compactly
    • it should be efficient to evaluate h h hh
Definition 2 A collection of random variables X 1 , X 2 , , X n X 1 , X 2 , , X n X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} are k k kk-wise independent if the variables X i 1 , X i 2 , , X i k X i 1 , X i 2 , , X i k X_(i_(1)),X_(i_(2)),dots,X_(i_(k))X_{i_{1}}, X_{i_{2}}, \ldots, X_{i_{k}} are independent for any set of distinct indices i 1 , i 2 , , i k i 1 , i 2 , , i k i_(1),i_(2),dots,i_(k)i_{1}, i_{2}, \ldots, i_{k}.
Take three random variables { X 1 , X 2 , X 3 } X 1 , X 2 , X 3 {X_(1),X_(2),X_(3)}\left\{X_{1}, X_{2}, X_{3}\right\} where X 1 , X 2 X 1 , X 2 X_(1),X_(2)X_{1}, X_{2} are independent { 0 , 1 } { 0 , 1 } {0,1}\{0,1\} random variables and X 3 = X 1 X 2 X 3 = X 1 X 2 X_(3)=X_(1)o+X_(2)X_{3}=X_{1} \oplus X_{2}. It is easy to check that the three variables are pairwise independent although they are not all independent.
Following the work of Carter and Wegman[2], the class of k k kk-universal hash families, and in particular for k = 2 k = 2 k=2k=2, provide an excellent tradeoff. H H H\mathcal{H} is strongly 2-universal if the following properties hold for a random function h h hh picked from H H H\mathcal{H}: (i) for every x U , h ( x ) x U , h ( x ) x inU,h(x)x \in \mathcal{U}, h(x) (which is a random variable) is uniformly distributed over the range and (ii) for every distinct pair x , y U x , y U x,y inUx, y \in \mathcal{U}, h ( x ) h ( x ) h(x)h(x) and h ( y ) h ( y ) h(y)h(y) are independent. 2-universal hash families are also called pairwise independent hash families. A weakly 2-universal family satisfies the property that that Pr [ h ( x ) = h ( y ) ] = 1 / L Pr [ h ( x ) = h ( y ) ] = 1 / L Pr[h(x)=h(y)]=1//L\operatorname{Pr}[h(x)=h(y)]=1 / L for any distinct x , y x , y x,yx, y. We state an important observation about pairwise independent random variables.
Lemma 1 Let Y = i = 1 h X i Y = i = 1 h X i Y=sum_(i=1)^(h)X_(i)Y=\sum_{i=1}^{h} X_{i} where X 1 , X 2 , , X h X 1 , X 2 , , X h X_(1),X_(2),dots,X_(h)X_{1}, X_{2}, \ldots, X_{h} are pairwise independent. Then
V a r [ Y ] = i = 1 h V a r [ X i ] . V a r [ Y ] = i = 1 h V a r X i . Var[Y]=sum_(i=1)^(h)Var[X_(i)].\mathbf{Var}[Y]=\sum_{i=1}^{h} \mathbf{Var}\left[X_{i}\right] .
Moreover if X i X i X_(i)X_{i} are binary/indicator random variables then
V a r [ Y ] i E [ X i 2 ] = i E [ X i ] = E [ Y ] . V a r [ Y ] i E X i 2 = i E X i = E [ Y ] . Var[Y] <= sum_(i)E[X_(i)^(2)]=sum_(i)E[X_(i)]=E[Y].\mathbf{Var}[Y] \leq \sum_{i} \mathbf{E}\left[X_{i}^{2}\right]=\sum_{i} \mathbf{E}\left[X_{i}\right]=\mathbf{E}[Y] .
There is a simple and nice construction of pairwise independent hash functions. Let p p pp be a prime number such that p | U | p | U | p >= |U|p \geq|\mathcal{U}|. Recall that Z p = { 0 , 1 , , p 1 } Z p = { 0 , 1 , , p 1 } Z_(p)={0,1,dots,p-1}Z_{p}=\{0,1, \ldots, p-1\} forms a field under the standard addition and multiplication modulo p p pp. For each a , b [ p ] a , b [ p ] a,b in[p]a, b \in[p] we can define a hash function h a , b h a , b h_(a,b)h_{a, b} where h a , b ( x ) = a x + b mod p h a , b ( x ) = a x + b mod p h_(a,b)(x)=ax+b modph_{a, b}(x)=a x+b \bmod {p}. Let H = { h a , b a , b [ p ] } H = h a , b a , b [ p ] H={h_(a,b)∣a,b in[p]}\mathcal{H}=\left\{h_{a, b} \mid a, b \in[p]\right\}. We can see that we only need to store two numbers a , b a , b a,ba, b of Θ ( log p ) Θ ( log p ) Theta(log p)\Theta(\log p) bits to implicitly store h a , b h a , b h_(a,b)h_{a, b} and evaluation of h a , b ( x ) h a , b ( x ) h_(a,b)(x)h_{a, b}(x) takes one addition and one multiplication of log p log p log p\log p bit numbers. Moreover, samply a random hash function from H H H\mathcal{H} requires sampling a , b a , b a,ba, b which is also easy. We claim that H H H\mathcal{H} is a pairwise independent family. You can verify this by the observation that for distinct x , y x , y x,yx, y and any i , j i , j i,ji, j the two equations a x + b = i a x + b = i ax+b=ia x+b=i and a y + b = j a y + b = j ay+b=ja y+b=j have a unique a , b a , b a,ba, b them simultaneously. Note that if a = 0 a = 0 a=0a=0 the hash function is pretty useless; all elements get mapped to b b bb. Nevertheless, for H H H\mathcal{H} to be pairwise independent one needs to include those hash functions but the probability that a = 0 a = 0 a=0a=0 is 1 / p 1 / p 1//p1 / p while there are p 2 p 2 p^(2)p^{2} functions in H H H\mathcal{H}. If one only wants a weakly universal hash family we can pick a a aa from [ 1 . . ( p 1 ) ] [ 1 . . ( p 1 ) ] [1..(p-1)][1 . .(p-1)]. The range of the hash function is [ p ] [ p ] [p][p]. To restrict the range to L L LL we let h a , b ( x ) = ( a x + b mod p ) mod L h a , b ( x ) = ( a x + b mod p ) mod L h_(a,b)^(')(x)=(ax+b modp)modLh_{a, b}^{\prime}(x)=(a x+b \bmod {p}) \bmod {L}.
More generally we will say that H H H\mathcal{H} is k k kk-universal if every element is uniformly distributed in the range and for any k k kk elements x 1 , , x k x 1 , , x k x_(1),dots,x_(k)x_{1}, \ldots, x_{k} the random variabels h ( x 1 ) , , h ( x k ) h x 1 , , h x k h(x_(1)),dots,h(x_(k))h\left(x_{1}\right), \ldots, h\left(x_{k}\right) are independent. Assuming U U U\mathcal{U} is the set of integers [ 0. . | U | ] [ 0. . | U | ] [0..|U|][0..| \mathcal{U} |], for any fixed k k kk there exist constructions for k k kk-universal hash families such that every hash function h h hh in the family can be stored using O ( k log | U | ) O ( k log | U | ) O(k log |U|)O(k \log |\mathcal{U}|) bits (essentially k k kk numbers) and h h hh can be evaluated using O ( k ) O ( k ) O(k)O(k) arithmetic operations on log | U | log | U | log |U|\log |\mathcal{U}| bit numbers. We will ignore specific details of the implementations and refer the reader to the considerable literature on hashing for further details.

3. Estimating Number of Distinct Elements

A lower bound on exact counting deterministic algorithms: We argue that any deterministic streaming algorithm that counts the number of distinct elements exactly needs Ω ( n ) Ω ( n ) Omega(n)\Omega(n) bits. To see this, suppose there is an algorithm A A A\mathcal{A} that uses strictly less than n n nn bits. Consider the h = 2 n h = 2 n h=2^(n)h=2^{n} different streams σ S σ S sigma_(S)\sigma_{S} where S [ n ] ; σ S S [ n ] ; σ S S sube[n];sigma_(S)S \subseteq[n] ; \sigma_{S} consists of the elements of S S SS in some arbitrary order. Since A A A\mathcal{A} uses n 1 n 1 n-1n-1 bits or less, there must be two distinct sets S 1 , S 2 S 1 , S 2 S_(1),S_(2)S_{1}, S_{2} such that the state of A A A\mathcal{A} at the end of σ S 1 , σ S 2 σ S 1 , σ S 2 sigma_(S_(1)),sigma_(S_(2))\sigma_{S_{1}}, \sigma_{S_{2}} is identical. Since S 1 , S 2 S 1 , S 2 S_(1),S_(2)S_{1}, S_{2} are distinct there is an element i i ii in S 1 S 2 S 1 S 2 S_(1)\\S_(2)S_{1} \backslash S_{2} or S 2 S 1 S 2 S 1 S_(2)\\S_(1)S_{2} \backslash S_{1}; wlog it is the former. Then it is easy to see the A A A\mathcal{A} cannot give the right count for at least one of the two streams, < σ S 1 , i > , < σ S 2 , i > < σ S 1 , i > , < σ S 2 , i > < sigma_(S_(1)),i > , < sigma_(S_(2)),i ><\sigma_{S_{1}}, i>,<\sigma_{S_{2}}, i>.
The basic hashing idea: We now discuss a simple high-level idea for estimating the number of distinct elements in the stream. Suppose h h hh is an idealized random hash function that maps [ 1. . n ] [ 1. . n ] [1..n][1..n] to the interval [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1]. Suppose there are d d dd distinct elements in the stream σ = a 1 , a 2 , , a m σ = a 1 , a 2 , , a m sigma=a_(1),a_(2),dots,a_(m)\sigma=a_{1}, a_{2}, \ldots, a_{m}. If h h hh behaves like a random function then the set { h ( a 1 ) , , h ( a m ) } h a 1 , , h a m {h(a_(1)),dots,h(a_(m))}\left\{h\left(a_{1}\right), \ldots, h\left(a_{m}\right)\right\} will behave like a collection of d d dd independent uniformly distributed random variables in [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1]. Let θ = min { h ( a 1 ) , , h ( a m ) } θ = min h a 1 , , h a m theta=min{h(a_(1)),dots,h(a_(m))}\theta=\min \left\{h\left(a_{1}\right), \ldots, h\left(a_{m}\right)\right\}; the expectation of θ θ theta\theta is 1 d + 1 1 d + 1 (1)/(d+1)\frac{1}{d+1} and hence 1 / θ 1 / θ 1//theta1 / \theta is good estimator. In the stream setting we can compute θ θ theta\theta by hashing each incoming value and keeping track of the minimum. We only need to have one number in memory. Although simple, the algorithm assumes idealized hash functions and we only have an unbiased estimator. To convert the idea to an implementable algorithm with proper guarantees requires work. There are several papers on this problem and we will now discuss some of the approaches.

3.1. The AMS algorithm

Here we describe an algorithm with better parameters but it only gives a constant factor approximation. This is due to Alon, Matias and Szegedy in their famous paper[3] on estimating frequency moments. We need some notation. For an integer t > 0 t > 0 t > 0t>0 let zeros ( t ) ( t ) (t)(t) denote the number of zeros that the binary representation of t t tt ends in; equivalenty
zeros ( t ) = max { i :∣ 2 i divides t } . zeros ( t ) = max i :∣ 2 i  divides  t . zeros(t)=max{i:∣2^(i)" divides "t}.\operatorname{zeros}(t)=\max \left\{i: \mid 2^{i} \text { divides } t\right\}.
AMS-DistinctElements: _ AMS-DistinctElements: _ "AMS-DistinctElements:"_\underline{\text{AMS-DistinctElements:}}
H H H\mathcal{H} is a 2-universal hash family from [ n ] [ n ] [n][n] to [ n ] [ n ] [n][n]
choose h h hh at random from H H H\mathcal{H}
z 0 z 0 z larr0z \leftarrow 0
While (stream is not empty) do
a i a i quada_(i)\quad a_{i} is current item
z max { z , zeros ( h ( a i ) ) } z max z , zeros h a i quad z larr max{z,zeros(h(a_(i)))}\quad z \leftarrow \max \left\{z, \operatorname{zeros}\left(h\left(a_{i}\right)\right)\right\}
endWhile
Output 2 z + 1 2 2 z + 1 2 2^(z+(1)/(2))2^{z+\frac{1}{2}}
First, we note that the space and time per element are O ( log n ) O ( log n ) O(log n)O(\log n). We now analyze the quality of the approximation provided by the output. Recall that h ( a j ) h a j h(a_(j))h\left(a_{j}\right) is uniformly distributed in [ n ] [ n ] [n][n]. We will assume for simplicity that n n nn is a power of 2 2 22.
Let d d dd to denote the number of distinct elements in the stream and let them be b 1 , b 2 , , b d b 1 , b 2 , , b d b_(1),b_(2),dots,b_(d)b_{1}, b_{2}, \ldots, b_{d}. For a given r r rr let X r , j X r , j X_(r,j)X_{r, j} be the indicator random variable that is 1 1 11 if zeros ( h ( b j ) ) r h b j r (h(b_(j))) >= r\left(h\left(b_{j}\right)\right) \geq r. Let Y r = j X r , j Y r = j X r , j Y_(r)=sum_(j)X_(r,j)Y_{r}=\sum_{j} X_{r, j}. That is, Y r Y r Y_(r)Y_{r} is the number of distinct elements whose hash values have atleast r r rr zeros.
Since h ( b j ) h b j h(b_(j))h\left(b_{j}\right) is uniformaly distribute in [ n ] [ n ] [n][n],
E [ X r , j ] = Pr [ zeros ( h ( b j ) ) r ] = ( n / 2 r ) n 1 2 r . E X r , j = Pr zeros h b j r = n / 2 r n 1 2 r . E[X_(r,j)]=Pr[zeros(h(b_(j))) >= r]=((n//2^(r)))/(n) >= (1)/(2^(r)).\mathbf{E}\left[X_{r, j}\right]=\operatorname{Pr}\left[\operatorname{zeros}\left(h\left(b_{j}\right)\right) \geq r\right]=\frac{\left(n / 2^{r}\right)}{n} \geq \frac{1}{2^{r}} .
Therefore
E [ Y r ] = j E [ X r , j ] = d 2 r . E Y r = j E X r , j = d 2 r . E[Y_(r)]=sum_(j)E[X_(r,j)]=(d)/(2^(r)).\mathbf{E}\left[Y_{r}\right]=\sum_{j} \mathbf{E}\left[X_{r, j}\right]=\frac{d}{2^{r}} .
Thus we have E [ Y log d ] = 1 E Y log d = 1 E[Y_(log d)]=1\mathbf{E}\left[Y_{\log d}\right]=1 (assuming d d dd is a power of 2 2 22).
Now we compute the variance of Y r Y r Y_(r)Y_{r}. Note that the variables X r , j X r , j X_(r,j)X_{r, j} and X r , j X r , j X_(r,j^('))X_{r, j^{\prime}} are pairwise independent since H H H\mathcal{H} is 2-universal. Hence
V a r [ Y r ] = j V a r [ X r , j ] j E [ X r , j 2 ] = j E [ X r , j ] = d 2 r . V a r Y r = j V a r X r , j j E X r , j 2 = j E X r , j = d 2 r . Var[Y_(r)]=sum_(j)Var[X_(r,j)] <= sum_(j)E[X_(r,j)^(2)]=sum_(j)E[X_(r,j)]=(d)/(2^(r)).\mathbf{Var}\left[Y_{r}\right]=\sum_{j} \mathbf{Var}\left[X_{r, j}\right] \leq \sum_{j} \mathbf{E}\left[X_{r, j}^{2}\right]=\sum_{j} \mathbf{E}\left[X_{r, j}\right]=\frac{d}{2^{r}} .
Using Markov's inequality
Pr [ Y r > 0 ] = Pr [ Y r 1 ] E [ Y r ] d 2 r . Pr Y r > 0 = Pr Y r 1 E Y r d 2 r . Pr[Y_(r) > 0]=Pr[Y_(r) >= 1] <= E[Y_(r)] <= (d)/(2^(r)).\operatorname{Pr}\left[Y_{r}>0\right]=\operatorname{Pr}\left[Y_{r} \geq 1\right] \leq \mathbf{E}\left[Y_{r}\right] \leq \frac{d}{2^{r}} .
Using Chebyshev's inequality
Pr [ Y r = 0 ] = Pr [ | Y r E [ Y r ] | d 2 r ] V a r [ Y r ] ( d / 2 r ) 2 2 r d . Pr Y r = 0 = Pr [ Y r E Y r d 2 r ] V a r Y r d / 2 r 2 2 r d . Pr[Y_(r)=0]=Pr[|Y_(r)-E[Y_(r)]| >= (d)/(2^(r))] <= (Var[Y_(r)])/((d//2^(r))^(2)) <= (2^(r))/(d).\operatorname{Pr}\left[Y_{r}=0\right]=\operatorname{Pr}[\left|Y_{r}-\mathbf{E}\left[Y_{r}\right]\right| \geq \frac{d}{2^{r}}] \leq \frac{\mathbf{Var}\left[Y_{r}\right]}{\left(d / 2^{r}\right)^{2}} \leq \frac{2^{r}}{d} .
Let z z z^(')z^{\prime} be the value of z z zz at the end of the stream and let d = 2 z + 1 2 d = 2 z + 1 2 d^(')=2^(z^(')+(1)/(2))d^{\prime}=2^{z^{\prime}+\frac{1}{2}} be the estimate for d d dd output by the algorithm. We claim that d d d^(')d^{\prime} cannot be too large compared to d d dd with constant probability. Let a a aa be the smallest integer such that 2 a + 1 2 3 d 2 a + 1 2 3 d 2^(a+(1)/(2)) >= 3d2^{a+\frac{1}{2}} \geq 3 d.
Pr [ d 3 d ] = Pr [ Y a > 0 ] d 2 a 2 3 . Pr d 3 d = Pr Y a > 0 d 2 a 2 3 . Pr[d^(') >= 3d]=Pr[Y_(a) > 0] <= (d)/(2^(a)) <= (sqrt2)/(3).\operatorname{Pr}\left[d^{\prime} \geq 3 d\right]=\operatorname{Pr}\left[Y_{a}>0\right] \leq \frac{d}{2^{a}} \leq \frac{\sqrt{2}}{3} .
Now we claim that d d d^(')d^{\prime} is not too small compared to d d dd with constant probability. For this let b b bb the largest integer such that 2 b + 1 2 d / 3 2 b + 1 2 d / 3 2^(b+(1)/(2)) <= d//32^{b+\frac{1}{2}} \leq d / 3. Then,
Pr [ d d / 3 ] = Pr [ Y b + 1 = 0 ] 2 b + 1 d 2 3 . Pr d d / 3 = Pr Y b + 1 = 0 2 b + 1 d 2 3 . Pr[d^(') <= d//3]=Pr[Y_(b+1)=0] <= (2^(b+1))/(d) <= (sqrt2)/(3).\operatorname{Pr}\left[d^{\prime} \leq d / 3\right]=\operatorname{Pr}\left[Y_{b+1}=0\right] \leq \frac{2^{b+1}}{d} \leq \frac{\sqrt{2}}{3} .
Thus, the algorithm provides ( 1 / 3 , 2 / 3 0.4714 ) ( 1 / 3 , 2 / 3 0.4714 ) (1//3,sqrt2//3≃0.4714)(1 / 3, \sqrt{2} / 3 \simeq 0.4714)-approximation to the number of distinct elements. Using the median trick we can make the probability of success be at least ( 1 δ ) ( 1 δ ) (1-delta)(1-\delta) to obtain a ( 1 / 3 , δ ) ( 1 / 3 , δ ) (1//3,delta)(1 / 3, \delta)-approximation by running O ( log 1 δ ) O log 1 δ O(log (1)/(delta))O\left(\log \frac{1}{\delta}\right)-parallel and independent copies of the algorithm. The time and space will be O ( log 1 δ log n ) O log 1 δ log n O(log (1)/(delta)log n)O\left(\log \frac{1}{\delta} \log n\right).

3.2. A ( 1 ϵ ) ( 1 ϵ ) (1-epsilon)(1-\epsilon)-approximation in O ( 1 ϵ 2 log n ) O 1 ϵ 2 log n O((1)/(epsilon^(2))log n)O\left(\frac{1}{\epsilon^{2}} \log n\right) space

Bar-Yossef et al.[4] described three algorithms for distinct elements, that last of which gives a bound O ~ ( ϵ 2 + log n ) ) O ~ ( ϵ 2 + log n ) ) tilde(O)(epsilon^(2)+log n))\tilde{O}(\epsilon^{2}+\log n)) space and amortized time per element O ( log n + log 1 ϵ ) O ( log n + log 1 ϵ ) O(log n+log (1)/(epsilon))O(\log n+\log \frac{1}{\epsilon}); the notation O ~ ) O ~ ) tilde(O))\tilde{O}) suppresses dependence on log log n log log n log log n\log \log n and log 1 / ϵ log 1 / ϵ log 1//epsilon\log 1 / \epsilon. Here we describe their first algorithm that gives gives an ( ϵ , δ 0 ) ( ϵ , δ 0 ) (epsilon,delta_(0))(\epsilon, \delta_{0})-approximation in space O ( 1 ϵ 2 log n ) O ( 1 ϵ 2 log n ) O((1)/(epsilon^(2))log n)O(\frac{1}{\epsilon^{2}} \log n) and O ( log 1 / ϵ log n ) O ( log 1 / ϵ log n ) O(log 1//epsilon log n)O(\log 1 / \epsilon \log n) time per update; via the median trick, with an additional log 1 / δ log 1 / δ log 1//delta\log 1 / \delta factor, we can obtain a ( ϵ , δ ) ( ϵ , δ ) (epsilon,delta)(\epsilon, \delta)-approximation.
The algorithm is based on the Flajolet-Martin idea of hashing to [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] and taking the minimum but with a small and important technical tweak. Let t = c ϵ 2 t = c ϵ 2 t=(c)/(epsilon^(2))t=\frac{c}{\epsilon^{2}} for some constant c c cc to be fixed later.
BJKST-DistinctElements: _ BJKST-DistinctElements: _ "BJKST-DistinctElements:"_\underline{\text{BJKST-DistinctElements:}}
H H H\mathcal{H} is a 2-universal hash family from [ n ] [ n ] [n][n] to [ N = n 3 ] N = n 3 [N=n^(3)]\left[N=n^{3}\right]
choose h h hh at random from H H H\mathcal{H}
t c ϵ 2 t c ϵ 2 t larr(c)/(epsilon^(2))t \leftarrow \frac{c}{\epsilon^{2}}
While (stream is not empty) do
a i a i quada_(i)\quad a_{i} is current item
quad\quad Update the smallest t t tt hash values seen so far with h ( a i ) h a i h(a_(i))h\left(a_{i}\right)
endWhile
Let v v vv be the t t t^(')t^{\prime} th smallest value seen in the hast values.
Output t N / v t N / v tN//vt N / v.
We observe that the algorithm can be implemented by keeping track of t t tt hash values each of which take O ( log n ) O ( log n ) O(log n)O(\log n) space and hence the space usage is O ( t log n ) = O ( 1 ϵ 2 log n ) O ( t log n ) = O 1 ϵ 2 log n O(t log n)=O((1)/(epsilon^(2))log n)O(t \log n)=O\left(\frac{1}{\epsilon^{2}} \log n\right). The t t tt values can be stored in a binary search tree so that when a new item is processed we can update the data structure in O ( log t ) O ( log t ) O(log t)O(\log t) searches which takes total time O ( log 1 ϵ log n ) O log 1 ϵ log n O(log (1)/(epsilon)log n)O\left(\log \frac{1}{\epsilon} \log n\right).
The intution of the algorithm is as follows. As before let d d dd be the number of distinct elements and let them be b 1 , , b d b 1 , , b d b_(1),dots,b_(d)b_{1}, \ldots, b_{d}. The hash values h ( b 1 ) , , h ( b d ) h b 1 , , h b d h(b_(1)),dots,h(b_(d))h\left(b_{1}\right), \ldots, h\left(b_{d}\right) are uniformly distributed in [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1]. For any fixed t t tt we expect about t / d t / d t//dt / d hash values to fall in the interval [ 0 , t / d ] [ 0 , t / d ] [0,t//d][0, t / d] and the t t tt'th smallest hash value v v vv should be around t / d t / d t//dt / d and this justifies the estimator for d d dd being t / v t / v t//vt / v. Now we formally analyse the properties of this estimator.
We chose the hash family to map [ n ] [ n ] [n][n] to [ n 3 ] n 3 [n^(3)]\left[n^{3}\right] and therefore with probability at least ( 1 1 / n ) ( 1 1 / n ) (1-1//n)(1-1 / n) the random hash function h h hh is injective over [ n ] [ n ] [n][n]. We will assume this is indeed the case. Moreover we will assume that n ϵ / 24 n ϵ / 24 n >= sqrt(epsilon//24)n \geq \sqrt{\epsilon / 24} which implies in particular that ϵ t / ( 4 d ) 1 / N ϵ t / ( 4 d ) 1 / N epsilon t//(4d) >= 1//N\epsilon t /(4 d) \geq 1 / N. Let d d d^(')d^{\prime} the estimate returned by the algorithm.
Lemma 2 Pr [ d < ( 1 ϵ ) d ] 1 / 6 ] Pr d < ( 1 ϵ ) d 1 / 6 {: Pr[d^(') < (1-epsilon)d] <= 1//6]\left.\operatorname{Pr}\left[d^{\prime}<(1-\epsilon) d\right] \leq 1 / 6\right].
Proof: The values h ( b 1 ) , , h ( b d ) h b 1 , , h b d h(b_(1)),dots,h(b_(d))h\left(b_{1}\right), \ldots, h\left(b_{d}\right) are uniformly distributed in 1 ...N. If d < ( 1 ϵ ) d d < ( 1 ϵ ) d d^(') < (1-epsilon)dd^{\prime}<(1-\epsilon) d it implies that v > t N ( 1 ϵ ) d v > t N ( 1 ϵ ) d v > (tN)/((1-epsilon)d)v>\frac{t N}{(1-\epsilon) d}; that is less than t t tt values fell in the interval [ 1 , t N ( 1 ϵ ) d ] [ 1 , t N ( 1 ϵ ) d ] [1,(tN)/((1-epsilon)d)][1, \frac{t N}{(1-\epsilon) d}]. Let X i X i X_(i)X_{i} be the indicator random variable for the event that h ( b i ) ( 1 + ϵ ) t N / d h b i ( 1 + ϵ ) t N / d h(b_(i)) <= (1+epsilon)tN//dh\left(b_{i}\right) \leq(1+\epsilon) t N / d and let Y = i = d X i Y = i = d X i Y=sum_(i=)^(d)X_(i)Y=\sum_{i=}^{d} X_{i}.
Since h ( b i ) h b i h(b_(i))h\left(b_{i}\right) is distributed uniformly in [ 1 . . N ] [ 1 . . N ] [1..N][1 . . N], taking rounding errors into account, we have ( 1 + ϵ / 2 ) t / d E [ X i ] ( 1 + 3 ϵ / 2 ) t / d ( 1 + ϵ / 2 ) t / d E X i ( 1 + 3 ϵ / 2 ) t / d (1+epsilon//2)t//d <= E[X_(i)] <= (1+3epsilon//2)t//d(1+\epsilon / 2) t / d \leq \mathbf{E}\left[X_{i}\right] \leq(1+3 \epsilon / 2) t / d and hence E [ Y ] t ( 1 + ϵ / 2 ) E [ Y ] t ( 1 + ϵ / 2 ) E[Y] >= t(1+epsilon//2)\mathbf{E}[Y] \geq t(1+\epsilon / 2). We have V a r [ Y ] E [ Y ] V a r [ Y ] E [ Y ] Var[Y] <= E[Y] <=\mathbf{Var}[Y] \leq \mathbf{E}[Y] \leq t ( 1 + 3 ϵ / 2 ) t ( 1 + 3 ϵ / 2 ) t(1+3epsilon//2)t(1+3 \epsilon / 2) (due to pairwise independence, Lemma 1)). We have d < ( 1 ϵ ) d d < ( 1 ϵ ) d d^(') < (1-epsilon)dd^{\prime}<(1-\epsilon) d only if Y < t Y < t Y < tY<t and by Chebyshev,
Pr [ Y < t ] Pr [ | Y E [ Y ] | ϵ t / 2 ] 4 V a r [ Y ] ϵ 2 t 2 12 ( ϵ 2 t 2 ) 1 / 6 . Pr [ Y < t ] Pr [ | Y E [ Y ] | ϵ t / 2 ] 4 V a r [ Y ] ϵ 2 t 2 12 ϵ 2 t 2 1 / 6 . Pr[Y < t] <= Pr[|Y-E[Y]| >= epsilon t//2] <= (4Var[Y])/(epsilon^(2)t^(2)) <= 12(epsilon^(2)t^(2)) <= 1//6.\operatorname{Pr}[Y<t] \leq \operatorname{Pr}[|Y-\mathbf{E}[Y]| \geq \epsilon t / 2] \leq \frac{4 \mathbf{Var}[Y]}{\epsilon^{2} t^{2}} \leq 12\left(\epsilon^{2} t^{2}\right) \leq 1 / 6 .
Lemma 3 Pr [ d > ( 1 + ϵ ) d ] 1 / 6 ] Pr d > ( 1 + ϵ ) d 1 / 6 {: Pr[d^(') > (1+epsilon)d] <= 1//6]\left.\operatorname{Pr}\left[d^{\prime}>(1+\epsilon) d\right] \leq 1 / 6\right].
Proof: Suppose d > ( 1 + ϵ ) d d > ( 1 + ϵ ) d d^(') > (1+epsilon)dd^{\prime}>(1+\epsilon) d, that is v < t N ( 1 + ϵ ) d v < t N ( 1 + ϵ ) d v < (tN)/((1+epsilon)d)v<\frac{t N}{(1+\epsilon) d}. This implies that more than t t tt hash values are less than t N ( 1 + ϵ ) d ( 1 ϵ / 2 ) t N / d t N ( 1 + ϵ ) d ( 1 ϵ / 2 ) t N / d (tN)/((1+epsilon)d) <= (1-epsilon//2)tN//d\frac{t N}{(1+\epsilon) d} \leq(1-\epsilon / 2) t N / d. We will show that this happens with small probability.
Let X i X i X_(i)X_{i} be the indicator random variable for the event h ( b i ) < ( 1 ϵ / 2 ) t N / d h b i < ( 1 ϵ / 2 ) t N / d h(b_(i)) < (1-epsilon//2)tN//dh\left(b_{i}\right)<(1-\epsilon / 2) t N / d and let Y = i = 1 d X i Y = i = 1 d X i Y=sum_(i=1)^(d)X_(i)Y=\sum_{i=1}^{d} X_{i}. We have E [ X i ] < ( 1 ϵ / 2 ) t / d + 1 / N ( 1 ϵ / 4 ) t / d E X i < ( 1 ϵ / 2 ) t / d + 1 / N ( 1 ϵ / 4 ) t / d E[X_(i)] < (1-epsilon//2)t//d+1//N <= (1-epsilon//4)t//d\mathbf{E}\left[X_{i}\right]<(1-\epsilon / 2) t / d+1 / N \leq(1-\epsilon / 4) t / d (the 1 / N 1 / N 1//N1 / N is for rounding errors). Hence E [ Y ] ( 1 ϵ / 4 ) t E [ Y ] ( 1 ϵ / 4 ) t E[Y] <= (1-epsilon//4)t\mathbf{E}[Y] \leq(1-\epsilon / 4) t. As we argued d > ( 1 + ϵ ) d d > ( 1 + ϵ ) d d^(') > (1+epsilon)dd^{\prime}>(1+\epsilon) d happens only if Y > t Y > t Y > tY>t. By Cheybyshev,
Pr [ Y > t ] Pr [ | Y E [ Y ] | ϵ t / 4 ] 16 V a r [ Y ] ϵ r t 2 16 / ( ϵ 2 t 2 ) 1 / 6 . Pr [ Y > t ] Pr [ | Y E [ Y ] | ϵ t / 4 ] 16 V a r [ Y ] ϵ r t 2 16 / ϵ 2 t 2 1 / 6 . Pr[Y > t] <= Pr[|Y-E[Y]| >= epsilon t//4] <= (16Var[Y])/(epsilon^(r)t^(2)) <= 16//(epsilon^(2)t^(2)) <= 1//6.\operatorname{Pr}[Y>t] \leq \operatorname{Pr}[|Y-\mathbf{E}[Y]| \geq \epsilon t / 4] \leq \frac{16 \mathbf{Var}[Y]}{\epsilon^{r} t^{2}} \leq 16 /\left(\epsilon^{2} t^{2}\right) \leq 1 / 6 .
Bibliographic Notes: Our description of the AMS algorithm is taken from notes of Amit Chakrabarti. Flajolet and Martin proposed the basic hash function based algorithm in[5].[4:1] Kane, Nelson and Woodruff[6] described an ( ϵ , δ 0 ) ϵ , δ 0 (epsilon,delta_(0))\left(\epsilon, \delta_{0}\right)-approximation for a fixed δ 0 δ 0 delta_(0)\delta_{0} in space O ( 1 ϵ 2 + log n ) O 1 ϵ 2 + log n O((1)/(epsilon^(2))+log n)O\left(\frac{1}{\epsilon^{2}}+\log n\right) (the time per update is also the same). This is a theoretically optimum algorithm since there are lower bounds of Ω ( ϵ 2 ) Ω ϵ 2 Omega(epsilon^(2))\Omega\left(\epsilon^{2}\right) and Ω ( log n ) Ω ( log n ) Omega(log n)\Omega(\log n) on the space required for an ϵ ϵ epsilon\epsilon-approximation. We refer the reader to[7],[8] for analysis and empirical evaluation of an algorithm called HyperLogLog which seems to very well in practice.
You can read the notes from the next lecture from Chandra Chekuri's course on Estimating F_k norms via AMS sampling here.

  1. Many streaming papers flip the notation and use n n nn to denote the length of the stream and m m mm to denote the range of the integers in the stream ↩︎
  2. J Lawrence Carter and Mark N Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143-154, 1979. Preliminary version in Proc. ACM STOC, 1977. ↩︎
  3. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137-147, 1999. Preliminary version in Proc. of ACM STOC 1996. ↩︎
  4. Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Randomization and Approximation Techniques in Computer Science, pages 1-10. Springer, 2002. ↩︎ ↩︎
  5. Philippe Flajolet and G Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2):182-209, 1985. ↩︎
  6. Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 41-52. ACM, 2010. ↩︎
  7. Philippe Flajolet, Eric Fusy, Olivier Gandouet, Frédéric Meunier, et al. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Analysis of Algorithms 2007 (AofA07), pages 127-146, 2007. ↩︎
  8. Stefan Heule, Marc Nunkesser, and Alex Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the EDBT 2013 Conference, Genoa, Italy, 2013. ↩︎

Recommended for you

Eric Mockensturm
Probing The Full Monty Hall Problem
Probing The Full Monty Hall Problem
A tutorial on the Monty Hall problem in statistics.
8 points
0 issues