You can read the notes from the previous lecture of Chandra Chekuri's course on Basics of Probability, Probabilistic Counting, and Reservoir Sampling here.
1. Estimating Frequency Moments in Streams
A significant fraction of streaming literature is on the problem of estimating frequency moments. Let sigma=a_(1),a_(2),dots,a_(m)\sigma=a_{1}, a_{2}, \ldots, a_{m} be a stream of numbers where for each i,a_(i)i, a_{i} is an intger between 11 and nn. We will try to stick to the notation of using mm for the length of the stream and nn for range of the integers[1]. Let f_(i)f_{i} be the number of occurences (or frequency) of integer ii in the stream. We let f=(f_(1),f_(2),dots,f_(n))\mathbf{f}=\left(f_{1}, f_{2}, \ldots, f_{n}\right) be the frequency vector for a given stream sigma\sigma. For k >= 0,F_(k)(sigma)k \geq 0, F_{k}(\sigma) is defined to be the kk'th frequency moment of sigma\sigma:
F_(k)=sum_(i)f_(i)^(k).F_{k}=\sum_{i} f_{i}^{k} .
We will discuss several algorithms to estimate F_(k)F_{k} for various values of kk. For instance F_(0)F_{0} is simply the number of distinct elements in sigma\sigma. Note that F_(1)=sum_(i)f_(i)=mF_{1}=\sum_{i} f_{i}=m, the length of the stream. A kk increases to ooF_(k)\infty F_{k} will concentrate on the most frequent element and we can thing of F_(oo)F_{\infty} as finding the most frequent element.
Definition 1Let A)(sigma)\mathcal{A})(\sigma) be the real-valued output of a randomized streaming algorithm on stream sigma\sigma. We say that A\mathcal{A} provides an (alpha,beta)(\alpha, \beta)-approximation for a real-valued function gg if
Our ideal goal is to obtain a (epsilon,delta)(\epsilon, \delta)-approximation for any given epsilon,delta in(0,1)\epsilon, \delta \in(0,1).
2. Background on Hashing
Hashing techniques play a fundamental role in streaming, in particular for estimating frequency moments. We will briefly review hashing from a theoretical point of view and in particular kk-universal hashing.
A hash function maps a finite universe U\mathcal{U} to some range R\mathcal{R}. Typically the range is the set of integers [0..L-1][0 . . L-1] for some finite LL (here LL is the number of buckets in the hash table). Sometimes it is convenient to consider, for the sake of developing intuition, hash functions that maps U\mathcal{U} to the continuous interval [0,1][0,1]. We will, in general, be working with a family of hash functions H\mathcal{H} and hh will be drawn from H\mathcal{H} uniformly at random; the analyis of the algorithm will be based on properties of H\mathcal{H}. We would like H\mathcal{H} to have two important and contradictory properties:
•a random function from H\mathcal{H} should behave like a completely random function from U\mathcal{U} to the range.
•H\mathcal{H} should have nice computational properties:
–a uniformly random function from H\mathcal{H} should be easy to sample
–any function h inHh \in \mathcal{H} should have small representation size so that it can be stored compactly
–it should be efficient to evaluate hh
Definition 2A collection of random variables X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} are kk-wise independent if the variables X_(i_(1)),X_(i_(2)),dots,X_(i_(k))X_{i_{1}}, X_{i_{2}}, \ldots, X_{i_{k}} are independent for any set of distinct indices i_(1),i_(2),dots,i_(k)i_{1}, i_{2}, \ldots, i_{k}.
Take three random variables {X_(1),X_(2),X_(3)}\left\{X_{1}, X_{2}, X_{3}\right\} where X_(1),X_(2)X_{1}, X_{2} are independent {0,1}\{0,1\} random variables and X_(3)=X_(1)o+X_(2)X_{3}=X_{1} \oplus X_{2}. It is easy to check that the three variables are pairwise independent although they are not all independent.
Following the work of Carter and Wegman[2], the class of kk-universal hash families, and in particular for k=2k=2, provide an excellent tradeoff. H\mathcal{H} is strongly 2-universal if the following properties hold for a random function hh picked from H\mathcal{H}: (i) for every x inU,h(x)x \in \mathcal{U}, h(x) (which is a random variable) is uniformly distributed over the range and (ii) for every distinct pair x,y inUx, y \in \mathcal{U}, h(x)h(x) and h(y)h(y) are independent. 2-universal hash families are also called pairwise independent hash families. A weakly 2-universal family satisfies the property that that Pr[h(x)=h(y)]=1//L\operatorname{Pr}[h(x)=h(y)]=1 / L for any distinct x,yx, y. We state an important observation about pairwise independent random variables.
Lemma 1Let Y=sum_(i=1)^(h)X_(i)Y=\sum_{i=1}^{h} X_{i} where X_(1),X_(2),dots,X_(h)X_{1}, X_{2}, \ldots, X_{h} are pairwise independent. Then
There is a simple and nice construction of pairwise independent hash functions. Let pp be a prime number such that p >= |U|p \geq|\mathcal{U}|. Recall that Z_(p)={0,1,dots,p-1}Z_{p}=\{0,1, \ldots, p-1\} forms a field under the standard addition and multiplication modulo pp. For each a,b in[p]a, b \in[p] we can define a hash function h_(a,b)h_{a, b} where h_(a,b)(x)=ax+b modph_{a, b}(x)=a x+b \bmod {p}. Let H={h_(a,b)∣a,b in[p]}\mathcal{H}=\left\{h_{a, b} \mid a, b \in[p]\right\}. We can see that we only need to store two numbers a,ba, b of Theta(log p)\Theta(\log p) bits to implicitly store h_(a,b)h_{a, b} and evaluation of h_(a,b)(x)h_{a, b}(x) takes one addition and one multiplication of log p\log p bit numbers. Moreover, samply a random hash function from H\mathcal{H} requires sampling a,ba, b which is also easy. We claim that H\mathcal{H} is a pairwise independent family. You can verify this by the observation that for distinct x,yx, y and any i,ji, j the two equations ax+b=ia x+b=i and ay+b=ja y+b=j have a unique a,ba, b them simultaneously. Note that if a=0a=0 the hash function is pretty useless; all elements get mapped to bb. Nevertheless, for H\mathcal{H} to be pairwise independent one needs to include those hash functions but the probability that a=0a=0 is 1//p1 / p while there are p^(2)p^{2} functions in H\mathcal{H}. If one only wants a weakly universal hash family we can pick aa from [1..(p-1)][1 . .(p-1)]. The range of the hash function is [p][p]. To restrict the range to LL we let h_(a,b)^(')(x)=(ax+b modp)modLh_{a, b}^{\prime}(x)=(a x+b \bmod {p}) \bmod {L}.
More generally we will say that H\mathcal{H} is kk-universal if every element is uniformly distributed in the range and for any kk elements x_(1),dots,x_(k)x_{1}, \ldots, x_{k} the random variabels h(x_(1)),dots,h(x_(k))h\left(x_{1}\right), \ldots, h\left(x_{k}\right) are independent. Assuming U\mathcal{U} is the set of integers [0..|U|][0..| \mathcal{U} |], for any fixed kk there exist constructions for kk-universal hash families such that every hash function hh in the family can be stored using O(k log |U|)O(k \log |\mathcal{U}|) bits (essentially kk numbers) and hh can be evaluated using O(k)O(k) arithmetic operations on log |U|\log |\mathcal{U}| bit numbers. We will ignore specific details of the implementations and refer the reader to the considerable literature on hashing for further details.
3. Estimating Number of Distinct Elements
A lower bound on exact counting deterministic algorithms: We argue that any deterministic streaming algorithm that counts the number of distinct elements exactly needs Omega(n)\Omega(n) bits. To see this, suppose there is an algorithm A\mathcal{A} that uses strictly less than nn bits. Consider the h=2^(n)h=2^{n} different streams sigma_(S)\sigma_{S} where S sube[n];sigma_(S)S \subseteq[n] ; \sigma_{S} consists of the elements of SS in some arbitrary order. Since A\mathcal{A} uses n-1n-1 bits or less, there must be two distinct sets S_(1),S_(2)S_{1}, S_{2} such that the state of A\mathcal{A} at the end of sigma_(S_(1)),sigma_(S_(2))\sigma_{S_{1}}, \sigma_{S_{2}} is identical. Since S_(1),S_(2)S_{1}, S_{2} are distinct there is an element ii in S_(1)\\S_(2)S_{1} \backslash S_{2} or S_(2)\\S_(1)S_{2} \backslash S_{1}; wlog it is the former. Then it is easy to see the A\mathcal{A} cannot give the right count for at least one of the two streams, < sigma_(S_(1)),i > , < sigma_(S_(2)),i ><\sigma_{S_{1}}, i>,<\sigma_{S_{2}}, i>.
The basic hashing idea: We now discuss a simple high-level idea for estimating the number of distinct elements in the stream. Suppose hh is an idealized random hash function that maps [1..n][1..n] to the interval [0,1][0,1]. Suppose there are dd distinct elements in the stream sigma=a_(1),a_(2),dots,a_(m)\sigma=a_{1}, a_{2}, \ldots, a_{m}. If hh behaves like a random function then the set {h(a_(1)),dots,h(a_(m))}\left\{h\left(a_{1}\right), \ldots, h\left(a_{m}\right)\right\} will behave like a collection of dd independent uniformly distributed random variables in [0,1][0,1]. Let theta=min{h(a_(1)),dots,h(a_(m))}\theta=\min \left\{h\left(a_{1}\right), \ldots, h\left(a_{m}\right)\right\}; the expectation of theta\theta is (1)/(d+1)\frac{1}{d+1} and hence 1//theta1 / \theta is good estimator. In the stream setting we can compute theta\theta by hashing each incoming value and keeping track of the minimum. We only need to have one number in memory. Although simple, the algorithm assumes idealized hash functions and we only have an unbiased estimator. To convert the idea to an implementable algorithm with proper guarantees requires work. There are several papers on this problem and we will now discuss some of the approaches.
3.1. The AMS algorithm
Here we describe an algorithm with better parameters but it only gives a constant factor approximation. This is due to Alon, Matias and Szegedy in their famous paper[3] on estimating frequency moments. We need some notation. For an integer t > 0t>0 let zeros (t)(t) denote the number of zeros that the binary representation of tt ends in; equivalenty
H\mathcal{H} is a 2-universal hash family from [n][n] to [n][n]
choose hh at random from H\mathcal{H}
z larr0z \leftarrow 0
While (stream is not empty) do
quada_(i)\quad a_{i} is current item
quad z larr max{z,zeros(h(a_(i)))}\quad z \leftarrow \max \left\{z, \operatorname{zeros}\left(h\left(a_{i}\right)\right)\right\}
endWhile
Output 2^(z+(1)/(2))2^{z+\frac{1}{2}}
First, we note that the space and time per element are O(log n)O(\log n). We now analyze the quality of the approximation provided by the output. Recall that h(a_(j))h\left(a_{j}\right) is uniformly distributed in [n][n]. We will assume for simplicity that nn is a power of 22.
Let dd to denote the number of distinct elements in the stream and let them be b_(1),b_(2),dots,b_(d)b_{1}, b_{2}, \ldots, b_{d}. For a given rr let X_(r,j)X_{r, j} be the indicator random variable that is 11 if zeros (h(b_(j))) >= r\left(h\left(b_{j}\right)\right) \geq r. Let Y_(r)=sum_(j)X_(r,j)Y_{r}=\sum_{j} X_{r, j}. That is, Y_(r)Y_{r} is the number of distinct elements whose hash values have atleast rr zeros.
Since h(b_(j))h\left(b_{j}\right) is uniformaly distribute in [n][n],
Thus we have E[Y_(log d)]=1\mathbf{E}\left[Y_{\log d}\right]=1 (assuming dd is a power of 22).
Now we compute the variance of Y_(r)Y_{r}. Note that the variables X_(r,j)X_{r, j} and X_(r,j^('))X_{r, j^{\prime}} are pairwise independent since H\mathcal{H} is 2-universal. Hence
Let z^(')z^{\prime} be the value of zz at the end of the stream and let d^(')=2^(z^(')+(1)/(2))d^{\prime}=2^{z^{\prime}+\frac{1}{2}} be the estimate for dd output by the algorithm. We claim that d^(')d^{\prime} cannot be too large compared to dd with constant probability. Let aa be the smallest integer such that 2^(a+(1)/(2)) >= 3d2^{a+\frac{1}{2}} \geq 3 d.
Now we claim that d^(')d^{\prime} is not too small compared to dd with constant probability. For this let bb the largest integer such that 2^(b+(1)/(2)) <= d//32^{b+\frac{1}{2}} \leq d / 3. Then,
Thus, the algorithm provides (1//3,sqrt2//3≃0.4714)(1 / 3, \sqrt{2} / 3 \simeq 0.4714)-approximation to the number of distinct elements. Using the median trick we can make the probability of success be at least (1-delta)(1-\delta) to obtain a (1//3,delta)(1 / 3, \delta)-approximation by running O(log (1)/(delta))O\left(\log \frac{1}{\delta}\right)-parallel and independent copies of the algorithm. The time and space will be O(log (1)/(delta)log n)O\left(\log \frac{1}{\delta} \log n\right).
3.2. A (1-epsilon)(1-\epsilon)-approximation in O((1)/(epsilon^(2))log n)O\left(\frac{1}{\epsilon^{2}} \log n\right) space
Bar-Yossef et al.[4] described three algorithms for distinct elements, that last of which gives a bound tilde(O)(epsilon^(2)+log n))\tilde{O}(\epsilon^{2}+\log n)) space and amortized time per element O(log n+log (1)/(epsilon))O(\log n+\log \frac{1}{\epsilon}); the notation tilde(O))\tilde{O}) suppresses dependence on log log n\log \log n and log 1//epsilon\log 1 / \epsilon. Here we describe their first algorithm that gives gives an (epsilon,delta_(0))(\epsilon, \delta_{0})-approximation in space O((1)/(epsilon^(2))log n)O(\frac{1}{\epsilon^{2}} \log n) and O(log 1//epsilon log n)O(\log 1 / \epsilon \log n) time per update; via the median trick, with an additional log 1//delta\log 1 / \delta factor, we can obtain a (epsilon,delta)(\epsilon, \delta)-approximation.
The algorithm is based on the Flajolet-Martin idea of hashing to [0,1][0,1] and taking the minimum but with a small and important technical tweak. Let t=(c)/(epsilon^(2))t=\frac{c}{\epsilon^{2}} for some constant cc to be fixed later.
H\mathcal{H} is a 2-universal hash family from [n][n] to [N=n^(3)]\left[N=n^{3}\right]
choose hh at random from H\mathcal{H}
t larr(c)/(epsilon^(2))t \leftarrow \frac{c}{\epsilon^{2}}
While (stream is not empty) do
quada_(i)\quad a_{i} is current item
quad\quad Update the smallest tt hash values seen so far with h(a_(i))h\left(a_{i}\right)
endWhile
Let vv be the t^(')t^{\prime} th smallest value seen in the hast values.
Output tN//vt N / v.
We observe that the algorithm can be implemented by keeping track of tt hash values each of which take O(log n)O(\log n) space and hence the space usage is O(t log n)=O((1)/(epsilon^(2))log n)O(t \log n)=O\left(\frac{1}{\epsilon^{2}} \log n\right). The tt values can be stored in a binary search tree so that when a new item is processed we can update the data structure in O(log t)O(\log t) searches which takes total time O(log (1)/(epsilon)log n)O\left(\log \frac{1}{\epsilon} \log n\right).
The intution of the algorithm is as follows. As before let dd be the number of distinct elements and let them be b_(1),dots,b_(d)b_{1}, \ldots, b_{d}. The hash values h(b_(1)),dots,h(b_(d))h\left(b_{1}\right), \ldots, h\left(b_{d}\right) are uniformly distributed in [0,1][0,1]. For any fixed tt we expect about t//dt / d hash values to fall in the interval [0,t//d][0, t / d] and the tt'th smallest hash value vv should be around t//dt / d and this justifies the estimator for dd being t//vt / v. Now we formally analyse the properties of this estimator.
We chose the hash family to map [n][n] to [n^(3)]\left[n^{3}\right] and therefore with probability at least (1-1//n)(1-1 / n) the random hash function hh is injective over [n][n]. We will assume this is indeed the case. Moreover we will assume that n >= sqrt(epsilon//24)n \geq \sqrt{\epsilon / 24} which implies in particular that epsilon t//(4d) >= 1//N\epsilon t /(4 d) \geq 1 / N. Let d^(')d^{\prime} the estimate returned by the algorithm.
Proof: The values h(b_(1)),dots,h(b_(d))h\left(b_{1}\right), \ldots, h\left(b_{d}\right) are uniformly distributed in 1 ...N. If d^(') < (1-epsilon)dd^{\prime}<(1-\epsilon) d it implies that v > (tN)/((1-epsilon)d)v>\frac{t N}{(1-\epsilon) d}; that is less than tt values fell in the interval [1,(tN)/((1-epsilon)d)][1, \frac{t N}{(1-\epsilon) d}]. Let X_(i)X_{i} be the indicator random variable for the event that h(b_(i)) <= (1+epsilon)tN//dh\left(b_{i}\right) \leq(1+\epsilon) t N / d and let Y=sum_(i=)^(d)X_(i)Y=\sum_{i=}^{d} X_{i}.
Since h(b_(i))h\left(b_{i}\right) is distributed uniformly in [1..N][1 . . N], taking rounding errors into account, we have (1+epsilon//2)t//d <= E[X_(i)] <= (1+3epsilon//2)t//d(1+\epsilon / 2) t / d \leq \mathbf{E}\left[X_{i}\right] \leq(1+3 \epsilon / 2) t / d and hence E[Y] >= t(1+epsilon//2)\mathbf{E}[Y] \geq t(1+\epsilon / 2). We have Var[Y] <= E[Y] <=\mathbf{Var}[Y] \leq \mathbf{E}[Y] \leqt(1+3epsilon//2)t(1+3 \epsilon / 2) (due to pairwise independence, Lemma 1)). We have d^(') < (1-epsilon)dd^{\prime}<(1-\epsilon) d only if Y < tY<t and by Chebyshev,
Proof: Suppose d^(') > (1+epsilon)dd^{\prime}>(1+\epsilon) d, that is v < (tN)/((1+epsilon)d)v<\frac{t N}{(1+\epsilon) d}. This implies that more than tt hash values are less than (tN)/((1+epsilon)d) <= (1-epsilon//2)tN//d\frac{t N}{(1+\epsilon) d} \leq(1-\epsilon / 2) t N / d. We will show that this happens with small probability.
Let X_(i)X_{i} be the indicator random variable for the event h(b_(i)) < (1-epsilon//2)tN//dh\left(b_{i}\right)<(1-\epsilon / 2) t N / d and let Y=sum_(i=1)^(d)X_(i)Y=\sum_{i=1}^{d} X_{i}. We have E[X_(i)] < (1-epsilon//2)t//d+1//N <= (1-epsilon//4)t//d\mathbf{E}\left[X_{i}\right]<(1-\epsilon / 2) t / d+1 / N \leq(1-\epsilon / 4) t / d (the 1//N1 / N is for rounding errors). Hence E[Y] <= (1-epsilon//4)t\mathbf{E}[Y] \leq(1-\epsilon / 4) t. As we argued d^(') > (1+epsilon)dd^{\prime}>(1+\epsilon) d happens only if Y > tY>t. By Cheybyshev,
Bibliographic Notes: Our description of the AMS algorithm is taken from notes of Amit Chakrabarti. Flajolet and Martin proposed the basic hash function based algorithm in[5].[4:1] Kane, Nelson and Woodruff[6] described an (epsilon,delta_(0))\left(\epsilon, \delta_{0}\right)-approximation for a fixed delta_(0)\delta_{0} in space O((1)/(epsilon^(2))+log n)O\left(\frac{1}{\epsilon^{2}}+\log n\right) (the time per update is also the same). This is a theoretically optimum algorithm since there are lower bounds of Omega(epsilon^(2))\Omega\left(\epsilon^{2}\right) and Omega(log n)\Omega(\log n) on the space required for an epsilon\epsilon-approximation. We refer the reader to[7],[8] for analysis and empirical evaluation of an algorithm called HyperLogLog which seems to very well in practice.
You can read the notes from the next lecture from Chandra Chekuri's course on Estimating F_k norms via AMS sampling here.
Many streaming papers flip the notation and use nn to denote the length of the stream and mm to denote the range of the integers in the stream ↩︎
J Lawrence Carter and Mark N Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143-154, 1979. Preliminary version in Proc. ACM STOC, 1977. ↩︎
Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137-147, 1999. Preliminary version in Proc. of ACM STOC 1996. ↩︎
Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Randomization and Approximation Techniques in Computer Science, pages 1-10. Springer, 2002. ↩︎↩︎
Philippe Flajolet and G Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences, 31(2):182-209, 1985. ↩︎
Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 41-52. ACM, 2010. ↩︎
Philippe Flajolet, Eric Fusy, Olivier Gandouet, Frédéric Meunier, et al. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Analysis of Algorithms 2007 (AofA07), pages 127-146, 2007. ↩︎
Stefan Heule, Marc Nunkesser, and Alex Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the EDBT 2013 Conference, Genoa, Italy, 2013. ↩︎
Recommended for you
Eric Mockensturm
Probing The Full Monty Hall Problem
Probing The Full Monty Hall Problem
A tutorial on the Monty Hall problem in statistics.