# Estimating the Number of Distinct Elements in a Stream

You can read the notes from the previous lecture of Chandra Chekuri's course on Basics of Probability, Probabilistic Counting, and Reservoir Sampling here.

## 1. Estimating Frequency Moments in Streams

A significant fraction of streaming literature is on the problem of estimating frequency moments. Let $\sigma ={a}_{1},{a}_{2},\dots ,{a}_{m}$$\sigma ={a}_{1},{a}_{2},\dots ,{a}_{m}$sigma=a_(1),a_(2),dots,a_(m)\sigma=a_{1}, a_{2}, \ldots, a_{m} be a stream of numbers where for each $i,{a}_{i}$$i,{a}_{i}$i,a_(i)i, a_{i} is an intger between $1$$1$11 and $n$$n$nn. We will try to stick to the notation of using $m$$m$mm for the length of the stream and $n$$n$nn for range of the integers[1]. Let ${f}_{i}$${f}_{i}$f_(i)f_{i} be the number of occurences (or frequency) of integer $i$$i$ii in the stream. We let $\mathbf{f}=\left({f}_{1},{f}_{2},\dots ,{f}_{n}\right)$$\mathbf{f}=\left({f}_{1},{f}_{2},\dots ,{f}_{n}\right)$f=(f_(1),f_(2),dots,f_(n))\mathbf{f}=\left(f_{1}, f_{2}, \ldots, f_{n}\right) be the frequency vector for a given stream $\sigma$$\sigma$sigma\sigma. For $k\ge 0,{F}_{k}\left(\sigma \right)$$k\ge 0,{F}_{k}\left(\sigma \right)$k >= 0,F_(k)(sigma)k \geq 0, F_{k}(\sigma) is defined to be the $k$$k$kk'th frequency moment of $\sigma$$\sigma$sigma\sigma:
${F}_{k}=\sum _{i}{f}_{i}^{k}.$${F}_{k}=\sum _{i} {f}_{i}^{k}.$F_(k)=sum_(i)f_(i)^(k).F_{k}=\sum_{i} f_{i}^{k} .
We will discuss several algorithms to estimate ${F}_{k}$${F}_{k}$F_(k)F_{k} for various values of $k$$k$kk. For instance ${F}_{0}$${F}_{0}$F_(0)F_{0} is simply the number of distinct elements in $\sigma$$\sigma$sigma\sigma. Note that ${F}_{1}=\sum _{i}{f}_{i}=m$${F}_{1}=\sum _{i} {f}_{i}=m$F_(1)=sum_(i)f_(i)=mF_{1}=\sum_{i} f_{i}=m, the length of the stream. A $k$$k$kk increases to $\mathrm{\infty }{F}_{k}$$\mathrm{\infty }{F}_{k}$ooF_(k)\infty F_{k} will concentrate on the most frequent element and we can thing of ${F}_{\mathrm{\infty }}$${F}_{\mathrm{\infty }}$F_(oo)F_{\infty} as finding the most frequent element.
Definition 1 Let $\mathcal{A}\right)\left(\sigma \right)$$\mathcal{A}\right)\left(\sigma \right)$A)(sigma)\mathcal{A})(\sigma) be the real-valued output of a randomized streaming algorithm on stream $\sigma$$\sigma$sigma\sigma. We say that $\mathcal{A}$$\mathcal{A}$A\mathcal{A} provides an $\left(\alpha ,\beta \right)$$\left(\alpha ,\beta \right)$(alpha,beta)(\alpha, \beta)-approximation for a real-valued function $g$$g$gg if
$\mathrm{Pr}\left[|\frac{\mathcal{A}\left(\sigma \right)}{g\left(\sigma \right)}-1|>\alpha \right]\le \beta$$\mathrm{Pr}\left[\left|\frac{\mathcal{A}\left(\sigma \right)}{g\left(\sigma \right)}-1\right|>\alpha \right]\le \beta$Pr[|(A(sigma))/(g(sigma))-1| > alpha] <= beta\operatorname{Pr}\left[\left|\frac{\mathcal{A}(\sigma)}{g(\sigma)}-1\right|>\alpha\right] \leq \beta
for all $\sigma$$\sigma$sigma\sigma.
Our ideal goal is to obtain a $\left(ϵ,\delta \right)$$\left(ϵ,\delta \right)$(epsilon,delta)(\epsilon, \delta)-approximation for any given $ϵ,\delta \in \left(0,1\right)$$ϵ,\delta \in \left(0,1\right)$epsilon,delta in(0,1)\epsilon, \delta \in(0,1).

## 2. Background on Hashing

Hashing techniques play a fundamental role in streaming, in particular for estimating frequency moments. We will briefly review hashing from a theoretical point of view and in particular $k$$k$kk-universal hashing.
A hash function maps a finite universe $\mathcal{U}$$\mathcal{U}$U\mathcal{U} to some range $\mathcal{R}$$\mathcal{R}$R\mathcal{R}. Typically the range is the set of integers $\left[0..L-1\right]$$\left[0..L-1\right]$[0..L-1][0 . . L-1] for some finite $L$$L$LL (here $L$$L$LL is the number of buckets in the hash table). Sometimes it is convenient to consider, for the sake of developing intuition, hash functions that maps $\mathcal{U}$$\mathcal{U}$U\mathcal{U} to the continuous interval $\left[0,1\right]$$\left[0,1\right]$[0,1][0,1]. We will, in general, be working with a family of hash functions $\mathcal{H}$$\mathcal{H}$H\mathcal{H} and $h$$h$hh will be drawn from $\mathcal{H}$$\mathcal{H}$H\mathcal{H} uniformly at random; the analyis of the algorithm will be based on properties of $\mathcal{H}$$\mathcal{H}$H\mathcal{H}. We would like $\mathcal{H}$$\mathcal{H}$H\mathcal{H} to have two important and contradictory properties:
• a random function from $\mathcal{H}$$\mathcal{H}$H\mathcal{H} should behave like a completely random function from $\mathcal{U}$$\mathcal{U}$U\mathcal{U} to the range.
• $\mathcal{H}$$\mathcal{H}$H\mathcal{H} should have nice computational properties:
• a uniformly random function from $\mathcal{H}$$\mathcal{H}$H\mathcal{H} should be easy to sample
• any function $h\in \mathcal{H}$$h\in \mathcal{H}$h inHh \in \mathcal{H} should have small representation size so that it can be stored compactly
• it should be efficient to evaluate $h$$h$hh
Definition 2 A collection of random variables ${X}_{1},{X}_{2},\dots ,{X}_{n}$${X}_{1},{X}_{2},\dots ,{X}_{n}$X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} are $k$$k$kk-wise independent if the variables ${X}_{{i}_{1}},{X}_{{i}_{2}},\dots ,{X}_{{i}_{k}}$${X}_{{i}_{1}},{X}_{{i}_{2}},\dots ,{X}_{{i}_{k}}$X_(i_(1)),X_(i_(2)),dots,X_(i_(k))X_{i_{1}}, X_{i_{2}}, \ldots, X_{i_{k}} are independent for any set of distinct indices ${i}_{1},{i}_{2},\dots ,{i}_{k}$${i}_{1},{i}_{2},\dots ,{i}_{k}$i_(1),i_(2),dots,i_(k)i_{1}, i_{2}, \ldots, i_{k}.
Take three random variables $\left\{{X}_{1},{X}_{2},{X}_{3}\right\}$$\left\{{X}_{1},{X}_{2},{X}_{3}\right\}${X_(1),X_(2),X_(3)}\left\{X_{1}, X_{2}, X_{3}\right\} where ${X}_{1},{X}_{2}$${X}_{1},{X}_{2}$X_(1),X_(2)X_{1}, X_{2} are independent $\left\{0,1\right\}$$\left\{0,1\right\}${0,1}\{0,1\} random variables and ${X}_{3}={X}_{1}\oplus {X}_{2}$${X}_{3}={X}_{1}\oplus {X}_{2}$X_(3)=X_(1)o+X_(2)X_{3}=X_{1} \oplus X_{2}. It is easy to check that the three variables are pairwise independent although they are not all independent.
Following the work of Carter and Wegman[2], the class of $k$$k$kk-universal hash families, and in particular for $k=2$$k=2$k=2k=2, provide an excellent tradeoff. $\mathcal{H}$$\mathcal{H}$H\mathcal{H} is strongly 2-universal if the following properties hold for a random function $h$$h$hh picked from $\mathcal{H}$$\mathcal{H}$H\mathcal{H}: (i) for every $x\in \mathcal{U},h\left(x\right)$$x\in \mathcal{U},h\left(x\right)$x inU,h(x)x \in \mathcal{U}, h(x) (which is a random variable) is uniformly distributed over the range and (ii) for every distinct pair $x,y\in \mathcal{U}$$x,y\in \mathcal{U}$x,y inUx, y \in \mathcal{U}, $h\left(x\right)$$h\left(x\right)$h(x)h(x) and $h\left(y\right)$$h\left(y\right)$h(y)h(y) are independent. 2-universal hash families are also called pairwise independent hash families. A weakly 2-universal family satisfies the property that that $\mathrm{Pr}\left[h\left(x\right)=h\left(y\right)\right]=1/L$$\mathrm{Pr}\left[h\left(x\right)=h\left(y\right)\right]=1/L$Pr[h(x)=h(y)]=1//L\operatorname{Pr}[h(x)=h(y)]=1 / L for any distinct $x,y$$x,y$x,yx, y. We state an important observation about pairwise independent random variables.
Lemma 1 Let $Y=\sum _{i=1}^{h}{X}_{i}$$Y=\sum _{i=1}^{h} {X}_{i}$Y=sum_(i=1)^(h)X_(i)Y=\sum_{i=1}^{h} X_{i} where ${X}_{1},{X}_{2},\dots ,{X}_{h}$${X}_{1},{X}_{2},\dots ,{X}_{h}$X_(1),X_(2),dots,X_(h)X_{1}, X_{2}, \ldots, X_{h} are pairwise independent. Then
$\mathbf{V}\mathbf{a}\mathbf{r}\left[Y\right]=\sum _{i=1}^{h}\mathbf{V}\mathbf{a}\mathbf{r}\left[{X}_{i}\right].$$\mathbf{V}\mathbf{a}\mathbf{r}\left[Y\right]=\sum _{i=1}^{h} \mathbf{V}\mathbf{a}\mathbf{r}\left[{X}_{i}\right].$Var[Y]=sum_(i=1)^(h)Var[X_(i)].\mathbf{Var}[Y]=\sum_{i=1}^{h} \mathbf{Var}\left[X_{i}\right] .
Moreover if ${X}_{i}$${X}_{i}$X_(i)X_{i} are binary/indicator random variables then
$\mathbf{V}\mathbf{a}\mathbf{r}\left[Y\right]\le \sum _{i}\mathbf{E}\left[{X}_{i}^{2}\right]=\sum _{i}\mathbf{E}\left[{X}_{i}\right]=\mathbf{E}\left[Y\right].$$\mathbf{V}\mathbf{a}\mathbf{r}\left[Y\right]\le \sum _{i} \mathbf{E}\left[{X}_{i}^{2}\right]=\sum _{i} \mathbf{E}\left[{X}_{i}\right]=\mathbf{E}\left[Y\right].$Var[Y] <= sum_(i)E[X_(i)^(2)]=sum_(i)E[X_(i)]=E[Y].\mathbf{Var}[Y] \leq \sum_{i} \mathbf{E}\left[X_{i}^{2}\right]=\sum_{i} \mathbf{E}\left[X_{i}\right]=\mathbf{E}[Y] .
There is a simple and nice construction of pairwise independent hash functions. Let $p$$p$pp be a prime number such that $p\ge |\mathcal{U}|$$p\ge |\mathcal{U}|$p >= |U|p \geq|\mathcal{U}|. Recall that ${Z}_{p}=\left\{0,1,\dots ,p-1\right\}$${Z}_{p}=\left\{0,1,\dots ,p-1\right\}$Z_(p)={0,1,dots,p-1}Z_{p}=\{0,1, \ldots, p-1\} forms a field under the standard addition and multiplication modulo $p$$p$pp. For each $a,b\in \left[p\right]$$a,b\in \left[p\right]$a,b in[p]a, b \in[p] we can define a hash function ${h}_{a,b}$${h}_{a,b}$h_(a,b)h_{a, b} where ${h}_{a,b}\left(x\right)=ax+bmodp$${h}_{a,b}\left(x\right)=ax+bmodp$h_(a,b)(x)=ax+b modph_{a, b}(x)=a x+b \bmod {p}. Let $\mathcal{H}=\left\{{h}_{a,b}\mid a,b\in \left[p\right]\right\}$$\mathcal{H}=\left\{{h}_{a,b}\mid a,b\in \left[p\right]\right\}$H={h_(a,b)∣a,b in[p]}\mathcal{H}=\left\{h_{a, b} \mid a, b \in[p]\right\}. We can see that we only need to store two numbers $a,b$$a,b$a,ba, b of $\mathrm{\Theta }\left(\mathrm{log}p\right)$$\mathrm{\Theta }\left(\mathrm{log}p\right)$Theta(log p)\Theta(\log p) bits to implicitly store ${h}_{a,b}$${h}_{a,b}$h_(a,b)h_{a, b} and evaluation of ${h}_{a,b}\left(x\right)$${h}_{a,b}\left(x\right)$h_(a,b)(x)h_{a, b}(x) takes one addition and one multiplication of $\mathrm{log}p$$\mathrm{log}p$log p\log p bit numbers. Moreover, samply a random hash function from $\mathcal{H}$$\mathcal{H}$H\mathcal{H} requires sampling $a,b$$a,b$a,ba, b which is also easy. We claim that $\mathcal{H}$$\mathcal{H}$H\mathcal{H} is a pairwise independent family. You can verify this by the observation that for distinct $x,y$$x,y$x,yx, y and any $i,j$$i,j$i,ji, j the two equations $ax+b=i$$ax+b=i$ax+b=ia x+b=i and $ay+b=j$$ay+b=j$ay+b=ja y+b=j have a unique $a,b$$a,b$a,ba, b them simultaneously. Note that if $a=0$$a=0$a=0a=0 the hash function is pretty useless; all elements get mapped to $b$$b$bb. Nevertheless, for $\mathcal{H}$$\mathcal{H}$H\mathcal{H} to be pairwise independent one needs to include those hash functions but the probability that $a=0$$a=0$a=0a=0 is $1/p$$1/p$1//p1 / p while there are ${p}^{2}$${p}^{2}$p^(2)p^{2} functions in $\mathcal{H}$$\mathcal{H}$H\mathcal{H}. If one only wants a weakly universal hash family we can pick $a$$a$aa from $\left[1..\left(p-1\right)\right]$$\left[1..\left(p-1\right)\right]$[1..(p-1)][1 . .(p-1)]. The range of the hash function is $\left[p\right]$$\left[p\right]$[p][p]. To restrict the range to $L$$L$LL we let ${h}_{a,b}^{\mathrm{\prime }}\left(x\right)=\left(ax+bmodp\right)modL$${h}_{a,b}^{\mathrm{\prime }}\left(x\right)=\left(ax+bmodp\right)modL$h_(a,b)^(')(x)=(ax+b modp)modLh_{a, b}^{\prime}(x)=(a x+b \bmod {p}) \bmod {L}.
More generally we will say that $\mathcal{H}$$\mathcal{H}$H\mathcal{H} is $k$$k$kk-universal if every element is uniformly distributed in the range and for any $k$$k$kk elements ${x}_{1},\dots ,{x}_{k}$${x}_{1},\dots ,{x}_{k}$x_(1),dots,x_(k)x_{1}, \ldots, x_{k} the random variabels $h\left({x}_{1}\right),\dots ,h\left({x}_{k}\right)$$h\left({x}_{1}\right),\dots ,h\left({x}_{k}\right)$h(x_(1)),dots,h(x_(k))h\left(x_{1}\right), \ldots, h\left(x_{k}\right) are independent. Assuming $\mathcal{U}$$\mathcal{U}$U\mathcal{U} is the set of integers $\left[0..|\mathcal{U}|\right]$$\left[0..|\mathcal{U}|\right]$[0..|U|][0..| \mathcal{U} |], for any fixed $k$$k$kk there exist constructions for $k$$k$kk-universal hash families such that every hash function $h$$h$hh in the family can be stored using $O\left(k\mathrm{log}|\mathcal{U}|\right)$$O\left(k\mathrm{log}|\mathcal{U}|\right)$O(k log |U|)O(k \log |\mathcal{U}|) bits (essentially $k$$k$kk numbers) and $h$$h$hh can be evaluated using $O\left(k\right)$$O\left(k\right)$O(k)O(k) arithmetic operations on $\mathrm{log}|\mathcal{U}|$$\mathrm{log}|\mathcal{U}|$log |U|\log |\mathcal{U}| bit numbers. We will ignore specific details of the implementations and refer the reader to the considerable literature on hashing for further details.

## 3. Estimating Number of Distinct Elements

A lower bound on exact counting deterministic algorithms: We argue that any deterministic streaming algorithm that counts the number of distinct elements exactly needs $\mathrm{\Omega }\left(n\right)$$\mathrm{\Omega }\left(n\right)$Omega(n)\Omega(n) bits. To see this, suppose there is an algorithm $\mathcal{A}$$\mathcal{A}$A\mathcal{A} that uses strictly less than $n$$n$nn bits. Consider the $h={2}^{n}$$h={2}^{n}$h=2^(n)h=2^{n} different streams ${\sigma }_{S}$${\sigma }_{S}$sigma_(S)\sigma_{S} where $S\subseteq \left[n\right];{\sigma }_{S}$$S\subseteq \left[n\right];{\sigma }_{S}$S sube[n];sigma_(S)S \subseteq[n] ; \sigma_{S} consists of the elements of $S$$S$SS in some arbitrary order. Since $\mathcal{A}$$\mathcal{A}$A\mathcal{A} uses $n-1$$n-1$n-1n-1 bits or less, there must be two distinct sets ${S}_{1},{S}_{2}$${S}_{1},{S}_{2}$S_(1),S_(2)S_{1}, S_{2} such that the state of $\mathcal{A}$$\mathcal{A}$A\mathcal{A} at the end of ${\sigma }_{{S}_{1}},{\sigma }_{{S}_{2}}$${\sigma }_{{S}_{1}},{\sigma }_{{S}_{2}}$sigma_(S_(1)),sigma_(S_(2))\sigma_{S_{1}}, \sigma_{S_{2}} is identical. Since ${S}_{1},{S}_{2}$${S}_{1},{S}_{2}$S_(1),S_(2)S_{1}, S_{2} are distinct there is an element $i$$i$ii in ${S}_{1}\mathrm{\setminus }{S}_{2}$${S}_{1}\mathrm{\setminus }{S}_{2}$S_(1)\\S_(2)S_{1} \backslash S_{2} or ${S}_{2}\mathrm{\setminus }{S}_{1}$${S}_{2}\mathrm{\setminus }{S}_{1}$S_(2)\\S_(1)S_{2} \backslash S_{1}; wlog it is the former. Then it is easy to see the $\mathcal{A}$$\mathcal{A}$A\mathcal{A} cannot give the right count for at least one of the two streams, $<{\sigma }_{{S}_{1}},i>,<{\sigma }_{{S}_{2}},i>$$<{\sigma }_{{S}_{1}},i>,<{\sigma }_{{S}_{2}},i>$< sigma_(S_(1)),i > , < sigma_(S_(2)),i ><\sigma_{S_{1}}, i>,<\sigma_{S_{2}}, i>.
The basic hashing idea: We now discuss a simple high-level idea for estimating the number of distinct elements in the stream. Suppose $h$$h$hh is an idealized random hash function that maps $\left[1..n\right]$$\left[1..n\right]$[1..n][1..n] to the interval $\left[0,1\right]$$\left[0,1\right]$[0,1][0,1]. Suppose there are $d$$d$dd distinct elements in the stream $\sigma ={a}_{1},{a}_{2},\dots ,{a}_{m}$$\sigma ={a}_{1},{a}_{2},\dots ,{a}_{m}$sigma=a_(1),a_(2),dots,a_(m)\sigma=a_{1}, a_{2}, \ldots, a_{m}. If $h$$h$hh behaves like a random function then the set $\left\{h\left({a}_{1}\right),\dots ,h\left({a}_{m}\right)\right\}$$\left\{h\left({a}_{1}\right),\dots ,h\left({a}_{m}\right)\right\}${h(a_(1)),dots,h(a_(m))}\left\{h\left(a_{1}\right), \ldots, h\left(a_{m}\right)\right\} will behave like a collection of $d$$d$dd independent uniformly distributed random variables in $\left[0,1\right]$$\left[0,1\right]$[0,1][0,1]. Let $\theta =min\left\{h\left({a}_{1}\right),\dots ,h\left({a}_{m}\right)\right\}$$\theta =min\left\{h\left({a}_{1}\right),\dots ,h\left({a}_{m}\right)\right\}$theta=min{h(a_(1)),dots,h(a_(m))}\theta=\min \left\{h\left(a_{1}\right), \ldots, h\left(a_{m}\right)\right\}; the expectation of $\theta$$\theta$theta\theta is $\frac{1}{d+1}$$\frac{1}{d+1}$(1)/(d+1)\frac{1}{d+1} and hence $1/\theta$$1/\theta$1//theta1 / \theta is good estimator. In the stream setting we can compute $\theta$$\theta$theta\theta by hashing each incoming value and keeping track of the minimum. We only need to have one number in memory. Although simple, the algorithm assumes idealized hash functions and we only have an unbiased estimator. To convert the idea to an implementable algorithm with proper guarantees requires work. There are several papers on this problem and we will now discuss some of the approaches.

### 3.1. The AMS algorithm

Here we describe an algorithm with better parameters but it only gives a constant factor approximation. This is due to Alon, Matias and Szegedy in their famous paper[3] on estimating frequency moments. We need some notation. For an integer $t>0$$t>0$t > 0t>0 let zeros $\left(t\right)$$\left(t\right)$(t)(t) denote the number of zeros that the binary representation of $t$$t$tt ends in; equivalenty
$\mathrm{zeros}\left(t\right)=max\left\{i:\mid {2}^{i}\text{divides}t\right\}.$zeros(t)=max{i:∣2^(i)" divides "t}.\operatorname{zeros}(t)=\max \left\{i: \mid 2^{i} \text { divides } t\right\}.
 $\underset{_}{\text{AMS-DistinctElements:}}$$\underset{_}{\text{AMS-DistinctElements:}}$"AMS-DistinctElements:"_\underline{\text{AMS-DistinctElements:}}$\underset{_}{\text{AMS-DistinctElements:}}$ $\mathcal{H}$$\mathcal{H}$H\mathcal{H}$\mathcal{H}$ is a 2-universal hash family from $\left[n\right]$$\left[n\right]$[n][n]$\left[n\right]$ to $\left[n\right]$$\left[n\right]$[n][n]$\left[n\right]$ choose $h$$h$hh$h$ at random from $\mathcal{H}$$\mathcal{H}$H\mathcal{H}$\mathcal{H}$ $z←0$$z←0$z larr0z \leftarrow 0$z←0$ While (stream is not empty) do $\phantom{\rule{1em}{0ex}}{a}_{i}$$\phantom{\rule{1em}{0ex}}{a}_{i}$quada_(i)\quad a_{i}$\phantom{\rule{1em}{0ex}}{a}_{i}$ is current item $\phantom{\rule{1em}{0ex}}z←max\left\{z,\mathrm{zeros}\left(h\left({a}_{i}\right)\right)\right\}$$\phantom{\rule{1em}{0ex}}z←max\left\{z,\mathrm{zeros}\left(h\left({a}_{i}\right)\right)\right\}$quad z larr max{z,zeros(h(a_(i)))}\quad z \leftarrow \max \left\{z, \operatorname{zeros}\left(h\left(a_{i}\right)\right)\right\}$\phantom{\rule{1em}{0ex}}z←max\left\{z,\mathrm{zeros}\left(h\left({a}_{i}\right)\right)\right\}$ endWhile Output ${2}^{z+\frac{1}{2}}$${2}^{z+\frac{1}{2}}$2^(z+(1)/(2))2^{z+\frac{1}{2}}${2}^{z+\frac{1}{2}}$
First, we note that the space and time per element are $O\left(\mathrm{log}n\right)$$O\left(\mathrm{log}n\right)$O(log n)O(\log n). We now analyze the quality of the approximation provided by the output. Recall that $h\left({a}_{j}\right)$$h\left({a}_{j}\right)$h(a_(j))h\left(a_{j}\right) is uniformly distributed in $\left[n\right]$$\left[n\right]$[n][n]. We will assume for simplicity that $n$$n$nn is a power of $2$$2$22.
Let $d$$d$dd to denote the number of distinct elements in the stream and let them be ${b}_{1},{b}_{2},\dots ,{b}_{d}$${b}_{1},{b}_{2},\dots ,{b}_{d}$b_(1),b_(2),dots,b_(d)b_{1}, b_{2}, \ldots, b_{d}. For a given $r$$r$rr let ${X}_{r,j}$${X}_{r,j}$X_(r,j)X_{r, j} be the indicator random variable that is $1$$1$11 if zeros $\left(h\left({b}_{j}\right)\right)\ge r$$\left(h\left({b}_{j}\right)\right)\ge r$(h(b_(j))) >= r\left(h\left(b_{j}\right)\right) \geq r. Let ${Y}_{r}=\sum _{j}{X}_{r,j}$${Y}_{r}=\sum _{j} {X}_{r,j}$Y_(r)=sum_(j)X_(r,j)Y_{r}=\sum_{j} X_{r, j}. That is, ${Y}_{r}$${Y}_{r}$Y_(r)Y_{r} is the number of distinct elements whose hash values have atleast $r$$r$rr zeros.
Since $h\left({b}_{j}\right)$$h\left({b}_{j}\right)$h(b_(j))h\left(b_{j}\right) is uniformaly distribute in $\left[n\right]$$\left[n\right]$[n][n],
$\mathbf{E}\left[{X}_{r,j}\right]=\mathrm{Pr}\left[\mathrm{zeros}\left(h\left({b}_{j}\right)\right)\ge r\right]=\frac{\left(n/{2}^{r}\right)}{n}\ge \frac{1}{{2}^{r}}.$$\mathbf{E}\left[{X}_{r,j}\right]=\mathrm{Pr}\left[\mathrm{zeros}\left(h\left({b}_{j}\right)\right)\ge r\right]=\frac{\left(n/{2}^{r}\right)}{n}\ge \frac{1}{{2}^{r}}.$E[X_(r,j)]=Pr[zeros(h(b_(j))) >= r]=((n//2^(r)))/(n) >= (1)/(2^(r)).\mathbf{E}\left[X_{r, j}\right]=\operatorname{Pr}\left[\operatorname{zeros}\left(h\left(b_{j}\right)\right) \geq r\right]=\frac{\left(n / 2^{r}\right)}{n} \geq \frac{1}{2^{r}} .
Therefore
$\mathbf{E}\left[{Y}_{r}\right]=\sum _{j}\mathbf{E}\left[{X}_{r,j}\right]=\frac{d}{{2}^{r}}.$$\mathbf{E}\left[{Y}_{r}\right]=\sum _{j} \mathbf{E}\left[{X}_{r,j}\right]=\frac{d}{{2}^{r}}.$E[Y_(r)]=sum_(j)E[X_(r,j)]=(d)/(2^(r)).\mathbf{E}\left[Y_{r}\right]=\sum_{j} \mathbf{E}\left[X_{r, j}\right]=\frac{d}{2^{r}} .
Thus we have $\mathbf{E}\left[{Y}_{\mathrm{log}d}\right]=1$$\mathbf{E}\left[{Y}_{\mathrm{log}d}\right]=1$E[Y_(log d)]=1\mathbf{E}\left[Y_{\log d}\right]=1 (assuming $d$$d$dd is a power of $2$$2$22).
Now we compute the variance of ${Y}_{r}$${Y}_{r}$Y_(r)Y_{r}. Note that the variables ${X}_{r,j}$${X}_{r,j}$X_(r,j)X_{r, j} and ${X}_{r,{j}^{\mathrm{\prime }}}$${X}_{r,{j}^{\mathrm{\prime }}}$X_(r,j^('))X_{r, j^{\prime}} are pairwise independent since $\mathcal{H}$$\mathcal{H}$H\mathcal{H} is 2-universal. Hence
$\mathbf{V}\mathbf{a}\mathbf{r}\left[{Y}_{r}\right]=\sum _{j}\mathbf{V}\mathbf{a}\mathbf{r}\left[{X}_{r,j}\right]\le \sum _{j}\mathbf{E}\left[{X}_{r,j}^{2}\right]=\sum _{j}\mathbf{E}\left[{X}_{r,j}\right]=\frac{d}{{2}^{r}}.$$\mathbf{V}\mathbf{a}\mathbf{r}\left[{Y}_{r}\right]=\sum _{j} \mathbf{V}\mathbf{a}\mathbf{r}\left[{X}_{r,j}\right]\le \sum _{j} \mathbf{E}\left[{X}_{r,j}^{2}\right]=\sum _{j} \mathbf{E}\left[{X}_{r,j}\right]=\frac{d}{{2}^{r}}.$Var[Y_(r)]=sum_(j)Var[X_(r,j)] <= sum_(j)E[X_(r,j)^(2)]=sum_(j)E[X_(r,j)]=(d)/(2^(r)).\mathbf{Var}\left[Y_{r}\right]=\sum_{j} \mathbf{Var}\left[X_{r, j}\right] \leq \sum_{j} \mathbf{E}\left[X_{r, j}^{2}\right]=\sum_{j} \mathbf{E}\left[X_{r, j}\right]=\frac{d}{2^{r}} .
Using Markov's inequality
$\mathrm{Pr}\left[{Y}_{r}>0\right]=\mathrm{Pr}\left[{Y}_{r}\ge 1\right]\le \mathbf{E}\left[{Y}_{r}\right]\le \frac{d}{{2}^{r}}.$$\mathrm{Pr}\left[{Y}_{r}>0\right]=\mathrm{Pr}\left[{Y}_{r}\ge 1\right]\le \mathbf{E}\left[{Y}_{r}\right]\le \frac{d}{{2}^{r}}.$Pr[Y_(r) > 0]=Pr[Y_(r) >= 1] <= E[Y_(r)] <= (d)/(2^(r)).\operatorname{Pr}\left[Y_{r}>0\right]=\operatorname{Pr}\left[Y_{r} \geq 1\right] \leq \mathbf{E}\left[Y_{r}\right] \leq \frac{d}{2^{r}} .
Using Chebyshev's inequality
$\mathrm{Pr}\left[{Y}_{r}=0\right]=\mathrm{Pr}\left[|{Y}_{r}-\mathbf{E}\left[{Y}_{r}\right]|\ge \frac{d}{{2}^{r}}\right]\le \frac{\mathbf{V}\mathbf{a}\mathbf{r}\left[{Y}_{r}\right]}{{\left(d/{2}^{r}\right)}^{2}}\le \frac{{2}^{r}}{d}.$$\mathrm{Pr}\left[{Y}_{r}=0\right]=\mathrm{Pr}\left[\left|{Y}_{r}-\mathbf{E}\left[{Y}_{r}\right]\right|\ge \frac{d}{{2}^{r}}\right]\le \frac{\mathbf{V}\mathbf{a}\mathbf{r}\left[{Y}_{r}\right]}{{\left(d/{2}^{r}\right)}^{2}}\le \frac{{2}^{r}}{d}.$Pr[Y_(r)=0]=Pr[|Y_(r)-E[Y_(r)]| >= (d)/(2^(r))] <= (Var[Y_(r)])/((d//2^(r))^(2)) <= (2^(r))/(d).\operatorname{Pr}\left[Y_{r}=0\right]=\operatorname{Pr}[\left|Y_{r}-\mathbf{E}\left[Y_{r}\right]\right| \geq \frac{d}{2^{r}}] \leq \frac{\mathbf{Var}\left[Y_{r}\right]}{\left(d / 2^{r}\right)^{2}} \leq \frac{2^{r}}{d} .
Let ${z}^{\mathrm{\prime }}$${z}^{\mathrm{\prime }}$z^(')z^{\prime} be the value of $z$$z$zz at the end of the stream and let ${d}^{\mathrm{\prime }}={2}^{{z}^{\mathrm{\prime }}+\frac{1}{2}}$${d}^{\mathrm{\prime }}={2}^{{z}^{\mathrm{\prime }}+\frac{1}{2}}$d^(')=2^(z^(')+(1)/(2))d^{\prime}=2^{z^{\prime}+\frac{1}{2}} be the estimate for $d$$d$dd output by the algorithm. We claim that ${d}^{\mathrm{\prime }}$${d}^{\mathrm{\prime }}$d^(')d^{\prime} cannot be too large compared to $d$$d$dd with constant probability. Let $a$$a$aa be the smallest integer such that ${2}^{a+\frac{1}{2}}\ge 3d$${2}^{a+\frac{1}{2}}\ge 3d$2^(a+(1)/(2)) >= 3d2^{a+\frac{1}{2}} \geq 3 d.
$\mathrm{Pr}\left[{d}^{\mathrm{\prime }}\ge 3d\right]=\mathrm{Pr}\left[{Y}_{a}>0\right]\le \frac{d}{{2}^{a}}\le \frac{\sqrt{2}}{3}.$$\mathrm{Pr}\left[{d}^{\mathrm{\prime }}\ge 3d\right]=\mathrm{Pr}\left[{Y}_{a}>0\right]\le \frac{d}{{2}^{a}}\le \frac{\sqrt{2}}{3}.$Pr[d^(') >= 3d]=Pr[Y_(a) > 0] <= (d)/(2^(a)) <= (sqrt2)/(3).\operatorname{Pr}\left[d^{\prime} \geq 3 d\right]=\operatorname{Pr}\left[Y_{a}>0\right] \leq \frac{d}{2^{a}} \leq \frac{\sqrt{2}}{3} .
Now we claim that ${d}^{\mathrm{\prime }}$${d}^{\mathrm{\prime }}$d^(')d^{\prime} is not too small compared to $d$$d$dd with constant probability. For this let $b$$b$bb the largest integer such that ${2}^{b+\frac{1}{2}}\le d/3$${2}^{b+\frac{1}{2}}\le d/3$2^(b+(1)/(2)) <= d//32^{b+\frac{1}{2}} \leq d / 3. Then,
$\mathrm{Pr}\left[{d}^{\mathrm{\prime }}\le d/3\right]=\mathrm{Pr}\left[{Y}_{b+1}=0\right]\le \frac{{2}^{b+1}}{d}\le \frac{\sqrt{2}}{3}.$$\mathrm{Pr}\left[{d}^{\mathrm{\prime }}\le d/3\right]=\mathrm{Pr}\left[{Y}_{b+1}=0\right]\le \frac{{2}^{b+1}}{d}\le \frac{\sqrt{2}}{3}.$Pr[d^(') <= d//3]=Pr[Y_(b+1)=0] <= (2^(b+1))/(d) <= (sqrt2)/(3).\operatorname{Pr}\left[d^{\prime} \leq d / 3\right]=\operatorname{Pr}\left[Y_{b+1}=0\right] \leq \frac{2^{b+1}}{d} \leq \frac{\sqrt{2}}{3} .
Thus, the algorithm provides $\left(1/3,\sqrt{2}/3\simeq 0.4714\right)$$\left(1/3,\sqrt{2}/3\simeq 0.4714\right)$(1//3,sqrt2//3≃0.4714)(1 / 3, \sqrt{2} / 3 \simeq 0.4714)-approximation to the number of distinct elements. Using the median trick we can make the probability of success be at least $\left(1-\delta \right)$$\left(1-\delta \right)$(1-delta)(1-\delta) to obtain a $\left(1/3,\delta \right)$$\left(1/3,\delta \right)$(1//3,delta)(1 / 3, \delta)-approximation by running $O\left(\mathrm{log}\frac{1}{\delta }\right)$$O\left(\mathrm{log}\frac{1}{\delta }\right)$O(log (1)/(delta))O\left(\log \frac{1}{\delta}\right)-parallel and independent copies of the algorithm. The time and space will be $O\left(\mathrm{log}\frac{1}{\delta }\mathrm{log}n\right)$$O\left(\mathrm{log}\frac{1}{\delta }\mathrm{log}n\right)$O(log (1)/(delta)log n)O\left(\log \frac{1}{\delta} \log n\right).

### 3.2. A $\left(1-ϵ\right)$$\left(1-ϵ\right)$(1-epsilon)(1-\epsilon)$\left(1-ϵ\right)$-approximation in $O\left(\frac{1}{{ϵ}^{2}}\mathrm{log}n\right)$$O\left(\frac{1}{{ϵ}^{2}}\mathrm{log}n\right)$O((1)/(epsilon^(2))log n)O\left(\frac{1}{\epsilon^{2}} \log n\right)$O\left(\frac{1}{{ϵ}^{2}}\mathrm{log}n\right)$ space

Bar-Yossef et al.[4] described three algorithms for distinct elements, that last of which gives a bound $\stackrel{~}{O}\left({ϵ}^{2}+\mathrm{log}n\right)\right)$$\stackrel{~}{O}\left({ϵ}^{2}+\mathrm{log}n\right)\right)$tilde(O)(epsilon^(2)+log n))\tilde{O}(\epsilon^{2}+\log n)) space and amortized time per element $O\left(\mathrm{log}n+\mathrm{log}\frac{1}{ϵ}\right)$$O\left(\mathrm{log}n+\mathrm{log}\frac{1}{ϵ}\right)$O(log n+log (1)/(epsilon))O(\log n+\log \frac{1}{\epsilon}); the notation $\stackrel{~}{O}\right)$$\stackrel{~}{O}\right)$tilde(O))\tilde{O}) suppresses dependence on $\mathrm{log}\mathrm{log}n$$\mathrm{log}\mathrm{log}n$log log n\log \log n