Basics of Probability, Probabilistic Counting (Morris's algorithm), and Reservoir Sampling

1. Streaming/One-Pass Model of Computation

In the streaming model we assume that the input data comes as a stream. More formally, the data is assumed to consist of m m mm items/tokens/objects a 1 , a 2 , , a m a 1 , a 2 , , a m a_(1),a_(2),dots,a_(m)a_{1}, a_{2}, \ldots, a_{m}. A simple case is when each token is a number in the range [ 1. . n ] [ 1. . n ] [1..n][1..n]. Another example is when each token represents an edge in a graph given as ( u , v ) ( u , v ) (u,v)(u, v) where u u uu and and v v vv are integers representing the node indices. The tokes arrive one by one in order. The algorithm has to process token x i x i x_(i)x_{i} before it sees the next token. If we are allowed to store all the tokens then we are in the standard model of computing where we have access to the whole input. However, we will assume that the space available to the algorithm is much less than m m mm tokens; typically sub-linear in m m mm or in the ideal scenario polylog ( m ) ( m ) (m)(m). Our goal is to (approximately) compute/estimate various quantities of interest using such limited space. Surprisingly one can compute various useful functions of interest. Some parameters of interest for a streaming algorithm are:
  • space used by the algorithm as a function of the stream length m m mm and the nature of the tokens
  • the worst-case time to process each token
  • the total time to process all the tokens (equivalently the amortized time to process a token)
  • the accuracy of the output
  • the probability of success in terms of obtaining a desired approximation (assuming that the algorithm is randomized)
There are several application areas that motivate the streaming model. In networking, a large number of packets transit a switch and we may want to analyze some statistics about the packets. The number of packets transiting the switch is vastly more than what can be stored for offline processing. In large databases the data is stored in disks and random access is not feasible; the size of the main memory is much smaller than the storage available on the disk. A one-pass or multi-pass streaming algorithm allows one to avoid sophisticated data structures on the disk. There are also applications where the data is distributed in several places and we need to process them separately and combine the results with low communication overhead.
In database applications and such it also makes sense to discuss the multi-pass model or the semi-streaming model where one gets to make several passes over the data. Typically we will assume that the number of passes is a small constant.

2. Background on Probability and Inequalities

The course will rely heavily on proababilistic methods. We will mostly rely on discrete probability spaces. We will keep the discussion high-level where possible and use certain results in a black-box fashion.
Let Ω Ω Omega\Omega be a finite set. A probability measure p p pp assings a non-negative number p ( ω ) p ( ω ) p(omega)p(\omega) for each ω Ω ω Ω omega in Omega\omega \in \Omega such that ω Ω p ( ω ) = 1 ω Ω p ( ω ) = 1 sum_(omega in Omega)p(omega)=1\sum_{\omega \in \Omega} p(\omega)=1. The tuple ( Ω , p ) ( Ω , p ) (Omega,p)(\Omega, p) defines a discrete probability space; an event in this space is any subset A Ω A Ω A sube OmegaA \subseteq \Omega and the probability of an event is simply p ( A ) = ω A p ( ω ) p ( A ) = ω A p ( ω ) p(A)=sum_(omega in A)p(omega)p(A)=\sum_{\omega \in A} p(\omega). When Ω Ω Omega\Omega is a continuous space such as the interval [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] things get trickier and we need to talk about a measure spaces σ σ sigma\sigma-algebras over Ω Ω Omega\Omega; we can only assign probability to certain subsets of Ω Ω Omega\Omega. We will not go into details since we will not need any formal machinery for what we do in this course.
An important definition is that of a random variable. We will focus only on real-valued random variables in this course. A random variable X X XX in a probability space is a function X : Ω R X : Ω R X:Omega rarrRX: \Omega \rightarrow \mathbb{R}. In the discrete setting the expectation of X X XX, denoted by E [ X ] E [ X ] E[X]\mathbf{E}[X], is defined as ω Ω p ( w ) X ( ω ) ω Ω p ( w ) X ( ω ) sum_(omega in Omega)p(w)X(omega)\sum_{\omega \in \Omega} p(w) X(\omega). For continuous spaces E [ X ] = X ( ω ) d p ( ω ) E [ X ] = X ( ω ) d p ( ω ) E[X]=int X(omega)dp(omega)\mathbf{E}[X]=\int X(\omega) d p(\omega) with appropriate definition of the integral. The variance of X X XX, denoted by V a r [ X ] V a r [ X ] Var[X]\mathbf{Var}[X] or as σ X 2 σ X 2 sigma_(X)^(2)\sigma_{X}^{2}, is defined as E [ ( X E [ X ] ) 2 ] E ( X E [ X ] ) 2 E[(X-E[X])^(2)]\mathbf{E}\left[(X-\mathbf{E}[X])^{2}\right]. The standard deviation is σ X σ X sigma_(X)\sigma_{X}, the square root of the variance.
Theorem 1 (Markov's Inequality) Let X X XX be a non-negative random variable such that E [ X ] E [ X ] E[X]\mathbf{E}[X] is finite. Then for any t > 0 , Pr [ X t ] E [ X ] / t t > 0 , Pr [ X t ] E [ X ] / t t > 0,Pr[X >= t] <= E[X]//tt>0, \operatorname{Pr}[X \geq t] \leq \mathbf{E}[X] / t.
Proof: The proof is in some sense obvious, especially in the discrete case. Here is a sketch. Define a new random variable Y Y YY where Y ( ω ) = X ( ω ) Y ( ω ) = X ( ω ) Y(omega)=X(omega)Y(\omega)=X(\omega) if X ( ω ) < t X ( ω ) < t X(omega) < tX(\omega)<t and Y ( ω ) = t Y ( ω ) = t Y(omega)=tY(\omega)=t if X ( ω ) t X ( ω ) t X(omega) >= tX(\omega) \geq t. Y Y YY is non-negative and Y X Y X Y <= XY \leq X point-wise and hence E [ Y ] E [ X ] E [ Y ] E [ X ] E[Y] <= E[X]\mathbf{E}[Y] \leq \mathbf{E}[X]. We also see that:
E [ X ] E [ Y ] = ω : X ( ω ) < t X ( ω ) p ( ω ) + ω : X ( ω ) t t p ( ω ) t ω : X ( ω ) t p ( ω ) (since X is non-negative) t Pr [ X t ] . E [ X ] E [ Y ] = ω : X ( ω ) < t X ( ω ) p ( ω ) + ω : X ( ω ) t t p ( ω ) t ω : X ( ω ) t p ( ω )  (since  X  is non-negative)  t Pr [ X t ] . {:[E[X] >= E[Y]=sum_(omega:X(omega) < t)X(omega)p(omega)+sum_(omega:X(omega) >= t)tp(omega)],[ >= tsum_(omega:X(omega) >= t)p(omega)quad" (since "X" is non-negative) "],[ >= t Pr[X >= t].]:}\begin{aligned} \mathbf{E}[X] \geq \mathbf{E}[Y] &=\sum_{\omega: X(\omega)<t} X(\omega) p(\omega)+\sum_{\omega: X(\omega) \geq t} t p(\omega) \\ & \geq t \sum_{\omega: X(\omega) \geq t} p(\omega) \quad \text { (since } X \text { is non-negative) } \\ & \geq t \operatorname{Pr}[X \geq t] . \end{aligned}
The continuous case follows by replacing sums by integrals.
Markov's inequality is tight under the assumption. Assume you can construct an example. The more information we have about a random variable the better we can bound its deviation from the expectation.
Theorem 2 (Chebyshev's Inequality) Let X X XX be a random variable with E [ X ] E [ X ] E[X]\mathbf{E}[X] and V a r [ X ] f i V a r [ X ] f i Var[X]fi-\mathbf{Var}[X] f i- nite. Then Pr [ | X | t ] E [ X 2 ] / t 2 Pr [ | X | t ] E X 2 / t 2 Pr[|X| >= t] <= E[X^(2)]//t^(2)\operatorname{Pr}[|X| \geq t] \leq \mathbf{E}\left[X^{2}\right] / t^{2} and Pr [ | X E [ X ] | t σ X ] 1 / t 2 Pr | X E [ X ] | t σ X 1 / t 2 Pr[|X-E[X]| >= tsigma_(X)] <= 1//t^(2)\operatorname{Pr}\left[|X-\mathbf{E}[X]| \geq t \sigma_{X}\right] \leq 1 / t^{2}.
Proof: Consider the non-negative random variable Y = X 2 Y = X 2 Y=X^(2)Y=X^{2}. Pr [ | X | t ] = Pr [ Y t 2 ] Pr [ | X | t ] = Pr Y t 2 Pr[|X| >= t]=Pr[Y >= t^(2)]\operatorname{Pr}[|X| \geq t]=\operatorname{Pr}\left[Y \geq t^{2}\right] and we apply Markov's inequality to the latter. The second inequality is similar by considering Y = ( X E [ X ] ) 2 Y = ( X E [ X ] ) 2 Y=(X-E[X])^(2)Y=(X-\mathbf{E}[X])^{2}.
Chernoff-Hoeffding Bounds: We will use several times various forms of the Chernoff-Hoeffding bounds that apply to a random variable that is a a finite sum of bounded and independent random variables. There are several versions of these bounds. First we state a general bound that is applicable to non-negative random variables and is dimension-free in that it depends only the expectation rather than the number of variables.
Theorem 3 (Chernoff-Hoeffding) Let X 1 , X 2 , , X n X 1 , X 2 , , X n X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} be independent binary random variables and let a 1 , a 2 , , a n a 1 , a 2 , , a n a_(1),a_(2),dots,a_(n)a_{1}, a_{2}, \ldots, a_{n} be coefficients in [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1]. Let X = i a i X i X = i a i X i X=sum_(i)a_(i)X_(i)X=\sum_{i} a_{i} X_{i}. Then
  • For any μ E [ X ] μ E [ X ] mu >= E[X]\mu \geq \mathbf{E}[X] and any δ > 0 , Pr [ X > ( 1 + δ ) μ ] ( e δ ( 1 + δ ) ( 1 + δ ) ) μ δ > 0 , Pr [ X > ( 1 + δ ) μ ] e δ ( 1 + δ ) ( 1 + δ ) μ delta > 0,Pr[X > (1+delta)mu] <= ((e^(delta))/((1+delta)^((1+delta))))^(mu)\delta>0, \operatorname{Pr}[X>(1+\delta) \mu] \leq\left(\frac{e^{\delta}}{(1+\delta)^{(1+\delta)}}\right)^{\mu}.
  • For any μ E [ X ] μ E [ X ] mu <= E[X]\mu \leq \mathbf{E}[X] and any δ > 0 , Pr [ X < ( 1 δ ) μ ] e μ δ 2 / 2 δ > 0 , Pr [ X < ( 1 δ ) μ ] e μ δ 2 / 2 delta > 0,Pr[X < (1-delta)mu] <= e^(-mudelta^(2)//2)\delta>0, \operatorname{Pr}[X<(1-\delta) \mu] \leq e^{-\mu \delta^{2} / 2}.
The following corollary bounds the deviation from the mean in both directions.
Corollary 4 Under the conditions of Theorem 3, the following hold:
  • If δ > 2 e 1 , Pr [ X ( 1 + δ ) μ ] 2 ( 1 + δ ) μ δ > 2 e 1 , Pr [ X ( 1 + δ ) μ ] 2 ( 1 + δ ) μ delta > 2e-1,Pr[X >= (1+delta)mu] <= 2^(-(1+delta)mu)\delta>2 e-1, \operatorname{Pr}[X \geq(1+\delta) \mu] \leq 2^{-(1+\delta) \mu}.
  • For any U U UU there is a constant c ( U ) c ( U ) c(U)c(U) such that for 0 < δ < U , Pr [ X ( 1 + δ ) μ ] e c ( U ) δ 2 μ 0 < δ < U , Pr [ X ( 1 + δ ) μ ] e c ( U ) δ 2 μ 0 < delta < U,Pr[X >= (1+delta)mu] <= e^(-c(U)delta^(2)mu)0<\delta<U, \operatorname{Pr}[X \geq(1+\delta) \mu] \leq e^{-c(U) \delta^{2} \mu}. In particular, combining with the lower tail bound,
Pr [ | X μ | δ μ ] 2 e c ( U ) t 2 μ . Pr [ | X μ | δ μ ] 2 e c ( U ) t 2 μ . Pr[|X-mu| >= delta mu] <= 2e^(-c(U)t^(2)mu).\operatorname{Pr}[|X-\mu| \geq \delta \mu] \leq 2 e^{-c(U) t^{2} \mu} .
We refer the reader to the standard books on randomized algorithms [1] and [2] for the derivation of the above bounds.
If we are interested only in the upper tail we also have the following bounds which show the dependence of μ μ mu\mu on n n nn to obtain an inverse polynomial probability.
Corollary 5 Under the conditions of Theorem 3 , there is a universal constant α α alpha\alpha such that for any μ max { 1 , E [ X ] } μ max { 1 , E [ X ] } mu >= max{1,E[X]}\mu \geq \max \{1, \mathbf{E}[X]\}, and sufficiently large n n nn and for c 1 , Pr [ X > α c ln n ln n μ ] 1 / n c c 1 , Pr X > α c ln n ln n μ 1 / n c c >= 1,Pr[X > (alpha c ln n)/(ln n)*mu] <= 1//n^(c)c \geq 1, \operatorname{Pr}\left[X>\frac{\alpha c \ln n}{\ln n} \cdot \mu\right] \leq 1 / n^{c}. Similarly, there is a constant α α alpha\alpha such that for any ϵ > 0 , Pr [ X ( 1 + ϵ ) μ + α c log n / ϵ ] 1 / n c ϵ > 0 , Pr [ X ( 1 + ϵ ) μ + α c log n / ϵ ] 1 / n c epsilon > 0,Pr[X >= (1+epsilon)mu+alpha c log n//epsilon] <= 1//n^(c)\epsilon>0, \operatorname{Pr}[X \geq(1+\epsilon) \mu+\alpha c \log n / \epsilon] \leq 1 / n^{c}.
Remark 6 If the X i X i X_(i)X_{i} are in the range [ 0 , b ] [ 0 , b ] [0,b][0, b] for some b b bb not equal to 1 1 11 one can scale them appropriately and then use the standard bounds.
Some times we need to deal with random variables that are in the range [ 1 , 1 ] [ 1 , 1 ] [-1,1][-1,1]. Consider the setting where X = i X i X = i X i X=sum_(i)X_(i)X=\sum_{i} X_{i} where for each i , X i [ 1 , 1 ] i , X i [ 1 , 1 ] i,X_(i)in[-1,1]i, X_{i} \in[-1,1] and E [ X i ] = 0 E X i = 0 E[X_(i)]=0\mathbf{E}\left[X_{i}\right]=0, and the X i X i X_(i)X_{i} are independent. In this case E [ X ] = 0 E [ X ] = 0 E[X]=0\mathbf{E}[X]=0 and we can no longer expect a dimension-free bound. Suppose each X i X i X_(i)X_{i} is 1 with probability 1 / 2 1 / 2 1//21 / 2 and 1 1 -1-1 with probability 1 / 2 1 / 2 1//21 / 2. Then X = i X i X = i X i X=sum_(i)X_(i)X=\sum_{i} X_{i} corresponds to a 1-dimensional random walk and even though the expected value is 0 the standard deviation of X X XX is Θ ( n ) Θ ( n ) Theta(sqrtn)\Theta(\sqrt{n}). One can show that Pr [ | X | t n ] 2 e t 2 / 2 Pr [ | X | t n ] 2 e t 2 / 2 Pr[|X| >= tsqrtn] <= 2e^(-t^(2)//2)\operatorname{Pr}[|X| \geq t \sqrt{n}] \leq 2 e^{-t^{2} / 2}. For these settings we can use the following bounds.
Theorem 7 Let X 1 , X 2 , , X n X 1 , X 2 , , X n X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} be independent random variables such that for each i , X i [ a i , b i ] i , X i a i , b i i,X_(i)in[a_(i),b_(i)]i, X_{i} \in\left[a_{i}, b_{i}\right]. Let X = i a i X i X = i a i X i X=sum_(i)a_(i)X_(i)X=\sum_{i} a_{i} X_{i} and let μ = E [ X ] μ = E [ X ] mu=E[X]\mu=\mathbf{E}[X]. Then
Pr [ | X μ | t ] 2 e 2 t 2 i = 1 n ( b i a i ) 2 . Pr [ | X μ | t ] 2 e 2 t 2 i = 1 n b i a i 2 . Pr[|X-mu| >= t] <= 2e^(-(2t^(2))/(sum_(i=1)^(n)(b_(i)-a_(i))^(2))).\operatorname{Pr}[|X-\mu| \geq t] \leq 2 e^{-\frac{2 t^{2}}{\sum_{i=1}^{n}\left(b_{i}-a_{i}\right)^{2}}} .
In particular if b i a i 1 b i a i 1 b_(i)-a_(i) <= 1b_{i}-a_{i} \leq 1 for all i i ii then
Pr [ | X μ | t ] 2 e 2 t 2 n . Pr [ | X μ | t ] 2 e 2 t 2 n . Pr[|X-mu| >= t] <= 2e^(-(2t^(2))/(n)).\operatorname{Pr}[|X-\mu| \geq t] \leq 2 e^{-\frac{2 t^{2}}{n}} .
Note that V a r [ X ] = i V a r [ X i ] V a r [ X ] = i V a r X i Var[X]=sum_(i)Var[X_(i)]\mathbf{Var}[X]=\sum_{i} \mathbf{Var}\left[X_{i}\right]. One can show a bound based of the following form
Pr [ | X μ | t ] 2 e t 2 2 ( σ X 2 + M t / 3 ) Pr [ | X μ | t ] 2 e t 2 2 σ X 2 + M t / 3 Pr[|X-mu| >= t] <= 2e^(-(t^(2))/(2(sigma_(X)^(2)+Mt//3)))\operatorname{Pr}[|X-\mu| \geq t] \leq 2 e^{-\frac{t^{2}}{2\left(\sigma_{X}^{2}+M t / 3\right)}}
where | X i | M X i M |X_(i)| <= M\left|X_{i}\right| \leq M for all i i ii.
Remark 8 Compare the Chebyshev bound to the Chernoff-Hoeffding bounds for the same variance.
Statistical Estimators, Reducing Variance and Boosting: In streaming we will mainly work with randomized algorithms that compute a function f f ff of the data stream x 1 , , x m x 1 , , x m x_(1),dots,x_(m)x_{1}, \ldots, x_{m}. They typically work by producing an unbiased estimator, via a random variable X X XX, for the the function value. That is, the algorithm will have the property that the E [ X ] E [ X ] E[X]\mathbf{E}[X] is the desired value. Note that the randomness is internal to the algorithm and not part of the input (we will also later discuss randomness in the input when considering random order streams). Having an estimator is not often useful. We will also typically try to evaluate V a r [ X ] V a r [ X ] Var[X]\mathbf{Var}[X] and then we can use Chebyshev's inequality. One way to reduce the variance of the estimate is to run the algorithm in parallel (with separate random bits) and get estimators X 1 , X 2 , , X h X 1 , X 2 , , X h X_(1),X_(2),dots,X_(h)X_{1}, X_{2}, \ldots, X_{h} and use X = 1 h i X i X = 1 h i X i X=(1)/(h)sum_(i)X_(i)X=\frac{1}{h} \sum_{i} X_{i} as the final estimator. Note that V a r ( X ) = 1 h i V a r ( X i ) V a r ( X ) = 1 h i V a r X i Var(X)=(1)/(h)sum_(i)Var(X_(i))\mathbf{Var}(X)=\frac{1}{h} \sum_{i} \mathbf{Var}\left(X_{i}\right) since the X i X i X_(i)X_{i} are independent. Thus the variance has been reduced by a factor of h h hh. A different approach is to use the median value of X 1 , X 2 , , X h X 1 , X 2 , , X h X_(1),X_(2),dots,X_(h)X_{1}, X_{2}, \ldots, X_{h} as the final estimator. We can then use Chernoff-Hoeffding bounds to get a much better dependence on h h hh. In fact both approaches can be combined and we illustrate it via a concrete example in the next section.

3. Probabilistic Counting and Morris's Algorithm

Suppose we have a long sequence of events that we wish to count. If we wish to count a sequent of events of length upto some N N NN we can easily do this by using a counter with log 2 N log 2 N |~log_(2)N~|\left\lceil\log _{2} N\right\rceil bits. In some settings of interest we would like to further reduce this. It is not hard to argue that if one wants an exact and deterministic count then we do need a counter with log 2 N log 2 N |~log_(2)N~|\left\lceil\log _{2} N\right\rceil bits. Surprisingly, if we allow for approximation and randomization, one can count with about log l o g N log l o g N log logN\log log N bits. This was first shown in a short and sweet paper of Morris[3]; it is a nice paper to read the historical motivation.
Morris's algorithm keeps a counter that basically keeps an estimate of log N log N log N\log N where N N NN is the number of events and this requires about log l o g N log l o g N log logN\log log N bits. There are several variants of this, here we will discuss the simple one and point the reader to [3:1] [4] [5] [5,3,1] for other schemes and a more detailed and careful analysis.
ProbabilisticCounting: _ ProbabilisticCounting: _ "ProbabilisticCounting:"_\underline{\text{ProbabilisticCounting:}}
X 0 X 0 X larr0X \leftarrow 0
While (a new event arrives)
quad\quad Toss a biased coin that is heads with probability 1 / 2 X 1 / 2 X 1//2^(X)1/2^{X}
quad\quad If (coin turns up heads)
X X + 1 X X + 1 qquad X larr X+1\qquad X \leftarrow X + 1
endWhile
Output 2 X 1 2 X 1 2^(X)-12^{X}-1 as the estimate for the length of the stream.
For integer i 0 i 0 i >= 0i \geq 0, let X n X n X_(n)X_{n} be the random variable that denotes the value of the counter after i i ii events. Let Y n = 2 X n Y n = 2 X n Y_(n)=2^(X_(n))Y_{n}=2^{X_{n}}. The lemma below shows that the output of the algorithm is an unbiased estimator of the count that we desire.
Lemma 9 E [ Y n ] = n + 1 E Y n = n + 1 E[Y_(n)]=n+1\mathbf{E}\left[Y_{n}\right]=n+1.
Proof: Proof by induction on n n nn. The case of n = 0 , 1 n = 0 , 1 n=0,1n=0,1 is easy to verify since in both cases we have that Y n Y n Y_(n)Y_{n} is deterministically equal to n + 1 n + 1 n+1n+1. We have for n 2 n 2 n >= 2n \geq 2:
E [ Y n ] = E [ 2 X n ] = j = 0 2 j Pr [ X n = j ] = j = 0 2 j ( Pr [ X n 1 = j ] ( 1 1 2 j ) + Pr [ X n 1 = j 1 ] 1 2 j 1 ) = j = 0 2 j Pr [ X n 1 = j ] + j = 0 ( 2 Pr [ X n 1 = j 1 ] Pr [ X n 1 = j ] ) = E [ Y n 1 ] + 1 = n + 1 E Y n = E 2 X n = j = 0 2 j Pr X n = j = j = 0 2 j Pr X n 1 = j 1 1 2 j + Pr X n 1 = j 1 1 2 j 1 = j = 0 2 j Pr X n 1 = j + j = 0 2 Pr X n 1 = j 1 Pr X n 1 = j = E Y n 1 + 1 = n + 1 {:[E[Y_(n)]=E[2^(X_(n))]=sum_(j=0)^(oo)2^(j)Pr[X_(n)=j]],[=sum_(j=0)^(oo)2^(j)(Pr[X_(n-1)=j]*(1-(1)/(2^(j)))+Pr[X_(n-1)=j-1]*(1)/(2^(j-1)))],[=sum_(j=0)^(oo)2^(j)Pr[X_(n-1)=j]+sum_(j=0)^(oo)(2Pr[X_(n-1)=j-1]-Pr[X_(n-1)=j])],[=E[Y_(n-1)]+1],[=n+1]:}\begin{aligned} \mathbf{E}\left[Y_{n}\right]=\mathbf{E}\left[2^{X_{n}}\right] &=\sum_{j=0}^{\infty} 2^{j} \operatorname{Pr}\left[X_{n}=j\right] \\ &=\sum_{j=0}^{\infty} 2^{j}\left(\operatorname{Pr}\left[X_{n-1}=j\right] \cdot\left(1-\frac{1}{2^{j}}\right)+\operatorname{Pr}\left[X_{n-1}=j-1\right] \cdot \frac{1}{2^{j-1}}\right) \\ &=\sum_{j=0}^{\infty} 2^{j} \operatorname{Pr}\left[X_{n-1}=j\right]+\sum_{j=0}^{\infty}\left(2 \operatorname{Pr}\left[X_{n-1}=j-1\right]-\operatorname{Pr}\left[X_{n-1}=j\right]\right) \\ &=\mathbf{E}\left[Y_{n-1}\right]+1 \\ &=n+1 \end{aligned}
where we used induction to obtain the final equality.
Since E [ Y n ] = n + 1 E Y n = n + 1 E[Y_(n)]=n+1\mathbf{E}\left[Y_{n}\right]=n+1 we have E [ X n ] = log 2 ( n + 1 ) E X n = log 2 ( n + 1 ) E[X_(n)]=log_(2)(n+1)\mathbf{E}\left[X_{n}\right]=\log _{2}(n+1) which implies that the expected number of bits in the counter after n n nn events is log log n + O ( 1 ) log log n + O ( 1 ) log log n+O(1)\log \log n+O(1) bits. We should also calculate the variance of Y n Y n Y_(n)Y_{n}.
Lemma 10 E [ Y n 2 ] = 3 2 n 2 + 3 2 n + 1 E Y n 2 = 3 2 n 2 + 3 2 n + 1 E[Y_(n)^(2)]=(3)/(2)n^(2)+(3)/(2)n+1\mathbf{E}\left[Y_{n}^{2}\right]=\frac{3}{2} n^{2}+\frac{3}{2} n+1 and V a r [ Y n ] = n ( n 1 ) / 2 V a r Y n = n ( n 1 ) / 2 Var[Y_(n)]=n(n-1)//2\mathbf{Var}\left[Y_{n}\right]=n(n-1) / 2.
Proof: Proof is by induction on n n nn. Easy to verify base cases n = 0 , 1 n = 0 , 1 n=0,1n=0,1 since Y n = n + 1 Y n = n + 1 Y_(n)=n+1Y_{n}=n+1 deterministically. For n 2 n 2 n >= 2n \geq 2:
E [ Y n 2 ] = E [ 2 2 X n ] = j 0 2 2 j Pr [ X n = j ] = j 0 2 2 j ( Pr [ X n 1 = j ] ( 1 1 2 j ) + Pr [ X n 1 = j 1 ] 1 2 j 1 ) = j 0 2 2 j Pr [ X n 1 = j ] + j 0 ( 2 j Pr [ X n 1 = j 1 ] + 42 j 1 Pr [ X n 1 = j 1 ] ) = E [ Y n 1 2 ] + 3 E [ Y n 1 ] = 3 2 ( n 1 ) 2 + 3 2 ( n 1 ) + 1 + 3 n = 3 2 n 2 + 3 2 n + 1 . E Y n 2 = E 2 2 X n = j 0 2 2 j Pr X n = j = j 0 2 2 j Pr X n 1 = j 1 1 2 j + Pr X n 1 = j 1 1 2 j 1 = j 0 2 2 j Pr X n 1 = j + j 0 2 j Pr X n 1 = j 1 + 42 j 1 Pr X n 1 = j 1 = E Y n 1 2 + 3 E Y n 1 = 3 2 ( n 1 ) 2 + 3 2 ( n 1 ) + 1 + 3 n = 3 2 n 2 + 3 2 n + 1 . {:[E[Y_(n)^(2)]=E[2^(2X_(n))]=sum_(j >= 0)2^(2j)*Pr[X_(n)=j]],[=sum_(j >= 0)2^(2j)*(Pr[X_(n-1)=j](1-(1)/(2^(j)))+Pr[X_(n-1)=j-1](1)/(2^(j-1)))],[=sum_(j >= 0)2^(2j)*Pr[X_(n-1)=j]+sum_(j >= 0)(-2^(j)Pr[X_(n-1)=j-1]+42^(j-1)Pr[X_(n-1)=j-1])],[=E[Y_(n-1)^(2)]+3E[Y_(n-1)]],[=(3)/(2)(n-1)^(2)+(3)/(2)(n-1)+1+3n],[=(3)/(2)n^(2)+(3)/(2)n+1.]:}\begin{aligned} \mathbf{E}\left[Y_{n}^{2}\right]=\mathbf{E}\left[2^{2 X_{n}}\right] &=\sum_{j \geq 0} 2^{2 j} \cdot \operatorname{Pr}\left[X_{n}=j\right] \\ &=\sum_{j \geq 0} 2^{2 j} \cdot\left(\operatorname{Pr}\left[X_{n-1}=j\right]\left(1-\frac{1}{2^{j}}\right)+\operatorname{Pr}\left[X_{n-1}=j-1\right] \frac{1}{2^{j-1}}\right) \\ &=\sum_{j \geq 0} 2^{2 j} \cdot \operatorname{Pr}\left[X_{n-1}=j\right]+\sum_{j \geq 0}\left(-2^{j} \operatorname{Pr}\left[X_{n-1}=j-1\right]+42^{j-1} \operatorname{Pr}\left[X_{n-1}=j-1\right]\right) \\ &=\mathbf{E}\left[Y_{n-1}^{2}\right]+3 \mathbf{E}\left[Y_{n-1}\right] \\ &=\frac{3}{2}(n-1)^{2}+\frac{3}{2}(n-1)+1+3 n \\ &=\frac{3}{2} n^{2}+\frac{3}{2} n+1 . \end{aligned}
We used induction and the value of E [ Y n ] E Y n E[Y_(n)]\mathbf{E}\left[Y_{n}\right] that we previously computed. V a r [ Y n ] = E [ Y n 2 ] ( E [ Y n ] ) 2 V a r Y n = E Y n 2 E Y n 2 Var[Y_(n)]=E[Y_(n)^(2)]-(E[Y_(n)])^(2)\mathbf{Var}\left[Y_{n}\right]=\mathbf{E}\left[Y_{n}^{2}\right]-\left(\mathbf{E}\left[Y_{n}\right]\right)^{2} and it can be verified that it is equal to n ( n 1 ) / 2 n ( n 1 ) / 2 n(n-1)//2n(n-1) / 2.

3.1. Approximation and Success Probability

The analysis of the expectation shows that the output of the algorithm is an unbiased estimator. The analysis of the variance shows that the estimator is fairly reasonable. However, we would like to have a finer understanding. For instance, given some parameter ϵ > 0 ϵ > 0 epsilon > 0\epsilon>0, what is Pr [ | Y n ( n + 1 ) | > Pr Y n ( n + 1 ) > Pr[|Y_(n)-(n+1)| > :}\operatorname{Pr}\left[\left|Y_{n}-(n+1)\right|>\right. ϵ n ] ϵ n ] epsilon n]\epsilon n]? We could also ask a related question. Is there an algorithm that given ϵ , δ ϵ , δ epsilon,delta\epsilon, \delta guarantees that the output Y Y YY will satisfy the property that Pr [ | Y n | > ϵ n ] δ Pr [ | Y n | > ϵ n ] δ Pr[|Y-n| > epsilon n] <= delta\operatorname{Pr}[|Y-n|>\epsilon n] \leq \delta while still ensuring that the counter size is O ( log log n ) O ( log log n ) O(log log n)O(\log \log n); of course we would expect that the constant in the O ( ) O ( ) O()O() notation will depend on ϵ , δ ϵ , δ epsilon,delta\epsilon, \delta.
The algorithm can be modified to obtain a ( 1 + ϵ ) ( 1 + ϵ ) (1+epsilon)(1+\epsilon)-approximation with constant probability using a related scheme where the probability of incrementing the counter is 1 a X 1 a X (1)/(a^(X))\frac{1}{a^{X}} for some parameter a a aa; see [3:2]. The expected number of bits in the counter to achieve a ( 1 + ϵ ) ( 1 + ϵ ) (1+epsilon)(1+\epsilon)-approximation can be shown to be log log n + O ( log 1 / ϵ ) log log n + O ( log 1 / ϵ ) log log n+O(log 1//epsilon)\log \log n+O(\log 1 / \epsilon) bits.
Here we describe two general purpose ways to obtain better approximations by using independent estimators. Suppose we run the basic algorithm h h hh times in parallel with independent randomness to get estimators Y n , 1 , Y n , 2 , , Y n , h Y n , 1 , Y n , 2 , , Y n , h Y_(n,1),Y_(n,2),dots,Y_(n,h)Y_{n, 1}, Y_{n, 2}, \ldots, Y_{n, h}. We then output Z = 1 h i = 1 h Y n , i 1 Z = 1 h i = 1 h Y n , i 1 Z=(1)/(h)sum_(i=1)^(h)Y_(n,i)-1Z=\frac{1}{h} \sum_{i=1}^{h} Y_{n, i}-1 as our estimate for n n nn. Note that Z Z ZZ is also an unbiased estimator. We have that V a r [ Z ] = 1 h V a r [ Y n ] V a r [ Z ] = 1 h V a r Y n Var[Z]=(1)/(h)Var[Y_(n)]\mathbf{Var}[Z]=\frac{1}{h} \mathbf{Var}\left[Y_{n}\right].
Claim 11 Suppose h = 2 / ϵ 2 h = 2 / ϵ 2 h=2//epsilon^(2)h=2 / \epsilon^{2} then Pr [ | Z n | ϵ n ] < 1 4 Pr [ | Z n | ϵ n ] < 1 4 Pr[|Z-n| >= epsilon n] < (1)/(4)\operatorname{Pr}[|Z-n| \geq \epsilon n]<\frac{1}{4}.
Proof: We apply Chebyshev's inequality to obtain that
Pr [ | Z n | ϵ n ] V a r [ Z ] ϵ 2 n 2 1 h V a r [ Y n ] ϵ 2 n 2 1 h n ( n 1 ) 2 ϵ 2 n 2 < 1 4 . Pr [ | Z n | ϵ n ] V a r [ Z ] ϵ 2 n 2 1 h V a r Y n ϵ 2 n 2 1 h n ( n 1 ) 2 ϵ 2 n 2 < 1 4 . Pr[|Z-n| >= epsilon n] <= (Var[Z])/(epsilon^(2)n^(2)) <= (1)/(h)(Var[Y_(n)])/(epsilon^(2)n^(2)) <= (1)/(h)(n(n-1))/(2epsilon^(2)n^(2)) < (1)/(4).\operatorname{Pr}[|Z-n| \geq \epsilon n] \leq \frac{\mathbf{Var}[Z]}{\epsilon^{2} n^{2}} \leq \frac{1}{h} \frac{\mathbf{Var}\left[Y_{n}\right]}{\epsilon^{2} n^{2}} \leq \frac{1}{h} \frac{n(n-1)}{2 \epsilon^{2} n^{2}}<\frac{1}{4} .
Now suppose we want a high probability guarantee regarding the approximation. That is, we would like the estimator to be a ( 1 + ϵ ) ( 1 + ϵ ) (1+epsilon)(1+\epsilon)-approximation with probability at least ( 1 δ ) ( 1 δ ) (1-delta)(1-\delta) for some given parameter δ δ delta\delta.
We choose = c log 1 δ = c log 1 δ ℓ=c log (1)/(delta)\ell=c \log \frac{1}{\delta} for some sufficiently large constant c c cc. We independently and in parallel obtain estimators Z 1 , Z 2 , , Z Z 1 , Z 2 , , Z Z_(1),Z_(2),dots,Z_(ℓ)Z_{1}, Z_{2}, \ldots, Z_{\ell} and output the median of the estimators; lets call Z Z Z^(')Z^{\prime} the random variable corresponding to the median. We will prove the following.
Claim 12 Pr [ | Z n | ϵ n ] ( 1 δ ) Pr Z n ϵ n ( 1 δ ) Pr[|Z^(')-n| >= epsilon n] <= (1-delta)\operatorname{Pr}\left[\left|Z^{\prime}-n\right| \geq \epsilon n\right] \leq(1-\delta).
Proof: Define an indicator random variable A i , 1 i A i , 1 i A_(i),1 <= i <= ℓA_{i}, 1 \leq i \leq \ell where A i A i A_(i)A_{i} is 1 if | Z i n | ϵ n Z i n ϵ n |Z_(i)-n| >= epsilon n\left|Z_{i}-n\right| \geq \epsilon n. From Claim 11. Pr [ A i = 1 ] < 1 / 4 Pr A i = 1 < 1 / 4 Pr[A_(i)=1] < 1//4\operatorname{Pr}\left[A_{i}=1\right]<1 / 4. Let A = i = 1 A i A = i = 1 A i A=sum_(i=1)^(ℓ)A_(i)A=\sum_{i=1}^{\ell} A_{i} and hence E [ A ] < / 4 E [ A ] < / 4 E[A] < ℓ//4\mathbf{E}[A]<\ell / 4. We also observe that | Z n | ϵ n Z n ϵ n |Z^(')-n| >= epsilon n\left|Z^{\prime}-n\right| \geq \epsilon n only if A / 2 A / 2 A >= ℓ//2A \geq \ell / 2, that is, if more than half of the Z i Z i Z_(i)Z_{i} 's deviate by more than ϵ n ϵ n epsilon n\epsilon n. We can apply Chernoff-Hoeffding bound to upper bound Pr [ A / 2 ] = Pr [ A ( 1 + δ ) μ ] Pr [ A / 2 ] = Pr [ A ( 1 + δ ) μ ] Pr[A >= ℓ//2]=Pr[A >= (1+delta)mu]\operatorname{Pr}[A \geq \ell / 2]=\operatorname{Pr}[A \geq(1+\delta) \mu] where δ = 1 δ = 1 delta=1\delta=1 and μ = / 4 E [ A ] μ = / 4 E [ A ] mu=ℓ//4 >= E[A]\mu=\ell / 4 \geq \mathbf{E}[A]. From Theorem 3 this is at most ( e / 4 ) / 4 ( e / 4 ) / 4 (e//4)^(ℓ//4)(e / 4)^{\ell / 4}. Since = c log 1 δ = c log 1 δ ℓ=c log (1)/(delta)\ell=c \log \frac{1}{\delta}, the probability is at most δ δ delta\delta for an appropriate choice of c c cc.
The total space usage for running the estimates in parallel is O ( 1 ϵ 2 log 1 δ log log n ) O 1 ϵ 2 log 1 δ log log n O((1)/(epsilon^(2))*log (1)/(delta)*log log n)O\left(\frac{1}{\epsilon^{2}} \cdot \log \frac{1}{\delta} \cdot \log \log n\right).

4. Reservoir Sampling

Random sampling is a powerful general purpose technique in a variety of areas. The goal is to pick a small subset S S SS of a set of items N N NN such that S S SS is representative of N N NN and often a random sample works well. The simplest sampling procedure is to pick a uniform sample of size k k kk from a set of size m m mm where k m k m k <= mk \leq m (typically k m k m k≪mk \ll m). We can consider sampling with or without replacement. These are standard and easy procedures if the whole data set is available in a random-access manner here we are assuming that we have access to a random number generator.
Below we describe a simple yet nice technique called reservoir sampling to obtain a uniform sample of size 1 from a stream.
UniformSample: _ UniformSample: _ "UniformSample:"_\underline{\text{UniformSample:}}
s null s null s larr"null"s \leftarrow \text{null}
m 0 m 0 m larr0m \leftarrow 0
While (stream is not done)
m m + 1 m m + 1 quad m larr m+1\quad m \leftarrow m + 1
x m x m quadx_(m)\quad x_{m} is current item
quad\quad Toss a biased coin that is heads with probability 1 / m 1 / m 1//m1/m
quad\quad If (coin turns up heads)
s x m s x m qquadquad s larrx_(m)\qquad\quad s \leftarrow x_{m}
endWhile
Output s s ss as the sample
Lemma 13 Let m m mm be the length of the stream. The output of the algorithm s s ss is uniform. That is, for any 1 j m , Pr [ s = x j ] = 1 / m 1 j m , Pr s = x j = 1 / m 1 <= j <= m,Pr[s=x_(j)]=1//m1 \leq j \leq m, \operatorname{Pr}\left[s=x_{j}\right]=1 / m.
Proof: We observe that s = x j s = x j s=x_(j)s=x_{j} if x j x j x_(j)x_{j} is chosen when it is considered by the algorithm (which happens with probability 1 / j 1 / j 1//j1 / j ), and none of x j + 1 , , x m x j + 1 , , x m x_(j+1),dots,x_(m)x_{j+1}, \ldots, x_{m} are chosen to replace x j x j x_(j)x_{j}. All the relevant events are independent and we can compute:
Pr [ s = x j ] = 1 / j × i > j ( 1 1 / i ) = 1 / m . Pr s = x j = 1 / j × i > j ( 1 1 / i ) = 1 / m . Pr[s=x_(j)]=1//j xxprod_(i > j)(1-1//i)=1//m.\operatorname{Pr}\left[s=x_{j}\right]=1 / j \times \prod_{i>j}(1-1 / i)=1 / m .
To obtain k k kk samples with replacement, the procedure for k = 1 k = 1 k=1k=1 can be done in parallel with independent randomness. Now we consider obtaining k k kk samples from the stream without replacement. The output will be stored in an array of S S SS of size k k kk.
Sample-without-Replacement ( k ) : _ Sample-without-Replacement ( k ) : _ "Sample-without-Replacement"(k):_\underline{\text{Sample-without-Replacement}(k):}
S [ 1. . k ] null S [ 1. . k ] null S[1..k]larr"null"S[1..k]\leftarrow \text{null}
m 0 m 0 m larr0m \leftarrow 0
While (stream is not done)
m m + 1 m m + 1 quad m larr m+1\quad m \leftarrow m + 1
x m x m quadx_(m)\quad x_{m} is current item
quad\quad if ( m k ) ( m k ) (m <= k)(m \leq k)
S [ m ] x m S [ m ] x m qquadquad S[m]larrx_(m)\qquad\quad S[m]\leftarrow x_{m}
quad\quad Else
r r qquadquad r larr\qquad\quad r \leftarrow uniform random number in range [ 1. . m ] [ 1. . m ] [1..m][1..m]
qquadquad\qquad\quad If ( r k ) ( r k ) (r <= k)(r \leq k)
S [ r ] x m S [ r ] x m qquadqquadquad S[r]larrx_(m)\qquad\qquad\quad S[r]\leftarrow x_{m}
endWhile
Output S S SS
An alternative description is that when item x t x t x_(t)x_{t} arrives (for t > k t > k t > kt>k) we decide to choose it for inclusion in S S SS with probability k / t k / t k//tk / t, and if it is chosen then we choose a uniform element from S S SS to be replaced by x t x t x_(t)x_{t}.
Exercise: Prove that the algorithm outputs a random sample without replacement of size k k kk from the stream.
Weighted sampling: Now we consider a generalization of the problem to weighted sampling. Suppose the stream consists of m m mm items x 1 , , x m x 1 , , x m x_(1),dots,x_(m)x_{1}, \ldots, x_{m} and each item j j jj has a weigth w j > 0 w j > 0 w_(j) > 0w_{j}>0. The goal is to choose a k k kk sample in proportion to the weights. Suppose k = 1 k = 1 k=1k=1 then the goal is to choose an item s s ss such that Pr [ s = x j ] = w j / W Pr s = x j = w j / W Pr[s=x_(j)]=w_(j)//W\operatorname{Pr}\left[s=x_{j}\right]=w_{j} / W where W = i w i W = i w i W=sum_(i)w_(i)W=\sum_{i} w_{i}. It is easy to generalize the uniform sampling algorithm to achieve this and k k kk samples with replacement is also easy. The more interesting case is when we want k k kk samples without replacement. The precise definition of what this means is not so obvious. Here is an algorithm. Obtain first a uniform sample s s ss from x 1 , , x m x 1 , , x m x_(1),dots,x_(m)x_{1}, \ldots, x_{m} in proportion to the weights. Remove s s ss from the set and obtain another uniform sample in the residual set. Repeat k k kk times to obtain a set of k k kk items (assume that k < m k < m k < mk<m). The k k kk removed items form the output S S SS. We now want to obtain a random set S S SS according to this same distribution but in a streaming fashion.
First we describe a randomized offline algorithm below.
Weighed-Sample-without-Replacement ( k ) : _ Weighed-Sample-without-Replacement ( k ) : _ "Weighed-Sample-without-Replacement"(k):_\underline{\text{Weighed-Sample-without-Replacement}(k):}
For i = 1 i = 1 i=1i=1 to m m mm do
r i r i quadr_(i)larr\quad r_{i} \leftarrow uniform random number in interval ( 0 , 1 ) ( 0 , 1 ) (0,1)(0,1)
w i = r i 1 / w i w i = r i 1 / w i quadw_(i)^(')=r_(i)^(1//w_(i))\quad w_{i}^{\prime}=r_{i}^{1 / w_{i}}
endFor
Sort items in decreasing order according to w i w i w_(i)^(')w_{i}^{\prime} values
Output the first k k kk items from the sorted order
We leave it as a simple exercise to show that the above algorithm can be implemented in the stream model by keeping the heaviest k k kk modified weights seen so far. Now for the analysis.
To get some intuition we make the following claim.
Claim 14 Let r 1 , r 2 r 1 , r 2 r_(1),r_(2)r_{1}, r_{2} be independent unformly distributed random variables over [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1] and let X 1 = X 1 = X_(1)=X_{1}= r 1 1 / w 1 r 1 1 / w 1 r_(1)^(1//w_(1))r_{1}^{1 / w_{1}} and X 2 = r 2 1 / w 2 X 2 = r 2 1 / w 2 X_(2)=r_(2)^(1//w_(2))X_{2}=r_{2}^{1 / w_{2}} where w 1 , w 2 0 w 1 , w 2 0 w_(1),w_(2) >= 0w_{1}, w_{2} \geq 0. Then
Pr [ X 1 X 2 ] = w 2 w 1 + w 2 . Pr X 1 X 2 = w 2 w 1 + w 2 . Pr[X_(1) <= X_(2)]=(w_(2))/(w_(1)+w_(2)).\operatorname{Pr}\left[X_{1} \leq X_{2}\right]=\frac{w_{2}}{w_{1}+w_{2}} .
The above claim can be shown by doing basic analysis via the probability density functions. More precisely, suppose w > 0 w > 0 w > 0w>0. Consider the random variable Y = r 1 / w Y = r 1 / w Y=r^(1//w)Y=r^{1 / w} where r r rr is chosen uniformly in ( 0 , 1 ) ( 0 , 1 ) (0,1)(0,1). The cumulative probabilty function of Y Y YY,
F Y ( t ) = Pr [ Y t ] = Pr [ r 1 / w t ] = Pr [ r t w ] = t w . F Y ( t ) = Pr [ Y t ] = Pr r 1 / w t = Pr r t w = t w . F_(Y)(t)=Pr[Y <= t]=Pr[r^(1//w) <= t]=Pr[r <= t^(w)]=t^(w).F_{Y}(t)=\operatorname{Pr}[Y \leq t]=\operatorname{Pr}\left[r^{1 / w} \leq t\right]=\operatorname{Pr}\left[r \leq t^{w}\right]=t^{w} .
Therefore the density function f Y ( t ) f Y ( t ) f_(Y)(t)f_{Y}(t) is w t w 1 w t w 1 wt^(w-1)w t^{w-1}. Thus
Pr [ X 1 X 2 ] = 0 1 F Y 1 ( t ) f Y 2 ( t ) d t = 0 1 t w 1 w 2 t w 2 1 d t = w 2 w 1 + w 2 . Pr X 1 X 2 = 0 1 F Y 1 ( t ) f Y 2 ( t ) d t = 0 1 t w 1 w 2 t w 2 1 d t = w 2 w 1 + w 2 . Pr[X_(1) <= X_(2)]=int_(0)^(1)F_(Y_(1))(t)f_(Y_(2))(t)dt=int_(0)^(1)t^(w_(1))w_(2)t^(w_(2)-1)dt=(w_(2))/(w_(1)+w_(2)).\operatorname{Pr}\left[X_{1} \leq X_{2}\right]=\int_{0}^{1} F_{Y_{1}}(t) f_{Y_{2}}(t) d t=\int_{0}^{1} t^{w_{1}} w_{2} t^{w_{2}-1} d t=\frac{w_{2}}{w_{1}+w_{2}} .
We now make a much more general statement.
Lemma 15 Let r 1 , r 2 , , r n r 1 , r 2 , , r n r_(1),r_(2),dots,r_(n)r_{1}, r_{2}, \ldots, r_{n} be independent random variables each of which is uniformly distributed random variables over [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1]. Let X i = r i 1 / w i X i = r i 1 / w i X_(i)=r_(i)^(1//w_(i))X_{i}=r_{i}^{1 / w_{i}} for 1 i n 1 i n 1 <= i <= n1 \leq i \leq n. Then for any α [ 0 , 1 ] α [ 0 , 1 ] alpha in[0,1]\alpha \in[0,1]
Pr [ X 1 X 2 X n α ] = α w 1 + w 2 + + w n i = 1 n w i w 1 + + w i . Pr X 1 X 2 X n α = α w 1 + w 2 + + w n i = 1 n w i w 1 + + w i . Pr[X_(1) <= X_(2)dots <= X_(n) <= alpha]=alpha^(w_(1)+w_(2)+dots+w_(n))*prod_(i=1)^(n)(w_(i))/(w_(1)+dots+w_(i)).\operatorname{Pr}\left[X_{1} \leq X_{2} \ldots \leq X_{n} \leq \alpha\right]=\alpha^{w_{1}+w_{2}+\ldots+w_{n}} \cdot \prod_{i=1}^{n} \frac{w_{i}}{w_{1}+\ldots+w_{i}}.
Proof: By induction on n n nn. For n = 1 , Pr [ X 1 α ] = F Y 1 ( α ) = α w 1 n = 1 , Pr X 1 α = F Y 1 ( α ) = α w 1 n=1,Pr[X_(1) <= alpha]=F_(Y_(1))(alpha)=alpha^(w_(1))n=1, \operatorname{Pr}\left[X_{1} \leq \alpha\right]=F_{Y_{1}}(\alpha)=\alpha^{w_{1}}. Assuming the lemma holds for all h < n h < n h < nh<n we prove it for n n nn.
Pr [ X 1 X n α ] = 0 α Pr [ X 1 X n 1 t ] f Y n ( t ) d t = 0 α t w 1 + w 2 + + w n 1 ( i = 1 n 1 w i w 1 + + w i ) w n t w n 1 d t = w n ( i = 1 n 1 w i w 1 + + w i ) 0 α t w 1 + w 2 + + w n 1 d t = α w 1 + w 2 + + w n i = 1 n w i w 1 + + w i . Pr X 1 X n α = 0 α Pr X 1 X n 1 t f Y n ( t ) d t = 0 α t w 1 + w 2 + + w n 1 i = 1 n 1 w i w 1 + + w i w n t w n 1 d t = w n i = 1 n 1 w i w 1 + + w i 0 α t w 1 + w 2 + + w n 1 d t = α w 1 + w 2 + + w n i = 1 n w i w 1 + + w i . {:[Pr[X_(1) <= dots <= X_(n) <= alpha]=int_(0)^(alpha)Pr[X_(1) <= dots <= X_(n-1) <= t]f_(Y_(n))(t)dt],[=int_(0)^(alpha)t^(w_(1)+w_(2)+dots+w_(n-1))*(prod_(i=1)^(n-1)(w_(i))/(w_(1)+dots+w_(i)))w_(n)t^(w_(n)-1)dt],[=w_(n)(prod_(i=1)^(n-1)(w_(i))/(w_(1)+dots+w_(i)))int_(0)^(alpha)t^(w_(1)+w_(2)+dots+w_(n)-1)dt],[=alpha^(w_(1)+w_(2)+dots+w_(n))*prod_(i=1)^(n)(w_(i))/(w_(1)+dots+w_(i)).]:}\begin{aligned} \operatorname{Pr}\left[X_{1} \leq \ldots \leq X_{n} \leq \alpha\right] &=\int_{0}^{\alpha} \operatorname{Pr}\left[X_{1} \leq \ldots \leq X_{n-1} \leq t\right] f_{Y_{n}}(t) d t \\ &=\int_{0}^{\alpha} t^{w_{1}+w_{2}+\ldots+w_{n-1}} \cdot\left(\prod_{i=1}^{n-1} \frac{w_{i}}{w_{1}+\ldots+w_{i}}\right) w_{n} t^{w_{n}-1} d t \\ &=w_{n}\left(\prod_{i=1}^{n-1} \frac{w_{i}}{w_{1}+\ldots+w_{i}}\right) \int_{0}^{\alpha} t^{w_{1}+w_{2}+\ldots+w_{n}-1} d t \\ &=\alpha^{w_{1}+w_{2}+\ldots+w_{n}} \cdot \prod_{i=1}^{n} \frac{w_{i}}{w_{1}+\ldots+w_{i}}. \end{aligned}
We used the induction hypothesis in the second equality.
Now we are ready to finish the proof. Consider any fixed j j jj. We claim that the probability that X j X j X_(j)X_{j} is the largest number among X 1 , X 2 , , X m X 1 , X 2 , , X m X_(1),X_(2),dots,X_(m)X_{1}, X_{2}, \ldots, X_{m} is equal to w j w 1 + + w n w j w 1 + + w n (w_(j))/(w_(1)+dots+w_(n))\frac{w_{j}}{w_{1}+\ldots+w_{n}}. Do you see why? Conditioned on X j X j X_(j)X_{j} being the largest, the rest of the variables are still independent and we can apply this observation again. You should hopefully be convinced that picking the largest k k kk among the values X 1 , X 2 , , X m X 1 , X 2 , , X m X_(1),X_(2),dots,X_(m)X_{1}, X_{2}, \ldots, X_{m} gives the desired sample.
Bibliographic Notes: Morris's algorithm is from [3:3]. See [4:1] for a detailed analysis and [5:1] for a more recent treatment which seems cleaner and easier. Weighted reservoir sampling is from [6]. See [7] for more on efficient reservoir sampling methods.
You can read the notes from the next lecture from Chandra Chekuri's course on Estimating the Number of Distinct Elements in a Stream here.

  1. Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press, 1995. ↩︎
  2. Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. ↩︎
  3. Robert Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10): 840-842, 1978. ↩︎ ↩︎ ↩︎ ↩︎
  4. Philippe Flajolet. Approximate counting: a detailed analysis. BIT Numerical Mathematics, 25(1):113-134, 1985. ↩︎ ↩︎
  5. Miklós Csürös. Approximate counting with a floating-point counter. CoRR, abs/0904.3062, 2009. ↩︎ ↩︎
  6. Pavlos S Efraimidis and Paul G Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181-185, 2006. ↩︎
  7. Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37-57, 1985. ↩︎

Recommended for you

Emil Junker
Manipulative Attacks in Group Identification
Manipulative Attacks in Group Identification
This review provides an introduction to the group identification problem and gives an overview of the feasibility and computational complexity of manipulative attacks in group identification.
2 points
0 issues