In the streaming model we assume that the input data comes as a stream. More formally, the data is assumed to consist of mm items/tokens/objects a_(1),a_(2),dots,a_(m)a_{1}, a_{2}, \ldots, a_{m}. A simple case is when each token is a number in the range [1..n][1..n]. Another example is when each token represents an edge in a graph given as (u,v)(u, v) where uu and and vv are integers representing the node indices. The tokes arrive one by one in order. The algorithm has to process token x_(i)x_{i} before it sees the next token. If we are allowed to store all the tokens then we are in the standard model of computing where we have access to the whole input. However, we will assume that the space available to the algorithm is much less than mm tokens; typically sub-linear in mm or in the ideal scenario polylog (m)(m). Our goal is to (approximately) compute/estimate various quantities of interest using such limited space. Surprisingly one can compute various useful functions of interest. Some parameters of interest for a streaming algorithm are:
space used by the algorithm as a function of the stream length mm and the nature of the tokens
the worst-case time to process each token
the total time to process all the tokens (equivalently the amortized time to process a token)
the accuracy of the output
the probability of success in terms of obtaining a desired approximation (assuming that the algorithm is randomized)
There are several application areas that motivate the streaming model. In networking, a large number of packets transit a switch and we may want to analyze some statistics about the packets. The number of packets transiting the switch is vastly more than what can be stored for offline processing. In large databases the data is stored in disks and random access is not feasible; the size of the main memory is much smaller than the storage available on the disk. A one-pass or multi-pass streaming algorithm allows one to avoid sophisticated data structures on the disk. There are also applications where the data is distributed in several places and we need to process them separately and combine the results with low communication overhead.
In database applications and such it also makes sense to discuss the multi-pass model or the semi-streaming model where one gets to make several passes over the data. Typically we will assume that the number of passes is a small constant.
2. Background on Probability and Inequalities
The course will rely heavily on proababilistic methods. We will mostly rely on discrete probability spaces. We will keep the discussion high-level where possible and use certain results in a black-box fashion.
Let Omega\Omega be a finite set. A probability measure pp assings a non-negative number p(omega)p(\omega) for each omega in Omega\omega \in \Omega such that sum_(omega in Omega)p(omega)=1\sum_{\omega \in \Omega} p(\omega)=1. The tuple (Omega,p)(\Omega, p) defines a discrete probability space; an event in this space is any subset A sube OmegaA \subseteq \Omega and the probability of an event is simply p(A)=sum_(omega in A)p(omega)p(A)=\sum_{\omega \in A} p(\omega). When Omega\Omega is a continuous space such as the interval [0,1][0,1] things get trickier and we need to talk about a measure spaces sigma\sigma-algebras over Omega\Omega; we can only assign probability to certain subsets of Omega\Omega. We will not go into details since we will not need any formal machinery for what we do in this course.
An important definition is that of a random variable. We will focus only on real-valued random variables in this course. A random variable XX in a probability space is a function X:Omega rarrRX: \Omega \rightarrow \mathbb{R}. In the discrete setting the expectation of XX, denoted by E[X]\mathbf{E}[X], is defined as sum_(omega in Omega)p(w)X(omega)\sum_{\omega \in \Omega} p(w) X(\omega). For continuous spaces E[X]=int X(omega)dp(omega)\mathbf{E}[X]=\int X(\omega) d p(\omega) with appropriate definition of the integral. The variance of XX, denoted by Var[X]\mathbf{Var}[X] or as sigma_(X)^(2)\sigma_{X}^{2}, is defined as E[(X-E[X])^(2)]\mathbf{E}\left[(X-\mathbf{E}[X])^{2}\right]. The standard deviation is sigma_(X)\sigma_{X}, the square root of the variance.
Theorem 1 (Markov's Inequality)Let XX be a non-negative random variable such that E[X]\mathbf{E}[X] is finite. Then for any t > 0,Pr[X >= t] <= E[X]//tt>0, \operatorname{Pr}[X \geq t] \leq \mathbf{E}[X] / t.
Proof: The proof is in some sense obvious, especially in the discrete case. Here is a sketch. Define a new random variable YY where Y(omega)=X(omega)Y(\omega)=X(\omega) if X(omega) < tX(\omega)<t and Y(omega)=tY(\omega)=t if X(omega) >= tX(\omega) \geq t. YY is non-negative and Y <= XY \leq X point-wise and hence E[Y] <= E[X]\mathbf{E}[Y] \leq \mathbf{E}[X]. We also see that:
{:[E[X] >= E[Y]=sum_(omega:X(omega) < t)X(omega)p(omega)+sum_(omega:X(omega) >= t)tp(omega)],[ >= tsum_(omega:X(omega) >= t)p(omega)quad" (since "X" is non-negative) "],[ >= t Pr[X >= t].]:}\begin{aligned}
\mathbf{E}[X] \geq \mathbf{E}[Y] &=\sum_{\omega: X(\omega)<t} X(\omega) p(\omega)+\sum_{\omega: X(\omega) \geq t} t p(\omega) \\
& \geq t \sum_{\omega: X(\omega) \geq t} p(\omega) \quad \text { (since } X \text { is non-negative) } \\
& \geq t \operatorname{Pr}[X \geq t] .
\end{aligned}
The continuous case follows by replacing sums by integrals.
Markov's inequality is tight under the assumption. Assume you can construct an example. The more information we have about a random variable the better we can bound its deviation from the expectation.
Theorem 2 (Chebyshev's Inequality)Let XX be a random variable with E[X]\mathbf{E}[X] and Var[X]fi-\mathbf{Var}[X] f i- nite. Then Pr[|X| >= t] <= E[X^(2)]//t^(2)\operatorname{Pr}[|X| \geq t] \leq \mathbf{E}\left[X^{2}\right] / t^{2} and Pr[|X-E[X]| >= tsigma_(X)] <= 1//t^(2)\operatorname{Pr}\left[|X-\mathbf{E}[X]| \geq t \sigma_{X}\right] \leq 1 / t^{2}.
Proof: Consider the non-negative random variable Y=X^(2)Y=X^{2}. Pr[|X| >= t]=Pr[Y >= t^(2)]\operatorname{Pr}[|X| \geq t]=\operatorname{Pr}\left[Y \geq t^{2}\right] and we apply Markov's inequality to the latter. The second inequality is similar by considering Y=(X-E[X])^(2)Y=(X-\mathbf{E}[X])^{2}.
Chernoff-Hoeffding Bounds: We will use several times various forms of the Chernoff-Hoeffding bounds that apply to a random variable that is a a finite sum of bounded and independent random variables. There are several versions of these bounds. First we state a general bound that is applicable to non-negative random variables and is dimension-free in that it depends only the expectation rather than the number of variables.
Theorem 3 (Chernoff-Hoeffding)Let X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} be independent binary random variables and let a_(1),a_(2),dots,a_(n)a_{1}, a_{2}, \ldots, a_{n} be coefficients in [0,1][0,1]. Let X=sum_(i)a_(i)X_(i)X=\sum_{i} a_{i} X_{i}. Then
For any mu >= E[X]\mu \geq \mathbf{E}[X] and any delta > 0,Pr[X > (1+delta)mu] <= ((e^(delta))/((1+delta)^((1+delta))))^(mu)\delta>0, \operatorname{Pr}[X>(1+\delta) \mu] \leq\left(\frac{e^{\delta}}{(1+\delta)^{(1+\delta)}}\right)^{\mu}.
For any mu <= E[X]\mu \leq \mathbf{E}[X] and any delta > 0,Pr[X < (1-delta)mu] <= e^(-mudelta^(2)//2)\delta>0, \operatorname{Pr}[X<(1-\delta) \mu] \leq e^{-\mu \delta^{2} / 2}.
The following corollary bounds the deviation from the mean in both directions.
Corollary 4Under the conditions of Theorem 3, the following hold:
For any UU there is a constant c(U)c(U) such that for 0 < delta < U,Pr[X >= (1+delta)mu] <= e^(-c(U)delta^(2)mu)0<\delta<U, \operatorname{Pr}[X \geq(1+\delta) \mu] \leq e^{-c(U) \delta^{2} \mu}. In particular, combining with the lower tail bound,
We refer the reader to the standard books on randomized algorithms [1] and [2] for the derivation of the above bounds.
If we are interested only in the upper tail we also have the following bounds which show the dependence of mu\mu on nn to obtain an inverse polynomial probability.
Corollary 5Under the conditions of Theorem 3 , there is a universal constant alpha\alpha such that for any mu >= max{1,E[X]}\mu \geq \max \{1, \mathbf{E}[X]\}, and sufficiently large nn and for c >= 1,Pr[X > (alpha c ln n)/(ln n)*mu] <= 1//n^(c)c \geq 1, \operatorname{Pr}\left[X>\frac{\alpha c \ln n}{\ln n} \cdot \mu\right] \leq 1 / n^{c}. Similarly, there is a constant alpha\alpha such that for any epsilon > 0,Pr[X >= (1+epsilon)mu+alpha c log n//epsilon] <= 1//n^(c)\epsilon>0, \operatorname{Pr}[X \geq(1+\epsilon) \mu+\alpha c \log n / \epsilon] \leq 1 / n^{c}.
Remark 6If the X_(i)X_{i} are in the range [0,b][0, b] for some bb not equal to 11 one can scale them appropriately and then use the standard bounds.
Some times we need to deal with random variables that are in the range [-1,1][-1,1]. Consider the setting where X=sum_(i)X_(i)X=\sum_{i} X_{i} where for each i,X_(i)in[-1,1]i, X_{i} \in[-1,1] and E[X_(i)]=0\mathbf{E}\left[X_{i}\right]=0, and the X_(i)X_{i} are independent. In this case E[X]=0\mathbf{E}[X]=0 and we can no longer expect a dimension-free bound. Suppose each X_(i)X_{i} is 1 with probability 1//21 / 2 and -1-1 with probability 1//21 / 2. Then X=sum_(i)X_(i)X=\sum_{i} X_{i} corresponds to a 1-dimensional random walk and even though the expected value is 0 the standard deviation of XX is Theta(sqrtn)\Theta(\sqrt{n}). One can show that Pr[|X| >= tsqrtn] <= 2e^(-t^(2)//2)\operatorname{Pr}[|X| \geq t \sqrt{n}] \leq 2 e^{-t^{2} / 2}. For these settings we can use the following bounds.
Theorem 7Let X_(1),X_(2),dots,X_(n)X_{1}, X_{2}, \ldots, X_{n} be independent random variables such that for each i,X_(i)in[a_(i),b_(i)]i, X_{i} \in\left[a_{i}, b_{i}\right]. Let X=sum_(i)a_(i)X_(i)X=\sum_{i} a_{i} X_{i} and let mu=E[X]\mu=\mathbf{E}[X]. Then
where |X_(i)| <= M\left|X_{i}\right| \leq M for all ii.
Remark 8Compare the Chebyshev bound to the Chernoff-Hoeffding bounds for the same variance.
Statistical Estimators, Reducing Variance and Boosting: In streaming we will mainly work with randomized algorithms that compute a function ff of the data stream x_(1),dots,x_(m)x_{1}, \ldots, x_{m}. They typically work by producing an unbiased estimator, via a random variable XX, for the the function value. That is, the algorithm will have the property that the E[X]\mathbf{E}[X] is the desired value. Note that the randomness is internal to the algorithm and not part of the input (we will also later discuss randomness in the input when considering random order streams). Having an estimator is not often useful. We will also typically try to evaluate Var[X]\mathbf{Var}[X] and then we can use Chebyshev's inequality. One way to reduce the variance of the estimate is to run the algorithm in parallel (with separate random bits) and get estimators X_(1),X_(2),dots,X_(h)X_{1}, X_{2}, \ldots, X_{h} and use X=(1)/(h)sum_(i)X_(i)X=\frac{1}{h} \sum_{i} X_{i} as the final estimator. Note that Var(X)=(1)/(h)sum_(i)Var(X_(i))\mathbf{Var}(X)=\frac{1}{h} \sum_{i} \mathbf{Var}\left(X_{i}\right) since the X_(i)X_{i} are independent. Thus the variance has been reduced by a factor of hh. A different approach is to use the median value of X_(1),X_(2),dots,X_(h)X_{1}, X_{2}, \ldots, X_{h} as the final estimator. We can then use Chernoff-Hoeffding bounds to get a much better dependence on hh. In fact both approaches can be combined and we illustrate it via a concrete example in the next section.
3. Probabilistic Counting and Morris's Algorithm
Suppose we have a long sequence of events that we wish to count. If we wish to count a sequent of events of length upto some NN we can easily do this by using a counter with |~log_(2)N~|\left\lceil\log _{2} N\right\rceil bits. In some settings of interest we would like to further reduce this. It is not hard to argue that if one wants an exact and deterministic count then we do need a counter with |~log_(2)N~|\left\lceil\log _{2} N\right\rceil bits. Surprisingly, if we allow for approximation and randomization, one can count with about log logN\log log N bits. This was first shown in a short and sweet paper of Morris[3]; it is a nice paper to read the historical motivation.
Morris's algorithm keeps a counter that basically keeps an estimate of log N\log N where NN is the number of events and this requires about log logN\log log N bits. There are several variants of this, here we will discuss the simple one and point the reader to [3:1][4][5] [5,3,1] for other schemes and a more detailed and careful analysis.
quad\quad Toss a biased coin that is heads with probability 1//2^(X)1/2^{X}
quad\quad If (coin turns up heads)
qquad X larr X+1\qquad X \leftarrow X + 1
endWhile
Output 2^(X)-12^{X}-1 as the estimate for the length of the stream.
For integer i >= 0i \geq 0, let X_(n)X_{n} be the random variable that denotes the value of the counter after ii events. Let Y_(n)=2^(X_(n))Y_{n}=2^{X_{n}}. The lemma below shows that the output of the algorithm is an unbiased estimator of the count that we desire.
Proof: Proof by induction on nn. The case of n=0,1n=0,1 is easy to verify since in both cases we have that Y_(n)Y_{n} is deterministically equal to n+1n+1. We have for n >= 2n \geq 2:
where we used induction to obtain the final equality.
Since E[Y_(n)]=n+1\mathbf{E}\left[Y_{n}\right]=n+1 we have E[X_(n)]=log_(2)(n+1)\mathbf{E}\left[X_{n}\right]=\log _{2}(n+1) which implies that the expected number of bits in the counter after nn events is log log n+O(1)\log \log n+O(1) bits. We should also calculate the variance of Y_(n)Y_{n}.
We used induction and the value of E[Y_(n)]\mathbf{E}\left[Y_{n}\right] that we previously computed. Var[Y_(n)]=E[Y_(n)^(2)]-(E[Y_(n)])^(2)\mathbf{Var}\left[Y_{n}\right]=\mathbf{E}\left[Y_{n}^{2}\right]-\left(\mathbf{E}\left[Y_{n}\right]\right)^{2} and it can be verified that it is equal to n(n-1)//2n(n-1) / 2.
3.1. Approximation and Success Probability
The analysis of the expectation shows that the output of the algorithm is an unbiased estimator. The analysis of the variance shows that the estimator is fairly reasonable. However, we would like to have a finer understanding. For instance, given some parameter epsilon > 0\epsilon>0, what is Pr[|Y_(n)-(n+1)| > :}\operatorname{Pr}\left[\left|Y_{n}-(n+1)\right|>\right.epsilon n]\epsilon n]? We could also ask a related question. Is there an algorithm that given epsilon,delta\epsilon, \delta guarantees that the output YY will satisfy the property that Pr[|Y-n| > epsilon n] <= delta\operatorname{Pr}[|Y-n|>\epsilon n] \leq \delta while still ensuring that the counter size is O(log log n)O(\log \log n); of course we would expect that the constant in the O()O() notation will depend on epsilon,delta\epsilon, \delta.
The algorithm can be modified to obtain a (1+epsilon)(1+\epsilon)-approximation with constant probability using a related scheme where the probability of incrementing the counter is (1)/(a^(X))\frac{1}{a^{X}} for some parameter aa; see [3:2]. The expected number of bits in the counter to achieve a (1+epsilon)(1+\epsilon)-approximation can be shown to be log log n+O(log 1//epsilon)\log \log n+O(\log 1 / \epsilon) bits.
Here we describe two general purpose ways to obtain better approximations by using independent estimators. Suppose we run the basic algorithm hh times in parallel with independent randomness to get estimators Y_(n,1),Y_(n,2),dots,Y_(n,h)Y_{n, 1}, Y_{n, 2}, \ldots, Y_{n, h}. We then output Z=(1)/(h)sum_(i=1)^(h)Y_(n,i)-1Z=\frac{1}{h} \sum_{i=1}^{h} Y_{n, i}-1 as our estimate for nn. Note that ZZ is also an unbiased estimator. We have that Var[Z]=(1)/(h)Var[Y_(n)]\mathbf{Var}[Z]=\frac{1}{h} \mathbf{Var}\left[Y_{n}\right].
Now suppose we want a high probability guarantee regarding the approximation. That is, we would like the estimator to be a (1+epsilon)(1+\epsilon)-approximation with probability at least (1-delta)(1-\delta) for some given parameter delta\delta.
We choose ℓ=c log (1)/(delta)\ell=c \log \frac{1}{\delta} for some sufficiently large constant cc. We independently and in parallel obtain estimators Z_(1),Z_(2),dots,Z_(ℓ)Z_{1}, Z_{2}, \ldots, Z_{\ell} and output the median of the estimators; lets call Z^(')Z^{\prime} the random variable corresponding to the median. We will prove the following.
Proof: Define an indicator random variable A_(i),1 <= i <= ℓA_{i}, 1 \leq i \leq \ell where A_(i)A_{i} is 1 if |Z_(i)-n| >= epsilon n\left|Z_{i}-n\right| \geq \epsilon n. From Claim 11. Pr[A_(i)=1] < 1//4\operatorname{Pr}\left[A_{i}=1\right]<1 / 4. Let A=sum_(i=1)^(ℓ)A_(i)A=\sum_{i=1}^{\ell} A_{i} and hence E[A] < ℓ//4\mathbf{E}[A]<\ell / 4. We also observe that |Z^(')-n| >= epsilon n\left|Z^{\prime}-n\right| \geq \epsilon n only if A >= ℓ//2A \geq \ell / 2, that is, if more than half of the Z_(i)Z_{i} 's deviate by more than epsilon n\epsilon n. We can apply Chernoff-Hoeffding bound to upper bound Pr[A >= ℓ//2]=Pr[A >= (1+delta)mu]\operatorname{Pr}[A \geq \ell / 2]=\operatorname{Pr}[A \geq(1+\delta) \mu] where delta=1\delta=1 and mu=ℓ//4 >= E[A]\mu=\ell / 4 \geq \mathbf{E}[A]. From Theorem 3 this is at most (e//4)^(ℓ//4)(e / 4)^{\ell / 4}. Since ℓ=c log (1)/(delta)\ell=c \log \frac{1}{\delta}, the probability is at most delta\delta for an appropriate choice of cc.
The total space usage for running the estimates in parallel is O((1)/(epsilon^(2))*log (1)/(delta)*log log n)O\left(\frac{1}{\epsilon^{2}} \cdot \log \frac{1}{\delta} \cdot \log \log n\right).
4. Reservoir Sampling
Random sampling is a powerful general purpose technique in a variety of areas. The goal is to pick a small subset SS of a set of items NN such that SS is representative of NN and often a random sample works well. The simplest sampling procedure is to pick a uniform sample of size kk from a set of size mm where k <= mk \leq m (typically k≪mk \ll m). We can consider sampling with or without replacement. These are standard and easy procedures if the whole data set is available in a random-access manner here we are assuming that we have access to a random number generator.
Below we describe a simple yet nice technique called reservoir sampling to obtain a uniform sample of size 1 from a stream.
quad\quad Toss a biased coin that is heads with probability 1//m1/m
quad\quad If (coin turns up heads)
qquadquad s larrx_(m)\qquad\quad s \leftarrow x_{m}
endWhile
Output ss as the sample
Lemma 13Let mm be the length of the stream. The output of the algorithm ss is uniform. That is, for any 1 <= j <= m,Pr[s=x_(j)]=1//m1 \leq j \leq m, \operatorname{Pr}\left[s=x_{j}\right]=1 / m.
Proof: We observe that s=x_(j)s=x_{j} if x_(j)x_{j} is chosen when it is considered by the algorithm (which happens with probability 1//j1 / j ), and none of x_(j+1),dots,x_(m)x_{j+1}, \ldots, x_{m} are chosen to replace x_(j)x_{j}. All the relevant events are independent and we can compute:
To obtain kk samples with replacement, the procedure for k=1k=1 can be done in parallel with independent randomness. Now we consider obtaining kk samples from the stream without replacement. The output will be stored in an array of SS of size kk.
An alternative description is that when item x_(t)x_{t} arrives (for t > kt>k) we decide to choose it for inclusion in SS with probability k//tk / t, and if it is chosen then we choose a uniform element from SS to be replaced by x_(t)x_{t}.
Exercise: Prove that the algorithm outputs a random sample without replacement of size kk from the stream.
Weighted sampling: Now we consider a generalization of the problem to weighted sampling. Suppose the stream consists of mm items x_(1),dots,x_(m)x_{1}, \ldots, x_{m} and each item jj has a weigth w_(j) > 0w_{j}>0. The goal is to choose a kk sample in proportion to the weights. Suppose k=1k=1 then the goal is to choose an item ss such that Pr[s=x_(j)]=w_(j)//W\operatorname{Pr}\left[s=x_{j}\right]=w_{j} / W where W=sum_(i)w_(i)W=\sum_{i} w_{i}. It is easy to generalize the uniform sampling algorithm to achieve this and kk samples with replacement is also easy. The more interesting case is when we want kk samples without replacement. The precise definition of what this means is not so obvious. Here is an algorithm. Obtain first a uniform sample ss from x_(1),dots,x_(m)x_{1}, \ldots, x_{m} in proportion to the weights. Remove ss from the set and obtain another uniform sample in the residual set. Repeat kk times to obtain a set of kk items (assume that k < mk<m). The kk removed items form the output SS. We now want to obtain a random set SS according to this same distribution but in a streaming fashion.
First we describe a randomized offline algorithm below.
Sort items in decreasing order according to w_(i)^(')w_{i}^{\prime} values
Output the first kk items from the sorted order
We leave it as a simple exercise to show that the above algorithm can be implemented in the stream model by keeping the heaviest kk modified weights seen so far. Now for the analysis.
To get some intuition we make the following claim.
Claim 14Let r_(1),r_(2)r_{1}, r_{2} be independent unformly distributed random variables over [0,1][0,1] and let X_(1)=X_{1}=r_(1)^(1//w_(1))r_{1}^{1 / w_{1}} and X_(2)=r_(2)^(1//w_(2))X_{2}=r_{2}^{1 / w_{2}} where w_(1),w_(2) >= 0w_{1}, w_{2} \geq 0. Then
The above claim can be shown by doing basic analysis via the probability density functions. More precisely, suppose w > 0w>0. Consider the random variable Y=r^(1//w)Y=r^{1 / w} where rr is chosen uniformly in (0,1)(0,1). The cumulative probabilty function of YY,
Therefore the density function f_(Y)(t)f_{Y}(t) is wt^(w-1)w t^{w-1}. Thus
Pr[X_(1) <= X_(2)]=int_(0)^(1)F_(Y_(1))(t)f_(Y_(2))(t)dt=int_(0)^(1)t^(w_(1))w_(2)t^(w_(2)-1)dt=(w_(2))/(w_(1)+w_(2)).\operatorname{Pr}\left[X_{1} \leq X_{2}\right]=\int_{0}^{1} F_{Y_{1}}(t) f_{Y_{2}}(t) d t=\int_{0}^{1} t^{w_{1}} w_{2} t^{w_{2}-1} d t=\frac{w_{2}}{w_{1}+w_{2}} .
We now make a much more general statement.
Lemma 15Let r_(1),r_(2),dots,r_(n)r_{1}, r_{2}, \ldots, r_{n} be independent random variables each of which is uniformly distributed random variables over [0,1][0,1]. Let X_(i)=r_(i)^(1//w_(i))X_{i}=r_{i}^{1 / w_{i}} for 1 <= i <= n1 \leq i \leq n. Then for any alpha in[0,1]\alpha \in[0,1]
Proof: By induction on nn. For n=1,Pr[X_(1) <= alpha]=F_(Y_(1))(alpha)=alpha^(w_(1))n=1, \operatorname{Pr}\left[X_{1} \leq \alpha\right]=F_{Y_{1}}(\alpha)=\alpha^{w_{1}}. Assuming the lemma holds for all h < nh<n we prove it for nn.
{:[Pr[X_(1) <= dots <= X_(n) <= alpha]=int_(0)^(alpha)Pr[X_(1) <= dots <= X_(n-1) <= t]f_(Y_(n))(t)dt],[=int_(0)^(alpha)t^(w_(1)+w_(2)+dots+w_(n-1))*(prod_(i=1)^(n-1)(w_(i))/(w_(1)+dots+w_(i)))w_(n)t^(w_(n)-1)dt],[=w_(n)(prod_(i=1)^(n-1)(w_(i))/(w_(1)+dots+w_(i)))int_(0)^(alpha)t^(w_(1)+w_(2)+dots+w_(n)-1)dt],[=alpha^(w_(1)+w_(2)+dots+w_(n))*prod_(i=1)^(n)(w_(i))/(w_(1)+dots+w_(i)).]:}\begin{aligned}
\operatorname{Pr}\left[X_{1} \leq \ldots \leq X_{n} \leq \alpha\right] &=\int_{0}^{\alpha} \operatorname{Pr}\left[X_{1} \leq \ldots \leq X_{n-1} \leq t\right] f_{Y_{n}}(t) d t \\
&=\int_{0}^{\alpha} t^{w_{1}+w_{2}+\ldots+w_{n-1}} \cdot\left(\prod_{i=1}^{n-1} \frac{w_{i}}{w_{1}+\ldots+w_{i}}\right) w_{n} t^{w_{n}-1} d t \\
&=w_{n}\left(\prod_{i=1}^{n-1} \frac{w_{i}}{w_{1}+\ldots+w_{i}}\right) \int_{0}^{\alpha} t^{w_{1}+w_{2}+\ldots+w_{n}-1} d t \\
&=\alpha^{w_{1}+w_{2}+\ldots+w_{n}} \cdot \prod_{i=1}^{n} \frac{w_{i}}{w_{1}+\ldots+w_{i}}.
\end{aligned}
We used the induction hypothesis in the second equality.
Now we are ready to finish the proof. Consider any fixed jj. We claim that the probability that X_(j)X_{j} is the largest number among X_(1),X_(2),dots,X_(m)X_{1}, X_{2}, \ldots, X_{m} is equal to (w_(j))/(w_(1)+dots+w_(n))\frac{w_{j}}{w_{1}+\ldots+w_{n}}. Do you see why? Conditioned on X_(j)X_{j} being the largest, the rest of the variables are still independent and we can apply this observation again. You should hopefully be convinced that picking the largest kk among the values X_(1),X_(2),dots,X_(m)X_{1}, X_{2}, \ldots, X_{m} gives the desired sample.
Bibliographic Notes: Morris's algorithm is from [3:3]. See [4:1] for a detailed analysis and [5:1] for a more recent treatment which seems cleaner and easier. Weighted reservoir sampling is from [6]. See [7] for more on efficient reservoir sampling methods.
You can read the notes from the next lecture from Chandra Chekuri's course on Estimating the Number of Distinct Elements in a Stream here.
Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press, 1995. ↩︎
Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. ↩︎
Robert Morris. Counting large numbers of events in small registers. Communications of the ACM, 21(10): 840-842, 1978. ↩︎↩︎↩︎↩︎
Philippe Flajolet. Approximate counting: a detailed analysis. BIT Numerical Mathematics, 25(1):113-134, 1985. ↩︎↩︎
Miklós Csürös. Approximate counting with a floating-point counter. CoRR, abs/0904.3062, 2009. ↩︎↩︎
Pavlos S Efraimidis and Paul G Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181-185, 2006. ↩︎
Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37-57, 1985. ↩︎
Recommended for you
Emil Junker
Manipulative Attacks in Group Identification
Manipulative Attacks in Group Identification
This review provides an introduction to the group identification problem and gives an overview of the feasibility and computational complexity of manipulative attacks in group identification.