You can read the notes from the previous lecture of Chandra Chekuri's course on Estimating the Number of Distinct Elements in a Stream here.
1. AMS Sampling
We have seen reservoir sampling and the related weighted sampling technique to obtain independent samples from a stream without the algorithm knowing the length of the stream. We now discuss a technique to sample from a stream sigma=a_(1),a_(2),dots,a_(m)\sigma=a_{1}, a_{2}, \ldots, a_{m} where the tokens a_(j)a_{j} are integers from [n][n] and we wish to estimate a function
where f_(i)f_{i} is the frequency of ii and gg is a real-valued function such that g(0)=0g(0)=0. A natural example is to estimate frequency moments F_(k)=sum_(i in[n])f_(i)^(k)F_{k}=\sum_{i \in[n]} f_{i}^{k}; here we have g(x)=x^(k)g(x)=x^{k}, a convex function for k >= 1k \geq 1. Another example is the empirical entropy of sigma\sigma defined as sum_(i in[n])p_(i)log p_(i)\sum_{i \in[n]} p_{i} \log p_{i} where p_(i)=(f_(i))/(m)p_{i}=\frac{f_{i}}{m} is the empirical probability of ii; here g(x)=x log xg(x)=x \log x.[1]
AMS sampling from the famous paper [?] gives an unbiased estimator for g(sigma)g(\sigma). The estimator is based on a random variable YY defined as follows. Let JJ be a uniformly random sample from [m][m]. Let R=|{j|a_(j)=a_(J),J <= j <= m}|R=|\{j | a_{j}=a_{J}, J \leq j \leq m\}|. That is, RR is the count of the number of tokens after JJ that are for the same coordinate. Then, let YY the estimate defined as:
Y=m(g(R)-g(R-1)).Y=m(g(R)-g(R-1)) .
The lemma below shows that YY is an unbiased estimator of g(sigma)g(\sigma).
Proof: The probability that a_(J)=ia_{J}=i is exactly f_(i)//mf_{i} / m since JJ is a uniform sample. Moreover if a_(J)=ia_{J}=i then RR is distributed as a uniform random variable over [f_(i)][f_{i}].
One can estimate YY using small space in the streaming setting via the reservoir sampling idea for generating a uniform sample. The algorithm is given below; the count RR gets reset whenever a new sample is picked.
"AMSEstimate:"_\underline{\text{AMSEstimate:}}
s larr"null"s\leftarrow\text{null}
m larr0m\leftarrow0
R larr0R\leftarrow0
While (stream is not done)
quad m larr m+1\quad m\leftarrow m+1
quada_(m)\quad a_{m} is current item
quad\quad Toss a biased coin that is heads with probability 1//m1/m
quad\quad If (coin turns up heads)
quadquad s larra_(m)\quad\quad s\leftarrow a_{m}
quadquad R larr1\quad\quad R\leftarrow1
quad\quad Else If (a_(m)==s)(a_{m}==s)
quadquad R larr R+1\quad\quad R\leftarrow R+1
endWhile
Output m(g(R)-g(R-1))m(g(R)-g(R-1))
To obtain a (epsilon,delta)(\epsilon, \delta)-approximation via the estimator YY we need to estimate Var[Y]\operatorname{Var}[Y] and apply standard tools. We do this for frequency moments now.
1.1. Application to estimating frequency moments
We now apply the AMS sampling to estimate F_(k)F_{k} the kk'th frequency moment for k >= 1k \geq 1. We have already seen that YY is an exact statistication estimator for F_(k)F_{k} when we set g(x)=x^(k)g(x)=x^{k}. We now estimate the variance of YY in this setting.
Lemma 2Wheng(x)=x^(k)g(x)=x^{k}andk >= 1k \geq 1,
Var[Y] <= kF_(1)F_(2k-1) <= kn^(1-(1)/(k))F_(k)^(2).\mathbf{Var}[Y] \leq k F_{1} F_{2 k-1} \leq k n^{1-\frac{1}{k}} F_{k}^{2} .
Using the preceding inequality, and the inequality (sum_(i=1)^(n)x_(i))//n <= ((sum_(i=1)^(n)x_(i)^(k))//n)^((1)/(k))(\sum_{i=1}^{n} x_{i}) / n \leq((\sum_{i=1}^{n} x_{i}^{k}) / n)^{\frac{1}{k}} for all k >= 1k \geq 1 (due to the convexity of the function g(x)=x^(k))g(x)=x^{k}), we obtain that
Thus we have E[Y]=F_(k)\mathbf{E}[Y]=F_{k} and Var[Y] <= kn^(1-1//k)F_(k)^(2)\mathbf{Var}[Y] \leq k n^{1-1 / k} F_{k}^{2}. We now apply the trick of reducing the variance and then the median trick to obtain a high-probability bound. If we take hh independent estimators for YY and take their average the variance goes down by a factor of hh. We let h=h=(c)/(epsilon^(2))kn^(1-1//k)\frac{c}{\epsilon^{2}} k n^{1-1 / k} for some fixed constant cc. Let Y^(')Y^{\prime} be the resulting averaged estimator. We have E[Y^(')]=\mathbf{E}[Y^{\prime}]=F_(k)F_{k} and Var[Y^(')] <= Var[Y]//h <= (epsilon^(2))/(c)F_(k)^(2)\mathbf{Var}[Y^{\prime}] \leq \mathbf{Var}[Y] / h \leq \frac{\epsilon^{2}}{c} F_{k}^{2}. Now, using Chebyshev, we have
We can choose c=3c=3 to obtain a (epsilon,1//3)(\epsilon, 1 / 3)-approximation. By using the median trick with Theta(log (1)/(delta))\Theta(\log \frac{1}{\delta}) independent estimators we can obtain a (epsilon,delta)(\epsilon, \delta)-approximation. The overall number of estimators we run independently is O(log (1)/(delta)*(1)/(epsilon^(2))*n^(1-1//k))O(\log \frac{1}{\delta} \cdot \frac{1}{\epsilon^{2}} \cdot n^{1-1 / k}). Each estimator requires O(log n+log m)O(\log n+\log m) space since we keep track of one index from [m][m], one count from [m][m], and one item from [n][n]. Thus the space usage to obtain a (epsilon,delta)(\epsilon, \delta)-approximation is O(log (1)/(delta)*(1)/(epsilon^(2))*n^(1-1//k)*(log m+log n))O(\log \frac{1}{\delta} \cdot \frac{1}{\epsilon^{2}} \cdot n^{1-1 / k} \cdot(\log m+\log n)). The time to process each stream element is also the same.
The space complexity of tilde(O)(n^(1-1//k))\tilde{O}(n^{1-1 / k}) is not optimal for estimating F_(k)F_{k}. One can achieve tilde(O)(n^(1-2//k))\tilde{O}(n^{1-2 / k}) which is optimal for k > 2k>2 and one can in fact achieve poly-logarithmic space for 1 <= k <= 21 \leq k \leq 2. We will see these results later in the course.
Bibliographic Notes: See Chapter 1 of the draft book by McGregor and Muthukrishnan; see the application of AMS sampling for estimating the entropy. See Chapter 5 of Amit Chakrabarti for the special case of frequency moments explained in detail. In particular he states a clean lemma that bundles the variance reduction technique and the median trick.
You can read the notes from the next lecture from Chandra Chekuri's course on Estimating F_2 norm, Sketching and Johnson-Lindenstrauss Lemma here.
In the context of entropy, by convention, x log x=0x \log x=0 for x=0x=0. ↩︎
A New Paradigm for Exploiting Pre-trained Model Hubs
A New Paradigm for Exploiting Pre-trained Model Hubs
It is often overlooked that the number of models in pre-trained model hubs is scaling up very fast. As a result, pre-trained model hubs are popular but under-exploited. Here a new paradigm is advocated to sufficiently exploit pre-trained model hubs.