Estimating F k F k F_(k)F_k norms via AMS sampling

You can read the notes from the previous lecture of Chandra Chekuri's course on Estimating the Number of Distinct Elements in a Stream here.

1. AMS Sampling

We have seen reservoir sampling and the related weighted sampling technique to obtain independent samples from a stream without the algorithm knowing the length of the stream. We now discuss a technique to sample from a stream σ = a 1 , a 2 , , a m σ = a 1 , a 2 , , a m sigma=a_(1),a_(2),dots,a_(m)\sigma=a_{1}, a_{2}, \ldots, a_{m} where the tokens a j a j a_(j)a_{j} are integers from [ n ] [ n ] [n][n] and we wish to estimate a function
g ( σ ) := i [ n ] g ( f i ) g ( σ ) := i [ n ] g ( f i ) g(sigma):=sum_(i in[n])g(f_(i))g(\sigma):=\sum_{i \in[n]} g(f_{i})
where f i f i f_(i)f_{i} is the frequency of i i ii and g g gg is a real-valued function such that g ( 0 ) = 0 g ( 0 ) = 0 g(0)=0g(0)=0. A natural example is to estimate frequency moments F k = i [ n ] f i k F k = i [ n ] f i k F_(k)=sum_(i in[n])f_(i)^(k)F_{k}=\sum_{i \in[n]} f_{i}^{k}; here we have g ( x ) = x k g ( x ) = x k g(x)=x^(k)g(x)=x^{k}, a convex function for k 1 k 1 k >= 1k \geq 1. Another example is the empirical entropy of σ σ sigma\sigma defined as i [ n ] p i log p i i [ n ] p i log p i sum_(i in[n])p_(i)log p_(i)\sum_{i \in[n]} p_{i} \log p_{i} where p i = f i m p i = f i m p_(i)=(f_(i))/(m)p_{i}=\frac{f_{i}}{m} is the empirical probability of i i ii; here g ( x ) = x log x g ( x ) = x log x g(x)=x log xg(x)=x \log x.[1]
AMS sampling from the famous paper [?] gives an unbiased estimator for g ( σ ) g ( σ ) g(sigma)g(\sigma). The estimator is based on a random variable Y Y YY defined as follows. Let J J JJ be a uniformly random sample from [ m ] [ m ] [m][m]. Let R = | { j | a j = a J , J j m } | R = | { j | a j = a J , J j m } | R=|{j|a_(j)=a_(J),J <= j <= m}|R=|\{j | a_{j}=a_{J}, J \leq j \leq m\}|. That is, R R RR is the count of the number of tokens after J J JJ that are for the same coordinate. Then, let Y Y YY the estimate defined as:
Y = m ( g ( R ) g ( R 1 ) ) . Y = m ( g ( R ) g ( R 1 ) ) . Y=m(g(R)-g(R-1)).Y=m(g(R)-g(R-1)) .
The lemma below shows that Y Y YY is an unbiased estimator of g ( σ ) g ( σ ) g(sigma)g(\sigma).
Lemma 1
E [ Y ] = g ( σ ) = i [ n ] g ( f i ) . E [ Y ] = g ( σ ) = i [ n ] g ( f i ) . E[Y]=g(sigma)=sum_(i in[n])g(f_(i)).\mathbf{E}[Y]=g(\sigma)=\sum_{i \in[n]} g(f_{i}).
Proof: The probability that a J = i a J = i a_(J)=ia_{J}=i is exactly f i / m f i / m f_(i)//mf_{i} / m since J J JJ is a uniform sample. Moreover if a J = i a J = i a_(J)=ia_{J}=i then R R RR is distributed as a uniform random variable over [ f i ] [ f i ] [f_(i)][f_{i}].
E [ Y ] = i [ n ] Pr [ a J = i ] E [ Y | a J = i ] = i [ n ] f i m E [ Y | a J = i ] = i [ n ] f i m = 1 f i m 1 f i ( g ( ) g ( 1 ) ) = i [ n ] g ( f i ) . E [ Y ] = i [ n ] Pr [ a J = i ] E [ Y | a J = i ] = i [ n ] f i m E [ Y | a J = i ] = i [ n ] f i m = 1 f i m 1 f i ( g ( ) g ( 1 ) ) = i [ n ] g ( f i ) . {:[E[Y]=sum_(i in[n])Pr[a_(J)=i]E[Y|a_(J)=i]],[=sum_(i in[n])(f_(i))/(m)E[Y|a_(J)=i]],[=sum_(i in[n])(f_(i))/(m)sum_(ℓ=1)^(f_(i))m(1)/(f_(i))(g(ℓ)-g(ℓ-1))],[=sum_(i in[n])g(f_(i)).]:}\begin{aligned} \mathbf{E}[Y] &=\sum_{i \in[n]} \operatorname{Pr}[a_{J}=i] \mathbf{E}[Y | a_{J}=i] \\ &=\sum_{i \in[n]} \frac{f_{i}}{m} \mathbf{E}[Y | a_{J}=i] \\ &=\sum_{i \in[n]} \frac{f_{i}}{m} \sum_{\ell=1}^{f_{i}} m \frac{1}{f_{i}}(g(\ell)-g(\ell-1)) \\ &=\sum_{i \in[n]} g(f_{i}). \end{aligned}
One can estimate Y Y YY using small space in the streaming setting via the reservoir sampling idea for generating a uniform sample. The algorithm is given below; the count R R RR gets reset whenever a new sample is picked.
AMSEstimate: _ AMSEstimate: _ "AMSEstimate:"_\underline{\text{AMSEstimate:}}
s null s null s larr"null"s\leftarrow\text{null}
m 0 m 0 m larr0m\leftarrow0
R 0 R 0 R larr0R\leftarrow0
While (stream is not done)
m m + 1 m m + 1 quad m larr m+1\quad m\leftarrow m+1
a m a m quada_(m)\quad a_{m} is current item
quad\quad Toss a biased coin that is heads with probability 1 / m 1 / m 1//m1/m
quad\quad If (coin turns up heads)
s a m s a m quadquad s larra_(m)\quad\quad s\leftarrow a_{m}
R 1 R 1 quadquad R larr1\quad\quad R\leftarrow1
quad\quad Else If ( a m == s ) ( a m == s ) (a_(m)==s)(a_{m}==s)
R R + 1 R R + 1 quadquad R larr R+1\quad\quad R\leftarrow R+1
endWhile
Output m ( g ( R ) g ( R 1 ) ) m ( g ( R ) g ( R 1 ) ) m(g(R)-g(R-1))m(g(R)-g(R-1))
To obtain a ( ϵ , δ ) ( ϵ , δ ) (epsilon,delta)(\epsilon, \delta)-approximation via the estimator Y Y YY we need to estimate Var [ Y ] Var [ Y ] Var[Y]\operatorname{Var}[Y] and apply standard tools. We do this for frequency moments now.

1.1. Application to estimating frequency moments

We now apply the AMS sampling to estimate F k F k F_(k)F_{k} the k k kk'th frequency moment for k 1 k 1 k >= 1k \geq 1. We have already seen that Y Y YY is an exact statistication estimator for F k F k F_(k)F_{k} when we set g ( x ) = x k g ( x ) = x k g(x)=x^(k)g(x)=x^{k}. We now estimate the variance of Y Y YY in this setting.
Lemma 2 When g ( x ) = x k g ( x ) = x k g(x)=x^(k)g(x)=x^{k} and k 1 k 1 k >= 1k \geq 1,
V a r [ Y ] k F 1 F 2 k 1 k n 1 1 k F k 2 . V a r [ Y ] k F 1 F 2 k 1 k n 1 1 k F k 2 . Var[Y] <= kF_(1)F_(2k-1) <= kn^(1-(1)/(k))F_(k)^(2).\mathbf{Var}[Y] \leq k F_{1} F_{2 k-1} \leq k n^{1-\frac{1}{k}} F_{k}^{2} .
Proof:
V a r [ Y ] E [ Y 2 ] i [ n ] Pr [ a J = i ] = 1 f i m 2 f i ( k ( 1 ) k ) 2 i [ n ] f i m = 1 f i m 2 f i ( k ( 1 ) k ) ( k ( 1 ) k ) m i [ n ] = 1 f i k k 1 ( k ( 1 ) k ) ( using ( x k ( x 1 ) k ) k x k 1 ) k m i [ n ] f i k 1 f i k k m F 2 k 1 = k F 1 F 2 k 1 . V a r [ Y ] E [ Y 2 ] i [ n ] Pr [ a J = i ] = 1 f i m 2 f i ( k ( 1 ) k ) 2 i [ n ] f i m = 1 f i m 2 f i ( k ( 1 ) k ) ( k ( 1 ) k ) m i [ n ] = 1 f i k k 1 ( k ( 1 ) k ) ( using ( x k ( x 1 ) k ) k x k 1 ) k m i [ n ] f i k 1 f i k k m F 2 k 1 = k F 1 F 2 k 1 . {:[Var[Y] <= E[Y^(2)]],[ <= sum_(i in[n])Pr[a_(J)=i]sum_(ℓ=1)^(f_(i))(m^(2))/(f_(i))(ℓ^(k)-(ℓ-1)^(k))^(2)],[ <= sum_(i in[n])(f_(i))/(m)sum_(ℓ=1)^(f_(i))(m^(2))/(f_(i))(ℓ^(k)-(ℓ-1)^(k))(ℓ^(k)-(ℓ-1)^(k))],[ <= msum_(i in[n])sum_(ℓ=1)^(f_(i))kℓ^(k-1)(ℓ^(k)-(ℓ-1)^(k))quad("using"(x^(k)-(x-1)^(k)) <= kx^(k-1))],[ <= kmsum_(i in[n])f_(i)^(k-1)f_(i)^(k)],[ <= kmF_(2k-1)=kF_(1)F_(2k-1).]:}\begin{aligned} \mathbf{Var}[Y] & \leq \mathbf{E}[Y^{2}] \\ & \leq \sum_{i \in[n]} \operatorname{Pr}[a_{J}=i] \sum_{\ell=1}^{f_{i}} \frac{m^{2}}{f_{i}}(\ell^{k}-(\ell-1)^{k})^{2} \\ & \leq \sum_{i \in[n]} \frac{f_{i}}{m} \sum_{\ell=1}^{f_{i}} \frac{m^{2}}{f_{i}}(\ell^{k}-(\ell-1)^{k})(\ell^{k}-(\ell-1)^{k}) \\ & \leq m \sum_{i \in[n]} \sum_{\ell=1}^{f_{i}} k \ell^{k-1}(\ell^{k}-(\ell-1)^{k}) \quad(\text {using}(x^{k}-(x-1)^{k}) \leq k x^{k-1}) \\ & \leq k m \sum_{i \in[n]} f_{i}^{k-1} f_{i}^{k} \\ & \leq k m F_{2 k-1}=k F_{1} F_{2 k-1} . \end{aligned}
We now use convexity of the function x k x k x^(k)x^{k} for k 1 k 1 k >= 1k \geq 1 to prove the second part. Note that max i f i = F max i f i = F max_(i)f_(i)=F_(oo)\max _{i} f_{i}=F_{\infty}.
F 1 F 2 k 1 = ( i f i ) ( i f i 2 k 1 ) ( i f i ) F k 1 ( i f i k ) ( i f i ) ( i f i k ) k 1 k ( i f i k ) . F 1 F 2 k 1 = i f i i f i 2 k 1 i f i F k 1 i f i k i f i i f i k k 1 k i f i k . F_(1)F_(2k-1)=(sum_(i)f_(i))(sum_(i)f_(i)^(2k-1)) <= (sum_(i)f_(i))F_(oo)^(k-1)(sum_(i)f_(i)^(k)) <= (sum_(i)f_(i))(sum_(i)f_(i)^(k))^((k-1)/(k))(sum_(i)f_(i)^(k)).F_{1} F_{2 k-1}=\left(\sum_{i} f_{i}\right)\left(\sum_{i} f_{i}^{2 k-1}\right) \leq\left(\sum_{i} f_{i}\right) F_{\infty}^{k-1}\left(\sum_{i} f_{i}^{k}\right) \leq\left(\sum_{i} f_{i}\right)\left(\sum_{i} f_{i}^{k}\right)^{\frac{k-1}{k}}\left(\sum_{i} f_{i}^{k}\right) .
Using the preceding inequality, and the inequality ( i = 1 n x i ) / n ( ( i = 1 n x i k ) / n ) 1 k ( i = 1 n x i ) / n ( ( i = 1 n x i k ) / n ) 1 k (sum_(i=1)^(n)x_(i))//n <= ((sum_(i=1)^(n)x_(i)^(k))//n)^((1)/(k))(\sum_{i=1}^{n} x_{i}) / n \leq((\sum_{i=1}^{n} x_{i}^{k}) / n)^{\frac{1}{k}} for all k 1 k 1 k >= 1k \geq 1 (due to the convexity of the function g ( x ) = x k ) g ( x ) = x k ) g(x)=x^(k))g(x)=x^{k}), we obtain that
F 1 F 2 k 1 ( i f i ) ( i f i k ) k 1 k ( i f i k ) n 1 1 / k ( i f i k ) 1 k ( i f i k ) k 1 k ( i f i k ) n 1 1 / k ( i f i k ) 2 . F 1 F 2 k 1 i f i i f i k k 1 k i f i k n 1 1 / k i f i k 1 k i f i k k 1 k i f i k n 1 1 / k i f i k 2 . F_(1)F_(2k-1) <= (sum_(i)f_(i))(sum_(i)f_(i)^(k))^((k-1)/(k))(sum_(i)f_(i)^(k)) <= n^(1-1//k)(sum_(i)f_(i)^(k))^((1)/(k))(sum_(i)f_(i)^(k))^((k-1)/(k))(sum_(i)f_(i)^(k)) <= n^(1-1//k)(sum_(i)f_(i)^(k))^(2).F_{1} F_{2 k-1} \leq\left(\sum_{i} f_{i}\right)\left(\sum_{i} f_{i}^{k}\right)^{\frac{k-1}{k}}\left(\sum_{i} f_{i}^{k}\right) \leq n^{1-1 / k}\left(\sum_{i} f_{i}^{k}\right)^{\frac{1}{k}}\left(\sum_{i} f_{i}^{k}\right)^{\frac{k-1}{k}}\left(\sum_{i} f_{i}^{k}\right) \leq n^{1-1 / k}\left(\sum_{i} f_{i}^{k}\right)^{2} .
Thus we have E [ Y ] = F k E [ Y ] = F k E[Y]=F_(k)\mathbf{E}[Y]=F_{k} and V a r [ Y ] k n 1 1 / k F k 2 V a r [ Y ] k n 1 1 / k F k 2 Var[Y] <= kn^(1-1//k)F_(k)^(2)\mathbf{Var}[Y] \leq k n^{1-1 / k} F_{k}^{2}. We now apply the trick of reducing the variance and then the median trick to obtain a high-probability bound. If we take h h hh independent estimators for Y Y YY and take their average the variance goes down by a factor of h h hh. We let h = h = h=h= c ϵ 2 k n 1 1 / k c ϵ 2 k n 1 1 / k (c)/(epsilon^(2))kn^(1-1//k)\frac{c}{\epsilon^{2}} k n^{1-1 / k} for some fixed constant c c cc. Let Y Y Y^(')Y^{\prime} be the resulting averaged estimator. We have E [ Y ] = E [ Y ] = E[Y^(')]=\mathbf{E}[Y^{\prime}]= F k F k F_(k)F_{k} and V a r [ Y ] V a r [ Y ] / h ϵ 2 c F k 2 V a r [ Y ] V a r [ Y ] / h ϵ 2 c F k 2 Var[Y^(')] <= Var[Y]//h <= (epsilon^(2))/(c)F_(k)^(2)\mathbf{Var}[Y^{\prime}] \leq \mathbf{Var}[Y] / h \leq \frac{\epsilon^{2}}{c} F_{k}^{2}. Now, using Chebyshev, we have
Pr [ | Y E [ Y ] | ϵ E [ Y ] ] V a r [ Y ] / ( ϵ 2 E [ Y ] 2 ) 1 c . Pr [ | Y E [ Y ] | ϵ E [ Y ] ] V a r [ Y ] / ( ϵ 2 E [ Y ] 2 ) 1 c . Pr[|Y^(')-E[Y^(')]| >= epsilonE[Y^(')]] <= Var[Y^(')]//(epsilon^(2)E[Y^(')]^(2)) <= (1)/(c).\operatorname{Pr}[|Y^{\prime}-\mathbf{E}[Y^{\prime}]| \geq \epsilon \mathbf{E}[Y^{\prime}]] \leq \mathbf{Var}[Y^{\prime}] /(\epsilon^{2} \mathbf{E}[Y^{\prime}]^{2}) \leq \frac{1}{c} .
We can choose c = 3 c = 3 c=3c=3 to obtain a ( ϵ , 1 / 3 ) ( ϵ , 1 / 3 ) (epsilon,1//3)(\epsilon, 1 / 3)-approximation. By using the median trick with Θ ( log 1 δ ) Θ ( log 1 δ ) Theta(log (1)/(delta))\Theta(\log \frac{1}{\delta}) independent estimators we can obtain a ( ϵ , δ ) ( ϵ , δ ) (epsilon,delta)(\epsilon, \delta)-approximation. The overall number of estimators we run independently is O ( log 1 δ 1 ϵ 2 n 1 1 / k ) O ( log 1 δ 1 ϵ 2 n 1 1 / k ) O(log (1)/(delta)*(1)/(epsilon^(2))*n^(1-1//k))O(\log \frac{1}{\delta} \cdot \frac{1}{\epsilon^{2}} \cdot n^{1-1 / k}). Each estimator requires O ( log n + log m ) O ( log n + log m ) O(log n+log m)O(\log n+\log m) space since we keep track of one index from [ m ] [ m ] [m][m], one count from [ m ] [ m ] [m][m], and one item from [ n ] [ n ] [n][n]. Thus the space usage to obtain a ( ϵ , δ ) ( ϵ , δ ) (epsilon,delta)(\epsilon, \delta)-approximation is O ( log 1 δ 1 ϵ 2 n 1 1 / k ( log m + log n ) ) O ( log 1 δ 1 ϵ 2 n 1 1 / k ( log m + log n ) ) O(log (1)/(delta)*(1)/(epsilon^(2))*n^(1-1//k)*(log m+log n))O(\log \frac{1}{\delta} \cdot \frac{1}{\epsilon^{2}} \cdot n^{1-1 / k} \cdot(\log m+\log n)). The time to process each stream element is also the same.
The space complexity of O ~ ( n 1 1 / k ) O ~ ( n 1 1 / k ) tilde(O)(n^(1-1//k))\tilde{O}(n^{1-1 / k}) is not optimal for estimating F k F k F_(k)F_{k}. One can achieve O ~ ( n 1 2 / k ) O ~ ( n 1 2 / k ) tilde(O)(n^(1-2//k))\tilde{O}(n^{1-2 / k}) which is optimal for k > 2 k > 2 k > 2k>2 and one can in fact achieve poly-logarithmic space for 1 k 2 1 k 2 1 <= k <= 21 \leq k \leq 2. We will see these results later in the course.
Bibliographic Notes: See Chapter 1 of the draft book by McGregor and Muthukrishnan; see the application of AMS sampling for estimating the entropy. See Chapter 5 of Amit Chakrabarti for the special case of frequency moments explained in detail. In particular he states a clean lemma that bundles the variance reduction technique and the median trick.
You can read the notes from the next lecture from Chandra Chekuri's course on Estimating F_2 norm, Sketching and Johnson-Lindenstrauss Lemma here.

  1. In the context of entropy, by convention, x log x = 0 x log x = 0 x log x=0x \log x=0 for x = 0 x = 0 x=0x=0. ↩︎

Recommended for you

Kaichao You
A New Paradigm for Exploiting Pre-trained Model Hubs
A New Paradigm for Exploiting Pre-trained Model Hubs
It is often overlooked that the number of models in pre-trained model hubs is scaling up very fast. As a result, pre-trained model hubs are popular but under-exploited. Here a new paradigm is advocated to sufficiently exploit pre-trained model hubs.
8 points
0 issues