Count and Count-Min Sketches

You can read the notes from the previous lecture of Chandra Chekuri's course on Estimating F_2 norm, Sketching, Johnson-Lindenstrauss Lemma here.
The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 / k > F 1 / k > F_(1)//k>F_{1} / k can be found using O ( k ) O ( k ) O(k)O(k) counters and an update time of O ( log k ) O ( log k ) O(log k)O(\log k). Setting k = 1 / ϵ k = 1 / ϵ k=1//epsilonk=1 / \epsilon one can view the algorithm as providing an additive ϵ F 1 ϵ F 1 epsilonF_(1)\epsilon F_{1} approximation for each f i f i f_(i)f_{i}. However, the algorithm does not provide a sketch. One advantage of linear sketching algorithms is the ability to handle deletions. We now discuss two sketching algorithms that have a found a number of applications. These sketches can be used to for estimating point queries: after seeing a stream σ σ sigma\sigma over items in [ n ] [ n ] [n][n] we would like to estimate f i f i f_(i)f_{i} the frequency of i [ n ] i [ n ] i in[n]i \in[n]. More generally, in the turnstile model, we would like to estimate x i x i x_(i)x_{i} for a given i [ n ] i [ n ] i in[n]i \in[n]. We can only guarantee the estimate with an additive error.

1. CountMin Sketch

We firt describe the simpler CountMin sketch. The sketch maintains several counters. The counters are best visualized as a rectangular array of width w w ww and depth d d dd. With each row i i ii we have a hash function h i : [ n ] [ w ] h i : [ n ] [ w ] h_(i):[n]rarr[w]h_{i}:[n] \rightarrow[w] that maps elements to one of w w ww buckets.
CountMin-Sketch ( w , d ) : _ CountMin-Sketch ( w , d ) : _ "CountMin-Sketch"(w,d):_\underline{\text{CountMin-Sketch}(w,d):}
h 1 , h 2 , , h d h 1 , h 2 , , h d h_(1),h_(2),dots,h_(d)h_{1},h_{2},\ldots,h_{d} are pair-wise independent hash functions from [ n ] [ w ] [ n ] [ w ] [n]rarr[w][n]\rightarrow[w].
While (stream is not empty) do
a t = ( i t Δ t ) a t = ( i t Δ t ) quada_(t)=(i_(t)Delta_(t))\quad a_{t}=(i_{t}\Delta_{t}) is current item
quad\quadfor = 1 = 1 ℓ=1\ell=1 to d d dd do
C [ , h ( i j ) ] C [ , h ( i j ) ] + Δ t C [ , h ( i j ) ] C [ , h ( i j ) ] + Δ t qquad C[ℓ,h_(ℓ)(i_(j))]larr C[ℓ,h_(ℓ)(i_(j))]+Delta_(t)\qquad C[\ell,h_{\ell}(i_{j})]\leftarrow C[\ell,h_{\ell}(i_{j})]+\Delta_{t}
endWhile
For i [ n ] i [ n ] i in[n]i\in [n] set x ~ i = min = 1 d C [ , h ( i ) ] . x ~ i = min = 1 d C [ , h ( i ) ] . tilde(x)_(i)=min_(ℓ=1)^(d)C[ℓ,h_(ℓ)(i)].\tilde{x}_{i}=\min^{d}_{\ell=1}C[\ell,h_{\ell}(i)].
The counter C [ , j ] C [ , j ] C[ℓ,j]C[\ell, j] simply counts the sum of all x i x i x_(i)x_{i} such that h ( i ) = j h ( i ) = j h_(ℓ)(i)=jh_{\ell}(i)=j. That is,
C [ , j ] = i : h ( i ) = j x i . C [ , j ] = i : h ( i ) = j x i . C[ℓ,j]=sum_(i:h_(ℓ)(i)=j)x_(i).C[\ell, j]=\sum_{i: h_{\ell}(i)=j} x_{i} .
Exercise: CountMin is a linear sketch. What are the entries of the projection matrix?
We will analyze the sketch in the strick turnstile model where x i 0 x i 0 x_(i) >= 0x_{i} \geq 0 for all i [ n ] i [ n ] i in[n]i \in[n]; note that Δ t Δ t Delta_(t)\Delta_{t} we be negative.
Lemma 1 Let d = Ω ( log 1 δ ) d = Ω log 1 δ d=Omega(log (1)/(delta))d=\Omega\left(\log \frac{1}{\delta}\right) and w > 2 ϵ w > 2 ϵ w > (2)/(epsilon)w>\frac{2}{\epsilon}. Then for any fixed i [ n ] , x i x ~ i i [ n ] , x i x ~ i i in[n],x_(i) <= tilde(x)_(i)i \in[n], x_{i} \leq \tilde{x}_{i} and
Pr [ x ~ i x i + ϵ x 1 ] δ . Pr x ~ i x i + ϵ x 1 δ . Pr[ tilde(x)_(i) >= x_(i)+epsilon||x||_(1)] <= delta.\operatorname{Pr}\left[\tilde{x}_{i} \geq x_{i}+\epsilon\|\mathbf{x}\|_{1}\right] \leq \delta .
Proof: Fix i [ n ] i [ n ] i in[n]i \in[n]. Let Z = C [ , h ( i ) ] Z = C , h ( i ) Z_(ℓ)=C[ℓ,h_(ℓ)(i)]Z_{\ell}=C\left[\ell, h_{\ell}(i)\right] be the value of the counter in row \ell to which i i ii is hashed to. We have
E [ Z ] = x i + i i Pr [ h ( i ) = h ( i ) ] x i = x i + i i 1 w x i x i + ϵ 2 x 1 . E Z = x i + i i Pr h i = h ( i ) x i = x i + i i 1 w x i x i + ϵ 2 x 1 . E[Z_(ℓ)]=x_(i)+sum_(i^(')!=i)Pr[h_(ℓ)(i^('))=h_(ℓ)(i)]x_(i^('))=x_(i)+sum_(i^(')!=i)(1)/(w)x_(i^(')) <= x_(i)+(epsilon)/(2)||x||_(1).\mathbf{E}\left[Z_{\ell}\right]=x_{i}+\sum_{i^{\prime} \neq i} \operatorname{Pr}\left[h_{\ell}\left(i^{\prime}\right)=h_{\ell}(i)\right] x_{i^{\prime}}=x_{i}+\sum_{i^{\prime} \neq i} \frac{1}{w} x_{i^{\prime}} \leq x_{i}+\frac{\epsilon}{2}\|\mathbf{x}\|_{1} .
Note that we used pair-wise independence of h h h_(ℓ)h_{\ell} to conclude that Pr [ h ( i ) = h ( i ) ] = 1 / w Pr h i = h ( i ) = 1 / w Pr[h_(ℓ)(i^('))=h_(ℓ)(i)]=1//w\operatorname{Pr}\left[h_{\ell}\left(i^{\prime}\right)=h_{\ell}(i)\right]=1 / w.
By Markov's inequality (here we are using non-negativity of x x x\mathbf{x} ),
Pr [ Z > x i + ϵ x 1 ] 1 / 2 . Pr Z > x i + ϵ x 1 1 / 2 . Pr[Z_(ℓ) > x_(i)+epsilon||x||_(1)] <= 1//2.\operatorname{Pr}\left[Z_{\ell}>x_{i}+\epsilon\|\mathbf{x}\|_{1}\right] \leq 1 / 2 .
Thus
Pr [ min Z > x i + ϵ x 1 ] 1 / 2 d δ . Pr min Z > x i + ϵ x 1 1 / 2 d δ . Pr[min_(ℓ)Z_(ℓ) > x_(i)+epsilon||x||_(1)] <= 1//2^(d) <= delta.\operatorname{Pr}\left[\min _{\ell} Z_{\ell}>x_{i}+\epsilon\|\mathbf{x}\|_{1}\right] \leq 1 / 2^{d} \leq \delta .
Remark: By choosing δ = Ω ( log n ) δ = Ω ( log n ) delta=Omega(log n)\delta=\Omega(\log n) we can ensure with probability at least ( 1 1 / poly ( n ) ) ( 1 1 / poly ( n ) ) (1-1//poly(n))(1-1 / \operatorname{poly}(n)) that x ~ i x i ϵ x 1 x ~ i x i ϵ x 1 tilde(x)_(i)-x_(i) <= epsilon||x||_(1)\tilde{x}_{i}-x_{i} \leq \epsilon\|\mathbf{x}\|_{1} for all i [ n ] i [ n ] i in[n]i \in[n].
Exercise: For general turnstile streams where x x x\mathbf{x} can have negative entries we can take the median of the counters. For this estimate you should be able to prove the following.
Pr [ | x ~ i x i | 3 ϵ x 1 ] δ 1 / 4 . Pr x ~ i x i 3 ϵ x 1 δ 1 / 4 . Pr[| tilde(x)_(i)-x_(i)| >= 3epsilon||x||_(1)] <= delta^(1//4).\operatorname{Pr}\left[\left|\tilde{x}_{i}-x_{i}\right| \geq 3 \epsilon\|\mathbf{x}\|_{1}\right] \leq \delta^{1 / 4} .

2. Count Sketch

Now we discuss the closely related Count sketch which also maintains an array of counters parameterized by the width w w ww and depth d d dd.
Count-Sketch ( w , d ) : _ Count-Sketch ( w , d ) : _ "Count-Sketch"(w,d):_\underline{\text{Count-Sketch} (w,d) :}
h 1 , h 2 , , h d h 1 , h 2 , , h d h_(1),h_(2),dots,h_(d)h_{1}, h_{2}, \ldots, h_{d} are pair-wise independent hash functions from [ n ] [ w ] . [ n ] [ w ] . [n]rarr[w].[n] \rightarrow[w] .
g 1 , g 2 , , g d g 1 , g 2 , , g d g_(1),g_(2),dots,g_(d)g_{1}, g_{2}, \ldots, g_{d} are pair-wise independent hash functions from [ n ] { 1 , 1 } [ n ] { 1 , 1 } [n]rarr{-1,1}[n] \rightarrow\{-1,1\}.
While (stream is not empty) do
a t = ( i t , Δ t ) a t = i t , Δ t quada_(t)=(i_(t),Delta_(t))\quad a_{t}=\left(i_{t}, \Delta_{t}\right) is current item
quad\quadfor = 1 = 1 ℓ=1\ell=1 to d d dd do
C [ , h ( i j ) ] C [ , h ( i j ) ] + g ( i t ) Δ t C [ , h ( i j ) ] C [ , h ( i j ) ] + g ( i t ) Δ t qquad C[ℓ,h_(ℓ)(i_(j))]larr C[ℓ,h_(ℓ)(i_(j))]+g(i_(t))Delta_(t)\qquad C[\ell, h_{\ell}(i_{j})] \leftarrow C[\ell, h_{\ell}(i_{j})]+g(i_{t}) \Delta_{t}
endWhile
For i [ n ] i [ n ] i in[n]i \in[n] set x ~ i = median { g 1 ( i ) C [ 1 , h 1 ( i ) ] , g 2 ( i ) C [ 2 , h 2 ( i ) , , g d ( i ) C [ d , h d ( i ) ] } . x ~ i = median { g 1 ( i ) C [ 1 , h 1 ( i ) ] , g 2 ( i ) C [ 2 , h 2 ( i ) , , g d ( i ) C [ d , h d ( i ) ] } . tilde(x)_(i)="median"{g_(1)(i)C[1,h_(1)(i)],g_(2)(i)C[2,h_(2)(i),dots,g_(d)(i)C[d,h_(d)(i)]}.\tilde{x}_{i}=\text{median}\{g_{1}(i) C[1, h_{1}(i)], g_{2}(i) C[2, h_{2}(i), \ldots, g_{d}(i) C[d, h_{d}(i)]\}.
Exercise: CountMin is a linear sketch. What are the entries of the projection matrix?
Lemma 2 Let d log 1 δ d log 1 δ d >= log (1)/(delta)d \geq \log \frac{1}{\delta} and w > 3 ϵ 2 w > 3 ϵ 2 w > (3)/(epsilon^(2))w>\frac{3}{\epsilon^{2}}. Then for any fixed i [ n ] , E [ x ~ i ] = x i i [ n ] , E x ~ i = x i i in[n],E[ tilde(x)_(i)]=x_(i)i \in[n], \mathbf{E}\left[\tilde{x}_{i}\right]=x_{i} and
Pr [ | x ~ i x i | ϵ x 2 ] δ . Pr x ~ i x i ϵ x 2 δ . Pr[| tilde(x)_(i)-x_(i)| >= epsilon||x||_(2)] <= delta.\operatorname{Pr}\left[\left|\tilde{x}_{i}-x_{i}\right| \geq \epsilon\|\mathbf{x}\|_{2}\right] \leq \delta .
Proof: Fix an i [ n ] i [ n ] i in[n]i \in[n]. Let Z = g ( i ) C [ , h ( i ) ] Z = g ( i ) C , h ( i ) Z_(ℓ)=g_(ℓ)(i)C[ℓ,h_(ℓ)(i)]Z_{\ell}=g_{\ell}(i) C\left[\ell, h_{\ell}(i)\right]. For i [ n ] i [ n ] i^(')in[n]i^{\prime} \in[n] let Y i Y i Y_(i^('))Y_{i^{\prime}} be the indicator random variable that is 1 if h ( i ) = h ( i ) h ( i ) = h i h_(ℓ)(i)=h_(ℓ)(i^('))h_{\ell}(i)=h_{\ell}\left(i^{\prime}\right); that is i i ii and i i i^(')i^{\prime} collide in h h h_(ℓ)h_{\ell}. Note that E [ Y i ] = E [ Y i 2 ] = 1 / w E Y i = E Y i 2 = 1 / w E[Y_(i^('))]=E[Y_(i^('))^(2)]=1//w\mathbf{E}\left[Y_{i^{\prime}}\right]=\mathbf{E}\left[Y_{i^{\prime}}^{2}\right]=1 / w from the pairwise independence of h h h_(ℓ)h_{\ell}. We have
Z = g ( i ) C [ , h ( i ) ] = g ( i ) i g ( i ) x i Y i Z = g ( i ) C , h ( i ) = g ( i ) i g i x i Y i Z_(ℓ)=g_(ℓ)(i)C[ℓ,h_(ℓ)(i)]=g_(ℓ)(i)sum_(i^('))g_(ℓ)(i^('))x_(i^('))Y_(i^('))Z_{\ell}=g_{\ell}(i) C\left[\ell, h_{\ell}(i)\right]=g_{\ell}(i) \sum_{i^{\prime}} g_{\ell}\left(i^{\prime}\right) x_{i^{\prime}} Y_{i^{\prime}}
Therefore,
E [ Z ] = x i + i i E [ g ( i ) g ( i ) Y i ] x i = x i , E Z = x i + i i E g ( i ) g i Y i x i = x i , E[Z_(ℓ)]=x_(i)+sum_(i^(')!=i)E[g_(ℓ)(i)g_(ℓ)(i^('))Y_(i^('))]x_(i^('))=x_(i),\mathbf{E}\left[Z_{\ell}\right]=x_{i}+\sum_{i^{\prime} \neq i} \mathbf{E}\left[g_{\ell}(i) g_{\ell}\left(i^{\prime}\right) Y_{i^{\prime}}\right] x_{i^{\prime}}=x_{i},
because E [ g ( i ) g ( i ) ] = 0 E g ( i ) g i = 0 E[g_(ℓ)(i)g_(ℓ)(i^('))]=0\mathbf{E}\left[g_{\ell}(i) g_{\ell}\left(i^{\prime}\right)\right]=0 for i i i i i!=i^(')i \neq i^{\prime} from pairwise independence of g g g_(ℓ)g_{\ell} and Y i Y i Y_(i^('))Y_{i^{\prime}} is independent of g ( i ) g ( i ) g_(ℓ)(i)g_{\ell}(i) and g ( i ) g i g_(ℓ)(i^('))g_{\ell}\left(i^{\prime}\right). Now we upper bound the variance of Z Z Z_(ℓ)Z_{\ell}.
Var [ Z ] = E [ ( i i g ( i ) g ( i ) Y i x i ) 2 ] = E [ i i x i 2 Y i 2 + i i x i x i g ( i ) g ( i ) Y i Y i ] = i i x i 2 E [ Y i 2 ] x 2 2 / w . Var Z = E i i g ( i ) g i Y i x i 2 = E i i x i 2 Y i 2 + i i x i x i g i g i Y i Y i = i i x i 2 E Y i 2 x 2 2 / w . {:[Var[Z_(ℓ)]=E[(sum_(i^(')!=i)g_(ℓ)(i)g_(ℓ)(i^('))Y_(i^('))x_(i^(')))^(2)]],[=E[sum_(i^(')!=i)x_(i^('))^(2)Y_(i^('))^(2)+sum_(i^(')!=i^(''))x_(i^('))x_(i^(''))g_(ℓ)(i^('))g_(ℓ)(i^(''))Y_(i^('))Y_(i^(''))]],[=sum_(i^(')!=i)x_(i^('))^(2)E[Y_(i^('))^(2)]],[ <= ||x||_(2)^(2)//w.]:}\begin{aligned} \operatorname{Var}\left[Z_{\ell}\right] &=\mathbf{E}\left[\left(\sum_{i^{\prime} \neq i} g_{\ell}(i) g_{\ell}\left(i^{\prime}\right) Y_{i^{\prime}} x_{i^{\prime}}\right)^{2}\right] \\ &=\mathbf{E}\left[\sum_{i^{\prime} \neq i} x_{i^{\prime}}^{2} Y_{i^{\prime}}^{2}+\sum_{i^{\prime} \neq i^{\prime \prime}} x_{i^{\prime}} x_{i^{\prime \prime}} g_{\ell}\left(i^{\prime}\right) g_{\ell}\left(i^{\prime \prime}\right) Y_{i^{\prime}} Y_{i^{\prime \prime}}\right] \\ &=\sum_{i^{\prime} \neq i} x_{i^{\prime}}^{2} \mathbf{E}\left[Y_{i^{\prime}}^{2}\right] \\ & \leq\|\mathbf{x}\|_{2}^{2} / w . \end{aligned}
Using Chebyshev,
Pr [ | Z x i | ϵ x 2 ] Var [ Z ] ϵ 2 x 2 2 1 ϵ 2 w 1 / 3 . Pr Z x i ϵ x 2 Var Z ϵ 2 x 2 2 1 ϵ 2 w 1 / 3 . Pr[|Z_(ℓ)-x_(i)| >= epsilon||x||_(2)] <= (Var[Z_(ℓ)])/(epsilon^(2)||x||_(2)^(2)) <= (1)/(epsilon^(2)w) <= 1//3.\operatorname{Pr}\left[\left|Z_{\ell}-x_{i}\right| \geq \epsilon\|\mathbf{x}\|_{2}\right] \leq \frac{\operatorname{Var}\left[Z_{\ell}\right]}{\epsilon^{2}\|\mathbf{x}\|_{2}^{2}} \leq \frac{1}{\epsilon^{2} w} \leq 1 / 3 .
Now, via the Chernoff bound,
Pr [ | median { Z 1 , , Z d } x i | ϵ x 2 ] e c d δ . Pr median Z 1 , , Z d x i ϵ x 2 e c d δ . Pr[|"median"{Z_(1),dots,Z_(d)}-x_(i)| >= epsilon||x||_(2)] <= e^(-cd) <= delta.\operatorname{Pr}\left[\left|\text{median}\left\{Z_{1}, \ldots, Z_{d}\right\}-x_{i}\right| \geq \epsilon\|\mathbf{x}\|_{2}\right] \leq e^{-c d} \leq \delta .
Thus choosing d = O ( log n ) d = O ( log n ) d=O(log n)d=O(\log n) and taking the median guarantees the desired bound with high probability.
Remark: By choosing δ = Ω ( log n ) δ = Ω ( log n ) delta=Omega(log n)\delta=\Omega(\log n) we can ensure with probability at least ( 1 1 / poly ( n ) ) ( 1 1 / poly ( n ) ) (1-1//poly(n))(1-1 / \operatorname{poly}(n)) that | x ~ i x i | ϵ x 2 x ~ i x i ϵ x 2 | tilde(x)_(i)-x_(i)| <= epsilon||x||_(2)\left|\tilde{x}_{i}-x_{i}\right| \leq \epsilon\|\mathbf{x}\|_{2} for all i [ n ] i [ n ] i in[n]i \in[n].

3. Applications

Count and CountMin sketches have found a number of applications. Note that they have a similar structure though the guarantees are different. Consider the problem of estimating frequency moments. Count sketch outputs an estimate f ~ i f ~ i tilde(f)_(i)\tilde{f}_{i} for f i f i f_(i)f_{i} with an additive error of ϵ f 2 ϵ f 2 epsilon||f||_(2)\epsilon\|\mathbf{f}\|_{2} while CountMin guarantees an additive error of ϵ f 1 ϵ f 1 epsilon||f||_(1)\epsilon\|\mathbf{f}\|_{1} which is always larger. CountMin provides a one-sided error when x 0 x 0 x >= 0\mathrm{x} \geq 0 which has some benefits. CountMin uses O ( 1 ϵ log 1 δ ) O 1 ϵ log 1 δ O((1)/(epsilon)log (1)/(delta))O\left(\frac{1}{\epsilon} \log \frac{1}{\delta}\right) counters while Count sketch uses O ( 1 ϵ 2 log 1 δ ) O 1 ϵ 2 log 1 δ O((1)/(epsilon^(2))log (1)/(delta))O\left(\frac{1}{\epsilon^{2}} \log \frac{1}{\delta}\right) counters. Note that the Misra-Greis algorithm uses O ( 1 / ϵ ) O ( 1 / ϵ ) O(1//epsilon)O(1 / \epsilon)-counters.

3.1. Heavy Hitters

We will call an index i i ii an α α alpha\alpha-HH (for heavy hitter) if x i α x 1 x i α x 1 x_(i) >= alpha||x||_(1)x_{i} \geq \alpha\|x\|_{1} where α ( 0 , 1 ] α ( 0 , 1 ] alpha in(0,1]\alpha \in(0,1]. We would like to find S α S α S_(alpha)S_{\alpha}, the set of all α α alpha\alpha-heavy hitters. We will relax this assumption to output S S SS such that
S α S S α ϵ . S α S S α ϵ . S_(alpha)sube S subeS_(alpha-epsilon).S_{\alpha} \subseteq S \subseteq S_{\alpha-\epsilon} .
Here we will assume that α < α α < α alpha < alpha\alpha<\alpha for otherwise the approximation does not make sense.
Suppose we used CountMin sketch with w = 2 / ϵ w = 2 / ϵ w=2//epsilonw=2 / \epsilon and δ = c / n δ = c / n delta=c//n\delta=c / n for sufficiently large c c cc. Then, as we saw, with probability at least ( 1 1 / poly ( n ) ) ( 1 1 / poly ( n ) ) (1-1//poly(n))(1-1 /\operatorname{poly}(n)), for all i [ n ] i [ n ] i in[n]i \in[n],
x i x ~ i x i + ϵ x 1 . x i x ~ i x i + ϵ x 1 . x_(i) <= tilde(x)_(i) <= x_(i)+epsilon||x||_(1).x_{i} \leq \tilde{x}_{i} \leq x_{i}+\epsilon\|\mathbf{x}\|_{1} .
Once the sketch is computed we can simply go over all i i ii and add i i ii to S S SS if x ~ i α x 1 x ~ i α x 1 tilde(x)_(i) >= alpha||x||_(1)\tilde{x}_{i} \geq \alpha\|\mathbf{x}\|_{1}. It is easy to see that S S SS is the desired set.
Unfortunately the computation of S S SS is expensive. The sketch has O ( 1 ϵ log n ) O 1 ϵ log n O((1)/(epsilon)log n)O\left(\frac{1}{\epsilon} \log n\right) counters and processing each i i ii takes time proportional to the number of counters and hence the total time is O ( 1 ϵ n log n ) O 1 ϵ n log n O((1)/(epsilon)n log n)O\left(\frac{1}{\epsilon} n \log n\right) to output a set S S SS of size O ( 1 α ) O 1 α O((1)/(alpha))O\left(\frac{1}{\alpha}\right). It turns that by keeping additional information in the sketch in a hierarchical fashion one can cut down the time to be proportional to O ( 1 α polylog ( n ) ) ) O 1 α polylog ( n ) ) ) O((1)/(alpha):}polylog(n)))O\left(\frac{1}{\alpha}\right. \operatorname{polylog}(n))).

3.2. Range Queries

In several application the range [ n ] [ n ] [n][n] corresponds to an actual total ordering of the items. For instance [ n ] [ n ] [n][n] could represent the discretization of time and x x x\mathbf{x} corresponds to the signal. In databases [ n ] [ n ] [n][n] could represent ordered numerical attributes such as age of a person, height, or salary. In such settings range queries are very useful. A range query is an interval of the form [ i , j ] [ i , j ] [i,j][i, j] where i , j [ n ] i , j [ n ] i,j in[n]i, j \in[n] and i j i j i <= ji \leq j. The goal is to output i j x i i j x i sum_(i <= ℓ <= j)x_(i)\sum_{i \leq \ell \leq j} x_{i}. Note that there are O ( n 2 ) O n 2 O(n^(2))O\left(n^{2}\right) potential queries.
There is a simple trick to solve this using the sketches we have seen. An interval [ i , j ] [ i , j ] [i,j][i, j] is a dyadic interval/range if j i + 1 j i + 1 j-i+1j-i+1 is 2 k 2 k 2^(k)2^{k} and 2 k 2 k 2^(k)2^{k} divides i 1 i 1 i-1i-1. Assume n n nn is a power of 2. Then the dyadic intervals of length 1 are [ 1 , 1 ] , [ 2 , 2 ] , , [ n , n ] [ 1 , 1 ] , [ 2 , 2 ] , , [ n , n ] [1,1],[2,2],dots,[n,n][1,1],[2,2], \ldots,[n, n]. Those of length 2 are [ 1 , 2 ] , [ 3 , 4 ] , [ 1 , 2 ] , [ 3 , 4 ] , [1,2],[3,4],dots[1,2],[3,4], \ldots and of length 4 are [ 1 , 4 ] , [ 5 , 8 ] , [ 1 , 4 ] , [ 5 , 8 ] , [1,4],[5,8],dots[1,4],[5,8], \ldots.
Claim 3 Every range [ i , j ] [ i , j ] [i,j][i, j] can be expressed as a disjoint union of at most 2 log n 2 log n 2log n2 \log n dyadic ranges.
Thus it suffices to maintain accurate point queries for the dyadic ranges. Note that there are at most 2 n 2 n 2n2 n dyadic ranges. They fall into O ( log n ) O ( log n ) O(log n)O(\log n) groups based on length; the ranges for a given length partition the entire interval. We can keep a separate CountMin sketch for the n / 2 i n / 2 i n//2^(i)n / 2^{i} dyadic intervals of length i i ii ( i = 0 i = 0 i=0i=0 corresponds to the sketch for point queries). Using these O ( log n ) O ( log n ) O(log n)O(\log n) CountMin sketches we can answer any range query with an additive error of ϵ x 1 ϵ x 1 epsilon||x||_(1)\epsilon\|\mathbf{x}\|_{1}. Note that a range [ i , j ] [ i , j ] [i,j][i, j] is expressed as the sum of 2 log n 2 log n 2log n2 \log n point queries each of which has an additive error. So ϵ ϵ epsilon^(')\epsilon^{\prime} for the sketches has to be chosen to be ϵ / ( 2 log n ) ϵ / ( 2 log n ) epsilon//(2log n)\epsilon /(2 \log n) to ensure an additive error of ϵ x 1 ϵ x 1 epsilon||x||_(1)\epsilon\|\mathbf{x}\|_{1} for the range queries.
By choosing d = O ( log n ) d = O ( log n ) d=O(log n)d=O(\log n) the error probability for all point queries in all sketches will be at most 1 / poly ( n ) 1 / poly ( n ) 1//poly(n)1 / \operatorname{poly}(n). This will guarantee that all range queries will be answered to within an additive ϵ x 1 ϵ x 1 epsilon||x||_(1)\epsilon\|\mathbf{x}\|_{1}. The total space will be O ( 1 ϵ log 3 n ) O 1 ϵ log 3 n O((1)/(epsilon)log^(3)n)O\left(\frac{1}{\epsilon} \log ^{3} n\right)

3.3. Sparse Recovery

Let x R n x R n xinR^(n)\mathbf{x} \in \mathbb{R}^{n} be a vector. Can we approximate x x x\mathbf{x} by a sparse vector z z z\mathbf{z}? By sparse we mean that z z z\mathbf{z} has at most k k kk non-zero entries for some given k k kk (this is the same as saying z 0 k z 0 k ||z||_(0) <= k\|\mathbf{z}\|_{0} \leq k ). A reasonable way to model this is to ask for computing the error
err p k ( x ) = min z : z 0 k x z p err p k ( x ) = min z : z 0 k x z p err_(p)^(k)(x)=min_(z:||z||_(0) <= k)||x-z||_(p)\operatorname{err}_{p}^{k}(\mathbf{x})=\min _{\mathbf{z}:\|\mathbf{z}\|_{0} \leq k}\|\mathbf{x}-\mathbf{z}\|_{p}
for some p p pp. A typical choice is p = 2 p = 2 p=2p=2. It is easy to see that the optimum z z z\mathbf{z} is obtained by restricting x x x\mathbf{x} to its k k kk largest coordinates (in absolute value). The question we ask here is whether we can estimate err 2 k ( x ) err 2 k ( x ) err_(2)^(k)(x)\operatorname{err}_{2}^{k}(\mathbf{x}) efficiently in a streaming fashion. For this we use the Count sketch. Recall that by choosing w = 3 / ϵ 2 w = 3 / ϵ 2 w=3//epsilon^(2)w=3 / \epsilon^{2} and d = Θ ( log n ) d = Θ ( log n ) d=Theta(log n)d=\Theta(\log n) the sketch ensures that with high probability,
i [ n ] , | x ~ i x i | ϵ x 2 . i [ n ] , x ~ i x i ϵ x 2 . AA i in[n],quad| tilde(x)_(i)-x_(i)| <= epsilon||x||_(2).\forall i \in[n], \quad\left|\tilde{x}_{i}-x_{i}\right| \leq \epsilon\|\mathbf{x}\|_{2} .
One can in fact show a generalization.
Lemma 4 Count-Sketch with w = 3 k / ϵ w = 3 k / ϵ w=3k//epsilonw=3 k / \epsilon and d = O ( log n ) d = O ( log n ) d=O(log n)d=O(\log n) ensures that
i [ n ] , | x ~ i x i | ϵ k err 2 k ( x ) . i [ n ] , x ~ i x i ϵ k err 2 k ( x ) . AA i in[n],quad| tilde(x)_(i)-x_(i)| <= (epsilon)/(sqrtk)err_(2)^(k)(x).\forall i \in[n], \quad\left|\tilde{x}_{i}-x_{i}\right| \leq \frac{\epsilon}{\sqrt{k}} \operatorname{err}_{2}^{k}(\mathbf{x}) .
Proof: Let S = { i 1 , i 2 , , i k } S = i 1 , i 2 , , i k S={i_(1),i_(2),dots,i_(k)}S=\left\{i_{1}, i_{2}, \ldots, i_{k}\right\} be the indices of the largest coordinates in x x x\mathbf{x} and let x x x^(')\mathbf{x}^{\prime} be obtained from x x x\mathbf{x} by setting entries of x x x\mathbf{x} to zero for indices in S S SS. Note that err 2 k ( x ) = x 2 err 2 k ( x ) = x 2 err_(2)^(k)(x)=||x^(')||_(2)\operatorname{err}_{2}^{k}(\mathbf{x})=\left\|\mathbf{x}^{\prime}\right\|_{2}. Fix a coordinate i i ii. Consider row \ell and let Z = g ( i ) C [ , h ( i ) ] Z = g ( i ) C , h ( i ) Z_(ℓ)=g_(ℓ)(i)C[ℓ,h_(ℓ)(i)]Z_{\ell}=g_{\ell}(i) C\left[\ell, h_{\ell}(i)\right] as before. Let A A A_(ℓ)A_{\ell} be the event that there exists an index t S t S t in St \in S such that h ( i ) = h ( t ) h ( i ) = h ( t ) h_(ℓ)(i)=h_(ℓ)(t)h_{\ell}(i)=h_{\ell}(t); that is any "big" coordinate collides with i i ii under h h h_(ℓ)h_{\ell}. Note that Pr [ A ] t S Pr [ h ( i ) = Pr [ h ( t ) ] | S | / w ϵ / 3 Pr A t S Pr h ( i ) = Pr h ( t ) | S | / w ϵ / 3 Pr[A_(ℓ)] <= sum_(t in S)Pr[h_(ℓ)(i)=Pr[h_(ℓ)(t)] <= |S|//w <= epsilon//3:}\operatorname{Pr}\left[A_{\ell}\right] \leq \sum_{t \in S} \operatorname{Pr}\left[h_{\ell}(i)=\operatorname{Pr}\left[h_{\ell}(t)\right] \leq|S| / w \leq \epsilon / 3\right. by pair-wise independence of h h hh. Now we estimate
Pr [ | Z x i | ϵ k err 2 k ( x ) ] = Pr [ | Z x i | ϵ k x 2 ] = Pr [ A ] Pr [ | Z x i | ϵ k x 2 ] + Pr [ | Z x i | ϵ k x 2 ¬ A ] Pr [ A ] + 1 / 3 < 1 / 2 . Pr [ | Z x i | ϵ k err 2 k ( x ) ] = Pr [ | Z x i | ϵ k x 2 ] = Pr [ A ] Pr [ | Z x i | ϵ k x 2 ] + Pr [ | Z x i | ϵ k x 2 ¬ A ] Pr [ A ] + 1 / 3 < 1 / 2 . {:[Pr[|Z_(ℓ)-x_(i)| >= (epsilon)/(sqrtk)err_(2)^(k)(x)]=Pr[|Z_(ℓ)-x_(i)| >= (epsilon)/(sqrtk)||x^(')||_(2)]],[=Pr[A_(ℓ)]*Pr[|Z_(ℓ)-x_(i)| >= (epsilon)/(sqrtk)||x^(')||_(2)]+Pr[|Z_(ℓ)-x_(i)| >= (epsilon)/(sqrtk)||x^(')||_(2)∣notA_(ℓ)]],[ <= Pr[A_(ℓ)]+1//3 < 1//2.]:}\begin{aligned} \operatorname{Pr}[|Z_{\ell}-x_{i}| \geq \frac{\epsilon}{\sqrt{k}} \operatorname{err}_{2}^{k}(\mathbf{x})] &=\operatorname{Pr}[|Z_{\ell}-x_{i}| \geq \frac{\epsilon}{\sqrt{k}}\|\mathbf{x}^{\prime}\|_{2}] \\ &=\operatorname{Pr}[A_{\ell}] \cdot \operatorname{Pr}[|Z_{\ell}-x_{i}| \geq \frac{\epsilon}{\sqrt{k}}\|\mathbf{x}^{\prime}\|_{2}]+\operatorname{Pr}[|Z_{\ell}-x_{i}| \geq \frac{\epsilon}{\sqrt{k}}\|\mathbf{x}^{\prime}\|_{2} \mid \neg A_{\ell}] \\ & \leq \operatorname{Pr}[A_{\ell}]+1 / 3<1 / 2 . \end{aligned}
Now let x ~ x ~ tilde(x)\tilde{\mathbf{x}} be the approximation to x x x\mathbf{x} that is obtained from the sketch. We can take the k k kk largest coordinates of x ~ x ~ tilde(x)\tilde{\mathbf{x}} to form the vector z z z\mathbf{z} and output z z z\mathbf{z}. We claim that this gives a good approximation to err 2 k ( x ) err 2 k ( x ) err_(2)^(k)(x)\operatorname{err}_{2}^{k}(\mathbf{x}). To see this we prove the following lemma.
Lemma 5 Let x , y R n x , y R n x,yinR^(n)\mathbf{x}, \mathbf{y} \in \mathbb{R}^{n} such that
x y ϵ k err 2 k ( x ) . x y ϵ k err 2 k ( x ) . ||x-y||_(oo) <= (epsilon)/(sqrtk)err_(2)^(k)(x).\|\mathbf{x}-\mathbf{y}\|_{\infty} \leq \frac{\epsilon}{\sqrt{k}} \operatorname{err}_{2}^{k}(\mathbf{x}).
Then,
x z 2 ( 1 + 5 ϵ ) err 2 k ( x ) , x z 2 ( 1 + 5 ϵ ) err 2 k ( x ) , ||x-z||_(2) <= (1+5epsilon)err_(2)^(k)(x),\|\mathbf{x}-\mathbf{z}\|_{2} \leq(1+5 \epsilon) \operatorname{err}_{2}^{k}(\mathbf{x}),
where z z z\mathbf{z} is the vector obtained as follows: z i = y i z i = y i z_(i)=y_(i)\mathbf{z}_{i}=\mathbf{y}_{i} for i T i T i in Ti \in T where T T TT is the set of k k kk largest (in absolute value) indices of y y y\mathbf{y} and z i = 0 z i = 0 z_(i)=0\mathbf{z}_{i}=0 for i T i T i!in Ti \notin T.
Proof: Let t = 1 k err 2 k ( x ) t = 1 k err 2 k ( x ) t=(1)/(sqrtk)err_(2)^(k)(x)t=\frac{1}{\sqrt{k}} \operatorname{err}_{2}^{k}(\mathbf{x}) to help ease the notation. Let S S SS be the index set of the largest coordinates of x x xx. We have,
( err 2 k ( x ) ) 2 = k t 2 = i [ n ] S x i 2 = i T S x i 2 + i [ n ] ( S T ) x i 2 . err 2 k ( x ) 2 = k t 2 = i [ n ] S x i 2 = i T S x i 2 + i [ n ] ( S T ) x i 2 . (err_(2)^(k)(x))^(2)=kt^(2)=sum_(i in[n]\\S)x_(i)^(2)=sum_(i in T\\S)x_(i)^(2)+sum_(i in[n]\\(S uu T))x_(i)^(2).\left(\operatorname{err}_{2}^{k}(\mathbf{x})\right)^{2}=k t^{2}=\sum_{i \in[n] \backslash S} x_{i}^{2}=\sum_{i \in T \backslash S} x_{i}^{2}+\sum_{i \in[n] \backslash(S \cup T)} x_{i}^{2} .
We write:
x z 2 2 = i T | x i z i | 2 + i S T | x i z i | 2 + i [ n ] ( S T ) x i 2 = i T | x i y i | 2 + i S T x i 2 + i [ n ] ( S T ) x i 2 . x z 2 2 = i T x i z i 2 + i S T x i z i 2 + i [ n ] ( S T ) x i 2 = i T x i y i 2 + i S T x i 2 + i [ n ] ( S T ) x i 2 . {:[||x-z||_(2)^(2)=sum_(i in T)|x_(i)-z_(i)|^(2)+sum_(i in S\\T)|x_(i)-z_(i)|^(2)+sum_(i in[n]\\(S uu T))x_(i)^(2)],[=sum_(i in T)|x_(i)-y_(i)|^(2)+sum_(i in S\\T)x_(i)^(2)+sum_(i in[n]\\(S uu T))x_(i)^(2).]:}\begin{aligned} \|\mathbf{x}-\mathbf{z}\|_{2}^{2} &=\sum_{i \in T}\left|x_{i}-z_{i}\right|^{2}+\sum_{i \in S \backslash T}\left|x_{i}-z_{i}\right|^{2}+\sum_{i \in[n] \backslash(S \cup T)} x_{i}^{2} \\ &=\sum_{i \in T}\left|x_{i}-y_{i}\right|^{2}+\sum_{i \in S \backslash T} x_{i}^{2}+\sum_{i \in[n] \backslash(S \cup T)} x_{i}^{2}. \end{aligned}
We treat each term separately. The first one is easy to bound.
i T | x i y i | 2 i T ϵ 2 t 2 ϵ 2 k t 2 . i T x i y i 2 i T ϵ 2 t 2 ϵ 2 k t 2 . sum_(i in T)|x_(i)-y_(i)|^(2) <= sum_(i in T)epsilon^(2)t^(2) <= epsilon^(2)kt^(2).\sum_{i \in T}\left|x_{i}-y_{i}\right|^{2} \leq \sum_{i \in T} \epsilon^{2} t^{2} \leq \epsilon^{2} k t^{2} .
The third term is common to x z 2 x z 2 ||x-z||_(2)\|\mathbf{x}-\mathbf{z}\|_{2} and err 2 k ( x ) err 2 k ( x ) err_(2)^(k)(x)\operatorname{err}_{2}^{k}(\mathbf{x}). The second term is the one to care about.
Note that S S SS is set of k k kk largest coordinates in x x x\mathbf{x} and T T TT is set of k k kk largest coordinates in y y y\mathbf{y}. Thus | S T | = | T S | | S T | = | T S | |S\\T|=|T\\S||S \backslash T|=|T \backslash S|, say their cardinality is 1 1 ℓ >= 1\ell \geq 1. Since x x x\mathbf{x} and y y y\mathbf{y} are close in ℓ_(oo)\ell_{\infty} norm (that is they are close in each coordinate) it must mean that the coordinates in S T S T S\\TS \backslash T and T S T S T\\ST \backslash S are roughly the same value in x x x\mathbf{x}. More precisely let a = max i S T | x i | a = max i S T x i a=max_(i in S\\T)|x_(i)|a=\max _{i \in S \backslash T}\left|x_{i}\right| and b = min i T S | x i | b = min i T S x i b=min_(i in T\\S)|x_(i)|b=\min _{i \in T \backslash S}\left|x_{i}\right|. We leave it as an exercise to the reader to argue that that a b + 2 ϵ t a b + 2 ϵ t a <= b+2epsilon ta \leq b+2 \epsilon t since x y ϵ t x y ϵ t ||x-y||_(oo) <= epsilon t\|\mathbf{x}-\mathbf{y}\|_{\infty} \leq \epsilon t.
Thus,
i S T x i 2 a 2 ( b + 2 ϵ t ) 2 b 2 + 4 ϵ k t b + 4 k ϵ 2 t 2 . i S T x i 2 a 2 ( b + 2 ϵ t ) 2 b 2 + 4 ϵ k t b + 4 k ϵ 2 t 2 . sum_(i in S\\T)x_(i)^(2) <= ℓa^(2) <= ℓ(b+2epsilon t)^(2) <= ℓb^(2)+4epsilon ktb+4kepsilon^(2)t^(2).\sum_{i \in S \backslash T} x_{i}^{2} \leq \ell a^{2} \leq \ell(b+2 \epsilon t)^{2} \leq \ell b^{2}+4 \epsilon k t b+4 k \epsilon^{2} t^{2}.
But we have
i T S x i 2 b 2 . i T S x i 2 b 2 . sum_(i in T\\S)x_(i)^(2) >= ℓb^(2).\sum_{i \in T \backslash S} x_{i}^{2} \geq \ell b^{2} .
Putting things together,
x z 2 2 b 2 + 4 ϵ k t b + i [ n ] ( S T ) x i 2 + 5 k ϵ 2 t 2 i T S x i 2 + i [ n ] ( S T ) x i 2 + 4 ϵ ( err 2 k ( x ) ) 2 + 5 ϵ 2 ( err 2 k ( x ) ) 2 ( err 2 k ( x ) ) 2 + 9 ϵ ( err 2 k ( x ) ) 2 . x z 2 2 b 2 + 4 ϵ k t b + i [ n ] ( S T ) x i 2 + 5 k ϵ 2 t 2 i T S x i 2 + i [ n ] ( S T ) x i 2 + 4 ϵ err 2 k ( x ) 2 + 5 ϵ 2 err 2 k ( x ) 2 err 2 k ( x ) 2 + 9 ϵ err 2 k ( x ) 2 . {:[||x-z||_(2)^(2) <= ℓb^(2)+4epsilon ktb+sum_(i in[n]\\(S uu T))x_(i)^(2)+5kepsilon^(2)t^(2)],[ <= sum_(i in T\\S)x_(i)^(2)+sum_(i in[n]\\(S uu T))x_(i)^(2)+4epsilon(err_(2)^(k)(x))^(2)+5epsilon^(2)(err_(2)^(k)(x))^(2)],[ <= (err_(2)^(k)(x))^(2)+9epsilon(err_(2)^(k)(x))^(2).]:}\begin{aligned} \|\mathbf{x}-\mathbf{z}\|_{2}^{2} & \leq \ell b^{2}+4 \epsilon k t b+\sum_{i \in[n] \backslash(S \cup T)} x_{i}^{2}+5 k \epsilon^{2} t^{2} \\ & \leq \sum_{i \in T \backslash S} x_{i}^{2}+\sum_{i \in[n] \backslash(S \cup T)} x_{i}^{2}+4 \epsilon\left(\operatorname{err}_{2}^{k}(\mathbf{x})\right)^{2}+5 \epsilon^{2}\left(\operatorname{err}_{2}^{k}(\mathbf{x})\right)^{2} \\ & \leq\left(\operatorname{err}_{2}^{k}(\mathbf{x})\right)^{2}+9 \epsilon\left(\operatorname{err}_{2}^{k}(\mathbf{x})\right)^{2} . \end{aligned}
The lemma follows by by the fact that for sufficiently small ϵ , 1 + 9 ϵ 1 + 5 ϵ ϵ , 1 + 9 ϵ 1 + 5 ϵ epsilon,sqrt(1+9epsilon) <= 1+5epsilon\epsilon, \sqrt{1+9 \epsilon} \leq 1+5 \epsilon.
Bibliographic Notes: Count sketch is by Charikar, Chen and Farach-Colton[1]. CountMin sketch is due to Cormode and Muthukrishnan[2]; see the papers for several applications. Cormode's survey on sketching in[3] has a nice perspective. See[4] for a comparative analysis (theoretical and experimenta) of algorithms for frinding frequent items. A deterministic variant of CountMin called CR-Precis is interesting; see http://polylogblog.wordpress.com/2009/09/22/bite-sized-streams-cr-precis/ for a blog post with pointers and some comments. The applications are taken from the first chapter in the draft book by McGregor and Muthukrishnan.
You can read the notes from the next lecture of Chandra Chekuri's course on \ell_0 Sampling, and Priority Sampling here.

  1. Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theoretical Computer Science, 312(1):3-15, 2004. ↩︎
  2. Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58-75, 2005. ↩︎
  3. Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1-294, 2012. ↩︎
  4. Graham Cormode and Marios Hadjieleftheriou. Methods for finding frequent items in data streams. VLDB J., 19(1):3-20, 2010. ↩︎

Recommended for you

Kaichao You
Event Camera: the Next Generation of Visual Perception System
Event Camera: the Next Generation of Visual Perception System
Event camera can extend computer vision to scenarios where it is currently incompetent. In the following decades, it is hopeful that event cameras will be mature enough to be mass-produced, to have dedicated algorithms, and to show up in widely-used products.
29 points
0 issues