Count and Count-Min Sketches

You can read the notes from the previous lecture of Chandra Chekuri's course on Estimating F_2 norm, Sketching, Johnson-Lindenstrauss Lemma here.

The Misra-Greis deterministic counting guarantees that all items with frequency

> F_{1} / k

can be found using

O (k)

counters and an update time of

O (\log k)

. Setting

k = 1 / ϵ

one can view the algorithm as providing an additive

ϵ F_{1}

approximation for each

f_{i}

. However, the algorithm does not provide a sketch. One advantage of linear sketching algorithms is the ability to handle deletions. We now discuss two sketching algorithms that have a found a number of applications. These sketches can be used to for estimating point queries: after seeing a stream

σ

over items in

[n]

we would like to estimate

f_{i}

the frequency of

i \in [n]

. More generally, in the turnstile model, we would like to estimate

x_{i}

for a given

i \in [n]

. We can only guarantee the estimate with an additive error.

1. CountMin Sketch

We firt describe the simpler CountMin sketch. The sketch maintains several counters. The counters are best visualized as a rectangular array of width

w

and depth

d

. With each row

i

we have a hash function

h_{i} : [n] \to [w]

that maps elements to one of

w

buckets.

\underline{CountMin-Sketch (w, d) :}

h_{1}, h_{2}, \dots, h_{d}

are pair-wise independent hash functions from

[n] \to [w]

While (stream is not empty) do

a_{t} = (i_{t} Δ_{t})

is current item

for

ℓ = 1

d

C [ℓ, h_{ℓ} (i_{j})] \leftarrow C [ℓ, h_{ℓ} (i_{j})] + Δ_{t}

endWhile

For

i \in [n]

set

{\tilde{x}}_{i} = {min}_{ℓ = 1}^{d} C [ℓ, h_{ℓ} (i)] .

The counter

C [ℓ, j]

simply counts the sum of all

x_{i}

such that

h_{ℓ} (i) = j

. That is,

C [ℓ, j] = \sum_{i : h_{ℓ} (i) = j} x_{i} .

Exercise: CountMin is a linear sketch. What are the entries of the projection matrix?

We will analyze the sketch in the strick turnstile model where

x_{i} \geq 0

for all

i \in [n]

; note that

Δ_{t}

we be negative.

Lemma 1 Let

d = Ω (\log \frac{1}{δ})

and

w > \frac{2}{ϵ}

. Then for any fixed

i \in [n], x_{i} \leq {\tilde{x}}_{i}

and

\Pr [{\tilde{x}}_{i} \geq x_{i} + ϵ ∥ x ∥_{1}] \leq δ .

Proof: Fix

i \in [n]

. Let

Z_{ℓ} = C [ℓ, h_{ℓ} (i)]

be the value of the counter in row

ℓ

to which

i

is hashed to. We have

E [Z_{ℓ}] = x_{i} + \sum_{i^{'} \neq i} \Pr [h_{ℓ} (i^{'}) = h_{ℓ} (i)] x_{i^{'}} = x_{i} + \sum_{i^{'} \neq i} \frac{1}{w} x_{i^{'}} \leq x_{i} + \frac{ϵ}{2} ∥ x ∥_{1} .

Note that we used pair-wise independence of

h_{ℓ}

to conclude that

\Pr [h_{ℓ} (i^{'}) = h_{ℓ} (i)] = 1 / w

By Markov's inequality (here we are using non-negativity of

x

\Pr [Z_{ℓ} > x_{i} + ϵ ∥ x ∥_{1}] \leq 1 / 2 .

Thus

\Pr [min_{ℓ} Z_{ℓ} > x_{i} + ϵ ∥ x ∥_{1}] \leq 1 / 2^{d} \leq δ .

Remark: By choosing

δ = Ω (\log n)

we can ensure with probability at least

(1 - 1 / poly (n))

that

{\tilde{x}}_{i} - x_{i} \leq ϵ ∥ x ∥_{1}

for all

i \in [n]

Exercise: For general turnstile streams where

x

can have negative entries we can take the median of the counters. For this estimate you should be able to prove the following.

\Pr [| {\tilde{x}}_{i} - x_{i} | \geq 3 ϵ ∥ x ∥_{1}] \leq δ^{1 / 4} .

2. Count Sketch

Now we discuss the closely related Count sketch which also maintains an array of counters parameterized by the width

w

and depth

d

\underline{Count-Sketch (w, d) :}

h_{1}, h_{2}, \dots, h_{d}

are pair-wise independent hash functions from

[n] \to [w] .

g_{1}, g_{2}, \dots, g_{d}

are pair-wise independent hash functions from

[n] \to {- 1, 1}

While (stream is not empty) do

a_{t} = (i_{t}, Δ_{t})

is current item

for

ℓ = 1

d

C [ℓ, h_{ℓ} (i_{j})] \leftarrow C [ℓ, h_{ℓ} (i_{j})] + g (i_{t}) Δ_{t}

endWhile

For

i \in [n]

set

{\tilde{x}}_{i} = median {g_{1} (i) C [1, h_{1} (i)], g_{2} (i) C [2, h_{2} (i), \dots, g_{d} (i) C [d, h_{d} (i)]} .

Exercise: CountMin is a linear sketch. What are the entries of the projection matrix?

Lemma 2 Let

d \geq \log \frac{1}{δ}

and

w > \frac{3}{ϵ^{2}}

. Then for any fixed

i \in [n], E [{\tilde{x}}_{i}] = x_{i}

and

\Pr [| {\tilde{x}}_{i} - x_{i} | \geq ϵ ∥ x ∥_{2}] \leq δ .

Proof: Fix an

i \in [n]

. Let

Z_{ℓ} = g_{ℓ} (i) C [ℓ, h_{ℓ} (i)]

. For

i^{'} \in [n]

let

Y_{i^{'}}

be the indicator random variable that is 1 if

h_{ℓ} (i) = h_{ℓ} (i^{'})

; that is

i

and

i^{'}

collide in

h_{ℓ}

. Note that

E [Y_{i^{'}}] = E [Y_{i^{'}}^{2}] = 1 / w

from the pairwise independence of

h_{ℓ}

. We have

Z_{ℓ} = g_{ℓ} (i) C [ℓ, h_{ℓ} (i)] = g_{ℓ} (i) \sum_{i^{'}} g_{ℓ} (i^{'}) x_{i^{'}} Y_{i^{'}}

Therefore,

E [Z_{ℓ}] = x_{i} + \sum_{i^{'} \neq i} E [g_{ℓ} (i) g_{ℓ} (i^{'}) Y_{i^{'}}] x_{i^{'}} = x_{i},

because

E [g_{ℓ} (i) g_{ℓ} (i^{'})] = 0

for

i \neq i^{'}

from pairwise independence of

g_{ℓ}

and

Y_{i^{'}}

is independent of

g_{ℓ} (i)

and

g_{ℓ} (i^{'})

. Now we upper bound the variance of

Z_{ℓ}

\begin{aligned} Var [Z_{ℓ}] & = E [{(\sum_{i^{'} \neq i} g_{ℓ} (i) g_{ℓ} (i^{'}) Y_{i^{'}} x_{i^{'}})}^{2}] \\ = E [\sum_{i^{'} \neq i} x_{i^{'}}^{2} Y_{i^{'}}^{2} + \sum_{i^{'} \neq i^{''}} x_{i^{'}} x_{i^{''}} g_{ℓ} (i^{'}) g_{ℓ} (i^{''}) Y_{i^{'}} Y_{i^{''}}] \\ = \sum_{i^{'} \neq i} x_{i^{'}}^{2} E [Y_{i^{'}}^{2}] \\ \leq ∥ x ∥_{2}^{2} / w . \end{aligned}

Using Chebyshev,

\Pr [| Z_{ℓ} - x_{i} | \geq ϵ ∥ x ∥_{2}] \leq \frac{Var [Z_{ℓ}]}{ϵ^{2} ∥ x ∥_{2}^{2}} \leq \frac{1}{ϵ^{2} w} \leq 1 / 3 .

Now, via the Chernoff bound,

\Pr [| median {Z_{1}, \dots, Z_{d}} - x_{i} | \geq ϵ ∥ x ∥_{2}] \leq e^{- c d} \leq δ .

Thus choosing

d = O (\log n)

and taking the median guarantees the desired bound with high probability.

Remark: By choosing

δ = Ω (\log n)

we can ensure with probability at least

(1 - 1 / poly (n))

that

| {\tilde{x}}_{i} - x_{i} | \leq ϵ ∥ x ∥_{2}

for all

i \in [n]

3. Applications

Count and CountMin sketches have found a number of applications. Note that they have a similar structure though the guarantees are different. Consider the problem of estimating frequency moments. Count sketch outputs an estimate

{\tilde{f}}_{i}

for

f_{i}

with an additive error of

ϵ ∥ f ∥_{2}

while CountMin guarantees an additive error of

ϵ ∥ f ∥_{1}

which is always larger. CountMin provides a one-sided error when

x \geq 0

which has some benefits. CountMin uses

O (\frac{1}{ϵ} \log \frac{1}{δ})

counters while Count sketch uses

O (\frac{1}{ϵ^{2}} \log \frac{1}{δ})

counters. Note that the Misra-Greis algorithm uses

O (1 / ϵ)

-counters.

3.1. Heavy Hitters

We will call an index

i

α

-HH (for heavy hitter) if

x_{i} \geq α ∥ x ∥_{1}

where

α \in (0, 1]

. We would like to find

S_{α}

, the set of all

α

-heavy hitters. We will relax this assumption to output

S

such that

S_{α} \subseteq S \subseteq S_{α - ϵ} .

Here we will assume that

α < α

for otherwise the approximation does not make sense.

Suppose we used CountMin sketch with

w = 2 / ϵ

and

δ = c / n

for sufficiently large

c

. Then, as we saw, with probability at least

(1 - 1 / poly (n))

, for all

i \in [n]

x_{i} \leq {\tilde{x}}_{i} \leq x_{i} + ϵ ∥ x ∥_{1} .

Once the sketch is computed we can simply go over all

i

and add

i

S

{\tilde{x}}_{i} \geq α ∥ x ∥_{1}

. It is easy to see that

S

is the desired set.

Unfortunately the computation of

S

is expensive. The sketch has

O (\frac{1}{ϵ} \log n)

counters and processing each

i

takes time proportional to the number of counters and hence the total time is

O (\frac{1}{ϵ} n \log n)

to output a set

S

of size

O (\frac{1}{α})

. It turns that by keeping additional information in the sketch in a hierarchical fashion one can cut down the time to be proportional to

O (\frac{1}{α} polylog (n)))

3.2. Range Queries

In several application the range

[n]

corresponds to an actual total ordering of the items. For instance

[n]

could represent the discretization of time and

x

corresponds to the signal. In databases

[n]

could represent ordered numerical attributes such as age of a person, height, or salary. In such settings range queries are very useful. A range query is an interval of the form

[i, j]

where

i, j \in [n]

and

i \leq j

. The goal is to output

\sum_{i \leq ℓ \leq j} x_{i}

. Note that there are

O (n^{2})

potential queries.

There is a simple trick to solve this using the sketches we have seen. An interval

[i, j]

is a dyadic interval/range if

j - i + 1

2^{k}

and

2^{k}

divides

i - 1

. Assume

n

is a power of 2. Then the dyadic intervals of length 1 are

[1, 1], [2, 2], \dots, [n, n]

. Those of length 2 are

[1, 2], [3, 4], \dots

and of length 4 are

[1, 4], [5, 8], \dots

Claim 3 Every range

[i, j]

can be expressed as a disjoint union of at most

2 \log n

dyadic ranges.

Thus it suffices to maintain accurate point queries for the dyadic ranges. Note that there are at most

2 n

dyadic ranges. They fall into

O (\log n)

groups based on length; the ranges for a given length partition the entire interval. We can keep a separate CountMin sketch for the

n / 2^{i}

dyadic intervals of length

i

(

i = 0

corresponds to the sketch for point queries). Using these

O (\log n)

CountMin sketches we can answer any range query with an additive error of

ϵ ∥ x ∥_{1}

. Note that a range

[i, j]

is expressed as the sum of

2 \log n

point queries each of which has an additive error. So

ϵ^{'}

for the sketches has to be chosen to be

ϵ / (2 \log n)

to ensure an additive error of

ϵ ∥ x ∥_{1}

for the range queries.

By choosing

d = O (\log n)

the error probability for all point queries in all sketches will be at most

1 / poly (n)

. This will guarantee that all range queries will be answered to within an additive

ϵ ∥ x ∥_{1}

. The total space will be

O (\frac{1}{ϵ} \log^{3} n)

3.3. Sparse Recovery

Let

x \in R^{n}

be a vector. Can we approximate

x

by a sparse vector

z

? By sparse we mean that

z

has at most

k

non-zero entries for some given

k

(this is the same as saying

∥ z ∥_{0} \leq k

). A reasonable way to model this is to ask for computing the error

{err}_{p}^{k} (x) = min_{z : ∥ z ∥_{0} \leq k} ∥ x - z ∥_{p}

for some

p

. A typical choice is

p = 2

. It is easy to see that the optimum

z

is obtained by restricting

x

to its

k

largest coordinates (in absolute value). The question we ask here is whether we can estimate

{err}_{2}^{k} (x)

efficiently in a streaming fashion. For this we use the Count sketch. Recall that by choosing

w = 3 / ϵ^{2}

and

d = Θ (\log n)

the sketch ensures that with high probability,

\forall i \in [n], | {\tilde{x}}_{i} - x_{i} | \leq ϵ ∥ x ∥_{2} .

One can in fact show a generalization.

Lemma 4 Count-Sketch with

w = 3 k / ϵ

and

d = O (\log n)

ensures that

\forall i \in [n], | {\tilde{x}}_{i} - x_{i} | \leq \frac{ϵ}{\sqrt{k}} {err}_{2}^{k} (x) .

Proof: Let

S = {i_{1}, i_{2}, \dots, i_{k}}

be the indices of the largest coordinates in

x

and let

x^{'}

be obtained from

x

by setting entries of

x

to zero for indices in

S

. Note that

{err}_{2}^{k} (x) = {∥ x^{'} ∥}_{2}

. Fix a coordinate

i

. Consider row

ℓ

and let

Z_{ℓ} = g_{ℓ} (i) C [ℓ, h_{ℓ} (i)]

as before. Let

A_{ℓ}

be the event that there exists an index

t \in S

such that

h_{ℓ} (i) = h_{ℓ} (t)

; that is any "big" coordinate collides with

i

under

h_{ℓ}

. Note that

\Pr [A_{ℓ}] \leq \sum_{t \in S} \Pr [h_{ℓ} (i) = \Pr [h_{ℓ} (t)] \leq | S | / w \leq ϵ / 3

by pair-wise independence of

h

. Now we estimate

\begin{aligned} \Pr [| Z_{ℓ} - x_{i} | \geq \frac{ϵ}{\sqrt{k}} {err}_{2}^{k} (x)] & = \Pr [| Z_{ℓ} - x_{i} | \geq \frac{ϵ}{\sqrt{k}} ∥ x^{'} ∥_{2}] \\ = \Pr [A_{ℓ}] \cdot \Pr [| Z_{ℓ} - x_{i} | \geq \frac{ϵ}{\sqrt{k}} ∥ x^{'} ∥_{2}] + \Pr [| Z_{ℓ} - x_{i} | \geq \frac{ϵ}{\sqrt{k}} ∥ x^{'} ∥_{2} ∣ \neg A_{ℓ}] \\ \leq \Pr [A_{ℓ}] + 1 / 3 < 1 / 2 . \end{aligned}

Now let

\tilde{x}

be the approximation to

x

that is obtained from the sketch. We can take the

k

largest coordinates of

\tilde{x}

to form the vector

z

and output

z

. We claim that this gives a good approximation to

{err}_{2}^{k} (x)

. To see this we prove the following lemma.

Lemma 5 Let

x, y \in R^{n}

such that

∥ x - y ∥_{\infty} \leq \frac{ϵ}{\sqrt{k}} {err}_{2}^{k} (x) .

Then,

∥ x - z ∥_{2} \leq (1 + 5 ϵ) {err}_{2}^{k} (x),

where

z

is the vector obtained as follows:

z_{i} = y_{i}

for

i \in T

where

T

is the set of

k

largest (in absolute value) indices of

y

and

z_{i} = 0

for

i \notin T

Proof: Let

t = \frac{1}{\sqrt{k}} {err}_{2}^{k} (x)

to help ease the notation. Let

S

be the index set of the largest coordinates of

x

. We have,

{({err}_{2}^{k} (x))}^{2} = k t^{2} = \sum_{i \in [n] ∖ S} x_{i}^{2} = \sum_{i \in T ∖ S} x_{i}^{2} + \sum_{i \in [n] ∖ (S \cup T)} x_{i}^{2} .

We write:

\begin{aligned} ∥ x - z ∥_{2}^{2} & = \sum_{i \in T} {| x_{i} - z_{i} |}^{2} + \sum_{i \in S ∖ T} {| x_{i} - z_{i} |}^{2} + \sum_{i \in [n] ∖ (S \cup T)} x_{i}^{2} \\ = \sum_{i \in T} {| x_{i} - y_{i} |}^{2} + \sum_{i \in S ∖ T} x_{i}^{2} + \sum_{i \in [n] ∖ (S \cup T)} x_{i}^{2} . \end{aligned}

We treat each term separately. The first one is easy to bound.

\sum_{i \in T} {| x_{i} - y_{i} |}^{2} \leq \sum_{i \in T} ϵ^{2} t^{2} \leq ϵ^{2} k t^{2} .

The third term is common to

∥ x - z ∥_{2}

and

{err}_{2}^{k} (x)

. The second term is the one to care about.

Note that

S

is set of

k

largest coordinates in

x

and

T

is set of

k

largest coordinates in

y

. Thus

| S ∖ T | = | T ∖ S |

, say their cardinality is

ℓ \geq 1

. Since

x

and

y

are close in

ℓ_{\infty}

norm (that is they are close in each coordinate) it must mean that the coordinates in

S ∖ T

and

T ∖ S

are roughly the same value in

x

. More precisely let

a = max_{i \in S ∖ T} | x_{i} |

and

b = min_{i \in T ∖ S} | x_{i} |

. We leave it as an exercise to the reader to argue that that

a \leq b + 2 ϵ t

since

∥ x - y ∥_{\infty} \leq ϵ t

Thus,

\sum_{i \in S ∖ T} x_{i}^{2} \leq ℓ a^{2} \leq ℓ (b + 2 ϵ t)^{2} \leq ℓ b^{2} + 4 ϵ k t b + 4 k ϵ^{2} t^{2} .

But we have

\sum_{i \in T ∖ S} x_{i}^{2} \geq ℓ b^{2} .

Putting things together,

\begin{aligned} ∥ x - z ∥_{2}^{2} & \leq ℓ b^{2} + 4 ϵ k t b + \sum_{i \in [n] ∖ (S \cup T)} x_{i}^{2} + 5 k ϵ^{2} t^{2} \\ \leq \sum_{i \in T ∖ S} x_{i}^{2} + \sum_{i \in [n] ∖ (S \cup T)} x_{i}^{2} + 4 ϵ {({err}_{2}^{k} (x))}^{2} + 5 ϵ^{2} {({err}_{2}^{k} (x))}^{2} \\ \leq {({err}_{2}^{k} (x))}^{2} + 9 ϵ {({err}_{2}^{k} (x))}^{2} . \end{aligned}

The lemma follows by by the fact that for sufficiently small

ϵ, \sqrt{1 + 9 ϵ} \leq 1 + 5 ϵ

Bibliographic Notes: Count sketch is by Charikar, Chen and Farach-Colton^[1]. CountMin sketch is due to Cormode and Muthukrishnan^[2]; see the papers for several applications. Cormode's survey on sketching in^[3] has a nice perspective. See^[4] for a comparative analysis (theoretical and experimenta) of algorithms for frinding frequent items. A deterministic variant of CountMin called CR-Precis is interesting; see http://polylogblog.wordpress.com/2009/09/22/bite-sized-streams-cr-precis/ for a blog post with pointers and some comments. The applications are taken from the first chapter in the draft book by McGregor and Muthukrishnan.

You can read the notes from the next lecture of Chandra Chekuri's course on \ell_0 Sampling, and Priority Sampling here.

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theoretical Computer Science, 312(1):3-15, 2004. ↩︎
Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58-75, 2005. ↩︎
Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1-294, 2012. ↩︎
Graham Cormode and Marios Hadjieleftheriou. Methods for finding frequent items in data streams. VLDB J., 19(1):3-20, 2010. ↩︎

Count and Count-Min Sketches

1. CountMin Sketch

2. Count Sketch

3. Applications

3.1. Heavy Hitters

3.2. Range Queries

3.3. Sparse Recovery

Recommended for you

Report Article