$\ell_{0}$ sampling, and priority sampling

You can read the notes from the previous lecture of Chandra Chekuri's course on Count and Count-Min Sketches here.

1. Priority Sampling and Sum Queries

Suppose we have a stream

a_{1}, a_{2}, \dots, a_{n}

(yes, we are changing notation here from

m

n

for the length of the stream) of objects and each

a_{i}

has a non-negative weight

w_{i}

. We want to store a representative sample

S \subset [n]

of the items so that we can answer subset sum queries. That is, given a query

I \subseteq [n]

we would like to answer

\sum_{i \in I} w_{i}

. One way to do this as follows. Suppose we pick

S

as follows: sample each

i \in [n]

independently with probability

p_{i}

and if

i

is chosen we set a scaled weight

{\hat{w}}_{i} = w_{i} / p_{i}

. Now, given a query

I

we output the estimate for its weight as

\sum_{i \in I \cap S} {\hat{w}}_{i}

. Note that expectation of the estimate is equal to

w (I)

. The disadvantage of this scheme is mainly related to the fact that we cannot control the size of sample. This means that we cannot fully utilize the memory available. Related to the first, is that if we do not know the length of stream apriori, we cannot figure out the sampling rate even if we are willing to be flexible in the size of the memory. An elegant scheme of Duffield, Lund and Thorup, called priority sampling overcomes these limitations. They considered the setting where we are given a parameter

k

for the size of the sample

S

and the goal is maintain a

k

-sample

S

along with some weights

{\hat{w}}_{i}

for

i \in S

such that we can answer subset sum queries.

Their scheme is the following, described as if

a_{1}, a_{2}, \dots, a_{n}

are available offline.

For each $i \in [n]$ set priority $q_{i} = w_{i} / u_{i}$ where $u_{i}$ is chosen uniformly (and independently from other items) at random from $[0, 1]$ .
$S$ is the set of items with the $k$ highest priorities.
$τ$ is the $(k + 1)$ 'st highest priority. If $k \geq n$ we set $τ = 0$ .
If $i \in S$ , set ${\hat{w}}_{i} = max {w_{i}, τ}$ , else set ${\hat{w}}_{i} = 0$ .

We observe that the above sampling can be implemented in the streaming setting by simply keeping the current sample

S

and current threshold

τ

. We leave it as an exercise to show that this informatino can be updated when a new item arrives.

We show some nice and non-obvious properties of priority sampling. We will assume for simplicity that

1 < k < n

. The first one is the basic one that we would want.

Lemma 1

E [{\hat{w}}_{i}] = w_{i}

Proof: Fix

i

. Let

A (τ^{'})

be the event that the

k

'th highest priority among items

j \neq i

τ^{'}

. Note that

i \in S

q_{i} = w_{i} / u_{i} \geq τ^{'}

and if

i \in S

then

{\hat{w}}_{i} = max {w_{i}, τ^{'}}

, otherwise

{\hat{w}}_{i} = 0

. To evaluate

\Pr [i \in S ∣ A (τ^{'})]

we consider two cases.

Case 1:

w_{i} \geq τ^{'}

. Here we have

\Pr [i \in S ∣ A (τ^{'})] = 1

and

{\hat{w}}_{i} = w_{i}

Case 2:

w_{i} < τ^{'}

. Then

\Pr [i \in S ∣ A (τ^{'})] = \frac{w_{i}}{τ^{'}}

and

{\hat{w}}_{i} = τ^{'}

In both cases we see that

E [{\hat{w}}_{i}] = w_{i}

The previous claim shows that the estimator

\sum_{I \cap S} {\hat{w}}_{i}

has expectation equal to

w (I)

. We can also estimate the variance of

{\hat{w}}_{i}

via the threshold

τ

Lemma 2

V a r [{\hat{w}}_{i}] = E [{\hat{v}}_{i}]

where

{\hat{v}}_{i} = {\begin{cases} τ max {0, τ - w_{i}} & if i \in S \\ 0 & if i \notin S \end{cases}

Proof: Fix

i

. We define

A (τ^{'})

to be the event that

τ^{'}

is the

k

'th highest priority among elements

j \neq i

. The proof is based on showing that

E [{\hat{v}}_{i} ∣ A (τ^{'})] = E [{\hat{w}}_{i}^{2} ∣ A (τ^{'})] - w_{i}^{2} .

From the proof outline of the preceding lemma, we estimate the lhs of as

\begin{aligned} E [{\hat{v}}_{i} ∣ A (τ^{'})] & = \Pr [i \in S ∣ A (τ^{'})] \times E [{\hat{v}}_{i} ∣ i \in S \land A (τ^{'})] \\ = min {1, w_{i} / τ^{'}} \times τ^{'} max {0, τ^{'} - w_{i}} \\ = max {0, w_{i} τ^{'} - w_{i}^{2}} . \end{aligned}

Now we analyze the rhs,

\begin{aligned} E [{\hat{w}}_{i}^{2} ∣ A (τ^{'})] & = \Pr [i \in S ∣ A (τ^{'})] \times E [{\hat{w}}_{i}^{2} ∣ i \in S \land A (τ^{'})] \\ = min {1, w_{i} / τ^{'}} \times {(max {w_{i}, τ^{'}})}^{2} \\ = max {w_{i}^{2}, w_{i} τ^{'}} . \end{aligned}

Surprisingly, if

k \geq 2

then the covariance between

{\hat{w}}_{i}

and

{\hat{w}}_{j}

for any

i \neq j

is equal to 0.

Lemma 3

E [{\hat{w}}_{i} {\hat{w}}_{j}] = 0

In fact the previous lemma is a special case of a more general lemma below.

Lemma 4

E [\prod_{i \in I} {\hat{w}}_{i}] = \prod_{i = 1}^{k} w_{i}

| I | \leq k

and is 0 if

| I | > k

Proof: It is easy to see that if

| I | > k

the product is 0 since at least one of them is not in the sample. We now assume

| I | \leq k

and prove the desired claim by inducion on

| I |

. In fact we need a stronger hypothesis. Let

τ^{''}

be the

(k - | I | + 1)

'th highest priority among the items

j \neq i

. We will condition on

τ^{''}

and prove that

E [\prod_{i \in I} {\hat{w}}_{i} ∣ A (τ^{''})] = \prod_{i \in I} w_{i}

. For the base case we have seen the proof for

| I | = 1

Case 1: There is

h \in I

such that

w_{h} > τ^{''}

. Then clearly

h \in S

and

{\hat{w}}_{h} = w_{h}

. In this case

E [\prod_{i \in I} {\hat{w}}_{i} ∣ A (τ^{''})] = w_{h} \cdot E [\prod_{i \in I ∖ {h}} {\hat{w}}_{i} ∣ A (τ^{''})],

and we apply induction to

I ∖ {h}

. Technically the term

E [\prod_{i \in I ∖ {h}} {\hat{w}}_{i} ∣ A (τ^{''})]

is referring to the fact that

τ^{''}

is the

k - | I^{'} | + 1

'st highest priority where

I^{'} = I ∖ {h}

Case 2: For all

h \in I, w_{h} < τ^{''}

. Let

q

be the minimum priority among items in

I

. If

q < τ^{''}

then

{\hat{w}}_{j} = 0

for some

j \in I

and the entire product is 0. Thus, in this case, there is no contribution to the expectation. Thus we will consider the case when

q \geq τ^{''}

. The probability for this event is

\prod_{i \in I} \frac{w_{i}}{τ^{''}}

. But in this case all

i \in I

will be in

S

and moreover

{\hat{w}}_{i} = τ^{''}

for each

i

. Thus the expected value of

\prod_{i \in I} {\hat{w}}_{i} = \prod_{i \in I} w_{i}

as desired.

Combining Lemma 2 and 3 the variance of the estimator

\sum_{i \in I \cap S} {\hat{w}}_{i}

V a r [\sum_{i \in I \cap S} {\hat{w}}_{i} = \sum_{i \in I \cap S} V a r [{\hat{w}}_{i}] = \sum_{i \in I \cap S} E [{\hat{v}}_{i}] .

The advantage of this is that the variance of the estimator can be computed by examining

τ

and the weights of the elements in the

S \cap I

2. $ℓ_{0}$ Sampling

We have seen

ℓ_{2}

sampling in the streaming setting. The ideas generalize to

ℓ_{p}

sampling to

ℓ_{p}

sampling for

p \in (0, 2)

– see [] for instance. However,

ℓ_{0}

sampling requires slightly different ideas.

ℓ_{0}

sampling means that we are sampling near-uniformly from the distinct elements in the stream. Surprisingly we can do this even in the turnstile setting.

Recall that one of the applications we saw for the Count-Sketch is

ℓ_{2}

-sparse recovery. In particular we can obtain a

(1 + ϵ)

-approximation for err

_{2}^{k} (x)

with high-probability using

O (k \log n / ϵ)

words. Suppose

x

k

-sparse then err

_{2}^{k} (x) = 0

! It means that we can detect if

x

k

-sparse, and in fact identify the non-zero coordinates of

x

, with high-probability. In fact one can prove the following stronger version.

Lemma 5 For $1 \leq k \leq n$ there and $k^{'} = O (k)$ there is a sketch $L : R^{n} \to R^{k^{'}}$ (generated from $O (k \log n)$ random bits) and a recovery procedure that on input $L (x)$ has the following feature: (i) if $x$ is $k$ -sparse then it outputs $x^{'} = x$ with probability 1 and (ii) if $x$ is not $k$ -sparse the algorithm detects this with high-probability.

We will use the above for

ℓ_{0}

sampling as follows. We will first describe a high-level algorithm that is not streaming friendly and will indicate later how it can be made stream implementable.

For $h = 1, \dots, ⌊ \log n ⌋$ let $I_{h}$ be a random subsets of $[n]$ where $I_{j}$ has cardinality $2^{j}$ . Let $I_{0} = [n]$ .
Let $k = ⌈ 4 \log (1 / δ) ⌉$ . For $h = 0, \dots, ⌊ \log n ⌋$ , run $k$ -sparse-recovery on $x$ restricted to coordinates of $I_{h}$ .
If any of the sparse-recoveries succeeds then output a random coordinate from the first sparserecovery that succeeds.
Algorithm fails if none of the sparse-recoveries output a valid vector.

Let

J

be the index set of non-zero coordinates of

x

. We now show that the algorithm with probability

(1 - δ)

succeeds in outputting a uniform sample from

J

. Suppose

| J | \leq k

. Then

x

is recovered exactly for

h = 0

and the algorithm outputs a uniform sample from

J

. Suppose

| J | > k

. We observe that

E [| I_{h} \cap J |] = 2^{h} | J | / n

and hence there is a

h^{*}

such that

E [| I_{h^{*}} \cap J |] = 2^{h^{*}} | J | / n

is between

k / 3

and

2 k / 3

. By Chernoff-bounds one can show that with probability at least

(1 - δ)

1 \leq | I_{h^{*}} \cap J | \leq k

. For this

h^{*}

the sparse recovery will succeed and output a random coordinate of

J

. The formal claims are the following:

With probability at least $(1 - δ)$ the algorithm outputs a coordinate $i \in [n]$ .
If the algorithm outputs a coordinate $i$ then the probability that it is not a uniform random sample is because the sparse recovery algorithm failed for some $h$ ; we can make this probability be less than $1 / n^{c}$ for any desired constant $c$ .

Thus, in fact we get zero-error

ℓ_{0}

sample.

The algorithm, as described, requires one to sample and store

I_{h}

for

h = 0, \dots, ⌊ \log n ⌋

. In order to avoid this we can use Nisan's pseudo-random generator for small-space computation. We skip the details of this; see^[1]. The overall space requirements for the above procedure can be shown to be

O (\log^{2} n \log (1 / δ)

with an error probability bounded by

δ + O (1 / n^{c})

. This is near-optimal for constant

δ

as shown in^[1:1].

Bibliographic Notes: The material on priority sampling is directly from^[2] which describes applications, relationship to prior sampling techniques and also has an experimental evaluation. Priority sampling has shown to be "optimal" in a strong sense; see^[3].

The

ℓ_{0}

sampling algorithm we described is from the paper by Jowhari, Saglam and Tardos^[1:2]. See a simpler algorithm in the chapter on signals by McGregor-Muthu draft book.

You can read the notes from the next lecture of Chandra Chekuri's course on Quantiles and Selection in Multiple Passes here.

Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. CoRR, abs/1012.4889, 2010. ↩︎ ↩︎ ↩︎
Nick Duffield, Carsten Lund, and Mikkel Thorup. Priority sampling for estimation of arbitrary subset sums. Journal of the ACM (JACM), 54(6):32, 2007. ↩︎
Mario Szegedy. The dlt priority sampling is essentially optimal. In Proceedings of the thirtyeighth annual ACM symposium on Theory of computing, pages 150-158. ACM, 2006. ↩︎

$ℓ_{0}$ sampling, and priority sampling

1. Priority Sampling and Sum Queries

2. $ℓ_{0}$ Sampling

Recommended for you

ℓ 0 ℓ 0 ℓ_(0)\ell_{0}ℓ0 sampling, and priority sampling

1. Priority Sampling and Sum Queries

2. ℓ 0 ℓ 0 ℓ_(0)\ell_{0}ℓ0 Sampling

Recommended for you

Report Article

$ℓ_{0}$ sampling, and priority sampling

2. $ℓ_{0}$ Sampling