Estimating F 2 F 2 F_(2)F_2 norm, Sketching, Johnson-Lindenstrauss Lemma

You can read the notes from the previous lecture of Chandra Chekuri's course on Estimating F_2 norm, Sketching, Johnson-Lindenstrauss Lemma here.

1. Sketch for F p F p F_(p)F_{p} Estimation when 0 < p 2 0 < p 2 0 < p <= 20<p \leq 2

We have seen a linear sketching estimate for F 2 F 2 F_(2)F_{2} estimation that uses O ( log n ) O ( log n ) O(log n)O(\log n) space. Indyk[1] obtained a technically sophisticated and interesting sketch for F k F k F_(k)F_{k} estimation where 0 < p 2 0 < p 2 0 < p <= 20<p \leq 2 (note that p p pp can be a real number) which uses polylog ( n ) ( n ) (n)(n) space. Since the details are rather technical we will only give the high-level approach and refer the reader to the paper and related notes for more details. Note that for p > 2 p > 2 p > 2p>2 there is a lower bound of Ω ( n 1 2 / p ) Ω n 1 2 / p Omega(n^(1-2//p))\Omega\left(n^{1-2 / p}\right) on the space required.
To describe the sketch for 0 < p 2 0 < p 2 0 < p <= 20<p \leq 2 we will revisit the F 2 F 2 F_(2)F_{2} estimate via the JL Lemma approach that uses properties of the normal distribution.
F 2 -Estimate: _ F 2 -Estimate: _ F_(2)"-Estimate:"_\underline{F_{2}\text{-Estimate:}}
Let Y 1 , Y 2 , , Y n Y 1 , Y 2 , , Y n Y_(1),Y_(2),dots,Y_(n)Y_{1}, Y_{2}, \ldots, Y_{n} be sampled independenty from the N ( 0 , 1 ) N ( 0 , 1 ) N(0,1)\mathcal{N}(0,1) distribution
z 0 z 0 z larr0z \leftarrow 0
While (stream is not empty) do
( i j , Δ j ) i j , Δ j quad(i_(j),Delta_(j))\quad\left(i_{j}, \Delta_{j}\right) is current token
z z + Δ j Y i j z z + Δ j Y i j quad z larr z+Delta_(j)*Y_(i_(j))\quad z \leftarrow z+\Delta_{j} \cdot Y_{i_{j}}
endWhile
Output z 2 z 2 z^(2)z^{2}
Let Z = i [ n ] x i Y i Z = i [ n ] x i Y i Z=sum_(i in[n])x_(i)Y_(i)Z=\sum_{i \in[n]} x_{i} Y_{i} be the random variable that represents the value of z z zz at the end of the stream. The variable Z Z ZZ is a sum of independent normal variables and by the properties of the normal distribution Z i x i 2 N ( 0 , 1 ) Z i x i 2 N ( 0 , 1 ) Z∼sqrt(sum_(i)x_(i)^(2))*N(0,1)Z \sim \sqrt{\sum_{i} x_{i}^{2}} \cdot \mathcal{N}(0,1). Normal distribution is called 2-stable for this reason. More generally a distribution D D D\mathcal{D} is said to be p p pp-stable if the following property holds: Let Z 1 , Z 2 , , Z n Z 1 , Z 2 , , Z n Z_(1),Z_(2),dots,Z_(n)Z_{1}, Z_{2}, \ldots, Z_{n} be independent random variables distributed according to D D D\mathcal{D}. Then i x i Z i i x i Z i sum_(i)x_(i)Z_(i)\sum_{i} x_{i} Z_{i} has the same distribution as x p Z x p Z ||x||_(p)Z\|x\|_{p} Z where Z D Z D Z∼DZ \sim \mathcal{D}. Note that a p p pp-stable distribution will be symmetric around 0 0 00.
It is known that p p pp-stable distributions exist for all p ( 0 , 2 ] p ( 0 , 2 ] p in(0,2]p \in(0,2] and not for any p > 2 p > 2 p > 2p>2. The p p pp-stable distributions do not have, in general, an analytical formula except in some cases. We have already seen that the standard normal distribution is 2-stable. The 1-stable distribution is the Cauchy distribution which is the distribution of the ratio of two independent standard normal random variables. The density function of the Cauchy distribution is known to be 1 π ( 1 + x 2 ) 1 π 1 + x 2 (1)/(pi(1+x^(2)))\frac{1}{\pi\left(1+x^{2}\right)}; note that the Cauchy distribution does not have a finite mean or variance. We use D p D p D_(p)\mathcal{D}_{p} to denote a p p pp-stable distribution.
Although a general p p pp-stable distribution does not have an analytical formula it is known that one can sample from D p D p D_(p)\mathcal{D}_{p}. Chambers-Mallows-Stuck method is the following:
  • Sample θ θ theta\theta uniformly from [ π / 2 , π / 2 ] [ π / 2 , π / 2 ] [-pi//2,pi//2][-\pi / 2, \pi / 2].
  • Sample r r rr uniformly from [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1].
  • Ouput
sin ( p θ ) ( cos θ ) 1 / p ( cos ( ( 1 p ) θ ) ln ( 1 / r ) ) ( 1 p ) / p . sin ( p θ ) ( cos θ ) 1 / p cos ( ( 1 p ) θ ) ln ( 1 / r ) ( 1 p ) / p . (sin(p theta))/((cos theta)^(1//p))((cos((1-p)theta))/(ln(1//r)))^((1-p)//p).\frac{\sin (p \theta)}{(\cos \theta)^{1 / p}}\left(\frac{\cos ((1-p) \theta)}{\ln (1 / r)}\right)^{(1-p) / p} .
We need one more definition.
Definition 1 The median of a distribution D D D\mathcal{D} is θ θ theta\theta if for Y D , Pr [ Y μ ] = 1 / 2 Y D , Pr [ Y μ ] = 1 / 2 Y∼D,Pr[Y <= mu]=1//2Y \sim \mathcal{D}, \operatorname{Pr}[Y \leq \mu]=1 / 2. If ϕ ( x ) ϕ ( x ) phi(x)\phi(x) is the probability density function of D D D\mathcal{D} then we have μ ϕ ( x ) d x = 1 / 2 μ ϕ ( x ) d x = 1 / 2 int_(-oo)^(mu)phi(x)dx=1//2\int_{-\infty}^{\mu} \phi(x) d x=1 / 2.
Note that a median may not be uniquely defined for a distribution. The distribution D p D p D_(p)\mathcal{D}_{p} has a unique median and so we will use the terminology median ( D p ) D p (D_(p))\left(\mathcal{D}_{p}\right) to denote this quantity. For a distribution D D D\mathcal{D} we will refer to | D | | D | |D||\mathcal{D}| the distribution of the absolute value of a random variable drawn from D D D\mathcal{D}. If ϕ ( x ) ϕ ( x ) phi(x)\phi(x) is the density function of D D D\mathcal{D} then the density function of | D | | D | |D||\mathcal{D}| is given by ψ ψ psi\psi, where ψ ( x ) = 2 ϕ ( x ) ψ ( x ) = 2 ϕ ( x ) psi(x)=2phi(x)\psi(x)=2 \phi(x) if x 0 x 0 x >= 0x \geq 0 and ψ ( x ) = 0 ψ ( x ) = 0 psi(x)=0\psi(x)=0 if x < 0 . x < 0 . x < 0.x<0 .
F p -Estimate: _ F p -Estimate: _ F_(p)"-Estimate:"_\underline{F_{p}\text{-Estimate:}}
k Θ ( 1 ϵ 2 log 1 δ ) k Θ 1 ϵ 2 log 1 δ k larr Theta((1)/(epsilon^(2))log (1)/(delta))k \leftarrow \Theta\left(\frac{1}{\epsilon^{2}} \log \frac{1}{\delta}\right)
Let M M MM be a k × n k × n k xx nk \times n matrix where each M i j D p M i j D p M_(ij)∼D_(p)M_{i j} \sim \mathcal{D}_{p}
y M x y M x ylarr Mx\mathbf{y} \leftarrow M \mathbf{x}
Output Y median ( | y 1 | , | y 2 | , , | y k | ) median ( | D p | ) Y median y 1 , y 2 , , y k median D p Y larr(median(|y_(1)|,|y_(2)|,dots,|y_(k)|))/(median(|D_(p)|))Y \leftarrow \frac{\operatorname { median }\left(\left|y_{1}\right|,\left|y_{2}\right|, \ldots,\left|y_{k}\right|\right)}{\operatorname{median}\left(\left|\mathcal{D}_{p}\right|\right)}
By the p p pp-stability property we see that each y i x p Y y i x p Y y_(i)∼||x||_(p)Yy_{i} \sim\|x\|_{p} Y where Y D p Y D p Y∼D_(p)Y \sim \mathcal{D}_{p}. First, consider the case that k = 1 k = 1 k=1k=1. Then the output | y 1 | / y 1 / |y_(1)|//\left|y_{1}\right| / median ( | D p | ) D p (|D_(p)|)\left(\left|\mathcal{D}_{p}\right|\right) is distributed according to c | D p | c D p c|D_(p)|c\left|\mathcal{D}_{p}\right| where c = x p / c = x p / c=||x||_(p)//c=\|x\|_{p} /median ( | D p | ) D p (|D_(p)|)\left(\left|\mathcal{D}_{p}\right|\right). It is not hard to verify that the median of this distribution is x p x p ||x||_(p)\|x\|_{p}. Thus, the algorithm take k k kk samples from this distribution and ouputs as the estimator the sample median. The lemma below shows that the sample median has good concentration properties.
Lemma 1 Let ϵ > 0 ϵ > 0 epsilon > 0\epsilon>0 and let D D D\mathcal{D} be a distribution with density function ϕ ϕ phi\phi and a unique median μ > 0 μ > 0 mu > 0\mu>0. Suppose ϕ ϕ phi\phi is absolutely continuous on [ ( 1 ϵ ) μ , ( 1 + ϵ ) μ ] [ ( 1 ϵ ) μ , ( 1 + ϵ ) μ ] [(1-epsilon)mu,(1+epsilon)mu][(1-\epsilon) \mu,(1+\epsilon) \mu] and let α = min { ϕ ( x ) x ϵ α = min { ϕ ( x ) x ϵ alpha=min{phi(x)∣x epsilon\alpha=\min \{\phi(x) \mid x \epsilon [ ( 1 ϵ ) μ , ( 1 + ϵ ) μ ] [ ( 1 ϵ ) μ , ( 1 + ϵ ) μ ] [(1-epsilon)mu,(1+epsilon)mu][(1-\epsilon) \mu,(1+\epsilon) \mu]. Let Y = median ( Y 1 , Y 2 , , Y k ) Y = median Y 1 , Y 2 , , Y k Y=median(Y_(1),Y_(2),dots,Y_(k))Y=\operatorname{median}\left(Y_{1}, Y_{2}, \ldots, Y_{k}\right) where Y 1 , , Y k Y 1 , , Y k Y_(1),dots,Y_(k)Y_{1}, \ldots, Y_{k} are independent samples from the distribution D D D\mathcal{D}. Then
Pr [ | Y μ | ϵ μ ] 2 e 2 3 ϵ 2 μ 2 α 2 k . Pr [ | Y μ | ϵ μ ] 2 e 2 3 ϵ 2 μ 2 α 2 k . Pr[|Y-mu| >= epsilon mu] <= 2e^(-(2)/(3)epsilon^(2)mu^(2)alpha^(2)k).\operatorname{Pr}[|Y-\mu| \geq \epsilon \mu] \leq 2 e^{-\frac{2}{3} \epsilon^{2} \mu^{2} \alpha^{2} k} .
We sketch the proof to upper bound Pr [ Y ( 1 ϵ ) μ ] Pr [ Y ( 1 ϵ ) μ ] Pr[Y <= (1-epsilon)mu]\operatorname{Pr}[Y \leq(1-\epsilon) \mu]. The other direction is similar. Note that by the definition of the median, Pr [ Y j μ ] = 1 / 2 Pr Y j μ = 1 / 2 Pr[Y_(j) <= mu]=1//2\operatorname{Pr}\left[Y_{j} \leq \mu\right]=1 / 2. Hence
Pr [ Y j ( 1 ϵ ) μ ] = 1 / 2 ( 1 ϵ ) μ μ ϕ ( x ) d x . Pr Y j ( 1 ϵ ) μ = 1 / 2 ( 1 ϵ ) μ μ ϕ ( x ) d x . Pr[Y_(j) <= (1-epsilon)mu]=1//2-int_((1-epsilon)mu)^(mu)phi(x)dx.\operatorname{Pr}\left[Y_{j} \leq(1-\epsilon) \mu\right]=1 / 2-\int_{(1-\epsilon) \mu}^{\mu} \phi(x) d x .
Let γ = ( 1 ϵ ) μ μ ϕ ( x ) d x γ = ( 1 ϵ ) μ μ ϕ ( x ) d x gamma=int_((1-epsilon)mu)^(mu)phi(x)dx\gamma=\int_{(1-\epsilon) \mu}^{\mu} \phi(x) d x. It is easy to see that γ α ϵ μ γ α ϵ μ gamma >= alpha epsilon mu\gamma \geq \alpha \epsilon \mu.
Let I j I j I_(j)I_{j} be the indicator event for Y i ( 1 ϵ ) μ Y i ( 1 ϵ ) μ Y_(i) <= (1-epsilon)muY_{i} \leq(1-\epsilon) \mu; we have E [ I j ] = Pr [ Y i ( 1 ϵ ) μ ] 1 / 2 γ E I j = Pr Y i ( 1 ϵ ) μ 1 / 2 γ E[I_(j)]=Pr[Y_(i) <= (1-epsilon)mu] <= 1//2-gamma\mathbf{E}\left[I_{j}\right]=\operatorname{Pr}\left[Y_{i} \leq(1-\epsilon) \mu\right] \leq 1 / 2-\gamma. Let I = j I j I = j I j I=sum_(j)I_(j)I=\sum_{j} I_{j}; we have E [ I ] = k ( 1 / 2 γ ) E [ I ] = k ( 1 / 2 γ ) E[I]=k(1//2-gamma)\mathbf{E}[I]=k(1 / 2-\gamma). Since Y Y YY is the median of Y 1 , Y 2 , , Y k , Y ( 1 ϵ ) μ Y 1 , Y 2 , , Y k , Y ( 1 ϵ ) μ Y_(1),Y_(2),dots,Y_(k),Y <= (1-epsilon)muY_{1}, Y_{2}, \ldots, Y_{k}, Y \leq(1-\epsilon) \mu only if more than k / 2 k / 2 k//2k / 2 of I j I j I_(j)I_{j} are true which is the same as Pr [ I > ( 1 + δ ) E [ I ] ] Pr [ I > ( 1 + δ ) E [ I ] ] Pr[I > (1+delta)E[I]]\operatorname{Pr}[I>(1+\delta) \mathbf{E}[I]] where 1 + δ = 1 1 2 γ 1 + δ = 1 1 2 γ 1+delta=(1)/(1-2gamma)1+\delta=\frac{1}{1-2 \gamma}. Now, via Chernoff bounds, this probability is at most e γ 2 k / 3 e γ 2 k / 3 e^(-gamma^(2)k//3)e^{-\gamma^{2} k / 3} for sufficiently small γ γ gamma\gamma.
We can now apply the lemma to the estimator output by the algorithm. We let ϕ ϕ phi\phi be the distribution of c | D p | c D p c|D_(p)|c\left|\mathcal{D}_{p}\right|. Recall that the median of this distribution if x p x p ||x||_(p)\|x\|_{p} and the output of the algorithm is the median of k k kk indepenent samples from this distribution. Thus, from the lemma,
Pr [ | Y x p | ϵ x p ] 2 e ϵ 2 k μ 2 α 2 / 3 . Pr Y x p ϵ x p 2 e ϵ 2 k μ 2 α 2 / 3 . Pr[|Y-||x||_(p)| >= epsilon||x||_(p)] <= 2e^(-epsilon^(2)kmu^(2)alpha^(2)//3).\operatorname{Pr}\left[\left|Y-\|x\|_{p}\right| \geq \epsilon\|x\|_{p}\right] \leq 2 e^{-\epsilon^{2} k \mu^{2} \alpha^{2} / 3} .
Let ϕ ϕ phi^(')\phi^{\prime} be the distribution of | D | | D | |D_(√)||\mathcal{D}_{\surd}| and μ μ mu^(')\mu^{\prime} be the median of ϕ ϕ phi^(')\phi^{\prime}. Then it can be seen that μ α = μ α μ α = μ α mu alpha=mu^(')alpha^(')\mu \alpha=\mu^{\prime} \alpha^{\prime} where α = min { ϕ ( x ) ( 1 ϵ ) μ ( 1 + ϵ ) μ } α = min ϕ ( x ) ( 1 ϵ ) μ ( 1 + ϵ ) μ alpha^(')=min{phi^(')(x)∣(1-epsilon)mu^(') <= (1+epsilon)mu^(')}\alpha^{\prime}=\min \left\{\phi^{\prime}(x) \mid(1-\epsilon) \mu^{\prime} \leq(1+\epsilon) \mu^{\prime}\right\}. Thus μ α μ α mu^(')alpha^(')\mu^{\prime} \alpha^{\prime} depends only on D p D p D_(p)\mathcal{D}_{p} and ϵ ϵ epsilon\epsilon. Letting this be c p , ϵ c p , ϵ c_(p,epsilon)c_{p, \epsilon} we have,
Pr [ | Y x p | ϵ x p ] 2 e ϵ 2 k c p , ϵ 2 / 3 ( 1 δ ) , Pr Y x p ϵ x p 2 e ϵ 2 k c p , ϵ 2 / 3 ( 1 δ ) , Pr[|Y-||x||_(p)| >= epsilon||x||_(p)] <= 2e^(-epsilon^(2)kc_(p,epsilon)^(2)//3) <= (1-delta),\operatorname{Pr}\left[\left|Y-\|x\|_{p}\right| \geq \epsilon\|x\|_{p}\right] \leq 2 e^{-\epsilon^{2} k c_{p, \epsilon}^{2} / 3} \leq(1-\delta),
provided k = Ω ( c p , ϵ 1 ϵ 2 log 1 δ ) k = Ω ( c p , ϵ 1 ϵ 2 log 1 δ ) k=Omega(c_(p,epsilon)*(1)/(epsilon^(2))log (1)/(delta))k=\Omega(c_{p, \epsilon} \cdot \frac{1}{\epsilon^{2}} \log \frac{1}{\delta}).
Technical Issues: There are several technical issues that need to be addressed to obtain a proper algorithm from the preceding description. First, the algorithm as described requires one to store the entire matrix M M MM which is too large for streaming applications. Second, the constant k k kk depends on c p , ϵ c p , ϵ c_(p,epsilon)c_{p, \epsilon} which is not explicitly known since D p D p D_(p)\mathcal{D}_{p} is not well-understood for general p p pp. To obtain a streaming algorithm, the very high-level idea is to derandomize the algorithm via the use of pseudorandom generators for small-space due to Nisan. See[1:1] for more details.

2. Counting Frequent Items

We have seen various algorithm for estimating various F p F p F_(p)F_{p} norms for p 0 p 0 p >= 0p \geq 0. Note that F 0 F 0 F_(0)F_{0} corresponds to number of distinct elements. In the limit, as p , p p , p p rarr oo,ℓ_(p)p \rightarrow \infty, \ell_{p} norm of a vector x x x\mathbf{x} is the maximum of the absolute values of the entries of x x x\mathbf{x}. Thus, we can define the F F F_(oo)F_{\infty} norm to corresponds to finding the maximum frequency in x x x\mathbf{x}. More generally, we would like to find the frequent items in a stream which are also called "heavy hitters". In general, it is not feasible to estimate the heaviest frequency with limited space if it is too small relative to m m mm.

2.1. Misra-Greis algorithm for frequent items

Suppose we have a stream σ = a 1 , a 2 , , a m σ = a 1 , a 2 , , a m sigma=a_(1),a_(2),dots,a_(m)\sigma=a_{1}, a_{2}, \ldots, a_{m} where a j [ n ] a j [ n ] a_(j)in[n]a_{j} \in[n], the simple setting and we want to find all elements in [ n ] [ n ] [n][n] such that f i > m / k f i > m / k f_(i) > m//kf_{i}>m / k. Note that there can be at most k k kk such elements. The simplest case is when k = 2 k = 2 k=2k=2 when we want to know whether there is a "majority" element. There is a simple deterministic algorithm that perhaps you have all seen for k = 2 k = 2 k=2k=2 in an algorithm class. The algorithm uses an associative array data structure of size k k kk.
MisraGreis ( k ) _ MisraGreis ( k ) _ "MisraGreis"(k)_\underline{\text{MisraGreis}(k)}:
D D DD is an empty associative array
While (stream is not empty) do
a j a j quada_(j)\quad a_{j} is current item
quad\quad If ( a j ( a j (a_(j)(a_{j} is in keys ( D ) ) keys ( D ) ) "keys"(D))\textit{keys}(D))
D [ a j ] D [ a j ] + 1 D a j D a j + 1 qquad D[a_(j)]larr D[a_(j)]+1\qquad D\left[a_{j}\right] \leftarrow D\left[a_{j}\right]+1
quad\quad Else if ( | keys ( A ) | < k 1 ) ( | keys ( A ) | < k 1 ) (|"keys"(A)| < k-1)(|\textit{keys}(A)|<k-1) then
D [ a j ] 1 D a j 1 qquad D[a_(j)]larr1\qquad D\left[a_{j}\right] \leftarrow 1
quad\quad Else
qquad\qquad for each keys ( D ) keys ( D ) ℓin"keys"(D)\ell \in \textit{keys}(D) do
D [ ] D [ ] 1 D [ ] D [ ] 1 qquadquad D[ℓ]larr D[ℓ]-1\qquad \quad D[\ell] \leftarrow D[\ell]-1
qquad\qquad Remove elements from D D DD whose counter values are 0 0 00
quad\quad \, endWhile
quad\quad For each i keys ( D ) i keys ( D ) i in"keys"(D)i \in \textit{keys}(D) set f ^ i = D [ i ] f ^ i = D [ i ] hat(f)_(i)=D[i]\hat{f}_{i}=D[i]
quad\quad For each i keys ( D ) i keys ( D ) i!in"keys"(D)i \notin \textit{keys}(D) set f ^ i = 0 f ^ i = 0 hat(f)_(i)=0\hat{f}_{i}=0
We leave the following as an exercise to the reader.
Lemma 2 For each i [ n ] i [ n ] i in[n]i \in[n]:
f i m k f ^ i f i . f i m k f ^ i f i . f_(i)-(m)/(k) <= hat(f)_(i) <= f_(i).f_{i}-\frac{m}{k} \leq \hat{f}_{i} \leq f_{i}.
The lemma implies that if f i > m / k f i > m / k f_(i) > m//kf_{i}>m / k then i keys ( D ) i keys ( D ) i in"keys"(D)i \in \textit{keys}(D) at the end of the algorithm. Thus one can use a second-pass over the data to compute the exact f i f i f_(i)f_{i} only for the k k kk itmes in keys ( D ) keys ( D ) "keys"(D)\textit{keys}(D). This gives an O ( k n ) O ( k n ) O(kn)O(k n) time two-pass algorithm for finding all items which have frequency at least m / k m / k m//km / k.
Bibliographic Notes: For more details on F p F p F_(p)F_{p} estimation when 0 < p 2 0 < p 2 0 < p <= 20<p \leq 2 see the original paper of Indyk[1:2], notes of Amit Chakrabarti (Chapter 7) and Lecture 4 of Jelani Nelson's course.
You can read the notes from the next lecture of Chandra Chekuri's course on Count and Count-Min Sketches here.

  1. Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307-323, 2006. ↩︎ ↩︎ ↩︎

Recommended for you

Weihua He
Spiking Neural Network : Towards Brain-inspired Computing
Spiking Neural Network : Towards Brain-inspired Computing
A brief introduction and discussion of the base of brain-inspired computing, spiking neural network (SNN).
9 points
1 issues