Estimating F 2 F 2 F_(2)F_2 norm, Sketching, Johnson-Lindenstrauss Lemma

You can read the notes from the previous lecture of Chandra Chekuri's course on Estimating F_k norms via AMS sampling here.

1. F 2 F 2 F_(2)F_{2} Estimation

We have seen a generic algorithm for estimating the F k F k F_(k)F_{k}, the k k kk'th frequency moment of a stream using O ~ ( n 1 1 / k ) O ~ n 1 1 / k tilde(O)(n^(1-1//k))\tilde{O}\left(n^{1-1 / k}\right)-space for k 1 k 1 k >= 1k \geq 1. Now we will see an amazingly simple algorithm for F 2 F 2 F_(2)F_{2} estimation due to[1].
AMS F 2 Estimate : _ AMS F 2 Estimate : _ "AMS"-F_(2)-"Estimate":_\underline{\text{AMS}-F_{2}-\text{Estimate}:}
H H H\mathcal{H} is a 4 4 44-universal hash family from [ n ] [ n ] [n][n] to { 1 , 1 } { 1 , 1 } {-1,1}\{-1,1\}
choose h h hh at random from H H H\mathcal{H}
z 0 z 0 z larr0z \leftarrow 0
While (stream is not empty) do
a j a j quada_(j)\quad a_{j} is current item
z z + h ( a j ) z z + h a j quad z larr z+h(a_(j))\quad z \leftarrow z+h\left(a_{j}\right)
endWhile
Output z 2 z 2 z^(2)z^{2}
An conceptually equivalent way to describe the algorithm is the following.
AMS F 2 Estimate : _ AMS F 2 Estimate : _ "AMS"-F_(2)-"Estimate":_\underline{\text{AMS}-F_{2}-\text{Estimate}:}
Let Y 1 , Y 2 , , Y n Y 1 , Y 2 , , Y n Y_(1),Y_(2),dots,Y_(n)Y_{1}, Y_{2}, \ldots, Y_{n} be { 1 , + 1 } { 1 , + 1 } {-1,+1}\{-1,+1\} random variable that are 4 4 44-wise independent
z 0 z 0 z larr0z \leftarrow 0
While (stream is not empty) do
a j a j quada_(j)\quad a_{j} is current item
z z + Y a j z z + Y a j quad z larr z+Y_(a_(j))\quad z \leftarrow z+Y_{a_{j}}
endWhile
Output z 2 z 2 z^(2)z^{2}
The difference between the two is that the former one is a streaming friendly. Instead of keeping Y 1 , , Y n Y 1 , , Y n Y_(1),dots,Y_(n)Y_{1}, \ldots, Y_{n} explicity we sample h h hh from a 4 4 44-wise independent hash family so that h h hh can be stored compactly in O ( log n ) O ( log n ) O(log n)O(\log n)-space and we can generate Y i = h ( i ) Y i = h ( i ) Y_(i)=h(i)Y_{i}=h(i) in O ( log n ) O ( log n ) O(log n)O(\log n) time on the fly. We will analyze the algorithm in the second description.
Let Z = i [ n ] f i Y i Z = i [ n ] f i Y i Z=sum_(i in[n])f_(i)Y_(i)Z=\sum_{i \in[n]} f_{i} Y_{i} be the random variable that represents the value of z z zz at the end of the stream. Note that for all i [ n ] , E [ Y i ] = 0 i [ n ] , E Y i = 0 i in[n],E[Y_(i)]=0i \in[n], \mathbf{E}\left[Y_{i}\right]=0 and E [ Y i 2 ] = 1 E Y i 2 = 1 E[Y_(i)^(2)]=1\mathbf{E}\left[Y_{i}^{2}\right]=1. Moreover, since Y 1 , , Y n Y 1 , , Y n Y_(1),dots,Y_(n)Y_{1}, \ldots, Y_{n} are 4 4 44-wise independent and hence also 2 2 22-wise independent, E [ Y i Y i ] = 0 E Y i Y i = 0 E[Y_(i)Y_(i^('))]=0\mathbf{E}\left[Y_{i} Y_{i^{\prime}}\right]=0 for i i i i i!=i^(')i \neq i^{\prime}. The expected value of the output is
E [ Z 2 ] = i , i [ n ] f i f i E [ Y i Y i ] = i [ n ] f i 2 E [ Y i 2 ] + i i f i f i E [ Y i Y i ] = i [ n ] f i 2 = F 2 . E Z 2 = i , i [ n ] f i f i E Y i Y i = i [ n ] f i 2 E Y i 2 + i i f i f i E Y i Y i = i [ n ] f i 2 = F 2 . E[Z^(2)]=sum_(i,i^(')in[n])f_(i)f_(i^('))E[Y_(i)Y_(i^('))]=sum_(i in[n])f_(i)^(2)E[Y_(i)^(2)]+sum_(i!=i^('))f_(i)f_(i^('))E[Y_(i)Y_(i^('))]=sum_(i in[n])f_(i)^(2)=F_(2).\mathbf{E}\left[Z^{2}\right]=\sum_{i, i^{\prime} \in[n]} f_{i} f_{i^{\prime}} \mathbf{E}\left[Y_{i} Y_{i^{\prime}}\right]=\sum_{i \in[n]} f_{i}^{2} \mathbf{E}\left[Y_{i}^{2}\right]+\sum_{i \neq i^{\prime}} f_{i} f_{i^{\prime}} \mathbf{E}\left[Y_{i} Y_{i^{\prime}}\right]=\sum_{i \in[n]} f_{i}^{2}=F_{2} .
We can also compute the variance of the output which is E [ Z 4 ] E Z 4 E[Z^(4)]\mathbf{E}\left[Z^{4}\right].
E [ Z 4 ] = i [ n ] j [ n ] k [ n ] [ n ] f i f j f k f E [ Y i Y j Y k Y ] . E Z 4 = i [ n ] j [ n ] k [ n ] [ n ] f i f j f k f E Y i Y j Y k Y . E[Z^(4)]=sum_(i in[n])sum_(j in[n])sum_(k in[n])sum_(ℓin[n])f_(i)f_(j)f_(k)f_(ℓ)E[Y_(i)Y_(j)Y_(k)Y_(ℓ)].\mathbf{E}\left[Z^{4}\right]=\sum_{i \in[n]} \sum_{j \in[n]} \sum_{k \in[n]} \sum_{\ell \in[n]} f_{i} f_{j} f_{k} f_{\ell} \mathbf{E}\left[Y_{i} Y_{j} Y_{k} Y_{\ell}\right] .
Via the 4 4 44-wise independence of the Y Y YY's we have that E [ Y i Y j Y k Y ] = 0 E Y i Y j Y k Y = 0 E[Y_(i)Y_(j)Y_(k)Y_(ℓ)]=0\mathbf{E}\left[Y_{i} Y_{j} Y_{k} Y_{\ell}\right]=0 if there is an index among i , j , k , i , j , k , i,j,k,ℓi, j, k, \ell that occurs exactly once in the multiset, otherwise it is 1 1 11. If it is 1 1 11 there are two cases: all indices are the same or there are two distinct indices that occur twice each. Therefore,
E [ Z 4 ] = i [ n ] j [ n ] k [ n ] [ n ] f i f j f k f E [ Y i Y j Y k Y ] = i [ n ] f i 4 + 6 i = 1 n j = i + 1 n f i 2 f j 2 . E Z 4 = i [ n ] j [ n ] k [ n ] [ n ] f i f j f k f E Y i Y j Y k Y = i [ n ] f i 4 + 6 i = 1 n j = i + 1 n f i 2 f j 2 . E[Z^(4)]=sum_(i in[n])sum_(j in[n])sum_(k in[n])sum_(ℓin[n])f_(i)f_(j)f_(k)f_(ℓ)E[Y_(i)Y_(j)Y_(k)Y_(ℓ)]=sum_(i in[n])f_(i)^(4)+6sum_(i=1)^(n)sum_(j=i+1)^(n)f_(i)^(2)f_(j)^(2).\mathbf{E}\left[Z^{4}\right]=\sum_{i \in[n]} \sum_{j \in[n]} \sum_{k \in[n]} \sum_{\ell \in[n]} f_{i} f_{j} f_{k} f_{\ell} \mathbf{E}\left[Y_{i} Y_{j} Y_{k} Y_{\ell}\right]=\sum_{i \in[n]} f_{i}^{4}+6 \sum_{i=1}^{n} \sum_{j=i+1}^{n} f_{i}^{2} f_{j}^{2} .
Thus, we have
Var [ Z 2 ] = E [ Z 4 ] ( E [ Z 2 ] ) 2 = F 4 F 2 2 + 6 i = 1 n j = i + 1 n f i 2 f j 2 = F 4 ( F 4 + 2 i = 1 n j = i + 1 n f i 2 f j 2 ) + 6 i = 1 n j = i + 1 n f i 2 f j 2 = 4 i = 1 n j = i + 1 n f i 2 f j 2 2 F 2 2 Var Z 2 = E Z 4 E Z 2 2 = F 4 F 2 2 + 6 i = 1 n j = i + 1 n f i 2 f j 2 = F 4 F 4 + 2 i = 1 n j = i + 1 n f i 2 f j 2 + 6 i = 1 n j = i + 1 n f i 2 f j 2 = 4 i = 1 n j = i + 1 n f i 2 f j 2 2 F 2 2 {:[Var[Z^(2)]=E[Z^(4)]-(E[Z^(2)])^(2)],[=F_(4)-F_(2)^(2)+6sum_(i=1)^(n)sum_(j=i+1)^(n)f_(i)^(2)f_(j)^(2)],[=F_(4)-(F_(4)+2sum_(i=1)^(n)sum_(j=i+1)^(n)f_(i)^(2)f_(j)^(2))+6sum_(i=1)^(n)sum_(j=i+1)^(n)f_(i)^(2)f_(j)^(2)],[=4sum_(i=1)^(n)sum_(j=i+1)^(n)f_(i)^(2)f_(j)^(2)],[ <= 2F_(2)^(2)]:}\begin{aligned} \operatorname{Var}\left[Z^{2}\right] &=\mathbf{E}\left[Z^{4}\right]-\left(\mathbf{E}\left[Z^{2}\right]\right)^{2} \\ &=F_{4}-F_{2}^{2}+6 \sum_{i=1}^{n} \sum_{j=i+1}^{n} f_{i}^{2} f_{j}^{2} \\ &=F_{4}-\left(F_{4}+2 \sum_{i=1}^{n} \sum_{j=i+1}^{n} f_{i}^{2} f_{j}^{2}\right)+6 \sum_{i=1}^{n} \sum_{j=i+1}^{n} f_{i}^{2} f_{j}^{2} \\ &=4 \sum_{i=1}^{n} \sum_{j=i+1}^{n} f_{i}^{2} f_{j}^{2} \\ & \leq 2 F_{2}^{2} \end{aligned}
Let X = Z 2 X = Z 2 X=Z^(2)X=Z^{2} be the output estimate. We have E [ X ] = F 2 E [ X ] = F 2 E[X]=F_(2)\mathbf{E}[X]=F_{2} and Var [ X ] 2 F 2 2 2 E [ X ] 2 Var [ X ] 2 F 2 2 2 E [ X ] 2 Var[X] <= 2F_(2)^(2) <= 2E[X]^(2)\operatorname{Var}[X] \leq 2 F_{2}^{2} \leq 2 \mathbf{E}[X]^{2}. We now apply the standard idea of averaging O ( 1 / ϵ 2 ) O 1 / ϵ 2 O(1//epsilon^(2))O\left(1 / \epsilon^{2}\right) estimates to reduce variance, apply Chebyshev on the average estimator to see that it is a ϵ ϵ epsilon\epsilon-approximation with > 1 / 2 > 1 / 2 > 1//2>1 / 2 probability. Then we apply the median trick with log ( 1 δ ) log 1 δ log((1)/(delta))\log \left(\frac{1}{\delta}\right)-independent averaged estimators to obtain an ( ϵ , δ ) ( ϵ , δ ) (epsilon,delta)(\epsilon, \delta)-approximation. The overall space requirement is O ( 1 ϵ 2 log 1 δ log n ) O 1 ϵ 2 log 1 δ log n O((1)/(epsilon^(2))log (1)/(delta)log n)O\left(\frac{1}{\epsilon^{2}} \log \frac{1}{\delta} \log n\right) and this is also the time to process each element.

2. (Linear) Sketching and Streaming with Updates

The F 2 F 2 F_(2)F_{2} estimation algorithm is amazingly simple and has the following interesting properties. Suppose σ 1 σ 1 sigma_(1)\sigma_{1} and σ 2 σ 2 sigma_(2)\sigma_{2} are two streams and the algorithm computes z 1 z 1 z_(1)z_{1} and z 2 z 2 z_(2)z_{2} on σ 1 σ 1 sigma_(1)\sigma_{1} and σ 2 σ 2 sigma_(2)\sigma_{2} It is easy to see that the algorithm on σ = σ 1 σ 2 σ = σ 1 σ 2 sigma=sigma_(1)*sigma_(2)\sigma=\sigma_{1} \cdot \sigma_{2} (the concatenation of the two streams) computes z 1 + z 2 z 1 + z 2 z_(1)+z_(2)z_{1}+z_{2}. Thus the algorithm retains z z zz as a sketch of the stream σ σ sigma\sigma. Note that the output of the algorithm is not z z zz but some function of z z zz (in the case of F 2 F 2 F_(2)F_{2} estimation it is z 2 z 2 z^(2)z^{2} ). Moreover, in this special case the sketch is a linear sketch which we will define more formally later.
Formally a sketch of a stream σ σ sigma\sigma is a data structure z ( σ ) z ( σ ) z(sigma)z(\sigma) that has the property that if σ = σ σ 2 σ = σ σ 2 sigma=sigma*sigma_(2)\sigma=\sigma \cdot \sigma_{2}, z ( σ ) z ( σ ) z(sigma)z(\sigma) can be computed by combining the sketches z ( σ 1 ) z σ 1 z(sigma_(1))z\left(\sigma_{1}\right) and z ( σ 2 ) z σ 2 z(sigma_(2))z\left(\sigma_{2}\right). Ideally the combining algorithm should take small space as well. Note that the algorithm can post-process the sketch to output the estimator.
The power of sketching algorithms is illustrated by thinking of more general streaming models than what we have seen so far. We have considered streams of the form a 1 , a 2 , , a m a 1 , a 2 , , a m a_(1),a_(2),dots,a_(m)a_{1}, a_{2}, \ldots, a_{m} where each a i a i a_(i)a_{i} is a token, in particular an integer from [ n ] [ n ] [n][n]. Now we will consider the following model. We start with a n n nn-dimensional vector/signal x = ( 0 , 0 , , 0 ) x = ( 0 , 0 , , 0 ) x=(0,0,dots,0)\mathbf{x}=(0,0, \ldots, 0) and the stream tokens consists of updates to coordinates of x x x\mathbf{x}. Thus each token a t = ( i t , Δ t ) a t = i t , Δ t a_(t)=(i_(t),Delta_(t))a_{t}=\left(i_{t}, \Delta_{t}\right) where i t [ n ] i t [ n ] i_(t)in[n]i_{t} \in[n] and Δ t Δ t Delta_(t)\Delta_{t} is a number (could be a real number and be negative). The token a t a t a_(t)a_{t} udpates the i i ii 'th coordinate of x x x\mathbf{x} :
x i t x i t + Δ t . x i t x i t + Δ t . x_(i_(t))larrx_(i_(t))+Delta_(t).x_{i_{t}} \leftarrow x_{i_{t}}+\Delta_{t} .
We will let x t x t x_(t)\mathbf{x}_{t} be the value of x x x\mathbf{x} after the updates corresponding to a 1 , a 2 , , a t a 1 , a 2 , , a t a_(1),a_(2),dots,a_(t)a_{1}, a_{2}, \ldots, a_{t}.
If the Δ t Δ t Delta_(t)\Delta_{t}'s are allowed to be negative the model is called turnstile streams; note that Δ t Δ t Delta_(t)\Delta_{t} being negative allows items to be deleted. If x t x t x_(t)\mathbf{x}_{t} is required to be always non-negative, we have the strict turnstile stream model. A further special case is when Δ t Δ t Delta_(t)\Delta_{t} is required to be positive and is called the cash register model.
Linear sketches are particularly simple and yet very powerful. A linear sketch corresponds to a k × n k × n k xx nk \times n projection matrix M M MM and the sketch for vector x x x\mathbf{x} is simply M x M x MxM \mathbf{x}. Composing linear sketches corresponds to simply addding the sketches since M x + M x = M ( x + x ) M x + M x = M x + x Mx+Mx^(')=M(x+x^('))M \mathbf{x}+M \mathbf{x}^{\prime}=M\left(\mathbf{x}+\mathbf{x}^{\prime}\right). In the streaming setting when we see a token a t = ( i t , Δ t ) a t = i t , Δ t a_(t)=(i_(t),Delta_(t))a_{t}=\left(i_{t}, \Delta_{t}\right), updating the sketch corresponds to adding Δ t M e i t Δ t M e i t Delta_(t)Me_(i_(t))\Delta_{t} M e_{i_{t}} to the sketch where e i t e i t e_(i_(t))e_{i_{t}} is the vector with 1 in row i t i t i_(t)i_{t} and 0 every where else. To implement the algorithm in small space it would suffice to be able to generate the i i ii 'th column of M M MM efficienty on the fly rather than storing the entire matrix M M MM.
F 2 F 2 F_(2)F_{2} estimation as linear sketching: It is not hard to see that the F 2 F 2 F_(2)F_{2} estimation algorithm we have seen is essentially a linear sketch algorithm. Consider the matrix M M MM with k = O ( 1 ϵ 2 log 1 δ ) k = O 1 ϵ 2 log 1 δ k=O((1)/(epsilon^(2))log (1)/(delta))k=O\left(\frac{1}{\epsilon^{2}} \log \frac{1}{\delta}\right) rows where each entry is in { 1 , 1 } { 1 , 1 } {-1,1}\{-1,1\}. The sketch is simply z = M x z = M x z=Mx\mathbf{z}=M \mathbf{x}. The algorithm post-processes the sketch to output its estimator.
Note that because the sketch is linear it does not matter whether x x x\mathrm{x} is negative. In fact it is easy to see this from the analysis as well. In particular this implies that we can estimate f σ f σ 2 f σ f σ 2 ||f_(sigma)-f_(sigma^('))||_(2)\left\|f_{\sigma}-f_{\sigma^{\prime}}\right\|_{2} where f σ f σ f_(sigma)f_{\sigma} and f σ f σ f_(sigma^('))f_{\sigma^{\prime}} are the frequency vectors of σ σ sigma\sigma and σ σ sigma^(')\sigma^{\prime} respectively. Similarly, if x x x\mathbf{x}, x x x^(')\mathbf{x}^{\prime} are two n n nn dimensional signals representing a time-series then the 2 2 ℓ_(2)\ell_{2} norm of their difference can be estimated by making one pass of the signals even when the signals are given via a sequence of updates which can even be interspersed (of course we need to know the identity of the signals from which the updates are coming from).

3. Johnson-Lindenstrauss Lemma and Dimension Reduction in 2 2 ℓ_(2)\ell_{2}

The AMS linear sketch for F 2 F 2 F_(2)F_{2} estimation appears magical. One way to understand this is via the dimensionality reduction for 2 2 ℓ_(2)\ell_{2} spaces given by the well-known Johnson-Lindenstrauss lemma which has many applications. The JL Lemma can be stated as follows.
Theorem 1 (JL Lemma) Let v 1 , v 2 , , v n v 1 , v 2 , , v n v_(1),v_(2),dots,v_(n)\mathbf{v}_{1}, \mathbf{v}_{2}, \ldots, \mathbf{v}_{n} be any n n nn points/vectors in R d R d R^(d)\mathbb{R}^{d}. For any ϵ ϵ epsilon in\epsilon \in ( 0 , 1 / 2 ) ( 0 , 1 / 2 ) (0,1//2)(0,1 / 2), there is linear map f : R d R k f : R d R k f:R^(d)rarrR^(k)f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{k} where k 8 ln n / ϵ 2 k 8 ln n / ϵ 2 k <= 8ln n//epsilon^(2)k \leq 8 \ln n / \epsilon^{2} such that for all 1 i < j n 1 i < j n 1 <= i < j <= n1 \leq i<j \leq n,
( 1 ϵ ) v i v j 2 f ( v i ) f ( v j ) 2 v i v j 2 . ( 1 ϵ ) v i v j 2 f v i f v j 2 v i v j 2 . (1-epsilon)||v_(i)-v_(j)||_(2) <= ||f(v_(i))-f(v_(j))||_(2) <= ||v_(i)-v_(j)||_(2).(1-\epsilon)\left\|v_{i}-v_{j}\right\|_{2} \leq\left\|f\left(v_{i}\right)-f\left(v_{j}\right)\right\|_{2} \leq\left\|v_{i}-v_{j}\right\|_{2} .
Moreover f f ff can be obtained in randomized polynomial-time.
The implication of the JL Lemma is that any n n nn points in d d dd-dimensional Euclidean space can be projected to O ( ln n / ϵ 2 ) O ln n / ϵ 2 O(ln n//epsilon^(2))O\left(\ln n / \epsilon^{2}\right)-dimensions while preserving all their pairwise Euclidean distances.
The simple randomized algorithm that proves the JL Lemma is the following. Let M M MM be a k × d k × d k xx dk \times d matrix where each entry M i j M i j M_(ij)M_{i j} is picked independently from the standard N ( 0 , 1 ) N ( 0 , 1 ) N(0,1)\mathcal{N}(0,1) normal distribution. Then the map f f ff is givens as f ( v ) = 1 k M v f ( v ) = 1 k M v f(v)=(1)/(sqrtk)Mvf(\mathbf{v})=\frac{1}{\sqrt{k}} M \mathbf{v}. We now sketch why this works.
Lemma 2 Let Z 1 , Z 2 , , Z k Z 1 , Z 2 , , Z k Z_(1),Z_(2),dots,Z_(k)Z_{1}, Z_{2}, \ldots, Z_{k} be independent N ( 0 , 1 ) N ( 0 , 1 ) N(0,1)\mathcal{N}(0,1) random variables and let Y = i Z i 2 Y = i Z i 2 Y=sum_(i)Z_(i)^(2)Y=\sum_{i} Z_{i}^{2}. Then, for ϵ ( 0 , 1 / 2 ) ϵ ( 0 , 1 / 2 ) epsilon in(0,1//2)\epsilon \in(0,1 / 2), there is a constant c such that,
Pr [ ( 1 ϵ ) 2 k Y ( 1 + ϵ ) 2 k ] 2 e c ϵ 2 k . Pr ( 1 ϵ ) 2 k Y ( 1 + ϵ ) 2 k 2 e c ϵ 2 k . Pr[(1-epsilon)^(2)k <= Y <= (1+epsilon)^(2)k] <= 2e^(cepsilon^(2)k).\operatorname{Pr}\left[(1-\epsilon)^{2} k \leq Y \leq(1+\epsilon)^{2} k\right] \leq 2 e^{c \epsilon^{2} k} .
In other words the sum of squares of k k kk standard normal variables is sharply concentrated around its mean which is k k kk. In fact the distribution of Y Y YY has a name, the χ 2 χ 2 chi^(2)\chi^{2} distribution with parameter k k kk. We will not prove the preceding lemma. A proof via the standard Chernoff-type argument can be found in various places.
Assuming the lemma we can prove the following.
Lemma 3 Let ϵ ϵ epsilon\epsilon in ( 0 , 1 / 2 ) ( 0 , 1 / 2 ) (0,1//2)(0,1 / 2) and v v v\mathbf{v} be a unit-vector in R d R d R^(d)\mathbb{R}^{d}, then ( 1 ϵ ) M v 2 ( 1 + ϵ ) ( 1 ϵ ) M v 2 ( 1 + ϵ ) (1-epsilon) <= ||Mv||_(2) <= (1+epsilon)||(1-\epsilon) \leq\|M \mathbf{v}\|_{2} \leq(1+\epsilon) \| with probability at least ( 1 2 e c ϵ 2 k ) 1 2 e c ϵ 2 k (1-2e^(cepsilon^(2)k))\left(1-2 e^{c \epsilon^{2} k}\right).
Proof: First we observe a well-known fact about normal distributions. Let X X XX and Y Y YY be independent N ( 0 , 1 ) N ( 0 , 1 ) N(0,1)\mathcal{N}(0,1) random variables. Then a X + b Y a X + b Y aX+bYa X+b Y is N ( 0 , a 2 + b 2 ) N 0 , a 2 + b 2 N(0,sqrt(a^(2)+b^(2)))\mathcal{N}\left(0, \sqrt{a^{2}+b^{2}}\right) random variable.
Let u = k M v u = k M v u=sqrtkMv\mathbf{u}=\sqrt{k} M \mathbf{v}. Note that u u u\mathbf{u} is a random vector. Note that u i = j = 1 n v j k M i j u i = j = 1 n v j k M i j u_(i)=sum_(j=1)^(n)v_(j)sqrtkM_(ij)u_{i}=\sum_{j=1}^{n} v_{j} \sqrt{k} M_{i j}. Since each k M i j k M i j sqrtkM_(ij)\sqrt{k} M_{i j} is N ( 0 , 1 ) N ( 0 , 1 ) N(0,1)\mathcal{N}(0,1) random variable and all entries are independent we have that u i N ( 0 , 1 ) u i N ( 0 , 1 ) u_(i)≃N(0,1)u_{i} \simeq \mathcal{N}(0,1) because the variance of u i u i u_(i)u_{i} is j v i 2 = 1 j v i 2 = 1 sum_(j)v_(i)^(2)=1\sum_{j} v_{i}^{2}=1 (note that v v v\mathbf{v} is a unit vector). Thus u 1 , u 2 , , u k u 1 , u 2 , , u k u_(1),u_(2),dots,u_(k)u_{1}, u_{2}, \ldots, u_{k} are independent N ( 0 , 1 ) N ( 0 , 1 ) N(0,1)\mathcal{N}(0,1) random variables. Therefore u 2 2 = i = k u i 2 u 2 2 = i = k u i 2 ||u||_(2)^(2)=sum_(i)=^(k)u_(i)^(2)\|u\|_{2}^{2}=\sum_{i}={ }^{k} u_{i}^{2}. Applying Lemma 2 , we have
Pr [ ( 1 ϵ ) 2 k u 2 2 ( 1 + ϵ ) 2 k ] 1 2 e c ϵ 2 k . Pr ( 1 ϵ ) 2 k u 2 2 ( 1 + ϵ ) 2 k 1 2 e c ϵ 2 k . Pr[(1-epsilon)^(2)k <= ||u||_(2)^(2) <= (1+epsilon)^(2)k] >= 1-2e^(cepsilon^(2)k).\operatorname{Pr}\left[(1-\epsilon)^{2} k \leq\|u\|_{2}^{2} \leq(1+\epsilon)^{2} k\right] \geq 1-2 e^{c \epsilon^{2} k} .
Unit-vectors are convenient for the proof but by scaling one obtains the following easy corollary.
Corollary 4 Let ϵ ϵ epsilon\epsilon in ( 0 , 1 / 2 ) ( 0 , 1 / 2 ) (0,1//2)(0,1 / 2) and v v v\mathbf{v} be any vector in R d R d R^(d)\mathbb{R}^{d}, then ( 1 ϵ ) v 2 M v 2 ( 1 + ϵ ) v 2 ( 1 ϵ ) v 2 M v 2 ( 1 + ϵ ) v 2 (1-epsilon)||v||_(2) <= ||Mv||_(2) <= (1+epsilon)||||v||_(2)(1-\epsilon)\|\mathbf{v}\|_{2} \leq\|M \mathbf{v}\|_{2} \leq(1+\epsilon)\|\| \mathbf{v} \|_{2} with probability at least ( 1 2 e c 2 k ) 1 2 e c 2 k (1-2e^(c^(2)k))\left(1-2 e^{c^{2} k}\right).
Now the JL Lemma follows easisly via a union bound. Let k = c ln n / ϵ 2 k = c ln n / ϵ 2 k=c^(')ln n//epsilon^(2)k=c^{\prime} \ln n / \epsilon^{2} where c c c^(')c^{\prime} is chosen based on c c cc. Consider any pair v i , v j v i , v j v_(i),v_(j)\mathbf{v}_{i}, \mathbf{v}_{j}.
Pr [ ( 1 ϵ ) v i v j 2 M ( v i v j 2 ( 1 + ϵ ) v i v j 2 ] ( 1 2 e c ϵ 2 k ) 1 2 e c ϵ 2 c ln n / ϵ 2 1 2 n c c . Pr ( 1 ϵ ) v i v j 2 M v i v j 2 ( 1 + ϵ ) v i v j 2 1 2 e c ϵ 2 k 1 2 e c ϵ 2 c ln n / ϵ 2 1 2 n c c . Pr[(1-epsilon)||v_(i)-v_(j)||_(2) <= ||M(v_(i)-v_(j)||_(2) <= (1+epsilon)||v_(i)-v_(j)||_(2)] >= (1-2e^(cepsilon^(2)k)) >= 1-2e^(cepsilon^(2)*c^(')ln n//epsilon^(2)) >= 1-(2)/(n^(cc^('))).:}\operatorname{Pr}\left[(1-\epsilon)\left\|\mathbf{v}_{i}-\mathbf{v}_{j}\right\|_{2} \leq \| M\left(\mathbf{v}_{i}-\mathbf{v}_{j}\left\|_{2} \leq(1+\epsilon)\right\| \mathbf{v}_{i}-\mathbf{v}_{j} \|_{2}\right] \geq\left(1-2 e^{c \epsilon^{2} k}\right) \geq 1-2 e^{c \epsilon^{2} \cdot c^{\prime} \ln n / \epsilon^{2}} \geq 1-\frac{2}{n^{c c^{\prime}}} .\right.
If c c 3 c c 3 cc^(') >= 3c c^{\prime} \geq 3 then the probability of the distance between v i v i v_(i)\mathbf{v}_{i} and v j v j v_(j)\mathbf{v}_{j} being preserved to within a relative ϵ ϵ epsilon\epsilon-approximation is at least 1 1 / n 3 1 1 / n 3 1-1//n^(3)1-1 / n^{3}. Since there are only n ( n 1 ) / 2 n ( n 1 ) / 2 n(n-1)//2n(n-1) / 2 pairs of distances, the probability that all of them will be preserved to this error tolerance is, via the union bound, at least ( 1 1 / n ) ( 1 1 / n ) (1-1//n)(1-1 / n).
Bibliographic Notes: See Chapter 6 of Amit Chakrabarti's lecture notes. A lot of work has been done in the algorithmic community to make the dimensionality reduction faster to evaluate. An interesting result is due to Achlioptas[2] who showed that the matrix M M MM whose entries we chose from N ( 0 , 1 ) N ( 0 , 1 ) N(0,1)\mathcal{N}(0,1) can in fact be chosen from { 1 , 0 , 1 } { 1 , 0 , 1 } {-1,0,1}\{-1,0,1\}; the discrete entries create a sparser matrix and the resulting matrix multiplication is computationally easier.
You can read the notes from the next lecture of Chandra Chekuri's course on Estimating F_2 norm, Sketching, Johnson-Lindenstrauss Lemma here.

  1. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137-147, 1999. Preliminary version in Proc. of ACM STOC 1996. ↩︎
  2. Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671-687, 2003. ↩︎

Recommended for you

Kaichao You
A New Paradigm for Exploiting Pre-trained Model Hubs
A New Paradigm for Exploiting Pre-trained Model Hubs
It is often overlooked that the number of models in pre-trained model hubs is scaling up very fast. As a result, pre-trained model hubs are popular but under-exploited. Here a new paradigm is advocated to sufficiently exploit pre-trained model hubs.
8 points
0 issues