You can read the notes from the previous lecture of Chandra Chekuri's course on Estimating F_2 norm, Sketching, Johnson-Lindenstrauss Lemma here.
1. Sketch for F_(p)F_{p} Estimation when 0 < p <= 20<p \leq 2
We have seen a linear sketching estimate for F_(2)F_{2} estimation that uses O(log n)O(\log n) space. Indyk[1] obtained a technically sophisticated and interesting sketch for F_(k)F_{k} estimation where 0 < p <= 20<p \leq 2 (note that pp can be a real number) which uses polylog(n)(n) space. Since the details are rather technical we will only give the high-level approach and refer the reader to the paper and related notes for more details. Note that for p > 2p>2 there is a lower bound of Omega(n^(1-2//p))\Omega\left(n^{1-2 / p}\right) on the space required.
To describe the sketch for 0 < p <= 20<p \leq 2 we will revisit the F_(2)F_{2} estimate via the JL Lemma approach that uses properties of the normal distribution.
Let Y_(1),Y_(2),dots,Y_(n)Y_{1}, Y_{2}, \ldots, Y_{n} be sampled independenty from the N(0,1)\mathcal{N}(0,1) distribution
z larr0z \leftarrow 0
While (stream is not empty) do
quad(i_(j),Delta_(j))\quad\left(i_{j}, \Delta_{j}\right) is current token
quad z larr z+Delta_(j)*Y_(i_(j))\quad z \leftarrow z+\Delta_{j} \cdot Y_{i_{j}}
endWhile
Output z^(2)z^{2}
Let Z=sum_(i in[n])x_(i)Y_(i)Z=\sum_{i \in[n]} x_{i} Y_{i} be the random variable that represents the value of zz at the end of the stream. The variable ZZ is a sum of independent normal variables and by the properties of the normal distribution Z∼sqrt(sum_(i)x_(i)^(2))*N(0,1)Z \sim \sqrt{\sum_{i} x_{i}^{2}} \cdot \mathcal{N}(0,1). Normal distribution is called 2-stable for this reason. More generally a distribution D\mathcal{D} is said to be pp-stable if the following property holds: Let Z_(1),Z_(2),dots,Z_(n)Z_{1}, Z_{2}, \ldots, Z_{n} be independent random variables distributed according to D\mathcal{D}. Then sum_(i)x_(i)Z_(i)\sum_{i} x_{i} Z_{i} has the same distribution as ||x||_(p)Z\|x\|_{p} Z where Z∼DZ \sim \mathcal{D}. Note that a pp-stable distribution will be symmetric around 00.
It is known that pp-stable distributions exist for all p in(0,2]p \in(0,2] and not for any p > 2p>2. The pp-stable distributions do not have, in general, an analytical formula except in some cases. We have already seen that the standard normal distribution is 2-stable. The 1-stable distribution is the Cauchy distribution which is the distribution of the ratio of two independent standard normal random variables. The density function of the Cauchy distribution is known to be (1)/(pi(1+x^(2)))\frac{1}{\pi\left(1+x^{2}\right)}; note that the Cauchy distribution does not have a finite mean or variance. We use D_(p)\mathcal{D}_{p} to denote a pp-stable distribution.
Although a general pp-stable distribution does not have an analytical formula it is known that one can sample from D_(p)\mathcal{D}_{p}. Chambers-Mallows-Stuck method is the following:
Sample theta\theta uniformly from [-pi//2,pi//2][-\pi / 2, \pi / 2].
Definition 1The median of a distribution D\mathcal{D} is theta\theta if for Y∼D,Pr[Y <= mu]=1//2Y \sim \mathcal{D}, \operatorname{Pr}[Y \leq \mu]=1 / 2. If phi(x)\phi(x) is the probability density function of D\mathcal{D} then we have int_(-oo)^(mu)phi(x)dx=1//2\int_{-\infty}^{\mu} \phi(x) d x=1 / 2.
Note that a median may not be uniquely defined for a distribution. The distribution D_(p)\mathcal{D}_{p} has a unique median and so we will use the terminology median (D_(p))\left(\mathcal{D}_{p}\right) to denote this quantity. For a distribution D\mathcal{D} we will refer to |D||\mathcal{D}| the distribution of the absolute value of a random variable drawn from D\mathcal{D}. If phi(x)\phi(x) is the density function of D\mathcal{D} then the density function of |D||\mathcal{D}| is given by psi\psi, where psi(x)=2phi(x)\psi(x)=2 \phi(x) if x >= 0x \geq 0 and psi(x)=0\psi(x)=0 if x < 0.x<0 .
k larr Theta((1)/(epsilon^(2))log (1)/(delta))k \leftarrow \Theta\left(\frac{1}{\epsilon^{2}} \log \frac{1}{\delta}\right)
Let MM be a k xx nk \times n matrix where each M_(ij)∼D_(p)M_{i j} \sim \mathcal{D}_{p}
ylarr Mx\mathbf{y} \leftarrow M \mathbf{x}
Output Y larr(median(|y_(1)|,|y_(2)|,dots,|y_(k)|))/(median(|D_(p)|))Y \leftarrow \frac{\operatorname { median }\left(\left|y_{1}\right|,\left|y_{2}\right|, \ldots,\left|y_{k}\right|\right)}{\operatorname{median}\left(\left|\mathcal{D}_{p}\right|\right)}
By the pp-stability property we see that each y_(i)∼||x||_(p)Yy_{i} \sim\|x\|_{p} Y where Y∼D_(p)Y \sim \mathcal{D}_{p}. First, consider the case that k=1k=1. Then the output |y_(1)|//\left|y_{1}\right| / median (|D_(p)|)\left(\left|\mathcal{D}_{p}\right|\right) is distributed according to c|D_(p)|c\left|\mathcal{D}_{p}\right| where c=||x||_(p)//c=\|x\|_{p} /median(|D_(p)|)\left(\left|\mathcal{D}_{p}\right|\right). It is not hard to verify that the median of this distribution is ||x||_(p)\|x\|_{p}. Thus, the algorithm take kk samples from this distribution and ouputs as the estimator the sample median. The lemma below shows that the sample median has good concentration properties.
Lemma 1Let epsilon > 0\epsilon>0 and let D\mathcal{D} be a distribution with density function phi\phi and a unique median mu > 0\mu>0. Suppose phi\phi is absolutely continuous on [(1-epsilon)mu,(1+epsilon)mu][(1-\epsilon) \mu,(1+\epsilon) \mu] and let alpha=min{phi(x)∣x epsilon\alpha=\min \{\phi(x) \mid x \epsilon[(1-epsilon)mu,(1+epsilon)mu][(1-\epsilon) \mu,(1+\epsilon) \mu]. Let Y=median(Y_(1),Y_(2),dots,Y_(k))Y=\operatorname{median}\left(Y_{1}, Y_{2}, \ldots, Y_{k}\right) where Y_(1),dots,Y_(k)Y_{1}, \ldots, Y_{k} are independent samples from the distribution D\mathcal{D}. Then
We sketch the proof to upper bound Pr[Y <= (1-epsilon)mu]\operatorname{Pr}[Y \leq(1-\epsilon) \mu]. The other direction is similar. Note that by the definition of the median, Pr[Y_(j) <= mu]=1//2\operatorname{Pr}\left[Y_{j} \leq \mu\right]=1 / 2. Hence
Pr[Y_(j) <= (1-epsilon)mu]=1//2-int_((1-epsilon)mu)^(mu)phi(x)dx.\operatorname{Pr}\left[Y_{j} \leq(1-\epsilon) \mu\right]=1 / 2-\int_{(1-\epsilon) \mu}^{\mu} \phi(x) d x .
Let gamma=int_((1-epsilon)mu)^(mu)phi(x)dx\gamma=\int_{(1-\epsilon) \mu}^{\mu} \phi(x) d x. It is easy to see that gamma >= alpha epsilon mu\gamma \geq \alpha \epsilon \mu.
Let I_(j)I_{j} be the indicator event for Y_(i) <= (1-epsilon)muY_{i} \leq(1-\epsilon) \mu; we have E[I_(j)]=Pr[Y_(i) <= (1-epsilon)mu] <= 1//2-gamma\mathbf{E}\left[I_{j}\right]=\operatorname{Pr}\left[Y_{i} \leq(1-\epsilon) \mu\right] \leq 1 / 2-\gamma. Let I=sum_(j)I_(j)I=\sum_{j} I_{j}; we have E[I]=k(1//2-gamma)\mathbf{E}[I]=k(1 / 2-\gamma). Since YY is the median of Y_(1),Y_(2),dots,Y_(k),Y <= (1-epsilon)muY_{1}, Y_{2}, \ldots, Y_{k}, Y \leq(1-\epsilon) \mu only if more than k//2k / 2 of I_(j)I_{j} are true which is the same as Pr[I > (1+delta)E[I]]\operatorname{Pr}[I>(1+\delta) \mathbf{E}[I]] where 1+delta=(1)/(1-2gamma)1+\delta=\frac{1}{1-2 \gamma}. Now, via Chernoff bounds, this probability is at most e^(-gamma^(2)k//3)e^{-\gamma^{2} k / 3} for sufficiently small gamma\gamma.
We can now apply the lemma to the estimator output by the algorithm. We let phi\phi be the distribution of c|D_(p)|c\left|\mathcal{D}_{p}\right|. Recall that the median of this distribution if ||x||_(p)\|x\|_{p} and the output of the algorithm is the median of kk indepenent samples from this distribution. Thus, from the lemma,
Let phi^(')\phi^{\prime} be the distribution of |D_(√)||\mathcal{D}_{\surd}| and mu^(')\mu^{\prime} be the median of phi^(')\phi^{\prime}. Then it can be seen that mu alpha=mu^(')alpha^(')\mu \alpha=\mu^{\prime} \alpha^{\prime} where alpha^(')=min{phi^(')(x)∣(1-epsilon)mu^(') <= (1+epsilon)mu^(')}\alpha^{\prime}=\min \left\{\phi^{\prime}(x) \mid(1-\epsilon) \mu^{\prime} \leq(1+\epsilon) \mu^{\prime}\right\}. Thus mu^(')alpha^(')\mu^{\prime} \alpha^{\prime} depends only on D_(p)\mathcal{D}_{p} and epsilon\epsilon. Letting this be c_(p,epsilon)c_{p, \epsilon} we have,
Technical Issues: There are several technical issues that need to be addressed to obtain a proper algorithm from the preceding description. First, the algorithm as described requires one to store the entire matrix MM which is too large for streaming applications. Second, the constant kk depends on c_(p,epsilon)c_{p, \epsilon} which is not explicitly known since D_(p)\mathcal{D}_{p} is not well-understood for general pp. To obtain a streaming algorithm, the very high-level idea is to derandomize the algorithm via the use of pseudorandom generators for small-space due to Nisan. See[1:1] for more details.
2. Counting Frequent Items
We have seen various algorithm for estimating various F_(p)F_{p} norms for p >= 0p \geq 0. Note that F_(0)F_{0} corresponds to number of distinct elements. In the limit, as p rarr oo,ℓ_(p)p \rightarrow \infty, \ell_{p} norm of a vector x\mathbf{x} is the maximum of the absolute values of the entries of x\mathbf{x}. Thus, we can define the F_(oo)F_{\infty} norm to corresponds to finding the maximum frequency in x\mathbf{x}. More generally, we would like to find the frequent items in a stream which are also called "heavy hitters". In general, it is not feasible to estimate the heaviest frequency with limited space if it is too small relative to mm.
2.1. Misra-Greis algorithm for frequent items
Suppose we have a stream sigma=a_(1),a_(2),dots,a_(m)\sigma=a_{1}, a_{2}, \ldots, a_{m} where a_(j)in[n]a_{j} \in[n], the simple setting and we want to find all elements in [n][n] such that f_(i) > m//kf_{i}>m / k. Note that there can be at most kk such elements. The simplest case is when k=2k=2 when we want to know whether there is a "majority" element. There is a simple deterministic algorithm that perhaps you have all seen for k=2k=2 in an algorithm class. The algorithm uses an associative array data structure of size kk.
"MisraGreis"(k)_\underline{\text{MisraGreis}(k)}:
DD is an empty associative array
While (stream is not empty) do
quada_(j)\quad a_{j} is current item
quad\quad If (a_(j)(a_{j} is in "keys"(D))\textit{keys}(D))
The lemma implies that if f_(i) > m//kf_{i}>m / k then i in"keys"(D)i \in \textit{keys}(D) at the end of the algorithm. Thus one can use a second-pass over the data to compute the exact f_(i)f_{i} only for the kk itmes in "keys"(D)\textit{keys}(D). This gives an O(kn)O(k n) time two-pass algorithm for finding all items which have frequency at least m//km / k.
Bibliographic Notes: For more details on F_(p)F_{p} estimation when 0 < p <= 20<p \leq 2 see the original paper of Indyk[1:2], notes of Amit Chakrabarti (Chapter 7) and Lecture 4 of Jelani Nelson's course.
You can read the notes from the next lecture of Chandra Chekuri's course on Count and Count-Min Sketches here.
Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307-323, 2006. ↩︎↩︎↩︎