Introduction to Independent Components Analysis

You can read the notes from the previous lecture from Andrew Ng's CS229 course on Principal Components Analysis here.
Our next topic is Independent Components Analysis (ICA). Similar to PCA, this will find a new basis in which to represent our data. However, the goal is very different.
As a motivating example, consider the "cocktail party problem." Here, d d dd speakers are speaking simultaneously at a party, and any microphone placed in the room records only an overlapping combination of the d d dd speakers' voices. But lets say we have d d dd different microphones placed in the room, and because each microphone is a different distance from each of the speakers, it records a different combination of the speakers' voices. Using these microphone recordings, can we separate out the original d d dd speakers' speech signals?
To formalize this problem, we imagine that there is some data s R d s R d s inR^(d)s \in \mathbb{R}^{d} that is generated via d d dd independent sources. What we observe is
x = A s , x = A s , x=As,x=A s,
where A A AA is an unknown square matrix called the mixing matrix. Repeated observations gives us a dataset { x ( i ) ; i = 1 , , n } x ( i ) ; i = 1 , , n {x^((i));i=1,dots,n}\left\{x^{(i)} ; i=1, \ldots, n\right\}, and our goal is to recover the sources s ( i ) s ( i ) s^((i))s^{(i)} that had generated our data ( x ( i ) = A s ( i ) ) x ( i ) = A s ( i ) (x^((i))=As^((i)))\left(x^{(i)}=A s^{(i)}\right).
In our cocktail party problem, s ( i ) s ( i ) s^((i))s^{(i)} is an d d dd-dimensional vector, and s j ( i ) s j ( i ) s_(j)^((i))s_{j}^{(i)} is the sound that speaker j j jj was uttering at time i i ii. Also, x ( i ) x ( i ) x^((i))x^{(i)} in an d d dd-dimensional vector, and x j ( i ) x j ( i ) x_(j)^((i))x_{j}^{(i)} is the acoustic reading recorded by microphone j j jj at time i i ii.
Let W = A 1 W = A 1 W=A^(-1)W=A^{-1} be the unmixing matrix. Our goal is to find W W WW, so that given our microphone recordings x ( i ) x ( i ) x^((i))x^{(i)}, we can recover the sources by computing s ( i ) = W x ( i ) s ( i ) = W x ( i ) s^((i))=Wx^((i))s^{(i)}=W x^{(i)}. For notational convenience, we also let w i T w i T w_(i)^(T)w_{i}^{T} denote the i i ii-th row of W W WW, so that
W = [ w 1 T w d T ] . W = w 1 T w d T . W=[[-w_(1)^(T)-],[vdots],[-w_(d)^(T)-]].W=\left[\begin{array}{c} -w_{1}^{T}- \\ \vdots \\ -w_{d}^{T}- \end{array}\right].
Thus, w i R d w i R d w_(i)inR^(d)w_{i} \in \mathbb{R}^{d}, and the j j jj-th source can be recovered as s j ( i ) = w j T x ( i ) s j ( i ) = w j T x ( i ) s_(j)^((i))=w_(j)^(T)x^((i))s_{j}^{(i)}=w_{j}^{T} x^{(i)}.

1. ICA ambiguities

To what degree can W = A 1 W = A 1 W=A^(-1)W=A^{-1} be recovered? If we have no prior knowledge about the sources and the mixing matrix, it is easy to see that there are some inherent ambiguities in A A AA that are impossible to recover, given only the x ( i ) x ( i ) x^((i))x^{(i)}'s.
Specifically, let P P PP be any d d dd-by- d d dd permutation matrix. This means that each row and each column of P P PP has exactly one "1." Here are some examples of permutation matrices:
P = [ 0 1 0 1 0 0 0 0 1 ] ; P = [ 0 1 1 0 ] ; P = [ 1 0 0 1 ] . P = 0 1 0 1 0 0 0 0 1 ; P = 0 1 1 0 ; P = 1      0 0      1 . P=[[0,1,0],[1,0,0],[0,0,1]];quad P=[[0,1],[1,0]];quad P=[[1,0],[0,1]].P=\left[\begin{array}{ccc} 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{array}\right] ; \quad P=\left[\begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array}\right] ; \quad P=\left[\begin{array}{ll} 1 & 0 \\ 0 & 1 \end{array}\right].
If z z zz is a vector, then P z P z PzPz is another vector that contains a permuted version of z z zz's coordinates. Given only the x ( i ) x ( i ) x^((i))x^{(i)}'s, there will be no way to distinguish between W W WW and P W P W PWP W. Specifically, the permutation of the original sources is ambiguous, which should be no surprise. Fortunately, this does not matter for most applications.
Further, there is no way to recover the correct scaling of the w i w i w_(i)w_{i}'s. For instance, if A A AA were replaced with 2 A 2 A 2A2 A, and every s ( i ) s ( i ) s^((i))s^{(i)} were replaced with ( 0.5 ) s ( i ) ( 0.5 ) s ( i ) (0.5)s^((i))(0.5) s^{(i)}, then our observed x ( i ) = 2 A ( 0.5 ) s ( i ) x ( i ) = 2 A ( 0.5 ) s ( i ) x^((i))=2A*(0.5)s^((i))x^{(i)}=2 A \cdot(0.5) s^{(i)} would still be the same. More broadly, if a single column of A A AA were scaled by a factor of α α alpha\alpha, and the corresponding source were scaled by a factor of 1 / α 1 / α 1//alpha1 / \alpha, then there is again no way to determine that this had happened given only the x ( i ) x ( i ) x^((i))x^{(i)}'s. Thus, we cannot recover the "correct" scaling of the sources. However, for the applications that we are concerned with – including the cocktail party problem – this ambiguity also does not matter. Specifically, scaling a speaker's speech signal s j ( i ) s j ( i ) s_(j)^((i))s_{j}^{(i)} by some positive factor α α alpha\alpha affects only the volume of that speaker's speech. Also, sign changes do not matter, and s j ( i ) s j ( i ) s_(j)^((i))s_{j}^{(i)} and s j ( i ) s j ( i ) -s_(j)^((i))-s_{j}^{(i)} sound identical when played on a speaker. Thus, if the w i w i w_(i)w_{i} found by an algorithm is scaled by any non-zero real number, the corresponding recovered source s i = w i T x s i = w i T x s_(i)=w_(i)^(T)xs_{i}=w_{i}^{T} x will be scaled by the same factor; but this usually does not matter. (These comments also apply to ICA for the brain/MEG data that we talked about in class.)
Are these the only sources of ambiguity in ICA? It turns out that they are, so long as the sources s i s i s_(i)s_{i} are non-Gaussian. To see what the difficulty is with Gaussian data, consider an example in which n = 2 n = 2 n=2n=2, and s N ( 0 , I ) s N ( 0 , I ) s∼N(0,I)s \sim \mathcal{N}(0, I). Here, I I II is the 2 x 2 2 x 2 2x22 \mathrm{x} 2 identity matrix. Note that the contours of the density of the standard normal distribution N ( 0 , I ) N ( 0 , I ) N(0,I)\mathcal{N}(0, I) are circles centered on the origin, and the density is rotationally symmetric.
Now, suppose we observe some x = A s x = A s x=Asx=A s, where A A AA is our mixing matrix. Then, the distribution of x x xx will be Gaussian, x N ( 0 , A A T ) x N 0 , A A T x∼N(0,AA^(T))x \sim \mathcal{N}\left(0, A A^{T}\right), since
E s N ( 0 , I ) [ x ] = E [ A s ] = A E [ s ] = 0 Cov [ x ] = E s N ( 0 , I ) [ x x T ] = E [ A s s T A T ] = A E [ s s T ] A T = A Cov [ s ] A T = A A T E s N ( 0 , I ) [ x ] = E [ A s ] = A E [ s ] = 0 Cov [ x ] = E s N ( 0 , I ) x x T = E A s s T A T = A E s s T A T = A Cov [ s ] A T = A A T {:[E_(s∼N(0,I))[x]=E[As]=AE[s]=0],[Cov[x]=E_(s∼N(0,I))[xx^(T)]=E[Ass^(T)A^(T)]=AE[ss^(T)]A^(T)=A*Cov[s]*A^(T)=AA^(T)]:}\begin{gathered} \mathbb{E}_{s \sim \mathcal{N}(0, I)}[x]=\mathbb{E}[A s]=A \mathbb{E}[s]=0 \\ \operatorname{Cov}[x]=\mathbb{E}_{s \sim \mathcal{N}(0, I)}\left[x x^{T}\right]=\mathbb{E}\left[A s s^{T} A^{T}\right]=A \mathbb{E}\left[s s^{T}\right] A^{T}=A \cdot \operatorname{Cov}[s] \cdot A^{T}=A A^{T} \end{gathered}
Now, let R R RR be an arbitrary orthogonal (less formally, a rotation/reflection) matrix, so that R R T = R T R = I R R T = R T R = I RR^(T)=R^(T)R=IR R^{T}=R^{T} R=I, and let A = A R A = A R A^(')=ARA^{\prime}=A R. Then if the data had been mixed according to A A A^(')A^{\prime} instead of A A AA, we would have instead observed x = A s x = A s x^(')=A^(')sx^{\prime}=A^{\prime} s. The distribution of x x x^(')x^{\prime} is also Gaussian, x N ( 0 , A A T ) x N 0 , A A T x^(')∼N(0,AA^(T))x^{\prime} \sim \mathcal{N}\left(0, A A^{T}\right), since E s N ( 0 , I ) [ x ( x ) T ] = E [ A s s T ( A ) T ] = E [ A R s s T ( A R ) T ] = A R R T A T = A A T . E s N ( 0 , I ) x x T = E A s s T A T = E A R s s T ( A R ) T = A R R T A T = A A T . E_(s∼N(0,I))[x^(')(x^('))^(T)]=E[A^(')ss^(T)(A^('))^(T)]=E[ARss^(T)(AR)^(T)]=ARR^(T)A^(T)=AA^(T).\mathbb{E}_{s \sim \mathcal{N}(0, I)}\left[x^{\prime}\left(x^{\prime}\right)^{T}\right]=\mathbb{E}\left[A^{\prime} s s^{T}\left(A^{\prime}\right)^{T}\right]=\mathbb{E}\left[A R s s^{T}(A R)^{T}\right]=A R R^{T} A^{T}=A A^{T} . Hence, whether the mixing matrix is A A AA or A A A^(')A^{\prime}, we would observe data from a N ( 0 , A A T ) N 0 , A A T N(0,AA^(T))\mathcal{N}\left(0, A A^{T}\right) distribution. Thus, there is no way to tell if the sources were mixed using A A AA and A A A^(')A^{\prime}. There is an arbitrary rotational component in the mixing matrix that cannot be determined from the data, and we cannot recover the original sources.
Our argument above was based on the fact that the multivariate standard normal distribution is rotationally symmetric. Despite the bleak picture that this paints for ICA on Gaussian data, it turns out that, so long as the data is not Gaussian, it is possible, given enough data, to recover the d d dd independent sources.

2. Densities and linear transformations

Before moving on to derive the ICA algorithm proper, we first digress briefly to talk about the effect of linear transformations on densities.
Suppose a random variable s s ss is drawn according to some density p s ( s ) p s ( s ) p_(s)(s)p_{s}(s). For simplicity, assume for now that s R s R s inRs \in \mathbb{R} is a real number. Now, let the random variable x x xx be defined according to x = A s x = A s x=Asx=A s (here, x R , A R x R , A R x inR,A inRx \in \mathbb{R}, A \in \mathbb{R}). Let p x p x p_(x)p_{x} be the density of x x xx. What is p x p x p_(x)p_{x}?
Let W = A 1 W = A 1 W=A^(-1)W=A^{-1}. To calculate the "probability" of a particular value of x x xx, it is tempting to compute s = W x s = W x s=Wxs=W x, then then evaluate p s p s p_(s)p_{s} at that point, and conclude that " p x ( x ) = p s ( W x ) p x ( x ) = p s ( W x ) p_(x)(x)=p_(s)(Wx)p_{x}(x)=p_{s}(W x)." However, this is incorrect. For example, let s s s∼s \sim Uniform [ 0 , 1 ] [ 0 , 1 ] [0,1][0,1], so p s ( s ) = 1 { 0 s 1 } p s ( s ) = 1 { 0 s 1 } p_(s)(s)=1{0 <= s <= 1}p_{s}(s)=1\{0 \leq s \leq 1\}. Now, let A = 2 A = 2 A=2A=2, so x = 2 s x = 2 s x=2sx=2 s. Clearly, x x xx is distributed uniformly in the interval [ 0 , 2 ] [ 0 , 2 ] [0,2][0,2]. Thus, its density is given by p x ( x ) = ( 0.5 ) 1 { 0 x 2 } p x ( x ) = ( 0.5 ) 1 { 0 x 2 } p_(x)(x)=(0.5)1{0 <= x <= 2}p_{x}(x)=(0.5) 1\{0 \leq x \leq 2\}. This does not equal p s ( W x ) p s ( W x ) p_(s)(Wx)p_{s}(W x), where W = 0.5 = A 1 W = 0.5 = A 1 W=0.5=A^(-1)W=0.5=A^{-1}. Instead, the correct formula is p x ( x ) = p s ( W x ) | W | p x ( x ) = p s ( W x ) | W | p_(x)(x)=p_(s)(Wx)|W|p_{x}(x)=p_{s}(W x)|W|.
More generally, if s s ss is a vector-valued distribution with density p s p s p_(s)p_{s}, and x = A s x = A s x=Asx=A s for a square, invertible matrix A A AA, then the density of x x xx is given by
p x ( x ) = p s ( W x ) | W | , p x ( x ) = p s ( W x ) | W | , p_(x)(x)=p_(s)(Wx)*|W|,p_{x}(x)=p_{s}(W x) \cdot|W|,
where W = A 1 W = A 1 W=A^(-1)W=A^{-1}.
Remark. If you're seen the result that A A AA maps [ 0 , 1 ] d [ 0 , 1 ] d [0,1]^(d)[0,1]^{d} to a set of volume | A | | A | |A||A|, then here's another way to remember the formula for p x p x p_(x)p_{x} given above, that also generalizes our previous 1-dimensional example. Specifically, let A R d × d A R d × d A inR^(d xx d)A \in \mathbb{R}^{d \times d} be given, and let W = A 1 W = A 1 W=A^(-1)W=A^{-1} as usual. Also let C 1 = [ 0 , 1 ] d C 1 = [ 0 , 1 ] d C_(1)=[0,1]^(d)C_{1}=[0,1]^{d} be the d d dd-dimensional hypercube, and define C 2 = { A s : s C 1 } R d C 2 = A s : s C 1 R d C_(2)={As:s inC_(1)}subeR^(d)C_{2}=\left\{A s: s \in C_{1}\right\} \subseteq \mathbb{R}^{d} to be the image of C 1 C 1 C_(1)C_{1} under the mapping given by A A AA. Then it is a standard result in linear algebra (and, indeed, one of the ways of defining determinants) that the volume of C 2 C 2 C_(2)C_{2} is given by | A | | A | |A||A|. Now, suppose s s ss is uniformly distributed in [ 0 , 1 ] d [ 0 , 1 ] d [0,1]^(d)[0,1]^{d}, so its density is p s ( s ) = 1 { s C 1 } p s ( s ) = 1 s C 1 p_(s)(s)=1{s inC_(1)}p_{s}(s)=1\left\{s \in C_{1}\right\}. Then clearly x x xx will be uniformly distributed in C 2 C 2 C_(2)C_{2}. Its density is therefore found to be p x ( x ) = 1 { x C 2 } / vol ( C 2 ) p x ( x ) = 1 x C 2 / vol C 2 p_(x)(x)=1{x inC_(2)}//vol(C_(2))p_{x}(x)=1\left\{x \in C_{2}\right\} / \operatorname{vol}\left(C_{2}\right) (since it must integrate over C 2 C 2 C_(2)C_{2} to 1 ) ) )). But using the fact that the determinant of the inverse of a matrix is just the inverse of the determinant, we have 1 / vol ( C 2 ) = 1 / | A | = | A 1 | = | W | 1 / vol C 2 = 1 / | A | = A 1 = | W | 1//vol(C_(2))=1//|A|=|A^(-1)|=|W|1 /\operatorname{vol}\left(C_{2}\right)=1 /|A|=\left|A^{-1}\right|=|W|. Thus, p x ( x ) = 1 { x C 2 } | W | = 1 { W x C 1 } | W | = p s ( W x ) | W | p x ( x ) = 1 x C 2 | W | = 1 { W x C 1 | W | = p s ( W x ) | W | p_(x)(x)=1{x inC_(2)}|W|=1{Wx in{:C_(1)}|W|=p_(s)(Wx)|W|p_{x}(x)=1\left\{x \in C_{2}\right\}|W|=1\{W x \in\left.C_{1}\right\}|W|=p_{s}(W x)|W|.

3. ICA algorithm

We are now ready to derive an ICA algorithm. We describe an algorithm by Bell and Sejnowski, and we give an interpretation of their algorithm as a method for maximum likelihood estimation. (This is different from their original interpretation involving a complicated idea called the infomax principal which is no longer necessary given the modern understanding of ICA.)
We suppose that the distribution of each source s j s j s_(j)s_{j} is given by a density p s p s p_(s)p_{s}, and that the joint distribution of the sources s s ss is given by
p ( s ) = j = 1 d p s ( s j ) . p ( s ) = j = 1 d p s s j . p(s)=prod_(j=1)^(d)p_(s)(s_(j)).p(s)=\prod_{j=1}^{d} p_{s}\left(s_{j}\right) .
Note that by modeling the joint distribution as a product of marginals, we capture the assumption that the sources are independent. Using our formulas from the previous section, this implies the following density on x = A s = x = A s = x=As=x=A s= W 1 s W 1 s W^(-1)sW^{-1} s:
p ( x ) = j = 1 d p s ( w j T x ) | W | . p ( x ) = j = 1 d p s w j T x | W | . p(x)=prod_(j=1)^(d)p_(s)(w_(j)^(T)x)*|W|.p(x)=\prod_{j=1}^{d} p_{s}\left(w_{j}^{T} x\right) \cdot|W|.
All that remains is to specify a density for the individual sources p s p s p_(s)p_{s}.
Recall that, given a real-valued random variable z z zz, its cumulative distribution function (cdf) F F FF is defined by F ( z 0 ) = P ( z z 0 ) = z 0 p z ( z ) d z F z 0 = P z z 0 = z 0 p z ( z ) d z F(z_(0))=P(z <= z_(0))=int_(-oo)^(z_(0))p_(z)(z)dzF\left(z_{0}\right)=P\left(z \leq z_{0}\right)=\int_{-\infty}^{z_{0}} p_{z}(z) d z and the density is the derivative of the cdf: p z ( z ) = F ( z ) p z ( z ) = F ( z ) p_(z)(z)=F^(')(z)p_{z}(z)=F^{\prime}(z).
Thus, to specify a density for the s i s i s_(i)s_{i}'s, all we need to do is to specify some cdf for it. A cdf has to be a monotonic function that increases from zero to one. Following our previous discussion, we cannot choose the Gaussian cdf, as ICA doesn't work on Gaussian data. What we'll choose instead as a reasonable "default" cdf that slowly increases from 0 to 1, is the sigmoid function g ( s ) = 1 / ( 1 + e s ) g ( s ) = 1 / 1 + e s g(s)=1//(1+e^(-s))g(s)=1 /\left(1+e^{-s}\right). Hence, p s ( s ) = g ( s ) p s ( s ) = g ( s ) p_(s)(s)=g^(')(s)p_{s}(s)=g^{\prime}(s).[1]
The square matrix W W WW is the parameter in our model. Given a training set { x ( i ) ; i = 1 , , n } x ( i ) ; i = 1 , , n {x^((i));i=1,dots,n}\left\{x^{(i)} ; i=1, \ldots, n\right\}, the log likelihood is given by
( W ) = i = 1 n ( j = 1 d log g ( w j T x ( i ) ) + log | W | ) . ( W ) = i = 1 n j = 1 d log g w j T x ( i ) + log | W | . ℓ(W)=sum_(i=1)^(n)(sum_(j=1)^(d)log g^(')(w_(j)^(T)x^((i)))+log |W|).\ell(W)=\sum_{i=1}^{n}\left(\sum_{j=1}^{d} \log g^{\prime}\left(w_{j}^{T} x^{(i)}\right)+\log |W|\right).
We would like to maximize this in terms W W WW. By taking derivatives and using the fact (from the first set of notes) that W | W | = | W | ( W 1 ) T W | W | = | W | W 1 T grad_(W)|W|=|W|(W^(-1))^(T)\nabla_{W}|W|=|W|\left(W^{-1}\right)^{T}, we easily derive a stochastic gradient ascent learning rule. For a training example x ( i ) x ( i ) x^((i))x^{(i)}, the update rule is:
W := W + α ( [ 1 2 g ( w 1 T x ( i ) ) 1 2 g ( w 2 T x ( i ) ) 1 2 g ( w d T x ( i ) ) ] x ( i ) T + ( W T ) 1 ) , W := W + α 1 2 g w 1 T x ( i ) 1 2 g w 2 T x ( i ) 1 2 g w d T x ( i ) x ( i ) T + W T 1 , W:=W+alpha([[1-2g(w_(1)^(T)x^((i)))],[1-2g(w_(2)^(T)x^((i)))],[vdots],[1-2g(w_(d)^(T)x^((i)))]]x^((i)T)+(W^(T))^(-1)),W:=W+\alpha\left(\left[\begin{array}{c} 1-2 g\left(w_{1}^{T} x^{(i)}\right) \\ 1-2 g\left(w_{2}^{T} x^{(i)}\right) \\ \vdots \\ 1-2 g\left(w_{d}^{T} x^{(i)}\right) \end{array}\right] x^{(i) T}+\left(W^{T}\right)^{-1}\right),
where α α alpha\alpha is the learning rate.
After the algorithm converges, we then compute s ( i ) = W x ( i ) s ( i ) = W x ( i ) s^((i))=Wx^((i))s^{(i)}=W x^{(i)} to recover the original sources.
Remark. When writing down the likelihood of the data, we implicitly assumed that the x ( i ) x ( i ) x^((i))x^{(i)}'s were independent of each other (for different values of i i ii; note this issue is different from whether the different coordinates of x ( i ) x ( i ) x^((i))x^{(i)} are independent), so that the likelihood of the training set was given by i p ( x ( i ) ; W ) i p x ( i ) ; W prod_(i)p(x^((i));W)\prod_{i} p\left(x^{(i)} ; W\right). This assumption is clearly incorrect for speech data and other time series where the x ( i ) x ( i ) x^((i))x^{(i)}'s are dependent, but it can be shown that having correlated training examples will not hurt the performance of the algorithm if we have sufficient data. However, for problems where successive training examples are correlated, when implementing stochastic gradient ascent, it sometimes helps accelerate convergence if we visit training examples in a randomly permuted order. (I.e., run stochastic gradient ascent on a randomly shuffled copy of the training set.)
You can read the notes from the next lecture from Andrew Ng's CS229 course on Reinforcement Learning and Control here.

  1. If you have prior knowledge that the sources' densities take a certain form, then it is a good idea to substitute that in here. But in the absence of such knowledge, the sigmoid function can be thought of as a reasonable default that seems to work well for many problems. Also, the presentation here assumes that either the data x ( i ) x ( i ) x^((i))x^{(i)} has been preprocessed to have zero mean, or that it can naturally be expected to have zero mean (such as acoustic signals). This is necessary because our assumption that p s ( s ) = g ( s ) p s ( s ) = g ( s ) p_(s)(s)=g^(')(s)p_{s}(s)=g^{\prime}(s) implies E [ s ] = 0 E [ s ] = 0 E[s]=0\mathbb{E}[s]=0 (the derivative of the logistic function is a symmetric function, and hence gives a density corresponding to a random variable with zero mean), which implies E [ x ] = E [ A s ] = 0 E [ x ] = E [ A s ] = 0 E[x]=E[As]=0\mathbb{E}[x]=\mathbb{E}[A s]=0. ↩︎

Recommended for you

Kaichao You
Co-Tuning: An easy but effective trick to improve transfer learning
Co-Tuning: An easy but effective trick to improve transfer learning
Transfer learning is a popular method in the deep learning community, but it is usually implemented naively (eg. copying weights as initialization). Co-Tuning is a recently proposed technique to improve transfer learning that is easy to implement, and effective to a wide variety of tasks.
5 points
0 issues