Introduction to Factor Analysis

You can read the notes from the previous lecture from Tengyu Ma and Andrew Ng's CS229 course on the EM algorithm here.
When we have data x ( i ) R d x ( i ) R d x^((i))inR^(d)x^{(i)} \in \mathbb{R}^{d} that comes from a mixture of several Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting, we usually imagine problems where we have sufficient data to be able to discern the multiple-Gaussian structure in the data. For instance, this would be the case if our training set size n n nn was significantly larger than the dimension d d dd of the data.
Now, consider a setting in which d n d n d≫nd \gg n. In such a problem, it might be difficult to model the data even with a single Gaussian, much less a mixture of Gaussian. Specifically, since the n n nn data points span only a low-dimensional subspace of R d R d R^(d)\mathbb{R}^{d}, if we model the data as Gaussian, and estimate the mean and covariance using the usual maximum likelihood estimators,
μ = 1 n i = 1 n x ( i ) Σ = 1 n i = 1 n ( x ( i ) μ ) ( x ( i ) μ ) T , μ = 1 n i = 1 n x ( i ) Σ = 1 n i = 1 n x ( i ) μ x ( i ) μ T , {:[mu=(1)/(n)sum_(i=1)^(n)x^((i))],[Sigma=(1)/(n)sum_(i=1)^(n)(x^((i))-mu)(x^((i))-mu)^(T)","]:}\begin{aligned} \mu &=\frac{1}{n} \sum_{i=1}^{n} x^{(i)} \\ \Sigma &=\frac{1}{n} \sum_{i=1}^{n}\left(x^{(i)}-\mu\right)\left(x^{(i)}-\mu\right)^{T}, \end{aligned}
we would find that the matrix Σ Σ Sigma\Sigma is singular. This means that Σ 1 Σ 1 Sigma^(-1)\Sigma^{-1} does not exist, and 1 / | Σ | 1 / 2 = 1 / 0 1 / | Σ | 1 / 2 = 1 / 0 1//|Sigma|^(1//2)=1//01 /|\Sigma|^{1 / 2}=1 / 0. But both of these terms are needed in computing the usual density of a multivariate Gaussian distribution. Another way of stating this difficulty is that maximum likelihood estimates of the parameters result in a Gaussian that places all of its probability in the affine space spanned by the data,[1] and this corresponds to a singular covariance matrix.
More generally, unless n n nn exceeds d d dd by some reasonable amount, the maximum likelihood estimates of the mean and covariance may be quite poor. Nonetheless, we would still like to be able to fit a reasonable Gaussian model to the data, and perhaps capture some interesting covariance structure in the data. How can we do this?
In the next section, we begin by reviewing two possible restrictions on Σ Σ Sigma\Sigma that allow us to fit Σ Σ Sigma\Sigma with small amounts of data but neither will give a satisfactory solution to our problem. We next discuss some properties of Gaussians that will be needed later; specifically, how to find marginal and conditonal distributions of Gaussians. Finally, we present the factor analysis model, and EM for it.

1. Restrictions of Σ Σ Sigma\Sigma

If we do not have sufficient data to fit a full covariance matrix, we may place some restrictions on the space of matrices Σ Σ Sigma\Sigma that we will consider. For instance, we may choose to fit a covariance matrix Σ Σ Sigma\Sigma that is diagonal. In this setting, the reader may easily verify that the maximum likelihood estimate of the covariance matrix is given by the diagonal matrix Σ Σ Sigma\Sigma satisfying
Σ j j = 1 n i = 1 n ( x j ( i ) μ j ) 2 . Σ j j = 1 n i = 1 n x j ( i ) μ j 2 . Sigma_(jj)=(1)/(n)sum_(i=1)^(n)(x_(j)^((i))-mu_(j))^(2).\Sigma_{j j}=\frac{1}{n} \sum_{i=1}^{n}\left(x_{j}^{(i)}-\mu_{j}\right)^{2} .
Thus, Σ j j Σ j j Sigma_(jj)\Sigma_{j j} is just the empirical estimate of the variance of the j j jj-th coordinate of the data.
Recall that the contours of a Gaussian density are ellipses. A diagonal Σ Σ Sigma\Sigma corresponds to a Gaussian where the major axes of these ellipses are axis-aligned.
Sometimes, we may place a further restriction on the covariance matrix that not only must it be diagonal, but its diagonal entries must all be equal. In this setting, we have Σ = σ 2 I Σ = σ 2 I Sigma=sigma^(2)I\Sigma=\sigma^{2} I, where σ 2 σ 2 sigma^(2)\sigma^{2} is the parameter under our control. The maximum likelihood estimate of σ 2 σ 2 sigma^(2)\sigma^{2} can be found to be:
σ 2 = 1 n d j = 1 d i = 1 n ( x j ( i ) μ j ) 2 . σ 2 = 1 n d j = 1 d i = 1 n ( x j ( i ) μ j ) 2 . sigma^(2)=(1)/(nd)sum_(j=1)^(d)sum_(i=1)^(n)(x_(j)^((i))-mu_(j))^(2).\sigma^{2}=\frac{1}{n d} \sum_{j=1}^{d} \sum_{i=1}^{n}(x_{j}^{(i)}-\mu_{j})^{2} .
This model corresponds to using Gaussians whose densities have contours that are circles (in 2 2 22 dimensions; or spheres/hyperspheres in higher dimensions).
If we are fitting a full, unconstrained, covariance matrix Σ Σ Sigma\Sigma to data, it is necessary that n d + 1 n d + 1 n >= d+1n \geq d+1 in order for the maximum likelihood estimate of Σ Σ Sigma\Sigma not to be singular. Under either of the two restrictions above, we may obtain non-singular Σ Σ Sigma\Sigma when n 2 n 2 n >= 2n \geq 2.
However, restricting Σ Σ Sigma\Sigma to be diagonal also means modeling the different coordinates x i , x j x i , x j x_(i),x_(j)x_{i}, x_{j} of the data as being uncorrelated and independent. Often, it would be nice to be able to capture some interesting correlation structure in the data. If we were to use either of the restrictions on Σ Σ Sigma\Sigma described above, we would therefore fail to do so. In this set of notes, we will describe the factor analysis model, which uses more parameters than the diagonal Σ Σ Sigma\Sigma and captures some correlations in the data, but also without having to fit a full covariance matrix.

2. Marginals and conditionals of Gaussians

Before describing factor analysis, we digress to talk about how to find conditional and marginal distributions of random variables with a joint multivariate Gaussian distribution.
Suppose we have a vector-valued random variable
x = [ x 1 x 2 ] , x = x 1 x 2 , x=[[x_(1)],[x_(2)]],x=\left[\begin{array}{l} x_{1} \\ x_{2} \end{array}\right],
where x 1 R r , x 2 R s x 1 R r , x 2 R s x_(1)inR^(r),x_(2)inR^(s)x_{1} \in \mathbb{R}^{r}, x_{2} \in \mathbb{R}^{s}, and x R r + s x R r + s x inR^(r+s)x \in \mathbb{R}^{r+s}. Suppose x N ( μ , Σ ) x N ( μ , Σ ) x∼N(mu,Sigma)x \sim \mathcal{N}(\mu, \Sigma), where
μ = [ μ 1 μ 2 ] , Σ = [ Σ 11 Σ 12 Σ 21 Σ 22 ] . μ = μ 1 μ 2 , Σ = Σ 11      Σ 12 Σ 21      Σ 22 . mu=[[mu_(1)],[mu_(2)]],quad Sigma=[[Sigma_(11),Sigma_(12)],[Sigma_(21),Sigma_(22)]].\mu=\left[\begin{array}{l} \mu_{1} \\ \mu_{2} \end{array}\right], \quad \Sigma=\left[\begin{array}{ll} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array}\right] .
Here, μ 1 R r , μ 2 R s , Σ 11 R r × r , Σ 12 R r × s μ 1 R r , μ 2 R s , Σ 11 R r × r , Σ 12 R r × s mu_(1)inR^(r),mu_(2)inR^(s),Sigma_(11)inR^(r xx r),Sigma_(12)inR^(r xx s)\mu_{1} \in \mathbb{R}^{r}, \mu_{2} \in \mathbb{R}^{s}, \Sigma_{11} \in \mathbb{R}^{r \times r}, \Sigma_{12} \in \mathbb{R}^{r \times s}, and so on. Note that since covariance matrices are symmetric, Σ 12 = Σ 21 T Σ 12 = Σ 21 T Sigma_(12)=Sigma_(21)^(T)\Sigma_{12}=\Sigma_{21}^{T}.
Under our assumptions, x 1 x 1 x_(1)x_{1} and x 2 x 2 x_(2)x_{2} are jointly multivariate Gaussian. What is the marginal distribution of x 1 ? x 1 ? x_(1)?x_{1}? It is not hard to see that E [ x 1 ] = μ 1 E x 1 = μ 1 E[x_(1)]=mu_(1)\mathrm{E}\left[x_{1}\right]=\mu_{1}, and that Cov ( x 1 ) = E [ ( x 1 μ 1 ) ( x 1 μ 1 ) ] = Σ 11 Cov x 1 = E x 1 μ 1 x 1 μ 1 = Σ 11 Cov(x_(1))=E[(x_(1)-mu_(1))(x_(1)-mu_(1))]=Sigma_(11)\operatorname{Cov}\left(x_{1}\right)=\mathrm{E}\left[\left(x_{1}-\mu_{1}\right)\left(x_{1}-\mu_{1}\right)\right]=\Sigma_{11}. To see that the latter is true, note that by definition of the joint covariance of x 1 x 1 x_(1)x_{1} and x 2 x 2 x_(2)x_{2}, we have that
Cov ( x ) = Σ = [ Σ 11 Σ 12 Σ 21 Σ 22 ] = E [ ( x μ ) ( x μ ) T ] = E [ ( x 1 μ 1 x 2 μ 2 ) ( x 1 μ 1 x 2 μ 2 ) T ] = E [ ( x 1 μ 1 ) ( x 1 μ 1 ) T ( x 1 μ 1 ) ( x 2 μ 2 ) T ( x 2 μ 2 ) ( x 1 μ 1 ) T ( x 2 μ 2 ) ( x 2 μ 2 ) T ] . Cov ( x ) = Σ = Σ 11 Σ 12 Σ 21 Σ 22 = E ( x μ ) ( x μ ) T = E x 1 μ 1 x 2 μ 2 x 1 μ 1 x 2 μ 2 T = E x 1 μ 1 x 1 μ 1 T x 1 μ 1 x 2 μ 2 T x 2 μ 2 x 1 μ 1 T x 2 μ 2 x 2 μ 2 T . {:[Cov(x)=Sigma],[=[[Sigma_(11),Sigma_(12)],[Sigma_(21),Sigma_(22)]]],[=E[(x-mu)(x-mu)^(T)]],[=E[([x_(1)-mu_(1)],[x_(2)-mu_(2)])([x_(1)-mu_(1)],[x_(2)-mu_(2)])^(T)]],[=E[[(x_(1)-mu_(1))(x_(1)-mu_(1))^(T),(x_(1)-mu_(1))(x_(2)-mu_(2))^(T)],[(x_(2)-mu_(2))(x_(1)-mu_(1))^(T),(x_(2)-mu_(2))(x_(2)-mu_(2))^(T)]].]:}\begin{aligned} \operatorname{Cov}(x) &=\Sigma \\ &=\left[\begin{array}{ll} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{array}\right] \\ &=\mathrm{E}\left[(x-\mu)(x-\mu)^{T}\right] \\ &=\mathrm{E}\left[\left(\begin{array}{l} x_{1}-\mu_{1} \\ x_{2}-\mu_{2} \end{array}\right)\left(\begin{array}{l} x_{1}-\mu_{1} \\ x_{2}-\mu_{2} \end{array}\right)^{T}\right] \\ &=\mathrm{E}\left[\begin{array}{ll} \left(x_{1}-\mu_{1}\right)\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{1}-\mu_{1}\right)\left(x_{2}-\mu_{2}\right)^{T} \\ \left(x_{2}-\mu_{2}\right)\left(x_{1}-\mu_{1}\right)^{T} & \left(x_{2}-\mu_{2}\right)\left(x_{2}-\mu_{2}\right)^{T} \end{array}\right]. \end{aligned}
Matching the upper-left subblocks in the matrices in the second and the last lines above gives the result.
Since marginal distributions of Gaussians are themselves Gaussian, we therefore have that the marginal distribution of x 1 x 1 x_(1)x_{1} is given by x 1 N ( μ 1 , Σ 11 ) x 1 N μ 1 , Σ 11 x_(1)∼N(mu_(1),Sigma_(11))x_{1} \sim \mathcal{N}\left(\mu_{1}, \Sigma_{11}\right).
Also, we can ask, what is the conditional distribution of x 1 x 1 x_(1)x_{1} given x 2 x 2 x_(2)x_{2}? By referring to the definition of the multivariate Gaussian distribution, it can be shown that x 1 | x 2 N ( μ 1 | 2 , Σ 1 | 2 ) x 1 | x 2 N μ 1 | 2 , Σ 1 | 2 x_(1)|x_(2)∼N(mu_(1|2),Sigma_(1|2))x_{1} | x_{2} \sim \mathcal{N}\left(\mu_{1 | 2}, \Sigma_{1 | 2}\right), where
(1) μ 1 | 2 = μ 1 + Σ 12 Σ 22 1 ( x 2 μ 2 ) (2) Σ 1 | 2 = Σ 11 Σ 12 Σ 22 1 Σ 21 (1) μ 1 | 2 = μ 1 + Σ 12 Σ 22 1 x 2 μ 2 (2) Σ 1 | 2 = Σ 11 Σ 12 Σ 22 1 Σ 21 {:(1)mu_(1|2)=mu_(1)+Sigma_(12)Sigma_(22)^(-1)(x_(2)-mu_(2)),(2)Sigma_(1|2)=Sigma_(11)-Sigma_(12)Sigma_(22)^(-1)Sigma_(21):}\begin{align} \mu_{1 | 2} &=\mu_{1}+\Sigma_{12} \Sigma_{22}^{-1}\left(x_{2}-\mu_{2}\right) \\ \Sigma_{1 | 2} &=\Sigma_{11}-\Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \end{align}
When we work with the factor analysis model in the next section, these formulas for finding conditional and marginal distributions of Gaussians will be very useful.

3. The Factor analysis model

In the factor analysis model, we posit a joint distribution on ( x , z ) ( x , z ) (x,z)(x, z) as follows, where z R k z R k z inR^(k)z \in \mathbb{R}^{k} is a latent random variable:
z N ( 0 , I ) x | z N ( μ + Λ z , Ψ ) . z N ( 0 , I ) x | z N ( μ + Λ z , Ψ ) . {:[z∼N(0","I)],[x|z∼N(mu+Lambda z","Psi).]:}\begin{aligned} z & \sim \mathcal{N}(0, I) \\ x | z & \sim \mathcal{N}(\mu+\Lambda z, \Psi) . \end{aligned}
Here, the parameters of our model are the vector μ R d μ R d mu inR^(d)\mu \in \mathbb{R}^{d}, the matrix Λ R d × k Λ R d × k Lambda inR^(d xx k)\Lambda \in \mathbb{R}^{d \times k}, and the diagonal matrix Ψ R d × d Ψ R d × d Psi inR^(d xx d)\Psi \in \mathbb{R}^{d \times d}. The value of k k kk is usually chosen to be smaller than d d dd.
Thus, we imagine that each datapoint x ( i ) x ( i ) x^((i))x^{(i)} is generated by sampling a k k kk dimension multivariate Gaussian z ( i ) z ( i ) z^((i))z^{(i)}. Then, it is mapped to a d d dd-dimensional affine space of R d R d R^(d)\mathbb{R}^{d} by computing μ + Λ z ( i ) μ + Λ z ( i ) mu+Lambdaz^((i))\mu+\Lambda z^{(i)}. Lastly, x ( i ) x ( i ) x^((i))x^{(i)} is generated by adding covariance Ψ Ψ Psi\Psi noise to μ + Λ z ( i ) μ + Λ z ( i ) mu+Lambdaz^((i))\mu+\Lambda z^{(i)}.
Equivalently (convince yourself that this is the case), we can therefore also define the factor analysis model according to
z N ( 0 , I ) ϵ N ( 0 , Ψ ) x = μ + Λ z + ϵ z N ( 0 , I ) ϵ N ( 0 , Ψ ) x = μ + Λ z + ϵ {:[z∼N(0","I)],[epsilon∼N(0","Psi)],[x=mu+Lambda z+epsilon]:}\begin{aligned} z & \sim \mathcal{N}(0, I) \\ \epsilon & \sim \mathcal{N}(0, \Psi) \\ x &=\mu+\Lambda z+\epsilon \end{aligned}
where ϵ ϵ epsilon\epsilon and z z zz are independent.
Let's work out exactly what distribution our model defines. Our random variables z z zz and x x xx have a joint Gaussian distribution
[ z x ] N ( μ z x , Σ ) . z x N μ z x , Σ . [[z],[x]]∼N(mu_(zx),Sigma).\left[\begin{array}{l} z \\ x \end{array}\right] \sim \mathcal{N}\left(\mu_{z x}, \Sigma\right).
We will now find μ z x μ z x mu_(zx)\mu_{z x} and Σ Σ Sigma\Sigma.
We know that E [ z ] = 0 E [ z ] = 0 E[z]=0\mathrm{E}[z]=0, from the fact that z N ( 0 , I ) z N ( 0 , I ) z∼N(0,I)z \sim \mathcal{N}(0, I). Also, we have that
E [ x ] = E [ μ + Λ z + ϵ ] = μ + Λ E [ z ] + E [ ϵ ] = μ . E [ x ] = E [ μ + Λ z + ϵ ] = μ + Λ E [ z ] + E [ ϵ ] = μ . {:[E[x]=E[mu+Lambda z+epsilon]],[=mu+LambdaE[z]+E[epsilon]],[=mu.]:}\begin{aligned} \mathrm{E}[x] &=\mathrm{E}[\mu+\Lambda z+\epsilon] \\ &=\mu+\Lambda \mathrm{E}[z]+\mathrm{E}[\epsilon] \\ &=\mu . \end{aligned}
Putting these together, we obtain
μ z x = [ 0 μ ] μ z x = 0 μ mu_(zx)=[[ vec(0)],[mu]]\mu_{z x}=\left[\begin{array}{c} \vec{0} \\ \mu \end{array}\right]
Next, to find Σ Σ Sigma\Sigma, we need to calculate Σ z z = E [ ( z E [ z ] ) ( z E [ z ] ) T ] Σ z z = E ( z E [ z ] ) ( z E [ z ] ) T Sigma_(zz)=E[(z-E[z])(z-E[z])^(T)]\Sigma_{z z}=\mathrm{E}\left[(z-\mathrm{E}[z])(z-\mathrm{E}[z])^{T}\right] (the upper-left block of Σ Σ Sigma\Sigma), Σ z x = E [ ( z E [ z ] ) ( x E [ x ] ) T ] Σ z x = E ( z E [ z ] ) ( x E [ x ] ) T Sigma_(zx)=E[(z-E[z])(x-E[x])^(T)]\Sigma_{z x}=\mathrm{E}\left[(z-\mathrm{E}[z])(x-\mathrm{E}[x])^{T}\right] (upper-right block), and Σ x x = E [ ( x E [ x ] ) ( x E [ x ] ) T ] Σ x x = E ( x E [ x ] ) ( x E [ x ] ) T Sigma_(xx)=E[(x-E[x])(x-E[x])^(T)]\Sigma_{x x}=\mathrm{E}\left[(x-\mathrm{E}[x])(x-\mathrm{E}[x])^{T}\right] (lower-right block).
Now, since z N ( 0 , I ) z N ( 0 , I ) z∼N(0,I)z \sim \mathcal{N}(0, I), we easily find that Σ z z = Cov ( z ) = I Σ z z = Cov ( z ) = I Sigma_(zz)=Cov(z)=I\Sigma_{z z}=\operatorname{Cov}(z)=I. Also,
E [ ( z E [ z ] ) ( x E [ x ] ) T ] = E [ z ( μ + Λ z + ϵ μ ) T ] = E [ z z T ] Λ T + E [ z ϵ T ] = Λ T . E ( z E [ z ] ) ( x E [ x ] ) T = E z ( μ + Λ z + ϵ μ ) T = E z z T Λ T + E z ϵ T = Λ T . {:[E[(z-E[z])(x-E[x])^(T)]=E[z(mu+Lambda z+epsilon-mu)^(T)]],[=E[zz^(T)]Lambda^(T)+E[zepsilon^(T)]],[=Lambda^(T).]:}\begin{aligned} \mathrm{E}\left[(z-\mathrm{E}[z])(x-\mathrm{E}[x])^{T}\right] &=\mathrm{E}\left[z(\mu+\Lambda z+\epsilon-\mu)^{T}\right] \\ &=\mathrm{E}\left[z z^{T}\right] \Lambda^{T}+\mathrm{E}\left[z \epsilon^{T}\right] \\ &=\Lambda^{T}. \end{aligned}
In the last step, we used the fact that E [ z z T ] = Cov ( z ) E z z T = Cov ( z ) E[zz^(T)]=Cov(z)\mathrm{E}\left[z z^{T}\right]=\operatorname{Cov}(z) (since z z zz has zero mean), and E [ z ϵ T ] = E [ z ] E [ ϵ T ] = 0 E z ϵ T = E [ z ] E ϵ T = 0 E[zepsilon^(T)]=E[z]E[epsilon^(T)]=0\mathrm{E}\left[z \epsilon^{T}\right]=\mathrm{E}[z] \mathrm{E}\left[\epsilon^{T}\right]=0 (since z z zz and ϵ ϵ epsilon\epsilon are independent, and hence the expectation of their product is the product of their expectations). Similarly, we can find Σ x x Σ x x Sigma_(xx)\Sigma_{x x} as follows:
E [ ( x E [ x ] ) ( x E [ x ] ) T ] = E [ ( μ + Λ z + ϵ μ ) ( μ + Λ z + ϵ μ ) T ] = E [ Λ z z T Λ T + ϵ z T Λ T + Λ z ϵ T + ϵ ϵ T ] = Λ E [ z z T ] Λ T + E [ ϵ ϵ T ] = Λ Λ T + Ψ . E ( x E [ x ] ) ( x E [ x ] ) T = E ( μ + Λ z + ϵ μ ) ( μ + Λ z + ϵ μ ) T = E Λ z z T Λ T + ϵ z T Λ T + Λ z ϵ T + ϵ ϵ T = Λ E z z T Λ T + E ϵ ϵ T = Λ Λ T + Ψ . {:[E[(x-E[x])(x-E[x])^(T)]=E[(mu+Lambda z+epsilon-mu)(mu+Lambda z+epsilon-mu)^(T)]],[=E[Lambda zz^(T)Lambda^(T)+epsilonz^(T)Lambda^(T)+Lambda zepsilon^(T)+epsilonepsilon^(T)]],[=LambdaE[zz^(T)]Lambda^(T)+E[epsilonepsilon^(T)]],[=LambdaLambda^(T)+Psi.]:}\begin{aligned} \mathrm{E}\left[(x-\mathrm{E}[x])(x-\mathrm{E}[x])^{T}\right] &=\mathrm{E}\left[(\mu+\Lambda z+\epsilon-\mu)(\mu+\Lambda z+\epsilon-\mu)^{T}\right] \\ &=\mathrm{E}\left[\Lambda z z^{T} \Lambda^{T}+\epsilon z^{T} \Lambda^{T}+\Lambda z \epsilon^{T}+\epsilon \epsilon^{T}\right] \\ &=\Lambda \mathrm{E}\left[z z^{T}\right] \Lambda^{T}+\mathrm{E}\left[\epsilon \epsilon^{T}\right] \\ &=\Lambda \Lambda^{T}+\Psi. \end{aligned}
Putting everything together, we therefore have that
(3) [ z x ] N ( [ 0 μ ] , [ I Λ T Λ Λ Λ T + Ψ ] ) . (3) z x N 0 μ , I Λ T Λ Λ Λ T + Ψ . {:(3)[[z],[x]]∼N([[ vec(0)],[mu]],[[I,Lambda^(T)],[Lambda,LambdaLambda^(T)+Psi]]).:}\begin{equation} \left[\begin{array}{l} z \\ x \end{array}\right] \sim \mathcal{N}\left(\left[\begin{array}{l} \overrightarrow{0} \\ \mu \end{array}\right],\left[\begin{array}{cc} I & \Lambda^{T} \\ \Lambda & \Lambda \Lambda^{T}+\Psi \end{array}\right]\right) . \end{equation}
Hence, we also see that the marginal distribution of x x xx is given by x x x∼x \sim N ( μ , Λ Λ T + Ψ ) N μ , Λ Λ T + Ψ N(mu,LambdaLambda^(T)+Psi)\mathcal{N}\left(\mu, \Lambda \Lambda^{T}+\Psi\right). Thus, given a training set { x ( i ) ; i = 1 , , n } x ( i ) ; i = 1 , , n {x^((i));i=1,dots,n}\left\{x^{(i)} ; i=1, \ldots, n\right\}, we can write down the log likelihood of the parameters:
( μ , Λ , Ψ ) = log i = 1 n 1 ( 2 π ) d / 2 | Λ Λ T + Ψ | 1 / 2 exp ( 1 2 ( x ( i ) μ ) T ( Λ Λ T + Ψ ) 1 ( x ( i ) μ ) ) . ( μ , Λ , Ψ ) = log i = 1 n 1 ( 2 π ) d / 2 Λ Λ T + Ψ 1 / 2 exp 1 2 x ( i ) μ T Λ Λ T + Ψ 1 x ( i ) μ . ℓ(mu,Lambda,Psi)=log prod_(i=1)^(n)(1)/((2pi)^(d//2)|LambdaLambda^(T)+Psi|^(1//2))exp(-(1)/(2)(x^((i))-mu)^(T)(LambdaLambda^(T)+Psi)^(-1)(x^((i))-mu)).\ell(\mu, \Lambda, \Psi)=\log \prod_{i=1}^{n} \frac{1}{(2 \pi)^{d / 2}\left|\Lambda \Lambda^{T}+\Psi\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x^{(i)}-\mu\right)^{T}\left(\Lambda \Lambda^{T}+\Psi\right)^{-1}\left(x^{(i)}-\mu\right)\right) .
To perform maximum likelihood estimation, we would like to maximize this quantity with respect to the parameters. But maximizing this formula explicitly is hard (try it yourself), and we are aware of no algorithm that does so in closed-form. So, we will instead use to the EM algorithm. In the next section, we derive EM for factor analysis.

4. EM for factor analysis

The derivation for the E-step is easy. We need to compute Q i ( z ( i ) ) = Q i ( z ( i ) ) = Q_(i)(z^((i)))=Q_{i}(z^{(i)})= p ( z ( i ) | x ( i ) ; μ , Λ , Ψ ) p ( z ( i ) | x ( i ) ; μ , Λ , Ψ ) p(z^((i))|x^((i));mu,Lambda,Psi)p(z^{(i)} | x^{(i)} ; \mu, \Lambda, \Psi). By substituting the distribution given in Equation (3) into the formulas (1-2) used for finding the conditional distribution of a Gaussian, we find that z ( i ) | x ( i ) ; μ , Λ , Ψ N ( μ z ( i ) | x ( i ) , Σ z ( i ) | x ( i ) ) z ( i ) | x ( i ) ; μ , Λ , Ψ N μ z ( i ) x ( i ) , Σ z ( i ) x ( i ) z^((i))|x^((i));mu,Lambda,Psi∼N(mu_(z^((i)))|x^((i)),Sigma_(z^((i)))|x^((i)))z^{(i)} | x^{(i)} ; \mu, \Lambda, \Psi \sim \mathcal{N}\left(\mu_{z^{(i)}}\left|x^{(i)}, \Sigma_{z^{(i)}}\right| x^{(i)}\right), where
μ z ( i ) | x ( i ) = Λ T ( Λ Λ T + Ψ ) 1 ( x ( i ) μ ) , Σ z ( i ) | x ( i ) = I Λ T ( Λ Λ T + Ψ ) 1 Λ . μ z ( i ) | x ( i ) = Λ T Λ Λ T + Ψ 1 ( x ( i ) μ ) , Σ z ( i ) | x ( i ) = I Λ T Λ Λ T + Ψ 1 Λ . {:[mu_(z^((i))|x^((i)))=Lambda^(T)(LambdaLambda^(T)+Psi)^(-1)(x^((i))-mu)","],[Sigma_(z^((i))|x^((i)))=I-Lambda^(T)(LambdaLambda^(T)+Psi)^(-1)Lambda.]:}\begin{aligned} &\mu_{z^{(i)} | x^{(i)}}=\Lambda^{T}\left(\Lambda \Lambda^{T}+\Psi\right)^{-1}(x^{(i)}-\mu), \\ &\Sigma_{z^{(i)} | x^{(i)}}=I-\Lambda^{T}\left(\Lambda \Lambda^{T}+\Psi\right)^{-1} \Lambda . \end{aligned}
So, using these definitions for μ z ( i ) | x ( i ) μ z ( i ) | x ( i ) mu_(z^((i))|x^((i)))\mu_{z^{(i)} | x^{(i)}} and Σ z ( i ) | x ( i ) Σ z ( i ) | x ( i ) Sigma_(z^((i))|x^((i)))\Sigma_{z^{(i)} | x^{(i)}}, we have
Q i ( z ( i ) ) = 1 ( 2 π ) k / 2 | Σ z ( i ) | x ( i ) | 1 / 2 exp ( 1 2 ( z ( i ) μ z ( i ) | x ( i ) ) T Σ z ( i ) | x ( i ) ( z ( i ) μ z ( i ) | x ( i ) ) ) . Q i z ( i ) = 1 ( 2 π ) k / 2 Σ z ( i ) | x ( i ) 1 / 2 exp 1 2 ( z ( i ) μ z ( i ) | x ( i ) ) T Σ z ( i ) | x ( i ) ( z ( i ) μ z ( i ) | x ( i ) ) . Q_(i)(z^((i)))=(1)/((2pi)^(k//2)|Sigma_(z^((i))|x^((i)))|^(1//2))exp(-(1)/(2)(z^((i))-mu_(z^((i))|x^((i))))^(T)Sigma_(z^((i))|x^((i)))(z^((i))-mu_(z^((i))|x^((i))))).Q_{i}\left(z^{(i)}\right)=\frac{1}{(2 \pi)^{k / 2} \left| \Sigma_{z^{(i)}|x^{(i)}}\right|^{1 / 2}} \exp \left(-\frac{1}{2}(z^{(i)}-\mu_{z^{(i)}| x^{(i)}})^{T} \Sigma_{z^{(i)} | x^{(i)}}(z^{(i)}-\mu_{z^{(i)} | x^{(i)}})\right) .
Let's now work out the M-step. Here, we need to maximize
(4) i = 1 n z ( i ) Q i ( z ( i ) ) log p ( x ( i ) , z ( i ) ; μ , Λ , Ψ ) Q i ( z ( i ) ) d z ( i ) (4) i = 1 n z ( i ) Q i ( z ( i ) ) log p x ( i ) , z ( i ) ; μ , Λ , Ψ Q i z ( i ) d z ( i ) {:(4)sum_(i=1)^(n)int_(z^((i)))Q_(i)(z^((i)))log (p(x^((i)),z^((i));mu,Lambda,Psi))/(Q_(i)(z^((i))))dz^((i)):}\begin{equation} \sum_{i=1}^{n} \int_{z^{(i)}} Q_{i}(z^{(i)}) \log \frac{p\left(x^{(i)}, z^{(i)} ; \mu, \Lambda, \Psi\right)}{Q_{i}\left(z^{(i)}\right)} d z^{(i)} \end{equation}
with respect to the parameters μ , Λ , Ψ μ , Λ , Ψ mu,Lambda,Psi\mu, \Lambda, \Psi. We will work out only the optimization with respect to Λ Λ Lambda\Lambda, and leave the derivations of the updates for μ μ mu\mu and Ψ Ψ Psi\Psi as an exercise to the reader.
We can simplify Equation (4) as follows:
(5) i = 1 n z ( i ) Q i ( z ( i ) ) [ log p ( x ( i ) | z ( i ) ; μ , Λ , Ψ ) + log p ( z ( i ) ) log Q i ( z ( i ) ) ] d z ( i ) (6) = i = 1 n E z ( i ) Q i [ log p ( x ( i ) z ( i ) ; μ , Λ , Ψ ) + log p ( z ( i ) ) log Q i ( z ( i ) ) ] (5) i = 1 n z ( i ) Q i z ( i ) log p x ( i ) | z ( i ) ; μ , Λ , Ψ + log p z ( i ) log Q i z ( i ) d z ( i ) (6) = i = 1 n E z ( i ) Q i log p x ( i ) z ( i ) ; μ , Λ , Ψ + log p z ( i ) log Q i z ( i ) {:(5)sum_(i=1)^(n)int_(z^((i)))Q_(i)(z^((i)))[log p(x^((i))|z^((i));mu,Lambda,Psi)+log p(z^((i)))-log Q_(i)(z^((i)))]dz^((i)),(6)=sum_(i=1)^(n)E_(z^((i))∼Q_(i))[log p(x^((i))∣z^((i));mu,Lambda,Psi)+log p(z^((i)))-log Q_(i)(z^((i)))]:}\begin{gather} \sum_{i=1}^{n} \int_{z^{(i)}} Q_{i}\left(z^{(i)}\right)\left[\log p\left(x^{(i)} | z^{(i)} ; \mu, \Lambda, \Psi\right)+\log p\left(z^{(i)}\right)-\log Q_{i}\left(z^{(i)}\right)\right] d z^{(i)} \\ =\sum_{i=1}^{n} \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[\log p\left(x^{(i)} \mid z^{(i)} ; \mu, \Lambda, \Psi\right)+\log p\left(z^{(i)}\right)-\log Q_{i}\left(z^{(i)}\right)\right] \end{gather}
Here, the " z ( i ) Q i z ( i ) Q i z^((i))∼Q_(i)z^{(i)} \sim Q_{i}" subscript indicates that the expectation is with respect to z ( i ) z ( i ) z^((i))z^{(i)} drawn from Q i Q i Q_(i)Q_{i}. In the subsequent development, we will omit this subscript when there is no risk of ambiguity. Dropping terms that do not depend on the parameters, we find that we need to maximize:
i = 1 n E [ log p ( x ( i ) z ( i ) ; μ , Λ , Ψ ) ] = i = 1 n E [ log 1 ( 2 π ) d / 2 | Ψ | 1 / 2 exp ( 1 2 ( x ( i ) μ Λ z ( i ) ) T Ψ 1 ( x ( i ) μ Λ z ( i ) ) ) ] = i = 1 n E [ 1 2 log | Ψ | n 2 log ( 2 π ) 1 2 ( x ( i ) μ Λ z ( i ) ) T Ψ 1 ( x ( i ) μ Λ z ( i ) ) ] i = 1 n E log p ( x ( i ) z ( i ) ; μ , Λ , Ψ ) = i = 1 n E log 1 ( 2 π ) d / 2 | Ψ | 1 / 2 exp 1 2 ( x ( i ) μ Λ z ( i ) ) T Ψ 1 ( x ( i ) μ Λ z ( i ) ) = i = 1 n E 1 2 log | Ψ | n 2 log ( 2 π ) 1 2 ( x ( i ) μ Λ z ( i ) ) T Ψ 1 ( x ( i ) μ Λ z ( i ) ) {:[sum_(i=1)^(n)E[log p(x^((i))∣z^((i));mu,Lambda,Psi)]],[=sum_(i=1)^(n)E[log (1)/((2pi)^(d//2)|Psi|^(1//2))exp(-(1)/(2)(x^((i))-mu-Lambdaz^((i)))^(T)Psi^(-1)(x^((i))-mu-Lambdaz^((i))))]],[=sum_(i=1)^(n)E[-(1)/(2)log |Psi|-(n)/(2)log(2pi)-(1)/(2)(x^((i))-mu-Lambdaz^((i)))^(T)Psi^(-1)(x^((i))-mu-Lambdaz^((i)))]]:}\begin{aligned} \sum_{i=1}^{n}& \mathrm{E}\left[\log p(x^{(i)} \mid z^{(i)} ; \mu, \Lambda, \Psi)\right] \\ =&\sum_{i=1}^{n} \mathrm{E}\left[\log \frac{1}{(2 \pi)^{d / 2}|\Psi|^{1 / 2}} \exp \left(-\frac{1}{2}(x^{(i)}-\mu-\Lambda z^{(i)})^{T} \Psi^{-1}(x^{(i)}-\mu-\Lambda z^{(i)})\right)\right] \\ =&\sum_{i=1}^{n} \mathrm{E}\left[-\frac{1}{2} \log |\Psi|-\frac{n}{2} \log (2 \pi)-\frac{1}{2}(x^{(i)}-\mu-\Lambda z^{(i)})^{T} \Psi^{-1}(x^{(i)}-\mu-\Lambda z^{(i)})\right] \end{aligned}
Let's maximize this with respect to Λ Λ Lambda\Lambda. Only the last term above depends on Λ Λ Lambda\Lambda. Taking derivatives, and using the facts that tr a = a tr a = a tr a=a\operatorname{tr} a=a (for a R a R a inRa \in \mathbb{R}), tr A B = tr B A tr A B = tr B A tr AB=tr BA\operatorname{tr} A B=\operatorname{tr} B A, and A tr A B A T C = C A B + C T A B T A tr A B A T C = C A B + C T A B T grad_(A)tr ABA^(T)C=CAB+C^(T)AB^(T)\nabla_{A} \operatorname{tr} A B A^{T} C=C A B+C^{T} A B^{T}, we get:
Λ i = 1 n E [ 1 2 ( x ( i ) μ Λ z ( i ) ) T Ψ 1 ( x ( i ) μ Λ z ( i ) ) ] = i = 1 n Λ E [ tr 1 2 z ( i ) T Λ T Ψ 1 Λ z ( i ) + tr z ( i ) T Λ T Ψ 1 ( x ( i ) μ ) ] = i = 1 n Λ E [ tr 1 2 Λ T Ψ 1 Λ z ( i ) z ( i ) T + tr Λ T Ψ 1 ( x ( i ) μ ) z ( i ) T ] = i = 1 n E [ Ψ 1 Λ z ( i ) z ( i ) T + Ψ 1 ( x ( i ) μ ) z ( i ) T ] Λ i = 1 n E 1 2 ( x ( i ) μ Λ z ( i ) ) T Ψ 1 ( x ( i ) μ Λ z ( i ) ) = i = 1 n Λ E tr 1 2 z ( i ) T Λ T Ψ 1 Λ z ( i ) + tr z ( i ) T Λ T Ψ 1 ( x ( i ) μ ) = i = 1 n Λ E tr 1 2 Λ T Ψ 1 Λ z ( i ) z ( i ) T + tr Λ T Ψ 1 ( x ( i ) μ ) z ( i ) T = i = 1 n E Ψ 1 Λ z ( i ) z ( i ) T + Ψ 1 ( x ( i ) μ ) z ( i ) T {:[grad_(Lambda)sum_(i=1)^(n)-E[(1)/(2)(x^((i))-mu-Lambdaz^((i)))^(T)Psi^(-1)(x^((i))-mu-Lambdaz^((i)))]],[=sum_(i=1)^(n)grad_(Lambda)E[-tr (1)/(2)z^((i)^(T))Lambda^(T)Psi^(-1)Lambdaz^((i))+tr z^((i)^(T))Lambda^(T)Psi^(-1)(x^((i))-mu)]],[=sum_(i=1)^(n)grad_(Lambda)E[-tr (1)/(2)Lambda^(T)Psi^(-1)Lambdaz^((i))z^((i)^(T))+tr Lambda^(T)Psi^(-1)(x^((i))-mu)z^((i)^(T))]],[=sum_(i=1)^(n)E[-Psi^(-1)Lambdaz^((i))z^((i)^(T))+Psi^(-1)(x^((i))-mu)z^((i)^(T))]]:}\begin{aligned} \nabla_{\Lambda} \sum_{i=1}^{n}-&\mathrm{E}\left[\frac{1}{2}(x^{(i)}-\mu-\Lambda z^{(i)})^{T} \Psi^{-1}(x^{(i)}-\mu-\Lambda z^{(i)})\right] \\ =\sum_{i=1}^{n} &\nabla_{\Lambda} \mathrm{E}\left[-\operatorname{tr} \frac{1}{2} z^{(i)^{T}} \Lambda^{T} \Psi^{-1} \Lambda z^{(i)}+\operatorname{tr} z^{(i)^{T}} \Lambda^{T} \Psi^{-1}(x^{(i)}-\mu)\right] \\ =\sum_{i=1}^{n} &\nabla_{\Lambda} \mathrm{E}\left[-\operatorname{tr} \frac{1}{2} \Lambda^{T} \Psi^{-1} \Lambda z^{(i)} z^{(i)^{T}}+\operatorname{tr} \Lambda^{T} \Psi^{-1}(x^{(i)}-\mu) z^{(i)^{T}}\right] \\ =\sum_{i=1}^{n} &\mathrm{E}\left[-\Psi^{-1} \Lambda z^{(i)} z^{(i)^{T}}+\Psi^{-1}(x^{(i)}-\mu) z^{(i)^{T}}\right] \end{aligned}
Setting this to zero and simplifying, we get:
i = 1 n Λ E z ( i ) Q i [ z ( i ) z ( i ) T ] = i = 1 n ( x ( i ) μ ) E z ( i ) Q i [ z ( i ) T ] . i = 1 n Λ E z ( i ) Q i z ( i ) z ( i ) T = i = 1 n x ( i ) μ E z ( i ) Q i z ( i ) T . sum_(i=1)^(n)LambdaE_(z^((i))∼Q_(i))[z^((i))z^((i)^(T))]=sum_(i=1)^(n)(x^((i))-mu)E_(z^((i))∼Q_(i))[z^((i)^(T))].\sum_{i=1}^{n} \Lambda \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[z^{(i)} z^{(i)^{T}}\right]=\sum_{i=1}^{n}\left(x^{(i)}-\mu\right) \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[z^{(i)^{T}}\right] .
Hence, solving for Λ Λ Lambda\Lambda, we obtain
(7) Λ = ( i = 1 n ( x ( i ) μ ) E z ( i ) Q i [ z ( i ) T ] ) ( i = 1 n E z ( i ) Q i [ z ( i ) z ( i ) T ] ) 1 . (7) Λ = i = 1 n ( x ( i ) μ ) E z ( i ) Q i z ( i ) T i = 1 n E z ( i ) Q i z ( i ) z ( i ) T 1 . {:(7)Lambda=(sum_(i=1)^(n)(x^((i))-mu)E_(z^((i))∼Q_(i))[z^((i)^(T))])(sum_(i=1)^(n)E_(z^((i))∼Q_(i))[z^((i))z^((i)^(T))])^(-1).:}\begin{equation} \Lambda=\left(\sum_{i=1}^{n}(x^{(i)}-\mu) \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[z^{(i)^{T}}\right]\right)\left(\sum_{i=1}^{n} \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[z^{(i)} z^{(i)^{T}}\right]\right)^{-1} . \end{equation}
It is interesting to note the close relationship between this equation and the normal equation that we'd derived for least squares regression,
" θ T = ( y T X ) ( X T X ) 1 . " " θ T = y T X X T X 1 . " "theta^(T)=(y^(T)X)(X^(T)X)^(-1).""\theta^{T}=\left(y^{T} X\right)\left(X^{T} X\right)^{-1}."
The analogy is that here, the x x xx's are a linear function of the z z zz's (plus noise). Given the "guesses" for z z zz that the E-step has found, we will now try to estimate the unknown linearity Λ Λ Lambda\Lambda relating the x x xx's and z z zz's. It is therefore no surprise that we obtain something similar to the normal equation. There is, however, one important difference between this and an algorithm that performs least squares using just the "best guesses" of the z z zz's; we will see this difference shortly.
To complete our M-step update, let's work out the values of the expectations in Equation (7). From our definition of Q i Q i Q_(i)Q_{i} being Gaussian with mean μ z ( i ) x ( i ) μ z ( i ) x ( i ) mu_(z^((i))∣x^((i)))\mu_{z^{(i)} \mid x^{(i)}} and covariance Σ z ( i ) x ( i ) Σ z ( i ) x ( i ) Sigma_(z^((i))∣x^((i)))\Sigma_{z^{(i)} \mid x^{(i)}}, we easily find
E z ( i ) Q i [ z ( i ) T ] = μ z ( i ) x ( i ) T E z ( i ) Q i [ z ( i ) z ( i ) T ] = μ z ( i ) x ( i ) μ z ( i ) x ( i ) T + Σ z ( i ) x ( i ) . E z ( i ) Q i z ( i ) T = μ z ( i ) x ( i ) T E z ( i ) Q i z ( i ) z ( i ) T = μ z ( i ) x ( i ) μ z ( i ) x ( i ) T + Σ z ( i ) x ( i ) . {:[E_(z^((i))∼Q_(i))[z^((i)^(T))]=mu_(z^((i))∣x^((i)))^(T)],[E_(z^((i))∼Q_(i))[z^((i))z^((i)^(T))]=mu_(z^((i))∣x^((i)))mu_(z^((i))∣x^((i)))^(T)+Sigma_(z^((i))∣x^((i)).)]:}\begin{aligned} \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[z^{(i)^{T}}\right] &=\mu_{z^{(i)} \mid x^{(i)}}^{T} \\ \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[z^{(i)} z^{(i)^{T}}\right] &=\mu_{z^{(i)} \mid x^{(i)}} \mu_{z^{(i)} \mid x^{(i)}}^{T}+\Sigma_{z^{(i)} \mid x^{(i)} .} \end{aligned}
The latter comes from the fact that, for a random variable Y Y YY, Cov ( Y ) = Cov ( Y ) = Cov(Y)=\operatorname{Cov}(Y)= E [ Y Y T ] E [ Y ] E [ Y ] T E Y Y T E [ Y ] E [ Y ] T E[YY^(T)]-E[Y]E[Y]^(T)\mathrm{E}\left[Y Y^{T}\right]-\mathrm{E}[Y] \mathrm{E}[Y]^{T}, and hence E [ Y Y T ] = E [ Y ] E [ Y ] T + Cov ( Y ) E Y Y T = E [ Y ] E [ Y ] T + Cov ( Y ) E[YY^(T)]=E[Y]E[Y]^(T)+Cov(Y)\mathrm{E}\left[Y Y^{T}\right]=\mathrm{E}[Y] \mathrm{E}[Y]^{T}+\operatorname{Cov}(Y). Substituting this back into Equation (7), we get the M-step update for Λ Λ Lambda\Lambda:
(8) Λ = ( i = 1 n ( x ( i ) μ ) μ z ( i ) x ( i ) T ) ( i = 1 n μ z ( i ) x ( i ) μ z ( i ) x ( i ) T + Σ z ( i ) x ( i ) ) 1 . (8) Λ = i = 1 n ( x ( i ) μ ) μ z ( i ) x ( i ) T i = 1 n μ z ( i ) x ( i ) μ z ( i ) x ( i ) T + Σ z ( i ) x ( i ) 1 . {:(8)Lambda=(sum_(i=1)^(n)(x^((i))-mu)mu_(z^((i))∣x^((i)))^(T))(sum_(i=1)^(n)mu_(z^((i))∣x^((i)))mu_(z^((i))∣x^((i)))^(T)+Sigma_(z^((i))∣x^((i))))^(-1).:}\begin{equation} \Lambda=\left(\sum_{i=1}^{n}(x^{(i)}-\mu) \mu_{z^{(i)} \mid x^{(i)}}^{T}\right)\left(\sum_{i=1}^{n} \mu_{z^{(i)} \mid x^{(i)}} \mu_{z^{(i)} \mid x^{(i)}}^{T}+\Sigma_{z^{(i)} \mid x^{(i)}}\right)^{-1} . \end{equation}
It is important to note the presence of the Σ z ( i ) x ( i ) Σ z ( i ) x ( i ) Sigma_(z^((i))∣x^((i)))\Sigma_{z^{(i)} \mid x^{(i)}} on the right hand side of this equation. This is the covariance in the posterior distribution p ( z ( i ) | x ( i ) ) p z ( i ) | x ( i ) p(z^((i))|x^((i)))p\left(z^{(i)} | x^{(i)}\right) of z ( i ) z ( i ) z^((i))z^{(i)} give x ( i ) x ( i ) x^((i))x^{(i)}, and the M-step must take into account this uncertainty about z ( i ) z ( i ) z^((i))z^{(i)} in the posterior. A common mistake in deriving EM is to assume that in the E-step, we need to calculate only expectation E [ z ] E [ z ] E[z]E[z] of the latent random variable z z zz, and then plug that into the optimization in the M-step everywhere z z zz occurs. While this worked for simple problems such as the mixture of Gaussians, in our derivation for factor analysis, we needed E [ z z T ] E z z T E[zz^(T)]E\left[z z^{T}\right] as well E [ z ] E [ z ] E[z]\mathrm{E}[z]; and as we saw, E [ z z T ] E z z T E[zz^(T)]E\left[z z^{T}\right] and E [ z ] E [ z ] T E [ z ] E [ z ] T E[z]E[z]^(T)\mathrm{E}[z] \mathrm{E}[z]^{T} differ by the quantity Σ z x Σ z x Sigma_(z∣x)\Sigma_{z \mid x}. Thus, the M-step update must take into account the covariance of z z zz in the posterior distribution p ( z ( i ) | x ( i ) ) p z ( i ) | x ( i ) p(z^((i))|x^((i)))p\left(z^{(i)} | x^{(i)}\right).
Lastly, we can also find the M-step optimizations for the parameters μ μ mu\mu and Ψ Ψ Psi\Psi. It is not hard to show that the first is given by
μ = 1 n i = 1 n x ( i ) . μ = 1 n i = 1 n x ( i ) . mu=(1)/(n)sum_(i=1)^(n)x^((i)).\mu=\frac{1}{n} \sum_{i=1}^{n} x^{(i)}.
Since this doesn't change as the parameters are varied (i.e., unlike the update for Λ Λ Lambda\Lambda, the right hand side does not depend on Q i ( z ( i ) ) = p ( z ( i ) | x ( i ) ; μ , Λ , Ψ ) Q i z ( i ) = p z ( i ) | x ( i ) ; μ , Λ , Ψ Q_(i)(z^((i)))=p(z^((i))|x^((i));mu,Lambda,Psi)Q_{i}\left(z^{(i)}\right)=p\left(z^{(i)} | x^{(i)} ; \mu, \Lambda, \Psi\right), which in turn depends on the parameters), this can be calculated just once and needs not be further updated as the algorithm is run. Similarly, the diagonal Ψ Ψ Psi\Psi can be found by calculating
Φ = 1 n i = 1 n x ( i ) x ( i ) T x ( i ) μ z ( i ) x ( i ) T Λ T Λ μ z ( i ) ) x ( i ) x ( i ) + Λ ( μ z ( i ) | x ( i ) μ z ( i ) T x ( i ) + Σ z ( i ) x ( i ) ) Λ T , Φ = 1 n i = 1 n x ( i ) x ( i ) T x ( i ) μ z ( i ) x ( i ) T Λ T Λ μ z ( i ) x ( i ) x ( i ) + Λ μ z ( i ) | x ( i ) μ z ( i ) T x ( i ) + Σ z ( i ) x ( i ) Λ T , Phi=(1)/(n)sum_(i=1)^(n)x^((i))x^((i)^(T))-x^((i))mu_(z^((i))∣x^((i)))^(T)Lambda^(T)-Lambdamu_({:z^((i)))∣x^((i)))x^((i))+Lambda(mu_(z^((i))|x^((i)))mu_(z^((i)))^(T)∣x^((i))+Sigma_(z^((i))∣x^((i))))Lambda^(T),\Phi=\frac{1}{n} \sum_{i=1}^{n} x^{(i)} x^{(i)^{T}}-x^{(i)} \mu_{z^{(i)} \mid x^{(i)}}^{T} \Lambda^{T}-\Lambda \mu_{\left.z^{(i)}\right) \mid x^{(i)}} x^{(i)}+\Lambda\left(\mu_{z^{(i)} |x^{(i)}} \mu_{z^{(i)}}^{T} \mid x^{(i)}+\Sigma_{z^{(i)} \mid x^{(i)}}\right) \Lambda^{T},
and setting Ψ i i = Φ i i Ψ i i = Φ i i Psi_(ii)=Phi_(ii)\Psi_{i i}=\Phi_{i i} (i.e., letting Ψ Ψ Psi\Psi be the diagonal matrix containing only the diagonal entries of Φ Φ Phi\Phi).
You can read the notes from the next lecture from Andrew Ng's CS229 course on Principal Components Analysis here.

  1. This is the set of points x x xx satisfying x = i = 1 n α i x ( i ) x = i = 1 n α i x ( i ) x=sum_(i=1)^(n)alpha_(i)x^((i))x=\sum_{i=1}^{n} \alpha_{i} x^{(i)}, for some α i α i alpha_(i)\alpha_{i}'s so that i = 1 n α 1 = 1 i = 1 n α 1 = 1 sum_(i=1)^(n)alpha_(1)=1\sum_{i=1}^{n} \alpha_{1}=1. ↩︎

Recommended for you

Srishti Saha
Text Generation Models - Introduction and a Demo using the GPT-J model
Text Generation Models - Introduction and a Demo using the GPT-J model
The below article describes the mechanism of text generation models. We cover the basic model like Markov Chains as well the more advanced deep learning models. We also give a demo of domain-specific text generation using the latest GPT-J model.
26 points
0 issues