The Expectation Maximization Algorithm

You can read the notes from the previous lecture from Andrew Ng's CS229 course on the Mixtures of Gaussians and the EM algorithm here.
In the previous set of notes, we talked about the EM algorithm as applied to fitting a mixture of Gaussians. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. We begin our discussion with a very useful result called Jensen's inequality

1. Jensen's inequality

Let f f ff be a function whose domain is the set of real numbers. Recall that f f ff is a convex function if f ( x ) 0 f ( x ) 0 f^('')(x) >= 0f^{\prime \prime}(x) \geq 0 (for all x R x R x inRx \in \mathbb{R} ). In the case of f f ff taking vector-valued inputs, this is generalized to the condition that its hessian H H HH is positive semi-definite ( H 0 ) ( H 0 ) (H >= 0)(H \geq 0). If f ( x ) > 0 f ( x ) > 0 f^('')(x) > 0f^{\prime \prime}(x)>0 for all x x xx, then we say f f ff is strictly convex (in the vector-valued case, the corresponding statement is that H H HH must be positive definite, written H > 0 H > 0 H > 0H>0). Jensen's inequality can then be stated as follows:
Theorem. Let f f ff be a convex function, and let X X XX be a random variable. Then:
E [ f ( X ) ] f ( E X ) . E [ f ( X ) ] f ( E X ) . E[f(X)] >= f(EX).\mathrm{E}[f(X)] \geq f(\mathrm{E} X) .
Moreover, if f f ff is strictly convex, then E [ f ( X ) ] = f ( E X ) E [ f ( X ) ] = f ( E X ) E[f(X)]=f(EX)\mathrm{E}[f(X)]=f(\mathrm{E} X) holds true if and only if X = E [ X ] X = E [ X ] X=E[X]X=\mathrm{E}[X] with probability 1 1 11 (i.e., if X X XX is a constant).
Recall our convention of occasionally dropping the parentheses when writing expectations, so in the theorem above, f ( E X ) = f ( E [ X ] ) f ( E X ) = f ( E [ X ] ) f(EX)=f(E[X])f(\mathrm{E} X)=f(\mathrm{E}[X]).
For an interpretation of the theorem, consider the figure below.
Here, f f ff is a convex function shown by the solid line. Also, X X XX is a random variable that has a 0.5 0.5 0.50.5 chance of taking the value a a aa, and a 0.5 0.5 0.50.5 chance of taking the value b b bb (indicated on the x x xx-axis). Thus, the expected value of X X XX is given by the midpoint between a a aa and b b bb.
We also see the values f ( a ) , f ( b ) f ( a ) , f ( b ) f(a),f(b)f(a), f(b) and f ( E [ X ] ) f ( E [ X ] ) f(E[X])f(\mathrm{E}[X]) indicated on the y y yy-axis. Moreover, the value E [ f ( X ) ] E [ f ( X ) ] E[f(X)]\mathrm{E}[f(X)] is now the midpoint on the y y yy-axis between f ( a ) f ( a ) f(a)f(a) and f ( b ) f ( b ) f(b)f(b). From our example, we see that because f f ff is convex, it must be the case that E [ f ( X ) ] f ( E X ) E [ f ( X ) ] f ( E X ) E[f(X)] >= f(EX)\mathrm{E}[f(X)] \geq f(\mathrm{E} X).
Incidentally, quite a lot of people have trouble remembering which way the inequality goes, and remembering a picture like this is a good way to quickly figure out the answer.
Remark. Recall that f f ff is [strictly] concave if and only if f f -f-f is [strictly] convex (i.e., f ( x ) 0 f ( x ) 0 f^('')(x) <= 0f^{\prime \prime}(x) \leq 0 or H 0 H 0 H <= 0H \leq 0). Jensen's inequality also holds for concave functions f f ff, but with the direction of all the inequalities reversed ( E [ f ( X ) ] f ( E X ) E [ f ( X ) ] f ( E X ) E[f(X)] <= f(EX)\mathrm{E}[f(X)] \leq f(\mathrm{E} X), etc.).

2. The EM algorithm

Suppose we have an estimation problem in which we have a training set { x ( 1 ) , , x ( n ) } x ( 1 ) , , x ( n ) {x^((1)),dots,x^((n))}\left\{x^{(1)}, \ldots, x^{(n)}\right\} consisting of n n nn independent examples. We have a latent variable model p ( x , z ; θ ) p ( x , z ; θ ) p(x,z;theta)p(x, z ; \theta) with z z zz being the latent variable (which for simplicity is assumed to take finite number of values). The density for x x xx can be obtained by marginalized over the latent variable z z zz:
(1) p ( x ; θ ) = z p ( x , z ; θ ) (1) p ( x ; θ ) = z p ( x , z ; θ ) {:(1)p(x;theta)=sum_(z)p(x","z;theta):}\begin{equation}p(x ; \theta)=\sum_{z} p(x, z ; \theta)\end{equation}
We wish to fit the parameters θ θ theta\theta by maximizing the log-likelihood of the data, defined by
(2) ( θ ) = i = 1 n log p ( x ( i ) ; θ ) (2) ( θ ) = i = 1 n log p x ( i ) ; θ {:(2)ℓ(theta)=sum_(i=1)^(n)log p(x^((i));theta):}\begin{equation}\ell(\theta)=\sum_{i=1}^{n} \log p\left(x^{(i)} ; \theta\right)\end{equation}
We can rewrite the objective in terms of the joint density p ( x , z ; θ ) p ( x , z ; θ ) p(x,z;theta)p(x, z ; \theta) by
(3) ( θ ) = i = 1 n log p ( x ( i ) ; θ ) (4) = i = 1 n log z ( i ) p ( x ( i ) , z ( i ) ; θ ) . (3) ( θ ) = i = 1 n log p x ( i ) ; θ (4) = i = 1 n log z ( i ) p x ( i ) , z ( i ) ; θ . {:(3)ℓ(theta)=sum_(i=1)^(n)log p(x^((i));theta),(4)=sum_(i=1)^(n)log sum_(z^((i)))p(x^((i)),z^((i));theta).:}\begin{align} \ell(\theta) &=\sum_{i=1}^{n} \log p\left(x^{(i)} ; \theta\right) \\ &=\sum_{i=1}^{n} \log \sum_{z^{(i)}} p\left(x^{(i)}, z^{(i)} ; \theta\right) . \end{align}
But, explicitly finding the maximum likelihood estimates of the parameters θ θ theta\theta may be hard since it will result in difficult non-convex optimization problems.[1] Here, the z ( i ) z ( i ) z^((i))z^{(i)}'s are the latent random variables; and it is often the case that if the z ( i ) z ( i ) z^((i))z^{(i)}'s were observed, then maximum likelihood estimation would be easy.
In such a setting, the EM algorithm gives an efficient method for maximum likelihood estimation. Maximizing ( θ ) ( θ ) ℓ(theta)\ell(\theta) explicitly might be difficult, and our strategy will be to instead repeatedly construct a lower-bound on \ell (E-step), and then optimize that lower-bound (M-step).[2]
It turns out that the summation i = 1 n i = 1 n sum_(i=1)^(n)\sum_{i=1}^{n} is not essential here, and towards a simpler exposition of the EM algorithm, we will first consider optimizing the the likelihood log p ( x ) log p ( x ) log p(x)\log p(x) for a single example x x xx. After we derive the algorithm for optimizing log p ( x ) log p ( x ) log p(x)\log p(x), we will convert it to an algorithm that works for n n nn examples by adding back the sum to each of the relevant equations. Thus, now we aim to optimize log p ( x ; θ ) log p ( x ; θ ) log p(x;theta)\log p(x ; \theta) which can be rewritten as
(5) log p ( x ; θ ) = log z p ( x , z ; θ ) (5) log p ( x ; θ ) = log z p ( x , z ; θ ) {:(5)log p(x;theta)=log sum_(z)p(x","z;theta):}\begin{equation}\log p(x ; \theta)=\log \sum_{z} p(x, z ; \theta)\end{equation}
Let Q Q QQ be a distribution over the possible values of z z zz. That is, z Q ( z ) = 1 z Q ( z ) = 1 sum_(z)Q(z)=1\sum_{z} Q(z)=1, Q ( z ) 0 ) Q ( z ) 0 ) Q(z) >= 0)Q(z) \geq 0).
Consider the following:[3]
log p ( x ; θ ) = log z p ( x , z ; θ ) (6) = log z Q ( z ) p ( x , z ; θ ) Q ( z ) (7) z Q ( z ) log p ( x , z ; θ ) Q ( z ) log p ( x ; θ ) = log z p ( x , z ; θ ) (6) = log z Q ( z ) p ( x , z ; θ ) Q ( z ) (7) z Q ( z ) log p ( x , z ; θ ) Q ( z ) {:[log p(x;theta)=log sum_(z)p(x","z;theta)],(6)=log sum_(z)Q(z)(p(x,z;theta))/(Q(z)),(7) >= sum_(z)Q(z)log (p(x,z;theta))/(Q(z)):}\begin{align} \log p(x ; \theta) &=\log \sum_{z} p(x, z ; \theta)\nonumber \\ &=\log \sum_{z} Q(z) \frac{p(x, z ; \theta)}{Q(z)} \\ &\geq \sum_{z} Q(z) \log \frac{p(x, z ; \theta)}{Q(z)} \end{align}
The last step of this derivation used Jensen's inequality. Specifically, f ( x ) = log x f ( x ) = log x f(x)=log xf(x)=\log x is a concave function, since f ( x ) = 1 / x 2 < 0 f ( x ) = 1 / x 2 < 0 f^('')(x)=-1//x^(2) < 0f^{\prime \prime}(x)=-1 / x^{2}<0 over its domain x R + x R + x inR^(+)x \in \mathbb{R}^{+}. Also, the term
z Q ( z ) [ p ( x , z ; θ ) Q ( z ) ] z Q ( z ) p ( x , z ; θ ) Q ( z ) sum_(z)Q(z)[(p(x,z;theta))/(Q(z))]\sum_{z} Q(z)\left[\frac{p(x, z ; \theta)}{Q(z)}\right]
in the summation is just an expectation of the quantity [ p ( x , z ; θ ) / Q ( z ) ] [ p ( x , z ; θ ) / Q ( z ) ] [p(x,z;theta)//Q(z)][p(x, z ; \theta) / Q(z)] with respect to z z zz drawn according to the distribution given by Q Q QQ.[4] By Jensen's inequality, we have
f ( E z Q [ p ( x , z ; θ ) Q ( z ) ] ) E z Q [ f ( p ( x , z ; θ ) Q ( z ) ) ] , f E z Q p ( x , z ; θ ) Q ( z ) E z Q f p ( x , z ; θ ) Q ( z ) , f(E_(z∼Q)[(p(x,z;theta))/(Q(z))]) >= E_(z∼Q)[f((p(x,z;theta))/(Q(z)))],f\left(\mathrm{E}_{z \sim Q}\left[\frac{p(x, z ; \theta)}{Q(z)}\right]\right) \geq \mathrm{E}_{z \sim Q}\left[f\left(\frac{p(x, z ; \theta)}{Q(z)}\right)\right],
where the " z Q z Q z∼Qz \sim Q" subscripts above indicate that the expectations are with respect to z z zz drawn from Q Q QQ. This allowed us to go from Equation (6) to Equation (7).
Now, for any distribution Q Q QQ, the formula (7) gives a lower-bound on log p ( x ; θ ) log p ( x ; θ ) log p(x;theta)\log p(x ; \theta). There are many possible choices for the Q Q QQ's. Which should we choose? Well, if we have some current guess θ θ theta\theta of the parameters, it seems natural to try to make the lower-bound tight at that value of θ θ theta\theta. I.e., we will make the inequality above hold with equality at our particular value of θ θ theta\theta.
To make the bound tight for a particular value of θ θ theta\theta, we need for the step involving Jensen's inequality in our derivation above to hold with equality.
For this to be true, we know it is sufficient that the expectation be taken over a "constant"-valued random variable. I.e., we require that
p ( x , z ; θ ) Q ( z ) = c p ( x , z ; θ ) Q ( z ) = c (p(x,z;theta))/(Q(z))=c\frac{p(x, z ; \theta)}{Q(z)}=c
for some constant c c cc that does not depend on z z zz. This is easily accomplished by choosing
Q ( z ) p ( x , z ; θ ) . Q ( z ) p ( x , z ; θ ) . Q(z)prop p(x,z;theta).Q(z) \propto p(x, z ; \theta).
Actually, since we know z Q ( z ) = 1 z Q ( z ) = 1 sum_(z)Q(z)=1\sum_{z} Q(z)=1 (because it is a distribution), this further tells us that
Q ( z ) = p ( x , z ; θ ) z p ( x , z ; θ ) = p ( x , z ; θ ) p ( x ; θ ) (8) = p ( z | x ; θ ) Q ( z ) = p ( x , z ; θ ) z p ( x , z ; θ ) = p ( x , z ; θ ) p ( x ; θ ) (8) = p ( z | x ; θ ) {:[Q(z)=(p(x,z;theta))/(sum_(z)p(x,z;theta))],[=(p(x,z;theta))/(p(x;theta))],(8)=p(z|x;theta):}\begin{align} Q(z) &=\frac{p(x, z ; \theta)}{\sum_{z} p(x, z ; \theta)}\nonumber \\ &=\frac{p(x, z ; \theta)}{p(x ; \theta)} \nonumber\\ &=p(z|x ; \theta) \end{align}
Thus, we simply set the Q Q QQ's to be the posterior distribution of the z z zz's given x x xx and the setting of the parameters θ θ theta\theta.
Indeed, we can directly verify that when Q ( z ) = p ( z | x ; θ ) Q ( z ) = p ( z | x ; θ ) Q(z)=p(z|x;theta)Q(z)=p(z | x ; \theta), then equation (7) is an equality because
z Q ( z ) log p ( x , z ; θ ) Q ( z ) = z p ( z | x ; θ ) log p ( x , z ; θ ) p ( z | x ; θ ) = z p ( z | x ; θ ) log p ( z | x ; θ ) p ( x ; θ ) p ( z | x ; θ ) = z p ( z | x ; θ ) log p ( x ; θ ) = log p ( x ; θ ) z p ( z | x ; θ ) = log p ( x ; θ ) ( because z p ( z | x ; θ ) = 1 ) z Q ( z ) log p ( x , z ; θ ) Q ( z ) = z p ( z | x ; θ ) log p ( x , z ; θ ) p ( z | x ; θ ) = z p ( z | x ; θ ) log p ( z | x ; θ ) p ( x ; θ ) p ( z | x ; θ ) = z p ( z | x ; θ ) log p ( x ; θ ) = log p ( x ; θ ) z p ( z | x ; θ ) = log p ( x ; θ ) ( because  z p ( z | x ; θ ) = 1 ) {:[sum_(z)Q(z)log (p(x,z;theta))/(Q(z))=sum_(z)p(z|x;theta)log (p(x,z;theta))/(p(z|x;theta))],[=sum_(z)p(z|x;theta)log (p(z|x;theta)p(x;theta))/(p(z|x;theta))],[=sum_(z)p(z|x;theta)log p(x;theta)],[=log p(x;theta)sum_(z)p(z|x;theta)],[=log p(x;theta)qquad("because "sum_(z)p(z|x;theta)=1)]:}\begin{aligned} \sum_{z} Q(z) \log \frac{p(x, z ; \theta)}{Q(z)} &=\sum_{z} p(z | x ; \theta) \log \frac{p(x, z ; \theta)}{p(z | x ; \theta)} \\ &=\sum_{z} p(z | x ; \theta) \log \frac{p(z | x ; \theta) p(x ; \theta)}{p(z | x ; \theta)} \\ &=\sum_{z} p(z | x ; \theta) \log p(x ; \theta) \\ &=\log p(x ; \theta) \sum_{z} p(z | x ; \theta) \\ &=\log p(x ; \theta) \qquad (\text{because } \sum_{z} p(z|x ; \theta)=1) \end{aligned}
For convenience, we call the expression in Equation (7) the evidence lower bound (ELBO) and we denote it by
(9) ELBO ( x ; Q , θ ) = z Q ( z ) log p ( x , z ; θ ) Q ( z ) (9) ELBO ( x ; Q , θ ) = z Q ( z ) log p ( x , z ; θ ) Q ( z ) {:(9)ELBO(x;Q","theta)=sum_(z)Q(z)log (p(x,z;theta))/(Q(z)):}\begin{equation}\operatorname{ELBO}(x ; Q, \theta)=\sum_{z} Q(z) \log \frac{p(x, z ; \theta)}{Q(z)}\end{equation}
With this equation, we can re-write equation (7) as
(10) Q , θ , x , log p ( x ; θ ) ELBO ( x ; Q , θ ) (10) Q , θ , x , log p ( x ; θ ) ELBO ( x ; Q , θ ) {:(10)AA Q","theta","x","quad log p(x;theta) >= ELBO(x;Q","theta):}\begin{equation}\forall Q, \theta, x, \quad \log p(x ; \theta) \geq \operatorname{ELBO}(x ; Q, \theta)\end{equation}
Intuitively, the EM algorithm alternatively updates Q Q QQ and θ θ theta\theta by a) setting Q ( z ) = p ( z | x ; θ ) Q ( z ) = p ( z | x ; θ ) Q(z)=p(z|x;theta)Q(z)=p(z | x ; \theta) following Equation (8) so that ELBO ( x ; Q , θ ) = log p ( x ; θ ) ELBO ( x ; Q , θ ) = log p ( x ; θ ) ELBO(x;Q,theta)=log p(x;theta)\operatorname{ELBO}(x ; Q, \theta)=\log p(x ; \theta) for x x xx and the current θ θ theta\theta, and b) maximizing ELBO ( x ; Q , θ ) ELBO ( x ; Q , θ ) ELBO(x;Q,theta)\operatorname{ELBO}(x ; Q, \theta) w.r.t θ θ theta\theta while fixing the choice of Q Q QQ.
Recall that all the discussion above was under the assumption that we aim to optimize the log-likelihood log p ( x ; θ ) log p ( x ; θ ) log p(x;theta)\log p(x ; \theta) for a single example x x xx. It turns out that with multiple training examples, the basic idea is the same and we only needs to take a sum over examples at relevant places. Next, we will build the evidence lower bound for multiple training examples and make the EM algorithm formal.
Recall we have a training set { x ( 1 ) , , x ( n ) } x ( 1 ) , , x ( n ) {x^((1)),dots,x^((n))}\left\{x^{(1)}, \ldots, x^{(n)}\right\}. Note that the optimal choice of Q Q QQ is p ( z | x ; θ ) p ( z | x ; θ ) p(z|x;theta)p(z | x ; \theta), and it depends on the particular example x x xx. Therefore here we will introduce n n nn distributions Q 1 , , Q n Q 1 , , Q n Q_(1),dots,Q_(n)Q_{1}, \ldots, Q_{n}, one for each example x ( i ) x ( i ) x^((i))x^{(i)}. For each example x ( i ) x ( i ) x^((i))x^{(i)}, we can build the evidence lower bound
log p ( x ( i ) ; θ ) ELBO ( x ( i ) ; Q i , θ ) = z ( i ) Q i ( z ( i ) ) log p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) log p x ( i ) ; θ ELBO x ( i ) ; Q i , θ = z ( i ) Q i z ( i ) log p x ( i ) , z ( i ) ; θ Q i z ( i ) log p(x^((i));theta) >= ELBO(x^((i));Q_(i),theta)=sum_(z^((i)))Q_(i)(z^((i)))log (p(x^((i)),z^((i));theta))/(Q_(i)(z^((i))))\log p\left(x^{(i)} ; \theta\right) \geq \operatorname{ELBO}\left(x^{(i)} ; Q_{i}, \theta\right)=\sum_{z^{(i)}} Q_{i}\left(z^{(i)}\right) \log \frac{p\left(x^{(i)}, z^{(i)} ; \theta\right)}{Q_{i}\left(z^{(i)}\right)}
Taking sum over all the examples, we obtain a lower bound for the loglikelihood
(11) ( θ ) i ELBO ( x ( i ) ; Q i , θ ) = i z ( i ) Q i ( z ( i ) ) log p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) (11) ( θ ) i ELBO x ( i ) ; Q i , θ = i z ( i ) Q i z ( i ) log p x ( i ) , z ( i ) ; θ Q i z ( i ) {:(11)ℓ(theta) >= sum_(i)ELBO(x^((i));Q_(i),theta),[=sum_(i)sum_(z^((i)))Q_(i)(z^((i)))log (p(x^((i)),z^((i));theta))/(Q_(i)(z^((i))))]:}\begin{align} \ell(\theta) & \geq \sum_{i} \operatorname{ELBO}\left(x^{(i)} ; Q_{i}, \theta\right) \\ &=\sum_{i} \sum_{z^{(i)}} Q_{i}\left(z^{(i)}\right) \log \frac{p\left(x^{(i)}, z^{(i)} ; \theta\right)}{Q_{i}\left(z^{(i)}\right)}\nonumber \end{align}
For any set of distributions Q 1 , , Q n Q 1 , , Q n Q_(1),dots,Q_(n)Q_{1}, \ldots, Q_{n}, the formula (11) gives a lowerbound on ( θ ) ( θ ) ℓ(theta)\ell(\theta), and analogous to the argument around equation (8), the Q i Q i Q_(i)Q_{i} that attains equality satisfies
Q i ( z ( i ) ) = p ( z ( i ) | x ( i ) ; θ ) Q i z ( i ) = p z ( i ) | x ( i ) ; θ Q_(i)(z^((i)))=p(z^((i))|x^((i));theta)Q_{i}\left(z^{(i)}\right)=p\left(z^{(i)}| x^{(i)} ; \theta\right)
Thus, we simply set the Q i Q i Q_(i)Q_{i}'s to be the posterior distribution of the z ( i ) z ( i ) z^((i))z^{(i)}'s given x ( i ) x ( i ) x^((i))x^{(i)} with the current setting of the parameters θ θ theta\theta.
Now, for this choice of the Q i Q i Q_(i)Q_{i}'s, Equation (11) gives a lower-bound on the loglikelihood \ell that we're trying to maximize. This is the E-step. In the M-step of the algorithm, we then maximize our formula in Equation (11) with respect to the parameters to obtain a new setting of the θ θ theta\theta's. Repeatedly carrying out these two steps gives us the EM algorithm, which is as follows:
Repeat until convergence {
(E-step) For each i i ii, set
Q i ( z ( i ) ) := p ( z ( i ) x ( i ) ; θ ) . Q i z ( i ) := p z ( i ) x ( i ) ; θ . Q_(i)(z^((i))):=p(z^((i))∣x^((i));theta).Q_{i}\left(z^{(i)}\right):=p\left(z^{(i)} \mid x^{(i)} ; \theta\right) .
(M-step) Set
θ := arg max θ i = 1 n ELBO ( x ( i ) ; Q i , θ ) (12) = arg max θ i z ( i ) Q i ( z ( i ) ) log p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) . θ := arg max θ i = 1 n ELBO x ( i ) ; Q i , θ (12) = arg max θ i z ( i ) Q i z ( i ) log p x ( i ) , z ( i ) ; θ Q i z ( i ) . {:[theta:=arg max_(theta)sum_(i=1)^(n)ELBO(x^((i));Q_(i),theta)],(12)=arg max_(theta)sum_(i)sum_(z^((i)))Q_(i)(z^((i)))log (p(x^((i)),z^((i));theta))/(Q_(i)(z^((i)))).:}\begin{align} \theta &:=\arg \max _{\theta} \sum_{i=1}^{n} \operatorname{ELBO}\left(x^{(i)} ; Q_{i}, \theta\right)\nonumber \\ &=\arg \max _{\theta} \sum_{i} \sum_{z^{(i)}} Q_{i}\left(z^{(i)}\right) \log \frac{p\left(x^{(i)}, z^{(i)} ; \theta\right)}{Q_{i}\left(z^{(i)}\right)}. \end{align}
}
How do we know if this algorithm will converge? Well, suppose θ ( t ) θ ( t ) theta^((t))\theta^{(t)} and θ ( t + 1 ) θ ( t + 1 ) theta^((t+1))\theta^{(t+1)} are the parameters from two successive iterations of EM. We will now prove that ( θ ( t ) ) ( θ ( t + 1 ) ) θ ( t ) θ ( t + 1 ) ℓ(theta^((t))) <= ℓ(theta^((t+1)))\ell\left(\theta^{(t)}\right) \leq \ell\left(\theta^{(t+1)}\right), which shows EM always monotonically improves the log-likelihood. The key to showing this result lies in our choice of the Q i Q i Q_(i)Q_{i}'s. Specifically, on the iteration of EM in which the parameters had started out as θ ( t ) θ ( t ) theta^((t))\theta^{(t)}, we would have chosen Q i ( t ) ( z ( i ) ) := p ( z ( i ) | x ( i ) ; θ ( t ) ) Q i ( t ) z ( i ) := p z ( i ) | x ( i ) ; θ ( t ) Q_(i)^((t))(z^((i))):=p(z^((i))|x^((i));theta^((t)))Q_{i}^{(t)}\left(z^{(i)}\right):=p\left(z^{(i)}| x^{(i)} ; \theta^{(t)}\right). We saw earlier that this choice ensures that Jensen's inequality, as applied to get Equation (11), holds with equality, and hence
(13) ( θ ( t ) ) = i = 1 n ELBO ( x ( i ) ; Q i ( t ) , θ ( t ) ) (13) θ ( t ) = i = 1 n ELBO x ( i ) ; Q i ( t ) , θ ( t ) {:(13)ℓ(theta^((t)))=sum_(i=1)^(n)ELBO(x^((i));Q_(i)^((t)),theta^((t))):}\begin{equation}\ell\left(\theta^{(t)}\right)=\sum_{i=1}^{n} \operatorname{ELBO}\left(x^{(i)} ; Q_{i}^{(t)}, \theta^{(t)}\right)\end{equation}
The parameters θ ( t + 1 ) θ ( t + 1 ) theta^((t+1))\theta^{(t+1)} are then obtained by maximizing the right hand side of the equation above. Thus,
( θ ( t + 1 ) ) i = 1 n ELBO ( x ( i ) ; Q i ( t ) , θ ( t + 1 ) ) ( because inequality (11) holds for all Q and θ ) i = 1 n ELBO ( x ( i ) ; Q i ( t ) , θ ( t ) ) (see reason below) = ( θ ( t ) ) (by equation (13)) θ ( t + 1 ) i = 1 n ELBO x ( i ) ; Q i ( t ) , θ ( t + 1 ) ( because inequality (11) holds for all  Q  and  θ ) i = 1 n ELBO x ( i ) ; Q i ( t ) , θ ( t )  (see reason below) = θ ( t )  (by equation (13)) {:[ℓ(theta^((t+1))) >= sum_(i=1)^(n)ELBO(x^((i));Q_(i)^((t)),theta^((t+1)))],[("because inequality (11) holds for all "Q" and "theta)],[ >= sum_(i=1)^(n)ELBO(x^((i));Q_(i)^((t)),theta^((t)))" (see reason below)"],[=ℓ(theta^((t)))" (by equation (13))"]:}\begin{aligned} \ell\left(\theta^{(t+1)}\right) & \geq \sum_{i=1}^{n} \operatorname{ELBO}\left(x^{(i)} ; Q_{i}^{(t)}, \theta^{(t+1)}\right) \\ &&(\text{because inequality (11) holds for all }Q\text{ and }\theta) \\ & \geq \sum_{i=1}^{n} \operatorname{ELBO}\left(x^{(i)} ; Q_{i}^{(t)}, \theta^{(t)}\right) & \text { (see reason below)} \\ & =\ell\left(\theta^{(t)}\right) & \text { (by equation (13))} \end{aligned}
where the last inequality follows from that θ ( t + 1 ) θ ( t + 1 ) theta^((t+1))\theta^{(t+1)} is chosen explicitly to be
arg max θ i = 1 n ELBO ( x ( i ) ; Q i ( t ) , θ ) arg max θ i = 1 n ELBO x ( i ) ; Q i ( t ) , θ arg max_(theta)sum_(i=1)^(n)ELBO(x^((i));Q_(i)^((t)),theta)\arg \max _{\theta} \sum_{i=1}^{n} \operatorname{ELBO}\left(x^{(i)} ; Q_{i}^{(t)}, \theta\right)
Hence, EM causes the likelihood to converge monotonically. In our description of the EM algorithm, we said we'd run it until convergence. Given the result that we just showed, one reasonable convergence test would be to check if the increase in ( θ ) ( θ ) ℓ(theta)\ell(\theta) between successive iterations is smaller than some tolerance parameter, and to declare convergence if EM is improving ( θ ) ( θ ) ℓ(theta)\ell(\theta) too slowly.
Remark. If we define ( ( (( by overloading ELBO ( ) ) ELBO ( ) ) ELBO(*))\operatorname{ELBO}(\cdot))
(14) ELBO ( Q , θ ) = i = 1 n ELBO ( x ( i ) ; Q i , θ ) = i z ( i ) Q i ( z ( i ) ) log p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) (14) ELBO ( Q , θ ) = i = 1 n ELBO x ( i ) ; Q i , θ = i z ( i ) Q i z ( i ) log p x ( i ) , z ( i ) ; θ Q i z ( i ) {:(14)ELBO(Q","theta)=sum_(i=1)^(n)ELBO(x^((i));Q_(i),theta)=sum_(i)sum_(z^((i)))Q_(i)(z^((i)))log (p(x^((i)),z^((i));theta))/(Q_(i)(z^((i)))):}\begin{equation}\operatorname{ELBO}(Q, \theta)=\sum_{i=1}^{n} \operatorname{ELBO}\left(x^{(i)} ; Q_{i}, \theta\right)=\sum_{i} \sum_{z^{(i)}} Q_{i}\left(z^{(i)}\right) \log \frac{p\left(x^{(i)}, z^{(i)} ; \theta\right)}{Q_{i}\left(z^{(i)}\right)}\end{equation}
then we know ( θ ) ELBO ( Q , θ ) ( θ ) ELBO ( Q , θ ) ℓ(theta) >= ELBO(Q,theta)\ell(\theta) \geq \operatorname{ELBO}(Q, \theta) from our previous derivation. The EM can also be viewed an alternating maximization algorithm on ELBO ( Q , θ ) ELBO ( Q , θ ) ELBO(Q,theta)\operatorname{ELBO}(Q, \theta), in which the E-step maximizes it with respect to Q Q QQ (check this yourself), and the M-step maximizes it with respect to θ θ theta\theta.

2.1. Other interpretation of ELBO

Let ELBO ( x ; Q , θ ) = z Q ( z ) log p ( x , z ; θ ) Q ( z ) ELBO ( x ; Q , θ ) = z Q ( z ) log p ( x , z ; θ ) Q ( z ) ELBO(x;Q,theta)=sum_(z)Q(z)log (p(x,z;theta))/(Q(z))\operatorname{ELBO}(x ; Q, \theta)=\sum_{z} Q(z) \log \frac{p(x, z ; \theta)}{Q(z)} be defined as in equation (9). There are several other forms of ELBO. First, we can rewrite
ELBO ( x ; Q , θ ) = E z Q [ log p ( x , z ; θ ) ] E z Q [ log Q ( z ) ] (15) = E z Q [ log p ( x | z ; θ ) ] D K L ( Q p z ) ELBO ( x ; Q , θ ) = E z Q [ log p ( x , z ; θ ) ] E z Q [ log Q ( z ) ] (15) = E z Q [ log p ( x | z ; θ ) ] D K L Q p z {:[ELBO(x;Q","theta)=E_(z∼Q)[log p(x","z;theta)]-E_(z∼Q)[log Q(z)]],(15)=E_(z∼Q)[log p(x|z;theta)]-D_(KL)(Q||p_(z)):}\begin{align} \operatorname{ELBO}(x ; Q, \theta) &=\mathrm{E}_{z \sim Q}[\log p(x, z ; \theta)]-\mathrm{E}_{z \sim Q}[\log Q(z)]\nonumber \\ &=\mathrm{E}_{z \sim Q}[\log p(x | z ; \theta)]-D_{K L}\left(Q \| p_{z}\right) \end{align}
where we use p z p z p_(z)p_{z} to denote the marginal distribution of z z zz (under the distribution p ( x , z ; θ ) ) p ( x , z ; θ ) ) p(x,z;theta))p(x, z ; \theta)), and D K L ( ) D K L ( ) D_(KL)()D_{K L}() denotes the KL divergence
(16) D K L ( Q p z ) = z Q ( z ) log Q ( z ) p ( z ) (16) D K L Q p z = z Q ( z ) log Q ( z ) p ( z ) {:(16)D_(KL)(Q||p_(z))=sum_(z)Q(z)log (Q(z))/(p(z)):}\begin{equation}D_{K L}\left(Q \| p_{z}\right)=\sum_{z} Q(z) \log \frac{Q(z)}{p(z)}\end{equation}
In many cases, the marginal distribution of z z zz does not depend on the parameter θ θ theta\theta. In this case, we can see that maximizing ELBO over θ θ theta\theta is equivalent to maximizing the first term in (15). This corresponds to maximizing the conditional likelihood of x x xx conditioned on z z zz, which is often a simpler question than the original question.
Another form of ELBO ( ) ELBO ( ) ELBO(*)\operatorname{ELBO}(\cdot) is (please verify yourself)
(17) ELBO ( x ; Q , θ ) = log p ( x ) D K L ( Q p z | x ) (17) ELBO ( x ; Q , θ ) = log p ( x ) D K L Q p z | x {:(17)ELBO(x;Q","theta)=log p(x)-D_(KL)(Q||p_(z|x)):}\begin{equation}\operatorname{ELBO}(x ; Q, \theta)=\log p(x)-D_{K L}\left(Q \| p_{z | x}\right)\end{equation}
where p z | x p z | x p_(z|x)p_{z | x} is the conditional distribution of z z zz given x x xx under the parameter θ θ theta\theta. This forms shows that the maximizer of ELBO ( Q , θ ) ELBO ( Q , θ ) ELBO(Q,theta)\operatorname{ELBO}(Q, \theta) over Q Q QQ is obtained when Q = p z | x Q = p z | x Q=p_(z|x)Q=p_{z | x}, which was shown in equation (8) before.

3. Mixture of Gaussians revisited

Armed with our general definition of the EM algorithm, let's go back to our old example of fitting the parameters ϕ , μ ϕ , μ phi,mu\phi, \mu and Σ Σ Sigma\Sigma in a mixture of Gaussians. For the sake of brevity, we carry out the derivations for the M-step updates only for ϕ ϕ phi\phi and μ j μ j mu_(j)\mu_{j}, and leave the updates for Σ j Σ j Sigma_(j)\Sigma_{j} as an exercise for the reader.
The E-step is easy. Following our algorithm derivation above, we simply calculate
w j ( i ) = Q i ( z ( i ) = j ) = P ( z ( i ) = j | x ( i ) ; ϕ , μ , Σ ) . w j ( i ) = Q i z ( i ) = j = P z ( i ) = j | x ( i ) ; ϕ , μ , Σ . w_(j)^((i))=Q_(i)(z^((i))=j)=P(z^((i))=j|x^((i));phi,mu,Sigma).w_{j}^{(i)}=Q_{i}\left(z^{(i)}=j\right)=P\left(z^{(i)}=j | x^{(i)} ; \phi, \mu, \Sigma\right) .
Here, " Q i ( z ( i ) = j ) Q i z ( i ) = j Q_(i)(z^((i))=j)Q_{i}\left(z^{(i)}=j\right)" denotes the probability of z ( i ) z ( i ) z^((i))z^{(i)} taking the value j j jj under the distribution Q i Q i Q_(i)Q_{i}.
Next, in the M-step, we need to maximize, with respect to our parameters ϕ , μ , Σ ϕ , μ , Σ phi,mu,Sigma\phi, \mu, \Sigma, the quantity
i = 1 n z ( i ) Q i ( z ( i ) ) log p ( x ( i ) , z ( i ) ; ϕ , μ , Σ ) Q i ( z ( i ) ) = i = 1 n j = 1 k Q i ( z ( i ) = j ) log p ( x ( i ) z ( i ) = j ; μ , Σ ) p ( z ( i ) = j ; ϕ ) Q i ( z ( i ) = j ) = i = 1 n j = 1 k w j ( i ) log 1 ( 2 π ) d / 2 | Σ j | 1 / 2 exp ( 1 2 ( x ( i ) μ j ) T Σ j 1 ( x ( i ) μ j ) ) ϕ j w j ( i ) i = 1 n z ( i ) Q i z ( i ) log p x ( i ) , z ( i ) ; ϕ , μ , Σ Q i z ( i ) = i = 1 n j = 1 k Q i z ( i ) = j log p x ( i ) z ( i ) = j ; μ , Σ p z ( i ) = j ; ϕ Q i z ( i ) = j = i = 1 n j = 1 k w j ( i ) log 1 ( 2 π ) d / 2 Σ j 1 / 2 exp 1 2 x ( i ) μ j T Σ j 1 x ( i ) μ j ϕ j w j ( i ) {:[sum_(i=1)^(n)sum_(z^((i)))Q_(i)(z^((i)))log (p(x^((i)),z^((i));phi,mu,Sigma))/(Q_(i)(z^((i))))],[=sum_(i=1)^(n)sum_(j=1)^(k)Q_(i)(z^((i))=j)log (p(x^((i))∣z^((i))=j;mu,Sigma)p(z^((i))=j;phi))/(Q_(i)(z^((i))=j))],[=sum_(i=1)^(n)sum_(j=1)^(k)w_(j)^((i))log ((1)/((2pi)^(d//2)|Sigma_(j)|^(1//2))exp(-(1)/(2)(x^((i))-mu_(j))^(T)Sigma_(j)^(-1)(x^((i))-mu_(j)))*phi_(j))/(w_(j)^((i)))]:}\begin{aligned} \sum_{i=1}^{n} & \sum_{z^{(i)}} Q_{i}\left(z^{(i)}\right) \log \frac{p\left(x^{(i)}, z^{(i)} ; \phi, \mu, \Sigma\right)}{Q_{i}\left(z^{(i)}\right)} \\ &=\sum_{i=1}^{n} \sum_{j=1}^{k} Q_{i}\left(z^{(i)}=j\right) \log \frac{p\left(x^{(i)} \mid z^{(i)}=j ; \mu, \Sigma\right) p\left(z^{(i)}=j ; \phi\right)}{Q_{i}\left(z^{(i)}=j\right)} \\ &=\sum_{i=1}^{n} \sum_{j=1}^{k} w_{j}^{(i)} \log \frac{\frac{1}{(2 \pi)^{d / 2}\left|\Sigma_{j}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x^{(i)}-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x^{(i)}-\mu_{j}\right)\right) \cdot \phi_{j}}{w_{j}^{(i)}} \end{aligned}
Let's maximize this with respect to μ l μ l mu_(l)\mu_{l}. If we take the derivative with respect to μ l μ l mu_(l)\mu_{l}, we find
μ l i = 1 n j = 1 k w j ( i ) log 1 ( 2 π ) d / 2 | Σ j | 1 / 2 exp ( 1 2 ( x ( i ) μ j ) T Σ j 1 ( x ( i ) μ j ) ) ϕ j w j ( i ) = μ l i = 1 n j = 1 k w j ( i ) 1 2 ( x ( i ) μ j ) T Σ j 1 ( x ( i ) μ j ) = 1 2 i = 1 n w l ( i ) μ l 2 μ l T Σ l 1 x ( i ) μ l T Σ l 1 μ l = i = 1 n w l ( i ) ( Σ l 1 x ( i ) Σ l 1 μ l ) μ l i = 1 n j = 1 k w j ( i ) log 1 ( 2 π ) d / 2 Σ j 1 / 2 exp 1 2 x ( i ) μ j T Σ j 1 x ( i ) μ j ϕ j w j ( i ) = μ l i = 1 n j = 1 k w j ( i ) 1 2 x ( i ) μ j T Σ j 1 x ( i ) μ j = 1 2 i = 1 n w l ( i ) μ l 2 μ l T Σ l 1 x ( i ) μ l T Σ l 1 μ l = i = 1 n w l ( i ) Σ l 1 x ( i ) Σ l 1 μ l {:[grad_(mu_(l))sum_(i=1)^(n)sum_(j=1)^(k)w_(j)^((i))log ((1)/((2pi)^(d//2)|Sigma_(j)|^(1//2))exp(-(1)/(2)(x^((i))-mu_(j))^(T)Sigma_(j)^(-1)(x^((i))-mu_(j)))*phi_(j))/(w_(j)^((i)))],[=-grad_(mu_(l))sum_(i=1)^(n)sum_(j=1)^(k)w_(j)^((i))(1)/(2)(x^((i))-mu_(j))^(T)Sigma_(j)^(-1)(x^((i))-mu_(j))],[=(1)/(2)sum_(i=1)^(n)w_(l)^((i))grad_(mu_(l))2mu_(l)^(T)Sigma_(l)^(-1)x^((i))-mu_(l)^(T)Sigma_(l)^(-1)mu_(l)],[=sum_(i=1)^(n)w_(l)^((i))(Sigma_(l)^(-1)x^((i))-Sigma_(l)^(-1)mu_(l))]:}\begin{aligned} \nabla_{\mu_{l}} & \sum_{i=1}^{n} \sum_{j=1}^{k} w_{j}^{(i)} \log \frac{\frac{1}{(2 \pi)^{d / 2}\left|\Sigma_{j}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x^{(i)}-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x^{(i)}-\mu_{j}\right)\right) \cdot \phi_{j}}{w_{j}^{(i)}} \\ &=-\nabla_{\mu_{l}} \sum_{i=1}^{n} \sum_{j=1}^{k} w_{j}^{(i)} \frac{1}{2}\left(x^{(i)}-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x^{(i)}-\mu_{j}\right) \\ &=\frac{1}{2} \sum_{i=1}^{n} w_{l}^{(i)} \nabla_{\mu_{l}} 2 \mu_{l}^{T} \Sigma_{l}^{-1} x^{(i)}-\mu_{l}^{T} \Sigma_{l}^{-1} \mu_{l} \\ &=\sum_{i=1}^{n} w_{l}^{(i)}\left(\Sigma_{l}^{-1} x^{(i)}-\Sigma_{l}^{-1} \mu_{l}\right) \end{aligned}
Setting this to zero and solving for μ l μ l mu_(l)\mu_{l} therefore yields the update rule
μ l := i = 1 n w l ( i ) x ( i ) i = 1 n w l ( i ) , μ l := i = 1 n w l ( i ) x ( i ) i = 1 n w l ( i ) , mu_(l):=(sum_(i=1)^(n)w_(l)^((i))x^((i)))/(sum_(i=1)^(n)w_(l)^((i))),\mu_{l}:=\frac{\sum_{i=1}^{n} w_{l}^{(i)} x^{(i)}}{\sum_{i=1}^{n} w_{l}^{(i)}},
which was what we had in the previous set of notes.
Let's do one more example, and derive the M-step update for the parameters ϕ j ϕ j phi_(j)\phi_{j}. Grouping together only the terms that depend on ϕ j ϕ j phi_(j)\phi_{j}, we find that we need to maximize
i = 1 n j = 1 k w j ( i ) log ϕ j . i = 1 n j = 1 k w j ( i ) log ϕ j . sum_(i=1)^(n)sum_(j=1)^(k)w_(j)^((i))log phi_(j).\sum_{i=1}^{n} \sum_{j=1}^{k} w_{j}^{(i)} \log \phi_{j} .
However, there is an additional constraint that the ϕ j ϕ j phi_(j)\phi_{j}'s sum to 1 1 11, since they represent the probabilities ϕ j = p ( z ( i ) = j ; ϕ ) ϕ j = p z ( i ) = j ; ϕ phi_(j)=p(z^((i))=j;phi)\phi_{j}=p\left(z^{(i)}=j ; \phi\right). To deal with the constraint that j = 1 k ϕ j = 1 j = 1 k ϕ j = 1 sum_(j=1)^(k)phi_(j)=1\sum_{j=1}^{k} \phi_{j}=1, we construct the Lagrangian
L ( ϕ ) = i = 1 n j = 1 k w j ( i ) log ϕ j + β ( j = 1 k ϕ j 1 ) , L ( ϕ ) = i = 1 n j = 1 k w j ( i ) log ϕ j + β j = 1 k ϕ j 1 , L(phi)=sum_(i=1)^(n)sum_(j=1)^(k)w_(j)^((i))log phi_(j)+beta(sum_(j=1)^(k)phi_(j)-1),\mathcal{L}(\phi)=\sum_{i=1}^{n} \sum_{j=1}^{k} w_{j}^{(i)} \log \phi_{j}+\beta\left(\sum_{j=1}^{k} \phi_{j}-1\right),
where β β beta\beta is the Lagrange multiplier.[5] Taking derivatives, we find
ϕ j L ( ϕ ) = i = 1 n w j ( i ) ϕ j + β ϕ j L ( ϕ ) = i = 1 n w j ( i ) ϕ j + β (del)/(delphi_(j))L(phi)=sum_(i=1)^(n)(w_(j)^((i)))/(phi_(j))+beta\frac{\partial}{\partial \phi_{j}} \mathcal{L}(\phi)=\sum_{i=1}^{n} \frac{w_{j}^{(i)}}{\phi_{j}}+\beta
Setting this to zero and solving, we get
ϕ j = i = 1 n w j ( i ) β ϕ j = i = 1 n w j ( i ) β phi_(j)=(sum_(i=1)^(n)w_(j)^((i)))/(-beta)\phi_{j}=\frac{\sum_{i=1}^{n} w_{j}^{(i)}}{-\beta}
I.e., ϕ j i = 1 n w j ( i ) ϕ j i = 1 n w j ( i ) phi_(j)propsum_(i=1)^(n)w_(j)^((i))\phi_{j} \propto \sum_{i=1}^{n} w_{j}^{(i)}. Using the constraint that j ϕ j = 1 j ϕ j = 1 sum_(j)phi_(j)=1\sum_{j} \phi_{j}=1, we easily find that β = i = 1 n j = 1 k w j ( i ) = i = 1 n 1 = n β = i = 1 n j = 1 k w j ( i ) = i = 1 n 1 = n -beta=sum_(i=1)^(n)sum_(j=1)^(k)w_(j)^((i))=sum_(i=1)^(n)1=n-\beta=\sum_{i=1}^{n} \sum_{j=1}^{k} w_{j}^{(i)}=\sum_{i=1}^{n} 1=n. (This used the fact that w j ( i ) = Q i ( z ( i ) = j ) w j ( i ) = Q i z ( i ) = j w_(j)^((i))=Q_(i)(z^((i))=j)w_{j}^{(i)}= Q_{i}\left(z^{(i)}=j\right), and since probabilities sum to 1 , j w j ( i ) = 1 1 , j w j ( i ) = 1 1,sum_(j)w_(j)^((i))=11, \sum_{j} w_{j}^{(i)}=1.) We therefore have our M-step updates for the parameters ϕ j ϕ j phi_(j)\phi_{j}:
ϕ j := 1 n i = 1 n w j ( i ) . ϕ j := 1 n i = 1 n w j ( i ) . phi_(j):=(1)/(n)sum_(i=1)^(n)w_(j)^((i)).\phi_{j}:=\frac{1}{n} \sum_{i=1}^{n} w_{j}^{(i)} .
The derivation for the M-step updates to Σ j Σ j Sigma_(j)\Sigma_{j} are also entirely straightforward.

4. Variational inference and variational auto-encoder

Loosely speaking, variational auto-encoder [2] generally refers to a family of algorithms that extend the EM algorithms to more complex models parameterized by neural networks. It extends the technique of variational inference with the additional "re-parametrization trick" which will be introduced below. Variational auto-encoder may not give the best performance for many datasets, but it contains several central ideas about how to extend EM algorithms to high-dimensional continuous latent variables with non-linear models. Understanding it will likely give you the language and backgrounds to understand various recent papers related to it.
As a running example, we will consider the following parameterization of p ( x , z ; θ ) p ( x , z ; θ ) p(x,z;theta)p(x, z ; \theta) by a neural network. Let θ θ theta\theta be the collection of the weights of a neural network g ( z ; θ ) g ( z ; θ ) g(z;theta)g(z ; \theta) that maps z R k z R k z inR^(k)z \in \mathbb{R}^{k} to R d R d R^(d)\mathbb{R}^{d}. Let
(18) z N ( 0 , I k × k ) (19) x | z N ( g ( z ; θ ) , σ 2 I d × d ) (18) z N 0 , I k × k (19) x | z N g ( z ; θ ) , σ 2 I d × d {:(18)z∼N(0,I_(k xx k)),(19)x|z∼N(g(z;theta),sigma^(2)I_(d xx d)):}\begin{align} z & \sim \mathcal{N}\left(0, I_{k \times k}\right) \\ x | z & \sim \mathcal{N}\left(g(z ; \theta), \sigma^{2} I_{d \times d}\right) \end{align}
Here I k × k I k × k I_(k xx k)I_{k \times k} denotes identity matrix of dimension k k kk by k k kk, and σ σ sigma\sigma is a scalar that we assume to be known for simplicity.
For the Gaussian mixture models in Section 3, the optimal choice of Q ( z ) = p ( z | x ; θ ) Q ( z ) = p ( z | x ; θ ) Q(z)=p(z|x;theta)Q(z)=p(z | x ; \theta) for each fixed θ θ theta\theta, that is the posterior distribution of z z zz, can be analytically computed. In many more complex models such as the model (19), it's intractable to compute the exact the posterior distribution p ( z | x ; θ ) p ( z | x ; θ ) p(z|x;theta)p(z | x ; \theta).
Recall that from equation (10), ELBO is always a lower bound for any choice of Q Q QQ, and therefore, we can also aim for finding an approximation of the true posterior distribution. Often, one has to use some particular form to approximate the true posterior distribution. Let Q Q Q\mathcal{Q} be a family of Q Q QQ's that we are considering, and we will aim to find a Q Q QQ within the family of Q Q Q\mathcal{Q} that is closest to the true posterior distribution. To formalize, recall the definition of the ELBO lower bound as a function of Q Q QQ and θ θ theta\theta defined in equation (14)
ELBO ( Q , θ ) = i = 1 n ELBO ( x ( i ) ; Q i , θ ) = i z ( i ) Q i ( z ( i ) ) log p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) ELBO ( Q , θ ) = i = 1 n ELBO x ( i ) ; Q i , θ = i z ( i ) Q i z ( i ) log p x ( i ) , z ( i ) ; θ Q i z ( i ) ELBO(Q,theta)=sum_(i=1)^(n)ELBO(x^((i));Q_(i),theta)=sum_(i)sum_(z^((i)))Q_(i)(z^((i)))log (p(x^((i)),z^((i));theta))/(Q_(i)(z^((i))))\operatorname{ELBO}(Q, \theta)=\sum_{i=1}^{n} \operatorname{ELBO}\left(x^{(i)} ; Q_{i}, \theta\right)=\sum_{i} \sum_{z^{(i)}} Q_{i}\left(z^{(i)}\right) \log \frac{p\left(x^{(i)}, z^{(i)} ; \theta\right)}{Q_{i}\left(z^{(i)}\right)}
Recall that EM can be viewed as alternating maximization of ELBO ( Q , θ ) ELBO ( Q , θ ) ELBO(Q,theta)\operatorname{ELBO}(Q, \theta). Here instead, we optimize the EBLO over Q Q Q Q Q inQQ \in \mathcal{Q}
(20) max Q Q max θ ELBO ( Q , θ ) (20) max Q Q max θ ELBO ( Q , θ ) {:(20)max_(Q inQ)max_(theta)ELBO(Q","theta):}\begin{equation}\max _{Q \in \mathcal{Q}} \max _{\theta} \operatorname{ELBO}(Q, \theta)\end{equation}
Now the next question is what form of Q Q QQ (or what structural assumptions to make about Q Q QQ) allows us to efficiently maximize the objective above. When the latent variable z z zz are high-dimensional discrete variables, one popular assumption is the mean field assumption, which assumes that Q i ( z ) Q i ( z ) Q_(i)(z)Q_{i}(z) gives a distribution with independent coordinates, or in other words, Q i Q i Q_(i)Q_{i} can be decomposed into Q i ( z ) = Q i 1 ( z 1 ) Q i k ( z k ) Q i ( z ) = Q i 1 z 1 Q i k z k Q_(i)(z)=Q_(i)^(1)(z_(1))cdotsQ_(i)^(k)(z_(k))Q_{i}(z)=Q_{i}^{1}\left(z_{1}\right) \cdots Q_{i}^{k}\left(z_{k}\right). There are tremendous applications of mean field assumptions to learning generative models with discrete latent variables, and we refer to [1] for a survey of these models and their impact to a wide range of applications including computational biology, computational neuroscience, social sciences. We will not get into the details about the discrete latent variable cases, and our main focus is to deal with continuous latent variables, which requires not only mean field assumptions, but additional techniques.
When z R k z R k z inR^(k)z \in \mathbb{R}^{k} is a continuous latent variable, there are several decisions to make towards successfully optimizing (20). First we need to give a succinct representation of the distribution Q i Q i Q_(i)Q_{i} because it is over an infinite number of points. A natural choice is to assume Q i Q i Q_(i)Q_{i} is a Gaussian distribution with some mean and variance. We would also like to have more succinct representation of the means of Q i Q i Q_(i)Q_{i} of all the examples. Note that Q i ( z ( i ) ) Q i z ( i ) Q_(i)(z^((i)))Q_{i}\left(z^{(i)}\right) is supposed to approximate p ( z ( i ) | x ( i ) ; θ ) p z ( i ) | x ( i ) ; θ p(z^((i))|x^((i));theta)p\left(z^{(i)} | x^{(i)} ; \theta\right). It would make sense let all the means of the Q i Q i Q_(i)Q_{i}'s be some function of x ( i ) x ( i ) x^((i))x^{(i)}. Concretely, let q ( ; ϕ ) , v ( ; ϕ ) q ( ; ϕ ) , v ( ; ϕ ) q(*;phi),v(*;phi)q(\cdot ; \phi), v(\cdot ; \phi) be two functions that map from dimension d d dd to k k kk, which are parameterized by ϕ ϕ phi\phi and ψ ψ psi\psi, we assume that
(21) Q i = N ( q ( x ( i ) ; ϕ ) , diag ( v ( x ( i ) ; ψ ) ) 2 ) (21) Q i = N q x ( i ) ; ϕ , diag v x ( i ) ; ψ 2 {:(21)Q_(i)=N(q(x^((i));phi),diag (v(x^((i));psi))^(2)):}\begin{equation}Q_{i}=\mathcal{N}\left(q\left(x^{(i)} ; \phi\right), \operatorname{diag}\left(v\left(x^{(i)} ; \psi\right)\right)^{2}\right)\end{equation}
Here diag ( w ) diag ( w ) diag(w)\operatorname{diag}(w) means the k × k k × k k xx kk \times k matrix with the entries of w R k w R k w inR^(k)w \in \mathbb{R}^{k} on the diagonal. In other words, the distribution Q i Q i Q_(i)Q_{i} is assumed to be a Gaussian distribution with independent coordinates, and the mean and standard deviations are governed by q q qq and v v vv. Often in variational auto-encoder, q q qq and v v vv are chosen to be neural networks.[6] In recent deep learning literature, often q , v q , v q,vq, v are called encoder (in the sense of encoding the data into latent code), whereas g ( z ; θ ) g ( z ; θ ) g(z;theta)g(z ; \theta) if often referred to as the decoder.
We remark that Q i Q i Q_(i)Q_{i} of such form in many cases are very far from a good approximation of the true posterior distribution. However, some approximation is necessary for feasible optimization. In fact, the form of Q i Q i Q_(i)Q_{i} needs to satisfy other requirements (which happened to be satisfied by the form (21))
Before optimizing the ELBO, let's first verify whether we can efficiently evaluate the value of the ELBO for fixed Q Q QQ of the form (21) and θ θ theta\theta. We rewrite the ELBO as a function of ϕ , ψ , θ ϕ , ψ , θ phi,psi,theta\phi, \psi, \theta by (22) ELBO ( ϕ , ψ , θ ) = i = 1 n E z ( i ) Q i [ log p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) ] where Q i = N ( q ( x ( i ) ; ϕ ) , diag ( v ( x ( i ) ; ψ ) ) 2 ) (22) ELBO ( ϕ , ψ , θ ) = i = 1 n E z ( i ) Q i log p x ( i ) , z ( i ) ; θ Q i z ( i )  where  Q i = N ( q ( x ( i ) ; ϕ ) , diag ( v ( x ( i ) ; ψ ) ) 2 ) {:(22)ELBO(phi","psi","theta)=sum_(i=1)^(n)E_(z^((i))∼Q_(i))[log (p(x^((i)),z^((i));theta))/(Q_(i)(z^((i))))],[" where "Q_(i)=N(q(x^((i));phi)","diag(v(x^((i));psi))^(2))]:}\begin{align} \operatorname{ELBO}(\phi, \psi, \theta) &=\sum_{i=1}^{n} \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[\log \frac{p\left(x^{(i)}, z^{(i)} ; \theta\right)}{Q_{i}\left(z^{(i)}\right)}\right] \\ & \text { where } Q_{i}=\mathcal{N}(q(x^{(i)} ; \phi), \operatorname{diag}(v(x^{(i)} ; \psi))^{2})\nonumber \end{align}
Note that to evaluate Q i ( z ( i ) ) Q i ( z ( i ) ) Q_(i)(z^((i)))Q_{i}(z^{(i)}) inside the expectation, we should be able to compute the density of Q i Q i Q_(i)Q_{i}. To estimate the expectation E z ( i ) Q i E z ( i ) Q i E_(z^((i))∼Q_(i))\mathrm{E}_{z^{(i)} \sim Q_{i}}, we should be able to sample from distribution Q i Q i Q_(i)Q_{i} so that we can build an empirical estimator with samples. It happens that for Gaussian distribution Q i = N ( q ( x ( i ) ; ϕ ) , diag ( v ( x ( i ) ; ψ ) ) 2 ) Q i = N ( q ( x ( i ) ; ϕ ) , diag ( v ( x ( i ) ; ψ ) ) 2 ) Q_(i)=N(q(x^((i));phi),diag(v(x^((i));psi))^(2))Q_{i}=\mathcal{N}(q(x^{(i)} ; \phi), \operatorname{diag}(v(x^{(i)} ; \psi))^{2}), we are able to be both efficiently.
Now let's optimize the ELBO. It turns out that we can run gradient ascent over ϕ , ψ , θ ϕ , ψ , θ phi,psi,theta\phi, \psi, \theta instead of alternating maximization. There is no strong need to compute the maximum over each variable at a much greater cost. (For Gaussian mixture model in Section 3, computing the maximum is analytically feasible and relatively cheap, and therefore we did alternating maximization.) Mathematically, let η η eta\eta be the learning rate, the gradient ascent step is
θ := θ + η θ ELBO ( ϕ , ψ , θ ) ϕ := ϕ + η ϕ ELBO ( ϕ , ψ , θ ) ψ := ψ + η ψ ELBO ( ϕ , ψ , θ ) θ := θ + η θ ELBO ( ϕ , ψ , θ ) ϕ := ϕ + η ϕ ELBO ( ϕ , ψ , θ ) ψ := ψ + η ψ ELBO ( ϕ , ψ , θ ) {:[theta:=theta+etagrad_(theta)ELBO(phi","psi","theta)],[phi:=phi+etagrad_(phi)ELBO(phi","psi","theta)],[psi:=psi+etagrad_(psi)ELBO(phi","psi","theta)]:}\begin{aligned} \theta &:=\theta+\eta \nabla_{\theta} \operatorname{ELBO}(\phi, \psi, \theta) \\ \phi &:=\phi+\eta \nabla_{\phi} \operatorname{ELBO}(\phi, \psi, \theta) \\ \psi &:=\psi+\eta \nabla_{\psi} \operatorname{ELBO}(\phi, \psi, \theta) \end{aligned}
Computing the gradient over θ θ theta\theta is simple because
θ ELBO ( ϕ , ψ , θ ) = θ i = 1 n E z ( i ) Q i [ log p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) ] = θ i = 1 n E z ( i ) Q i [ log p ( x ( i ) , z ( i ) ; θ ) ] (23) = i = 1 n E z ( i ) Q i [ θ log p ( x ( i ) , z ( i ) ; θ ) ] θ ELBO ( ϕ , ψ , θ ) = θ i = 1 n E z ( i ) Q i log p x ( i ) , z ( i ) ; θ Q i z ( i ) = θ i = 1 n E z ( i ) Q i log p x ( i ) , z ( i ) ; θ (23) = i = 1 n E z ( i ) Q i θ log p x ( i ) , z ( i ) ; θ {:[grad_(theta)ELBO(phi","psi","theta)=grad_(theta)sum_(i=1)^(n)E_(z^((i))∼Q_(i))[(log p(x^((i)),z^((i));theta))/(Q_(i)(z^((i))))]],[=grad_(theta)sum_(i=1)^(n)E_(z^((i))∼Q_(i))[log p(x^((i)),z^((i));theta)]],(23)=sum_(i=1)^(n)E_(z^((i))∼Q_(i))[grad_(theta)log p(x^((i)),z^((i));theta)]:}\begin{align} \nabla_{\theta} \operatorname{ELBO}(\phi, \psi, \theta) &=\nabla_{\theta} \sum_{i=1}^{n} \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[\frac{\log p\left(x^{(i)}, z^{(i)} ; \theta\right)}{Q_{i}\left(z^{(i)}\right)}\right] \nonumber\\ &=\nabla_{\theta} \sum_{i=1}^{n} \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[\log p\left(x^{(i)}, z^{(i)} ; \theta\right)\right] \nonumber\\ &=\sum_{i=1}^{n} \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[\nabla_{\theta} \log p\left(x^{(i)}, z^{(i)} ; \theta\right)\right] \end{align}
But computing the gradient over ϕ ϕ phi\phi and ψ ψ psi\psi is tricky because the sampling distribution Q i Q i Q_(i)Q_{i} depends on ϕ ϕ phi\phi and ψ ψ psi\psi. (Abstractly speaking, the issue we face can be simplified as the problem of computing the gradient E z Q ϕ [ f ( ϕ ) ] E z Q ϕ [ f ( ϕ ) ] E_(z∼Q_(phi))[f(phi)]\mathrm{E}_{z \sim Q_{\phi}}[f(\phi)] with respect to variable ϕ ϕ phi\phi. We know that in general, E z Q ϕ [ f ( ϕ ) ] E z Q ϕ [ f ( ϕ ) ] gradE_(z∼Q_(phi))[f(phi)]!=\nabla \mathrm{E}_{z \sim Q_{\phi}}[f(\phi)] \neq E z Q ϕ [ f ( ϕ ) ] E z Q ϕ [ f ( ϕ ) ] E_(z∼Q_(phi))[grad f(phi)]\mathrm{E}_{z \sim Q_{\phi}}[\nabla f(\phi)] because the dependency of Q ϕ Q ϕ Q_(phi)Q_{\phi} on ϕ ϕ phi\phi has to be taken into account as well.)
The idea that comes to rescue is the so-called re-parameterization trick: we rewrite z ( i ) Q i = N ( q ( x ( i ) ; ϕ ) , diag ( v ( x ( i ) ; ψ ) ) 2 ) z ( i ) Q i = N q x ( i ) ; ϕ , diag v x ( i ) ; ψ 2 z^((i))∼Q_(i)=N(q(x^((i));phi),diag (v(x^((i));psi))^(2))z^{(i)} \sim Q_{i}=\mathcal{N}\left(q\left(x^{(i)} ; \phi\right), \operatorname{diag}\left(v\left(x^{(i)} ; \psi\right)\right)^{2}\right) in an equivalent way:
(24) z ( i ) = q ( x ( i ) ; ϕ ) + v ( x ( i ) ; ψ ) ξ ( i ) where ξ ( i ) N ( 0 , I k × k ) (24) z ( i ) = q x ( i ) ; ϕ + v x ( i ) ; ψ ξ ( i )  where  ξ ( i ) N 0 , I k × k {:(24)z^((i))=q(x^((i));phi)+v(x^((i));psi)o.xi^((i))" where "xi^((i))∼N(0,I_(k xx k)):}\begin{equation}z^{(i)}=q\left(x^{(i)} ; \phi\right)+v\left(x^{(i)} ; \psi\right) \odot \xi^{(i)} \text { where } \xi^{(i)} \sim \mathcal{N}\left(0, I_{k \times k}\right)\end{equation}
Here x y x y x o.yx \odot y denotes the entry-wise product of two vectors of the same dimension. Here we used the fact that x N ( μ , σ 2 ) x N μ , σ 2 x∼N(mu,sigma^(2))x \sim N\left(\mu, \sigma^{2}\right) is equivalent to that x = μ + ξ σ x = μ + ξ σ x=mu+xi sigmax=\mu+\xi \sigma with ξ N ( 0 , 1 ) ξ N ( 0 , 1 ) xi∼N(0,1)\xi \sim N(0,1). We mostly just used this fact in every dimension simultaneously for the random variable z ( i ) Q i z ( i ) Q i z^((i))∼Q_(i)z^{(i)} \sim Q_{i}.
With this re-parameterization, we have that
(25) E z ( i ) Q i [ log p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) ] = E ξ ( i ) N ( 0 , 1 ) [ log p ( x ( i ) , q ( x ( i ) ; ϕ ) + v ( x ( i ) ; ψ ) ξ ( i ) ; θ ) Q i ( q ( x ( i ) ; ϕ ) + v ( x ( i ) ; ψ ) ξ ( i ) ) ] (25) E z ( i ) Q i log p x ( i ) , z ( i ) ; θ Q i z ( i ) = E ξ ( i ) N ( 0 , 1 ) log p x ( i ) , q x ( i ) ; ϕ + v x ( i ) ; ψ ξ ( i ) ; θ Q i q x ( i ) ; ϕ + v x ( i ) ; ψ ξ ( i ) {:(25)E_(z^((i))∼Q_(i))[log (p(x^((i)),z^((i));theta))/(Q_(i)(z^((i))))],[=E_(xi^((i))∼N(0,1))[log (p(x^((i)),q(x^((i));phi)+v(x^((i));psi)o.xi^((i));theta))/(Q_(i)(q(x^((i));phi)+v(x^((i));psi)o.xi^((i))))]]:}\begin{align} &\mathrm{E}_{z^{(i)} \sim Q_{i}}\left[\log \frac{p\left(x^{(i)}, z^{(i)} ; \theta\right)}{Q_{i}\left(z^{(i)}\right)}\right] \\ &=\mathrm{E}_{\xi^{(i)} \sim \mathcal{N}(0,1)}\left[\log \frac{p\left(x^{(i)}, q\left(x^{(i)} ; \phi\right)+v\left(x^{(i)} ; \psi\right) \odot \xi^{(i)} ; \theta\right)}{Q_{i}\left(q\left(x^{(i)} ; \phi\right)+v\left(x^{(i)} ; \psi\right) \odot \xi^{(i)}\right)}\right]\nonumber \end{align}
It follows that
ϕ E z ( i ) Q i [ log p ( x ( i ) , z ( i ) ; θ ) Q i ( z ( i ) ) ] = ϕ E ξ ( i ) N ( 0 , 1 ) [ log p ( x ( i ) , q ( x ( i ) ; ϕ ) + v ( x ( i ) ; ψ ) ξ ( i ) ; θ ) Q i ( q ( x ( i ) ; ϕ ) + v ( x ( i ) ; ψ ) ξ ( i ) ) ] = E ξ ( i ) N ( 0 , 1 ) [ ϕ log p ( x ( i ) , q ( x ( i ) ; ϕ ) + v ( x ( i ) ; ψ ) ξ ( i ) ; θ ) Q i ( q ( x ( i ) ; ϕ ) + v ( x ( i ) ; ψ ) ξ ( i ) ) ] ϕ E z ( i ) Q i log p x ( i ) , z ( i ) ; θ Q i z ( i ) = ϕ E ξ ( i ) N ( 0 , 1 ) log p x ( i ) , q x ( i ) ; ϕ + v x ( i ) ; ψ ξ ( i ) ; θ Q i q x ( i ) ; ϕ + v x ( i ) ; ψ ξ ( i ) = E ξ ( i ) N ( 0 , 1 ) ϕ log p x ( i ) , q x ( i ) ; ϕ + v x ( i ) ; ψ ξ ( i ) ; θ Q i q x ( i ) ; ϕ + v x ( i ) ; ψ ξ ( i ) {:[grad_(phi)E_(z^((i))∼Q_(i))[log (p(x^((i)),z^((i));theta))/(Q_(i)(z^((i))))]],[=grad_(phi)E_(xi^((i))∼N(0,1))[log (p(x^((i)),q(x^((i));phi)+v(x^((i));psi)o.xi^((i));theta))/(Q_(i)(q(x^((i));phi)+v(x^((i));psi)o.xi^((i))))]],[=E_(xi^((i))∼N(0,1))[grad_(phi)log (p(x^((i)),q(x^((i));phi)+v(x^((i));psi)o.xi^((i));theta))/(Q_(i)(q(x^((i));phi)+v(x^((i));psi)o.xi^((i))))]]:}\begin{aligned} &\nabla_{\phi} \mathrm{E}_{z^{(i)} \sim Q_{i}}\left[\log \frac{p\left(x^{(i)}, z^{(i)} ; \theta\right)}{Q_{i}\left(z^{(i)}\right)}\right] \\ &=\nabla_{\phi} \mathrm{E}_{\xi^{(i)} \sim \mathcal{N}(0,1)}\left[\log \frac{p\left(x^{(i)}, q\left(x^{(i)} ; \phi\right)+v\left(x^{(i)} ; \psi\right) \odot \xi^{(i)} ; \theta\right)}{Q_{i}\left(q\left(x^{(i)} ; \phi\right)+v\left(x^{(i)} ; \psi\right) \odot \xi^{(i)}\right)}\right] \\ &=\mathrm{E}_{\xi^{(i)} \sim \mathcal{N}(0,1)}\left[\nabla_{\phi} \log \frac{p\left(x^{(i)}, q\left(x^{(i)} ; \phi\right)+v\left(x^{(i)} ; \psi\right) \odot \xi^{(i)} ; \theta\right)}{Q_{i}\left(q\left(x^{(i)} ; \phi\right)+v\left(x^{(i)} ; \psi\right) \odot \xi^{(i)}\right)}\right] \end{aligned}
We can now sample multiple copies of ξ ( i ) ξ ( i ) xi^((i))\xi^{(i)}'s to estimate the the expectation in the RHS of the equation above.[7] We can estimate the gradient with respect to ψ ψ psi\psi similarly, and with these, we can implement the gradient ascent algorithm to optimize the ELBO over ϕ , ψ , θ ϕ , ψ , θ phi,psi,theta\phi, \psi, \theta.
There are not many high-dimensional distributions with analytically computable density function are known to be re-parameterizable. We refer to [2] for a few other choices that can replace Gaussian distribution.
You can read the notes from the next lecture from Andrew Ng's CS229 course on Factor Analysis here.

References

[1] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859-877, 2017 .
[2] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

  1. It's mostly an empirical observation that the optimization problem is difficult to optimize. ↩︎
  2. Empirically, the E-step and M-step can often be computed more efficiently than optimizing the function ( ) ( ) ℓ(*)\ell(\cdot) directly. However, it doesn't necessarily mean that alternating the two steps can always converge to the global optimum of ( ) ( ) ℓ(*)\ell(\cdot). Even for mixture of Gaussians, the EM algorithm can either converge to a global optimum or get stuck, depending on the properties of the training data. Empirically, for real-world data, often EM can converge to a solution with relatively high likelihood (if not the optimum), and the theory behind it is still largely not understood. ↩︎
  3. If z z zz were continuous, then Q Q QQ would be a density, and the summations over z z zz in our discussion are replaced with integrals over z z zz. ↩︎
  4. We note that the notion p ( x , z ; θ ) Q ( z ) p ( x , z ; θ ) Q ( z ) (p(x,z;theta))/(Q(z))\frac{p(x, z ; \theta)}{Q(z)} only makes sense if Q ( z ) 0 Q ( z ) 0 Q(z)!=0Q(z) \neq 0 whenever p ( x , z ; θ ) 0 p ( x , z ; θ ) 0 p(x,z;theta)!=0p(x, z ; \theta) \neq 0. Here we implicitly assume that we only consider those Q Q QQ with such a property. ↩︎
  5. We don't need to worry about the constraint that ϕ j 0 ϕ j 0 phi_(j) >= 0\phi_{j} \geq 0, because as we'll shortly see, the solution we'll find from this derivation will automatically satisfy that anyway. ↩︎
  6. q q qq and v v vv can also share parameters. We sweep this level of details under the rug in this note. ↩︎
  7. Empirically people sometimes just use one sample to estimate it for maximum computational efficiency. ↩︎

Recommended for you

Jintang Li
A Review of Trustworthy Graph Learning
A Review of Trustworthy Graph Learning
In this review, we will explore how to develop trustworthy graph neural networks (GNNs) models from different aspects.
26 points
0 issues