More on Multivariate Gaussians

This is the second of a two-part overview of multivariate gaussians from Andrew Ng's CS229 course on Machine learning. Click here to read part one.
Up to this point in class, you have seen multivariate Gaussians arise in a number of applications, such as the probabilistic interpretation of linear regression, Gaussian discriminant analysis, mixture of Gaussians clustering, and most recently, factor analysis. In these lecture notes, we attempt to demystify some of the fancier properties of multivariate Gaussians that were introduced in the recent factor analysis lecture. The goal of these notes is to give you some intuition into where these properties come from, so that you can use them with confidence on your homework (hint hint!) and beyond.

1. Definition

A vector-valued random variable x R n x R n x inR^(n)x \in \mathbf{R}^{n} is said to have a multivariate normal (or Gaussian) distribution with mean μ R n μ R n mu inR^(n)\mu \in \mathbf{R}^{n} and covariance matrix Σ S + + n Σ S + + n Sigma inS_(++)^(n)\Sigma \in \mathbf{S}_{++}^{n}[1] if its probability density function is given by
p ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 | Σ | 1 / 2 exp ( 1 2 ( x μ ) T Σ 1 ( x μ ) ) . p ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 | Σ | 1 / 2 exp 1 2 ( x μ ) T Σ 1 ( x μ ) . p(x;mu,Sigma)=(1)/((2pi)^(n//2)|Sigma|^(1//2))exp(-(1)/(2)(x-mu)^(T)Sigma^(-1)(x-mu)).p(x ; \mu, \Sigma)=\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right).
We write this as x N ( μ , Σ ) x N ( μ , Σ ) x∼N(mu,Sigma)x \sim \mathcal{N}(\mu, \Sigma).

2. Gaussian facts

Multivariate Gaussians turn out to be extremely handy in practice due to the following facts:
  • Fact #1: If you know the mean μ μ mu\mu and covariance matrix Σ Σ Sigma\Sigma of a Gaussian random variable x x xx, you can write down the probability density function for x x xx directly.
  • Fact #2: The following Gaussian integrals have closed-form solutions:
x R n p ( x ; μ , Σ ) d x = p ( x ; μ , Σ ) d x 1 d x n = 1 x R n x i p ( x ; μ , σ 2 ) d x = μ i x R n ( x i μ i ) ( x j μ j ) p ( x ; μ , σ 2 ) d x = Σ i j . x R n p ( x ; μ , Σ ) d x = p ( x ; μ , Σ ) d x 1 d x n = 1 x R n x i p x ; μ , σ 2 d x = μ i x R n x i μ i x j μ j p x ; μ , σ 2 d x = Σ i j . {:[int_(x inR^(n))p(x;mu","Sigma)dx=int_(-oo)^(oo)cdotsint_(-oo)^(oo)p(x;mu","Sigma)dx_(1)dots dx_(n)=1],[int_(x inR^(n))x_(i)p(x;mu,sigma^(2))dx=mu_(i)],[int_(x inR^(n))(x_(i)-mu_(i))(x_(j)-mu_(j))p(x;mu,sigma^(2))dx=Sigma_(ij).]:}\begin{aligned} \int_{x \in \mathbf{R}^{n}} p(x ; \mu, \Sigma) d x &=\int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} p(x ; \mu, \Sigma) d x_{1} \ldots d x_{n}=1 \\ \int_{x \in \mathbf{R}^{n}} x_{i} p\left(x ; \mu, \sigma^{2}\right) d x &=\mu_{i} \\ \int_{x \in \mathbf{R}^{n}}\left(x_{i}-\mu_{i}\right)\left(x_{j}-\mu_{j}\right) p\left(x ; \mu, \sigma^{2}\right) d x &=\Sigma_{i j} . \end{aligned}
  • Fact #3: Gaussians obey a number of closure properties:
    • The sum of independent Gaussian random variables is Gaussian.
    • The marginal of a joint Gaussian distribution is Gaussian.
    • The conditional of a joint Gaussian distribution is Gaussian.
At first glance, some of these facts, in particular facts # 1 # 1 #1\# 1 and # 2 # 2 #2\# 2, may seem either intuitively obvious or at least plausible. What is probably not so clear, however, is why these facts are so powerful. In this document, we'll provide some intuition for how these facts can be used when performing day-to-day manipulations dealing with multivariate Gaussian random variables.

3. Closure properties

In this section, we'll go through each of the closure properties described earlier, and we'll either prove the property using facts # 1 # 1 #1\# 1 and # 2 # 2 #2\# 2, or we'll at least give some type of intuition as to why the property is true.
The following is a quick roadmap of what we'll cover:
sums marginals conditionals
why is it Gaussian? no yes yes
resulting density function yes yes yes

3.1. Sum of independent Gaussians is Gaussian

The formal statement of this rule is:
    Suppose that y N ( μ , Σ ) y N ( μ , Σ ) y∼N(mu,Sigma)y \sim \mathcal{N}(\mu, \Sigma) and z N ( μ , Σ ) z N μ , Σ z∼N(mu^('),Sigma^('))z \sim \mathcal{N}\left(\mu^{\prime}, \Sigma^{\prime}\right) are independent Gaussian distributed random variables, where μ , μ R n μ , μ R n mu,mu^(')inR^(n)\mu, \mu^{\prime} \in \mathbf{R}^{n} and Σ , Σ S + + n Σ , Σ S + + n Sigma,Sigma^(')inS_(++)^(n)\Sigma, \Sigma^{\prime} \in \mathbf{S}_{++}^{n}. Then, their sum is also Gaussian: y + z N ( μ + μ , Σ + Σ ) . y + z N μ + μ , Σ + Σ . y+z∼N(mu+mu^('),Sigma+Sigma^(')).y+z \sim \mathcal{N}\left(\mu+\mu^{\prime}, \Sigma+\Sigma^{\prime}\right) .
Before we prove anything, here are some observations:
  1. The first thing to point out is that the importance of the independence assumption in the above rule. To see why this matters, suppose that y N ( μ , Σ ) y N ( μ , Σ ) y∼N(mu,Sigma)y \sim \mathcal{N}(\mu, \Sigma) for some mean vector μ μ mu\mu and covariance matrix Σ Σ Sigma\Sigma, and suppose that z = y z = y z=-yz=-y. Clearly, z z zz also has a Gaussian distribution (in fact, z N ( μ , Σ ) z N ( μ , Σ ) z∼N(-mu,Sigma)z \sim \mathcal{N}(-\mu, \Sigma), but y + z y + z y+zy+z is identically zero!
  2. The second thing to point out is a point of confusion for many students: if we add together two Gaussian densities ("bumps" in multidimensional space), wouldn't we get back some bimodal (i.e., "two-humped" density)? Here, the thing to realize is that the density of the random variable y + z y + z y+zy+z in this rule is NOT found by simply adding the densities of the individual random variables y y yy and z z zz. Rather, the density of y + z y + z y+zy+z will actually turn out to be a convolution of the densities for y y yy and z z zz.[2] To show that the convolution of two Gaussian densities gives a Gaussian density, however, is beyond the scope of this class.
Instead, let's just use the observation that the convolution does give some type of Gaussian density, along with Fact #1, to figure out what the density, p ( y + z | μ , Σ ) p ( y + z | μ , Σ ) p(y+z|mu,Sigma)p(y+z|\mu, \Sigma) would be, if we were to actually compute the convolution. How can we do this? Recall that from Fact #1, a Gaussian distribution is fully specified by its mean vector and covariance matrix. If we can determine what these are, then we're done.
But this is easy! For the mean, we have
E [ y i + z i ] = E [ y i ] + E [ z i ] = μ i + μ i E y i + z i = E y i + E z i = μ i + μ i E[y_(i)+z_(i)]=E[y_(i)]+E[z_(i)]=mu_(i)+mu_(i)^(')E\left[y_{i}+z_{i}\right]=E\left[y_{i}\right]+E\left[z_{i}\right]=\mu_{i}+\mu_{i}^{\prime}
from linearity of expectations. Therefore, the mean of y + z y + z y+zy+z is simply μ + μ μ + μ mu+mu^(')\mu+\mu^{\prime}. Also, the ( i , j ) ( i , j ) (i,j)(i, j)th entry of the covariance matrix is given by
E [ ( y i + z i ) ( y j + z j ) ] E [ y i + z i ] E [ y j + z j ] = E [ y i y j + z i y j + y i z j + z i z j ] ( E [ y i ] + E [ z i ] ) ( E [ y j ] + E [ z j ] ) = E [ y i y j ] + E [ z i y j ] + E [ y i z j ] + E [ z i z j ] E [ y i ] E [ y j ] E [ z i ] E [ y j ] E [ y i ] E [ z j ] E [ z i ] [ z j ] = ( E [ y i y j ] E [ y i ] E [ y j ] ) + ( E [ z i z j ] E [ z i ] E [ z j ] ) + ( E [ z i y j ] E [ z i ] E [ y j ] ) + ( E [ y i z j ] E [ y i ] E [ z j ] ) . E [ ( y i + z i ) ( y j + z j ) ] E [ y i + z i ] E [ y j + z j ] = E [ y i y j + z i y j + y i z j + z i z j ] ( E [ y i ] + E [ z i ] ) ( E [ y j ] + E [ z j ] ) = E [ y i y j ] + E [ z i y j ] + E [ y i z j ] + E [ z i z j ] E [ y i ] E [ y j ] E [ z i ] E [ y j ] E [ y i ] E [ z j ] E [ z i ] [ z j ] = ( E [ y i y j ] E [ y i ] E [ y j ] ) + ( E [ z i z j ] E [ z i ] E [ z j ] ) + ( E [ z i y j ] E [ z i ] E [ y j ] ) + ( E [ y i z j ] E [ y i ] E [ z j ] ) . {:[E[(y_(i)+z_(i))(y_(j)+z_(j))]-E[y_(i)+z_(i)]E[y_(j)+z_(j)]],[=E[y_(i)y_(j)+z_(i)y_(j)+y_(i)z_(j)+z_(i)z_(j)]-(E[y_(i)]+E[z_(i)])(E[y_(j)]+E[z_(j)])],[=E[y_(i)y_(j)]+E[z_(i)y_(j)]+E[y_(i)z_(j)]+E[z_(i)z_(j)]-E[y_(i)]E[y_(j)]-E[z_(i)]E[y_(j)]-E[y_(i)]E[z_(j)]-E[z_(i)][z_(j)]],[=(E[y_(i)y_(j)]-E[y_(i)]E[y_(j)])+(E[z_(i)z_(j)]-E[z_(i)]E[z_(j)])],[quad+(E[z_(i)y_(j)]-E[z_(i)]E[y_(j)])+(E[y_(i)z_(j)]-E[y_(i)]E[z_(j)]).]:}\begin{aligned} E[(y_{i}+&z_{i})(y_{j}+z_{j})]-E[y_{i}+z_{i}] E[y_{j}+z_{j}] \\ =& E[y_{i} y_{j}+z_{i} y_{j}+y_{i} z_{j}+z_{i} z_{j}]-(E[y_{i}]+E[z_{i}])(E[y_{j}]+E[z_{j}]) \\ =& E[y_{i} y_{j}]+E[z_{i} y_{j}]+E[y_{i} z_{j}]+E[z_{i} z_{j}]-E[y_{i}] E[y_{j}]-E[z_{i}] E[y_{j}]-E[y_{i}] E[z_{j}]-E[z_{i}][z_{j}] \\ =&(E[y_{i} y_{j}]-E[y_{i}] E[y_{j}])+(E[z_{i} z_{j}]-E[z_{i}] E[z_{j}]) \\ &\quad+(E[z_{i} y_{j}]-E[z_{i}] E[y_{j}])+(E[y_{i} z_{j}]-E[y_{i}] E[z_{j}]). \end{aligned}
Using the fact that y y yy and z z zz are independent, we have E [ z i y j ] = E [ z i ] E [ y j ] E z i y j = E z i E y j E[z_(i)y_(j)]=E[z_(i)]E[y_(j)]E\left[z_{i} y_{j}\right]=E\left[z_{i}\right] E\left[y_{j}\right] and E [ y i z j ] = E y i z j = E[y_(i)z_(j)]=E\left[y_{i} z_{j}\right]= E [ y i ] E [ z j ] E y i E z j E[y_(i)]E[z_(j)]E\left[y_{i}\right] E\left[z_{j}\right]. Therefore, the last two terms drop out, and we are left with,
E [ ( y i + z i ) ( y j + z j ) ] E [ y i + z i ] E [ y j + z j ] = ( E [ y i y j ] E [ y i ] E [ y j ] ) + ( E [ z i z j ] E [ z i ] E [ z j ] ) = Σ i j + Σ i j . E y i + z i y j + z j E y i + z i E y j + z j = E y i y j E y i E y j + E z i z j E z i E z j = Σ i j + Σ i j . {:[E[(y_(i):}{:+z_(i))(y_(j)+z_(j))]-E[y_(i)+z_(i)]E[y_(j)+z_(j)]],[=(E[y_(i)y_(j)]-E[y_(i)]E[y_(j)])+(E[z_(i)z_(j)]-E[z_(i)]E[z_(j)])],[=Sigma_(ij)+Sigma_(ij)^(').]:}\begin{aligned} E\left[\left(y_{i}\right.\right.&\left.\left.+z_{i}\right)\left(y_{j}+z_{j}\right)\right]-E\left[y_{i}+z_{i}\right] E\left[y_{j}+z_{j}\right] \\ &=\left(E\left[y_{i} y_{j}\right]-E\left[y_{i}\right] E\left[y_{j}\right]\right)+\left(E\left[z_{i} z_{j}\right]-E\left[z_{i}\right] E\left[z_{j}\right]\right) \\ &=\Sigma_{i j}+\Sigma_{i j}^{\prime} . \end{aligned}
From this, we can conclude that the covariance matrix of y + z y + z y+zy+z is simply Σ + Σ Σ + Σ Sigma+Sigma^(')\Sigma+\Sigma^{\prime}.
At this point, take a step back and think about what we have just done. Using some simple properties of expectations and independence, we have computed the mean and covariance matrix of y + z y + z y+zy+z. Because of Fact #1, we can thus write down the density for y + z y + z y+zy+z immediately, without the need to perform a convolution![3]

3.2. Marginal of a joint Gaussian is Gaussian

The formal statement of this rule is:
    Suppose that [ x A x B ] N ( [ μ A μ B ] , [ Σ A A Σ A B Σ B A Σ B B ] ) , x A x B N μ A μ B , Σ A A Σ A B Σ B A Σ B B , [[x_(A)],[x_(B)]]∼N([[mu_(A)],[mu_(B)]],[[Sigma_(AA),Sigma_(AB)],[Sigma_(BA),Sigma_(BB)]]),\left[\begin{array}{l} x_{A} \\ x_{B} \end{array}\right] \sim \mathcal{N}\left(\left[\begin{array}{l} \mu_{A} \\ \mu_{B} \end{array}\right],\left[\begin{array}{cc} \Sigma_{A A} & \Sigma_{A B} \\ \Sigma_{B A} & \Sigma_{B B} \end{array}\right]\right), where x A R m , x B R n x A R m , x B R n x_(A)inR^(m),x_(B)inR^(n)x_{A} \in \mathbf{R}^{m}, x_{B} \in \mathbf{R}^{n}, and the dimensions of the mean vectors and covariance matrix subblocks are chosen to match x A x A x_(A)x_{A} and x B x B x_(B)x_{B}. Then, the marginal densities, p ( x A ) = x B R n p ( x A , x B ; μ , Σ ) d x B p ( x B ) = x A R m p ( x A , x B ; μ , Σ ) d x A p x A = x B R n p x A , x B ; μ , Σ d x B p x B = x A R m p x A , x B ; μ , Σ d x A {:[p(x_(A))=int_(x_(B)inR^(n))p(x_(A),x_(B);mu,Sigma)dx_(B)],[p(x_(B))=int_(x_(A)inR^(m))p(x_(A),x_(B);mu,Sigma)dx_(A)]:}\begin{aligned} &p\left(x_{A}\right)=\int_{x_{B} \in \mathbf{R}^{n}} p\left(x_{A}, x_{B} ; \mu, \Sigma\right) d x_{B} \\ &p\left(x_{B}\right)=\int_{x_{A} \in \mathbf{R}^{m}} p\left(x_{A}, x_{B} ; \mu, \Sigma\right) d x_{A} \end{aligned} are Gaussian: x A N ( μ A , Σ A A ) x B N ( μ B , Σ B B ) . x A N μ A , Σ A A x B N μ B , Σ B B . {:[x_(A)∼N(mu_(A),Sigma_(AA))],[x_(B)∼N(mu_(B),Sigma_(BB)).]:}\begin{aligned} &x_{A} \sim \mathcal{N}\left(\mu_{A}, \Sigma_{A A}\right) \\ &x_{B} \sim \mathcal{N}\left(\mu_{B}, \Sigma_{B B}\right) . \end{aligned}
To justify this rule, let's just focus on the marginal distribution with respect to the variables x A x A x_(A)x_{A}.[4]
First, note that computing the mean and covariance matrix for a marginal distribution is easy: simply take the corresponding subblocks from the mean and covariance matrix of the joint density. To make sure this is absolutely clear, let's look at the covariance between x A , i x A , i x_(A,i)x_{A, i} and x A , j x A , j x_(A,j)x_{A, j} (the i i iith component of x A x A x_(A)x_{A} and the j j jjth component of x A x A x_(A)x_{A}). Note that x A , i x A , i x_(A,i)x_{A, i} and x A , j x A , j x_(A,j)x_{A, j} are also the i i ii th and j j jjth components of
[ x A x B ] x A x B [[x_(A)],[x_(B)]]\left[\begin{array}{l} x_{A} \\ x_{B} \end{array}\right]
(since x A x A x_(A)x_{A} appears at the top of this vector). To find their covariance, we need to simply look at the ( i , j ) ( i , j ) (i,j)(i, j) th element of the covariance matrix,
[ Σ A A Σ A B Σ B A Σ B B ] . Σ A A Σ A B Σ B A Σ B B . [[Sigma_(AA),Sigma_(AB)],[Sigma_(BA),Sigma_(BB)]].\left[\begin{array}{cc} \Sigma_{A A} & \Sigma_{A B} \\ \Sigma_{B A} & \Sigma_{B B} \end{array}\right] .
The ( i , j ) ( i , j ) (i,j)(i, j)th element is found in the Σ A A Σ A A Sigma_(AA)\Sigma_{A A} subblock, and in fact, is precisely Σ A A , i j Σ A A , i j Sigma_(AA,ij)\Sigma_{A A, i j}. Using this argument for all i , j { 1 , , m } i , j { 1 , , m } i,j in{1,dots,m}i, j \in\{1, \ldots, m\}, we see that the covariance matrix for x A x A x_(A)x_{A} is simply Σ A A Σ A A Sigma_(AA)\Sigma_{A A}. A similar argument can be used to find that the mean of x A x A x_(A)x_{A} is simply μ A μ A mu_(A)\mu_{A}. Thus, the above argument tells us that if we knew that the marginal distribution over x A x A x_(A)x_{A} is Gaussian, then we could immediately write down a density function for x A x A x_(A)x_{A} in terms of the appropriate submatrices of the mean and covariance matrices for the joint density!
The above argument, though simple, however, is somewhat unsatisfying: how can we actually be sure that x A x A x_(A)x_{A} has a multivariate Gaussian distribution? The argument for this is slightly long-winded, so rather than saving up the punchline, here's our plan of attack up front:
  1. Write the integral form of the marginal density explicitly.
  2. Rewrite the integral by partitioning the inverse covariance matrix.
  3. Use a "completion-of-squares" argument to evaluate the integral over x B x B x_(B)x_{B}.
  4. Argue that the resulting density is Gaussian.
Let's see each of these steps in action.

3.2.1. The marginal density in integral form

Suppose that we wanted to compute the density function of x A x A x_(A)x_{A} directly. Then, we would need to compute the integral,
p ( x A ) = x B R n p ( x A , x B ; μ , Σ ) d x B = 1 ( 2 π ) m + n 2 | Σ A A Σ A B Σ B A Σ B B | 1 / 2 x B R n exp ( 1 2 [ x A μ A x B μ B ] T [ Σ A A Σ A B Σ B A Σ B B ] 1 [ x A μ A x B μ B ] ) d x B . p x A = x B R n p x A , x B ; μ , Σ d x B = 1 ( 2 π ) m + n 2 Σ A A Σ A B Σ B A Σ B B 1 / 2 x B R n exp 1 2 x A μ A x B μ B T Σ A A Σ A B Σ B A Σ B B 1 x A μ A x B μ B d x B . {:[p(x_(A))=int_(x_(B)inR^(n))p(x_(A),x_(B);mu,Sigma)dx_(B)],[=(1)/((2pi)^((m+n)/(2))|[Sigma_(AA),Sigma_(AB)],[Sigma_(BA),Sigma_(BB)]|^(1//2))int_(x_(B)inR^(n))exp(-(1)/(2)[[x_(A)-mu_(A)],[x_(B)-mu_(B)]]^(T)[[Sigma_(AA),Sigma_(AB)],[Sigma_(BA),Sigma_(BB)]]^(-1)[[x_(A)-mu_(A)],[x_(B)-mu_(B)]])dx_(B).]:}\begin{aligned} p\left(x_{A}\right)&=\int_{x_{B} \in \mathbf{R}^{n}} p\left(x_{A}, x_{B} ; \mu, \Sigma\right) d x_{B} \\ &=\frac{1}{(2\pi)^{\frac{m+n}{2}}\left|\begin{array}{ll}\Sigma_{AA}&\Sigma_{AB}\\\Sigma_{BA}&\Sigma_{BB}\end{array}\right|^{1/2}}\int_{x_{B}\in\mathbf{R}^{n}}\exp\left(-\frac{1}{2}\left[\begin{array}{l}x_{A}-\mu_{A}\\x_{B}-\mu_{B}\end{array}\right]^{T}\left[\begin{array}{ll}\Sigma_{AA}&\Sigma_{AB}\\\Sigma_{BA}&\Sigma_{BB}\end{array}\right]^{-1}\left[\begin{array}{l}x_{A}-\mu_{A}\\x_{B}-\mu_{B}\end{array}\right]\right)dx_{B}. \end{aligned}

3.2.2. Partitioning the inverse covariance matrix

To make any sort of progress, we'll need to write the matrix product in the exponent in a slightly different form. In particular, let us define the matrix V R ( m + n ) × ( m + n ) V R ( m + n ) × ( m + n ) V inR^((m+n)xx(m+n))V \in \mathbf{R}^{(m+n) \times(m+n)} as[5]
V = [ V A A V A B V B A V B B ] = Σ 1 . V = V A A      V A B V B A      V B B = Σ 1 . V=[[V_(AA),V_(AB)],[V_(BA),V_(BB)]]=Sigma^(-1).V=\left[\begin{array}{ll} V_{A A} & V_{A B} \\ V_{B A} & V_{B B} \end{array}\right]=\Sigma^{-1} .
It might be tempting to think that
V = [ V A A V A B V B A V B B ] = [ Σ A A Σ A B Σ B A Σ B B ] 1 "=" [ Σ A A 1 Σ A B 1 Σ B A 1 Σ B B 1 ] V = V A A      V A B V B A      V B B = Σ A A Σ A B Σ B A Σ B B 1 "=" Σ A A 1 Σ A B 1 Σ B A 1 Σ B B 1 V=[[V_(AA),V_(AB)],[V_(BA),V_(BB)]]=[[Sigma_(AA),Sigma_(AB)],[Sigma_(BA),Sigma_(BB)]]^(-1)"="[[Sigma_(AA)^(-1),Sigma_(AB)^(-1)],[Sigma_(BA)^(-1),Sigma_(BB)^(-1)]]V=\left[\begin{array}{ll} V_{A A} & V_{A B} \\ V_{B A} & V_{B B} \end{array}\right]=\left[\begin{array}{cc} \Sigma_{A A} & \Sigma_{A B} \\ \Sigma_{B A} & \Sigma_{B B} \end{array}\right]^{-1} "="\left[\begin{array}{cc} \Sigma_{A A}^{-1} & \Sigma_{A B}^{-1} \\ \Sigma_{B A}^{-1} & \Sigma_{B B}^{-1} \end{array}\right]
However, the rightmost equality does not hold! We'll return to this issue in a later step; for now, though, it suffices to define V V VV as above without worrying what actual contents of each submatrix are.
Using this definition of V V VV, the integral expands to
p ( x A ) = 1 Z x B R n exp ( [ 1 2 ( x A μ A ) T V A A ( x A μ A ) + 1 2 ( x A μ A ) T V A B ( x B μ B ) + 1 2 ( x B μ B ) T V B A ( x A μ A ) + 1 2 ( x B μ B ) T V B B ( x B μ B ) ] ) d x B , p x A = 1 Z x B R n exp ( [ 1 2 x A μ A T V A A x A μ A + 1 2 x A μ A T V A B x B μ B + 1 2 x B μ B T V B A x A μ A + 1 2 x B μ B T V B B x B μ B ] ) d x B , {:[p(x_(A))=(1)/(Z)int_(x_(B)inR^(n))exp (-[(1)/(2)(x_(A)-mu_(A))^(T)V_(AA)(x_(A)-mu_(A))+(1)/(2)(x_(A)-mu_(A))^(T)V_(AB)(x_(B)-mu_(B))],[+(1)/(2)(x_(B)-mu_(B))^(T)V_(BA)(x_(A)-mu_(A))+(1)/(2)(x_(B)-mu_(B))^(T)V_(BB)(x_(B)-mu_(B))])dx_(B)","]:}\begin{aligned} p\left(x_{A}\right)=\frac{1}{Z} \int_{x_{B} \in \mathbf{R}^{n}} \exp \bigg(- \bigg[&\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A A}\left(x_{A}-\mu_{A}\right)+\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A B}\left(x_{B}-\mu_{B}\right) \\ &+\frac{1}{2}\left(x_{B}-\mu_{B}\right)^{T} V_{B A}\left(x_{A}-\mu_{A}\right)+\frac{1}{2}\left(x_{B}-\mu_{B}\right)^{T} V_{B B}\left(x_{B}-\mu_{B}\right)\bigg]\bigg) d x_{B}, \end{aligned}
where Z Z ZZ is some constant not depending on either x A x A x_(A)x_{A} or x B x B x_(B)x_{B} that we'll choose to ignore for the moment. If you haven't worked with partitioned matrices before, then the expansion above may seem a little magical to you. It is analogous to the idea that when defining a quadratic form based on some 2 × 2 2 × 2 2xx22 \times 2 matrix A A AA, then
x T A x = i j A i j x i x j = x 1 A 11 x 1 + x 1 A 12 x 2 + x 2 A 21 x 1 + x 2 A 22 x 2 . x T A x = i j A i j x i x j = x 1 A 11 x 1 + x 1 A 12 x 2 + x 2 A 21 x 1 + x 2 A 22 x 2 . x^(T)Ax=sum_(i)sum_(j)A_(ij)x_(i)x_(j)=x_(1)A_(11)x_(1)+x_(1)A_(12)x_(2)+x_(2)A_(21)x_(1)+x_(2)A_(22)x_(2).x^{T} A x=\sum_{i} \sum_{j} A_{i j} x_{i} x_{j}=x_{1} A_{11} x_{1}+x_{1} A_{12} x_{2}+x_{2} A_{21} x_{1}+x_{2} A_{22} x_{2} .
Take some time to convince yourself that the matrix generalization above also holds.

3.2.3. Integrating out x B x B x_(B)x_{B}

To evaluate the integral, we'll somehow want to integrate out x B x B x_(B)x_{B}. In general, however, Gaussian integrals are hard to compute by hand. Is there anything we can do to save time? There are, in fact, a number of Gaussian integrals for which the answer is already known (see Fact #2). The basic idea in this section, then, will be to transform the integral we had in the last section into a form where we can apply one of the results from Fact #2 in order to perform the required integration easily.
The key to this is a mathematical trick known as "completion of squares." Consider the quadratic function z T A z + b T z + c z T A z + b T z + c z^(T)Az+b^(T)z+cz^{T} A z+b^{T} z+c where A A AA is a symmetric, nonsingular matrix. Then, one can verify directly that
1 2 z T A z + b T z + c = 1 2 ( z + A 1 b ) T A ( z + A 1 b ) + c 1 2 b T A 1 b . 1 2 z T A z + b T z + c = 1 2 z + A 1 b T A z + A 1 b + c 1 2 b T A 1 b . (1)/(2)z^(T)Az+b^(T)z+c=(1)/(2)(z+A^(-1)b)^(T)A(z+A^(-1)b)+c-(1)/(2)b^(T)A^(-1)b.\frac{1}{2} z^{T} A z+b^{T} z+c=\frac{1}{2}\left(z+A^{-1} b\right)^{T} A\left(z+A^{-1} b\right)+c-\frac{1}{2} b^{T} A^{-1} b .
This is the multivariate generalization of the "completion of squares" argument used in single variable algebra:
1 2 a z 2 + b z + c = 1 2 a ( z + b a ) 2 + c b 2 2 a 1 2 a z 2 + b z + c = 1 2 a z + b a 2 + c b 2 2 a (1)/(2)az^(2)+bz+c=(1)/(2)a(z+(b)/(a))^(2)+c-(b^(2))/(2a)\frac{1}{2} a z^{2}+b z+c=\frac{1}{2} a\left(z+\frac{b}{a}\right)^{2}+c-\frac{b^{2}}{2 a}
To apply the completion of squares in our situation above, let
z = x B μ B A = V B B b = V B A ( x A μ A ) c = 1 2 ( x A μ A ) T V A A ( x A μ A ) . z = x B μ B A = V B B b = V B A x A μ A c = 1 2 x A μ A T V A A x A μ A . {:[z=x_(B)-mu_(B)],[A=V_(BB)],[b=V_(BA)(x_(A)-mu_(A))],[c=(1)/(2)(x_(A)-mu_(A))^(T)V_(AA)(x_(A)-mu_(A)).]:}\begin{aligned} z &=x_{B}-\mu_{B} \\ A &=V_{B B} \\ b &=V_{B A}\left(x_{A}-\mu_{A}\right) \\ c &=\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A A}\left(x_{A}-\mu_{A}\right) . \end{aligned}
Then, it follows that the integral can be rewritten as
p ( x A ) = 1 Z x B R n exp ( [ 1 2 ( x B μ B + V B B 1 V B A ( x A μ A ) ) T V B B ( x B μ B + V B B 1 V B A ( x A μ A ) ) + 1 2 ( x A μ A ) T V A A ( x A μ A ) 1 2 ( x A μ A ) T V A B V B B 1 V B A ( x A μ A ) ] ) d x B p x A = 1 Z x B R n exp ( [ 1 2 x B μ B + V B B 1 V B A x A μ A T V B B x B μ B + V B B 1 V B A x A μ A + 1 2 x A μ A T V A A x A μ A 1 2 x A μ A T V A B V B B 1 V B A x A μ A ] ) d x B {:[p(x_(A))=(1)/(Z)int_(x_(B)inR^(n))exp (-[(1)/(2)(x_(B)-mu_(B)+V_(BB)^(-1)V_(BA)(x_(A)-mu_(A)))^(T)V_(BB)(x_(B)-mu_(B)+V_(BB)^(-1)V_(BA)(x_(A)-mu_(A)))],[+(1)/(2)(x_(A)-mu_(A))^(T)V_(AA)(x_(A)-mu_(A))-(1)/(2)(x_(A)-mu_(A))^(T)V_(AB)V_(BB)^(-1)V_(BA)(x_(A)-mu_(A))])dx_(B)]:}\begin{aligned} p\left(x_{A}\right)=\frac{1}{Z} \int_{x_{B} \in \mathbf{R}^{n}} \exp \bigg(- \bigg[&\frac{1}{2}\left(x_{B}-\mu_{B}+V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\right)^{T} V_{B B}\left(x_{B}-\mu_{B}+V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\right) \\ &+\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A A}\left(x_{A}-\mu_{A}\right)-\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A B} V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\bigg]\bigg) d x_{B} \end{aligned}
We can factor out the terms not including x B x B x_(B)x_{B} to obtain,
p ( x A ) = exp ( 1 2 ( x A μ A ) T V A A ( x A μ A ) + 1 2 ( x A μ A ) T V A B V B B 1 V B A ( x A μ A ) ) 1 Z x B R n exp ( 1 2 [ ( x B μ B + V B B 1 V B A ( x A μ A ) ) T V B B ( x B μ B + V B B 1 V B A ( x A μ A ) ) ] ) d x B p x A = exp 1 2 x A μ A T V A A x A μ A + 1 2 x A μ A T V A B V B B 1 V B A x A μ A 1 Z x B R n exp 1 2 x B μ B + V B B 1 V B A x A μ A T V B B x B μ B + V B B 1 V B A x A μ A d x B {:[p(x_(A))=exp(-(1)/(2)(x_(A)-mu_(A))^(T)V_(AA)(x_(A)-mu_(A))+(1)/(2)(x_(A)-mu_(A))^(T)V_(AB)V_(BB)^(-1)V_(BA)(x_(A)-mu_(A)))],[quad*(1)/(Z)int_(x_(B)inR^(n))exp(-(1)/(2)[(x_(B)-mu_(B)+V_(BB)^(-1)V_(BA)(x_(A)-mu_(A)))^(T)V_(BB)(x_(B)-mu_(B)+V_(BB)^(-1)V_(BA)(x_(A)-mu_(A)))])dx_(B)]:}\begin{aligned} p\left(x_{A}\right)=& \exp \left(-\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A A}\left(x_{A}-\mu_{A}\right)+\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A B} V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\right) \\ &\quad \cdot \frac{1}{Z} \int_{x_{B} \in \mathbf{R}^{n}} \exp \left(-\frac{1}{2}\left[\left(x_{B}-\mu_{B}+V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\right)^{T} V_{B B}\left(x_{B}-\mu_{B}+V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\right)\right]\right) d x_{B} \end{aligned}
At this point, we can now apply Fact #2. In particular, we know that generically speaking, for a multivariate Gaussian distributed random variable x x xx with mean μ μ mu\mu and covariance matrix Σ Σ Sigma\Sigma, the density function normalizes, i.e.,
1 ( 2 π ) n / 2 | Σ | 1 / 2 R n exp ( 1 2 ( x μ ) T Σ 1 ( x μ ) ) = 1 , 1 ( 2 π ) n / 2 | Σ | 1 / 2 R n exp 1 2 ( x μ ) T Σ 1 ( x μ ) = 1 , (1)/((2pi)^(n//2)|Sigma|^(1//2))int_(R^(n))exp(-(1)/(2)(x-mu)^(T)Sigma^(-1)(x-mu))=1,\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \int_{\mathbf{R}^{n}} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right)=1,
or equivalently,
R n exp ( 1 2 ( x μ ) T Σ 1 ( x μ ) ) = ( 2 π ) n / 2 | Σ | 1 / 2 . R n exp 1 2 ( x μ ) T Σ 1 ( x μ ) = ( 2 π ) n / 2 | Σ | 1 / 2 . int_(R^(n))exp(-(1)/(2)(x-mu)^(T)Sigma^(-1)(x-mu))=(2pi)^(n//2)|Sigma|^(1//2).\int_{\mathbf{R}^{n}} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right)=(2 \pi)^{n / 2}|\Sigma|^{1 / 2} .
We use this fact to get rid of the remaining integral in our expression for p ( x A ) p x A p(x_(A))p\left(x_{A}\right):
p ( x A ) = 1 Z ( 2 π ) n / 2 | V B B | 1 / 2 exp ( 1 2 ( x A μ A ) T ( V A A V A B V B B 1 V B A ) ( x A μ A ) ) . p x A = 1 Z ( 2 π ) n / 2 V B B 1 / 2 exp 1 2 x A μ A T V A A V A B V B B 1 V B A x A μ A . p(x_(A))=(1)/(Z)*(2pi)^(n//2)|V_(BB)|^(1//2)*exp(-(1)/(2)(x_(A)-mu_(A))^(T)(V_(AA)-V_(AB)V_(BB)^(-1)V_(BA))(x_(A)-mu_(A))).p\left(x_{A}\right)=\frac{1}{Z} \cdot(2 \pi)^{n / 2}\left|V_{B B}\right|^{1 / 2} \cdot \exp \left(-\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T}\left(V_{A A}-V_{A B} V_{B B}^{-1} V_{B A}\right)\left(x_{A}-\mu_{A}\right)\right) .

3.2.4. Arguing that resulting density is Gaussian

At this point, we are almost done! Ignoring the normalization constant in front, we see that the density of x A x A x_(A)x_{A} is the exponential of a quadratic form in x A x A x_(A)x_{A}. We can quickly recognize that our density is none other than a Gaussian with mean vector μ A μ A mu_(A)\mu_{A} and covariance matrix ( V A A V A B V B B 1 V B A ) 1 V A A V A B V B B 1 V B A 1 (V_(AA)-V_(AB)V_(BB)^(-1)V_(BA))^(-1)\left(V_{A A}-V_{A B} V_{B B}^{-1} V_{B A}\right)^{-1}. Although the form of the covariance matrix may seem a bit complex, we have already achieved what we set out to show in the first place–namely, that x A x A x_(A)x_{A} has a marginal Gaussian distribution. Using the logic before, we can conclude that this covariance matrix must somehow reduce to Σ A A Σ A A Sigma_(AA)\Sigma_{A A}.
But, in case you are curious, it's also possible to show that our derivation is consistent with this earlier justification. To do this, we use the following result for partitioned matrices:
[ A B C D ] 1 = [ M 1 M 1 B D 1 D 1 C M 1 D 1 + D 1 C M 1 B D 1 ] . A      B C      D 1 = M 1 M 1 B D 1 D 1 C M 1 D 1 + D 1 C M 1 B D 1 . [[A,B],[C,D]]^(-1)=[[M^(-1),-M^(-1)BD^(-1)],[-D^(-1)CM^(-1),D^(-1)+D^(-1)CM^(-1)BD^(-1)]].\left[\begin{array}{ll} A & B \\ C & D \end{array}\right]^{-1}=\left[\begin{array}{cc} M^{-1} & -M^{-1} B D^{-1} \\ -D^{-1} C M^{-1} & D^{-1}+D^{-1} C M^{-1} B D^{-1} \end{array}\right] .
where M = A B D 1 C M = A B D 1 C M=A-BD^(-1)CM=A-B D^{-1} C. This formula can be thought of as the multivariable generalization of the explicit inverse for a 2 × 2 2 × 2 2xx22 \times 2 matrix,
[ a b c d ] 1 = 1 a d b c [ d b c a ] . a      b c      d 1 = 1 a d b c d b c a . [[a,b],[c,d]]^(-1)=(1)/(ad-bc)[[d,-b],[-c,a]].\left[\begin{array}{ll} a & b \\ c & d \end{array}\right]^{-1}=\frac{1}{a d-b c}\left[\begin{array}{cc} d & -b \\ -c & a \end{array}\right] .
Using the formula, it follows that
[ Σ A A Σ A B Σ B A Σ B B ] = [ V A A V A B V B A V B B ] 1 = [ ( V A A V A B V B B 1 V B A ) 1 ( V A A V A B V B B 1 V B A ) 1 V A B V B B 1 V B B 1 V B A ( V A A V A B V B B 1 V B A ) 1 ( V B B V B A V A A 1 V A B ) 1 ] Σ A A Σ A B Σ B A Σ B B = V A A V A B V B A V B B 1 = V A A V A B V B B 1 V B A 1 V A A V A B V B B 1 V B A 1 V A B V B B 1 V B B 1 V B A V A A V A B V B B 1 V B A 1 V B B V B A V A A 1 V A B 1 {:[[[Sigma_(AA),Sigma_(AB)],[Sigma_(BA),Sigma_(BB)]]=[[V_(AA),V_(AB)],[V_(BA),V_(BB)]]^(-1)],[=[[(V_(AA)-V_(AB)V_(BB)^(-1)V_(BA))^(-1),-(V_(AA)-V_(AB)V_(BB)^(-1)V_(BA))^(-1)V_(AB)V_(BB)^(-1)],[-V_(BB)^(-1)V_(BA)(V_(AA)-V_(AB)V_(BB)^(-1)V_(BA))^(-1),(V_(BB)-V_(BA)V_(AA)^(-1)V_(AB))^(-1)]]]:}\begin{aligned} {\left[\begin{array}{ll} \Sigma_{A A} & \Sigma_{A B} \\ \Sigma_{B A} & \Sigma_{B B} \end{array}\right] } &=\left[\begin{array}{ll} V_{A A} & V_{A B} \\ V_{B A} & V_{B B} \end{array}\right]^{-1} \\ &=\left[\begin{array}{cc} \left(V_{A A}-V_{A B} V_{B B}^{-1} V_{B A}\right)^{-1} & -\left(V_{A A}-V_{A B} V_{B B}^{-1} V_{B A}\right)^{-1} V_{A B} V_{B B}^{-1} \\ -V_{B B}^{-1} V_{B A}\left(V_{A A}-V_{A B} V_{B B}^{-1} V_{B A}\right)^{-1} & \left(V_{B B}-V_{B A} V_{A A}^{-1} V_{A B}\right)^{-1} \end{array}\right] \end{aligned}
We immediately see that ( V A A V A B V B B 1 V B A ) 1 = Σ A A V A A V A B V B B 1 V B A 1 = Σ A A (V_(AA)-V_(AB)V_(BB)^(-1)V_(BA))^(-1)=Sigma_(AA)\left(V_{A A}-V_{A B} V_{B B}^{-1} V_{B A}\right)^{-1}=\Sigma_{A A}, just as we expected!

3.3. Conditional of a joint Gaussian is Gaussian

The formal statement of this rule is:
    Suppose that [ x A x B ] N ( [ μ A μ B ] , [ Σ A A Σ A B Σ B A Σ B B ] ) , x A x B N μ A μ B , Σ A A Σ A B Σ B A Σ B B , [[x_(A)],[x_(B)]]∼N([[mu_(A)],[mu_(B)]],[[Sigma_(AA),Sigma_(AB)],[Sigma_(BA),Sigma_(BB)]]),\left[\begin{array}{l} x_{A} \\ x_{B} \end{array}\right] \sim \mathcal{N}\left(\left[\begin{array}{l} \mu_{A} \\ \mu_{B} \end{array}\right],\left[\begin{array}{cc} \Sigma_{A A} & \Sigma_{A B} \\ \Sigma_{B A} & \Sigma_{B B} \end{array}\right]\right), where x A R m , x B R n x A R m , x B R n x_(A)inR^(m),x_(B)inR^(n)x_{A} \in \mathbf{R}^{m}, x_{B} \in \mathbf{R}^{n}, and the dimensions of the mean vectors and covariance matrix subblocks are chosen to match x A x A x_(A)x_{A} and x B x B x_(B)x_{B}. Then, the conditional densities p ( x A x B ) = p ( x A , x B ; μ , Σ ) x A R m p ( x A , x B ; μ , Σ ) d x A p ( x B x A ) = p ( x A , x B ; μ , Σ ) x B R n p ( x A , x B ; μ , Σ ) d x B p x A x B = p x A , x B ; μ , Σ x A R m p x A , x B ; μ , Σ d x A p x B x A = p x A , x B ; μ , Σ x B R n p x A , x B ; μ , Σ d x B {:[p(x_(A)∣x_(B))=(p(x_(A),x_(B);mu,Sigma))/(int_(x_(A)inR^(m))p(x_(A),x_(B);mu,Sigma)dx_(A))],[p(x_(B)∣x_(A))=(p(x_(A),x_(B);mu,Sigma))/(int_(x_(B)inR^(n))p(x_(A),x_(B);mu,Sigma)dx_(B))]:}\begin{aligned} p\left(x_{A} \mid x_{B}\right) &=\frac{p\left(x_{A}, x_{B} ; \mu, \Sigma\right)}{\int_{x_{A} \in \mathbf{R}^{m}} p\left(x_{A}, x_{B} ; \mu, \Sigma\right) d x_{A}} \\ p\left(x_{B} \mid x_{A}\right) &=\frac{p\left(x_{A}, x_{B} ; \mu, \Sigma\right)}{\int_{x_{B} \in \mathbf{R}^{n}} p\left(x_{A}, x_{B} ; \mu, \Sigma\right) d x_{B}} \end{aligned} are also Gaussian: x A x B N ( μ A + Σ A B Σ B B 1 ( x B μ B ) , Σ A A Σ A B Σ B B 1 Σ B A ) x B x A N ( μ B + Σ B A Σ A A 1 ( x A μ A ) , Σ B B Σ B A Σ A A 1 Σ A B ) . x A x B N μ A + Σ A B Σ B B 1 x B μ B , Σ A A Σ A B Σ B B 1 Σ B A x B x A N μ B + Σ B A Σ A A 1 x A μ A , Σ B B Σ B A Σ A A 1 Σ A B . {:[x_(A)∣x_(B)∼N(mu_(A)+Sigma_(AB)Sigma_(BB)^(-1)(x_(B)-mu_(B)),Sigma_(AA)-Sigma_(AB)Sigma_(BB)^(-1)Sigma_(BA))],[x_(B)∣x_(A)∼N(mu_(B)+Sigma_(BA)Sigma_(AA)^(-1)(x_(A)-mu_(A)),Sigma_(BB)-Sigma_(BA)Sigma_(AA)^(-1)Sigma_(AB)).]:}\begin{aligned} &x_{A} \mid x_{B} \sim \mathcal{N}\left(\mu_{A}+\Sigma_{A B} \Sigma_{B B}^{-1}\left(x_{B}-\mu_{B}\right), \Sigma_{A A}-\Sigma_{A B} \Sigma_{B B}^{-1} \Sigma_{B A}\right) \\ &x_{B} \mid x_{A} \sim \mathcal{N}\left(\mu_{B}+\Sigma_{B A} \Sigma_{A A}^{-1}\left(x_{A}-\mu_{A}\right), \Sigma_{B B}-\Sigma_{B A} \Sigma_{A A}^{-1} \Sigma_{A B}\right) . \end{aligned}
As before, we'll just examine the conditional distribution x B x A x B x A x_(B)∣x_(A)x_{B} \mid x_{A}, and the other result will hold by symmetry. Our plan of attack will be as follows:
  1. Write the form of the conditional density explicitly.
  2. Rewrite the expression by partitioning the inverse covariance matrix.
  3. Use a "completion-of-squares" argument.
  4. Argue that the resulting density is Gaussian.
Let's see each of these steps in action.

3.3.1. The conditional density written explicitly

Suppose that we wanted to compute the density function of x B x B x_(B)x_{B} given x A x A x_(A)x_{A} directly. Then, we would need to compute
p ( x B x A ) = p ( x A , x B ; μ , Σ ) x B R m p ( x A , x B ; μ , Σ ) d x A = 1 Z exp ( 1 2 [ x A μ A x B μ B ] T [ Σ A A Σ A B Σ B A Σ B B ] 1 [ x A μ A x B μ B ] ) p x B x A = p x A , x B ; μ , Σ x B R m p x A , x B ; μ , Σ d x A = 1 Z exp 1 2 x A μ A x B μ B T Σ A A Σ A B Σ B A Σ B B 1 x A μ A x B μ B {:[p(x_(B)∣x_(A))=(p(x_(A),x_(B);mu,Sigma))/(int_(x_(B)inR^(m))p(x_(A),x_(B);mu,Sigma)dx_(A))],[=(1)/(Z^('))exp(-(1)/(2)[[x_(A)-mu_(A)],[x_(B)-mu_(B)]]^(T)[[Sigma_(AA),Sigma_(AB)],[Sigma_(BA),Sigma_(BB)]]^(-1)[[x_(A)-mu_(A)],[x_(B)-mu_(B)]])]:}\begin{aligned} p\left(x_{B} \mid x_{A}\right) &=\frac{p\left(x_{A}, x_{B} ; \mu, \Sigma\right)}{\int_{x_{B} \in \mathbf{R}^{m}} p\left(x_{A}, x_{B} ; \mu, \Sigma\right) d x_{A}} \\ &=\frac{1}{Z^{\prime}} \exp \left(-\frac{1}{2}\left[\begin{array}{l} x_{A}-\mu_{A} \\ x_{B}-\mu_{B} \end{array}\right]^{T}\left[\begin{array}{cc} \Sigma_{A A} & \Sigma_{A B} \\ \Sigma_{B A} & \Sigma_{B B} \end{array}\right]^{-1}\left[\begin{array}{l} x_{A}-\mu_{A} \\ x_{B}-\mu_{B} \end{array}\right]\right) \end{aligned}
where Z Z Z^(')Z^{\prime} is a normalization constant that we used to absorb factors not depending on x B x B x_(B)x_{B}. Note that this time, we don't even need to compute any integrals – the value of the integral does not depend on x B x B x_(B)x_{B}, and hence the integral can be folded into the normalization constant Z Z Z^(')Z^{\prime}.

3.3.2. Partitioning the inverse covariance matrix

As before, we reparameterize our density using the matrix V V VV, to obtain
p ( x B x A ) = 1 Z exp ( 1 2 [ x A μ A x B μ B ] T [ V A A V A B V B A V B B ] [ x A μ A x B μ B ] ) = 1 Z exp ( [ 1 2 ( x A μ A ) T V A A ( x A μ A ) + 1 2 ( x A μ A ) T V A B ( x B μ B ) + 1 2 ( x B μ B ) T V B A ( x A μ A ) + 1 2 ( x B μ B ) T V B B ( x B μ B ) ] ) . p x B x A = 1 Z exp ( 1 2 x A μ A x B μ B T V A A V A B V B A V B B x A μ A x B μ B ) = 1 Z exp ( [ 1 2 x A μ A T V A A x A μ A + 1 2 x A μ A T V A B x B μ B + 1 2 x B μ B T V B A x A μ A + 1 2 x B μ B T V B B x B μ B ] ) . {:[p(x_(B)∣x_(A))=(1)/(Z^('))exp (-(1)/(2)[[x_(A)-mu_(A)],[x_(B)-mu_(B)]]^(T)[[V_(AA),V_(AB)],[V_(BA),V_(BB)]][[x_(A)-mu_(A)],[x_(B)-mu_(B)]])],[=(1)/(Z^('))exp (-[(1)/(2)(x_(A)-mu_(A))^(T)V_(AA)(x_(A)-mu_(A))+(1)/(2)(x_(A)-mu_(A))^(T)V_(AB)(x_(B)-mu_(B))],[+(1)/(2)(x_(B)-mu_(B))^(T)V_(BA)(x_(A)-mu_(A))+(1)/(2)(x_(B)-mu_(B))^(T)V_(BB)(x_(B)-mu_(B))]).]:}\begin{aligned} p\left(x_{B} \mid x_{A}\right)=\frac{1}{Z^{\prime}} \exp \bigg(-\frac{1}{2}&\left[\begin{array}{l} x_{A}-\mu_{A} \\ x_{B}-\mu_{B} \end{array}\right]^{T}\left[\begin{array}{ll} V_{A A} & V_{A B} \\ V_{B A} & V_{B B} \end{array}\right]\left[\begin{array}{l} x_{A}-\mu_{A} \\ x_{B}-\mu_{B} \end{array}\right]\bigg) \\ =\frac{1}{Z^{\prime}} \exp \bigg(-\bigg[&\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A A}\left(x_{A}-\mu_{A}\right)+\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A B}\left(x_{B}-\mu_{B}\right)\\ &+\frac{1}{2}\left(x_{B}-\mu_{B}\right)^{T} V_{B A}\left(x_{A}-\mu_{A}\right)+\frac{1}{2}\left(x_{B}-\mu_{B}\right)^{T} V_{B B}\left(x_{B}-\mu_{B}\right)\bigg]\bigg). \end{aligned}

3.3.3. Use a "completion of squares" argument

Recall that
1 2 z T A z + b T z + c = 1 2 ( z + A 1 b ) T A ( z + A 1 b ) + c 1 2 b T A 1 b 1 2 z T A z + b T z + c = 1 2 z + A 1 b T A z + A 1 b + c 1 2 b T A 1 b (1)/(2)z^(T)Az+b^(T)z+c=(1)/(2)(z+A^(-1)b)^(T)A(z+A^(-1)b)+c-(1)/(2)b^(T)A^(-1)b\frac{1}{2} z^{T} A z+b^{T} z+c=\frac{1}{2}\left(z+A^{-1} b\right)^{T} A\left(z+A^{-1} b\right)+c-\frac{1}{2} b^{T} A^{-1} b
provided A A AA is a symmetric, nonsingular matrix. As before, to apply the completion of squares in our situation above, let
z = x B μ B A = V B B b = V B A ( x A μ A ) c = 1 2 ( x A μ A ) T V A A ( x A μ A ) . z = x B μ B A = V B B b = V B A x A μ A c = 1 2 x A μ A T V A A x A μ A . {:[z=x_(B)-mu_(B)],[A=V_(BB)],[b=V_(BA)(x_(A)-mu_(A))],[c=(1)/(2)(x_(A)-mu_(A))^(T)V_(AA)(x_(A)-mu_(A)).]:}\begin{aligned} z &=x_{B}-\mu_{B} \\ A &=V_{B B} \\ b &=V_{B A}\left(x_{A}-\mu_{A}\right) \\ c &=\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A A}\left(x_{A}-\mu_{A}\right) . \end{aligned}
Then, it follows that the expression for p ( x B x A ) p x B x A p(x_(B)∣x_(A))p\left(x_{B} \mid x_{A}\right) can be rewritten as
p ( x B x A ) = 1 Z exp ( [ 1 2 ( x B μ B + V B B 1 V B A ( x A μ A ) ) T V B B ( x B μ B + V B B 1 V B A ( x A μ A ) ) + 1 2 ( x A μ A ) T V A A ( x A μ A ) 1 2 ( x A μ A ) T V A B V B B 1 V B A ( x A μ A ) ] ) p x B x A = 1 Z exp 1 2 x B μ B + V B B 1 V B A x A μ A T V B B x B μ B + V B B 1 V B A x A μ A + 1 2 x A μ A T V A A x A μ A 1 2 x A μ A T V A B V B B 1 V B A x A μ A {:[p(x_(B)∣x_(A))=(1)/(Z^('))exp(-[(1)/(2)(x_(B)-mu_(B)+V_(BB)^(-1)V_(BA)(x_(A)-mu_(A)))^(T)V_(BB)(x_(B)-mu_(B)+V_(BB)^(-1)V_(BA)(x_(A)-mu_(A))):}],[{: qquadqquadquad+(1)/(2)(x_(A)-mu_(A))^(T)V_(AA)(x_(A)-mu_(A))-(1)/(2)(x_(A)-mu_(A))^(T)V_(AB)V_(BB)^(-1)V_(BA)(x_(A)-mu_(A))])]:}\begin{aligned} p\left(x_{B} \mid x_{A}\right)=&\frac{1}{Z^{\prime}} \exp \left(-\left[\frac{1}{2}\left(x_{B}-\mu_{B}+V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\right)^{T} V_{B B}\left(x_{B}-\mu_{B}+V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\right)\right.\right.\\ &\left.\left.\qquad\qquad\quad+\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A A}\left(x_{A}-\mu_{A}\right)-\frac{1}{2}\left(x_{A}-\mu_{A}\right)^{T} V_{A B} V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\right]\right) \end{aligned}
Absorbing the portion of the exponent which does not depend on x B x B x_(B)x_{B} into the normalization constant, we have
p ( x B x A ) = 1 Z exp ( 1 2 ( x B μ B + V B B 1 V B A ( x A μ A ) ) T V B B ( x B μ B + V B B 1 V B A ( x A μ A ) ) ) p x B x A = 1 Z exp 1 2 x B μ B + V B B 1 V B A x A μ A T V B B x B μ B + V B B 1 V B A x A μ A p(x_(B)∣x_(A))=(1)/(Z^(''))exp(-(1)/(2)(x_(B)-mu_(B)+V_(BB)^(-1)V_(BA)(x_(A)-mu_(A)))^(T)V_(BB)(x_(B)-mu_(B)+V_(BB)^(-1)V_(BA)(x_(A)-mu_(A))))p\left(x_{B} \mid x_{A}\right)=\frac{1}{Z^{\prime \prime}} \exp \left(-\frac{1}{2}\left(x_{B}-\mu_{B}+V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\right)^{T} V_{B B}\left(x_{B}-\mu_{B}+V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)\right)\right)

3.3.4. Arguing that resulting density is Gaussian

Looking at the last form, p ( x B x A ) p x B x A p(x_(B)∣x_(A))p\left(x_{B} \mid x_{A}\right) has the form of a Gaussian density with mean μ B μ B mu_(B)-\mu_{B}- V B B 1 V B A ( x A μ A ) V B B 1 V B A x A μ A V_(BB)^(-1)V_(BA)(x_(A)-mu_(A))V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right) and covariance matrix V B B 1 V B B 1 V_(BB)^(-1)V_{B B}^{-1}. As before, recall our matrix identity,
[ Σ A A Σ A B Σ B A Σ B B ] = [ ( V A A V A B V B B 1 V B A ) 1 ( V A A V A B V B B 1 V B A ) 1 V A B V B B 1 V B B 1 V B A ( V A A V A B V B B 1 V B A ) 1 ( V B B V B A V A A 1 V A B ) 1 ] . Σ A A      Σ A B Σ B A      Σ B B = V A A V A B V B B 1 V B A 1 V A A V A B V B B 1 V B A 1 V A B V B B 1 V B B 1 V B A V A A V A B V B B 1 V B A 1 V B B V B A V A A 1 V A B 1 . [[Sigma_(AA),Sigma_(AB)],[Sigma_(BA),Sigma_(BB)]]=[[(V_(AA)-V_(AB)V_(BB)^(-1)V_(BA))^(-1),-(V_(AA)-V_(AB)V_(BB)^(-1)V_(BA))^(-1)V_(AB)V_(BB)^(-1)],[-V_(BB)^(-1)V_(BA)(V_(AA)-V_(AB)V_(BB)^(-1)V_(BA))^(-1),(V_(BB)-V_(BA)V_(AA)^(-1)V_(AB))^(-1)]].\left[\begin{array}{ll} \Sigma_{A A} & \Sigma_{A B} \\ \Sigma_{B A} & \Sigma_{B B} \end{array}\right]=\left[\begin{array}{cc} \left(V_{A A}-V_{A B} V_{B B}^{-1} V_{B A}\right)^{-1} & -\left(V_{A A}-V_{A B} V_{B B}^{-1} V_{B A}\right)^{-1} V_{A B} V_{B B}^{-1} \\ -V_{B B}^{-1} V_{B A}\left(V_{A A}-V_{A B} V_{B B}^{-1} V_{B A}\right)^{-1} & \left(V_{B B}-V_{B A} V_{A A}^{-1} V_{A B}\right)^{-1} \end{array}\right] .
From this, it follows that
μ B A = μ B V B B 1 V B A ( x A μ A ) = μ B + Σ B A Σ A A 1 ( x A μ A ) . μ B A = μ B V B B 1 V B A x A μ A = μ B + Σ B A Σ A A 1 x A μ A . mu_(B∣A)=mu_(B)-V_(BB)^(-1)V_(BA)(x_(A)-mu_(A))=mu_(B)+Sigma_(BA)Sigma_(AA)^(-1)(x_(A)-mu_(A)).\mu_{B \mid A}=\mu_{B}-V_{B B}^{-1} V_{B A}\left(x_{A}-\mu_{A}\right)=\mu_{B}+\Sigma_{B A} \Sigma_{A A}^{-1}\left(x_{A}-\mu_{A}\right) .
Conversely, we can also apply our matrix identity to obtain:
[ V A A V A B V B A V B B ] = [ ( Σ A A Σ A B Σ B B 1 Σ B A ) 1 ( Σ A A Σ A B Σ B B 1 Σ B A ) 1 Σ A B Σ B B 1 Σ B B 1 Σ B A ( Σ A A Σ A B Σ B B 1 Σ B A ) 1 ( Σ B B Σ B A Σ A A 1 Σ A B ) 1 ] , V A A V A B V B A V B B = Σ A A Σ A B Σ B B 1 Σ B A 1 Σ A A Σ A B Σ B B 1 Σ B A 1 Σ A B Σ B B 1 Σ B B 1 Σ B A Σ A A Σ A B Σ B B 1 Σ B A 1 Σ B B Σ B A Σ A A 1 Σ A B 1 , [[V_(AA),V_(AB)],[V_(BA),V_(BB)]]=[[(Sigma_(AA)-Sigma_(AB)Sigma_(BB)^(-1)Sigma_(BA))^(-1),-(Sigma_(AA)-Sigma_(AB)Sigma_(BB)^(-1)Sigma_(BA))^(-1)Sigma_(AB)Sigma_(BB)^(-1)],[-Sigma_(BB)^(-1)Sigma_(BA)(Sigma_(AA)-Sigma_(AB)Sigma_(BB)^(-1)Sigma_(BA))^(-1),(Sigma_(BB)-Sigma_(BA)Sigma_(AA)^(-1)Sigma_(AB))^(-1)]],\left[\begin{array}{cc} V_{A A} & V_{A B} \\ V_{B A} & V_{B B} \end{array}\right]=\left[\begin{array}{cc} \left(\Sigma_{A A}-\Sigma_{A B} \Sigma_{B B}^{-1} \Sigma_{B A}\right)^{-1} & -\left(\Sigma_{A A}-\Sigma_{A B} \Sigma_{B B}^{-1} \Sigma_{B A}\right)^{-1} \Sigma_{A B} \Sigma_{B B}^{-1} \\ -\Sigma_{B B}^{-1} \Sigma_{B A}\left(\Sigma_{A A}-\Sigma_{A B} \Sigma_{B B}^{-1} \Sigma_{B A}\right)^{-1} & \left(\Sigma_{B B}-\Sigma_{B A} \Sigma_{A A}^{-1} \Sigma_{A B}\right)^{-1} \end{array}\right],
from which it follows that
Σ B A = V B B 1 = Σ B B Σ B A Σ A A 1 Σ A B . Σ B A = V B B 1 = Σ B B Σ B A Σ A A 1 Σ A B . Sigma_(B∣A)=V_(BB)^(-1)=Sigma_(BB)-Sigma_(BA)Sigma_(AA)^(-1)Sigma_(AB).\Sigma_{B \mid A}=V_{B B}^{-1}=\Sigma_{B B}-\Sigma_{B A} \Sigma_{A A}^{-1} \Sigma_{A B}.
And, we're done!

4. Summary

In these notes, we used a few simple properties of multivariate Gaussians (plus a couple matrix algebra tricks) in order to argue that multivariate Gaussians satisfy a number of closure properties. In general, multivariate Gaussians are exceedingly useful representations of probability distributions because the closure properties ensure that most of the types of operations we would ever want to perform using a multivariate Gaussian can be done in closed form. Analytically, integrals involving multivariate Gaussians are often nice in practice since we can rely on known Gaussian integrals to avoid having to ever perform the integration ourselves.

5. Exercise

Test your understanding! Let A R n × n A R n × n A inR^(n xx n)A \in \mathbf{R}^{n \times n} be a symmetric nonsingular square matrix, b R n b R n b inR^(n)b \in \mathbf{R}^{n}, and c c cc. Prove that
x R n exp ( 1 2 x T A x x T b c ) d x = ( 2 π ) n / 2 | A | 1 / 2 exp ( c b T A 1 b ) . x R n exp 1 2 x T A x x T b c d x = ( 2 π ) n / 2 | A | 1 / 2 exp c b T A 1 b . int_(x inR^(n))exp(-(1)/(2)x^(T)Ax-x^(T)b-c)dx=((2pi)^(n//2))/(|A|^(1//2)exp(c-b^(T)A^(-1)b)).\int_{x \in \mathbf{R}^{n}} \exp \left(-\frac{1}{2} x^{T} A x-x^{T} b-c\right) d x=\frac{(2 \pi)^{n / 2}}{|A|^{1 / 2} \exp \left(c-b^{T} A^{-1} b\right)}.

6. References

For more information on multivariate Gaussians, see: Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.

  1. Recall from the section notes on linear algebra that S + + n S + + n S_(++)^(n)\mathbf{S}_{++}^{n} is the space of symmetric positive definite n × n n × n n xx nn \times n matrices, defined as S + + n = { A R n × n : A = A T and x T A x > 0 for all x R n such that x 0 } . S + + n = A R n × n : A = A T  and  x T A x > 0  for all  x R n  such that  x 0 . S_(++)^(n)={A inR^(n xx n):A=A^(T)" and "x^(T)Ax > 0" for all "x inR^(n)" such that "x!=0}.\mathbf{S}_{++}^{n}=\left\{A \in \mathbf{R}^{n \times n}: A=A^{T} \text { and } x^{T} A x>0 \text { for all } x \in \mathbf{R}^{n} \text { such that } x \neq 0\right\}. ↩︎
  2. For example, if y y yy and z z zz were univariate Gaussians ( ( ((i.e., y N ( μ , σ 2 ) , z N ( μ , σ 2 ) ) y N ( μ , σ 2 ) , z N ( μ , σ 2 ) ) y∼N(mu,sigma^(2)),z∼N(mu^('),sigma^(')^(2)))y \sim \mathcal{N}(\mu, \sigma^{2}), z \sim \mathcal{N}(\mu^{\prime}, {\sigma^{\prime}}^{2})), then the convolution of their probability densities is given by p ( y + z ; μ , μ , σ 2 , σ 2 ) = p ( w ; μ , σ 2 ) p ( y + z w ; μ , σ 2 ) d w = 1 2 π σ exp ( 1 2 σ 2 ( w μ ) 2 ) 1 2 π σ exp ( 1 2 σ 2 ( y + z w μ ) 2 ) d w p y + z ; μ , μ , σ 2 , σ 2 = p w ; μ , σ 2 p y + z w ; μ , σ 2 d w = 1 2 π σ exp 1 2 σ 2 ( w μ ) 2 1 2 π σ exp 1 2 σ 2 y + z w μ 2 d w {:[p(y+z;mu,mu^('),sigma^(2),sigma^(')^(2))=int_(-oo)^(oo)p(w;mu,sigma^(2))p(y+z-w;mu^('),sigma^(')^(2))dw],[=int_(-oo)^(oo)(1)/(sqrt(2pi)sigma)exp(-(1)/(2sigma^(2))(w-mu)^(2))*(1)/(sqrt(2pi)sigma^('))exp(-(1)/(2sigma^(')^(2))(y+z-w-mu^('))^(2))dw]:}\begin{aligned} p\left(y+z ; \mu, \mu^{\prime}, \sigma^{2}, {\sigma^{\prime}}^{2}\right) &=\int_{-\infty}^{\infty} p\left(w ; \mu, \sigma^{2}\right) p\left(y+z-w ; \mu^{\prime}, {\sigma^{\prime}}^{2} \right) d w \\ &=\int_{-\infty}^{\infty} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^{2}}(w-\mu)^{2}\right) \cdot \frac{1}{\sqrt{2 \pi} \sigma^{\prime}} \exp \left(-\frac{1}{2 {\sigma^{\prime}}^{2}}\left(y+z-w-\mu^{\prime}\right)^{2}\right) d w \end{aligned} ↩︎
  3. Of course, we needed to know that y + z y + z y+zy+z had a Gaussian distribution in the first place. ↩︎
  4. In general, for a random vector x x xx which has a Gaussian distribution, we can always permute entries of x x xx so long as we permute the entries of the mean vector and the rows/columns of the covariance matrix in the corresponding way. As a result, it suffices to look only at x A x A x_(A)x_{A}, and the result for x B x B x_(B)x_{B} follows immediately. ↩︎
  5. Sometimes, V V VV is called the "precision" matrix. ↩︎

Recommended for you

Kaichao You
Event Camera: the Next Generation of Visual Perception System
Event Camera: the Next Generation of Visual Perception System
Event camera can extend computer vision to scenarios where it is currently incompetent. In the following decades, it is hopeful that event cameras will be mature enough to be mass-produced, to have dedicated algorithms, and to show up in widely-used products.
29 points
0 issues