Introduction to the Multivariate Gaussian Distribution

A vector-valued random variable X = [ X 1 X n ] T X = [ X 1           X n ] T X=[{:[X_(1),cdots,X_(n)]:}]^(T)X=[\begin{array}{lll}X_{1} & \cdots & X_{n}\end{array}]^{T} is said to have a multivariate normal (or Gaussian) distribution with mean μ R n μ R n mu inR^(n)\mu \in \mathbf{R}^{n} and covariance matrix Σ S + + n Σ S + + n Sigma inS_(++)^(n)\Sigma \in \mathbf{S}_{++}^{n}[1] if its probability density function[2] is given by
p ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 | Σ | 1 / 2 exp ( 1 2 ( x μ ) T Σ 1 ( x μ ) ) . p ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 | Σ | 1 / 2 exp 1 2 ( x μ ) T Σ 1 ( x μ ) . p(x;mu,Sigma)=(1)/((2pi)^(n//2)|Sigma|^(1//2))exp(-(1)/(2)(x-mu)^(T)Sigma^(-1)(x-mu)).p(x ; \mu, \Sigma)=\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right).
We write this as X N ( μ , Σ ) X N ( μ , Σ ) X∼N(mu,Sigma)X \sim \mathcal{N}(\mu, \Sigma). In these notes, we describe multivariate Gaussians and some of their basic properties.

1. Relationship to univariate Gaussians

Recall that the density function of a univariate normal (or Gaussian) distribution is given by
p ( x ; μ , σ 2 ) = 1 2 π σ exp ( 1 2 σ 2 ( x μ ) 2 ) . p ( x ; μ , σ 2 ) = 1 2 π σ exp 1 2 σ 2 ( x μ ) 2 . p(x;mu,sigma^(2))=(1)/(sqrt(2pi)sigma)exp(-(1)/(2sigma^(2))(x-mu)^(2)).p(x ; \mu, \sigma^{2})=\frac{1}{\sqrt{2 \pi} \sigma} \exp \left(-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right).
Here, the argument of the exponential function, 1 2 σ 2 ( x μ ) 2 1 2 σ 2 ( x μ ) 2 -(1)/(2sigma^(2))(x-mu)^(2)-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}, is a quadratic function of the variable x x xx. Furthermore, the parabola points downwards, as the coefficient of the quadratic term is negative. The coefficient in front, 1 2 π σ 1 2 π σ (1)/(sqrt(2pi)sigma)\frac{1}{\sqrt{2 \pi} \sigma}, is a constant that does not depend on x x xx; hence, we can think of it as simply a "normalization factor" used to ensure that
1 2 π σ exp ( 1 2 σ 2 ( x μ ) 2 ) = 1 . 1 2 π σ exp 1 2 σ 2 ( x μ ) 2 = 1 . (1)/(sqrt(2pi)sigma)int_(-oo)^(oo)exp(-(1)/(2sigma^(2))(x-mu)^(2))=1.\frac{1}{\sqrt{2 \pi} \sigma} \int_{-\infty}^{\infty} \exp \left(-\frac{1}{2 \sigma^{2}}(x-\mu)^{2}\right)=1 .
Figure 1:
The figure on the left shows a univariate Gaussian density for a single variable X X XX. The figure on the right shows a multivariate Gaussian density over two variables X 1 X 1 X_(1)X_{1} and X 2 X 2 X_(2)X_{2}.
In the case of the multivariate Gaussian density, the argument of the exponential function, 1 2 ( x μ ) T Σ 1 ( x μ ) 1 2 ( x μ ) T Σ 1 ( x μ ) -(1)/(2)(x-mu)^(T)Sigma^(-1)(x-mu)-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu), is a quadratic form in the vector variable x x xx. Since Σ Σ Sigma\Sigma is positive definite, and since the inverse of any positive definite matrix is also positive definite, then for any non-zero vector z , z T Σ 1 z > 0 z , z T Σ 1 z > 0 z,z^(T)Sigma^(-1)z > 0z, z^{T} \Sigma^{-1} z>0. This implies that for any vector x μ x μ x!=mux \neq \mu,
( x μ ) T Σ 1 ( x μ ) > 0 1 2 ( x μ ) T Σ 1 ( x μ ) < 0 . ( x μ ) T Σ 1 ( x μ ) > 0 1 2 ( x μ ) T Σ 1 ( x μ ) < 0 . {:[(x-mu)^(T)Sigma^(-1)(x-mu) > 0],[-(1)/(2)(x-mu)^(T)Sigma^(-1)(x-mu) < 0.]:}\begin{aligned} (x-\mu)^{T} \Sigma^{-1}(x-\mu) &>0 \\ -\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu) &<0 . \end{aligned}
Like in the univariate case, you can think of the argument of the exponential function as being a downward opening quadratic bowl. The coefficient in front (i.e., 1 ( 2 π ) n / 2 | Σ | 1 / 2 1 ( 2 π ) n / 2 | Σ | 1 / 2 (1)/((2pi)^(n//2)|Sigma|^(1//2))\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}}) has an even more complicated form than in the univariate case. However, it still does not depend on x x xx, and hence it is again simply a normalization factor used to ensure that
1 ( 2 π ) n / 2 | Σ | 1 / 2 exp ( 1 2 ( x μ ) T Σ 1 ( x μ ) ) d x 1 d x 2 d x n = 1 . 1 ( 2 π ) n / 2 | Σ | 1 / 2 exp 1 2 ( x μ ) T Σ 1 ( x μ ) d x 1 d x 2 d x n = 1 . (1)/((2pi)^(n//2)|Sigma|^(1//2))int_(-oo)^(oo)int_(-oo)^(oo)cdotsint_(-oo)^(oo)exp(-(1)/(2)(x-mu)^(T)Sigma^(-1)(x-mu))dx_(1)dx_(2)cdots dx_(n)=1.\frac{1}{(2 \pi)^{n / 2}|\Sigma|^{1 / 2}} \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} \exp \left(-\frac{1}{2}(x-\mu)^{T} \Sigma^{-1}(x-\mu)\right) d x_{1} d x_{2} \cdots d x_{n}=1 .

2. The covariance matrix

The concept of the covariance matrix is vital to understanding multivariate Gaussian distributions. Recall that for a pair of random variables X X XX and Y Y YY, their covariance is defined as
Cov [ X , Y ] = E [ ( X E [ X ] ) ( Y E [ Y ] ) ] = E [ X Y ] E [ X ] E [ Y ] . Cov [ X , Y ] = E [ ( X E [ X ] ) ( Y E [ Y ] ) ] = E [ X Y ] E [ X ] E [ Y ] . Cov[X,Y]=E[(X-E[X])(Y-E[Y])]=E[XY]-E[X]E[Y].\operatorname{Cov}[X, Y]=E[(X-E[X])(Y-E[Y])]=E[X Y]-E[X] E[Y].
When working with multiple variables, the covariance matrix provides a succinct way to summarize the covariances of all pairs of variables. In particular, the covariance matrix, which we usually denote as Σ Σ Sigma\Sigma, is the n × n n × n n xx nn \times n matrix whose ( i , j ) ( i , j ) (i,j)(i, j) th entry is Cov [ X i , X j ] Cov X i , X j Cov[X_(i),X_(j)]\operatorname{Cov}\left[X_{i}, X_{j}\right].
The following proposition (whose proof is provided in the Appendix A.1) gives an alternative way to characterize the covariance matrix of a random vector X X XX :
Proposition 1. For any random vector X X XX with mean μ μ mu\mu and covariance matrix Σ Σ Sigma\Sigma,
(1) Σ = E [ ( X μ ) ( X μ ) T ] = E [ X X T ] μ μ T . (1) Σ = E [ ( X μ ) ( X μ ) T ] = E [ X X T ] μ μ T . {:(1)Sigma=E[(X-mu)(X-mu)^(T)]=E[XX^(T)]-mumu^(T).:}\begin{equation} \Sigma=E[(X-\mu)(X-\mu)^{T}]=E[X X^{T}]-\mu \mu^{T}. \end{equation}
In the definition of multivariate Gaussians, we required that the covariance matrix Σ Σ Sigma\Sigma be symmetric positive definite (i.e., Σ S + + n Σ S + + n Sigma inS_(++)^(n)\Sigma \in \mathbf{S}_{++}^{n}). Why does this restriction exist? As seen in the following proposition, the covariance matrix of any random vector must always be symmetric positive semidefinite:
Proposition 2. Suppose that Σ Σ Sigma\Sigma is the covariance matrix corresponding to some random vector X X XX. Then Σ Σ Sigma\Sigma is symmetric positive semidefinite.
Proof. The symmetry of Σ Σ Sigma\Sigma follows immediately from its definition. Next, for any vector z R n z R n z inR^(n)z \in \mathbf{R}^{n}, observe that
(2) z T Σ z = i = 1 n j = 1 n ( Σ i j z i z j ) = i = 1 n j = 1 n ( Cov [ X i , X j ] z i z j ) = i = 1 n j = 1 n ( E [ ( X i E [ X i ] ) ( X j E [ X j ] ) ] z i z j ) (3) = E [ i = 1 n j = 1 n ( X i E [ X i ] ) ( X j E [ X j ] ) z i z j ] . (2) z T Σ z = i = 1 n j = 1 n ( Σ i j z i z j ) = i = 1 n j = 1 n ( Cov [ X i , X j ] z i z j ) = i = 1 n j = 1 n ( E [ ( X i E [ X i ] ) ( X j E [ X j ] ) ] z i z j ) (3) = E i = 1 n j = 1 n ( X i E [ X i ] ) ( X j E [ X j ] ) z i z j . {:(2)z^(T)Sigma z=sum_(i=1)^(n)sum_(j=1)^(n)(Sigma_(ij)z_(i)z_(j)),[=sum_(i=1)^(n)sum_(j=1)^(n)(Cov[X_(i)","X_(j)]*z_(i)z_(j))],[=sum_(i=1)^(n)sum_(j=1)^(n)(E[(X_(i)-E[X_(i)])(X_(j)-E[X_(j)])]*z_(i)z_(j))],(3)=E[sum_(i=1)^(n)sum_(j=1)^(n)(X_(i)-E[X_(i)])(X_(j)-E[X_(j)])*z_(i)z_(j)].:}\begin{align} z^{T} \Sigma z &=\sum_{i=1}^{n} \sum_{j=1}^{n}(\Sigma_{i j} z_{i} z_{j}) \\ &=\sum_{i=1}^{n} \sum_{j=1}^{n}(\operatorname{Cov}[X_{i}, X_{j}] \cdot z_{i} z_{j}) \nonumber \\ &=\sum_{i=1}^{n} \sum_{j=1}^{n}(E[(X_{i}-E[X_{i}])(X_{j}-E[X_{j}])] \cdot z_{i} z_{j}) \nonumber\\ &=E\left[\sum_{i=1}^{n} \sum_{j=1}^{n}(X_{i}-E[X_{i}])(X_{j}-E[X_{j}]) \cdot z_{i} z_{j}\right]. \end{align}
Here, (2) follows from the formula for expanding a quadratic form (see section notes on linear algebra), and (3) follows by linearity of expectations (see probability notes).
To complete the proof, observe that the quantity inside the brackets is of the form i j x i x j z i z j = ( x T z ) 2 0 i j x i x j z i z j = ( x T z ) 2 0 sum_(i)sum_(j)x_(i)x_(j)z_(i)z_(j)=(x^(T)z)^(2) >= 0\sum_{i} \sum_{j} x_{i} x_{j} z_{i} z_{j}=(x^{T} z)^{2} \geq 0 (see problem set #1). Therefore, the quantity inside the expectation is always nonnegative, and hence the expectation itself must be nonnegative. We conclude that z T Σ z 0 z T Σ z 0 z^(T)Sigma z >= 0z^{T} \Sigma z \geq 0.
From the above proposition it follows that Σ Σ Sigma\Sigma must be symmetric positive semidefinite in order for it to be a valid covariance matrix. However, in order for Σ 1 Σ 1 Sigma^(-1)\Sigma^{-1} to exist (as required in the definition of the multivariate Gaussian density), then Σ Σ Sigma\Sigma must be invertible and hence full rank. Since any full rank symmetric positive semidefinite matrix is necessarily symmetric positive definite, it follows that Σ Σ Sigma\Sigma must be symmetric positive definite.

3. The diagonal covariance matrix case

To get an intuition for what a multivariate Gaussian is, consider the simple case where n = 2 n = 2 n=2n=2, and where the covariance matrix Σ Σ Sigma\Sigma is diagonal, i.e.,
x = [ x 1 x 2 ] μ = [ μ 1 μ 2 ] Σ = [ σ 1 2 0 0 σ 2 2 ] x = x 1 x 2 μ = μ 1 μ 2 Σ = σ 1 2 0 0 σ 2 2 x=[[x_(1)],[x_(2)]]quad mu=[[mu_(1)],[mu_(2)]]quad Sigma=[[sigma_(1)^(2),0],[0,sigma_(2)^(2)]]x=\left[\begin{array}{l} x_{1} \\ x_{2} \end{array}\right] \quad \mu=\left[\begin{array}{l} \mu_{1} \\ \mu_{2} \end{array}\right] \quad \Sigma=\left[\begin{array}{cc}\sigma_{1}^{2} & 0 \\ 0 & \sigma_{2}^{2}\end{array}\right]
In this case, the multivariate Gaussian density has the form,
p ( x ; μ , Σ ) = 1 2 π | σ 1 2 0 0 σ 2 2 | 1 / 2 exp ( 1 2 [ x 1 μ 1 x 2 μ 2 ] T [ σ 1 2 0 0 σ 2 2 ] 1 [ x 1 μ 1 x 2 μ 2 ] ) = 1 2 π ( σ 1 2 σ 2 2 0 0 ) 1 / 2 exp ( 1 2 [ x 1 μ 1 x 2 μ 2 ] T [ 1 σ 1 2 0 0 1 σ 2 2 ] [ x 1 μ 1 x 2 μ 2 ] ) , p ( x ; μ , Σ ) = 1 2 π σ 1 2 0 0 σ 2 2 1 / 2 exp 1 2 x 1 μ 1 x 2 μ 2 T σ 1 2 0 0 σ 2 2 1 x 1 μ 1 x 2 μ 2 = 1 2 π σ 1 2 σ 2 2 0 0 1 / 2 exp 1 2 x 1 μ 1 x 2 μ 2 T 1 σ 1 2 0 0 1 σ 2 2 x 1 μ 1 x 2 μ 2 , {:[p(x;mu","Sigma)=(1)/(2pi|[sigma_(1)^(2),0],[0,sigma_(2)^(2)]|^(1//2))exp(-(1)/(2)[[x_(1)-mu_(1)],[x_(2)-mu_(2)]]^(T)[[sigma_(1)^(2),0],[0,sigma_(2)^(2)]]^(-1)[[x_(1)-mu_(1)],[x_(2)-mu_(2)]])],[=(1)/(2pi(sigma_(1)^(2)*sigma_(2)^(2)-0*0)^(1//2))exp(-(1)/(2)[[x_(1)-mu_(1)],[x_(2)-mu_(2)]]^(T)[[(1)/(sigma_(1)^(2)),0],[0,(1)/(sigma_(2)^(2))]][[x_(1)-mu_(1)],[x_(2)-mu_(2)]])","]:}\begin{aligned} p(x ; \mu, \Sigma) &=\frac{1}{2 \pi\left|\begin{array}{cc} \sigma_{1}^{2} & 0 \\ 0 & \sigma_{2}^{2} \end{array}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left[\begin{array}{l} x_{1}-\mu_{1} \\ x_{2}-\mu_{2} \end{array}\right]^{T}\left[\begin{array}{cc} \sigma_{1}^{2} & 0 \\ 0 & \sigma_{2}^{2} \end{array}\right]^{-1}\left[\begin{array}{l} x_{1}-\mu_{1} \\ x_{2}-\mu_{2} \end{array}\right]\right) \\ &=\frac{1}{2 \pi\left(\sigma_{1}^{2} \cdot \sigma_{2}^{2}-0 \cdot 0\right)^{1 / 2}} \exp \left(-\frac{1}{2}\left[\begin{array}{l} x_{1}-\mu_{1} \\ x_{2}-\mu_{2} \end{array}\right]^{T}\left[\begin{array}{cc} \frac{1}{\sigma_{1}^{2}} & 0 \\ 0 & \frac{1}{\sigma_{2}^{2}} \end{array}\right]\left[\begin{array}{l} x_{1}-\mu_{1} \\ x_{2}-\mu_{2} \end{array}\right]\right), \end{aligned}
where we have relied on the explicit formula for the determinant of a 2 × 2 2 × 2 2xx22 \times 2 matrix[3], and the fact that the inverse of a diagonal matrix is simply found by taking the reciprocal of each diagonal entry. Continuing,
p ( x ; μ , Σ ) = 1 2 π σ 1 σ 2 exp ( 1 2 [ x 1 μ 1 x 2 μ 2 ] T [ 1 σ 1 2 ( x 1 μ 1 ) 2 σ 2 2 ( x 2 μ 2 ) ] ) = 1 2 π σ 1 σ 2 exp ( 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 σ 2 2 ( x 2 μ 2 ) 2 ) = 1 2 π σ 1 exp ( 1 2 σ 1 2 ( x 1 μ 1 ) 2 ) 1 2 π σ 2 exp ( 1 2 σ 2 2 ( x 2 μ 2 ) 2 ) . p ( x ; μ , Σ ) = 1 2 π σ 1 σ 2 exp 1 2 x 1 μ 1 x 2 μ 2 T 1 σ 1 2 ( x 1 μ 1 ) 2 σ 2 2 ( x 2 μ 2 ) = 1 2 π σ 1 σ 2 exp 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 σ 2 2 ( x 2 μ 2 ) 2 = 1 2 π σ 1 exp 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 π σ 2 exp 1 2 σ 2 2 ( x 2 μ 2 ) 2 . {:[p(x;mu","Sigma)=(1)/(2pisigma_(1)sigma_(2))exp(-(1)/(2)[[x_(1)-mu_(1)],[x_(2)-mu_(2)]]^(T)[[(1)/(sigma_(1)^(2))(x_(1)-mu_(1))],[(2)/(sigma_(2)^(2))(x_(2)-mu_(2))]])],[=(1)/(2pisigma_(1)sigma_(2))exp(-(1)/(2sigma_(1)^(2))(x_(1)-mu_(1))^(2)-(1)/(2sigma_(2)^(2))(x_(2)-mu_(2))^(2))],[=(1)/(sqrt(2pi)sigma_(1))exp(-(1)/(2sigma_(1)^(2))(x_(1)-mu_(1))^(2))*(1)/(sqrt(2pi)sigma_(2))exp(-(1)/(2sigma_(2)^(2))(x_(2)-mu_(2))^(2)).]:}\begin{aligned} p(x ; \mu, \Sigma) &=\frac{1}{2 \pi \sigma_{1} \sigma_{2}} \exp \left(-\frac{1}{2}\left[\begin{array}{l} x_{1}-\mu_{1} \\ x_{2}-\mu_{2} \end{array}\right]^{T}\left[\begin{array}{c} \frac{1}{\sigma_{1}^{2}}(x_{1}-\mu_{1}) \\ \frac{2}{\sigma_{2}^{2}}(x_{2}-\mu_{2}) \end{array}\right]\right) \\ &=\frac{1}{2 \pi \sigma_{1} \sigma_{2}} \exp \left(-\frac{1}{2 \sigma_{1}^{2}}(x_{1}-\mu_{1})^{2}-\frac{1}{2 \sigma_{2}^{2}}(x_{2}-\mu_{2})^{2}\right) \\ &=\frac{1}{\sqrt{2 \pi} \sigma_{1}} \exp \left(-\frac{1}{2 \sigma_{1}^{2}}(x_{1}-\mu_{1})^{2}\right) \cdot \frac{1}{\sqrt{2 \pi} \sigma_{2}} \exp \left(-\frac{1}{2 \sigma_{2}^{2}}(x_{2}-\mu_{2})^{2}\right) . \end{aligned}
The last equation we recognize to simply be the product of two independent Gaussian densities, one with mean μ 1 μ 1 mu_(1)\mu_{1} and variance σ 1 2 σ 1 2 sigma_(1)^(2)\sigma_{1}^{2}, and the other with mean μ 2 μ 2 mu_(2)\mu_{2} and variance σ 2 2 σ 2 2 sigma_(2)^(2)\sigma_{2}^{2}.
More generally, one can show that an n n nn-dimensional Gaussian with mean μ R n μ R n mu inR^(n)\mu \in \mathbf{R}^{n} and diagonal covariance matrix Σ = diag ( σ 1 2 , σ 2 2 , , σ n 2 ) Σ = diag ( σ 1 2 , σ 2 2 , , σ n 2 ) Sigma=diag(sigma_(1)^(2),sigma_(2)^(2),dots,sigma_(n)^(2))\Sigma=\operatorname{diag}(\sigma_{1}^{2}, \sigma_{2}^{2}, \ldots, \sigma_{n}^{2}) is the same as a collection of n n nn independent Gaussian random variables with mean μ i μ i mu_(i)\mu_{i} and variance σ i 2 σ i 2 sigma_(i)^(2)\sigma_{i}^{2}, respectively.

4. Isocontours

Another way to understand a multivariate Gaussian conceptually is to understand the shape of its isocontours. For a function f : R 2 R f : R 2 R f:R^(2)rarrRf: \mathbf{R}^{2} \rightarrow \mathbf{R}, an isocontour is a set of the form
{ x R 2 : f ( x ) = c } . { x R 2 : f ( x ) = c } . {x inR^(2):f(x)=c}.\{x \in \mathbf{R}^{2}: f(x)=c\}.
for some c R c R c inRc \in \mathbf{R}.[4]

4.1. Shape of isocontours

What do the isocontours of a multivariate Gaussian look like? As before, let's consider the case where n = 2 n = 2 n=2n=2, and Σ Σ Sigma\Sigma is diagonal, i.e.,
x = [ x 1 x 2 ] μ = [ μ 1 μ 2 ] Σ = [ σ 1 2 0 0 σ 2 2 ] x = x 1 x 2 μ = μ 1 μ 2 Σ = σ 1 2 0 0 σ 2 2 x=[[x_(1)],[x_(2)]]quad mu=[[mu_(1)],[mu_(2)]]quad Sigma=[[sigma_(1)^(2),0],[0,sigma_(2)^(2)]]x=\left[\begin{array}{l} x_{1} \\ x_{2} \end{array}\right] \quad \mu=\left[\begin{array}{l} \mu_{1} \\ \mu_{2} \end{array}\right] \quad \Sigma=\left[\begin{array}{cc} \sigma_{1}^{2} & 0 \\ 0 & \sigma_{2}^{2} \end{array}\right]
As we showed in the last section,
(4) p ( x ; μ , Σ ) = 1 2 π σ 1 σ 2 exp ( 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 σ 2 2 ( x 2 μ 2 ) 2 ) . (4) p ( x ; μ , Σ ) = 1 2 π σ 1 σ 2 exp 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 σ 2 2 ( x 2 μ 2 ) 2 . {:(4)p(x;mu","Sigma)=(1)/(2pisigma_(1)sigma_(2))exp(-(1)/(2sigma_(1)^(2))(x_(1)-mu_(1))^(2)-(1)/(2sigma_(2)^(2))(x_(2)-mu_(2))^(2)).:}\begin{equation} p(x ; \mu, \Sigma)=\frac{1}{2 \pi \sigma_{1} \sigma_{2}} \exp \left(-\frac{1}{2 \sigma_{1}^{2}}(x_{1}-\mu_{1})^{2}-\frac{1}{2 \sigma_{2}^{2}}(x_{2}-\mu_{2})^{2}\right). \end{equation}
Now, let's consider the level set consisting of all points where p ( x ; μ , Σ ) = c p ( x ; μ , Σ ) = c p(x;mu,Sigma)=cp(x ; \mu, \Sigma)=c for some constant c R c R c inRc \in \mathbf{R}. In particular, consider the set of all x 1 , x 2 R x 1 , x 2 R x_(1),x_(2)inRx_{1}, x_{2} \in \mathbf{R} such that
c = 1 2 π σ 1 σ 2 exp ( 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 σ 2 2 ( x 2 μ 2 ) 2 ) 2 π c σ 1 σ 2 = exp ( 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 σ 2 2 ( x 2 μ 2 ) 2 ) log ( 2 π c σ 1 σ 2 ) = 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 σ 2 2 ( x 2 μ 2 ) 2 log ( 1 2 π c σ 1 σ 2 ) = 1 2 σ 1 2 ( x 1 μ 1 ) 2 + 1 2 σ 2 2 ( x 2 μ 2 ) 2 1 = ( x 1 μ 1 ) 2 2 σ 1 2 log ( 1 2 π c σ 1 σ 2 ) + ( x 2 μ 2 ) 2 2 σ 2 2 log ( 1 2 π c σ 1 σ 2 ) . c = 1 2 π σ 1 σ 2 exp 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 σ 2 2 ( x 2 μ 2 ) 2 2 π c σ 1 σ 2 = exp 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 σ 2 2 ( x 2 μ 2 ) 2 log 2 π c σ 1 σ 2 = 1 2 σ 1 2 ( x 1 μ 1 ) 2 1 2 σ 2 2 ( x 2 μ 2 ) 2 log 1 2 π c σ 1 σ 2 = 1 2 σ 1 2 ( x 1 μ 1 ) 2 + 1 2 σ 2 2 ( x 2 μ 2 ) 2 1 = ( x 1 μ 1 ) 2 2 σ 1 2 log 1 2 π c σ 1 σ 2 + ( x 2 μ 2 ) 2 2 σ 2 2 log 1 2 π c σ 1 σ 2 . {:[c=(1)/(2pisigma_(1)sigma_(2))exp(-(1)/(2sigma_(1)^(2))(x_(1)-mu_(1))^(2)-(1)/(2sigma_(2)^(2))(x_(2)-mu_(2))^(2))],[2pi csigma_(1)sigma_(2)=exp(-(1)/(2sigma_(1)^(2))(x_(1)-mu_(1))^(2)-(1)/(2sigma_(2)^(2))(x_(2)-mu_(2))^(2))],[log(2pi csigma_(1)sigma_(2))=-(1)/(2sigma_(1)^(2))(x_(1)-mu_(1))^(2)-(1)/(2sigma_(2)^(2))(x_(2)-mu_(2))^(2)],[log((1)/(2pi csigma_(1)sigma_(2)))=(1)/(2sigma_(1)^(2))(x_(1)-mu_(1))^(2)+(1)/(2sigma_(2)^(2))(x_(2)-mu_(2))^(2)],[1=((x_(1)-mu_(1))^(2))/(2sigma_(1)^(2)log((1)/(2pi csigma_(1)sigma_(2))))+((x_(2)-mu_(2))^(2))/(2sigma_(2)^(2)log((1)/(2pi csigma_(1)sigma_(2)))).]:}\begin{aligned} c &=\frac{1}{2 \pi \sigma_{1} \sigma_{2}} \exp \left(-\frac{1}{2 \sigma_{1}^{2}}(x_{1}-\mu_{1})^{2}-\frac{1}{2 \sigma_{2}^{2}}(x_{2}-\mu_{2})^{2}\right) \\ 2 \pi c \sigma_{1} \sigma_{2} &=\exp \left(-\frac{1}{2 \sigma_{1}^{2}}(x_{1}-\mu_{1})^{2}-\frac{1}{2 \sigma_{2}^{2}}(x_{2}-\mu_{2})^{2}\right) \\ \log \left(2 \pi c \sigma_{1} \sigma_{2}\right) &=-\frac{1}{2 \sigma_{1}^{2}}(x_{1}-\mu_{1})^{2}-\frac{1}{2 \sigma_{2}^{2}}(x_{2}-\mu_{2})^{2} \\ \log \left(\frac{1}{2 \pi c \sigma_{1} \sigma_{2}}\right) &=\frac{1}{2 \sigma_{1}^{2}}(x_{1}-\mu_{1})^{2}+\frac{1}{2 \sigma_{2}^{2}}(x_{2}-\mu_{2})^{2} \\ 1 &=\frac{(x_{1}-\mu_{1})^{2}}{2 \sigma_{1}^{2} \log \left(\frac{1}{2 \pi c \sigma_{1} \sigma_{2}}\right)}+\frac{(x_{2}-\mu_{2})^{2}}{2 \sigma_{2}^{2} \log \left(\frac{1}{2 \pi c \sigma_{1} \sigma_{2}}\right)} . \end{aligned}
Defining
r 1 = 2 σ 1 2 log ( 1 2 π c σ 1 σ 2 ) r 2 = 2 σ 2 2 log ( 1 2 π c σ 1 σ 2 ) , r 1 = 2 σ 1 2 log 1 2 π c σ 1 σ 2 r 2 = 2 σ 2 2 log 1 2 π c σ 1 σ 2 , r_(1)=sqrt(2sigma_(1)^(2)log((1)/(2pi csigma_(1)sigma_(2))))quadr_(2)=sqrt(2sigma_(2)^(2)log((1)/(2pi csigma_(1)sigma_(2)))),r_{1}=\sqrt{2 \sigma_{1}^{2} \log \left(\frac{1}{2 \pi c \sigma_{1} \sigma_{2}}\right)} \quad r_{2}=\sqrt{2 \sigma_{2}^{2} \log \left(\frac{1}{2 \pi c \sigma_{1} \sigma_{2}}\right)},
it follows that
(5) 1 = ( x 1 μ 1 r 1 ) 2 + ( x 2 μ 2 r 2 ) 2 . (5) 1 = x 1 μ 1 r 1 2 + x 2 μ 2 r 2 2 . {:(5)1=((x_(1)-mu_(1))/(r_(1)))^(2)+((x_(2)-mu_(2))/(r_(2)))^(2).:}\begin{equation} 1=\left(\frac{x_{1}-\mu_{1}}{r_{1}}\right)^{2}+\left(\frac{x_{2}-\mu_{2}}{r_{2}}\right)^{2} . \end{equation}
Equation (5) should be familiar to you from high school analytic geometry: it is the equation of an axis-aligned ellipse, with center ( μ 1 , μ 2 ) ( μ 1 , μ 2 ) (mu_(1),mu_(2))(\mu_{1}, \mu_{2}), where the x 1 x 1 x_(1)x_{1} axis has length 2 r 1 2 r 1 2r_(1)2 r_{1} and the x 2 x 2 x_(2)x_{2} axis has length 2 r 2 2 r 2 2r_(2)2 r_{2} !

4.2. Length of axes

To get a better understanding of how the shape of the level curves vary as a function of the variances of the multivariate Gaussian distribution, suppose that we are interested in the values of r 1 r 1 r_(1)r_{1} and r 2 r 2 r_(2)r_{2} at which c c cc is equal to a fraction 1 / e 1 / e 1//e1 / e of the peak height of Gaussian density.
Figure 2:
The figure on the left shows a heatmap indicating values of the density function for an axis-aligned multivariate Gaussian with mean μ = [ 3 2 ] μ = 3 2 mu=[[3],[2]]\mu=\left[\begin{array}{l}3 \\ 2\end{array}\right] and diagonal covariance matrix Σ = Σ = Sigma=\Sigma= [ 25 0 0 9 ] 25 0 0 9 [[25,0],[0,9]]\left[\begin{array}{cc}25 & 0 \\ 0 & 9\end{array}\right]. Notice that the Gaussian is centered at ( 3 , 2 ) ( 3 , 2 ) (3,2)(3,2), and that the isocontours are all elliptically shaped with major/minor axis lengths in a 5:3 ratio. The figure on the right shows a heatmap indicating values of the density function for a non axis-aligned multivariate Gaussian with mean μ = [ 3 2 ] μ = 3 2 mu=[[3],[2]]\mu=\left[\begin{array}{l}3 \\ 2\end{array}\right] and covariance matrix Σ = [ 10 5 5 5 ] Σ = 10 5 5 5 Sigma=[[10,5],[5,5]]\Sigma=\left[\begin{array}{cc}10 & 5 \\ 5 & 5\end{array}\right]. Here, the ellipses are again centered at ( 3 , 2 ) ( 3 , 2 ) (3,2)(3,2), but now the major and minor axes have been rotated via a linear transformation.
First, observe that maximum of Equation (4) occurs where x 1 = μ 1 x 1 = μ 1 x_(1)=mu_(1)x_{1}=\mu_{1} and x 2 = μ 2 x 2 = μ 2 x_(2)=mu_(2)x_{2}=\mu_{2}. Substituting these values into Equation (4), we see that the peak height of the Gaussian density is 1 2 π σ 1 σ 2 1 2 π σ 1 σ 2 (1)/(2pisigma_(1)sigma_(2))\frac{1}{2 \pi \sigma_{1} \sigma_{2}}.
Second, we substitute c = 1 e ( 1 2 π σ 1 σ 2 ) c = 1 e 1 2 π σ 1 σ 2 c=(1)/(e)((1)/(2pisigma_(1)sigma_(2)))c=\frac{1}{e}\left(\frac{1}{2 \pi \sigma_{1} \sigma_{2}}\right) into the equations for r 1 r 1 r_(1)r_{1} and r 2 r 2 r_(2)r_{2} to obtain
r 1 = 2 σ 1 2 log ( 1 2 π σ 1 σ 2 1 e ( 1 2 π σ 1 σ 2 ) ) = σ 1 2 r 2 = 2 σ 2 2 log ( 1 2 π σ 1 σ 2 1 e ( 1 2 π σ 1 σ 2 ) ) = σ 2 2 . r 1 = 2 σ 1 2 log 1 2 π σ 1 σ 2 1 e 1 2 π σ 1 σ 2 = σ 1 2 r 2 = 2 σ 2 2 log 1 2 π σ 1 σ 2 1 e 1 2 π σ 1 σ 2 = σ 2 2 . {:[r_(1)=sqrt(2sigma_(1)^(2)log((1)/(2pisigma_(1)sigma_(2)*(1)/(e)((1)/(2pisigma_(1)sigma_(2))))))=sigma_(1)sqrt2],[r_(2)=sqrt(2sigma_(2)^(2)log((1)/(2pisigma_(1)sigma_(2)*(1)/(e)((1)/(2pisigma_(1)sigma_(2))))))=sigma_(2)sqrt2.]:}\begin{aligned} &r_{1}=\sqrt{2 \sigma_{1}^{2} \log \left(\frac{1}{2 \pi \sigma_{1} \sigma_{2} \cdot \frac{1}{e}\left(\frac{1}{2 \pi \sigma_{1} \sigma_{2}}\right)}\right)}=\sigma_{1} \sqrt{2} \\ &r_{2}=\sqrt{2 \sigma_{2}^{2} \log \left(\frac{1}{2 \pi \sigma_{1} \sigma_{2} \cdot \frac{1}{e}\left(\frac{1}{2 \pi \sigma_{1} \sigma_{2}}\right)}\right)}=\sigma_{2} \sqrt{2}. \end{aligned}
From this, it follows that the axis length needed to reach a fraction 1 / e 1 / e 1//e1 / e of the peak height of the Gaussian density in the i i iith dimension grows in proportion to the standard deviation σ i σ i sigma_(i)\sigma_{i}. Intuitively, this again makes sense: the smaller the variance of some random variable x i x i x_(i)x_{i}, the more "tightly" peaked the Gaussian distribution in that dimension, and hence the smaller the radius r i r i r_(i)r_{i}.

4.3. Non-diagonal case, higher dimensions

Clearly, the above derivations rely on the assumption that Σ Σ Sigma\Sigma is a diagonal matrix. However, in the non-diagonal case, it turns out that the picture is not all that different. Instead of being an axis-aligned ellipse, the isocontours turn out to be simply rotated ellipses. Furthermore, in the n n nn-dimensional case, the level sets form geometrical structures known as ellipsoids in R n R n R^(n)\mathbf{R}^{n}.

5. Linear transformation interpretation

In the last few sections, we focused primarily on providing an intuition for how multivariate Gaussians with diagonal covariance matrices behaved. In particular, we found that an n n nn dimensional multivariate Gaussian with diagonal covariance matrix could be viewed simply as a collection of n n nn independent Gaussian-distributed random variables with means and variances μ i μ i mu_(i)\mu_{i} and σ i 2 σ i 2 sigma_(i)^(2)\sigma_{i}^{2}, respectvely. In this section, we dig a little deeper and provide a quantitative interpretation of multivariate Gaussians when the covariance matrix is not diagonal.
The key result of this section is the following theorem (see proof in Appendix A.2).
Theorem 1. Let X N ( μ , Σ ) X N ( μ , Σ ) X∼N(mu,Sigma)X \sim \mathcal{N}(\mu, \Sigma) for some μ R n μ R n mu inR^(n)\mu \in \mathbf{R}^{n} and Σ S + + n Σ S + + n Sigma inS_(++)^(n)\Sigma \in \mathbf{S}_{++}^{n}. Then, there exists a matrix B R n × n B R n × n B inR^(n xx n)B \in \mathbf{R}^{n \times n} such that if we define Z = B 1 ( X μ ) Z = B 1 ( X μ ) Z=B^(-1)(X-mu)Z=B^{-1}(X-\mu), then Z N ( 0 , I ) Z N ( 0 , I ) Z∼N(0,I)Z \sim \mathcal{N}(0, I).
To understand the meaning of this theorem, note that if Z N ( 0 , I ) Z N ( 0 , I ) Z∼N(0,I)Z \sim \mathcal{N}(0, I), then using the analysis from Section 4 , Z 4 , Z 4,Z4, Z can be thought of as a collection of n n nn independent standard normal random variables (i.e., Z i N ( 0 , 1 ) Z i N ( 0 , 1 ) Z_(i)∼N(0,1)Z_{i} \sim \mathcal{N}(0,1)). Furthermore, if Z = B 1 ( X μ ) Z = B 1 ( X μ ) Z=B^(-1)(X-mu)Z=B^{-1}(X-\mu) then X = B Z + μ X = B Z + μ X=BZ+muX=B Z+\mu follows from simple algebra.
Consequently, the theorem states that any random variable X X XX with a multivariate Gaussian distribution can be interpreted as the result of applying a linear transformation ( X = ( X = (X=(X= B Z + μ ) B Z + μ ) BZ+mu)B Z+\mu) to some collection of n n nn independent standard normal random variables ( Z ) ( Z ) (Z)(Z).

6. Appendix A.1

Proof. We prove the first of the two equalities in (1); the proof of the other equality is similar.
Σ = [ Cov [ X 1 , X 1 ] Cov [ X 1 , X n ] Cov [ X n , X 1 ] Cov [ X n , X n ] ] = [ E [ ( X 1 μ 1 ) 2 ] E [ ( X 1 μ 1 ) ( X n μ n ) ] E [ ( X n μ n ) ( X 1 μ 1 ) ] E [ ( X n μ n ) 2 ] ] (6) = E [ ( X 1 μ 1 ) 2 ( X 1 μ 1 ) ( X n μ n ) ( X n μ n ) ( X 1 μ 1 ) ( X n μ n ) 2 ] (7) = E [ [ X 1 μ 1 X n μ n ] [ X 1 μ 1 X n μ n ] ] = E [ ( X μ ) ( X μ ) T ] . Σ = Cov [ X 1 , X 1 ] Cov [ X 1 , X n ] Cov [ X n , X 1 ] Cov [ X n , X n ] = E ( X 1 μ 1 ) 2 E ( X 1 μ 1 ) ( X n μ n ) E ( X n μ n ) ( X 1 μ 1 ) E ( X n μ n ) 2 (6) = E ( X 1 μ 1 ) 2 ( X 1 μ 1 ) ( X n μ n ) ( X n μ n ) ( X 1 μ 1 ) ( X n μ n ) 2 (7) = E X 1 μ 1 X n μ n [ X 1 μ 1 X n μ n ] = E [ ( X μ ) ( X μ ) T ] . {:[Sigma=[[Cov[X_(1)","X_(1)],cdots,Cov[X_(1)","X_(n)]],[vdots,ddots,vdots],[Cov[X_(n)","X_(1)],cdots,Cov[X_(n)","X_(n)]]]],[=[[E[(X_(1)-mu_(1))^(2)],cdots,E[(X_(1)-mu_(1))(X_(n)-mu_(n))]],[vdots,ddots,vdots],[E[(X_(n)-mu_(n))(X_(1)-mu_(1))],cdots,E[(X_(n)-mu_(n))^(2)]]]],(6)=E[[(X_(1)-mu_(1))^(2),cdots,(X_(1)-mu_(1))(X_(n)-mu_(n))],[vdots,ddots,vdots],[(X_(n)-mu_(n))(X_(1)-mu_(1)),cdots,(X_(n)-mu_(n))^(2)]],(7)=E[[[X_(1)-mu_(1)],[vdots],[X_(n)-mu_(n)]][X_(1)-mu_(1)cdotsX_(n)-mu_(n)]],[=E[(X-mu)(X-mu)^(T)].]:}\begin{align} \Sigma &=\left[\begin{array}{ccc} \operatorname{Cov}[X_{1}, X_{1}] & \cdots & \operatorname{Cov}[X_{1}, X_{n}] \nonumber \\ \vdots & \ddots & \vdots \\ \operatorname{Cov}[X_{n}, X_{1}] & \cdots & \operatorname{Cov}[X_{n}, X_{n}] \end{array}\right] \\ &=\left[\begin{array}{ccc} E\left[(X_{1}-\mu_{1})^{2}\right] & \cdots & E\left[(X_{1}-\mu_{1})(X_{n}-\mu_{n})\right] \\ \vdots & \ddots & \vdots \\ E\left[(X_{n}-\mu_{n})(X_{1}-\mu_{1})\right] & \cdots & E\left[(X_{n}-\mu_{n})^{2}\right] \end{array}\right] \nonumber \\ &=E\left[\begin{array}{ccc} (X_{1}-\mu_{1})^{2} & \cdots & (X_{1}-\mu_{1})(X_{n}-\mu_{n}) \\ \vdots & \ddots & \vdots \\ (X_{n}-\mu_{n})(X_{1}-\mu_{1}) & \cdots & (X_{n}-\mu_{n})^{2} \end{array}\right] \\ &=E\left[\left[\begin{array}{c} X_{1}-\mu_{1} \\ \vdots \\ X_{n}-\mu_{n} \end{array}\right][X_{1}-\mu_{1} \cdots X_{n}-\mu_{n}]\right]\\ &=E[(X-\mu)(X-\mu)^{T}].\nonumber \end{align}
Here, (6) follows from the fact that the expectation of a matrix is simply the matrix found by taking the componentwise expectation of each entry. Also, (7) follows from the fact that for any vector z R n z R n z inR^(n)z \in \mathbf{R}^{n},
z z T = [ z 1 z 2 z n ] [ z 1 z 2 z n ] = [ z 1 z 1 z 1 z 2 z 1 z n z 2 z 1 z 2 z 2 z 2 z n z n z 1 z n z 2 z n z n ] . z z T = z 1 z 2 z n z 1      z 2      z n = z 1 z 1 z 1 z 2 z 1 z n z 2 z 1 z 2 z 2 z 2 z n z n z 1 z n z 2 z n z n . zz^(T)=[[z_(1)],[z_(2)],[vdots],[z_(n)]][[z_(1),z_(2),cdotsz_(n)]]=[[z_(1)z_(1),z_(1)z_(2),cdots,z_(1)z_(n)],[z_(2)z_(1),z_(2)z_(2),cdots,z_(2)z_(n)],[vdots,vdots,ddots,vdots],[z_(n)z_(1),z_(n)z_(2),cdots,z_(n)z_(n)]].z z^{T}=\left[\begin{array}{c} z_{1} \\ z_{2} \\ \vdots \\ z_{n} \end{array}\right]\left[\begin{array}{llll} z_{1} & z_{2} & \cdots z_{n} \end{array}\right]=\left[\begin{array}{cccc} z_{1} z_{1} & z_{1} z_{2} & \cdots & z_{1} z_{n} \\ z_{2} z_{1} & z_{2} z_{2} & \cdots & z_{2} z_{n} \\ \vdots & \vdots & \ddots & \vdots \\ z_{n} z_{1} & z_{n} z_{2} & \cdots & z_{n} z_{n} \end{array}\right].

7. Appendix A.2

We restate the theorem below:
Theorem 1. Let X N ( μ , Σ ) X N ( μ , Σ ) X∼N(mu,Sigma)X \sim \mathcal{N}(\mu, \Sigma) for some μ R n μ R n mu inR^(n)\mu \in \mathbf{R}^{n} and Σ S + + n Σ S + + n Sigma inS_(++)^(n)\Sigma \in \mathbf{S}_{++}^{n}. Then, there exists a matrix B R n × n B R n × n B inR^(n xx n)B \in \mathbf{R}^{n \times n} such that if we define Z = B 1 ( X μ ) Z = B 1 ( X μ ) Z=B^(-1)(X-mu)Z=B^{-1}(X-\mu), then Z N ( 0 , I ) Z N ( 0 , I ) Z∼N(0,I)Z \sim \mathcal{N}(0, I).
The derivation of this theorem requires some advanced linear algebra and probability theory and can be skipped for the purposes of this class. Our argument will consist of two parts. First, we will show that the covariance matrix Σ Σ Sigma\Sigma can be factorized as Σ = B B T Σ = B B T Sigma=BB^(T)\Sigma=B B^{T} for some invertible matrix B B BB. Second, we will perform a "change-of-variable" from X X XX to a different vector valued random variable Z Z ZZ using the relation Z = B 1 ( X μ ) Z = B 1 ( X μ ) Z=B^(-1)(X-mu)Z=B^{-1}(X-\mu).
Step 1: Factorizing the covariance matrix. Recall the following two properties of symmetric matrices from the notes on linear algebra [5] :
  1. Any real symmetric matrix A R n × n A R n × n A inR^(n xx n)A \in \mathbf{R}^{n \times n} can always be represented as A = U Λ U T A = U Λ U T A=U LambdaU^(T)A=U \Lambda U^{T}, where U U UU is a full rank orthogonal matrix containing of the eigenvectors of A A AA as its columns, and Λ Λ Lambda\Lambda is a diagonal matrix containing A A AA's eigenvalues.
  2. If A A AA is symmetric positive definite, all its eigenvalues are positive.
Since the covariance matrix Σ Σ Sigma\Sigma is positive definite, using the first fact, we can write Σ = U Λ U T Σ = U Λ U T Sigma=U LambdaU^(T)\Sigma=U \Lambda U^{T} for some appropriately defined matrices U U UU and Λ Λ Lambda\Lambda. Using the second fact, we can define Λ 1 / 2 R n × n Λ 1 / 2 R n × n Lambda^(1//2)inR^(n xx n)\Lambda^{1 / 2} \in \mathbf{R}^{n \times n} to be the diagonal matrix whose entries are the square roots of the corresponding entries from Λ Λ Lambda\Lambda. Since Λ = Λ 1 / 2 ( Λ 1 / 2 ) T Λ = Λ 1 / 2 ( Λ 1 / 2 ) T Lambda=Lambda^(1//2)(Lambda^(1//2))^(T)\Lambda=\Lambda^{1 / 2}(\Lambda^{1 / 2})^{T}, we have
Σ = U Λ U T = U Λ 1 / 2 ( Λ 1 / 2 ) T U T = U Λ 1 / 2 ( U Λ 1 / 2 ) T = B B T , Σ = U Λ U T = U Λ 1 / 2 ( Λ 1 / 2 ) T U T = U Λ 1 / 2 ( U Λ 1 / 2 ) T = B B T , Sigma=U LambdaU^(T)=ULambda^(1//2)(Lambda^(1//2))^(T)U^(T)=ULambda^(1//2)(ULambda^(1//2))^(T)=BB^(T),\Sigma=U \Lambda U^{T}=U \Lambda^{1 / 2}(\Lambda^{1 / 2})^{T} U^{T}=U \Lambda^{1 / 2}(U \Lambda^{1 / 2})^{T}=B B^{T},
where B = U Λ 1 / 2 B = U Λ 1 / 2 B=ULambda^(1//2)B=U \Lambda^{1 / 2}[6] In this case, then Σ 1 = B T B 1 Σ 1 = B T B 1 Sigma^(-1)=B^(-T)B^(-1)\Sigma^{-1}=B^{-T} B^{-1}, so we can rewrite the standard formula for the density of a multivariate Gaussian as
(8) p ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 | B B T | 1 / 2 exp ( 1 2 ( x μ ) T B T B 1 ( x μ ) ) . (8) p ( x ; μ , Σ ) = 1 ( 2 π ) n / 2 B B T 1 / 2 exp 1 2 ( x μ ) T B T B 1 ( x μ ) . {:(8)p(x;mu","Sigma)=(1)/((2pi)^(n//2)|BB^(T)|^(1//2))exp(-(1)/(2)(x-mu)^(T)B^(-T)B^(-1)(x-mu)).:}\begin{equation} p(x ; \mu, \Sigma)=\frac{1}{(2 \pi)^{n / 2}\left|B B^{T}\right|^{1 / 2}} \exp \left(-\frac{1}{2}(x-\mu)^{T} B^{-T} B^{-1}(x-\mu)\right). \end{equation}
Step 2: Change of variables. Now, define the vector-valued random variable Z = Z = Z=Z= B 1 ( X μ ) B 1 ( X μ ) B^(-1)(X-mu)B^{-1}(X-\mu). A basic formula of probability theory, which we did not introduce in the section notes on probability theory, is the "change-of-variables" formula for relating vector-valued random variables:
Suppose that X = [ X 1 X n ] T R n X = [ X 1           X n ] T R n X=[{:[X_(1),cdots,X_(n)]:}]^(T)inR^(n)X=[\begin{array}{lll}X_{1} & \cdots & X_{n}\end{array}]^{T} \in \mathbf{R}^{n} is a vector-valued random variable with joint density function f X : R n R f X : R n R f_(X):R^(n)rarrRf_{X}: \mathbf{R}^{n} \rightarrow \mathbf{R}. If Z = H ( X ) R n Z = H ( X ) R n Z=H(X)inR^(n)Z=H(X) \in \mathbf{R}^{n} where H H HH is a bijective, differentiable function, then Z Z ZZ has joint density f Z : R n R f Z : R n R f_(Z):R^(n)rarrRf_{Z}: \mathbf{R}^{n} \rightarrow \mathbf{R}, where
f Z ( z ) = f X ( x ) | det ( [ x 1 z 1 x 1 z n x n z 1 x n z n ] ) | . f Z ( z ) = f X ( x ) det x 1 z 1 x 1 z n x n z 1 x n z n . f_(Z)(z)=f_(X)(x)*|det([[(delx_(1))/(delz_(1)),cdots,(delx_(1))/(delz_(n))],[vdots,ddots,vdots],[(delx_(n))/(delz_(1)),cdots,(delx_(n))/(delz_(n))]])|.f_{Z}(z)=f_{X}(x) \cdot\left|\operatorname{det}\left(\left[\begin{array}{ccc} \frac{\partial x_{1}}{\partial z_{1}} & \cdots & \frac{\partial x_{1}}{\partial z_{n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial x_{n}}{\partial z_{1}} & \cdots & \frac{\partial x_{n}}{\partial z_{n}} \end{array}\right]\right)\right|.
Using the change-of-variable formula, one can show (after some algebra, which we'll skip) that the vector variable Z Z ZZ has the following joint density:
(9) p Z ( z ) = 1 ( 2 π ) n / 2 exp ( 1 2 z T z ) . (9) p Z ( z ) = 1 ( 2 π ) n / 2 exp 1 2 z T z . {:(9)p_(Z)(z)=(1)/((2pi)^(n//2))exp(-(1)/(2)z^(T)z).:}\begin{equation} p_{Z}(z)=\frac{1}{(2 \pi)^{n / 2}} \exp \left(-\frac{1}{2} z^{T} z\right). \end{equation}
The claim follows immediately.
This is the first of a two-part overview of multivariate gaussians from Andrew Ng’s CS229 course on Machine learning. Click here to read part two.

  1. Recall from the section notes on linear algebra that S + + n S + + n S_(++)^(n)\mathbf{S}_{++}^{n} is the space of symmetric positive definite n × n n × n n xx nn \times n matrices, defined as S + + n = { A R n × n : A = A T and x T A x > 0 for all x R n such that x 0 } S + + n = { A R n × n : A = A T  and  x T A x > 0  for all  x R n  such that  x 0 } S_(++)^(n)={A inR^(n xx n):A=A^(T)" and "x^(T)Ax > 0" for all "x inR^(n)" such that "x!=0}\mathbf{S}_{++}^{n}=\{A \in \mathbf{R}^{n \times n}: A=A^{T} \text { and } x^{T} A x>0 \text { for all } x \in \mathbf{R}^{n} \text { such that } x \neq 0\} ↩︎
  2. In these notes, we use the notation p ( ) p ( ) p(∙)p(\bullet) to denote density functions, instead of f X ( ) f X ( ) f_(X)(∙)f_{X}(\bullet) (as in the section notes on probability theory). ↩︎
  3. Namely, | a b c d | = a d b c . a      b c      d = a d b c . |[a,b],[c,d]|=ad-bc.\left|\begin{array}{ll}a & b \\ c & d\end{array}\right|=a d-b c . ↩︎
  4. Isocontours are often also known as level curves. More generally, a level set of a function f : R n R f : R n R f:R^(n)rarrRf: \mathbf{R}^{n} \rightarrow \mathbf{R}, is a set of the form { x R 2 : f ( x ) = c } { x R 2 : f ( x ) = c } {x inR^(2):f(x)=c}\{x \in \mathbf{R}^{2}: f(x)=c\} for some c R c R c inRc \in \mathbf{R}. ↩︎
  5. See section on "Eigenvalues and Eigenvectors of Symmetric Matrices." ↩︎
  6. To show that B B BB is invertible, it suffices to observe that U U UU is an invertible matrix, and right-multiplying U U UU by a diagonal matrix (with no zero diagonal entries) will rescale its columns but will not change its rank. ↩︎

Recommended for you

Eric Mockensturm
Probing The Full Monty Hall Problem
Probing The Full Monty Hall Problem
A tutorial on the Monty Hall problem in statistics.
8 points
0 issues