Linear algebra provides a way of compactly representing and operating on sets of linear equations. For example, consider the following system of equations:
This is two equations and two variables, so as you know from high school algebra, you can find a unique solution for x_(1)x_{1} and x_(2)x_{2} (unless the equations are somehow degenerate, for example if the second equation is simply a multiple of the first, but in the case above there is in fact a unique solution). In matrix notation, we can write the system more compactly as
As we will see shortly, there are many advantages (including the obvious space savings) to analyzing linear equations in this form.
1.1. Basic Notation
We use the following notation:
•By A inR^(m xx n)A \in \mathbb{R}^{m \times n} we denote a matrix with mm rows and nn columns, where the entries of AA are real numbers.
•By x inR^(n)x \in \mathbb{R}^{n}, we denote a vector with nn entries. By convention, an nn-dimensional vector is often thought of as a matrix with nn rows and 1 column, known as a column vector. If we want to explicitly represent a row vector – a matrix with 1 row and nn columns – we typically write x^(T)x^{T} (here x^(T)x^{T} denotes the transpose of xx, which we will define shortly).
•The iith element of a vector xx is denoted x_(i)x_{i}:
x=[[x_(1)],[x_(2)],[vdots],[x_(n)]].x=\left[\begin{array}{c}
x_{1} \\
x_{2} \\
\vdots \\
x_{n}
\end{array}\right] .
•We use the notation a_(ij)a_{i j} (or A_(ij),A_(i,j)A_{i j}, A_{i, j}, etc) to denote the entry of AA in the iith row and jjth column:
A=[[a_(11),a_(12),cdots,a_(1n)],[a_(21),a_(22),cdots,a_(2n)],[vdots,vdots,ddots,vdots],[a_(m1),a_(m2),cdots,a_(mn)]].A=\left[\begin{array}{cccc}
a_{11} & a_{12} & \cdots & a_{1 n} \\
a_{21} & a_{22} & \cdots & a_{2 n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{m 1} & a_{m 2} & \cdots & a_{m n}
\end{array}\right].
•We denote the jjth column of AA by a^(j)a^{j} or A_(:,j)A_{:, j}:
A=[[|,|,,|],[a^(1),a^(2),cdots,a^(n)],[|,|,,|]].A=\left[\begin{array}{cccc}
| & | & & | \\
a^{1} & a^{2} & \cdots & a^{n} \\
| & | & & |
\end{array}\right].
•We denote the iith row of AA by a^(T)a^{T} or A_(i,:)A_{i,:}:
A=[[-,a_(1)^(T),-],[-,a_(2)^(T),-],[,vdots,],[-,a_(m)^(T),-]].A=\left[\begin{array}{ccc}
- & a_{1}^{T} & - \\
- & a_{2}^{T} & - \\
& \vdots & \\
- & a_{m}^{T} & -
\end{array}\right].
•Viewing a matrix as a collection of column or row vectors is very important and convenient in many cases. In general, it would be mathematically (and conceptually) cleaner to operate on the level of vectors instead of scalars. There is no universal convention for denoting the columns or rows of a matrix, and thus you can feel free to change the notations as long as it's explicitly defined.
2. Matrix Multiplication
The product of two matrices A inR^(m xx n)A \in \mathbb{R}^{m \times n} and B inR^(n xx p)B \in \mathbb{R}^{n \times p} is the matrix
C=AB inR^(m xx p),C=A B \in \mathbb{R}^{m \times p},
Note that in order for the matrix product to exist, the number of columns in AA must equal the number of rows in BB. There are many other ways of looking at matrix multiplication that may be more convenient and insightful than the standard definition above, and we'll start by examining a few special cases.
2.1. Vector-Vector Products
Given two vectors x,y inR^(n)x, y \in \mathbb{R}^{n}, the quantity x^(T)yx^{T} y, sometimes called the inner product or dot product of the vectors, is a real number given by
Observe that inner products are really just special case of matrix multiplication. Note that it is always the case that x^(T)y=y^(T)xx^{T} y=y^{T} x.
Given vectors x inR^(m),y inR^(n)x \in \mathbb{R}^{m}, y \in \mathbb{R}^{n} (not necessarily of the same size), xy^(T)inR^(m xx n)x y^{T} \in \mathbb{R}^{m \times n} is called the outer product of the vectors. It is a matrix whose entries are given by (xy^(T))_(ij)=x_(i)y_(j)\left(x y^{T}\right)_{i j}=x_{i} y_{j}, i.e.,
As an example of how the outer product can be useful, let 1inR^(n)\mathbf{1} \in \mathbb{R}^{n} denote an nn-dimensional vector whose entries are all equal to 11. Furthermore, consider the matrix A inR^(m xx n)A \in \mathbb{R}^{m \times n} whose columns are all equal to some vector x inR^(m)x \in \mathbb{R}^{m}. Using outer products, we can represent AA compactly as,
Given a matrix A inR^(m xx n)A \in \mathbb{R}^{m \times n} and a vector x inR^(n)x \in \mathbb{R}^{n}, their product is a vector y=Ax inR^(m)y=A x \in \mathbb{R}^{m}. There are a couple ways of looking at matrix-vector multiplication, and we will look at each of them in turn.
If we write AA by rows, then we can express AxA x as,
In other words, y is a linear combination of the columns of AA, where the coefficients of the linear combination are given by the entries of xx.
So far we have been multiplying on the right by a column vector, but it is also possible to multiply on the left by a row vector. This is written, y^(T)=x^(T)Ay^{T}=x^{T} A for A inR^(m xx n),x inR^(m)A \in \mathbb{R}^{m \times n}, x \in \mathbb{R}^{m}, and y inR^(n)y \in \mathbb{R}^{n}. As before, we can express y^(T)y^{T} in two obvious ways, depending on whether we express AA in terms on its rows or columns. In the first case we express AA in terms of its columns, which gives
so we see that y^(T)y^{T} is a linear combination of the rows of AA, where the coefficients for the linear combination are given by the entries of xx.
2.3. Matrix-Matrix Products
Armed with this knowledge, we can now look at four different (but, of course, equivalent) ways of viewing the matrix-matrix multiplication C=ABC=A B as defined at the beginning of this section.
First, we can view matrix-matrix multiplication as a set of vector-vector products. The most obvious viewpoint, which follows immediately from the definition, is that the (i,j)(i, j)th entry of CC is equal to the inner product of the iith row of AA and the jjth column of BB. Symbolically, this looks like the following,
Remember that since A inR^(m xx n)A \in \mathbb{R}^{m \times n} and B inR^(n xx p),a_(i)inR^(n)B \in \mathbb{R}^{n \times p}, a_{i} \in \mathbb{R}^{n} and b^(j)inR^(n)b^{j} \in \mathbb{R}^{n}, so these inner products all make sense. This is the most "natural" representation when we represent AA by rows and BB by columns. Alternatively, we can represent AA by columns, and BB by rows. This representation leads to a much trickier interpretation of ABA B as a sum of outer products. Symbolically,
Put another way, ABA B is equal to the sum, over all ii, of the outer product of the iith column of AA and the iith row of BB. Since, in this case, a^(i)inR^(m)a^{i} \in \mathbb{R}^{m} and b_(i)inR^(p)b_{i} \in \mathbb{R}^{p}, the dimension of the outer product a^(i)b_(i)^(T)a^{i} b_{i}^{T} is m xx pm \times p, which coincides with the dimension of CC. Chances are, the last equality above may appear confusing to you. If so, take the time to check it for yourself!
Second, we can also view matrix-matrix multiplication as a set of matrix-vector products. Specifically, if we represent BB by columns, we can view the columns of CC as matrix-vector products between AA and the columns of BB. Symbolically,
Here the iith column of CC is given by the matrix-vector product with the vector on the right, c_(i)=Ab_(i)c_{i}=A b_{i}. These matrix-vector products can in turn be interpreted using both viewpoints given in the previous subsection. Finally, we have the analogous viewpoint, where we represent AA by rows, and view the rows of CC as the matrix-vector product between the rows of AA and CC. Symbolically,
Here the iith row of CC is given by the matrix-vector product with the vector on the left, c_(i)^(T)=a_(i)^(T)Bc_{i}^{T}=a_{i}^{T} B.
It may seem like overkill to dissect matrix multiplication to such a large degree, especially when all these viewpoints follow immediately from the initial definition we gave (in about a line of math) at the beginning of this section. The direct advantage of these various viewpoints is that they allow you to operate on the level/unit of vectors instead of scalars. To fully understand linear algebra without getting lost in the complicated manipulation of indices, the key is to operate with as large concepts as possible.[1]
Virtually all of linear algebra deals with matrix multiplications of some kind, and it is worthwhile to spend some time trying to develop an intuitive understanding of the viewpoints presented here.
In addition to this, it is useful to know a few basic properties of matrix multiplication at a higher level:
•Matrix multiplication is associative: (AB)C=A(BC)(A B) C=A(B C).
•Matrix multiplication is distributive: A(B+C)=AB+ACA(B+C)=A B+A C.
•Matrix multiplication is, in general, not commutative; that is, it can be the case that AB!=BAA B \neq B A. (For example, if A inR^(m xx n)A \in \mathbb{R}^{m \times n} and B inR^(n xx q)B \in \mathbb{R}^{n \times q}, the matrix product BAB A does not even exist if mm and qq are not equal!)
If you are not familiar with these properties, take the time to verify them for yourself. For example, to check the associativity of matrix multiplication, suppose that A inR^(m xx n)A \in \mathbb{R}^{m \times n}, B inR^(n xx p)B \in \mathbb{R}^{n \times p}, and C inR^(p xx q)C \in \mathbb{R}^{p \times q}. Note that AB inR^(m xx p)A B \in \mathbb{R}^{m \times p}, so (AB)C inR^(m xx q)(A B) C \in \mathbb{R}^{m \times q}. Similarly, BC inR^(n xx q)B C \in \mathbb{R}^{n \times q}, so A(BC)inR^(m xx q)A(B C) \in \mathbb{R}^{m \times q}. Thus, the dimensions of the resulting matrices agree. To show that matrix multiplication is associative, it suffices to check that the (i,j)(i, j)th entry of (AB)C(A B) C is equal to the (i,j)(i, j)th entry of A(BC)A(B C). We can verify this directly using the definition of matrix multiplication:
Here, the first and last two equalities simply use the definition of matrix multiplication, the third and fifth equalities use the distributive property for scalar multiplication over addition, and the fourth equality uses the commutative and associativity of scalar addition. This technique for proving matrix properties by reduction to simple scalar properties will come up often, so make sure you're familiar with it.
3. Operations and Properties
In this section we present several operations and properties of matrices and vectors. Hopefully a great deal of this will be review for you, so the notes can just serve as a reference for these topics.
3.1. The Identity Matrix and Diagonal Matrices
The identity matrix, denoted I inR^(n xx n)I \in \mathbb{R}^{n \times n}, is a square matrix with ones on the diagonal and zeros everywhere else. That is,
It has the property that for all A inR^(m xx n)A \in \mathbb{R}^{m \times n},
AI=A=IA.A I=A=I A.
Note that in some sense, the notation for the identity matrix is ambiguous, since it does not specify the dimension of II. Generally, the dimensions of II are inferred from context so as to make matrix multiplication possible. For example, in the equation above, the II in AI=AA I=A is an n xx nn \times n matrix, whereas the II in A=IAA=I A is an m xx mm \times m matrix.
A diagonal matrix is a matrix where all non-diagonal elements are 00. This is typically denoted D=diag(d_(1),d_(2),dots,d_(n))D=\operatorname{diag}\left(d_{1}, d_{2}, \ldots, d_{n}\right), with
The transpose of a matrix results from "flipping" the rows and columns. Given a matrix A inR^(m xx n)A \in \mathbb{R}^{m \times n}, its transpose, written A^(T)inR^(n xx m)A^{T} \in \mathbb{R}^{n \times m}, is the n xx mn \times m matrix whose entries are given by
We have in fact already been using the transpose when describing row vectors, since the transpose of a column vector is naturally a row vector.
The following properties of transposes are easily verified:
•(A^(T))^(T)=A\left(A^{T}\right)^{T}=A
•(AB)^(T)=B^(T)A^(T)(A B)^{T}=B^{T} A^{T}
•(A+B)^(T)=A^(T)+B^(T)(A+B)^{T}=A^{T}+B^{T}
3.3. Symmetric Matrices
A square matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n} is symmetric if A=A^(T)A=A^{T}. It is anti-symmetric if A=-A^(T)A=-A^{T}. It is easy to show that for any matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n}, the matrix A+A^(T)A+A^{T} is symmetric and the matrix A-A^(T)A-A^{T} is anti-symmetric. From this it follows that any square matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n} can be represented as a sum of a symmetric matrix and an anti-symmetric matrix, since
and the first matrix on the right is symmetric, while the second is anti-symmetric. It turns out that symmetric matrices occur a great deal in practice, and they have many nice properties which we will look at shortly. It is common to denote the set of all symmetric matrices of size nn as S^(n)\mathbb{S}^{n}, so that A inS^(n)A \in \mathbb{S}^{n} means that AA is a symmetric n xx nn \times n matrix;
3.4. The Trace
The trace of a square matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n}, denoted tr(A)\operatorname{tr}(A) (or just tr A\operatorname{tr} A if the parentheses are obviously implied), is the sum of diagonal elements in the matrix:
As described in the CS229 lecture notes, the trace has the following properties (included here for the sake of completeness):
•For A inR^(n xx n),tr A=tr A^(T)A \in \mathbb{R}^{n \times n}, \operatorname{tr} A=\operatorname{tr} A^{T}.
•For A,B inR^(n xx n),tr(A+B)=tr A+tr BA, B \in \mathbb{R}^{n \times n}, \operatorname{tr}(A+B)=\operatorname{tr} A+\operatorname{tr} B.
•For A inR^(n xx n),t inR,tr(tA)=t tr AA \in \mathbb{R}^{n \times n}, t \in \mathbb{R}, \operatorname{tr}(t A)=t \operatorname{tr} A.
•For A,BA, B such that ABA B is square, tr AB=tr BA\operatorname{tr} A B=\operatorname{tr} B A.
•For A,B,CA, B, C such that ABCA B C is square, tr ABC=tr BCA=tr CAB\operatorname{tr} A B C=\operatorname{tr} B C A=\operatorname{tr} C A B, and so on for the product of more matrices.
As an example of how these properties can be proven, we'll consider the fourth property given above. Suppose that A inR^(m xx n)A \in \mathbb{R}^{m \times n} and B inR^(n xx m)B \in \mathbb{R}^{n \times m} (so that AB inR^(m xx m)A B \in \mathbb{R}^{m \times m} is a square matrix). Observe that BA inR^(n xx n)B A \in \mathbb{R}^{n \times n} is also a square matrix, so it makes sense to apply the trace operator to it. To verify that tr AB=tr BA\operatorname{tr} A B=\operatorname{tr} B A, note that
{:[tr AB=sum_(i=1)^(m)(AB)_(ii)=sum_(i=1)^(m)(sum_(j=1)^(n)A_(ij)B_(ji))],[=sum_(i=1)^(m)sum_(j=1)^(n)A_(ij)B_(ji)=sum_(j=1)^(n)sum_(i=1)^(m)B_(ji)A_(ij)],[=sum_(j=1)^(n)(sum_(i=1)^(m)B_(ji)A_(ij))=sum_(j=1)^(n)(BA)_(jj)=tr BA.]:}\begin{aligned}
\operatorname{tr} A B &=\sum_{i=1}^{m}(A B)_{i i}=\sum_{i=1}^{m}\left(\sum_{j=1}^{n} A_{i j} B_{j i}\right) \\
&=\sum_{i=1}^{m} \sum_{j=1}^{n} A_{i j} B_{j i}=\sum_{j=1}^{n} \sum_{i=1}^{m} B_{j i} A_{i j} \\
&=\sum_{j=1}^{n}\left(\sum_{i=1}^{m} B_{j i} A_{i j}\right)=\sum_{j=1}^{n}(B A)_{j j}=\operatorname{tr} B A .
\end{aligned}
Here, the first and last two equalities use the definition of the trace operator and matrix multiplication. The fourth equality, where the main work occurs, uses the commutativity of scalar multiplication in order to reverse the order of the terms in each product, and the commutativity and associativity of scalar addition in order to rearrange the order of the summation.
3.5. Norms
A norm of a vector ||x||\|x\| is informally a measure of the "length" of the vector. For example, we have the commonly-used Euclidean or ℓ_(2)\ell_{2} norm,
In fact, all three norms presented so far are examples of the family of ℓ_(p)\ell_{p} norms, which are parameterized by a real number p >= 1p \geq 1, and defined as
Many other norms exist, but they are beyond the scope of this review.
3.6. Linear Independence and Rank
A set of vectors {x_(1),x_(2),dotsx_(n)}subR^(m)\left\{x_{1}, x_{2}, \ldots x_{n}\right\} \subset \mathbb{R}^{m} is said to be (linearly) independent if no vector can be represented as a linear combination of the remaining vectors. Conversely, if one vector belonging to the set can be represented as a linear combination of the remaining vectors, then the vectors are said to be (linearly) dependent. That is, if
for some scalar values alpha_(1),dots,alpha_(n-1)inR\alpha_{1}, \ldots, \alpha_{n-1} \in \mathbb{R}, then we say that the vectors x_(1),dots,x_(n)x_{1}, \ldots, x_{n} are linearly dependent; otherwise, the vectors are linearly independent. For example, the vectors
are linearly dependent because x_(3)=-2x_(1)+x_(2)x_{3}=-2 x_{1}+x_{2}.
The column rank of a matrix A inR^(m xx n)A \in \mathbb{R}^{m \times n} is the size of the largest subset of columns of AA that constitute a linearly independent set. With some abuse of terminology, this is often referred to simply as the number of linearly independent columns of AA. In the same way, the row rank is the largest number of rows of AA that constitute a linearly independent set.
For any matrix A inR^(m xx n)A \in \mathbb{R}^{m \times n}, it turns out that the column rank of AA is equal to the row rank of AA (though we will not prove this), and so both quantities are referred to collectively as the rank of AA, denoted as rank(A)\operatorname{rank}(A). The following are some basic properties of the rank:
•For A inR^(m xx n),rank(A) <= min(m,n)A \in \mathbb{R}^{m \times n}, \operatorname{rank}(A) \leq \min (m, n). If rank(A)=min(m,n)\operatorname{rank}(A)=\min (m, n), then AA is said to be full rank.
•For A inR^(m xx n),rank(A)=rank(A^(T))A \in \mathbb{R}^{m \times n}, \operatorname{rank}(A)=\operatorname{rank}\left(A^{T}\right).
•For A inR^(m xx n),B inR^(n xx p),rank(AB) <= min(rank(A),rank(B))A \in \mathbb{R}^{m \times n}, B \in \mathbb{R}^{n \times p}, \operatorname{rank}(A B) \leq \min (\operatorname{rank}(A), \operatorname{rank}(B)).
•For A,B inR^(m xx n),rank(A+B) <= rank(A)+rank(B)A, B \in \mathbb{R}^{m \times n}, \operatorname{rank}(A+B) \leq \operatorname{rank}(A)+\operatorname{rank}(B).
3.7. The Inverse of a Square Matrix
The inverse of a square matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n} is denoted A^(-1)A^{-1}, and is the unique matrix such that
A^(-1)A=I=AA^(-1).A^{-1} A=I=A A^{-1} .
Note that not all matrices have inverses. Non-square matrices, for example, do not have inverses by definition. However, for some square matrices AA, it may still be the case that A^(-1)A^{-1} may not exist. In particular, we say that AA is invertible or non-singular if A^(-1)A^{-1} exists and non-invertible or singular otherwise.[2]
In order for a square matrix AA to have an inverse A^(-1)A^{-1}, then AA must be full rank. We will soon see that there are many alternative sufficient and necessary conditions, in addition to full rank, for invertibility.
The following are properties of the inverse; all assume that A,B inR^(n xx n)A, B \in \mathbb{R}^{n \times n} are non-singular:
•(A^(-1))^(-1)=A\left(A^{-1}\right)^{-1}=A
•(AB)^(-1)=B^(-1)A^(-1)(A B)^{-1}=B^{-1} A^{-1}
•(A^(-1))^(T)=(A^(T))^(-1)\left(A^{-1}\right)^{T}=\left(A^{T}\right)^{-1}. For this reason this matrix is often denoted A^(-T)A^{-T}.
As an example of how the inverse is used, consider the linear system of equations, Ax=bA x=b where A inR^(n xx n)A \in \mathbb{R}^{n \times n}, and x,b inR^(n)x, b \in \mathbb{R}^{n}. If AA is nonsingular (i.e., invertible), then x=A^(-1)bx=A^{-1} b.
(What if A inR^(m xx n)A \in \mathbb{R}^{m \times n} is not a square matrix? Does this work?)
3.8. Orthogonal Matrices
Two vectors x,y inR^(n)x, y \in \mathbb{R}^{n} are orthogonal if x^(T)y=0x^{T} y=0. A vector x inR^(n)x \in \mathbb{R}^{n} is normalized if ||x||_(2)=1\|x\|_{2}=1. A square matrix U inR^(n xx n)U \in \mathbb{R}^{n \times n} is orthogonal (note the different meanings when talking about vectors versus matrices) if all its columns are orthogonal to each other and are normalized (the columns are then referred to as being orthonormal).
It follows immediately from the definition of orthogonality and normality that
U^(T)U=I=UU^(T).U^{T} U=I=U U^{T} .
In other words, the inverse of an orthogonal matrix is its transpose. Note that if UU is not square – i.e., U inR^(m xx n),quad n < mU \in \mathbb{R}^{m \times n}, \quad n<m – but its columns are still orthonormal, then U^(T)U=IU^{T} U=I, but UU^(T)!=IU U^{T} \neq I. We generally only use the term orthogonal to describe the previous case, where UU is square.
Another nice property of orthogonal matrices is that operating on a vector with an orthogonal matrix will not change its Euclidean norm, i.e.,
for any x inR^(n),U inR^(n xx n)x \in \mathbb{R}^{n}, U \in \mathbb{R}^{n \times n} orthogonal.
3.9. Range and Nullspace of a Matrix
The span of a set of vectors {x_(1),x_(2),dotsx_(n)}\left\{x_{1}, x_{2}, \ldots x_{n}\right\} is the set of all vectors that can be expressed as a linear combination of {x_(1),dots,x_(n)}\left\{x_{1}, \ldots, x_{n}\right\}. That is,
It can be shown that if {x_(1),dots,x_(n)}\left\{x_{1}, \ldots, x_{n}\right\} is a set of nn linearly independent vectors, where each x_(i)inR^(n)x_{i} \in \mathbb{R}^{n}, then span({x_(1),dotsx_(n)})=R^(n)\operatorname{span}\left(\left\{x_{1}, \ldots x_{n}\right\}\right)=\mathbb{R}^{n}. In other words, any vector v inR^(n)v \in \mathbb{R}^{n} can be written as a linear combination of x_(1)x_{1} through x_(n)x_{n}. The projection of a vector y inR^(m)y \in \mathbb{R}^{m} onto the span of {x_(1),dots,x_(n)}\left\{x_{1}, \ldots, x_{n}\right\} (here we assume {:x_(i)inR^(m))\left.x_{i} \in \mathbb{R}^{m}\right) is the vector v in span({x_(1),dotsx_(n)})v \in \operatorname{span}\left(\left\{x_{1}, \ldots x_{n}\right\}\right), such that vv is as close as possible to yy, as measured by the Euclidean norm ||v-y||_(2)\|v-y\|_{2}. We denote the projection as Proj(y;{x_(1),dots,x_(n)})\operatorname{Proj}\left(y ;\left\{x_{1}, \ldots, x_{n}\right\}\right) and can define it formally as,
The range (sometimes also called the columnspace) of a matrix A inR^(m xx n)A \in \mathbb{R}^{m \times n}, denoted R(A)\mathcal{R}(A), is the the span of the columns of AA. In other words,
Making a few technical assumptions (namely that AA is full rank and that n < mn<m), the projection of a vector y inR^(m)y \in \mathbb{R}^{m} onto the range of AA is given by,
Proj(y;A)=argmin_(v inR(A))||v-y||_(2)=A(A^(T)A)^(-1)A^(T)y.\operatorname{Proj}(y ; A)=\operatorname{argmin}_{v \in \mathcal{R}(A)}\|v-y\|_{2}=A\left(A^{T} A\right)^{-1} A^{T} y .
This last equation should look extremely familiar, since it is almost the same formula we derived in class (and which we will soon derive again) for the least squares estimation of parameters. Looking at the definition for the projection, it should not be too hard to convince yourself that this is in fact the same objective that we minimized in our least squares problem (except for a squaring of the norm, which doesn't affect the optimal point) and so these problems are naturally very connected. When AA contains only a single column, a inR^(m)a \in \mathbb{R}^{m}, this gives the special case for a projection of a vector on to a line:
Proj(y;a)=(aa^(T))/(a^(T)a)y.\operatorname{Proj}(y ; a)=\frac{a a^{T}}{a^{T} a} y .
The nullspace of a matrix A inR^(m xx n)A \in \mathbb{R}^{m \times n}, denoted N(A)\mathcal{N}(A) is the set of all vectors that equal 00 when multiplied by AA, i.e.,
N(A)={x inR^(n):Ax=0}.\mathcal{N}(A)=\left\{x \in \mathbb{R}^{n}: A x=0\right\} .
Note that vectors in R(A)\mathcal{R}(A) are of size mm, while vectors in the N(A)\mathcal{N}(A) are of size nn, so vectors in R(A^(T))\mathcal{R}\left(A^{T}\right) and N(A)\mathcal{N}(A) are both in R^(n)\mathbb{R}^{n}. In fact, we can say much more. It turns out that
{w:w=u+v,u inR(A^(T)),v inN(A)}=R^(n)" and "R(A^(T))nnN(A)={0}.\left\{w: w=u+v, u \in \mathcal{R}\left(A^{T}\right), v \in \mathcal{N}(A)\right\}=\mathbb{R}^{n} \text { and } \mathcal{R}\left(A^{T}\right) \cap \mathcal{N}(A)=\{\mathbf{0}\} .
In other words, R(A^(T))\mathcal{R}\left(A^{T}\right) and N(A)\mathcal{N}(A) are disjoint subsets that together span the entire space of R^(n)\mathbb{R}^{n}. Sets of this type are called orthogonal complements, and we denote this R(A^(T))=\mathcal{R}\left(A^{T}\right)=N(A)^(_|_)\mathcal{N}(A)^{\perp}.
3.10. The Determinant
The determinant of a square matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n}, is a function det: R^(n xx n)rarrR\mathbb{R}^{n \times n} \rightarrow \mathbb{R}, and is denoted |A||A| or det A\det A (like the trace operator, we usually omit parentheses). Algebraically, one could write down an explicit formula for the determinant of AA, but this unfortunately gives little intuition about its meaning. Instead, we'll start out by providing a geometric interpretation of the determinant and then visit some of its specific algebraic properties afterwards.
consider the set of points S subR^(n)S \subset \mathbb{R}^{n} formed by taking all possible linear combinations of the row vectors a_(1),dots,a_(n)inR^(n)a_{1}, \ldots, a_{n} \in \mathbb{R}^{n} of AA, where the coefficients of the linear combination are all between 0 and 1 ; that is, the set SS is the restriction of span({a_(1),dots,a_(n)})\operatorname{span}\left(\left\{a_{1}, \ldots, a_{n}\right\}\right) to only those linear combinations whose coefficients alpha_(1),dots,alpha_(n)\alpha_{1}, \ldots, \alpha_{n} satisfy 0 <= alpha_(i) <= 1,i=1,dots,n0 \leq \alpha_{i} \leq 1, i=1, \ldots, n. Formally,
The set SS corresponding to these rows is shown in Figure 3.10. For two-dimensional matrices, SS generally has the shape of a parallelogram. In our example, the value of the determinant is |A|=-7|A|=-7 (as can be computed using the formulas shown later in this section), so the area of the parallelogram is 77. (Verify this for yourself!)
In three dimensions, the set SS corresponds to an object known as a parallelepiped (a threedimensional box with skewed sides, such that every face has the shape of a parallelogram). The absolute value of the determinant of the 3xx33 \times 3 matrix whose rows define SS give the three-dimensional volume of the parallelepiped. In even higher dimensions, the set SS is an object known as an nn-dimensional parallelotope.
Figure 1: Illustration of the determinant for the 2xx22 \times 2 matrix AA given in (4). Here, a_(1)a_{1} and a_(2)a_{2} are vectors corresponding to the rows of AA, and the set SS corresponds to the shaded region (i.e., the parallelogram). The absolute value of the determinant, |det A|=7|\operatorname{det} A|=7, is the area of the parallelogram.
Algebraically, the determinant satisfies the following three properties (from which all other properties follow, including the general formula):
The determinant of the identity is 1,|I|=11,|I|=1. (Geometrically, the volume of a unit hypercube is 11).
Given a matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n}, if we multiply a single row in AA by a scalar t inRt \in \mathbb{R}, then the determinant of the new matrix is t|A|t|A|,
|[[-,ta_(1)^(T),-],[-,a_(2)^(T),-],[,vdots,],[-,a_(m)^(T),-]]|=t|A|.\left| \left[\begin{array}{ccc}-&ta_{1}^{T}&- \\- &a_{2}^{T}&- \\ &\vdots& \\ -&a_{m}^{T}&- \end{array} \right]\right|=t|A|.
(Geometrically, multiplying one of the sides of the set SS by a factor tt causes the volume to increase by a factor tt.)
If we exchange any two rows a_(i)^(T)a_{i}^{T} and a_(j)^(T)a_{j}^{T} of AA, then the determinant of the new matrix is -|A|-|A|, for example
|[[-,a_(2)^(T),-],[-,a_(1)^(T),-],[,vdots,],[-,a_(m)^(T),-]]|=-|A|.\left|\left[\begin{array}{ccc}
- & a_{2}^{T} & - \\
- & a_{1}^{T} & - \\
&\vdots & \\
- & a_{m}^{T} & -
\end{array}\right]\right|=-|A| .
In case you are wondering, it is not immediately obvious that a function satisfying the above three properties exists. In fact, though, such a function does exist, and is unique (which we will not prove here).
Several properties that follow from the three properties above include:
•For A inR^(n xx n),|A|=|A^(T)|A \in \mathbb{R}^{n \times n},|A|=\left|A^{T}\right|.
•For A,B inR^(n xx n),|AB|=|A||B|A, B \in \mathbb{R}^{n \times n},|A B|=|A||B|.
•For A inR^(n xx n),|A|=0A \in \mathbb{R}^{n \times n},|A|=0 if and only if AA is singular (i.e., non-invertible). (If AA is singular then it does not have full rank, and hence its columns are linearly dependent. In this case, the set SS corresponds to a "flat sheet" within the nn-dimensional space and hence has zero volume.)
•For A inR^(n xx n)A \in \mathbb{R}^{n \times n} and AA non-singular, |A^(-1)|=1//|A|\left|A^{-1}\right|=1 /|A|.
Before giving the general definition for the determinant, we define, for A inR^(n xx n),A_(\\i,\\j)inA \in \mathbb{R}^{n \times n}, A_{\backslash i, \backslash j} \inR^((n-1)xx(n-1))\mathbb{R}^{(n-1) \times(n-1)} to be the matrix that results from deleting the iith row and jjth column from AA. The general (recursive) formula for the determinant is
{:[|A|=sum_(i=1)^(n)(-1)^(i+j)a_(ij)|A_(\\i,\\j)|quad("for any "j in1,dots,n)],[=sum_(j=1)^(n)(-1)^(i+j)a_(ij)|A_(\\i,\\j)|quad("for any "i in1,dots,n)]:}\begin{align*}
|A| & =\sum_{i=1}^{n}(-1)^{i+j} a_{i j}\left|A_{\backslash i, \backslash j}\right| \quad \left(\text {for any } j \in 1, \ldots, n\right) \\
& =\sum_{j=1}^{n}(-1)^{i+j} a_{i j}\left|A_{\backslash i, \backslash j}\right| \quad \left(\text {for any } i \in 1, \ldots, n\right)
\end{align*}
with the initial case that |A|=a_(11)|A|=a_{11} for A inR^(1xx1)A \in \mathbb{R}^{1 \times 1}. If we were to expand this formula completely for A inR^(n xx n)A \in \mathbb{R}^{n \times n}, there would be a total of nn! (nn factorial) different terms. For this reason, we hardly ever explicitly write the complete equation of the determinant for matrices bigger than 3xx33 \times 3. However, the equations for determinants of matrices up to size 3xx33 \times 3 are fairly common, and it is good to know them:
The classical adjoint (often just called the adjoint) of a matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n}, is denoted adj(A)\operatorname{adj}(A), and defined as
(note the switch in the indices A_(\\j,\\i)A_{\backslash j, \backslash i}). It can be shown that for any nonsingular A inR^(n xx n)A \in \mathbb{R}^{n \times n},
While this is a nice "explicit" formula for the inverse of matrix, we should note that, numerically, there are in fact much more efficient ways of computing the inverse.
3.11. Quadratic Forms and Positive Semidefinite Matrices
Given a square matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n} and a vector x inR^(n)x \in \mathbb{R}^{n}, the scalar value x^(T)Axx^{T} A x is called a quadratic form. Written explicitly, we see that
x^(T)Ax=(x^(T)Ax)^(T)=x^(T)A^(T)x=x^(T)((1)/(2)A+(1)/(2)A^(T))x,x^{T} A x=\left(x^{T} A x\right)^{T}=x^{T} A^{T} x=x^{T}\left(\frac{1}{2} A+\frac{1}{2} A^{T}\right) x,
where the first equality follows from the fact that the transpose of a scalar is equal to itself, and the second equality follows from the fact that we are averaging two quantities which are themselves equal. From this, we can conclude that only the symmetric part of AA contributes to the quadratic form. For this reason, we often implicitly assume that the matrices appearing in a quadratic form are symmetric.
We give the following definitions:
•A symmetric matrix A inS^(n)A \in \mathbb{S}^{n} is positive definite (PD) if for all non-zero vectors x inR^(n),x^(T)Ax > 0x \in \mathbb{R}^{n}, x^{T} A x>0. This is usually denoted A>-0A \succ 0 (or just A > 0A>0 ), and often times the set of all positive definite matrices is denoted S_(++)^(n)\mathbb{S}_{++}^{n}.
•A symmetric matrix A inS^(n)A \in \mathbb{S}^{n} is positive semidefinite (PSD) if for all vectors x^(T)Ax >= 0x^{T} A x \geq 0. This is written A>-=0A \succeq 0 (or just A >= 0A \geq 0 ), and the set of all positive semidefinite matrices is often denoted S_(+)^(n)\mathbb{S}_{+}^{n}.
•Likewise, a symmetric matrix A inS^(n)A \in \mathbb{S}^{n} is negative definite (ND), denoted A-<0A \prec 0 (or just A < 0)A<0) if for all non-zero x inR^(n),x^(T)Ax < 0x \in \mathbb{R}^{n}, x^{T} A x<0.
•Similarly, a symmetric matrix A inS^(n)A \in \mathbb{S}^{n} is negative semidefinite (NSD), denoted A-<=0A \preceq 0 (or just A <= 0A \leq 0 ) if for all x inR^(n),x^(T)Ax <= 0x \in \mathbb{R}^{n}, x^{T} A x \leq 0.
•Finally, a symmetric matrix A inS^(n)A \in \mathbb{S}^{n} is indefinite, if it is neither positive semidefinite nor negative semidefinite – i.e., if there exists x_(1),x_(2)inR^(n)x_{1}, x_{2} \in \mathbb{R}^{n} such that x_(1)^(T)Ax_(1) > 0x_{1}^{T} A x_{1}>0 and x_(2)^(T)Ax_(2) < 0x_{2}^{T} A x_{2}<0.
It should be obvious that if AA is positive definite, then -A-A is negative definite and vice versa. Likewise, if AA is positive semidefinite then -A-A is negative semidefinite and vice versa. If AA is indefinite, then so is -A-A.
One important property of positive definite and negative definite matrices is that they are always full rank, and hence, invertible. To see why this is the case, suppose that some matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n} is not full rank. Then, suppose that the jjth column of AA is expressible as a linear combination of other n-1n-1 columns:
But this implies x^(T)Ax=0x^{T} A x=0 for some non-zero vector xx, so AA must be neither positive definite nor negative definite. Therefore, if AA is either positive definite or negative definite, it must be full rank.
Finally, there is one type of positive definite matrix that comes up frequently, and so deserves some special mention. Given any matrix A inR^(m xx n)A \in \mathbb{R}^{m \times n} (not necessarily symmetric or even square), the matrix G=A^(T)AG=A^{T} A (sometimes called a Gram matrix) is always positive semidefinite. Further, if m >= nm \geq n (and we assume for convenience that AA is full rank), then G=A^(T)AG=A^{T} A is positive definite.
3.12. Eigenvalues and Eigenvectors
Given a square matrix A inR^(n xx n)A \in \mathbb{R}^{n \times n}, we say that lambda inC\lambda \in \mathbb{C} is an eigenvalue of AA and x inC^(n)x \in \mathbb{C}^{n} is the corresponding eigenvector[4] if
Ax=lambda x,quad x!=0.A x=\lambda x, \quad x \neq 0.
Intuitively, this definition means that multiplying AA by the vector xx results in a new vector that points in the same direction as xx, but scaled by a factor lambda\lambda. Also note that for any eigenvector x inC^(n)x \in \mathbb{C}^{n}, and scalar t inC,A(cx)=cAx=c lambda x=lambda(cx)t \in \mathbb{C}, A(c x)=c A x=c \lambda x=\lambda(c x), so cxc x is also an eigenvector. For this reason when we talk about "the" eigenvector associated with lambda\lambda, we usually assume that the eigenvector is normalized to have length 11 (this still creates some ambiguity, since xx and -x-x will both be eigenvectors, but we will have to live with this).
We can rewrite the equation above to state that (lambda,x)(\lambda, x) is an eigenvalue-eigenvector pair of AA if,
(lambda I-A)x=0,quad x!=0.(\lambda I-A) x=0, \quad x \neq 0 .
But (lambda I-A)x=0(\lambda I-A) x=0 has a non-zero solution to xx if and only if (lambda I-A)(\lambda I-A) has a non-empty nullspace, which is only the case if (lambda I-A)(\lambda I-A) is singular, i.e.,
|(lambda I-A)|=0.|(\lambda I-A)|=0 .
We can now use the previous definition of the determinant to expand this expression |(lambda I-A)||(\lambda I-A)| into a (very large) polynomial in lambda\lambda, where lambda\lambda will have degree nn. It's often called the characteristic polynomial of the matrix AA.
We then find the nn (possibly complex) roots of this characteristic polynomial and denote them by lambda_(1),dots,lambda_(n)\lambda_{1}, \ldots, \lambda_{n}. These are all the eigenvalues of the matrix AA, but we note that they may not be distinct. To find the eigenvector corresponding to the eigenvalue lambda_(i)\lambda_{i}, we simply solve the linear equation (lambda_(i)I-A)x=0\left(\lambda_{i} I-A\right) x=0, which is guaranteed to have a non-zero solution because lambda_(i)I-A\lambda_{i} I-A is singular (but there could also be multiple or infinite solutions.)
It should be noted that this is not the method which is actually used in practice to numerically compute the eigenvalues and eigenvectors (remember that the complete expansion of the determinant has n!n! terms); it is rather a mathematical argument.
The following are properties of eigenvalues and eigenvectors (in all cases assume A inR^(n xx n)A \in \mathbb{R}^{n \times n} has eigenvalues lambda_(i),dots,lambda_(n)\lambda_{i}, \ldots, \lambda_{n}):
•The trace of a AA is equal to the sum of its eigenvalues,
tr A=sum_(i=1)^(n)lambda_(i).\operatorname{tr} A=\sum_{i=1}^{n} \lambda_{i} .
•The determinant of AA is equal to the product of its eigenvalues,
|A|=prod_(i=1)^(n)lambda_(i).|A|=\prod_{i=1}^{n} \lambda_{i} .
•The rank of AA is equal to the number of non-zero eigenvalues of AA.
•Suppose AA is non-singular with eigenvalue lambda\lambda and an associated eigenvector xx. Then 1//lambda1 / \lambda is an eigenvalue of A^(-1)A^{-1} with an associated eigenvector xx, i.e., A^(-1)x=(1//lambda)xA^{-1} x=(1 / \lambda) x. (To prove this, take the eigenvector equation, Ax=lambda xA x=\lambda x and left-multiply each side by A^(-1)A^{-1}.)
•The eigenvalues of a diagonal matrix D=diag(d_(1),dotsd_(n))D=\operatorname{diag}\left(d_{1}, \ldots d_{n}\right) are just the diagonal entries d_(1),dotsd_(n)d_{1}, \ldots d_{n}.
3.13. Eigenvalues and Eigenvectors of Symmetric Matrices
In general, the structures of the eigenvalues and eigenvectors of a general square matrix can be subtle to characterize. Fortunately, in most of the cases in machine learning, it suffices to deal with symmetric real matrices, whose eigenvalues and eigenvectors have remarkable properties.
Throughout this section, let's assume that AA is a symmetric real matrix. We have the following properties:
All eigenvalues of AA are real numbers. We denote them by lambda_(1),dots,lambda_(n)\lambda_{1}, \ldots, \lambda_{n}.
There exists a set of eigenvectors u_(1),dots,u_(n)u_{1}, \ldots, u_{n} such that a) for all i,u_(i)i, u_{i} is an eigenvector with eigenvalue lambda_(i)\lambda_{i} and b) u_(1),dots,u_(n)u_{1}, \ldots, u_{n} are unit vectors and orthogonal to each other.[5]
Let UU be the orthonormal matrix that contains u_(i)u_{i}'s as columns:[6]
Let Lambda=diag(lambda_(1),dots,lambda_(n))\Lambda=\operatorname{diag}\left(\lambda_{1}, \ldots, \lambda_{n}\right) be the diagonal matrix that contains lambda_(1),dots,lambda_(n)\lambda_{1}, \ldots, \lambda_{n} as entries on the diagonal. Using the view of matrix-matrix vector multiplication in equation (2) of Section 2.3, we can verify that
Recalling that orthonormal matrix UU satisfies that UU^(T)=IU U^{T}=I and using the equation above, we have
{:(6)A=AUU^(T)=U LambdaU^(T):}\begin{equation}
A=A U U^{T}=U \Lambda U^{T}
\end{equation}
This new presentation of AA as U LambdaU^(T)U \Lambda U^{T} is often called the diagonalization of the matrix AA. The term diagonalization comes from the fact that with such representation, we can often effectively treat a symmetric matrix AA as a diagonal matrix – which is much easier to understand – w.r.t the basis defined by the eigenvectors UU. We will elaborate this below by several examples.
Background: representing vector w.r.t. another basis. Any orthonormal matrix U=[[|,|,,|],[u_(1),u_(2),cdots,u_(n)],[|,|,,|]]U=\left[\begin{array}{cccc}| & | & & | \\ u_{1} & u_{2} & \cdots & u_{n} \\ | & | & & |\end{array}\right] defines a new basis (coordinate system) of R^(n)\mathbb{R}^{n} in the following sense. For any vector x inR^(n)x \in \mathbb{R}^{n} can be represented as a linear combination of u_(1),dots,u_(n)u_{1}, \ldots, u_{n} with coefficient hat(x)_(1),dots, hat(x)_(n)\hat{x}_{1}, \ldots, \hat{x}_{n}:
In other words, the vector hat(x)=U^(T)x\hat{x}=U^{T} x can serve as another representation of the vector xx w.r.t the basis defined by UU.
"Diagonalizing" matrix-vector multiplication. With the setup above, we will see that left-multiplying matrix AA can be viewed as left-multiplying a diagonal matrix w.r.t the basic of the eigenvectors. Suppose xx is a vector and hat(x)\hat{x} is its representation w.r.t to the basis of UU. Let z=Axz=A x be the matrix-vector product. Now let's compute the representation zz w.r.t the basis of UU:
Then, again using the fact that UU^(T)=U^(T)U=IU U^{T}=U^{T} U=I and equation (6), we have that
We see that left-multiplying matrix AA in the original space is equivalent to left-multiplying the diagonal matrix Lambda\Lambda w.r.t the new basis, which is merely scaling each coordinate by the corresponding eigenvalue.
Under the new basis, multiplying a matrix multiple times becomes much simpler as well. For example, suppose q=AAAxq=A A A x. Deriving out the analytical form of qq in terms of the entries of AA may be a nightmare under the original basis, but can be much easier under the new on:
{:(7) hat(q)=U^(T)q=U^(T)AAAx=U^(T)U LambdaU^(T)U LambdaU^(T)U LambdaU^(T)x=Lambda^(3) hat(x)=[[lambda_(1)^(3) hat(x)_(1)],[lambda_(2)^(3) hat(x)_(2)],[vdots],[lambda_(n)^(3) hat(x)_(n)]]:}\begin{equation}
\hat{q}=U^{T} q=U^{T} A A A x=U^{T} U \Lambda U^{T} U \Lambda U^{T} U \Lambda U^{T} x=\Lambda^{3} \hat{x}=\left[\begin{array}{c}
\lambda_{1}^{3} \hat{x}_{1} \\
\lambda_{2}^{3} \hat{x}_{2} \\
\vdots \\
\lambda_{n}^{3} \hat{x}_{n}
\end{array}\right]
\end{equation}
"Diagonalizing" quadratic form. As a directly corollary, the quadratic form x^(T)Axx^{T} A x can also be simplified under the new basis
{:(8)x^(T)Ax=x^(T)U LambdaU^(T)x= hat(x)^(T)Lambda hat(x)=sum_(i=1)^(n)lambda_(i) hat(x)_(i)^(2):}\begin{equation}
x^{T} A x=x^{T} U \Lambda U^{T} x=\hat{x}^{T} \Lambda \hat{x}=\sum_{i=1}^{n} \lambda_{i} \hat{x}_{i}^{2}
\end{equation}
(Recall that with the old representation, x^(T)Ax=sum_(i=1,j=1)^(n)x_(i)x_(j)A_(ij)x^{T} A x=\sum_{i=1, j=1}^{n} x_{i} x_{j} A_{i j} involves a sum of n^(2)n^{2} terms instead of nn terms in the equation above.) With this viewpoint, we can also show that the definiteness of the matrix AA depends entirely on the sign of its eigenvalues:
If all lambda_(i) > 0\lambda_{i}>0, then the matrix AA s positivedefinite because x^(T)Ax=sum_(i=1)^(n)lambda_(i) hat(x)_(i)^(2) > 0x^{T} A x=\sum_{i=1}^{n} \lambda_{i} \hat{x}_{i}^{2}>0 for any hat(x)!=0.\hat{x} \neq 0 .[7]
If all lambda_(i) >= 0\lambda_{i} \geq 0, it is positive semidefinite because x^(T)Ax=sum_(i=1)^(n)lambda_(i) hat(x)_(i)^(2) >= 0x^{T} A x=\sum_{i=1}^{n} \lambda_{i} \hat{x}_{i}^{2} \geq 0 for all hat(x)\hat{x}.
Likewise, if all lambda_(i) < 0\lambda_{i}<0 or lambda_(i) <= 0\lambda_{i} \leq 0, then AA is negative definite or negative semidefinite respectively.
Finally, if AA has both positive and negative eigenvalues, say lambda_(i) > 0\lambda_{i}>0 and lambda_(j) < 0\lambda_{j}<0, then it is indefinite. This is because if we let hat(x)\hat{x} satisfy hat(x)_(i)=1\hat{x}_{i}=1 and hat(x)_(k)=0,AA k!=i\hat{x}_{k}=0, \forall k \neq i, then x^(T)Ax=sum_(i=1)^(n)lambda_(i) hat(x)_(i)^(2) > 0x^{T} A x=\sum_{i=1}^{n} \lambda_{i} \hat{x}_{i}^{2}>0. Similarly we can let hat(x)\hat{x} satisfy hat(x)_(j)=1\hat{x}_{j}=1 and hat(x)_(k)=0,AA k!=j\hat{x}_{k}=0, \forall k \neq j, then x^(T)Ax=sum_(i=1)^(n)lambda_(i) hat(x)_(i)^(2) < 0.x^{T} A x=\sum_{i=1}^{n} \lambda_{i} \hat{x}_{i}^{2}<0 .[8]
An application where eigenvalues and eigenvectors come up frequently is in maximizing some function of a matrix. In particular, for a matrix A inS^(n)A \in \mathbb{S}^{n}, consider the following maximization problem,
{:(9)max_(x inR^(n))x^(T)Ax=sum_(i=1)^(n)lambda_(i) hat(x)_(i)^(2)quad"subject to "||x||_(2)^(2)=1:}\begin{equation}
\max _{x \in \mathbb{R}^{n}} x^{T} A x=\sum_{i=1}^{n} \lambda_{i} \hat{x}_{i}^{2} \quad \text {subject to }\|x\|_{2}^{2}=1
\end{equation}
i.e., we want to find the vector (of norm 11) which maximizes the quadratic form. Assuming the eigenvalues are ordered as lambda_(1) >= lambda_(2) >= dots >= lambda_(n)\lambda_{1} \geq \lambda_{2} \geq \ldots \geq \lambda_{n}, the optimal value of this optimization problem is lambda_(1)\lambda_{1} and any eigenvector u_(1)u_{1} corresponding to lambda_(1)\lambda_{1} is one of the maximizers. (If lambda_(1) > lambda_(2)\lambda_{1}>\lambda_{2}, then there is a unique eigenvector corresponding to eigenvalue lambda_(1)\lambda_{1}, which is the unique maximizer of the optimization problem (9).)
We can show this by using the diagonalization technique: Note that ||x||_(2)=|| hat(x)||_(2)\|x\|_{2}=\|\hat{x}\|_{2} by equation (3), and using equation (8), we can rewrite the optimization (9) as
Moreover, setting hat(x)=[[1],[0],[vdots],[0]]\hat{x}=\left[\begin{array}{c}
1 \\
0 \\
\vdots \\
0
\end{array}\right] achieves the equality in the equation above, and this corresponds to setting x=u_(1)x=u_{1}.
4. Matrix Calculus
While the topics in the previous sections are typically covered in a standard course on linear algebra, one topic that does not seem to be covered very often (and which we will use extensively) is the extension of calculus to the vector setting. Despite the fact that all the actual calculus we use is relatively trivial, the notation can often make things look much more difficult than they are. In this section we present some basic definitions of matrix calculus and provide a few examples.
4.1. The Gradient
Suppose that f:R^(m xx n)rarrRf: \mathbb{R}^{m \times n} \rightarrow \mathbb{R} is a function that takes as input a matrix AA of size m xx nm \times n and returns a real value. Then the gradient of ff (with respect to A inR^(m xx n)A \in \mathbb{R}^{m \times n} ) is the matrix of partial derivatives, defined as:
Note that the size of grad_(A)f(A)\nabla_{A} f(A) is always the same as the size of AA. So if, in particular, AA is just a vector x inR^(n)x \in \mathbb{R}^{n},
It is very important to remember that the gradient of a function is only defined if the function is real-valued, that is, if it returns a scalar value. We can not, for example, take the gradient of Ax,A inR^(n xx n)A x, A \in \mathbb{R}^{n \times n} with respect to xx, since this quantity is vector-valued.
It follows directly from the equivalent properties of partial derivatives that:
•For t inR,grad_(x)(tf(x))=tgrad_(x)f(x)t \in \mathbb{R}, \nabla_{x}(t f(x))=t \nabla_{x} f(x).
In principle, gradients are a natural extension of partial derivatives to functions of multiple variables. In practice, however, working with gradients can sometimes be tricky for notational reasons. For example, suppose that A inR^(m xx n)A \in \mathbb{R}^{m \times n} is a matrix of fixed coefficients and suppose that b inR^(m)b \in \mathbb{R}^{m} is a vector of fixed coefficients. Let f:R^(m)rarrRf: \mathbb{R}^{m} \rightarrow \mathbb{R} be the function defined by f(z)=z^(T)zf(z)=z^{T} z, such that grad_(z)f(z)=2z\nabla_{z} f(z)=2 z. But now, consider the expression,
grad f(Ax).\nabla f(A x) .
How should this expression be interpreted? There are at least two possibilities:
In the first interpretation, recall that grad_(z)f(z)=2z\nabla_{z} f(z)=2 z. Here, we interpret grad f(Ax)\nabla f(A x) as evaluating the gradient at the point AxA x, hence,
grad f(Ax)=2(Ax)=2Ax inR^(m).\nabla f(A x)=2(A x)=2 A x \in \mathbb{R}^{m} .
In the second interpretation, we consider the quantity f(Ax)f(A x) as a function of the input variables xx. More formally, let g(x)=f(Ax)g(x)=f(A x). Then in this interpretation,
grad f(Ax)=grad_(x)g(x)inR^(n).\nabla f(A x)=\nabla_{x} g(x) \in \mathbb{R}^{n} .
Here, we can see that these two interpretations are indeed different. One interpretation yields an mm-dimensional vector as a result, while the other interpretation yields an nn-dimensional vector as a result! How can we resolve this?
Here, the key is to make explicit the variables which we are differentiating with respect to. In the first case, we are differentiating the function ff with respect to its arguments zz and then substituting the argument AxA x. In the second case, we are differentiating the composite function g(x)=f(Ax)g(x)=f(A x) with respect to xx directly. We denote the first case as grad_(z)f(Ax)\nabla_{z} f(A x) and the second case as grad_(x)f(Ax)\nabla_{x} f(A x).[9] Keeping the notation clear is extremely important (as you'll find out in your homework, in fact!).
4.2. The Hessian
Suppose that f:R^(n)rarrRf: \mathbb{R}^{n} \rightarrow \mathbb{R} is a function that takes a vector in R^(n)\mathbb{R}^{n} and returns a real number. Then the Hessian matrix with respect to xx, written grad_(x)^(2)f(x)\nabla_{x}^{2} f(x) or simply as HH is the n xx nn \times n matrix of partial derivatives,
Similar to the gradient, the Hessian is defined only when f(x)f(x) is real-valued.
It is natural to think of the gradient as the analogue of the first derivative for functions of vectors, and the Hessian as the analogue of the second derivative (and the symbols we use also suggest this relation). This intuition is generally correct, but there a few caveats to keep in mind.
First, for real-valued functions of one variable f:RrarrRf: \mathbb{R} \rightarrow \mathbb{R}, it is a basic definition that the second derivative is the derivative of the first derivative, i.e.,
and this expression is not defined. Therefore, it is not the case that the Hessian is the gradient of the gradient. However, this is almost true, in the following sense: If we look at the iith entry of the gradient (grad_(x)f(x))_(i)=del f(x)//delx_(i)\left(\nabla_{x} f(x)\right)_{i}=\partial f(x) / \partial x_{i}, and take the gradient with respect to xx we get
If we don't mind being a little bit sloppy we can say that (essentially) grad_(x)^(2)f(x)=grad_(x)(grad_(x)f(x))^(T)\nabla_{x}^{2} f(x)=\nabla_{x}\left(\nabla_{x} f(x)\right)^{T}, so long as we understand that this really means taking the gradient of each entry of (grad_(x)f(x))^(T)\left(\nabla_{x} f(x)\right)^{T}, not the gradient of the whole vector.
Finally, note that while we can take the gradient with respect to a matrix A inR^(n)A \in \mathbb{R}^{n}, for the purposes of this class we will only consider taking the Hessian with respect to a vector x inR^(n)x \in \mathbb{R}^{n}. This is simply a matter of convenience (and the fact that none of the calculations we do require us to find the Hessian with respect to a matrix), since the Hessian with respect to a matrix would have to represent all the partial derivatives del^(2)f(A)//(delA_(ij)delA_(kℓ))\partial^{2} f(A) /\left(\partial A_{i j} \partial A_{k \ell}\right), and it is rather cumbersome to represent this as a matrix.
4.3. Gradients and Hessians of Quadratic and Linear Functions
Now let's try to determine the gradient and Hessian matrices for a few simple functions. It should be noted that all the gradients given here are special cases of the gradients given in the CS229 lecture notes.
For x inR^(n)x \in \mathbb{R}^{n}, let f(x)=b^(T)xf(x)=b^{T} x for some known vector b inR^(n)b \in \mathbb{R}^{n}. Then
From this we can easily see that grad_(x)b^(T)x=b\nabla_{x} b^{T} x=b. This should be compared to the analogous situation in single variable calculus, where del//(del x)ax=a\partial /(\partial x) a x=a.
Now consider the quadratic function f(x)=x^(T)Axf(x)=x^{T} A x for A inS^(n)A \in \mathbb{S}^{n}. Remember that
where the last equality follows since AA is symmetric (which we can safely assume, since it is appearing in a quadratic form). Note that the kkth entry of grad_(x)f(x)\nabla_{x} f(x) is just the inner product of the kkth row of AA and xx. Therefore, grad_(x)x^(T)Ax=2Ax\nabla_{x} x^{T} A x=2 A x. Again, this should remind you of the analogous fact in single-variable calculus, that del//(del x)ax^(2)=2ax\partial /(\partial x) a x^{2}=2 a x.
Finally, let's look at the Hessian of the quadratic function f(x)=x^(T)Axf(x)=x^{T} A x (it should be obvious that the Hessian of a linear function b^(T)xb^{T} x is zero). In this case,
Therefore, it should be clear that grad_(x)^(2)x^(T)Ax=2A\nabla_{x}^{2} x^{T} A x=2 A, which should be entirely expected (and again analogous to the single-variable fact that {:del^(2)//(delx^(2))ax^(2)=2a)\left.\partial^{2} /\left(\partial x^{2}\right) a x^{2}=2 a\right).
To recap,
•grad_(x)b^(T)x=b\nabla_{x} b^{T} x=b
•grad_(x)x^(T)Ax=2Ax\nabla_{x} x^{T} A x=2 A x (if AA symmetric)
•grad_(x)^(2)x^(T)Ax=2A\nabla_{x}^{2} x^{T} A x=2 A (if AA symmetric)
4.4. Least Squares
Let's apply the equations we obtained in the last section to derive the least squares equations. Suppose we are given matrices A inR^(m xx n)A \in \mathbb{R}^{m \times n} (for simplicity we assume AA is full rank) and a vector b inR^(m)b \in \mathbb{R}^{m} such that b!inR(A)b \notin \mathcal{R}(A). In this situation we will not be able to find a vector x inR^(n)x \in \mathbb{R}^{n}, such that Ax=bA x=b, so instead we want to find a vector xx such that AxA x is as close as possible to bb, as measured by the square of the Euclidean norm ||Ax-b||_(2)^(2)\|A x-b\|_{2}^{2}.
Using the fact that ||x||_(2)^(2)=x^(T)x\|x\|_{2}^{2}=x^{T} x, we have
{:[||Ax-b||_(2)^(2)=(Ax-b)^(T)(Ax-b)],[=x^(T)A^(T)Ax-2b^(T)Ax+b^(T)b]:}\begin{aligned}
\|A x-b\|_{2}^{2} &=(A x-b)^{T}(A x-b) \\
&=x^{T} A^{T} A x-2 b^{T} A x+b^{T} b
\end{aligned}
Taking the gradient with respect to xx we have, and using the properties we derived in the previous section
{:[grad_(x)(x^(T)A^(T)Ax-2b^(T)Ax+b^(T)b)=grad_(x)x^(T)A^(T)Ax-grad_(x)2b^(T)Ax+grad_(x)b^(T)b],[=2A^(T)Ax-2A^(T)b]:}\begin{aligned}
\nabla_{x}\left(x^{T} A^{T} A x-2 b^{T} A x+b^{T} b\right) &=\nabla_{x} x^{T} A^{T} A x-\nabla_{x} 2 b^{T} A x+\nabla_{x} b^{T} b \\
&=2 A^{T} A x-2 A^{T} b
\end{aligned}
Setting this last expression equal to zero and solving for xx gives the normal equations
x=(A^(T)A)^(-1)A^(T)bx=\left(A^{T} A\right)^{-1} A^{T} b
which is the same as what we derived in class.
4.5. Gradients of the Determinant
Now let's consider a situation where we find the gradient of a function with respect to a matrix, namely for A inR^(n xx n)A \in \mathbb{R}^{n \times n}, we want to find grad_(A)|A|\nabla_{A}|A|. Recall from our discussion of determinants that
|A|=sum_(i=1)^(n)(-1)^(i+j)A_(ij)|A_(\\i,\\j)|quad("for any "j in1,dots,n)|A|=\sum_{i=1}^{n}(-1)^{i+j} A_{i j}\left|A_{\backslash i, \backslash j}\right| \quad(\text {for any } j \in 1, \ldots, n)
Now let's consider the function f:S_(++)^(n)rarrR,f(A)=log |A|f: \mathbb{S}_{++}^{n} \rightarrow \mathbb{R}, f(A)=\log |A|. Note that we have to restrict the domain of ff to be the positive definite matrices, since this ensures that |A| > 0|A|>0, so that the log of |A||A| is a real number. In this case we can use the chain rule (nothing fancy, just the ordinary chain rule from single-variable calculus) to see that
where we can drop the transpose in the last expression because AA is symmetric. Note the similarity to the single-valued case, where del//(del x)log x=1//x\partial /(\partial x) \log x=1 / x.
4.6. Eigenvalues as Optimization
Finally, we use matrix calculus to solve an optimization problem in a way that leads directly to eigenvalue/eigenvector analysis. Consider the following, equality constrained optimization problem:
max_(x inR^(n))x^(T)Ax quad"subject to "||x||_(2)^(2)=1\max _{x \in \mathbb{R}^{n}} x^{T} A x \quad \text {subject to }\|x\|_{2}^{2}=1
for a symmetric matrix A inS^(n)A \in \mathbb{S}^{n}. A standard way of solving optimization problems with equality constraints is by forming the Lagrangian, an objective function that includes the equality constraints.[10] The Lagrangian in this case can be given by
L(x,lambda)=x^(T)Ax-lambda(x^(T)x-1)\mathcal{L}(x, \lambda)=x^{T} A x-\lambda\left(x^{T} x-1\right)
where lambda\lambda is called the Lagrange multiplier associated with the equality constraint. It can be established that for x^(**)x^{*} to be a optimal point to the problem, the gradient of the Lagrangian has to be zero at x^(**)x^{*} (this is not the only condition, but it is required). That is,
Notice that this is just the linear equation Ax=lambda xA x=\lambda x. This shows that the only points which can possibly maximize (or minimize) x^(T)Axx^{T} A x assuming x^(T)x=1x^{T} x=1 are the eigenvectors of AA.
E.g., if you could write all your math derivations with matrices or vectors, it would be better than doing them with scalar elements. ↩︎
It's easy to get confused and think that non-singular means non-invertible. But in fact, it means the opposite! Watch out! ↩︎
Admittedly, we have not actually defined what we mean by "volume" here, but hopefully the intuition should be clear enough. When n=2n=2, our notion of "volume" corresponds to the area of SS in the Cartesian plane. When n=3n=3, "volume" corresponds with our usual notion of volume for a three-dimensional object. ↩︎
Note that lambda\lambda and the entries of xx are actually in C\mathbb{C}, the set of complex numbers, not just the reals; we will see shortly why this is necessary. Don't worry about this technicality for now, you can think of complex vectors in the same way as real vectors. ↩︎
Mathematically, we have AA i,Au_(i)=lambda_(i)u_(i),||u_(i)||_(2)=1\forall i, A u_{i}=\lambda_{i} u_{i},\left\|u_{i}\right\|_{2}=1, and AA j!=i,u_(i)^(T)u_(j)=0\forall j \neq i, u_{i}^{T} u_{j}=0. Moreover, we remark that it's not true that all eigenvectors u_(1),dots,u_(n)u_{1}, \ldots, u_{n} satisfying a) of any matrix AA are orthogonal to each other, because the eigenvalues can be repetitive and so can eigenvectors. ↩︎
Here for notational simplicity, we deviate from the notational convention for columns of matrices in the previous sections. ↩︎
Note that hat(x)!=0<=>x!=0\hat{x} \neq 0 \Leftrightarrow x \neq 0. ↩︎
Note that x=U hat(x)x=U \hat{x} and therefore constructing hat(x)\hat{x} gives an implicit construction of xx. ↩︎
A drawback to this notation that we will have to live with is the fact that in the first case, grad_(z)f(Ax)\nabla_{z} f(A x) it appears that we are differentiating with respect to a variable that does not even appear in the expression being differentiated! For this reason, the first case is often written as grad f(Ax)\nabla f(A x), and the fact that we are differentiating with respect to the arguments of ff is understood. However, the second case is always written as grad_(x)f(Ax)\nabla_{x} f(A x). ↩︎
Don't worry if you haven't seen Lagrangians before, as we will cover them in greater detail later in CS229. ↩︎
Group Equivariant Convolutional Networks in Medical Image Analysis
Group Equivariant Convolutional Networks in Medical Image Analysis
This is a brief review of G-CNNs' applications in medical image analysis, including fundamental knowledge of group equivariant convolutional networks, and applications in medical images' classification and segmentation.