Regularization: Some Calculations from Bias Variance

This note contains a reprise of the eigenvalue arguments to understand how variance is reduced by regularization. We also describe different ways regularization can occur including from the algorithm or initialization. This note contains some additional calculations from the lecture and Piazza, just so that we have typeset versions of them. They contain no new information over the lecture, but they do supplement the notes.

Recall we have a design matrix

X \in R^{n \times d}

and labels

y \in R^{n}

. We are interested in the underdetermined case

n < d

so that rank

(X) \leq n < d

. We consider the following optimization problem for least squares with a regularization parameter

λ \geq 0

ℓ (θ; λ) = min_{θ \in R^{d}} \frac{1}{2} ∥ X θ - y ∥^{2} + \frac{λ}{2} ∥ θ ∥^{2}

Normal Equations

Computing derivatives as we did for the normal equations, we see that:

\nabla_{θ} ℓ (θ; λ) = X^{T} (X θ - y) + λ θ = (X^{T} X + λ I) θ - X^{T} y

By setting

\nabla_{θ} ℓ (θ, λ) = 0

we can solve for the

\hat{θ}

that minimizes the above problem. Explicitly, we have:

\begin{matrix} (1) & \hat{θ} = {(X^{T} X + λ I)}^{- 1} X^{T} y \end{matrix}

To see that the inverse in Eq. 1 exists, we observe that

X^{T} X

is a symmetric, real

d \times d

matrix so it has

d

eigenvalues (some may be

0

). Moreover, it is positive semidefinite, and we capture this by writing eig

(X^{T} X) = {σ_{1}^{2}, \dots, σ_{d}^{2}}

. Now, inspired by the regularized problem, we examine:

eig (X^{T} X + λ I) = {σ_{1}^{2} + λ, \dots, σ_{d}^{2} + λ}

Since

σ_{i}^{2} \geq 0

for all

i \in [d]

, if we set

λ > 0

then

X^{T} X + λ I

is full rank, and the inverse of

(X^{T} X + λ I)

exists. In turn, this means there is a unique such

\hat{θ}

Variance

Recall that in bias-variance, we are concerned with the variance of

\hat{θ}

as we sample the training set. We want to argue that as the regularization parameter

λ

increases, the variance in the fitted

\hat{θ}

decreases. We won't carry out the full formal argument, but it suffices to make one observation that is immediate from Eq. 1: the variance of

\hat{θ}

is proportional to the eigenvalues of

{(X^{T} X + λ I)}^{- 1}

. To see this, observe that the eigenvalues of an inverse are just the inverse of the eigenvalues:

eig ({(X^{T} X + λ I)}^{- 1}) = {\frac{1}{σ_{1}^{2} + λ}, \dots, \frac{1}{σ_{d}^{2} + λ}}

Now, condition on the points we draw, namely

X

. Then, recall that randomness is in the label noise (recall the linear regression model

y \sim X θ^{*} +

N (0, τ^{2} I) = N (X θ^{*}, τ^{2} I))

Recall a fact about the multivariate normal distribution:

if y \sim N (μ, Σ) then A y \sim N (A μ, A Σ A^{T})

Using linearity, we can verify that the expectation of

\hat{θ}

\begin{aligned} E [\hat{θ}] & = E [{(X^{T} X + λ I)}^{- 1} X^{T} y] \\ = E [{(X^{T} X + λ I)}^{- 1} X^{T} (X θ^{*} + N (0, τ^{2}, I))] \\ = E [{(X^{T} X + λ I)}^{- 1} X^{T} (X θ^{*})] \\ = {(X^{T} X + λ I)}^{- 1} (X^{T} X) θ^{*} (essentially a "shrunk" θ^{*}) \end{aligned}

The last line above suggests that the more regularization we add (larger the

λ

), the more the estimated

\hat{θ}

will be shrunk towards

0

. In other words, regularization adds bias (towards zero in this case). Though we paid the cost of higher bias, we gain by reducing the variance of

\hat{θ}

. To see this bias-variance tradeoff concretely, observe the covariance matrix of

\hat{θ}

\begin{aligned} C & := Cov [\hat{θ}] \\ = ({(X^{T} X + λ I)}^{- 1} X^{T}) (τ^{2} I) (X {(X^{T} X + λ I)}^{- 1}) \\ and \\ eig (C) & = {\frac{τ^{2} σ_{1}^{2}}{{(σ_{1}^{2} + λ)}^{2}}, \dots, \frac{τ^{2} σ_{d}^{2}}{{(σ_{d}^{2} + λ)}^{2}}} \end{aligned}

Notice that the entire spectrum of the covariance is a decreasing function of

λ

. By decomposing in the eigenvalue basis, we can see that actually

E [∥ \hat{θ} - θ^{*} ∥^{2}]

is a decreasing function of

λ

, as desired.

Gradient Descent

We show that you can initialize gradient descent in a way that effectively regularizes undetermined least squares-even with no regularization penalty

(λ = 0)

. Our first observation is that any point

x \in R^{d}

can be decomposed into two orthogonal components

x_{0}, x_{1}

such that

x = x_{0} + x_{1} and x_{0} \in Null (X) and x_{1} \in Range (X^{T}) .

Recall that

Null (X)

and

Range (X^{T})

are orthogonal subspaces by the fundamental theory of linear algebra. We write

P_{0}

for the projection on the null and

P_{1}

for the projection on the range, then

x_{0} = P_{0} (x)

and

x_{1} = P_{1} (x)

If one initializes at a point

θ

then, we observe that the gradient is orthogonal to the null space. That is, if

g (θ) = X^{T} (X θ - y)

then

g^{T} P_{0} (v) = 0

for any

v \in R^{d}

. But, then:

P_{0} (θ^{(t + 1)}) = P_{0} (θ^{t} - α g (θ^{(t)})) = P_{0} (θ^{t}) - α P_{0} g (θ^{(t)}) = P_{0} (θ^{(t)})

That is, no learning happens in the null. Whatever portion is in the null that we initialize stays there throughout execution.

A key property of the Moore-Penrose pseudoinverse, is that if

\hat{θ} = {(X^{T} X)}^{+} X^{T} y

then

P_{0} (\hat{θ}) = 0

. Hence, the gradient descent solution initialized at

θ_{0}

can be written

\hat{θ} + P_{0} (θ_{0})

. Two immediate observations:

Using the Moore-Penrose inverse acts as regularization, because it selects the solution $\hat{θ}$ .
So does gradient descent-provided that we initialize at $θ_{0} = 0$ . This is particularly interesting, as many modern machine learning techniques operate in these underdetermined regimes.

We've argued that there are many ways to find equivalent solutions, and that this allows us to understand the effect on the model fitting procedure as regularization. Thus, there are many ways to find that equivalent solution. Many modern methods of machine learning including dropout and data augmentation are not penalty, but their effect is understood as regularization. One contrast with the above methods is that they often depend on some property of the data or for how much they effectively regularization. In some sense, they adapt to the data. A final comment is that in the same sense above, adding more data regularizes the model as well!

Regularization: Some Calculations from Bias Variance

Normal Equations

Variance

Gradient Descent

Recommended for you

Report Article