Regularization: Some Calculations from Bias Variance

This note contains a reprise of the eigenvalue arguments to understand how variance is reduced by regularization. We also describe different ways regularization can occur including from the algorithm or initialization. This note contains some additional calculations from the lecture and Piazza, just so that we have typeset versions of them. They contain no new information over the lecture, but they do supplement the notes.
Recall we have a design matrix X R n × d X R n × d X inR^(n xx d)X \in \mathbb{R}^{n \times d} and labels y R n y R n y inR^(n)y \in \mathbb{R}^{n}. We are interested in the underdetermined case n < d n < d n < dn<d so that rank ( X ) n < d ( X ) n < d (X) <= n < d(X) \leq n<d. We consider the following optimization problem for least squares with a regularization parameter λ 0 λ 0 lambda >= 0\lambda \geq 0:
( θ ; λ ) = min θ R d 1 2 X θ y 2 + λ 2 θ 2 ( θ ; λ ) = min θ R d 1 2 X θ y 2 + λ 2 θ 2 ℓ(theta;lambda)=min_(theta inR^(d))(1)/(2)||X theta-y||^(2)+(lambda)/(2)||theta||^(2)\ell(\theta ; \lambda)=\min _{\theta \in \mathbb{R}^{d}} \frac{1}{2}\|X \theta-y\|^{2}+\frac{\lambda}{2}\|\theta\|^{2}

Normal Equations

Computing derivatives as we did for the normal equations, we see that:
θ ( θ ; λ ) = X T ( X θ y ) + λ θ = ( X T X + λ I ) θ X T y θ ( θ ; λ ) = X T ( X θ y ) + λ θ = X T X + λ I θ X T y grad_(theta)ℓ(theta;lambda)=X^(T)(X theta-y)+lambda theta=(X^(T)X+lambda I)theta-X^(T)y\nabla_{\theta} \ell(\theta ; \lambda)=X^{T}(X \theta-y)+\lambda \theta=\left(X^{T} X+\lambda I\right) \theta-X^{T} y
By setting θ ( θ , λ ) = 0 θ ( θ , λ ) = 0 grad_(theta)ℓ(theta,lambda)=0\nabla_{\theta} \ell(\theta, \lambda)=0 we can solve for the θ ^ θ ^ hat(theta)\hat{\theta} that minimizes the above problem. Explicitly, we have:
(1) θ ^ = ( X T X + λ I ) 1 X T y (1) θ ^ = X T X + λ I 1 X T y {:(1) hat(theta)=(X^(T)X+lambda I)^(-1)X^(T)y:}\begin{equation} \hat{\theta}=\left(X^{T} X+\lambda I\right)^{-1} X^{T} y \end{equation}
To see that the inverse in Eq. 1 exists, we observe that X T X X T X X^(T)XX^{T} X is a symmetric, real d × d d × d d xx dd \times d matrix so it has d d dd eigenvalues (some may be 0 0 00). Moreover, it is positive semidefinite, and we capture this by writing eig ( X T X ) = { σ 1 2 , , σ d 2 } X T X = σ 1 2 , , σ d 2 (X^(T)X)={sigma_(1)^(2),dots,sigma_(d)^(2)}\left(X^{T} X\right)=\left\{\sigma_{1}^{2}, \ldots, \sigma_{d}^{2}\right\}. Now, inspired by the regularized problem, we examine:
eig ( X T X + λ I ) = { σ 1 2 + λ , , σ d 2 + λ } eig X T X + λ I = σ 1 2 + λ , , σ d 2 + λ eig(X^(T)X+lambda I)={sigma_(1)^(2)+lambda,dots,sigma_(d)^(2)+lambda}\operatorname{eig}\left(X^{T} X+\lambda I\right)=\left\{\sigma_{1}^{2}+\lambda, \ldots, \sigma_{d}^{2}+\lambda\right\}
Since σ i 2 0 σ i 2 0 sigma_(i)^(2) >= 0\sigma_{i}^{2} \geq 0 for all i [ d ] i [ d ] i in[d]i \in[d], if we set λ > 0 λ > 0 lambda > 0\lambda>0 then X T X + λ I X T X + λ I X^(T)X+lambda IX^{T} X+\lambda I is full rank, and the inverse of ( X T X + λ I ) X T X + λ I (X^(T)X+lambda I)\left(X^{T} X+\lambda I\right) exists. In turn, this means there is a unique such θ ^ θ ^ hat(theta)\hat{\theta}.

Variance

Recall that in bias-variance, we are concerned with the variance of θ ^ θ ^ hat(theta)\hat{\theta} as we sample the training set. We want to argue that as the regularization parameter λ λ lambda\lambda increases, the variance in the fitted θ ^ θ ^ hat(theta)\hat{\theta} decreases. We won't carry out the full formal argument, but it suffices to make one observation that is immediate from Eq. 1: the variance of θ ^ θ ^ hat(theta)\hat{\theta} is proportional to the eigenvalues of ( X T X + λ I ) 1 X T X + λ I 1 (X^(T)X+lambda I)^(-1)\left(X^{T} X+\lambda I\right)^{-1}. To see this, observe that the eigenvalues of an inverse are just the inverse of the eigenvalues:
eig ( ( X T X + λ I ) 1 ) = { 1 σ 1 2 + λ , , 1 σ d 2 + λ } eig X T X + λ I 1 = 1 σ 1 2 + λ , , 1 σ d 2 + λ eig((X^(T)X+lambda I)^(-1))={(1)/(sigma_(1)^(2)+lambda),cdots,(1)/(sigma_(d)^(2)+lambda)}\operatorname{eig}\left(\left(X^{T} X+\lambda I\right)^{-1}\right)=\left\{\frac{1}{\sigma_{1}^{2}+\lambda}, \cdots, \frac{1}{\sigma_{d}^{2}+\lambda}\right\}
Now, condition on the points we draw, namely X X XX. Then, recall that randomness is in the label noise (recall the linear regression model y X θ + y X θ + y∼Xtheta^(**)+y \sim X \theta^{*}+ N ( 0 , τ 2 I ) = N ( X θ , τ 2 I ) ) N 0 , τ 2 I = N X θ , τ 2 I {:N(0,tau^(2)I)=N(Xtheta^(**),tau^(2)I))\left.\mathcal{N}\left(0, \tau^{2} I\right)=\mathcal{N}\left(X \theta^{*}, \tau^{2} I\right)\right).
Recall a fact about the multivariate normal distribution:
if y N ( μ , Σ ) then A y N ( A μ , A Σ A T ) if  y N ( μ , Σ )  then  A y N A μ , A Σ A T "if "y∼N(mu,Sigma)" then "Ay∼N(A mu,A SigmaA^(T))\text {if } y \sim \mathcal{N}(\mu, \Sigma) \text { then } A y \sim \mathcal{N}\left(A \mu, A \Sigma A^{T}\right)
Using linearity, we can verify that the expectation of θ ^ θ ^ hat(theta)\hat{\theta} is
E [ θ ^ ] = E [ ( X T X + λ I ) 1 X T y ] = E [ ( X T X + λ I ) 1 X T ( X θ + N ( 0 , τ 2 , I ) ) ] = E [ ( X T X + λ I ) 1 X T ( X θ ) ] = ( X T X + λ I ) 1 ( X T X ) θ ( essentially a "shrunk" θ ) E [ θ ^ ] = E X T X + λ I 1 X T y = E X T X + λ I 1 X T X θ + N 0 , τ 2 , I = E X T X + λ I 1 X T X θ = X T X + λ I 1 X T X θ ( essentially a "shrunk"  θ ) {:[E[ hat(theta)]=E[(X^(T)X+lambda I)^(-1)X^(T)y]],[=E[(X^(T)X+lambda I)^(-1)X^(T)(Xtheta^(**)+N(0,tau^(2),I))]],[=E[(X^(T)X+lambda I)^(-1)X^(T)(Xtheta^(**))]],[=(X^(T)X+lambda I)^(-1)(X^(T)X)theta^(**)quad("essentially a "shrunk" "theta^(**))]:}\begin{aligned} \mathbb{E}[\hat{\theta}] &=\mathbb{E}\left[\left(X^{T} X+\lambda I\right)^{-1} X^{T} y\right] \\ &=\mathbb{E}\left[\left(X^{T} X+\lambda I\right)^{-1} X^{T}\left(X \theta^{*}+\mathcal{N}\left(0, \tau^{2}, I\right)\right)\right] \\ &=\mathbb{E}\left[\left(X^{T} X+\lambda I\right)^{-1} X^{T}\left(X \theta^{*}\right)\right] \\ &=\left(X^{T} X+\lambda I\right)^{-1}\left(X^{T} X\right) \theta^{*} \quad (\text {essentially a "shrunk" } \theta^{*}) \end{aligned}
The last line above suggests that the more regularization we add (larger the λ λ lambda\lambda), the more the estimated θ ^ θ ^ hat(theta)\hat{\theta} will be shrunk towards 0 0 00. In other words, regularization adds bias (towards zero in this case). Though we paid the cost of higher bias, we gain by reducing the variance of θ ^ θ ^ hat(theta)\hat{\theta}. To see this bias-variance tradeoff concretely, observe the covariance matrix of θ ^ θ ^ hat(theta)\hat{\theta}:
C := Cov [ θ ^ ] = ( ( X T X + λ I ) 1 X T ) ( τ 2 I ) ( X ( X T X + λ I ) 1 ) and eig ( C ) = { τ 2 σ 1 2 ( σ 1 2 + λ ) 2 , , τ 2 σ d 2 ( σ d 2 + λ ) 2 } C := Cov [ θ ^ ] = X T X + λ I 1 X T τ 2 I X X T X + λ I 1 and eig ( C ) = τ 2 σ 1 2 σ 1 2 + λ 2 , , τ 2 σ d 2 σ d 2 + λ 2 {:[C:=Cov[ hat(theta)]],[=((X^(T)X+lambda I)^(-1)X^(T))(tau^(2)I)(X(X^(T)X+lambda I)^(-1))],["and"],[eig(C)={(tau^(2)sigma_(1)^(2))/((sigma_(1)^(2)+lambda)^(2)),dots,(tau^(2)sigma_(d)^(2))/((sigma_(d)^(2)+lambda)^(2))}]:}\begin{aligned} C &:=\operatorname{Cov}[\hat{\theta}] \\ &=\left(\left(X^{T} X+\lambda I\right)^{-1} X^{T}\right)\left(\tau^{2} I\right)\left(X\left(X^{T} X+\lambda I\right)^{-1}\right) \\ \text{and}\,\,& \\ \operatorname{eig}(C)&=\left\{\frac{\tau^{2} \sigma_{1}^{2}}{\left(\sigma_{1}^{2}+\lambda\right)^{2}}, \ldots, \frac{\tau^{2} \sigma_{d}^{2}}{\left(\sigma_{d}^{2}+\lambda\right)^{2}}\right\} \end{aligned}
Notice that the entire spectrum of the covariance is a decreasing function of λ λ lambda\lambda. By decomposing in the eigenvalue basis, we can see that actually E [ θ ^ θ 2 ] E [ θ ^ θ 2 ] E[|| hat(theta)-theta^(**)||^(2)]\mathbb{E}[\|\hat{\theta}-\theta^{*}\|^{2}] is a decreasing function of λ λ lambda\lambda, as desired.

Gradient Descent

We show that you can initialize gradient descent in a way that effectively regularizes undetermined least squares-even with no regularization penalty ( λ = 0 ) ( λ = 0 ) (lambda=0)(\lambda=0). Our first observation is that any point x R d x R d x inR^(d)x \in \mathbb{R}^{d} can be decomposed into two orthogonal components x 0 , x 1 x 0 , x 1 x_(0),x_(1)x_{0}, x_{1} such that
x = x 0 + x 1 and x 0 Null ( X ) and x 1 Range ( X T ) . x = x 0 + x 1  and  x 0 Null ( X )  and  x 1 Range X T . x=x_(0)+x_(1)" and "x_(0)in Null(X)" and "x_(1)in Range(X^(T)).x=x_{0}+x_{1} \text { and } x_{0} \in \operatorname{Null}(X) \text { and } x_{1} \in \operatorname{Range}\left(X^{T}\right).
Recall that Null ( X ) Null ( X ) Null(X)\operatorname{Null}(X) and Range ( X T ) Range X T Range(X^(T))\operatorname{Range}\left(X^{T}\right) are orthogonal subspaces by the fundamental theory of linear algebra. We write P 0 P 0 P_(0)P_{0} for the projection on the null and P 1 P 1 P_(1)P_{1} for the projection on the range, then x 0 = P 0 ( x ) x 0 = P 0 ( x ) x_(0)=P_(0)(x)x_{0}=P_{0}(x) and x 1 = P 1 ( x ) x 1 = P 1 ( x ) x_(1)=P_(1)(x)x_{1}=P_{1}(x).
If one initializes at a point θ θ theta\theta then, we observe that the gradient is orthogonal to the null space. That is, if g ( θ ) = X T ( X θ y ) g ( θ ) = X T ( X θ y ) g(theta)=X^(T)(X theta-y)g(\theta)=X^{T}(X \theta-y) then g T P 0 ( v ) = 0 g T P 0 ( v ) = 0 g^(T)P_(0)(v)=0g^{T} P_{0}(v)=0 for any v R d v R d v inR^(d)v \in \mathbb{R}^{d}. But, then:
P 0 ( θ ( t + 1 ) ) = P 0 ( θ t α g ( θ ( t ) ) ) = P 0 ( θ t ) α P 0 g ( θ ( t ) ) = P 0 ( θ ( t ) ) P 0 θ ( t + 1 ) = P 0 θ t α g θ ( t ) = P 0 θ t α P 0 g θ ( t ) = P 0 θ ( t ) P_(0)(theta^((t+1)))=P_(0)(theta^(t)-alpha g(theta^((t))))=P_(0)(theta^(t))-alphaP_(0)g(theta^((t)))=P_(0)(theta^((t)))P_{0}\left(\theta^{(t+1)}\right)=P_{0}\left(\theta^{t}-\alpha g\left(\theta^{(t)}\right)\right)=P_{0}\left(\theta^{t}\right)-\alpha P_{0} g\left(\theta^{(t)}\right)=P_{0}\left(\theta^{(t)}\right)
That is, no learning happens in the null. Whatever portion is in the null that we initialize stays there throughout execution.
A key property of the Moore-Penrose pseudoinverse, is that if θ ^ = ( X T X ) + X T y θ ^ = X T X + X T y hat(theta)=(X^(T)X)^(+)X^(T)y\hat{\theta}=\left(X^{T} X\right)^{+} X^{T} y then P 0 ( θ ^ ) = 0 P 0 ( θ ^ ) = 0 P_(0)( hat(theta))=0P_{0}(\hat{\theta})=0. Hence, the gradient descent solution initialized at θ 0 θ 0 theta_(0)\theta_{0} can be written θ ^ + P 0 ( θ 0 ) θ ^ + P 0 θ 0 hat(theta)+P_(0)(theta_(0))\hat{\theta}+P_{0}\left(\theta_{0}\right). Two immediate observations:
  • Using the Moore-Penrose inverse acts as regularization, because it selects the solution θ ^ θ ^ hat(theta)\hat{\theta}.
  • So does gradient descent-provided that we initialize at θ 0 = 0 θ 0 = 0 theta_(0)=0\theta_{0}=0. This is particularly interesting, as many modern machine learning techniques operate in these underdetermined regimes.
We've argued that there are many ways to find equivalent solutions, and that this allows us to understand the effect on the model fitting procedure as regularization. Thus, there are many ways to find that equivalent solution. Many modern methods of machine learning including dropout and data augmentation are not penalty, but their effect is understood as regularization. One contrast with the above methods is that they often depend on some property of the data or for how much they effectively regularization. In some sense, they adapt to the data. A final comment is that in the same sense above, adding more data regularizes the model as well!

Recommended for you

Srishti Saha
Text Generation Models - Introduction and a Demo using the GPT-J model
Text Generation Models - Introduction and a Demo using the GPT-J model
The below article describes the mechanism of text generation models. We cover the basic model like Markov Chains as well the more advanced deep learning models. We also give a demo of domain-specific text generation using the latest GPT-J model.
26 points
0 issues