Introduction to Deep Learning

You can read the notes from the previous lecture from Andrew Ng's CS229 course on Kernal Methods here.

We now begin our study of deep learning. In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation.

1. Supervised Learning with Non-linear Models

In the supervised learning setting (predicting

y

from the input

x

), suppose our model/hypothesis is

h_{θ} (x)

. In the past lectures, we have considered the cases when

h_{θ} (x) = θ^{⊤} x

(in linear regression or logistic regression) or

h_{θ} (x) =

θ^{⊤} ϕ (x)

(where

ϕ (x)

is the feature map). A commonality of these two models is that they are linear in the parameters

θ

. Next we will consider learning general family of models that are non-linear in both the parameters

θ

and the inputs

x

. The most common non-linear models are neural networks, which we will define staring from the next section. For this section, it suffices to think

h_{θ} (x)

as an abstract non-linear model.^[1]

Suppose

{(x^{(i)}, y^{(i)})}_{i = 1}^{n}

are the training examples. For simplicity, we start with the case where

y^{(i)} \in R

and

h_{θ} (x) \in R

Cost/loss function. We define the least square cost function for the

i

-th example

(x^{(i)}, y^{(i)})

\begin{matrix} (1.1) & J^{(i)} (θ) = \frac{1}{2} {(h_{θ} (x^{(i)}) - y^{(i)})}^{2} \end{matrix}

and define the mean-square cost function for the dataset as

\begin{matrix} (1.2) & J (θ) = \frac{1}{n} \sum_{i = 1}^{n} J^{(i)} (θ) \end{matrix}

which is same as in linear regression except that we introduce a constant

1 / n

in front of the cost function to be consistent with the convention. Note that multiplying the cost function with a scalar will not change the local minima or global minima of the cost function. Also note that the underlying parameterization for

h_{θ} (x)

is different from the case of linear regression, even though the form of the cost function is the same mean-squared loss. Throughout the notes, we use the words "loss" and "cost" interchangeably.

Optimizers (SGD). Commonly, people use gradient descent (GD), stochastic gradient (SGD), or their variants to optimize the loss function

J (θ)

. GD's update rule can be written as ^[2]

\begin{matrix} (1.3) & θ := θ - α \nabla_{θ} J (θ) \end{matrix}

where

α > 0

is often referred to as the learning rate or step size. Next, we introduce a version of the SGD (Algorithm

1

), which is lightly different from that in the first lecture notes.

Algorithm 1 Stochastic Gradient Descent

1: Hyperparameter: learning rate

α

, number of total iteration

n_{iter}

2: Initialize

θ

randomly.

3: for

i = 1

n_{iter}

Sample

j

uniformly from

{1, \dots, n}

, and update

θ

\begin{matrix} (1.4) & θ := θ - α \nabla_{θ} J^{j} (θ) \end{matrix}

Oftentimes computing the gradient of

B

examples simultaneously for the parameter

θ

can be faster than computing

B

gradients separately due to hardware parallelization. Therefore, a mini-batch version of SGD is most commonly used in deep learning, as shown in Algorithm

2

. There are also other variants of the SGD or mini-batch SGD with slightly different sampling schemes.

Algorithm 2 Mini-batch Stochastic Gradient Descent

1: Hyperparameters: learning rate

α

, batch size

B, #

iterations

n_{iter}

2: Initialize

θ

randomly

3: for

i = 1

n_{iter}

Sample

B

examples

j_{1}, \dots, j_{B}

(without replacement) uniformly from

{1, \dots, n}

, and update

θ

\begin{matrix} (1.5) & θ := θ - \frac{α}{B} \sum_{k = 1}^{B} \nabla_{θ} J^{(j_{k})} (θ) \end{matrix}

With these generic algorithms, a typical deep learning model is learned with the following steps. 1. Define a neural network parametrization

h_{θ} (x)

, which we will introduce in Section 2, and 2. write the backpropagation algorithm to compute the gradient of the loss function

J^{(j)} (θ)

efficiently, which will be covered in Section 3, and 3. run SGD or mini-batch SGD (or other gradient-based optimizers) with the loss function

J (θ)

2. Neural Networks

Neural networks refer to broad type of non-linear models/parametrizations

h_{θ} (x)

that involve combinations of matrix multiplications and other entrywise non-linear operations. We will start small and slowly build up a neural network, step by step.

A Neural Network with a Single Neuron. Recall the housing price prediction problem from before: given the size of the house, we want to predict the price. We will use it as a running example in this subsection.

Previously, we fit a straight line to the graph of size vs. housing price. Now, instead of fitting a straight line, we wish to prevent negative housing prices by setting the absolute minimum price as zero. This produces a "kink" in the graph as shown in Figure 1. How do we represent such a function with a single kink as

h_{θ} (x)

with unknown parameter? (After doing so, we can invoke the machinery in Section 1.)

We define a parameterized function

h_{θ} (x)

with input

x

, parameterized by

θ

, which outputs the price of the house

y

. Formally,

h_{θ} : x \to y

. Perhaps one of the simplest parametrization would be

\begin{matrix} (2.1) & h_{θ} (x) = max (w x + b, 0), where θ = (w, b) \in R^{2} \end{matrix}

Here

h_{θ} (x)

returns a single value:

(w x + b)

or zero, whichever is greater. In the context of neural networks, the function

max {t, 0}

is called a ReLU (pronounced "ray-lu"), or rectified linear unit, and often denoted by

ReLU (t) ≜

max {t, 0}

Generally, a one-dimensional non-linear function that maps

R

R

such as ReLU is often referred to as an activation function. The model

h_{θ} (x)

is said to have a single neuron partly because it has a single non-linear activation function. (We will discuss more about why a non-linear activation is called neuron.)

When the input

x \in R^{d}

has multiple dimensions, a neural network with a single neuron can be written as

\begin{matrix} (2.2) & h_{θ} (x) = ReLU (w^{⊤} x + b), where w \in R^{d}, b \in R, and θ = (w, b) \end{matrix}

The term

b

is often referred to as the "bias", and the vector

w

is referred to as the weight vector. Such a neural network has

1

layer. (We will define what multiple layers mean in the sequel.)

Stacking Neurons. A more complex neural network may take the single neuron described above and "stack" them together such that one neuron passes its output as input into the next neuron, resulting in a more complex function.

Let us now deepen the housing prediction example. In addition to the size of the house, suppose that you know the number of bedrooms, the zip code and the wealth of the neighborhood. Building neural networks is analogous to Lego bricks: you take individual bricks and stack them together to build complex structures. The same applies to neural networks: we take individual neurons and stack them together to create complex neural networks.

Given these features (size, number of bedrooms, zip code, and wealth), we might then decide that the price of the house depends on the maximum family size it can accommodate. Suppose the family size is a function of the size of the house and number of bedrooms (see Figure

2

). The zip code may provide additional information such as how walkable the neighborhood is (i.e., can you walk to the grocery store or do you need to drive everywhere). Combining the zip code with the wealth of the neighborhood may predict the quality of the local elementary school. Given these three derived features (family size, walkable, school quality), we may conclude that the price of the home ultimately depends on these three features.

Formally, the input to a neural network is a set of input features

x_{1}, x_{2}, x_{3}, x_{4}

. We denote the intermediate variables for "family size", "walkable", and "school quality" by

a_{1}, a_{2}, a_{3}

(these

a_{i}

's are often referred to as

Figure 1: Housing prices with a "kink" in the graph.

Figure 2: Diagram of a small neural network for predicting housing prices.

"hidden units" or "hidden neurons"). We represent each of the

a_{i}

's as a neural network with a single neuron with a subset of

x_{1}, \dots, x_{4}

as inputs. Then as in Figure 1, we will have the parameterization:

\begin{aligned} a_{1} = ReLU (θ_{1} x_{1} + θ_{2} x_{2} + θ_{3}) \\ a_{2} = ReLU (θ_{4} x_{3} + θ_{5}) \\ a_{3} = ReLU (θ_{6} x_{3} + θ_{7} x_{4} + θ_{8}) \end{aligned}

where

(θ_{1}, \dots, θ_{8})

are parameters. Now we represent the final output

h_{θ} (x)

as another linear function with

a_{1}, a_{2}, a_{3}

as inputs, and we get^[3]

\begin{matrix} (2.3) & h_{θ} (x) = θ_{9} a_{1} + θ_{10} a_{2} + θ_{11} a_{3} + θ_{12} \end{matrix}

where

θ

contains all the parameters

(θ_{1}, \dots, θ_{12})

Now we represent the output as a quite complex function of

x

with parameters

θ

. Then you can use this parametrization

h_{θ}

with the machinery of Section 1 to learn the parameters

θ

Inspiration from Biological Neural Networks. As the name suggests, artificial neural networks were inspired by biological neural networks. The hidden units

a_{1}, \dots, a_{m}

correspond to the neurons in a biological neural network, and the parameters

θ_{i}

's correspond to the synapses. However, it's unclear how similar the modern deep artificial neural networks are to the biological ones. For example, perhaps not many neuroscientists think biological neural networks could have 1000 layers, while some modern artificial neural networks do (we will elaborate more on the notion of layers.) Moreover, it's an open question whether human brains update their neural networks in a way similar to the way that computer scientists learn artificial neural networks (using backpropagation, which we will introduce in the next section.).

Two-layer Fully-Connected Neural Networks. We constructed the neural network in equation (2.3) using a significant amount of prior knowledge/belief about how the "family size", "walkable", and "school quality" are determined by the inputs. We implicitly assumed that we know the family size is an important quantity to look at and that it can be determined by only the "size" and "# bedrooms". Such a prior knowledge might not be available for other applications. It would be more flexible and general to have a generic parameterization. A simple way would be to write the intermediate variable

a_{1}

as a function of all

x_{1}, \dots, x_{4}

\begin{matrix} (2.4) & a_{1} = ReLU (w_{1}^{⊤} x + b_{1}), where w_{1} \in R^{4} and b_{1} \in R \end{matrix}

a_{2} = ReLU (w_{2}^{⊤} x + b_{2}), where w_{2} \in R^{4} and b_{2} \in R

a_{3} = ReLU (w_{3}^{⊤} x + b_{3}), where w_{3} \in R^{4} and b_{3} \in R

We still define

h_{θ} (x)

using equation (2.3) with

a_{1}, a_{2}, a_{3}

being defined as above. Thus we have a so-called fully-connected neural network as visualized in the dependency graph in Figure 2 because all the intermediate variables

a_{i}

's depend on all the inputs

x_{i}

's.

For full generality, a two-layer fully-connected neural network with

m

hidden units and

d

dimensional input

x \in R^{d}

is defined as

\begin{matrix} (2.5) & \forall j \in [1, \dots, m], z_{j} = w_{j}^{[1]^{⊤}} x + b_{j}^{[1]} where w_{j}^{[1]} \in R^{d}, b_{j}^{[1]} \in R \end{matrix}

Figure 3: Diagram of a two-layer fully connected neural network. Each edge from node

x_{i}

to node

a_{j}

indicates that

a_{j}

depends on

x_{i}

. The edge from

x_{i}

a_{j}

is associated with the weight

{(w_{j}^{[1]})}_{i}

which denotes the

i

-th coordinate of the vector

w_{j}^{[1]}

. The activation

a_{j}

can be computed by taking the ReLUof the weighted sum of

x_{i}

's with the weights being the weights associated with the incoming edges, that is,

a_{j} = ReLU (\sum_{i = 1}^{d} {(w_{j}^{[1]})}_{i} x_{i})

\begin{aligned} a_{j} & = ReLU (z_{j}), \\ a & = {[a_{1}, \dots, a_{m}]}^{⊤} \in R^{m} \\ (2.6) & h_{θ} (x) & = w^{[2]^{⊤}} a + b^{[2]} where w^{[2]} \in R^{m}, b^{[2]} \in R, \end{aligned}

Note that by default the vectors in

R^{d}

are viewed as column vectors, and in particular

a

is a column vector with components

a_{1}, a_{2}, \dots, a_{m}

. The indices

^{[1]}

and

^{[2]}

are used to distinguish two sets of parameters: the

w_{j}^{[1]}

,s (each of which is a vector in

R^{d}

) and

w^{[2]}

(which is a vector in

R^{m}

). We will have more of these later.

Vectorization. Before we introduce neural networks with more layers and more complex structures, we will simplify the expressions for neural networks with more matrix and vector notations. Another important motivation of vectorization is the speed perspective in the implementation. In order to implement a neural network efficiently, one must be careful when using for loops. The most natural way to implement equation (2.5) in code is perhaps to use a for loop. In practice, the dimensionalities of the inputs and hidden units are high. As a result, code will run very slowly if you use for loops. Leveraging the parallelism in GPUs is/was crucial for the progress of deep learning.

This gave rise to vectorization. Instead of using for loops, vectorization takes advantage of matrix algebra and highly optimized numerical linear algebra packages (e.g., BLAS) to make neural network computations run quickly. Before the deep learning era, a for loop may have been sufficient on smaller datasets, but modern deep networks and state-of-the-art datasets will be infeasible to run with for loops.

We vectorize the two-layer fully-connected neural network as below. We define a weight matrix

W^{[1]}

R^{m \times d}

as the concatenation of all the vectors

w_{j}^{[1], s}

in the following way:

\begin{matrix} (2.7) & W^{[1]} = [\begin{matrix} - w_{1}^{[1]^{⊤}} - \\ - w_{2}^{[1]^{⊤}} - \\ ⋮ \\ - w_{m}^{[1]^{⊤}} - \end{matrix}] \in R^{m \times d} \end{matrix}

Now by the definition of matrix vector multiplication, we can write

z =

{[z_{1}, \dots, z_{m}]}^{⊤} \in R^{m}

\begin{matrix} (2.8) & \underset{z \in R^{m \times 1}}{\underset{⏟}{[\begin{matrix} z_{1} \\ ⋮ \\ ⋮ \\ z_{m} \end{matrix}]}} = \underset{W^{[1]} \in R^{m \times d}}{\underset{⏟}{[\begin{matrix} - w_{1}^{[1]^{⊤}} - \\ - w_{2}^{[1]^{⊤}} - \\ ⋮ \\ - w_{m}^{[1]^{⊤}} - \end{matrix}]}} \underset{x \in R^{d \times 1}}{\underset{⏟}{[\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{d} \end{matrix}]}} + \underset{b^{[1]} \in R^{m \times 1}}{\underset{⏟}{[\begin{matrix} b_{1}^{[1]} \\ b_{2}^{[1]} \\ ⋮ \\ b_{m}^{[1]} \end{matrix}]}} \end{matrix}

Or succinctly,

\begin{matrix} (2.9) & z = W^{[1]} x + b^{[1]} \end{matrix}

We remark again that a vector in

R^{d}

in this notes, following the conventions previously established, is automatically viewed as a column vector, and can also be viewed as a

d \times 1

dimensional matrix. (Note that this is different from numpy where a vector is viewed as a row vector in broadcasting.)

Computing the activations

a \in R^{m}

from

z \in R^{m}

involves an elementwise non-linear application of the ReLU function, which can be computed in parallel efficiently. Overloading ReLU for element-wise application of ReLU (meaning, for a vector

t \in R^{d}

ReLU (t)

is a vector such that

ReLU (t)_{i} =

ReLU (t_{i}))

, we have

\begin{matrix} (2.10) & a = ReLU (z) \end{matrix}

Define

W^{[2]} = [w^{[2]^{⊤}}] \in R^{1 \times m}

similarly. Then, the model in equation (2.6) can be summarized as

\begin{aligned} a & = ReLU (W^{[1]} x + b^{[1]}) \\ (2.11) & h_{θ} (x) & = W^{[2]} a + b^{[2]} \end{aligned}

Here

θ

consists of

W^{[1]}, W^{[2]}

(often referred to as the weight matrices) and

b^{[1]}, b^{[2]}

(referred to as the biases). The collection of

W^{[1]}, b^{[1]}

is referred to as the first layer, and

W^{[2]}, b^{[2]}

the second layer. The activation

a

is referred to as the hidden layer. A two-layer neural network is also called one-hidden-layer neural network.

Multi-layer fully-connected neural networks. With this succinct notations, we can stack more layers to get a deeper fully-connected neural network. Let

r

be the number of layers (weight matrices). Let

W^{[1]}, \dots, W^{[r]}, b^{[1]}, \dots, b^{[r]}

be the weight matrices and biases of all the layers. Then a multi-layer neural network can be written as

\begin{aligned} a^{[1]} & = ReLU (W^{[1]} x + b^{[1]}) \\ a^{[2]} & = ReLU (W^{[2]} a^{[1]} + b^{[2]}) \\ \dots \\ a^{[r - 1]} & = ReLU (W^{[r - 1]} a^{[r - 2]} + b^{[r - 1]}) \\ (2.12) & h_{θ} (x) & = W^{[r]} a^{[r - 1]} + b^{[r]} \end{aligned}

We note that the weight matrices and biases need to have compatible dimensions for the equations above to make sense. If

a^{[k]}

has dimension

m_{k}

, then the weight matrix

W^{[k]}

should be of dimension

m_{k} \times m_{k - 1}

, and the bias

b^{[k]} \in R^{m_{k}}

. Moreover,

W^{[1]} \in R^{m_{1} \times d}

and

W^{[r]} \in R^{1 \times m_{r - 1}}

The total number of neurons in the network is

m_{1} + \dots + m_{r}

, and the total number of parameters in this network is

(d + 1) m_{1} + (m_{1} + 1) m_{2} + \dots +

(m_{r - 1} + 1) m_{r}

Sometimes for notational consistency we also write

a^{[0]} = x

, and

a^{[r]} =

h_{θ} (x)

. Then we have simple recursion that

\begin{matrix} (2.13) & a^{[k]} = ReLU (W^{[k]} a^{[k - 1]} + b^{[k]}), \forall k = 1, \dots, r - 1 \end{matrix}

Note that this would have be true for

k = r

if there were an additional ReLU in equation (2.12), but often people like to make the last layer linear (aka without a ReLU) so that negative outputs are possible and it's easier to interpret the last layer as a linear model. (More on the interpretability at the "connection to kernel method" paragraph of this section.)

Other activation functions. The activation function ReLU can be replaced by many other non-linear function

σ (\cdot)

that maps

R

R

such as

\begin{aligned} (2.14) & σ (z) = \frac{1}{1 + e^{- z}} (sigmoid) \\ (2.15) & σ (z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}} (tanh) \end{aligned}

Why do we not use the identity function for $σ (z)$ ? That is, why not use

σ (z) = z

? Assume for sake of argument that

b^{[1]}

and

b^{[2]}

are zeros. Suppose

σ (z) = z

, then for two-layer neural network, we have that

\begin{aligned} (2.16) & h_{θ} (x) & = W^{[2]} a^{[1]} \\ (2.17) & = W^{[2]} σ (z^{[1]}) & by definition \\ (2.18) & = W^{[2]} z^{[1]} & since σ (z) = z \\ (2.19) & = W^{[2]} W^{[1]} x & from Equation (2.8) \\ (2.20) & = \tilde{W} x & where \tilde{W} = W^{[2]} W^{[1]} \end{aligned}

Notice how

W^{[2]} W^{[1]}

collapsed into

\tilde{W}

This is because applying a linear function to another linear function will result in a linear function over the original input (i.e., you can construct a

\tilde{W}

such that

\tilde{W} x = W^{[2]} W^{[1]} x)

. This loses much of the representational power of the neural network as often times the output we are trying to predict has a non-linear relationship with the inputs. Without non-linear activation functions, the neural network will simply perform linear regression.

Connection to the Kernel Method. In the previous lectures, we covered the concept of feature maps. Recall that the main motivation for feature maps is to represent functions that are non-linear in the input

x

θ^{⊤} ϕ (x)

, where

θ

are the parameters and

ϕ (x)

, the feature map, is a handcrafted function non-linear in the raw input

x

. The performance of the learning algorithms can significantly depends on the choice of the feature map

ϕ (x)

. Oftentimes people use domain knowledge to design the feature map

ϕ (x)

that suits the particular applications. The process of choosing the feature maps is often referred to as feature engineering.

We can view deep learning as a way to automatically learn the right feature map (sometimes also referred to as "the representation") as follows. Suppose we denote by

β

the collection of the parameters in a fully-connected neural networks (equation (2.12)) except those in the last layer. Then we can abstract right

a^{[r - 1]}

as a function of the input

x

and the parameters in

β : a^{[r - 1]} = ϕ_{β} (x)

. Now we can write the model as

\begin{matrix} (2.21) & h_{θ} (x) = W^{[r]} ϕ_{β} (x) + b^{[r]} \end{matrix}

When

β

is fixed, then

ϕ_{β} (\cdot)

can viewed as a feature map, and therefore

h_{θ} (x)

is just a linear model over the features

ϕ_{β} (x)

. However, we will train the neural networks, both the parameters in

β

and the parameters

W^{[r]}, b^{[r]}

are optimized, and therefore we are not learning a linear model in the feature space, but also learning a good feature map

ϕ_{β} (\cdot)

itself so that it's possible to predict accurately with a linear model on top of the feature map. Therefore, deep learning tends to depend less on the domain knowledge of the particular applications and requires often less feature engineering. The penultimate layer

a^{[r - 1]}

is often (informally) referred to as the learned features or representations in the context of deep learning.

In the example of house price prediction, a fully-connected neural network does not need us to specify the intermediate quantity such "family size", and may automatically discover some useful features in the last penultimate layer (the activation

a^{[r - 1]})

, and use them to linearly predict the housing price. Often the feature map / representation obtained from one datasets (that is, the function

ϕ_{β} (\cdot)

can be also useful for other datasets, which indicates they contain essential information about the data. However, oftentimes, the neural network will discover complex features which are very useful for predicting the output but may be difficult for a human to understand or interpret. This is why some people refer to neural networks as a black box, as it can be difficult to understand the features it has discovered.

3. Backpropagation

In this section, we introduce backpropgation or auto-differentiation, which computes the gradient of the loss

\nabla J^{(j)} (θ)

efficiently. We will start with an informal theorem that states that as long as a real-valued function

f

can be efficiently computed/evaluated by a differentiable network or circuit, then its gradient can be efficiently computed in a similar time. We will then show how to do this concretely for fully-connected neural networks.

Because the formality of the general theorem is not the main focus here, we will introduce the terms with informal definitions. By a differentiable circuit or a differentiable network, we mean a composition of a sequence of differentiable arithmetic operations (additions, subtraction, multiplication, divisions, etc) and elementary differentiable functions (ReLU,

\exp

\log

\sin

\cos

, etc.). Let the size of the circuit be the total number of such operations and elementary functions. We assume that each of the operations and functions, and their derivatives or partial derivatives can be computed in

O (1)

time in the computer.

Theorem 3.1 : [backpropagation or auto-differentiation, informally stated] Suppose a differentiable circuit of size $N$ computes a real-valued function $f : R^{ℓ} \to R$ . Then, the gradient $\nabla f$ can be computed in time $O (N)$ , by a circuit of size $O (N)$ .^[4]

We note that the loss function

J^{(j)} (θ)

for the

j

-th example can be indeed computed by a sequence of operations and functions involving additions, subtraction, multiplications, and non-linear activations. Thus the theorem suggests that we should be able to compute

\nabla J^{(j)} (θ)

in a similar time to that for computing

J^{(j)} (θ)

itself. This does not only apply to the fully-connected neural network introduced in Section 2, but also many other types of neural networks.

In the rest of the section, we will showcase how to compute the gradient of the loss efficiently for fully-connected neural networks using backpropagation. Even though auto-differentiation or backpropagation is implemented in all the deep learning packages such as TensorFlow and PyTorch, understanding it is very helpful for gaining insights into the workings of deep learning.

3.1. Preliminary: chain rule

We first recall the chain rule in calculus. Suppose the variable

J

depends on the variables

θ_{1}, \dots, θ_{p}

via the intermediate variables

g_{1}, \dots, g_{k}

\begin{aligned} (3.1) & g_{j} = g_{j} (θ_{1}, \dots, θ_{p}), \forall j \in {1, \dots, k} \\ (3.2) & J = J (g_{1}, \dots, g_{k}) \end{aligned}

Here we overload the meaning of

g_{j}

's: they denote both the intermediate variables but also the functions used to compute the intermediate variables. Then, by the chain rule, we have that

\forall i

\begin{matrix} (3.3) & \frac{\partial J}{\partial θ_{i}} = \sum_{j = 1}^{k} \frac{\partial J}{\partial g_{j}} \frac{\partial g_{j}}{\partial θ_{i}} \end{matrix}

For the ease of invoking the chain rule in the following subsections in various ways, we will call

J

the output variable,

g_{1}, \dots, g_{k}

intermediate variables, and

θ_{1}, \dots, θ_{p}

the input variables in the chain rule.

3.2. Backpropagation for two-layer neural networks

Now we consider the two-layer neural network defined in equation (2.11). Our general approach is to first unpack the vectorized notation to scalar form to apply the chain rule, but as soon as we finish the derivation, we will pack the scalar equations back to a vectorized form to keep the notations succinct.

Recall the following equations are used for the computation of the loss

J

\begin{aligned} z & = W^{[1]} x + b^{[1]} \\ a & = ReLU (z) \\ h_{θ} (x) ≜ o & = W^{[2]} a + b^{[2]} \\ (3.4) & J & = \frac{1}{2} (y - o)^{2} \end{aligned}

Recall that

W^{[1]} \in R^{m \times d}, W^{[2]} \in R^{1 \times m}

, and

b^{[1]}, z, a \in R^{m}

, and

o, y, b^{[2]} \in R

. Recall that a vector in

R^{d}

is automatically interpreted as a column vector (like a matrix in

R^{d \times 1}

) if need be.^[5]

Computing

\frac{\partial J}{\partial W^{[2]}}

. Suppose

W^{[2]} = [W_{1}^{[2]}, \dots, W_{m}^{[2]}]

. We start by computing

\frac{\partial J}{\partial W_{i}^{[2]}}

using the chain rule (3.3) with

o

as the intermediate variable.

\begin{aligned} \frac{\partial J}{\partial W_{i}^{[2]}} & = \frac{\partial J}{\partial o} \cdot \frac{\partial o}{\partial W_{i}^{[2]}} \\ = (o - y) \cdot \frac{\partial o}{\partial W_{i}^{[2]}} \\ = (o - y) \cdot a_{i} (because o = \sum_{i = 1}^{m} W_{i}^{[2]} a_{i} + b^{[2]}) \end{aligned}

Vectorized notation. The equation above in vectorized notation becomes

\begin{matrix} (3.5) & \frac{\partial J}{\partial W^{[2]}} = (o - y) \cdot a^{⊤} \in R^{1 \times m} \end{matrix}

Similarly, we leave the reader to verify that

\begin{matrix} (3.6) & \frac{\partial J}{\partial b^{[2]}} = (o - y) \in R \end{matrix}

Clarification for the dimensionality of the partial derivative notation. We will use the notation

\frac{\partial J}{\partial A}

frequently in the rest of the lecture notes. We note that here we only use this notation for the case when

J

is a real-valued variable,^[6] but

A

can be a vector or a matrix. Moreover,

\frac{\partial J}{\partial A}

has the same dimensionality as

A

. For example, when

A

is a matrix, the

(i, j)

-th entry of

\frac{\partial J}{\partial A}

is equal to

\frac{\partial J}{\partial A_{i j}}

. If you are familiar with the notion of total derivatives, we note that the convention for dimensionality here is different from that for total derivatives.

Computing $\frac{\partial J}{\partial W_{[1]}^{[1]}}$ . Next we compute

\frac{\partial J}{\partial W^{[1]}}

. We first unpack the vectorized notation: let

W_{i j}^{[1]}

denote the

(i, j)

-the entry of

W^{[1]}

, where

i \in [m]

and

j \in [d]

. We compute

\frac{\partial J}{\partial W_{i j}^{[I]}}

using chain rule (3.3) with

z_{i}

as the intermediate variable.

\begin{aligned} \frac{\partial J}{\partial W_{i j}^{[1]}} & = \frac{\partial J}{\partial z_{i}} \cdot \frac{\partial z_{i}}{\partial W_{i j}^{[1]}} \\ = \frac{\partial J}{\partial z_{i}} \cdot x_{j} (because z_{i} = \sum_{k = 1}^{d} W_{i k}^{[1]} x_{k} + b_{i}^{[1]}) \end{aligned}

Vectorized notation. The equation above can be written compactly as

\begin{matrix} (3.7) & \frac{\partial J}{\partial W^{[1]}} = \frac{\partial J}{\partial z} \cdot x^{⊤} \end{matrix}

We can verify that the dimensions match:

\frac{\partial J}{\partial W^{[1]}} \in R^{m \times d}, \frac{\partial J}{\partial z} \in R^{m \times 1}

and

x^{⊤} \in R^{1 \times d}

Abstraction: For future usage, the computations for

\frac{\partial J}{\partial W^{[1]}}

and

\frac{\partial J}{\partial W^{[2]}}

above can be abstractified into the following claim:

Claim 3.2: Suppose

J

is a real-valued output variable,

z \in R^{m}

is the intermediate variable, and

W \in R^{m \times d}, u \in R^{d}, b \in R^{m}

are the input variables, and suppose they satisfy the following:

\begin{aligned} (3.8) & z & = W u + b \\ (3.9) & J & = J (z) \end{aligned}

Then

\frac{\partial J}{\partial W}

and

\frac{\partial J}{\partial b}

satisfy:

\begin{aligned} (3.10) & \frac{\partial J}{\partial W} & = \frac{\partial J}{\partial z} \cdot u^{⊤} \\ (3.11) & \frac{\partial J}{\partial b} & = \frac{\partial J}{\partial z} \end{aligned}

Computing $\frac{\partial J}{\partial z}$ . Equation (3.7) tells us that to compute

\frac{\partial J}{\partial W [1]}

, it suffices to compute

\frac{\partial J}{\partial z}

, which is the goal of the next few derivations.

We invoke the chain rule with

J

as the output variable,

a_{i}

as the intermediate variable, and

z_{i}

as the input variable,

\begin{aligned} \frac{\partial J}{\partial z_{i}} & = \frac{\partial J}{\partial a_{i}} \frac{\partial a_{i}}{\partial z_{i}} \\ = \frac{\partial J}{\partial a_{i}} \cdot 1 {z_{i} \geq 0} \end{aligned}

Vectorization and abstraction. The computation above can be summarized into:

Claim 3.3: Suppose the real-valued output variable

J

and vectors

z, a \in R^{m}

satisfy the following:

\begin{aligned} a & = σ (z), where σ is an element-wise activation, z, a \in R^{m} \\ J & = J (a) \end{aligned}

Then, we have that

\begin{matrix} (3.12) & \frac{\partial J}{\partial z} = \frac{\partial J}{\partial a} ⊙ σ^{'} (z) \end{matrix}

where

σ^{'} (\cdot)

is the element-wise derivative of the activation function

σ

, and

⊙

denotes the element-wise product of two vectors of the same dimensionality.

Computing $\frac{\partial J}{\partial a}$ . Now it suffices to compute

\frac{\partial J}{\partial a}

. We invoke the chain rule with

J

as the output variable,

o

as the intermediate variable, and

a_{i}

as the input variable,

\begin{aligned} \frac{\partial J}{\partial a_{i}} & = \frac{\partial J}{\partial o} \frac{\partial o}{\partial a_{i}} \\ = (o - y) \cdot W_{i}^{[2]} (because o = \sum_{i = 1}^{m} W_{i}^{[2]} a_{i} + b^{[2]}) \end{aligned}

Vectorization. In vectorized notation, we have

\begin{matrix} (3.13) & \frac{\partial J}{\partial a} = W^{[2]^{⊤}} \cdot (o - y) \end{matrix}

Abstraction. We now present a more general form of the computation above.

Claim 3.4: Suppose

J

is a real-valued output variable,

v \in R^{m}

is the intermediate variable, and

W \in R^{m \times d}, u \in R^{d}, b \in R^{m}

are the input variables, and suppose they satisfy the following:

\begin{aligned} v & = W u + b \\ J & = J (v) \end{aligned}

Then,

\begin{aligned} \frac{\partial J}{\partial u} = W^{⊤} \frac{\partial J}{\partial v} \\ (3.14) \end{aligned}

Summary for two-layer neural networks. Now combining the equations above, we arrive at Algorithm 3 which computes the gradients for twolayer neural networks.

3.3. Multi-layer neural networks

In this section, we will derive the backpropagation algorithms for the model defined in (2.12). With the notation

a^{[0]} = x

, recall that we have

\begin{aligned} a^{[1]} & = ReLU (W^{[1]} a^{[0]} + b^{[1]}) \\ a^{[2]} & = ReLU (W^{[2]} a^{[1]} + b^{[2]}) \\ \dots \\ a^{[r - 1]} & = ReLU (W^{[r - 1]} a^{[r - 2]} + b^{[r - 1]}) \end{aligned}

Algorithm 3 Back-propagation for two-layer neural networks

1: Compute the values of

z \in R^{m}, a \in R^{m}, and o \in R

2: Compute

\begin{aligned} δ^{[2]} ≜ \frac{\partial J}{\partial o} = (o - y) \in R \\ δ^{[1]} ≜ \frac{\partial J}{\partial z} (W^{[2]^{⊤}} (o - y)) ⊙ 1 {z \geq 0} \in R^{m \times 1} \\ (by eqn. (3.12) and (3.13)) \end{aligned}

3: Compute

\begin{aligned} (by eqn. (3.5)) & \frac{\partial J}{\partial W^{[2]}} & = δ^{[2]} a^{⊤} \in R^{1 \times m} \\ (by eqn. (3.5)) & \frac{\partial J}{\partial b^{[2]}} & = δ^{[2]} \in R \\ (by eqn. (3.7)) & \frac{\partial J}{\partial W^{[1]}} & = δ^{[1]} x^{⊤} \in R^{m \times d} \\ (as an exercise) & \frac{\partial J}{\partial b^{[1]}} & = δ^{[1]} \in R^{m} \end{aligned}

\begin{aligned} a^{[r]} & = z^{[r]} = W^{[r]} a^{[r - 1]} + b^{[r]} \\ J & = \frac{1}{2} {(a^{[r]} - y)}^{2} \end{aligned}

Here we define both

a^{[r]}

and

z^{[r]}

h_{θ} (x)

for notational simplicity.

First, we note that we have the following local abstraction for

k \in

{1, \dots, r}

\begin{aligned} z^{[k]} & = W^{[k]} a^{[k - 1]} + b^{[k]} \\ J & = J (z^{[k]}) \end{aligned}

Invoking Claim 3.2, we have that

\begin{aligned} \frac{\partial J}{\partial W^{[k]}} & = \frac{\partial J}{\partial z^{[k]}} \cdot a^{[k - 1]^{⊤}} \\ (3.15) & \frac{\partial J}{\partial b^{[k]}} & = \frac{\partial J}{\partial z^{[k]}} \end{aligned}

Therefore, it suffices to compute

\frac{\partial J}{\partial z^{l k]}}

. For simplicity, let's define

δ^{[k]} ≜ \frac{\partial J}{\partial z^{k k}}

. We compute

δ^{[k]}

from

k = r

to 1 inductively. First we have that

\begin{matrix} (3.16) & δ^{[r]} ≜ \frac{\partial J}{\partial z^{[r]}} = (z^{[r]} - y) \end{matrix}

Next for

k \leq r - 1

, suppose we have computed the value of

δ^{[k + 1]}

, then we will compute

δ^{[k]}

. First, using Claim 3.3, we have that

δ^{[k]} ≜ \frac{\partial J}{\partial z^{[k]}} = \frac{\partial J}{\partial a^{[k]}} ⊙ {ReLU}^{'} (z^{[k]})

Then we note that the relationship between

a^{[k]}

and

z^{[k + 1]}

can be abstractly written as

\begin{aligned} (3.17) & z^{[k + 1]} & = W^{[k + 1]} a^{[k]} + b^{[k + 1]} \\ (3.18) & J & = J (z^{[k + 1]}) \end{aligned}

Therefore by Claim

3.4

we have that

\begin{matrix} (3.19) & \frac{\partial J}{\partial a^{[k]}} = W^{[k + 1]^{⊤}} \frac{\partial J}{\partial z^{[k + 1]}} \end{matrix}

It follows that

\begin{aligned} δ^{[k]} & = (W^{[k + 1]^{⊤}} \frac{\partial J}{\partial z^{[k + 1]}}) ⊙ {ReLU}^{'} (z^{[k]}) \\ = (W^{[k + 1]^{⊤}} δ^{[k + 1]}) ⊙ {ReLU}^{'} (z^{[k]}) \end{aligned}

Algorithm 4 Back-propagation for multi-layer neural networks.

1: Compute and store the values of

a^{[k]}

's and

z^{[k]}

's for

k = 1, \dots, r,

and

J

⊳

This is often called the “forward pass”

2: .

3: for

k = r

1

⊳

This is often called the “backward pass”

k = r

then

compute

δ^{[r]} ≜ \frac{\partial J}{\partial z^{[r]}}

else

compute

\partial^{[k]} ≜ \frac{\partial J}{\partial z^{[k]}} = (W^{[k + 1]^{⊤}} \partial^{[k + 1]}) ⊙ ReLU' (z^{[k]})

Compute

\begin{aligned} \frac{\partial J}{\partial W^{[k]}} & = δ^{[k]} a^{[k - 1]^{⊤}} \\ \frac{\partial J}{\partial b^{[k]}} & = δ^{[k]} \end{aligned}

4. Vectorization Over Training Examples

As we discussed in Section 1, in the implementation of neural networks, we will leverage the parallelism across multiple examples. This means that we will need to write the forward pass (the evaluation of the outputs) of the neural network and the backward pass (backpropagation) for multiple training examples in matrix notation.

The basic idea. The basic idea is simple. Suppose you have a training set with three examples

x^{(1)}, x^{(2)}, x^{(3)}

. The first-layer activations for each example are as follows:

\begin{aligned} z^{[1] (1)} = W^{[1]} x^{(1)} + b^{[1]} \\ z^{[1] (2)} = W^{[1]} x^{(2)} + b^{[1]} \\ z^{[1] (3)} = W^{[1]} x^{(3)} + b^{[1]} \end{aligned}

Note the difference between square brackets

[\cdot]

, which refer to the layer number, and parenthesis

(\cdot)

, which refer to the training example number. Intuitively, one would implement this using a for loop. It turns out, we can vectorize these operations as well. First, define:

\begin{matrix} (4.1) & X = [\begin{array}{ccc} ∣ & ∣ & ∣ \\ x^{(1)} & x^{(2)} & x^{(3)} \\ ∣ & ∣ & ∣ \end{array}] \in R^{d \times 3} \end{matrix}

Note that we are stacking training examples in columns and not rows. We can then combine this into a single unified formulation:

\begin{matrix} (4.2) & Z^{[1]} = [\begin{array}{ccc} ∣ & ∣ & ∣ \\ z^{[1] (1)} & z^{[1] (2)} & z^{[1] (3)} \\ ∣ & ∣ & ∣ \end{array}] = W^{[1]} X + b^{[1]} \end{matrix}

You may notice that we are attempting to add

b^{[1]} \in R^{4 \times 1}

W^{[1]} X \in

R^{4 \times 3}

. Strictly following the rules of linear algebra, this is not allowed. In practice however, this addition is performed using broadcasting. We create an intermediate

{\tilde{b}}^{[1]} \in R^{4 \times 3}

\begin{matrix} (4.3) & {\tilde{b}}^{[1]} = [\begin{array}{ccc} ∣ & ∣ & ∣ \\ b^{[1]} & b^{[1]} & b^{[1]} \\ ∣ & ∣ & ∣ \end{array}] \end{matrix}

We can then perform the computation:

Z^{[1]} = W^{[1]} X + {\tilde{b}}^{[1]}

. Often times, it is not necessary to explicitly construct

{\tilde{b}}^{[1]}

. By inspecting the dimensions in (4.2), you can assume

b^{[1]} \in R^{4 \times 1}

is correctly broadcast to

W^{[1]} X \in R^{4 \times 3}

The matricization approach as above can easily generalize to multiple layers, with one subtlety though, as discussed below.

Complications/Subtlety in the Implementation. All the deep learning packages or implementations put the data points in the rows of a data matrix. (If the data point itself is a matrix or tensor, then the data are concentrated along the zero-th dimension.) However, most of the deep learning papers use a similar notation to these notes where the data points are treated as column vectors.^[7] There is a simple conversion to deal with the mismatch: in the implementation, all the columns become row vectors, row vectors become column vectors, all the matrices are transposed, and the orders of the matrix multiplications are flipped. In the example above, using the row major convention, the data matrix is

X \in R^{3 \times d}

, the first layer weight matrix has dimensionality

d \times m

(instead of

m \times d

as in the two layer neural net section), and the bias vector

b^{[1]} \in R^{1 \times m}

. The computation for the hidden activation becomes

\begin{matrix} (4.4) & Z^{[1]} = X W^{[1]} + b^{[1]} \in R^{3 \times m} \end{matrix}

You can read the notes from the next lecture from CS229 on Regularization and Model Selection here.

If a concrete example is helpful, perhaps think about the model $h_{θ} (x) = θ_{1}^{2} x_{1}^{2} + θ_{2}^{2} x_{2}^{2} +$ $\dots + θ_{d}^{2} x_{d}^{2}$ in this subsection, even though it's not a neural network. ↩︎
Recall that, as defined in the previous lecture notes, we use the notation " $a := b$ " to denote an operation (in a computer program) in which we set the value of a variable $a$ to be equal to the value of $b$ . In other words, this operation overwrites $a$ with the value of $b$ . In contrast, we will write " $a = b$ " when we are asserting a statement of fact, that the value of $a$ is equal to the value of $b$ . ↩︎
Typically, for multi-layer neural network, at the end, near the output, we don’t apply ReLU, especially when the output is not necessarily a positive number. ↩︎
We note if the output of the function $f$ does not depend on some of the input coordinates, then we set by default the gradient w.r.t that coordinate to zero. Setting to zero does not count towards the total runtime here in our accounting scheme. This is why when $N \leq ℓ$ , we can compute the gradient in $O (N)$ time, which might be potentially even less than $ℓ$ . ↩︎
We also note that even though this is the convention in math, it’s different from the convention in numpy where an one dimensional array will be automatically interpreted as a row vector. ↩︎
There is an extension of this notation to vector or matrix variable $J$ . However, in practice, it's often impractical to compute the derivatives of high-dimensional outputs. Thus, we will avoid using the notation $\frac{\partial J}{\partial A}$ for $J$ that is not a real-valued variable. ↩︎
The instructor suspects that this is mostly because in mathematics we naturally multiply a matrix to a vector on the left hand side. ↩︎

Introduction to Deep Learning

1. Supervised Learning with Non-linear Models

2. Neural Networks

3. Backpropagation

3.1. Preliminary: chain rule

3.2. Backpropagation for two-layer neural networks

3.3. Multi-layer neural networks

4. Vectorization Over Training Examples

Recommended for you

Report Article