Introduction to Deep Learning

You can read the notes from the previous lecture from Andrew Ng's CS229 course on Kernal Methods here.
We now begin our study of deep learning. In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation.

1. Supervised Learning with Non-linear Models

In the supervised learning setting (predicting y y yy from the input x x xx), suppose our model/hypothesis is h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x). In the past lectures, we have considered the cases when h θ ( x ) = θ x h θ ( x ) = θ x h_(theta)(x)=theta^(TT)xh_{\theta}(x)=\theta^{\top} x (in linear regression or logistic regression) or h θ ( x ) = h θ ( x ) = h_(theta)(x)=h_{\theta}(x)= θ ϕ ( x ) θ ϕ ( x ) theta^(TT)phi(x)\theta^{\top} \phi(x) (where ϕ ( x ) ϕ ( x ) phi(x)\phi(x) is the feature map). A commonality of these two models is that they are linear in the parameters θ θ theta\theta. Next we will consider learning general family of models that are non-linear in both the parameters θ θ theta\theta and the inputs x x xx. The most common non-linear models are neural networks, which we will define staring from the next section. For this section, it suffices to think h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x) as an abstract non-linear model.[1]
Suppose { ( x ( i ) , y ( i ) ) } i = 1 n x ( i ) , y ( i ) i = 1 n {(x^((i)),y^((i)))}_(i=1)^(n)\left\{\left(x^{(i)}, y^{(i)}\right)\right\}_{i=1}^{n} are the training examples. For simplicity, we start with the case where y ( i ) R y ( i ) R y^((i))inRy^{(i)} \in \mathbb{R} and h θ ( x ) R h θ ( x ) R h_(theta)(x)inRh_{\theta}(x) \in \mathbb{R}.
Cost/loss function. We define the least square cost function for the i i ii-th example ( x ( i ) , y ( i ) ) x ( i ) , y ( i ) (x^((i)),y^((i)))\left(x^{(i)}, y^{(i)}\right) as
(1.1) J ( i ) ( θ ) = 1 2 ( h θ ( x ( i ) ) y ( i ) ) 2 (1.1) J ( i ) ( θ ) = 1 2 h θ x ( i ) y ( i ) 2 {:(1.1)J^((i))(theta)=(1)/(2)(h_(theta)(x^((i)))-y^((i)))^(2):}\begin{equation} J^{(i)}(\theta)=\frac{1}{2}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}\tag{1.1} \end{equation}
and define the mean-square cost function for the dataset as
(1.2) J ( θ ) = 1 n i = 1 n J ( i ) ( θ ) (1.2) J ( θ ) = 1 n i = 1 n J ( i ) ( θ ) {:(1.2)J(theta)=(1)/(n)sum_(i=1)^(n)J^((i))(theta):}\begin{equation} J(\theta)=\frac{1}{n} \sum_{i=1}^{n} J^{(i)}(\theta) \tag{1.2} \end{equation}
which is same as in linear regression except that we introduce a constant 1 / n 1 / n 1//n1/n in front of the cost function to be consistent with the convention. Note that multiplying the cost function with a scalar will not change the local minima or global minima of the cost function. Also note that the underlying parameterization for h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x) is different from the case of linear regression, even though the form of the cost function is the same mean-squared loss. Throughout the notes, we use the words "loss" and "cost" interchangeably.
Optimizers (SGD). Commonly, people use gradient descent (GD), stochastic gradient (SGD), or their variants to optimize the loss function J ( θ ) J ( θ ) J(theta)J(\theta). GD's update rule can be written as [2]
(1.3) θ := θ α θ J ( θ ) (1.3) θ := θ α θ J ( θ ) {:(1.3)theta:=theta-alphagrad_(theta)J(theta):}\begin{equation} \theta:=\theta-\alpha \nabla_{\theta} J(\theta) \tag{1.3} \end{equation}
where α > 0 α > 0 alpha > 0\alpha>0 is often referred to as the learning rate or step size. Next, we introduce a version of the SGD (Algorithm 1 1 11), which is lightly different from that in the first lecture notes.
Algorithm 1 Stochastic Gradient Descent
1: Hyperparameter: learning rate α α alpha\alpha, number of total iteration n iter n iter n_("iter")n_{\text{iter}}.
2: Initialize θ θ theta\theta randomly.
3: for i = 1 i = 1 i=1i = 1 to n iter n iter n_("iter")n_{\text{iter}} do
4: quad\quad Sample j j jj uniformly from { 1 , , n } { 1 , , n } {1,dots,n}\{1,\ldots,n\}, and update θ θ theta\theta by (1.4) θ := θ α θ J j ( θ ) (1.4) θ := θ α θ J j ( θ ) {:(1.4)theta:=theta-alphagrad_(theta)J^(j)(theta):}\begin{equation} \theta := \theta - \alpha \nabla_{\theta}J^{j}(\theta) \tag{1.4} \end{equation}
Oftentimes computing the gradient of B B BB examples simultaneously for the parameter θ θ theta\theta can be faster than computing B B BB gradients separately due to hardware parallelization. Therefore, a mini-batch version of SGD is most commonly used in deep learning, as shown in Algorithm 2 2 22. There are also other variants of the SGD or mini-batch SGD with slightly different sampling schemes.
Algorithm 2 Mini-batch Stochastic Gradient Descent
1: Hyperparameters: learning rate α α alpha\alpha, batch size B , # B , # B,#B, \# iterations n iter n iter n_("iter")n_{\text{iter}}.
2: Initialize θ θ theta\theta randomly
3: for i = 1 i = 1 i=1i = 1 to n iter n iter n_("iter")n_{\text{iter}} do
4: quad\quad Sample B B BB examples j 1 , , j B j 1 , , j B j_(1),dots,j_(B)j_{1},\ldots,j_{B} (without replacement) uniformly from { 1 , , n } { 1 , , n } {1,dots,n}\{1,\ldots, n \}, and update θ θ theta\theta by
(1.5) θ := θ α B k = 1 B θ J ( j k ) ( θ ) (1.5) θ := θ α B k = 1 B θ J ( j k ) ( θ ) {:(1.5)theta:=theta-(alpha )/(B)sum_(k=1)^(B)grad_(theta)J^((j_(k)))(theta):}\begin{equation} \theta := \theta - \frac{\alpha}{B} \sum^{B}_{k=1}\nabla_{\theta}J^{(j_{k})}(\theta) \tag{1.5} \end{equation}
With these generic algorithms, a typical deep learning model is learned with the following steps. 1. Define a neural network parametrization h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x), which we will introduce in Section 2, and 2. write the backpropagation algorithm to compute the gradient of the loss function J ( j ) ( θ ) J ( j ) ( θ ) J^((j))(theta)J^{(j)}(\theta) efficiently, which will be covered in Section 3, and 3. run SGD or mini-batch SGD (or other gradient-based optimizers) with the loss function J ( θ ) J ( θ ) J(theta)J(\theta).

2. Neural Networks

Neural networks refer to broad type of non-linear models/parametrizations h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x) that involve combinations of matrix multiplications and other entrywise non-linear operations. We will start small and slowly build up a neural network, step by step.
A Neural Network with a Single Neuron. Recall the housing price prediction problem from before: given the size of the house, we want to predict the price. We will use it as a running example in this subsection.
Previously, we fit a straight line to the graph of size vs. housing price. Now, instead of fitting a straight line, we wish to prevent negative housing prices by setting the absolute minimum price as zero. This produces a "kink" in the graph as shown in Figure 1. How do we represent such a function with a single kink as h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x) with unknown parameter? (After doing so, we can invoke the machinery in Section 1.)
We define a parameterized function h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x) with input x x xx, parameterized by θ θ theta\theta, which outputs the price of the house y y yy. Formally, h θ : x y h θ : x y h_(theta):x rarr yh_{\theta}: x \rightarrow y. Perhaps one of the simplest parametrization would be
(2.1) h θ ( x ) = max ( w x + b , 0 ) , where θ = ( w , b ) R 2 (2.1) h θ ( x ) = max ( w x + b , 0 ) , where  θ = ( w , b ) R 2 {:(2.1)h_(theta)(x)=max(wx+b","0)", where "theta=(w","b)inR^(2):}\begin{equation} h_{\theta}(x)=\max (w x+b, 0) \text {, where } \theta=(w, b) \in \mathbb{R}^{2} \tag{2.1} \end{equation}
Here h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x) returns a single value: ( w x + b ) ( w x + b ) (wx+b)(w x+b) or zero, whichever is greater. In the context of neural networks, the function max { t , 0 } max { t , 0 } max{t,0}\max \{t, 0\} is called a ReLU (pronounced "ray-lu"), or rectified linear unit, and often denoted by ReLU ( t ) ReLU ( t ) ReLU(t)≜\operatorname{ReLU}(t) \triangleq max { t , 0 } max { t , 0 } max{t,0}\max \{t, 0\}.
Generally, a one-dimensional non-linear function that maps R R R\mathbb{R} to R R R\mathbb{R} such as ReLU is often referred to as an activation function. The model h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x) is said to have a single neuron partly because it has a single non-linear activation function. (We will discuss more about why a non-linear activation is called neuron.)
When the input x R d x R d x inR^(d)x \in \mathbb{R}^{d} has multiple dimensions, a neural network with a single neuron can be written as
(2.2) h θ ( x ) = ReLU ( w x + b ) , where w R d , b R , and θ = ( w , b ) (2.2) h θ ( x ) = ReLU w x + b , where  w R d , b R , and  θ = ( w , b ) {:(2.2)h_(theta)(x)=ReLU(w^(TT)x+b)", where "w inR^(d)","b inR", and "theta=(w","b):}\begin{equation} h_{\theta}(x)=\operatorname{ReLU}\left(w^{\top} x+b\right) \text {, where } w \in \mathbb{R}^{d}, b \in \mathbb{R} \text {, and } \theta=(w, b) \tag{2.2} \end{equation}
The term b b bb is often referred to as the "bias", and the vector w w ww is referred to as the weight vector. Such a neural network has 1 1 11 layer. (We will define what multiple layers mean in the sequel.)
Stacking Neurons. A more complex neural network may take the single neuron described above and "stack" them together such that one neuron passes its output as input into the next neuron, resulting in a more complex function.
Let us now deepen the housing prediction example. In addition to the size of the house, suppose that you know the number of bedrooms, the zip code and the wealth of the neighborhood. Building neural networks is analogous to Lego bricks: you take individual bricks and stack them together to build complex structures. The same applies to neural networks: we take individual neurons and stack them together to create complex neural networks.
Given these features (size, number of bedrooms, zip code, and wealth), we might then decide that the price of the house depends on the maximum family size it can accommodate. Suppose the family size is a function of the size of the house and number of bedrooms (see Figure 2 2 22). The zip code may provide additional information such as how walkable the neighborhood is (i.e., can you walk to the grocery store or do you need to drive everywhere). Combining the zip code with the wealth of the neighborhood may predict the quality of the local elementary school. Given these three derived features (family size, walkable, school quality), we may conclude that the price of the home ultimately depends on these three features.
Formally, the input to a neural network is a set of input features x 1 , x 2 , x 3 , x 4 x 1 , x 2 , x 3 , x 4 x_(1),x_(2),x_(3),x_(4)x_{1}, x_{2}, x_{3}, x_{4}. We denote the intermediate variables for "family size", "walkable", and "school quality" by a 1 , a 2 , a 3 a 1 , a 2 , a 3 a_(1),a_(2),a_(3)a_{1}, a_{2}, a_{3} (these a i a i a_(i)a_{i}'s are often referred to as
Figure 1: Housing prices with a "kink" in the graph.
Figure 2: Diagram of a small neural network for predicting housing prices.
"hidden units" or "hidden neurons"). We represent each of the a i a i a_(i)a_{i} 's as a neural network with a single neuron with a subset of x 1 , , x 4 x 1 , , x 4 x_(1),dots,x_(4)x_{1}, \ldots, x_{4} as inputs. Then as in Figure 1, we will have the parameterization:
a 1 = ReLU ( θ 1 x 1 + θ 2 x 2 + θ 3 ) a 2 = ReLU ( θ 4 x 3 + θ 5 ) a 3 = ReLU ( θ 6 x 3 + θ 7 x 4 + θ 8 ) a 1 = ReLU θ 1 x 1 + θ 2 x 2 + θ 3 a 2 = ReLU θ 4 x 3 + θ 5 a 3 = ReLU θ 6 x 3 + θ 7 x 4 + θ 8 {:[a_(1)=ReLU(theta_(1)x_(1)+theta_(2)x_(2)+theta_(3))],[a_(2)=ReLU(theta_(4)x_(3)+theta_(5))],[a_(3)=ReLU(theta_(6)x_(3)+theta_(7)x_(4)+theta_(8))]:}\begin{aligned} &a_{1}=\operatorname{ReLU}\left(\theta_{1} x_{1}+\theta_{2} x_{2}+\theta_{3}\right) \\ &a_{2}=\operatorname{ReLU}\left(\theta_{4} x_{3}+\theta_{5}\right) \\ &a_{3}=\operatorname{ReLU}\left(\theta_{6} x_{3}+\theta_{7} x_{4}+\theta_{8}\right) \end{aligned}
where ( θ 1 , , θ 8 ) θ 1 , , θ 8 (theta_(1),cdots,theta_(8))\left(\theta_{1}, \cdots, \theta_{8}\right) are parameters. Now we represent the final output h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x) as another linear function with a 1 , a 2 , a 3 a 1 , a 2 , a 3 a_(1),a_(2),a_(3)a_{1}, a_{2}, a_{3} as inputs, and we get[3]
(2.3) h θ ( x ) = θ 9 a 1 + θ 10 a 2 + θ 11 a 3 + θ 12 (2.3) h θ ( x ) = θ 9 a 1 + θ 10 a 2 + θ 11 a 3 + θ 12 {:(2.3)h_(theta)(x)=theta_(9)a_(1)+theta_(10)a_(2)+theta_(11)a_(3)+theta_(12):}\begin{equation} h_{\theta}(x)=\theta_{9} a_{1}+\theta_{10} a_{2}+\theta_{11} a_{3}+\theta_{12} \tag{2.3} \end{equation}
where θ θ theta\theta contains all the parameters ( θ 1 , , θ 12 ) ( θ 1 , , θ 12 ) (theta_(1),dots,theta_(12))(\theta_{1},\ldots,\theta_{12}).
Now we represent the output as a quite complex function of x x xx with parameters θ θ theta\theta. Then you can use this parametrization h θ h θ h_(theta)h_{\theta} with the machinery of Section 1 to learn the parameters θ θ theta\theta.
Inspiration from Biological Neural Networks. As the name suggests, artificial neural networks were inspired by biological neural networks. The hidden units a 1 , , a m a 1 , , a m a_(1),dots,a_(m)a_{1}, \ldots, a_{m} correspond to the neurons in a biological neural network, and the parameters θ i θ i theta_(i)\theta_{i}'s correspond to the synapses. However, it's unclear how similar the modern deep artificial neural networks are to the biological ones. For example, perhaps not many neuroscientists think biological neural networks could have 1000 layers, while some modern artificial neural networks do (we will elaborate more on the notion of layers.) Moreover, it's an open question whether human brains update their neural networks in a way similar to the way that computer scientists learn artificial neural networks (using backpropagation, which we will introduce in the next section.).
Two-layer Fully-Connected Neural Networks. We constructed the neural network in equation (2.3) using a significant amount of prior knowledge/belief about how the "family size", "walkable", and "school quality" are determined by the inputs. We implicitly assumed that we know the family size is an important quantity to look at and that it can be determined by only the "size" and "# bedrooms". Such a prior knowledge might not be available for other applications. It would be more flexible and general to have a generic parameterization. A simple way would be to write the intermediate variable a 1 a 1 a_(1)a_{1} as a function of all x 1 , , x 4 x 1 , , x 4 x_(1),dots,x_(4)x_{1}, \ldots, x_{4}:
(2.4) a 1 = ReLU ( w 1 x + b 1 ) , where w 1 R 4 and b 1 R (2.4) a 1 = ReLU w 1 x + b 1 ,  where  w 1 R 4  and  b 1 R {:(2.4)a_(1)=ReLU(w_(1)^(TT)x+b_(1))","" where "w_(1)inR^(4)" and "b_(1)inR:}\begin{equation} a_{1}=\operatorname{ReLU}\left(w_{1}^{\top} x+b_{1}\right), \text { where } w_{1} \in \mathbb{R}^{4} \text { and } b_{1} \in \mathbb{R} \tag{2.4} \end{equation}
a 2 = ReLU ( w 2 x + b 2 ) , where w 2 R 4 and b 2 R a 2 = ReLU w 2 x + b 2 ,  where  w 2 R 4  and  b 2 R a_(2)=ReLU(w_(2)^(TT)x+b_(2))," where "w_(2)inR^(4)" and "b_(2)inRa_{2}=\operatorname{ReLU}\left(w_{2}^{\top} x+b_{2}\right), \text { where } w_{2} \in \mathbb{R}^{4} \text { and } b_{2} \in \mathbb{R}
a 3 = ReLU ( w 3 x + b 3 ) , where w 3 R 4 and b 3 R a 3 = ReLU w 3 x + b 3 ,  where  w 3 R 4  and  b 3 R a_(3)=ReLU(w_(3)^(TT)x+b_(3))," where "w_(3)inR^(4)" and "b_(3)inRa_{3}=\operatorname{ReLU}\left(w_{3}^{\top} x+b_{3}\right), \text { where } w_{3} \in \mathbb{R}^{4} \text { and } b_{3} \in \mathbb{R}
We still define h θ ( x ) h θ ( x ) h_(theta)(x)h_{\theta}(x) using equation (2.3) with a 1 , a 2 , a 3 a 1 , a 2 , a 3 a_(1),a_(2),a_(3)a_{1}, a_{2}, a_{3} being defined as above. Thus we have a so-called fully-connected neural network as visualized in the dependency graph in Figure 2 because all the intermediate variables a i a i a_(i)a_{i}'s depend on all the inputs x i x i x_(i)x_{i}'s.
For full generality, a two-layer fully-connected neural network with m m mm hidden units and d d dd dimensional input x R d x R d x inR^(d)x \in \mathbb{R}^{d} is defined as
(2.5) j [ 1 , , m ] , z j = w j [ 1 ] x + b j [ 1 ] where w j [ 1 ] R d , b j [ 1 ] R (2.5) j [ 1 , , m ] , z j = w j [ 1 ] x + b j [ 1 ]  where  w j [ 1 ] R d , b j [ 1 ] R {:(2.5)AA j in[1","dots","m]","quadz_(j)=w_(j)^([1]^(TT))x+b_(j)^([1])" where "w_(j)^([1])inR^(d)","b_(j)^([1])inR:}\begin{equation} \forall j \in[1, \ldots, m], \quad z_{j}=w_{j}^{[1]^{\top}} x+b_{j}^{[1]} \text { where } w_{j}^{[1]} \in \mathbb{R}^{d}, b_{j}^{[1]} \in \mathbb{R} \tag{2.5} \end{equation}
Figure 3: Diagram of a two-layer fully connected neural network. Each edge from node x i x i x_(i)x_{i} to node a j a j a_(j)a_{j} indicates that a j a j a_(j)a_{j} depends on x i x i x_(i)x_{i}. The edge from x i x i x_(i)x_{i} to a j a j a_(j)a_{j} is associated with the weight ( w j [ 1 ] ) i w j [ 1 ] i (w_(j)^([1]))_(i)\left(w_{j}^{[1]}\right)_{i} which denotes the i i ii-th coordinate of the vector w j [ 1 ] w j [ 1 ] w_(j)^([1])w_{j}^{[1]}. The activation a j a j a_(j)a_{j} can be computed by taking the ReLUof the weighted sum of x i x i x_(i)x_{i} 's with the weights being the weights associated with the incoming edges, that is, a j = ReLU ( i = 1 d ( w j [ 1 ] ) i x i ) a j = ReLU i = 1 d w j [ 1 ] i x i a_(j)=ReLU(sum_(i=1)^(d)(w_(j)^([1]))_(i)x_(i))a_{j}=\operatorname{ReLU}\left(\sum_{i=1}^{d}\left(w_{j}^{[1]}\right)_{i} x_{i}\right).
a j = ReLU ( z j ) , a = [ a 1 , , a m ] R m (2.6) h θ ( x ) = w [ 2 ] a + b [ 2 ] where w [ 2 ] R m , b [ 2 ] R , a j = ReLU z j , a = a 1 , , a m R m (2.6) h θ ( x ) = w [ 2 ] a + b [ 2 ]  where  w [ 2 ] R m , b [ 2 ] R , {:[a_(j)=ReLU(z_(j))","],[a=[a_(1),dots,a_(m)]^(TT)inR^(m)],(2.6)h_(theta)(x)=w^([2]^(TT))a+b^([2])" where "w^([2])inR^(m)","b^([2])inR",":}\begin{align} a_{j} &=\operatorname{ReLU}\left(z_{j}\right),\nonumber \\ a &=\left[a_{1}, \ldots, a_{m}\right]^{\top} \in \mathbb{R}^{m}\nonumber \\ h_{\theta}(x) &=w^{[2]^{\top}} a+b^{[2]} \text { where } w^{[2]} \in \mathbb{R}^{m}, b^{[2]} \in \mathbb{R}, \tag{2.6} \end{align}
Note that by default the vectors in R d R d R^(d)\mathbb{R}^{d} are viewed as column vectors, and in particular a a aa is a column vector with components a 1 , a 2 , , a m a 1 , a 2 , , a m a_(1),a_(2),dots,a_(m)a_{1}, a_{2}, \ldots, a_{m}. The indices [ 1 ] [ 1 ] ^([1]){ }^{[1]} and [ 2 ] [ 2 ] ^([2]){ }^{[2]} are used to distinguish two sets of parameters: the w j [ 1 ] w j [ 1 ] w_(j)^([1])w_{j}^{[1]},s (each of which is a vector in R d R d R^(d)\mathbb{R}^{d} ) and w [ 2 ] w [ 2 ] w^([2])w^{[2]} (which is a vector in R m R m R^(m)\mathbb{R}^{m} ). We will have more of these later.
Vectorization. Before we introduce neural networks with more layers and more complex structures, we will simplify the expressions for neural networks with more matrix and vector notations. Another important motivation of vectorization is the speed perspective in the implementation. In order to implement a neural network efficiently, one must be careful when using for loops. The most natural way to implement equation (2.5) in code is perhaps to use a for loop. In practice, the dimensionalities of the inputs and hidden units are high. As a result, code will run very slowly if you use for loops. Leveraging the parallelism in GPUs is/was crucial for the progress of deep learning.
This gave rise to vectorization. Instead of using for loops, vectorization takes advantage of matrix algebra and highly optimized numerical linear algebra packages (e.g., BLAS) to make neural network computations run quickly. Before the deep learning era, a for loop may have been sufficient on smaller datasets, but modern deep networks and state-of-the-art datasets will be infeasible to run with for loops.
We vectorize the two-layer fully-connected neural network as below. We define a weight matrix W [ 1 ] W [ 1 ] W^([1])W^{[1]} in R m × d R m × d R^(m xx d)\mathbb{R}^{m \times d} as the concatenation of all the vectors w j [ 1 ] , s w j [ 1 ] , s w_(j)^([1],s)w_{j}^{[1], s} in the following way:
(2.7) W [ 1 ] = [ w 1 [ 1 ] w 2 [ 1 ] w m [ 1 ] ] R m × d (2.7) W [ 1 ] = w 1 [ 1 ] w 2 [ 1 ] w m [ 1 ] R m × d {:(2.7)W^([1])=[[-w_(1)^([1]^(TT))-],[-w_(2)^([1]^(TT))-],[vdots],[-w_(m)^([1]^(TT))-]]inR^(m xx d):}\begin{equation} W^{[1]}=\left[\begin{array}{c} -w_{1}^{[1]^{\top}}- \\ -w_{2}^{[1]^{\top}}- \\ \vdots \\ -w_{m}^{[1]^{\top}}- \end{array}\right] \in \mathbb{R}^{m \times d} \tag{2.7} \end{equation}
Now by the definition of matrix vector multiplication, we can write z = z = z=z= [ z 1 , , z m ] R m z 1 , , z m R m [z_(1),dots,z_(m)]^(TT)inR^(m)\left[z_{1}, \ldots, z_{m}\right]^{\top} \in \mathbb{R}^{m} as
(2.8) [ z 1 z m ] z R m × 1 = [ w 1 [ 1 ] w 2 [ 1 ] w m [ 1 ] ] W [ 1 ] R m × d [ x 1 x 2 x d ] x R d × 1 + [ b 1 [ 1 ] b 2 [ 1 ] b m [ 1 ] ] b [ 1 ] R m × 1 (2.8) z 1 z m z R m × 1 = w 1 [ 1 ] w 2 [ 1 ] w m [ 1 ] W [ 1 ] R m × d x 1 x 2 x d x R d × 1 + b 1 [ 1 ] b 2 [ 1 ] b m [ 1 ] b [ 1 ] R m × 1 {:(2.8)ubrace([[z_(1)],[vdots],[vdots],[z_(m)]])_(z inR^(m xx1))=ubrace([[-w_(1)^([1]^(TT))-],[-w_(2)^([1]^(TT))-],[vdots],[-w_(m)^([1]^(TT))-]])_(W^([1])inR^(m xx d))ubrace([[x_(1)],[x_(2)],[vdots],[x_(d)]])_(x inR^(d xx1))+ubrace([[b_(1)^([1])],[b_(2)^([1])],[vdots],[b_(m)^([1])]])_(b^([1])inR^(m xx1)):}\begin{equation} \underbrace{\left[\begin{array}{c} z_{1} \\ \vdots \\ \vdots \\ z_{m} \end{array}\right]}_{z \in \mathbb{R}^{m \times 1}}=\underbrace{\left[\begin{array}{c} -w_{1}^{[1]^{\top}}- \\ -w_{2}^{[1]^{\top}}- \\ \vdots \\ -w_{m}^{[1]^{\top}}- \end{array}\right]}_{W^{[1]} \in \mathbb{R}^{m \times d}} \underbrace{\left[\begin{array}{c} x_{1} \\ x_{2} \\ \vdots \\ x_{d} \end{array}\right]}_{x \in \mathbb{R}^{d \times 1}}+\underbrace{\left[\begin{array}{c} b_{1}^{[1]} \\ b_{2}^{[1]} \\ \vdots \\ b_{m}^{[1]} \end{array}\right]}_{b^{[1]} \in \mathbb{R}^{m \times 1}} \tag{2.8} \end{equation}
Or succinctly,
(2.9) z = W [ 1 ] x + b [ 1 ] (2.9) z = W [ 1 ] x + b [ 1 ] {:(2.9)z=W^([1])x+b^([1]):}\begin{equation} z=W^{[1]} x+b^{[1]} \tag{2.9} \end{equation}
We remark again that a vector in R d R d R^(d)\mathbb{R}^{d} in this notes, following the conventions previously established, is automatically viewed as a column vector, and can also be viewed as a d × 1 d × 1 d xx1d \times 1 dimensional matrix. (Note that this is different from numpy where a vector is viewed as a row vector in broadcasting.)
Computing the activations a R m a R m a inR^(m)a \in \mathbb{R}^{m} from z R m z R m z inR^(m)z \in \mathbb{R}^{m} involves an elementwise non-linear application of the ReLU function, which can be computed in parallel efficiently. Overloading ReLU for element-wise application of ReLU (meaning, for a vector t R d t R d t inR^(d)t \in \mathbb{R}^{d}, ReLU ( t ) ReLU ( t ) ReLU(t)\operatorname{ReLU}(t) is a vector such that ReLU ( t ) i = ReLU ( t ) i = ReLU(t)_(i)=\operatorname{ReLU}(t)_{i}= ReLU ( t i ) ) ReLU t i {: ReLU(t_(i)))\left.\operatorname{ReLU}\left(t_{i}\right)\right), we have
(2.10) a = ReLU ( z ) (2.10) a = ReLU ( z ) {:(2.10)a=ReLU(z):}\begin{equation} a=\operatorname{ReLU}(z) \tag{2.10} \end{equation}
Define W [ 2 ] = [ w [ 2 ] ] R 1 × m W [ 2 ] = w [ 2 ] R 1 × m W^([2])=[w^([2]^(TT))]inR^(1xx m)W^{[2]}=\left[w^{[2]^{\top}}\right] \in \mathbb{R}^{1 \times m} similarly. Then, the model in equation (2.6) can be summarized as
a = ReLU ( W [ 1 ] x + b [ 1 ] ) (2.11) h θ ( x ) = W [ 2 ] a + b [ 2 ] a = ReLU W [ 1 ] x + b [ 1 ] (2.11) h θ ( x ) = W [ 2 ] a + b [ 2 ] {:[a=ReLU(W^([1])x+b^([1]))],(2.11)h_(theta)(x)=W^([2])a+b^([2]):}\begin{align} a &=\operatorname{ReLU}\left(W^{[1]} x+b^{[1]}\right)\nonumber \\ h_{\theta}(x) &=W^{[2]} a+b^{[2]} \tag{2.11} \end{align}
Here θ θ theta\theta consists of W [ 1 ] , W [ 2 ] W [ 1 ] , W [ 2 ] W^([1]),W^([2])W^{[1]}, W^{[2]} (often referred to as the weight matrices) and b [ 1 ] , b [ 2 ] b [ 1 ] , b [ 2 ] b^([1]),b^([2])b^{[1]}, b^{[2]} (referred to as the biases). The collection of W [ 1 ] , b [ 1 ] W [ 1 ] , b [ 1 ] W^([1]),b^([1])W^{[1]}, b^{[1]}