You can read the notes from the previous lecture from Andrew Ng's CS229 course on Generative Learning Algorithms here.
1. Kernel Methods
1.1. Feature maps
Recall that in our discussion about linear regression, we considered the problem of predicting the price of a house (denoted by yy) from the living area of the house (denoted by xx), and we fit a linear function of xx to the training data. What if the price yy can be more accurately represented as a non-linear function of xx? In this case, we need a more expressive family of models than linear models.
We start by considering fitting cubic functions y=theta_(3)x^(3)+theta_(2)x^(2)+theta_(1)x+theta_(0)y=\theta_{3} x^{3}+\theta_{2} x^{2}+\theta_{1} x+\theta_{0}. It turns out that we can view the cubic function as a linear function over the a different set of feature variables (defined below). Concretely, let the function phi:RrarrR^(4)\phi: \mathbb{R} \rightarrow \mathbb{R}^{4} be defined as
Let theta inR^(4)\theta \in \mathbb{R}^{4} be the vector containing theta_(0),theta_(1),theta_(2),theta_(3)\theta_{0}, \theta_{1}, \theta_{2}, \theta_{3} as entries. Then we can rewrite the cubic function in xx as:
Thus, a cubic function of the variable xx can be viewed as a linear function over the variables phi(x)\phi(x). To distinguish between these two sets of variables, in the context of kernel methods, we will call the "original" input value the input attributes of a problem (in this case, xx, the living area). When the original input is mapped to some new set of quantities phi(x)\phi(x), we will call those new quantities the features variables. (Unfortunately, different authors use different terms to describe these two things in different contexts.) We will call phi\phi a feature map, which maps the attributes to the features.
1.2. LMS (least mean squares) with features
We will derive the gradient descent algorithm for fitting the model theta^(T)phi(x)\theta^{T} \phi(x). First recall that for ordinary least square problem where we were to fit theta^(T)x\theta^{T} x, the batch gradient descent update is (see the first lecture note for its derivation):
Let phi:R^(d)rarrR^(p)\phi: \mathbb{R}^{d} \rightarrow \mathbb{R}^{p} be a feature map that maps attribute xx (in R^(d)\mathbb{R}^{d} ) to the features phi(x)\phi(x) in R^(p)\mathbb{R}^{p}. (In the motivating example in the previous subsection, we have d=1d=1 and p=4p=4.) Now our goal is to fit the function theta^(T)phi(x)\theta^{T} \phi(x), with theta\theta being a vector in R^(p)\mathbb{R}^{p} instead of R^(d)\mathbb{R}^{d}. We can replace all the occurrences of x^((i))x^{(i)} in the algorithm above by phi(x^((i)))\phi(x^{(i)}) to obtain the new update:
The gradient descent update, or stochastic gradient update above becomes computationally expensive when the features phi(x)\phi(x) is high-dimensional. For example, consider the direct extension of the feature map in equation (1) to high-dimensional input xx: suppose x inR^(d)x \in \mathbb{R}^{d}, and let phi(x)\phi(x) be the vector that contains all the monomials of xx with degree <= 3\leq 3
The dimension of the features phi(x)\phi(x) is on the order of d^(3)d^{3}.[1] This is a prohibitively long vector for computational purpose - when d=1000d=1000, each update requires at least computing and storing a 1000^(3)=10^(9)1000^{3}=10^{9} dimensional vector, which is 10^(6)10^{6} times slower than the update rule for for ordinary least squares updates (2)(2).
It may appear at first that such d^(3)d^{3} runtime per update and memory usage are inevitable, because the vector theta\theta itself is of dimension p~~d^(3)p \approx d^{3}, and we may need to update every entry of theta\theta and store it. However, we will introduce the kernel trick with which we will not need to store theta\theta explicitly, and the runtime can be significantly improved.
For simplicity, we assume the initialize the value theta=0\theta=0, and we focus on the iterative update (3). The main observation is that at any time, theta\theta can be represented as a linear combination of the vectors phi(x^((1))),dots,phi(x^((n)))\phi(x^{(1)}), \ldots, \phi(x^{(n)}). Indeed, we can show this inductively as follows. At initialization, theta=0=\theta=0=sum_(i=1)^(n)0*phi(x^((i)))\sum_{i=1}^{n} 0 \cdot \phi\left(x^{(i)}\right). Assume at some point, theta\theta can be represented as
for some beta_(1),dots,beta_(n)inR\beta_{1}, \ldots, \beta_{n} \in \mathbb{R}. Then we claim that in the next round, theta\theta is still a linear combination of phi(x^((1))),dots,phi(x^((n)))\phi(x^{(1)}), \ldots, \phi(x^{(n)}) because
You may realize that our general strategy is to implicitly represent the pp-dimensional vector theta\theta by a set of coefficients beta_(1),dots,beta_(n)\beta_{1}, \ldots, \beta_{n}. Towards doing this, we derive the update rule of the coefficients beta_(1),dots,beta_(n)\beta_{1}, \ldots, \beta_{n}. Using the equation above, we see that the new beta_(i)\beta_{i} depends on the old one via
Here we still have the old theta\theta on the RHS of the equation. Replacing theta\theta by theta=sum_(j=1)^(n)beta_(j)phi(x^((j)))\theta=\sum_{j=1}^{n} \beta_{j} \phi\left(x^{(j)}\right) gives
AA i in{1,dots,n},beta_(i):=beta_(i)+alpha(y^((i))-sum_(j=1)^(n)beta_(j)phi(x^((j)))^(T)phi(x^((i))))\forall i \in\{1, \ldots, n\}, \beta_{i}:=\beta_{i}+\alpha\left(y^{(i)}-\sum_{j=1}^{n} \beta_{j} \phi(x^{(j)})^{T} \phi(x^{(i)})\right)
We often rewrite phi(x^((j)))^(T)phi(x^((i)))\phi(x^{(j)})^{T} \phi(x^{(i)}) as (:phi(x^((j))),phi(x^((i))):)\langle\phi(x^{(j)}), \phi(x^{(i)})\rangle to emphasize that it's the inner product of the two feature vectors. Viewing beta_(i)\beta_{i}'s as the new representation of theta\theta, we have successfully translated the batch gradient descent algorithm into an algorithm that updates the value of beta\beta iteratively. It may appear that at every iteration, we still need to compute the values of (:phi(x^((j))),phi(x^((i))):)\left\langle\phi\left(x^{(j)}\right), \phi\left(x^{(i)}\right)\right\rangle for all pairs of i,ji, j, each of which may take roughly O(p)O(p) operation. However, two important properties come to rescue:
We can pre-compute the pairwise inner products (:phi(x^((j))),phi(x^((i))):)\langle\phi(x^{(j)}), \phi(x^{(i)})\rangle for all pairs of i,ji, j before the loop starts.
For the feature map phi\phi defined in (5) (or many other interesting feature maps), computing (:phi(x^((j))),phi(x^((i))):)\langle\phi(x^{(j)}), \phi(x^{(i)})\rangle can be efficient and does not necessarily require computing phi(x^((i)))\phi\left(x^{(i)}\right) explicitly. This is because:
{:[(:phi(x)","phi(z):)=1+sum_(i=1)^(d)x_(i)z_(i)+sum_(i,j in{1,dots,d})x_(i)x_(j)z_(i)z_(j)+sum_(i,j,k in{1,dots,d})x_(i)x_(j)x_(k)z_(i)z_(j)z_(k)],[=1+sum_(i=1)^(d)x_(i)z_(i)+(sum_(i=1)^(d)x_(i)z_(i))^(2)+(sum_(i=1)^(d)x_(i)z_(i))^(3)],(9)=1+(:x","z:)+(:x","z:)^(2)+(:x","z:)^(3):}\begin{align}
\langle\phi(x), \phi(z)\rangle &=1+\sum_{i=1}^{d} x_{i} z_{i}+\sum_{i, j \in\{1, \ldots, d\}} x_{i} x_{j} z_{i} z_{j}+\sum_{i, j, k \in\{1, \ldots, d\}} x_{i} x_{j} x_{k} z_{i} z_{j} z_{k}\nonumber \\
&=1+\sum_{i=1}^{d} x_{i} z_{i}+\left(\sum_{i=1}^{d} x_{i} z_{i}\right)^{2}+\left(\sum_{i=1}^{d} x_{i} z_{i}\right)^{3}\nonumber \\
&=1+\langle x, z\rangle+\langle x, z\rangle^{2}+\langle x, z\rangle^{3}
\end{align}
Therefore, to compute (:phi(x),phi(z):)\langle\phi(x), \phi(z)\rangle, we can first compute (:x,z:)\langle x, z\rangle with O(d)O(d) time and then take another constant number of operations to compute 1+(:x,z:)+(:x,z:)^(2)+(:x,z:)^(3)1+\langle x, z\rangle+\langle x, z\rangle^{2}+\langle x, z\rangle^{3}.
As you will see, the inner products between the features (:phi(x),phi(z):)\langle\phi(x), \phi(z)\rangle are essential here. We define the Kernel corresponding to the feature map phi\phi as a function that maps XxxXrarrR\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R} satisfying:[2]
To wrap up the discussion, we write the down the final algorithm as follows:
Compute all the values K(x^((i)),x^((j)))≜(:phi(x^((i))),phi(x^((j))):)K(x^{(i)}, x^{(j)}) \triangleq\langle\phi(x^{(i)}), \phi(x^{(j)})\rangle using equation (9) for all,i,j in{1,dots,n}}i,j \in\{1, \ldots, n\}\}. Set beta:=0\beta:=0.
Loop:{:(11)AA i in{1","dots","n}","beta_(i):=beta_(i)+alpha(y^((i))-sum_(j=1)^(n)beta_(j)K(x^((i)),x^((j)))):}\begin{equation}
\forall i \in\{1, \ldots, n\}, \beta_{i}:=\beta_{i}+\alpha\left(y^{(i)}-\sum_{j=1}^{n} \beta_{j} K\left(x^{(i)}, x^{(j)}\right)\right)
\end{equation}
Or in vector notation, letting KK be the n xx nn\times n matrix with K_(ij)=K_{ij}=K(x^((i)),x^((j)))K\left(x^{(i)}, x^{(j)}\right), we have
beta:=beta+alpha( vec(y)-K beta)\beta:=\beta+\alpha(\vec{y}-K \beta)
With the algorithm above, we can update the representation beta\beta of the vector theta\theta efficiently with O(n)O(n) time per update. Finally, we need to show that the knowledge of the representation beta\beta suffices to compute the prediction theta^(T)phi(x)\theta^{T} \phi(x). Indeed, we have
You may realize that fundamentally all we need to know about the feature map phi(*)\phi(\cdot) is encapsulated in the corresponding kernel function K(*,*)K(\cdot, \cdot). We will expand on this in the next section.
1.4. Properties of kernels
In the last subsection, we started with an explicitly defined feature map phi\phi, which induces the kernel function K(x,z)≜(:phi(x),phi(z):)K(x, z) \triangleq\langle\phi(x), \phi(z)\rangle. Then we saw that the kernel function is so intrinsic so that as long as the kernel function is defined, the whole training algorithm can be written entirely in the language of the kernel without referring to the feature map phi\phi, so can the prediction of a test example xx (equation (12).)
Therefore, it would be tempted to define other kernel function K(*,*)K(\cdot, \cdot) and run the algorithm (11). Note that the algorithm (11) does not need to explicitly access the feature map phi\phi, and therefore we only need to ensure the existence of the feature map phi\phi, but do not necessarily need to be able to explicitly write phi\phi down.
What kinds of functions K(*,*)K(\cdot, \cdot) can correspond to some feature map phi\phi? In other words, can we tell if there is some feature mapping phi\phi so that K(x,z)=K(x, z)=phi(x)^(T)phi(z)\phi(x)^{T} \phi(z) for all x,zx, z?
If we can answer this question by giving a precise characterization of valid kernel functions, then we can completely change the interface of selecting feature maps phi\phi to the interface of selecting kernel function KK. Concretely, we can pick a function KK, verify that it satisfies the characterization (so that there exists a feature map phi\phi that KK corresponds to), and then we can run update rule (11). The benefit here is that we don't have to be able to compute phi\phi or write it down analytically, and we only need to know its existence. We will answer this question at the end of this subsection after we go through several concrete examples of kernels.
Suppose x,z inR^(d)x, z \in \mathbb{R}^{d}, and let's first consider the function K(*,*)K(\cdot, \cdot) defined as:
Thus, we see that K(x,z)=(:phi(x),phi(z):)K(x, z)=\langle\phi(x), \phi(z)\rangle is the kernel function that corresponds to the the feature mapping phi\phi given (shown here for the case of d=3d=3) by
Revisiting the computational efficiency perspective of kernel, note that whereas calculating the high-dimensional phi(x)\phi(x) requires O(d^(2))O\left(d^{2}\right) time, finding K(x,z)K(x, z) takes only O(d)O(d) time–linear in the dimension of the input attributes.
For another related example, also consider K(*,*)K(\cdot, \cdot) defined by
and the parameter cc controls the relative weighting between the x_(i)x_{i} (first order) and the x_(i)x_(j)x_{i} x_{j} (second order) terms.
More broadly, the kernel K(x,z)=(x^(T)z+c)^(k)K(x, z)=\left(x^{T} z+c\right)^{k} corresponds to a feature mapping to an ([d+k],[k])\left(\begin{array}{c}d+k \\ k\end{array}\right) feature space, corresponding of all monomials of the form x_(i_(1))x_(i_(2))dotsx_(i_(k))x_{i_{1}} x_{i_{2}} \ldots x_{i_{k}} that are up to order kk. However, despite working in this O(d^(k))O\left(d^{k}\right)-dimensional space, computing K(x,z)K(x, z) still takes only O(d)O(d) time, and hence we never need to explicitly represent feature vectors in this very high dimensional feature space.
Kernels as similarity metrics. Now, let's talk about a slightly different view of kernels. Intuitively, (and there are things wrong with this intuition, but nevermind), if phi(x)\phi(x) and phi(z)\phi(z) are close together, then we might expect K(x,z)=phi(x)^(T)phi(z)K(x, z)=\phi(x)^{T} \phi(z) to be large. Conversely, if phi(x)\phi(x) and phi(z)\phi(z) are far apart say nearly orthogonal to each other - then K(x,z)=phi(x)^(T)phi(z)K(x, z)=\phi(x)^{T} \phi(z) will be small. So, we can think of K(x,z)K(x, z) as some measurement of how similar are phi(x)\phi(x) and phi(z)\phi(z), or of how similar are xx and zz.
Given this intuition, suppose that for some learning problem that you're working on, you've come up with some function K(x,z)K(x, z) that you think might be a reasonable measure of how similar xx and zz are. For instance, perhaps you chose
This is a reasonable measure of xx and zz's similarity, and is close to 11 when xx and zz are close, and near 00 when xx and zz are far apart. Does there exist a feature map phi\phi such that the kernel KK defined above satisfies K(x,z)=K(x, z)=phi(x)^(T)phi(z)?\phi(x)^{T} \phi(z)? In this particular example, the answer is yes. This kernel is called the Gaussian kernel, and corresponds to an infinite dimensional feature mapping phi\phi. We will give a precise characterization about what properties a function KK needs to satisfy so that it can be a valid kernel function that corresponds to some feature map phi\phi.
Necessary conditions for valid kernels. Suppose for now that KK is indeed a valid kernel corresponding to some feature mapping phi\phi, and we will first see what properties it satisfies. Now, consider some finite set of nn points (not necessarily the training set) {x^((1)),dots,x^((n))}\left\{x^{(1)}, \ldots, x^{(n)}\right\}, and let a square, nn-by- nn matrix KK be defined so that its (i,j)(i, j)-entry is given by K_(ij)=K(x^((i)),x^((j)))K_{i j}=K\left(x^{(i)}, x^{(j)}\right). This matrix is called the kernel matrix. Note that we've overloaded the notation and used KK to denote both the kernel function K(x,z)K(x, z) and the kernel matrix KK, due to their obvious close relationship.
Now, if KK is a valid kernel, then K_(ij)=K(x^((i)),x^((j)))=phi(x^((i)))^(T)phi(x^((j)))=K_{i j}=K\left(x^{(i)}, x^{(j)}\right)=\phi\left(x^{(i)}\right)^{T} \phi\left(x^{(j)}\right)=phi(x^((j)))^(T)phi(x^((i)))=K(x^((j)),x^((i)))=K_(ji)\phi\left(x^{(j)}\right)^{T} \phi\left(x^{(i)}\right)=K\left(x^{(j)}, x^{(i)}\right)=K_{j i}, and hence KK must be symmetric. Moreover, letting phi_(k)(x)\phi_{k}(x) denote the kk-th coordinate of the vector phi(x)\phi(x), we find that for any vector zz, we have
The second-to-last step uses the fact that sum_(i,j)a_(i)a_(j)=(sum_(i)a_(i))^(2)\sum_{i, j} a_{i} a_{j}=\left(\sum_{i} a_{i}\right)^{2} for a_(i)=a_{i}=z_(i)phi_(k)(x^((i)))z_{i} \phi_{k}\left(x^{(i)}\right). Since zz was arbitrary, this shows that KK is positive semi-definite (K >= 0)(K \geq 0).
Hence, we've shown that if KK is a valid kernel (i.e., if it corresponds to some feature mapping phi\phi ), then the corresponding kernel matrix K inR^(n xx n)K \in \mathbb{R}^{n \times n} is symmetric positive semidefinite.
Sufficient conditions for valid kernels. More generally, the condition above turns out to be not only a necessary, but also a sufficient, condition for KK to be a valid kernel (also called a Mercer kernel). The following result is due to Mercer.[3]
Theorem (Mercer). Let K:R^(d)xxR^(d)|->RK: \mathbb{R}^{d} \times \mathbb{R}^{d} \mapsto \mathbb{R} be given. Then for KK to be a valid (Mercer) kernel, it is necessary and sufficient that for any {x^((1)),dots,x^((n))},(n < oo)\{x^{(1)}, \ldots, x^{(n)}\},(n<\infty), the corresponding kernel matrix is symmetric positive semi-definite.
Given a function KK, apart from trying to find a feature mapping phi\phi that corresponds to it, this theorem therefore gives another way of testing if it is a valid kernel. You'll also have a chance to play with these ideas more in problem set 2.
In class, we also briefly talked about a couple of other examples of kernels. For instance, consider the digit recognition problem, in which given an image (16x16 pixels) of a handwritten digit (0-9), we have to figure out which digit it was. Using either a simple polynomial kernel K(x,z)=(x^(T)z)^(k)K(x, z)=\left(x^{T} z\right)^{k} or the Gaussian kernel, SVMs were able to obtain extremely good performance on this problem. This was particularly surprising since the input attributes xx were just 256-dimensional vectors of the image pixel intensity values, and the system had no prior knowledge about vision, or even about which pixels are adjacent to which other ones. Another example that we briefly talked about in lecture was that if the objects xx that we are trying to classify are strings (say, xx is a list of amino acids, which strung together form a protein), then it seems hard to construct a reasonable, "small" set of features for most learning algorithms, especially if different strings have different lengths. However, consider letting phi(x)\phi(x) be a feature vector that counts the number of occurrences of each length-kk substring in xx. If we're considering strings of English letters, then there are 26^(k)26^{k} such strings. Hence, phi(x)\phi(x) is a 26^(k)26^{k} dimensional vector; even for moderate values of kk, this is probably too big for us to efficiently work with. (e.g., 26^(4)~~46000026^{4} \approx 460000.) However, using (dynamic programming-ish) string matching algorithms, it is possible to efficiently compute K(x,z)=phi(x)^(T)phi(z)K(x, z)=\phi(x)^{T} \phi(z), so that we can now implicitly work in this 26^(k)26^{k}-dimensional feature space, but without ever explicitly computing feature vectors in this space.
Application of kernel methods: We've seen the application of kernels to linear regression. In the next part, we will introduce the support vector machines to which kernels can be directly applied. dwell too much longer on it here. In fact, the idea of kernels has significantly broader applicability than linear regression and SVMs. Specifically, if you have any learning algorithm that you can write in terms of only inner products (:x,z:)\langle x, z\rangle between input attribute vectors, then by replacing this with K(x,z)K(x, z) where KK is a kernel, you can "magically" allow your algorithm to work efficiently in the high dimensional feature space corresponding to KK. For instance, this kernel trick can be applied with the perceptron to derive a kernel perceptron algorithm. Many of the algorithms that we'll see later in this class will also be amenable to this method, which has come to be known as the "kernel trick."
Support Vector Machines
This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe are indeed the best) "off-the-shelf" supervised learning algorithms. To tell the SVM story, we'll need to first talk about margins and the idea of separating data with a large "gap." Next, we'll talk about the optimal margin classifier, which will lead us into a digression on Lagrange duality. We'll also see kernels, which give a way to apply SVMs efficiently in very high dimensional (such as infinitedimensional) feature spaces, and finally, we'll close off the story with the SMO algorithm, which gives an efficient implementation of SVMs.
2. Margins: Intuition
We'll start our story on SVMs by talking about margins. This section will give the intuitions about margins and about the "confidence" of our predictions; these ideas will be made formal in Section 4.
Consider logistic regression, where the probability p(y=1|x;theta)p(y=1 | x ; \theta) is modeled by h_(theta)(x)=g(theta^(T)x)h_{\theta}(x)=g\left(\theta^{T} x\right). We then predict "11" on an input xx if and only if h_(theta)(x) >= 0.5h_{\theta}(x) \geq 0.5, or equivalently, if and only if theta^(T)x >= 0\theta^{T} x \geq 0. Consider a positive training example (y=1)(y=1). The larger theta^(T)x\theta^{T} x is, the larger also is h_(theta)(x)=p(y=h_{\theta}(x)=p(y=1|x;theta)1 | x ; \theta), and thus also the higher our degree of "confidence" that the label is 11. Thus, informally we can think of our prediction as being very confident that y=1y=1 if theta^(T)x≫0\theta^{T} x \gg 0. Similarly, we think of logistic regression as confidently predicting y=0y=0, if theta^(T)x≪0\theta^{T} x \ll 0. Given a training set, again informally it seems that we'd have found a good fit to the training data if we can find theta\theta so that theta^(T)x^((i))≫0\theta^{T} x^{(i)} \gg 0 whenever y^((i))=1y^{(i)}=1, and theta^(T)x^((i))≪0\theta^{T} x^{(i)} \ll 0 whenever y^((i))=0y^{(i)}=0, since this would reflect a very confident (and correct) set of classifications for all the training examples. This seems to be a nice goal to aim for, and we'll soon formalize this idea using the notion of functional margins.
For a different type of intuition, consider the following figure, in which x's represent positive training examples, o's denote negative training examples, a decision boundary (this is the line given by the equation theta^(T)x=0\theta^{T} x=0, and is also called the separating hyperplane) is also shown, and three points have also been labeled A, B and C.
Notice that the point A is very far from the decision boundary. If we are asked to make a prediction for the value of yy at A, it seems we should be quite confident that y=1y=1 there. Conversely, the point C is very close to the decision boundary, and while it's on the side of the decision boundary on which we would predict y=1y=1, it seems likely that just a small change to the decision boundary could easily have caused out prediction to be y=0y=0. Hence, we're much more confident about our prediction at A than at C. The point B lies in-between these two cases, and more broadly, we see that if a point is far from the separating hyperplane, then we may be significantly more confident in our predictions. Again, informally we think it would be nice if, given a training set, we manage to find a decision boundary that allows us to make all correct and confident (meaning far from the decision boundary) predictions on the training examples. We'll formalize this later using the notion of geometric margins.
3. Notation
To make our discussion of SVMs easier, we'll first need to introduce a new notation for talking about classification. We will be considering a linear classifier for a binary classification problem with labels yy and features xx. From now, we'll use y in{-1,1}y \in\{-1,1\} (instead of {0,1}\{0,1\}) to denote the class labels. Also, rather than parameterizing our linear classifier with the vector theta\theta, we will use parameters w,bw, b, and write our classifier as
Here, g(z)=1g(z)=1 if z >= 0z \geq 0, and g(z)=-1g(z)=-1 otherwise. This " w,bw, b " notation allows us to explicitly treat the intercept term bb separately from the other parameters. (We also drop the convention we had previously of letting x_(0)=1x_{0}=1 be an extra coordinate in the input feature vector.) Thus, bb takes the role of what was previously theta_(0)\theta_{0}, and ww takes the role of [theta_(1)dotstheta_(d)]^(T)\left[\theta_{1} \ldots \theta_{d}\right]^{T}.
Note also that, from our definition of gg above, our classifier will directly predict either 11 or -1-1 (cf. the perceptron algorithm), without first going through the intermediate step of estimating p(y=1)p(y=1) (which is what logistic regression does).
4. Functional and geometric margins
Let's formalize the notions of the functional and geometric margins. Given a training example (x^((i)),y^((i)))\left(x^{(i)}, y^{(i)}\right), we define the functional margin of (w,b)(w, b) with respect to the training example as
Note that if y^((i))=1y^{(i)}=1, then for the functional margin to be large (i.e., for our prediction to be confident and correct), we need w^(T)x^((i))+bw^{T} x^{(i)}+b to be a large positive number. Conversely, if y^((i))=-1y^{(i)}=-1, then for the functional margin to be large, we need w^(T)x^((i))+bw^{T} x^{(i)}+b to be a large negative number. Moreover, if y^((i))(w^(T)x^((i))+b) > 0y^{(i)}\left(w^{T} x^{(i)}+b\right)>0, then our prediction on this example is correct. (Check this yourself.) Hence, a large functional margin represents a confident and a correct prediction.
For a linear classifier with the choice of gg given above (taking values in {-1,1}\{-1,1\} ), there's one property of the functional margin that makes it not a very good measure of confidence, however. Given our choice of gg, we note that if we replace ww with 2w2 w and bb with 2b2 b, then since g(w^(T)x+b)=g(2w^(T)x+2b)g\left(w^{T} x+b\right)=g\left(2 w^{T} x+2 b\right), this would not change h_(w,b)(x)h_{w, b}(x) at all. I.e., gg, and hence also h_(w,b)(x)h_{w, b}(x), depends only on the sign, but not on the magnitude, of w^(T)x+bw^{T} x+b. However, replacing (w,b)(w, b) with (2w,2b)(2 w, 2 b) also results in multiplying our functional margin by a factor of 2. Thus, it seems that by exploiting our freedom to scale ww and bb, we can make the functional margin arbitrarily large without really changing anything meaningful. Intuitively, it might therefore make sense to impose some sort of normalization condition such as that ||w||_(2)=1\|w\|_{2}=1; i.e., we might replace (w,b)(w, b) with (w//||w||_(2),b//||w||_(2))\left(w /\|w\|_{2}, b /\|w\|_{2}\right), and instead consider the functional margin of (w//||w||_(2),b//||w||_(2))\left(w /\|w\|_{2}, b /\|w\|_{2}\right). We'll come back to this later.
Given a training set S={(x^((i)),y^((i)));i=1,dots,n}S=\left\{\left(x^{(i)}, y^{(i)}\right) ; i=1, \ldots, n\right\}, we also define the function margin of (w,b)(w, b) with respect to SS as the smallest of the functional margins of the individual training examples. Denoted by hat(gamma)\hat{\gamma}, this can therefore be written:
Next, let's talk about geometric margins. Consider the picture below:
The decision boundary corresponding to (w,b)(w, b) is shown, along with the vector ww. Note that ww is orthogonal (at 90^(@)90^{\circ}) to the separating hyperplane. (You should convince yourself that this must be the case.) Consider the point at A, which represents the input x^((i))x^{(i)} of some training example with label y^((i))=1y^{(i)}=1. Its distance to the decision boundary, gamma^((i))\gamma^{(i)}, is given by the line segment AB.
How can we find the value of gamma^((i))?\gamma^{(i)}? Well, w//||w||w /\|w\| is a unit-length vector pointing in the same direction as ww. Since AA represents x^((i))x^{(i)}, we therefore find that the point BB is given by x^((i))-gamma^((i))*w//||w||x^{(i)}-\gamma^{(i)} \cdot w /\|w\|. But this point lies on the decision boundary, and all points xx on the decision boundary satisfy the equation w^(T)x+b=0w^{T} x+b=0. Hence,
This was worked out for the case of a positive training example at A in the figure, where being on the "positive" side of the decision boundary is good. More generally, we define the geometric margin of (w,b)(w, b) with respect to a training example (x^((i)),y^((i)))\left(x^{(i)}, y^{(i)}\right) to be
Note that if ||w||=1\|w\|=1, then the functional margin equals the geometric margin-this thus gives us a way of relating these two different notions of margin. Also, the geometric margin is invariant to rescaling of the parameters; i.e., if we replace ww with 2w2 w and bb with 2b2 b, then the geometric margin does not change. This will in fact come in handy later. Specifically, because of this invariance to the scaling of the parameters, when trying to fit ww and bb to training data, we can impose an arbitrary scaling constraint on ww without changing anything important; for instance, we can demand that ||w||=1\|w\|=1, or |w_(1)|=5\left|w_{1}\right|=5, or |w_(1)+b|+|w_(2)|=2\left|w_{1}+b\right|+\left|w_{2}\right|=2, and any of these can be satisfied simply by rescaling ww and bb.
Finally, given a training set S={(x^((i)),y^((i)));i=1,dots,n}S=\left\{\left(x^{(i)}, y^{(i)}\right) ; i=1, \ldots, n\right\}, we also define the geometric margin of (w,b)(w, b) with respect to SS to be the smallest of the geometric margins on the individual training examples:
Given a training set, it seems from our previous discussion that a natural desideratum is to try to find a decision boundary that maximizes the (geometric) margin, since this would reflect a very confident set of predictions on the training set and a good "fit" to the training data. Specifically, this will result in a classifier that separates the positive and the negative training examples with a "gap" (geometric margin).
For now, we will assume that we are given a training set that is linearly separable; i.e., that it is possible to separate the positive and negative examples using some separating hyperplane. How will we find the one that achieves the maximum geometric margin? We can pose the following optimization problem:
I.e., we want to maximize gamma\gamma, subject to each training example having functional margin at least gamma\gamma. The ||w||=1\|w\|=1 constraint moreover ensures that the functional margin equals to the geometric margin, so we are also guaranteed that all the geometric margins are at least gamma\gamma. Thus, solving this problem will result in (w,b)(w, b) with the largest possible geometric margin with respect to the training set.
If we could solve the optimization problem above, we'd be done. But the "||w||=1\| w \|=1" constraint is a nasty (non-convex) one, and this problem certainly isn't in any format that we can plug into standard optimization software to solve. So, let's try transforming the problem into a nicer one. Consider:
Here, we're going to maximize hat(gamma)//||w||\hat{\gamma} /\|w\|, subject to the functional margins all being at least hat(gamma)\hat{\gamma}. Since the geometric and functional margins are related by gamma= hat(gamma)//||w∣\gamma=\hat{\gamma} /|| w \mid, this will give us the answer we want. Moreover, we've gotten rid of the constraint ||w||=1\|w\|=1 that we didn't like. The downside is that we now have a nasty (again, non-convex) objective (( hat(gamma)))/(||w||)\frac{\hat{\gamma}}{\|w\|} function; and, we still don't have any off-the-shelf software that can solve this form of an optimization problem.
Let's keep going. Recall our earlier discussion that we can add an arbitrary scaling constraint on ww and bb without changing anything. This is the key idea we'll use now. We will introduce the scaling constraint that the functional margin of w,bw, b with respect to the training set must be 11:
hat(gamma)=1.\hat{\gamma}=1 .
Since multiplying ww and bb by some constant results in the functional margin being multiplied by that same constant, this is indeed a scaling constraint, and can be satisfied by rescaling w,bw, b. Plugging this into our problem above, and noting that maximizing hat(gamma)//||w||=1//||w||\hat{\gamma} /\|w\|=1 /|| w \| is the same thing as minimizing ||w||^(2)\|w\|^{2}, we now have the following optimization problem:
We've now transformed the problem into a form that can be efficiently solved. The above is an optimization problem with a convex quadratic objective and only linear constraints. Its solution gives us the optimal margin classifier. This optimization problem can be solved using commercial quadratic programming (QP) code.[4]
While we could call the problem solved here, what we will instead do is make a digression to talk about Lagrange duality. This will lead us to our optimization problem's dual form, which will play a key role in allowing us to use kernels to get optimal margin classifiers to work efficiently in very high dimensional spaces. The dual form will also allow us to derive an efficient algorithm for solving the above optimization problem that will typically do much better than generic QP software.
6. Lagrange duality (optional reading)
Let's temporarily put aside SVMs and maximum margin classifiers, and talk about solving constrained optimization problems.
Some of you may recall how the method of Lagrange multipliers can be used to solve it. (Don't worry if you haven't seen it before.) In this method, we define the Lagrangian to be
In this section, we will generalize this to constrained optimization problems in which we may have inequality as well as equality constraints. Due to time constraints, we won't really be able to do the theory of Lagrange duality justice in this class[5] but we will give the main ideas and results, which we will then apply to our optimal margin classifier's optimization problem.
Consider the following, which we'll call the primal optimization problem:
Here, the "P\mathcal{P}" violates any of the "primal." Let some ww be given. If ww violates any of the primal constraints (i.e., if either g_(i)(w) > 0g_{i}(w)>0 or h_(i)(w)!=0h_{i}(w) \neq 0 for some ii ), then you should be able to verify that
Thus, theta_(P)\theta_{\mathcal{P}} takes the same value as the objective in our problem for all values of ww that satisfies the primal constraints, and is positive infinity if the constraints are violated. Hence, if we consider the minimization problem
we see that it is the same problem (i.e., and has the same solutions as) our original, primal problem. For later use, we also define the optimal value of the objective to be p^(**)=min_(w)theta_(P)(w)p^{*}=\min _{w} \theta_{\mathcal{P}}(w); we call this the value of the primal problem.
Now, let's look at a slightly different problem. We define
Here, the "D\mathcal{D}" subscript stands for "dual." Note also that whereas in the definition of theta_(P)\theta_{\mathcal{P}} we were optimizing (maximizing) with respect to alpha,beta\alpha, \beta, here we are minimizing with respect to ww.
This is exactly the same as our primal problem shown above, except that the order of the "max" and the "min" are now exchanged. We also define the optimal value of the dual problem's objective to be d^(**)=max_(alpha,beta:alpha_(i) >= 0)theta_(D)(w)d^{*}=\max _{\alpha, \beta: \alpha_{i} \geq 0} \theta_{\mathcal{D}}(w).
How are the primal and the dual problems related? It can easily be shown that
(You should convince yourself of this; this follows from the "maxmin" of a function always being less than or equal to the "min max.") However, under certain conditions, we will have
d^(**)=p^(**),d^{*}=p^{*},
so that we can solve the dual problem in lieu of the primal problem. Let's see what these conditions are.
Suppose ff and the g_(i)g_{i}'s are convex,[6] and the h_(i)h_{i}'s are affine.[7] Suppose further that the constraints g_(i)g_{i} are (strictly) feasible; this means that there exists some ww so that g_(i)(w) < 0g_{i}(w)<0 for all ii.
Under our above assumptions, there must exist w^(**),alpha^(**),beta^(**)w^{*}, \alpha^{*}, \beta^{*} so that w^(**)w^{*} is the solution to the primal problem, alpha^(**),beta^(**)\alpha^{*}, \beta^{*} are the solution to the dual problem, and moreover p^(**)=d^(**)=L(w^(**),alpha^(**),beta^(**))p^{*}=d^{*}=\mathcal{L}\left(w^{*}, \alpha^{*}, \beta^{*}\right). Moreover, w^(**),alpha^(**)w^{*}, \alpha^{*} and beta^(**)\beta^{*} satisfy the Karush-Kuhn-Tucker (KKT) conditions, which are as follows:
Moreover, if some w^(**),alpha^(**),beta^(**)w^{*}, \alpha^{*}, \beta^{*} satisfy the KKT conditions, then it is also a solution to the primal and dual problems.
We draw attention to Equation (17), which is called the KKT dual complementarity condition. Specifically, it implies that if alpha_(i)^(**) > 0\alpha_{i}^{*}>0, then g_(i)(w^(**))=0g_{i}\left(w^{*}\right)=0. (I.e., the "g g_(i)(w) <= 0"g_{i}(w) \leq 0 " constraint is active, meaning it holds with equality rather than with inequality.) Later on, this will be key for showing that the SVM has only a small number of "support vectors"; the KKT dual complementarity condition will also give us our convergence test when we talk about the SMO algorithm.
7. Optimal margin classifiers
Note: The equivalence of optimization problem (20) and the optimization problem (24), and the relationship between the primary and dual variables in equation (22) are the most important take home messages of this section.
Previously, we posed the following (primal) optimization problem for finding the optimal margin classifier:
We have one such constraint for each training example. Note that from the KKT dual complementarity condition, we will have alpha_(i) > 0\alpha_{i}>0 only for the training examples that have functional margin exactly equal to one (i.e., the ones corresponding to constraints that hold with equality, g_(i)(w)=0g_{i}(w)=0). Consider the figure below, in which a maximum margin separating hyperplane is shown by the solid line.
The points with the smallest margins are exactly the ones closest to the decision boundary; here, these are the three points (one negative and two positive examples) that lie on the dashed lines parallel to the decision boundary. Thus, only three of the alpha_(i)\alpha_{i}'s–namely, the ones corresponding to these three training examples–will be non-zero at the optimal solution to our optimization problem. These three points are called the support vectors in this problem. The fact that the number of support vectors can be much smaller than the size the training set will be useful later.
Let's move on. Looking ahead, as we develop the dual form of the problem, one key idea to watch out for is that we'll try to write our algorithm in terms of only the inner product (:x^((i)),x^((j)):)\left\langle x^{(i)}, x^{(j)}\right\rangle (think of this as (x^((i)))^(T)x^((j))\left(x^{(i)}\right)^{T} x^{(j)} ) between points in the input feature space. The fact that we can express our algorithm in terms of these inner products will be key when we apply the kernel trick.
When we construct the Lagrangian for our optimization problem we have:
Note that there're only "alpha_(i)\alpha_{i}" but no "beta_(i)\beta_{i}" Lagrange multipliers, since the problem has only inequality constraints.
Let's find the dual form of the problem. To do so, we need to first minimize L(w,b,alpha)\mathcal{L}(w, b, \alpha) with respect to ww and bb (for fixed alpha\alpha), to get theta_(D)\theta_{\mathcal{D}}, which we'll do by setting the derivatives of L\mathcal{L} with respect to ww and bb to zero. We have:
Recall that we got to the equation above by minimizing L\mathcal{L} with respect to ww and bb. Putting this together with the constraints alpha_(i) >= 0\alpha_{i} \geq 0 (that we always had) and the constraint (23), we obtain the following dual optimization problem:
You should also be able to verify that the conditions required for p^(**)=d^(**)p^{*}=d^{*} and the KKT conditions (Equations 15-19) to hold are indeed satisfied in our optimization problem. Hence, we can solve the dual in lieu of solving the primal problem. Specifically, in the dual problem above, we have a maximization problem in which the parameters are the alpha_(i)\alpha_{i} 's. We'll talk later about the specific algorithm that we're going to use to solve the dual problem, but if we are indeed able to solve it (i.e., find the alpha\alpha's that maximize W(alpha)W(\alpha) subject to the constraints), then we can use Equation (22) to go back and find the optimal ww's as a function of the alpha\alpha's. Having found w^(**)w^{*}, by considering the primal problem, it is also straightforward to find the optimal value for the intercept term bb as
Before moving on, let's also take a more careful look at Equation (22), which gives the optimal value of ww in terms of (the optimal value of) alpha\alpha. Suppose we've fit our model's parameters to a training set, and now wish to make a prediction at a new point input xx. We would then calculate w^(T)x+bw^{T} x+b, and predict y=1y=1 if and only if this quantity is bigger than zero. But using (22), this quantity can also be written:
Hence, if we've found the alpha_(i)\alpha_{i}'s, in order to make a prediction, we have to calculate a quantity that depends only on the inner product between xx and the points in the training set. Moreover, we saw earlier that the alpha_(i)\alpha_{i}'s will all be zero except for the support vectors. Thus, many of the terms in the sum above will be zero, and we really need to find only the inner products between xx and the support vectors (of which there is often only a small number) in order calculate 27 and make our prediction.
By examining the dual form of the optimization problem, we gained significant insight into the structure of the problem, and were also able to write the entire algorithm in terms of only inner products between input feature vectors. In the next section, we will exploit this property to apply the kernels to our classification problem. The resulting algorithm, support vector machines, will be able to efficiently learn in very high dimensional spaces.
8. Regularization and the non-separable case (optional reading)
The derivation of the SVM as presented so far assumed that the data is linearly separable. While mapping data to a high dimensional feature space via phi\phi does generally increase the likelihood that the data is separable, we can't guarantee that it always will be so. Also, in some cases it is not clear that finding a separating hyperplane is exactly what we'd want to do, since that might be susceptible to outliers. For instance, the left figure below shows an optimal margin classifier, and when a single outlier is added in the upper-left region (right figure), it causes the decision boundary to make a dramatic swing, and the resulting classifier has a much smaller margin.
To make the algorithm work for non-linearly separable datasets as well as be less sensitive to outliers, we reformulate our optimization (using ℓ_(1)\ell_{1}regularization) as follows:
Thus, examples are now permitted to have (functional) margin less than 1, and if an example has functional margin 1-xi_(i)1-\xi_{i} (with xi > 0\xi>0), we would pay a cost of the objective function being increased by Cxi_(i)C \xi_{i}. The parameter CC controls the relative weighting between the twin goals of making the ||w||^(2)\|w\|^{2} small (which we saw earlier makes the margin large) and of ensuring that most examples have functional margin at least 1.1 .
Here, the alpha_(i)\alpha_{i}'s and r_(i)r_{i}'s are our Lagrange multipliers (constrained to be >= 0\geq 0). We won't go through the derivation of the dual again in detail, but after setting the derivatives with respect to ww and bb to zero as before, substituting them back in, and simplifying, we obtain the following dual form of the problem:
As before, we also have that ww can be expressed in terms of the alpha_(i)\alpha_{i}'s as given in Equation (22), so that after solving the dual problem, we can continue to use Equation (27) to make our predictions. Note that, somewhat surprisingly, in adding ℓ_(1)\ell_{1} regularization, the only change to the dual problem is that what was originally a constraint that 0 <= alpha_(i)0 \leq \alpha_{i} has now become 0 <=0 \leqalpha_(i) <= C\alpha_{i} \leq C. The calculation for b^(**)b^{*} also has to be modified (Equation 25 is no longer valid); see the comments in the next section/Platt's paper.
Also, the KKT dual-complementarity conditions (which in the next section will be useful for testing for the convergence of the SMO algorithm) are:
Now, all that remains is to give an algorithm for actually solving the dual problem, which we will do in the next section.
9. The SMO algorithm (optional reading)
The SMO (sequential minimal optimization) algorithm, due to John Platt, gives an efficient way of solving the dual problem arising from the derivation of the SVM. Partly to motivate the SMO algorithm, and partly because it's interesting in its own right, let's first take another digression to talk about the coordinate ascent algorithm.
9.1. Coordinate ascent
Consider trying to solve the unconstrained optimization problem
Here, we think of WW as just some function of the parameters alpha_(i)\alpha_{i}'s, and for now ignore any relationship between this problem and SVMs. We've already seen two optimization algorithms, gradient ascent and Newton's method. The new algorithm we're going to consider here is called coordinate ascent:
Thus, in the innermost loop of this algorithm, we will hold all the variables except for some alpha_(i)\alpha_{i} fixed, and reoptimize WW with respect to just the parameter alpha_(i)\alpha_{i}. In the version of this method presented here, the inner-loop reoptimizes the variables in order alpha_(1),alpha_(2),dots,alpha_(n),alpha_(1),alpha_(2),dots\alpha_{1}, \alpha_{2}, \ldots, \alpha_{n}, \alpha_{1}, \alpha_{2}, \ldots. (A more sophisticated version might choose other orderings; for instance, we may choose the next variable to update according to which one we expect to allow us to make the largest increase in W(alpha)W(\alpha).)
When the function WW happens to be of such a form that the "arg max" in the inner loop can be performed efficiently, then coordinate ascent can be a fairly efficient algorithm. Here's a picture of coordinate ascent in action:
The ellipses in the figure are the contours of a quadratic function that we want to optimize. Coordinate ascent was initialized at (2,-2)(2,-2), and also plotted in the figure is the path that it took on its way to the global maximum. Notice that on each step, coordinate ascent takes a step that's parallel to one of the axes, since only one variable is being optimized at a time.
9.2. SMO
We close off the discussion of SVMs by sketching the derivation of the SMO algorithm. Some details will be left to the homework, and for others you may refer to the paper excerpt handed out in class.
Here's the (dual) optimization problem that we want to solve:
Let's say we have set of alpha_(i)\alpha_{i}'s that satisfy the constraints (32-33). Now, suppose we want to hold alpha_(2),dots,alpha_(n)\alpha_{2}, \ldots, \alpha_{n} fixed, and take a coordinate ascent step and reoptimize the objective with respect to alpha_(1)\alpha_{1}. Can we make any progress? The answer is no, because the constraint (33) ensures that
(This step used the fact that y^((1))in{-1,1}y^{(1)} \in\{-1,1\}, and hence (y^((1)))^(2)=1\left(y^{(1)}\right)^{2}=1.) Hence, alpha_(1)\alpha_{1} is exactly determined by the other alpha_(i)\alpha_{i}'s, and if we were to hold alpha_(2),dots,alpha_(n)\alpha_{2}, \ldots, \alpha_{n} fixed, then we can't make any change to alpha_(1)\alpha_{1} without violating the constraint (33) in the optimization problem.
Thus, if we want to update some subject of the alpha_(i)\alpha_{i}'s, we must update at least two of them simultaneously in order to keep satisfying the constraints. This motivates the SMO algorithm, which simply does the following:
Repeat till convergence {\{
Select some pair alpha_(i)\alpha_{i} and alpha_(j)\alpha_{j} to update next (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum).
Reoptimize W(alpha)W(\alpha) with respect to alpha_(i)\alpha_{i} and alpha_(j)\alpha_{j}, while holding all the other alpha_(k)\alpha_{k}'s (k!=i,j)(k \neq i, j) fixed.
}\}
To test for convergence of this algorithm, we can check whether the KKT conditions (Equations 28-30) are satisfied to within some tol. Here, tol is the convergence tolerance parameter, and is typically set to around 0.010.01 to 0.0010.001. (See the paper and pseudocode for details.)
The key reason that SMO is an efficient algorithm is that the update to alpha_(i),alpha_(j)\alpha_{i}, \alpha_{j} can be computed very efficiently. Let's now briefly sketch the main ideas for deriving the efficient update.
Let's say we currently have some setting of the alpha_(i)\alpha_{i}'s that satisfy the constraints (32-33), and suppose we've decided to hold alpha_(3),dots,alpha_(n)\alpha_{3}, \ldots, \alpha_{n} fixed, and want to reoptimize W(alpha_(1),alpha_(2),dots,alpha_(n))W\left(\alpha_{1}, \alpha_{2}, \ldots, \alpha_{n}\right) with respect to alpha_(1)\alpha_{1} and alpha_(2)\alpha_{2} (subject to the constraints). From (33), we require that
Since the right hand side is fixed (as we've fixed alpha_(3),dotsalpha_(n)\alpha_{3}, \ldots \alpha_{n} ), we can just let it be denoted by some constant zeta\zeta:
We can thus picture the constraints on alpha_(1)\alpha_{1} and alpha_(2)\alpha_{2} as follows:
From the constraints (32), we know that alpha_(1)\alpha_{1} and alpha_(2)\alpha_{2} must lie within the box [0,C]xx[0,C][0, C] \times[0, C] shown. Also plotted is the line alpha_(1)y^((1))+alpha_(2)y^((2))=zeta\alpha_{1} y^{(1)}+\alpha_{2} y^{(2)}=\zeta, on which we know alpha_(1)\alpha_{1} and alpha_(2)\alpha_{2} must lie. Note also that, from these constraints, we know L <= alpha_(2) <= HL \leq \alpha_{2} \leq H; otherwise, (alpha_(1),alpha_(2))\left(\alpha_{1}, \alpha_{2}\right) can't simultaneously satisfy both the box and the straight line constraint. In this example, L=0L=0. But depending on what the line alpha_(1)y^((1))+alpha_(2)y^((2))=zeta\alpha_{1} y^{(1)}+\alpha_{2} y^{(2)}=\zeta looks like, this won't always necessarily be the case; but more generally, there will be some lower-bound LL and some upper-bound HH on the permissible values for alpha_(2)\alpha_{2} that will ensure that alpha_(1),alpha_(2)\alpha_{1}, \alpha_{2} lie within the box [0,C]xx[0,C][0, C] \times[0, C].
Using Equation (34), we can also write alpha_(1)\alpha_{1} as a function of alpha_(2)\alpha_{2}:
(Check this derivation yourself; we again used the fact that y^((1))in{-1,1}y^{(1)} \in\{-1,1\} so that (y^((1)))^(2)=1(y^{(1)})^{2}=1.) Hence, the objective W(alpha)W(\alpha) can be written
Treating alpha_(3),dots,alpha_(n)\alpha_{3}, \ldots, \alpha_{n} as constants, you should be able to verify that this is just some quadratic function in alpha_(2)\alpha_{2}. I.e., this can also be expressed in the form aalpha_(2)^(2)+balpha_(2)+ca \alpha_{2}^{2}+b \alpha_{2}+c for some appropriate a,ba, b, and cc. If we ignore the "box" constraints (32) (or, equivalently, that L <= alpha_(2) <= HL \leq \alpha_{2} \leq H ), then we can easily maximize this quadratic function by setting its derivative to zero and solving. We'll let alpha_(2)^("new,unclipped")\alpha_{2}^{\textit {new,unclipped}} denote the resulting value of alpha_(2)\alpha_{2}. You should also be able to convince yourself that if we had instead wanted to maximize WW with respect to alpha_(2)\alpha_{2} but subject to the box constraint, then we can find the resulting value optimal simply by taking alpha_(2)^("new,unclipped")\alpha_{2}^{\textit {new,unclipped}} and "clipping" it to lie in the [L,H][L, H] interval, to get
Finally, having found the alpha_(2)^("new")\alpha_{2}^{\textit {new}}, we can use Equation (34) to go back and find the optimal value of alpha_(1)^("new")\alpha_{1}^{\textit {new}}.
There're a couple more details that are quite easy but that we'll leave you to read about yourself in Platt's paper: One is the choice of the heuristics used to select the next alpha_(i),alpha_(j)\alpha_{i}, \alpha_{j} to update; the other is how to update bb as the SMO algorithm is run.
You can read the notes from the next lecture from CS229 on Deep Learning here.
Here, for simplicity, we include all the monomials with repetitions (so that, e.g., x_(1)x_(2)x_(3)x_{1} x_{2} x_{3} and x_(2)x_(3)x_(1)x_{2} x_{3} x_{1} both appear in phi(x)\phi(x)). Therefore, there are totally 1+d+d^(2)+d^(3)1+d+d^{2}+d^{3} entries in phi(x)\phi(x). ↩︎
Recall that X\mathcal{X} is the space of the input xx. In our running example, X=R^(d)\mathcal{X}=\mathbb{R}^{d}↩︎
Many texts present Mercer's theorem in a slightly more complicated form involving L^(2)L^{2} functions, but when the input attributes take values in R^(d)\mathbb{R}^{d}, the version given here is equivalent. ↩︎
You may be familiar with linear programming, which solves optimization problems that have linear objectives and linear constraints. QP software is also widely available, which allows convex quadratic objectives and linear constraints. ↩︎
Readers interested in learning more about this topic are encouraged to read, e.g., R. T. Rockarfeller (1970), Convex Analysis, Princeton University Press. ↩︎
When ff has a Hessian, then it is convex if and only if the Hessian is positive semidefinite. For instance, f(w)=w^(T)wf(w)=w^{T} w is convex; similarly, all linear (and affine) functions are also convex. (A function ff can also be convex without being differentiable, but we won't need those more general definitions of convexity here.) ↩︎
I.e., there exists a_(i),b_(i)a_{i}, b_{i}, so that h_(i)(w)=a_(i)^(T)w+b_(i)h_{i}(w)=a_{i}^{T} w+b_{i}. "Affine" means the same thing as linear, except that we also allow the extra intercept term b_(i)b_{i}. ↩︎
This review gives an introduction to Adversarial Machine Learning on graph-structured data, including several recent papers and research ideas in this field. This review is based on our paper “A Survey of Adversarial Learning on Graph”.