CS4487
Last modified:
14 min read

Part 4 - Duality and SVM

Table of Contents

Review

Let’s review over what we covered last time.

Convex Set

Recall that a line segment between $\mathbf{x}^{(1)}$ and $\mathbf{x}^{(2)}$ all points, $$ \mathbf{x} = \alpha \mathbf{x}^{(1)} + (1 - \alpha) \mathbf{x}^{(2)} $$

So, a convex set, contains line segment between any two points in the set, $$ \mathbf{x}^{(1)}, \mathbf{x}^{(2)} \in \chi, 0 \leq \alpha \leq 1 \Rightarrow \alpha \mathbf{x}^{(1)} + (1 - \alpha) \mathbf{x}^{(2)} \in \chi. $$

Convex Function

$f : \mathbb{R}^N \rightarrow \mathbb{R}$ is convex if $dom(f)$ is a convex set and,

$$ f(\alpha \mathbf{x}^{(1)} + (1 - \alpha) \mathbf{x}^{(2)}) \leq \alpha f(\mathbf{x}^{(1)}) + (1 - \alpha) f(\mathbf{x}^{(2)}) $$

for all $\mathbf{x}^{(1)}, \mathbf{x}^{(2)} \in dom(f)$ and $0 \leq \alpha \leq 1$.

Also, remember that $f$ is concave if $-f$ is convex.

First Order Condition

$f$ is differentiable if the gradient, $$ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_N} \end{bmatrix}^T $$

exists at each $\mathbf{x} \in dom(f)$.

The first-order condition is that, differentiable $f$ with convex domain is convex if and only if, $$ f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) \quad \forall \mathbf{x}, \mathbf{y} \in dom(f). $$

First-order approximation of $f$ is global underestimator.

Second-Order Condition

$f$ is twice differentiable if the Hessian $\nabla^2 f(\mathbf{x})$ exists at each $\mathbf{x} \in \mathbb{S}^N$, $$ \nabla^2 f(\mathbf{x})_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}. $$

exists at each $\mathbf{x} \in dom(f)$.

The second-order condition is that, twice differentiable $f$ with convex domain is convex if and only if, $$ \nabla^2 f(\mathbf{x}) \succeq 0 $$

This means that the Hessian matrix is positive semidefinite, $$ \mathbf{A} \succeq 0 \Leftrightarrow \forall \mathbf{x}, \mathbf{x}^T \mathbf{A} \mathbf{x} \geq 0. $$

Convex Optimization Problems

To understand what a convex optimization problem is, we need to understand optimization problems in standard form.

Optimization Problem in Standard Form

$$ \begin{aligned} \underset{\mathbf{x}}{\text{minimize}} & \quad f_0(\mathbf{x}) \newline \text{subject to} & \quad f_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, r \newline & \quad h_i(\mathbf{x}) = 0, \quad i = 1, \ldots, s \end{aligned} $$

where,

  • $\mathbf{x} \in \mathbb{R}^N$ is the optimization variable,
  • $f_0 : \mathbb{R}^N \rightarrow \mathbb{R}$ is the objective function or cost function,
  • $f_i : \mathbb{R}^N \rightarrow \mathbb{R}$, $i = 1, \ldots, r$, are the inequality constraint functions,
  • $h_i : \mathbb{R}^N \rightarrow \mathbb{R}$, $i = 1, \ldots, s$, are the equality constraint functions.

The optimal value of the optimization problem is, $$ p^{\star} = \min \{f_0(\mathbf{x} | f_i(\mathbf{x}) \leq 0, i = 1, \ldots, r, h_i(\mathbf{x}) = 0, i = 1, \ldots, s)\} $$

Note,

  • $p^{\star} = \infty$ if the problem is infeasible (no $\mathbf{x}$ satisfies the constraints).
  • $p^{\star} = -\infty$ if the problem is unbounded (the objective function can be made arbitrarily small).

Optimal and Locally Optimal Points

A point $\mathbf{x}^{\star}$ is feasible if $\mathbf{x} \in dom(f_0)$, and it satisfies the constraints. A feasible $\mathbf{x}^{\star}$ is optimal if $f_0(\mathbf{x}^{\star}) = p^{\star}$.

$\mathbf{x}$ is locally optimal if there is a $R > 0$ such that $\mathbf{x}$ is optimal for, $$ \begin{aligned} \underset{z}{\text{minimize}} & \quad f_0(\mathbf{z}) \newline \text{subject to} & \quad f_i(\mathbf{z}) \leq 0, \quad i = 1, \ldots, r \newline & \quad h_i(\mathbf{z}) = 0, \quad i = 1, \ldots, s \newline & \quad \Vert \mathbf{z} - \mathbf{x} \Vert_2 \leq R \end{aligned} $$

Examples
  • $f_0(x) = -\log x, dom(f_0) = \mathbb{R}_{++}$
    • $p^{\star} = -\infty$
  • $f_0(x) = x \log x, dom(f_0) = \mathbb{R}_{++}$
    • $p^{\star} = -\frac{1}{e}, x = \frac{1}{e}$ is optimal.
  • $f_0(x) = x^3 - 3x$
    • $p^{\star} = -\infty$ locally optimum at $x = 1$.

Feasibility Problem

$$ \begin{aligned} \text{find} & \quad \mathbf{x} \newline \text{subject to} & \quad f_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, r \newline & \quad h_i(\mathbf{x}) = 0, \quad i = 1, \ldots, s \end{aligned} $$

can be considered a special case of the general problem with $f_0(\mathbf{x}) = 0$, $$ \begin{aligned} \underset{\mathbf{x}}{\text{minimize}} & \quad 0 \newline \text{subject to} & \quad f_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, r \newline & \quad h_i(\mathbf{x}) = 0, \quad i = 1, \ldots, s \end{aligned} $$

$p^{\star} = 0$ if constraints are feasible, any feasible $\mathbf{x}$ is optimal. Otherwise, $p^{\star} = \infty$ if the constraints are infeasible.

Convex Optimization Problem in Standard Form

$$ \begin{aligned} \text{minimize} & \quad f_0(\mathbf{x}) \newline \text{subject to} & \quad f_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, r \newline & \quad \mathbf{a_i}^T \mathbf{x} = b_i, \quad i = 1, \ldots, s \end{aligned} $$

where,

  • $f_0, f_1, \ldots, f_r$ are convex functions,
  • Equality constraints are affine functions.

We often write this as, $$ \begin{aligned} \underset{\mathbf{x}}{\text{minimize}} & \quad f_0(\mathbf{x}) \newline \text{subject to} & \quad f_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, r \newline & \quad \mathbf{A}\mathbf{x} = \mathbf{b} \end{aligned}. $$

Important property to note, a feasible set of a convex optimization problem is a convex set.

Example

$$ \begin{aligned} \text{minimize} & \quad f_0(\mathbf{x}) = x_1^2 + x_2^2 \newline \text{subject to} & \quad f_1(\mathbf{x}) = \frac{x_1}{1 + x_2^2} \leq 0 \newline & \quad h_1(\mathbf{x}) = (x_1 + x_2)^2 = 0 \end{aligned} $$

Let’s take a look at each component here.

$f_0$ is convex, feasible set $\newline{(x_1, x_2) | x_1 = -x_2 \leq 0\newline}$ is convex. However, this is not a convex problem (in standard form), $f_1$ is not convex and $h_1$ is not affine.

Let’s write it as a convex problem in standard form, $$ \begin{aligned} \text{minimize} & \quad f_0(\mathbf{x}) = x_1^2 + x_2^2 \newline \text{subject to} & \quad f_1(\mathbf{x}) = x_1 \leq 0 \ \bigg\rvert \ \cdot (1 + x_2^2) \newline & \quad h_1(\mathbf{x}) = x_1 + x_2 = 0 \ \bigg\rvert \ \sqrt{} \newline \end{aligned} $$

Local and Global Optima in Convex Optimization

Any locally optimal point of is (globally) optimal.

Proof. Suppose $\mathbf{x}$ is locally optimal and $\mathbf{y}$ is optimal with $f_0(\mathbf{y}) < f_0(\mathbf{x})$. Locally optimal $\mathbf{x}$ means there is a $R > 0$ such that, $$ \mathbf{z} \text{ feasible, } \Vert \mathbf{z} - \mathbf{x} \Vert_2 \leq R \Rightarrow f_0(\mathbf{z}) \geq f_0(\mathbf{x}). $$

Consider $\mathbf{z} = \alpha \mathbf{y} + (1 - \alpha) \mathbf{x}$, with $\alpha = \frac{R}{2 \Vert \mathbf{y} - \mathbf{x} \Vert_2}$. Thus, $\Vert \mathbf{y} - \mathbf{x} \Vert_2 > R$, so $0 < \alpha < \frac{1}{2}$.

$\mathbf{z}$ is a convex combination of two feasible points, hence feasible. $\Vert \mathbf{z} - \mathbf{x} \Vert_2 = \frac{R}{2}$, and, $$ f_0(\mathbf{z}) \leq \alpha f_0(\mathbf{y}) + (1 - \alpha) f_0(\mathbf{x}) < f_0(\mathbf{x}), $$

which contradicts our assumption that $\mathbf{x}$ is locally optimal.

Duality

To understand duality, we need to understand the Lagrangian.

Lagrangian

Given the optimization problem in standard form (not necessarily convex), $$ \begin{aligned} \underset{\mathbf{x}}{\text{min}} & \quad f_0(\mathbf{x}) \newline \text{subject to} & \quad f_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, r \newline & \quad h_i(\mathbf{x}) = 0, \quad i = 1, \ldots, s \end{aligned} $$

We have the variable $\mathbf{x} \in \mathbb{R}^N$, domain $\Chi$ and the optimal value $p^{\star}$.

Lagrangian $\mathbf{L} : \mathbb{R}^N \times \mathbb{R}^r \times \mathbb{R}^s \rightarrow \mathbb{R}$, with $dom(L) = \Chi \times \mathbb{R}^r \times \mathbb{R}^s$, $$ \mathbf{L}(\mathbf{x}, \mathbf{\lambda}, \mathbf{\nu}) = f_0(\mathbf{x}) + \sum_{i=1}^r \lambda_i f_i(\mathbf{x}) + \sum_{i=1}^s \nu_i h_i(\mathbf{x}). $$

This can be interpreted as the weighted sum of the objective and the constraints functions.

$\lambda_i$ is Lagrange multiplier associated with $f_i(\mathbf{x}) \leq 0$. $\nu_i$ is Lagrange multiplier associated with $h_i(\mathbf{x}) = 0$.

Lagrange Dual Function

The Lagrange dual function $g : \mathbb{R}^r \times \mathbb{R}^s \rightarrow \mathbb{R}$ is, $$ g(\mathbf{\lambda}, \mathbf{\nu}) = \underset{\mathbf{x} \in \Chi}{\text{inf}} \mathbf{L}(\mathbf{x}, \mathbf{\lambda}, \mathbf{\nu}) = \underset{\mathbf{x} \in \Chi}{\text{inf}} \left( f_0(\mathbf{x}) + \sum_{i=1}^r \lambda_i f_i(\mathbf{x}) + \sum_{i=1}^s \nu_i h_i(\mathbf{x}) \right). $$

$g$ is concave and can be $-\infty$ for some $\mathbf{\lambda}$ and $\mathbf{\nu}$.

Lower Bound Property. If $\mathbf{\lambda} \geq 0$, then $g(\mathbf{\lambda}, \mathbf{\nu}) \leq p^{\star}$.

Proof. If $\mathbf{\tilde{x}}$ is feasible and $\mathbf{\lambda} \geq 0$, then, $$ f_0(\mathbf{\tilde{x}}) \geq \mathbf{L}(\mathbf{\tilde{x}}, \mathbf{\lambda}, \mathbf{\nu}) \geq \underset{\mathbf{x} \in \Chi}{\text{inf}} \mathbf{L}(\mathbf{x}, \mathbf{\lambda}, \mathbf{\nu}) = g(\mathbf{\lambda}, \mathbf{\nu}). $$

Minimizing over all feasible $\mathbf{\tilde{x}}$ gives $p^{\star} \geq g(\mathbf{\lambda}, \mathbf{\nu})$.

Least-Norm Solution of Linear Equations

Given a linear system $\mathbf{A}\mathbf{x} = \mathbf{b}$, the least-norm solution is, $$ \begin{aligned} \underset{\mathbf{x}}{\text{minimize}} & \quad \mathbf{x}^T \mathbf{x} \newline \text{subject to} & \quad \mathbf{A}\mathbf{x} = \mathbf{b} \end{aligned} $$

Firstly, let’s write this in standard form, $$ \begin{aligned} \underset{\mathbf{x}}{\text{minimize}} & \quad \mathbf{x}^T \mathbf{x} \newline \text{subject to} & \quad \mathbf{A}\mathbf{x} - \mathbf{b} = 0 \end{aligned} $$

Therefore, the Lagrangian is $\mathbf{L}(\mathbf{x}, \mathbf{\nu}) = \mathbf{x}^T \mathbf{x} + \mathbf{\nu}^T (\mathbf{A}\mathbf{x} - \mathbf{b})$.

To minimize $\mathbf{L}$ over $\mathbf{x}$, set gradient equal to zero, $$ \nabla_{\mathbf{x}} \mathbf{L}(\mathbf{x}, \mathbf{\nu}) = 2\mathbf{x} + \mathbf{A}^T \mathbf{\nu} = 0 \Rightarrow \mathbf{x} = -\frac{1}{2} \mathbf{A}^T \mathbf{\nu}. $$

Plug in $\mathbf{L}$ to obtain $g$, $$ g(\mathbf{\nu}) = \mathbf{L}(-\frac{1}{2} \mathbf{A}^T \mathbf{\nu}, \mathbf{\nu}) = \frac{1}{4} \mathbf{\nu}^T \mathbf{A} \mathbf{A}^T \mathbf{\nu} - \mathbf{b}^T \mathbf{\nu}. $$

which is a concave function of $\mathbf{\nu}$.

Using the lower bound property, $p^{\star} \geq -\frac{1}{4} \mathbf{\nu}^T \mathbf{A} \mathbf{A}^T \mathbf{\nu} - \mathbf{b}^T \mathbf{\nu}$ for all $\mathbf{\nu}$.

Standard Form Linear Programming

$$ \begin{aligned} \underset{\mathbf{x}}{\text{minimize}} & \quad \mathbf{c}^T \mathbf{x} \newline \text{subject to} & \quad \mathbf{A}\mathbf{x} = \mathbf{b}, \quad \mathbf{x} \succeq 0 \newline \end{aligned} $$

Let’s write this in standard form, $$ \begin{aligned} \underset{\mathbf{x}}{\text{minimize}} & \quad \mathbf{c}^T \mathbf{x} \newline \text{subject to} & \quad \mathbf{A}\mathbf{x} - \mathbf{b} = 0 \newline & \quad -\mathbf{x} \preceq 0 \end{aligned} $$

The Lagrangian is, $$ \begin{aligned} \mathbf{L}(\mathbf{x}, \mathbf{\lambda}, \mathbf{\nu}) & = \mathbf{c}^T \mathbf{x} + \mathbf{\nu}^T (\mathbf{A}\mathbf{x} - \mathbf{b}) - \mathbf{\lambda}^T \mathbf{x} \newline & = -\mathbf{b}^T \mathbf{\nu} + (\mathbf{A}^T \mathbf{\nu} - \mathbf{\lambda} + \mathbf{c})^T \mathbf{x}. \end{aligned} $$

$\mathbf{L}$ is affine in $\mathbf{x}$, hence, $$ g(\mathbf{\lambda}, \mathbf{\nu}) = \underset{\mathbf{x}}{\text{inf}} \ \mathbf{L}(\mathbf{x}, \mathbf{\lambda}, \mathbf{\nu}) = \begin{cases} -\mathbf{b}^T \mathbf{\nu} & \mathbf{A}^T \mathbf{\nu} - \mathbf{\lambda} + \mathbf{c} = 0 \newline -\infty & \text{otherwise} \end{cases} $$

$g$ is linear on $\newline{(\mathbf{\lambda}, \mathbf{\nu}) | \mathbf{A}^T \mathbf{\nu} - \mathbf{\lambda} + \mathbf{c} = 0\newline}$, hence concave.

Using the lower bound property, $p^{\star} \geq -\mathbf{b}^T \mathbf{\nu}$ if $\mathbf{A}^T \mathbf{\nu} + \mathbf{c} \succeq 0$.

The Dual Problem

The Lagrange dual problem is, $$ \begin{aligned} \underset{\mathbf{\lambda}, \mathbf{\nu}}{\text{max}} & \quad g(\mathbf{\lambda}, \mathbf{\nu}) \newline \text{subject to} & \quad \mathbf{\lambda} \geq 0 \end{aligned} $$

$\mathbf{\lambda}, \mathbf{\nu}$ are dual feasible if $\mathbf{\lambda} \succeq 0$, $(\mathbf{\lambda}, \mathbf{\nu}) \in dom(g)$.

Find the best lower bound $p^{\star}$, obtained from Lagrange dual problem.

This is a convex optimization problem, the optimal value is denoted $d^{\star}$.

  • If we have weak duality: $d^{\star} \leq p^{\star}$.

    • Always holds (for convex problems and non-convex problems).
    • Can be used to find non-trivial lower bounds for difficult problems.
  • If we have strong duality: $d^{\star} = p^{\star}$.

    • Does not hold in general.
    • (Usually) holds for convex problems, e.g, SVM.

Complementary Slackness

Assuming strong duality holds, $\mathbf{x}^{\star}$ is primal optimal, $(\mathbf{\lambda}^{\star}, \mathbf{\nu}^{\star})$ is dual optimal. $$ \begin{aligned} f_0(\mathbf{x}^{\star}) & = g(\mathbf{\lambda}^{\star}, \mathbf{\nu}^{\star}) \newline & = \underset{\mathbf{x}}{\text{inf}} \left(f_0(\mathbf{x}) + \sum_{i=1}^r \lambda_i^{\star} f_i(\mathbf{x}) + \sum_{i=1}^s \nu_i^{\star} h_i(\mathbf{x})\right) \newline & \leq f_0(\mathbf{x}^{\star}) + \sum_{i=1}^r \lambda_i^{\star} f_i(\mathbf{x}^{\star}) + \sum_{i=1}^s \nu_i^{\star} h_i(\mathbf{x}^{\star}) \newline & \leq f_0(\mathbf{x}^{\star}). \end{aligned} $$

Hence, the two inequalites hold with equality, $\mathbf{x}^{\star}$ not only minimizes $f_0(\mathbf{x})$ but also minimizes $\mathbf{L}(\mathbf{x}, \mathbf{\lambda}^{\star}, \mathbf{\nu}^{\star})$.

$\lambda_i^{\star} f_i(\mathbf{x}^{\star}) = 0$ for all $i = 1, \ldots, r$. (Complementary slackness) $$ \lambda_i^{\star} > 0 \Rightarrow f_i(\mathbf{x}^{\star}) = 0. \quad f_i(\mathbf{x}^{\star}) < 0 \Rightarrow \lambda_i^{\star} = 0. $$

Karush-Kuhn-Tucker (KKT) Conditions

The following four conditions are called KKT conditions (for a problem with differentiable $f_i$ and $h_i$).

  1. Primal Constraint
    • $f_i(\mathbf{x}^{\star}) \leq 0, i = 1, \ldots, r, h_i(\mathbf{x}^{\star}) = 0, i = 1, \ldots, s$.
  2. Dual Constraint
    • $\lambda_i^{\star} \succeq 0$.
  3. Complementary Slackness
    • $\lambda_i^{\star} f_i(\mathbf{x}^{\star}) = 0, i = 1, \ldots, r$.
  4. Gradient of Lagrangian with respect to $\mathbf{x}$ vanishes.
    • $\nabla f_0(\mathbf{x}) + \sum_{i = 1}^r \lambda_i \nabla f_i(\mathbf{x}) + \sum_{i = 1}^s \nu_i \nabla h_i(\mathbf{x}) = 0$.

If strong duality holds and $\mathbf{x}, \mathbf{\lambda}, \mathbf{\nu}$ are optimal, then they must satisfy the KKT conditions.

Support Vector Machine (SVM)

Recall from last time, we assumed the data is linearly separable, what if the data is separable, but not linearly separable?

A smart idea is to transform the input space. Map $\mathbf{x} \in \mathbb{R}^N$ to a new high-dimensional space $\mathbf{z} \in \mathbb{R}^L$, $z = \phi(\mathbf{x})$, where $\phi$ is the transform function.

Then, we can learn the linear classifier in this new space.

SVM with Transformed Input

Given a training set $\mathcal{D} = \{\mathbf{x}^{(i)}, y^{(i)}\}_{i=1}^M$, the original SVM, $$ \underset{\mathbf{w}, b}{\arg\min} \frac{1}{2} \mathbf{w}^T \mathbf{w} \quad \text{subject to} \quad y^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) \geq 1, \quad 1 \leq i \leq M. $$

The transformed SVM, $$ \underset{\mathbf{w}, b}{\arg\min} \frac{1}{2} \mathbf{w}^T \mathbf{w} \quad \text{subject to} \quad y^{(i)} (\mathbf{w}^T \phi(\mathbf{x}^{(i)}) + b) \geq 1, \quad 1 \leq i \leq M. $$

The hyperplane $\mathbf{w} \in \mathbb{R}^L$ is now in the high-dimensional space.

If $L$ is very large, calculating the feature vector $\phi(\mathbf{x})$ could be time-consuming. Also, optimization could be very inefficient in high-dimensional space.

SVL: From Primal to Dual

Recall the primal form of SVM, $$ \begin{aligned} \underset{\mathbf{w}, b}{\min} & \quad \frac{1}{2} \Vert \mathbf{w} \Vert_2^2 + C \sum_{i=1}^M \xi_i \newline \text{subject to} & \quad y^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) \geq 1 - \xi_i, \quad \xi_i, i = 1, \ldots, M \newline & \quad \xi_i \geq 0, i = 1, \ldots, M \end{aligned} $$

Firstly, rewrite in standard form, $$ \begin{aligned} \underset{\mathbf{w}, b, \xi}{\min} & \quad \frac{1}{2} \Vert \mathbf{w} \Vert_2^2 + C \sum_{i=1}^M \xi_i \newline \text{subject to} & \quad -(y^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) - 1 + \xi_i) \leq 0, \quad i = 1, \ldots, M \newline & \quad -\xi_i \leq 0, i = 1, \ldots, M \end{aligned} $$

Note, we do not have any equality constraints here, only inequality constraints.

We form the Lagrangian, $$ \mathbf{L}(\mathbf{w}, b, \mathbf{\xi}, \mathbf{\alpha}, \mathbf{\beta}) = \frac{1}{2} \Vert \mathbf{w} \Vert_2^2 + C \sum_{i=1}^M \xi_i - \sum_{i=1}^M \alpha_i [y^{(i)} (\mathbf{w}^T \mathbf{x}^{(i)} + b) - 1 + \xi_i] - \sum_{i=1}^M \beta_i \xi_i $$

Taking the gradient of $\mathbf{L}$ with respect to $\mathbf{w}, b, \xi_i$ and setting to zero, $$ \nabla_{\mathbf{w}} \mathbf{L} = \mathbf{w} - \sum_{i=1}^M \alpha_i y^{(i)} \mathbf{x}^{(i)} = 0 \newline \nabla_b \mathbf{L} = \sum_{i=1}^M \alpha_i y^{(i)} = 0 \newline \nabla_{\xi_i} \mathbf{L} = C - \alpha_i - \beta_i = 0, \quad i = 1, \ldots, M $$

Plugging back into $\mathbf{L}$, $$ g(\mathbf{\alpha}) = \sum_{i=1}^M \alpha_i - \frac{1}{2} \sum_{i,j=1}^M y^{(i)} y^{(j)} \alpha_i \alpha_j \mathbf{x}^{(i)T} \mathbf{x}^{(j)} $$

SVM: The Dual Problem

Put this together with the constraints and eliminate $\beta_i$, $$ \begin{aligned} \underset{\mathbf{\alpha}}{\max} & \quad \sum_{i=1}^M \alpha_i - \frac{1}{2} \sum_{i,j=1}^M y^{(i)} y^{(j)} \alpha_i \alpha_j \mathbf{x}^{(i)T} \mathbf{x}^{(j)} \newline \text{subject to} & \quad \sum_{i=1}^M \alpha_i y^{(i)} = 0 \newline & \quad 0 \leq \alpha_i \leq C, \quad i = 1, \ldots, M \end{aligned} $$

$\alpha_i$ corresponds to the $i$-th training sample $(\mathbf{x}^{(i)}, y^{(i)})$.

Recover $\mathbf{w}$ as a sum of the training points $\mathbf{w} = \sum_{i=1}^M \alpha_i y^{(i)} \mathbf{x}^{(i)}$.

Thus, classifyiing a new point $\mathbf{x}^{\star}$ is, $$ y^{\star} = \text{sign}(\mathbf{w}^T \mathbf{x}^{\star} + b) = \text{sign} \left( \sum_{i=1}^M \alpha_i y^{(i)} \mathbf{x}^{(i)T} \mathbf{x}^{\star} + b \right) $$

Interpretation of $\alpha_i$

Combining complementary slackness with primal feasibility and dual feasibility, we can show that, $$ \alpha_i = 0 \Rightarrow y^{(i)} ((\mathbf{x}^{(i)})^T \mathbf{w}^{\star} + b^{\star}) \geq 1 $$

I.e, the sample $(\mathbf{x}^{(i)})$ is not on the margin.

$$ 0 < \alpha_i < C \Rightarrow y^{(i)} ((\mathbf{x}^{(i)})^T \mathbf{w}^{\star} + b^{\star}) = 1 $$

I.e, the sample $(\mathbf{x}^{(i)})$ is on the margin.

$$ \alpha_i = C \Rightarrow y^{(i)} ((\mathbf{x}^{(i)})^T \mathbf{w}^{\star} + b^{\star}) \leq 1 $$

I.e, the sample $(\mathbf{x}^{(i)})$ violates the margin.

Kernel Function

Dual SVM with transformed input, $$ \begin{aligned} \underset{\mathbf{\alpha}}{\max} & \quad \sum_{i=1}^M \alpha_i - \frac{1}{2} \sum_{i,j=1}^M y^{(i)} y^{(j)} \alpha_i \alpha_j \phi(\mathbf{x}^{(i)})^T \phi(\mathbf{x}^{(j)}) \newline \text{subject to} & \quad \sum_{i=1}^M \alpha_i y^{(i)} = 0 \newline & \quad 0 \leq \alpha_i \leq C, \quad i = 1, \ldots, M \end{aligned} $$

Completely written in terms of inner product $\phi(\mathbf{x}^{(i)})^T \phi(\mathbf{x}^{(j)})$.

Rather than explicitly calculate the high-dimensional $\phi(\mathbf{x})$, we only need to calculate the inner product between the two vectors.

Let the kernel function $\mathcal{K}(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) = \phi(\mathbf{x}^{(i)})^T \phi(\mathbf{x}^{(j)})$.

This is much less expensive to compute than explicitly calculating the high-dimensional feature vector and the inner product.

Example: Polynomial Kernel

Let the input vector $\mathbf{x} = [x_1, \ldots, x_N]^T \in \mathbb{R}^N$.

Kernel between two vector is a $p$-th order polynomial, $$ \mathcal{K}(\mathbf{x}, \mathbf{x}^{\prime}) = (\mathbf{x}^T \mathbf{x}^{\prime})^p = (\sum_{i=1}^N x_i x_i^{\prime})^p $$

For example, p = 2, $$ \mathcal{K}(\mathbf{x}, \mathbf{x}^{\prime}) = (\mathbf{x}^T \mathbf{x}^{\prime})^2 = \sum_{i=1}^N \sum_{j=1}^N (x_i x_i^{\prime} x_j x_j^{\prime}) = \phi(\mathbf{x})^T \phi(\mathbf{x}^{\prime}) $$

Transformed space is the quadratic terms of input $\mathbf{x}$. $$ \phi(\mathbf{x}) = [x_1 x_1, x_1 x_2, \ldots, x_2, x_1, x_2 x_2, \ldots, x_N x_1, \ldots, x_N x_N] $$

Comparison of number of multiplications,

  • Explicit calculation: $O(N^2)$
  • Kernel calculation: $O(N)$

Kernel Trick

Replacing the inner product with a kernel function in optimization is called the kernel trick.

Turns linear classifier into a non-linear classifier. The shape of the decision boundary is determined by the kernel function.

Kernel SVM, $$ \begin{aligned} \underset{\mathbf{\alpha}}{\max} & \quad \sum_{i=1}^M \alpha_i - \frac{1}{2} \sum_{i,j=1}^M y^{(i)} y^{(j)} \alpha_i \alpha_j \mathcal{K}(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) \newline \text{subject to} & \quad \sum_{i=1}^M \alpha_i y^{(i)} = 0 \newline & \quad 0 \leq \alpha_i \leq C, \quad i = 1, \ldots, M \end{aligned} $$

Our prediction is, $$ y^{\star} = \text{sign} \left( \sum_{i=1}^M \alpha_i y^{(i)} \mathcal{K}(\mathbf{x}^{(i)}, \mathbf{x}^{\star}) + b \right) $$

RBF Kernel

The Radial Basis Function (RBF) kernel is a popular kernel function, $$ \mathcal{K}(\mathbf{x}, \mathbf{x}^{\prime}) = e^{-\gamma \Vert \mathbf{x} - \mathbf{x}^{\prime} \Vert^2} $$

Which resembles a Gaussian.

Gamma $\gamma > 0$ is the inverse bandwidth parameter of the kernel. It controls the smoothness of the function.

Small $\gamma \rightarrow$ wide Gaussian $\rightarrow$ smooth function. Large $\gamma \rightarrow$ thin Gaussian $\rightarrow$ wiggly function.

Corresponds to feature transform function of infinite dimension.

Kernel SVM: Summary

  • Kernel Classifier.
    • Kernel function defines the shape of the non-linear decision boundary.
      • Implicitly transforms input feature into high-dimensional space.
      • Uses linear classifier in high-dimensional space.
      • The decision boundary is non-linear in the original space.
  • Training.
    • Maximize the margin of the training data.
      • I.e, maximizes the separation between the points and the decision boundary.
    • Use cross-validation to pick the hyperparameters and the kernel hyperparameters.
  • Advantages.
    • Non-linear decision boundary for more complex classification problems.
    • Some intuition from the type of kernel function used.
  • Disadvantages.
    • Sensitive to the kernel function used.
    • Sensitive to the $C$ and kernel hyperparameters.
    • Computationally expensive to do cross-validation.
    • Need to calculate the kernel matrix
      • $M^2$ terms where $M$ is the size of the training set.
      • For large $M$, uses large amount of computation and memory.

Sequential Minimal Optimization (SMO)

SMO is an algorithm for training SVMs. Let’s take a look.

Solving SVM: Coordinate Descent

Consider the unconstrained optimization problem, $$ \underset{\mathbf{\theta}}{\min} \ell(\theta_1, \theta_2, \ldots, \theta_N) $$

The coordinate descent method first selects an initial point $\mathbf{\theta}^{(0)} \in \mathbb{R}^N$ and repeats:

  • Choose an index $i$ from 1 to $N$.
  • $\theta_i^{(t+1)} = \arg \min_{\hat{\theta}_i} \ell(\theta_1^{(t)}, \ldots, \hat{\theta}_i, \ldots, \theta_N^{(t)})$

At each iteration, the algorithm determines a coordinate, and minimizes over the corresponding coordinate direction while fixing all other coordinates.

Coordinate descent is applicable in both differentiable and non-differentiable optimization contexts.

Solving SVM: Sequential Minimal Optimization (SMO)

SMO is an algorithm for solving the quadratic programming problem that arises during the training of SVMs, invented by John C. Platt in 1998.

  • Sequential
    • Not parallel.
    • Optimize a set of two Lagrange multipliers.
  • Minimal
    • Optimize the smallest possible sub-problem at each step.
  • Optimization
    • Satisfy the constraints for the chosen pair of Lagrange multipliers.

SMO

Recall the dual form of kernelized SVM with soft margin, $$ \begin{aligned} \underset{\mathbf{\alpha}}{\max} & \quad \sum_{i=1}^M \alpha_i - \frac{1}{2} \sum_{i,j=1}^M y^{(i)} y^{(j)} \alpha_i \alpha_j \mathcal{K}(\mathbf{x}^{(i)}, \mathbf{x}^{(j)}) \newline \text{subject to} & \quad \sum_{i=1}^M \alpha_i y^{(i)} = 0 \newline & \quad 0 \leq \alpha_i \leq C, \quad i = 1, \ldots, M \end{aligned} $$

SMO is derived by decomposing the problem to its extreme and optimizing a minimal subset of just two dual variables (i.e., two data points) at each iteration.

The sub-problem for two data points admits an analytical solution, eliminating the need to use an iterative quadratic programming optimizer as part of the algorithm.

The condition $\sum_{i=1}^M \alpha_i y^{(i)} = 0$ is enforced throughout the iterations, implying that the smallest number of multipliers that can be optimized at each step is two.

Whenever one multiplier is updated, at least one other multiplier needs to be adjusted in order to keep the condition true.

At each step SMO chooses two elements $\alpha_i$ and $\alpha_j$ to jointly optimize, finds the optimal values for those two parameters given that all the others are fixed.

The choice of the two points is determined by a heuristic, while the optimization of the two multipliers is performed analytically.

Despite needing more iterations to converge, each iteration uses so few operations that the algortihm exhibits an overall speed-up of some orders of magnitude.

Classification Summary

  • Classification Task
    • Observation $\mathbf{x}$: typically a real vector of feature values, $x \in \mathbb{R}^N$.
    • Class $y$: from a set of possible classes, e.g., $\mathcal{Y} = \newline{-1, +1\newline}$.
    • Goal: given an observation $\mathbf{x}$, predict the class $y$.
  • K-Nearest Neighbors (KNN)
    • Type: Discriminative.
    • Classes: Multi-class.
    • Decision Function: Non-linear.
    • Training: No training needed.
  • Bayes’ Classifier
    • Type: Generative.
    • Classes: Multi-class.
    • Decision Function: (generally) Non-linear.
    • Training: Estimate class-conditional densities $p(\mathbf{x} | y)$ by maximizing likelihood of data.
  • Logistic Regression
    • Type: Discriminative.
    • Classes: Binary.
    • Decision Function: Linear.
    • Training: Maximize likelihood of data in $p(y | \mathbf{x})$.
  • Support Vector Machine (SVM)
    • Type: Discriminative.
    • Classes: Binary.
    • Decision Function: linear.
    • Training: Maximize the margin (distance between the decision surface and the closest point).
  • Kernel SVM
    • Type: Discriminative.
    • Classes: Binary.
    • Decision Function: Non-linear (kernel function).
    • Training: Maximize the margin.

Regulatization and Overfitting

Some models have terms to prevent overfitting the training data, this can improve generalization to new data.

There is a parameter to control the regularization effect, select this parameter using cross-validation on the training set.

Other Things

  • Multi-class classification.

    • Can use binary classifiers to do multi-class using 1-vs-rest formulation.
  • Feature Normalization.

    • Normalize each feature dimension so that some feature dimensions with larger ranges do not dominate the optimization process.
  • Data Imbalance

    • If more data in one class, then apply weights to each class to balance objectives.
  • Class Importance

    • Mistakes on some classes are more critical.
    • Reweight class to focus classifier on correctly predicting one class at the expense of others.
  • Applications

    • Web document classification, spam classification.
    • Face gender recognition, face detection, digit classification.
  • Features

    • Choice of features is important!
      • Using uniformative features may confuse the classifier.
      • Use domain knowledge to pick the best features to extract from the data.

Which Classifier is Best?

“No Free Lunch” Theorem (Wolpert and Macready)

If an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems.

In other words, there is no best classifier for all tasks. The best classifier depends on the particular problem.