CS4487
Last modified:
10 min read

Part 3 - Optimization and Convexity

Table of Contents

Generative VS. Discriminative Classifiers

The Bayes optimal classifier, Naive Bayes and Linear Discriminant Analysis are said to be generative classifiers because they all explicitly model the joint distribution $p(\mathbf{x}, y)$ of the feature vectors $\mathbf{x}$ and the labels y.

To build a probabilistic classifier, all we really need to model is $p(y | \mathbf{x})$. However, classifiers based on directly estimating $p(y | \mathbf{x})$ are called discriminative classifiers because they do not care about the class conditional distribution $p(\mathbf{x} | y)$.

Logistic Regression

Recall that logistic regression is a probabilistic discriminative classifier. It directly models the decision boundary using a linear function (binary-classification case), $$ p(y = +1 | \mathbf{x}) = \frac{1}{1 + \exp(-(\mathbf{w}^T \mathbf{x} + b))} \newline \log\left(\frac{p(y = +1 | \mathbf{x})}{p(y = -1 | \mathbf{x})}\right) = \mathbf{w}^T \mathbf{x} + b $$

Which has the classification function, $$ f_{LR}(\mathbf{x}) = \underset{c \in \{1, \ldots, C\}}{\arg\max} p(y = c | \mathbf{x}). $$

However, the one draw back with logistic regression compared the previous models that we have covered is learning the parameters. In the case for the generative models, we could use MLE to learn the parameters. Learning the parameters for logistic regression cannot be performed analytically, and we have to resort to numerical optimization.

Optimization

In fact, much of machine learning can be written as an optimization problem, $$ \underset{\mathbf{\theta}}{\min} \ \ell(\mathcal{D}; \mathbf{\theta}) = \underset{\mathbf{\theta}}{\min} \frac{1}{M} \sum_{i=1}^M \ell(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{\theta}) + r(\mathbf{\theta}), $$

where $\mathbf{\theta}$ is a parameter vector to be optimized.

$\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)}), i = 1, \ldots, M\}$ is a finite training set.

$r(\cdot)$ is a regularizer to prevent overfitting to $\mathcal{D}$.

In simple terms, we want to find the parameter vector $\mathbf{\theta}$ that minimizes the loss function $\ell(\cdot)$.

In logistic regression ($\mathbf{\theta} = \{ \mathbf{w}, b \}$), $$ \ell(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{\theta}) = \log(1 + \exp(-y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b))) \newline r(\mathbf{\theta}) = \frac{1}{C} \mathbf{w}^T \mathbf{w}. $$

Types of Optimization

When we’re dealing with optimization, we will often deal with two types of optimization. Convex and Non-Convex optimization.

  • Convex optimization
    • The easy case.
    • Includes logistic regression, linear regression, SVM.
  • Non-Convex optimization
    • Hard in general.
    • Includes deep learning.

Before we understand how we approach optimization, we need to understand what convexity is.

Convex Set

Recall that a line segment between $\mathbf{x}^{(1)}$ and $\mathbf{x}^{(2)}$ all points, $$ \mathbf{x} = \alpha \mathbf{x}^{(1)} + (1 - \alpha) \mathbf{x}^{(2)} $$

So, a convex set, contains line segment between any two points in the set, $$ \mathbf{x}^{(1)}, \mathbf{x}^{(2)} \in \chi, 0 \leq \alpha \leq 1 \Rightarrow \alpha \mathbf{x}^{(1)} + (1 - \alpha) \mathbf{x}^{(2)} \in \chi. $$

Hyperplanes and Halfspaces

A hyperplane is a set of the form, $$ \{ \mathbf{x} | \mathbf{a}^T \mathbf{x} = b \} (a \neq 0). $$

where $\mathbf{a}$ is the normal vector to the hyperplane.

A halfspace is a set of the form, $$ \{ \mathbf{x} | \mathbf{a}^T \mathbf{x} \leq b \} (a \neq 0). $$

Hyperplanes and halfspaces are therefore convex sets.

Operations that Preserve Convexity

When we’re dealing with optimization problems, we often need methods for establishing convexity of a set $\chi$.

First, apply the definition of a convex set $$ \mathbf{x}^{(1)}, \mathbf{x}^{(2)} \in \chi, 0 \leq \alpha \leq 1 \Rightarrow \alpha \mathbf{x}^{(1)} + (1 - \alpha) \mathbf{x}^{(2)} \in \chi. $$

Then, show that $\chi$ is obtained from simple convex set (hyperplanes, halfspaces, etc.) by operations that preserve convexity.

  • Intersection
  • Affine functions
  • $\ldots$

Convex Function

With all of this, let’s define what a convex function is.

$f : \mathbb{R}^N \rightarrow \mathbb{R}$ is convex if $dom(f)$ is a convex set and,

$$ f(\alpha \mathbf{x}^{(1)} + (1 - \alpha) \mathbf{x}^{(2)}) \leq \alpha f(\mathbf{x}^{(1)}) + (1 - \alpha) f(\mathbf{x}^{(2)}) $$

for all $\mathbf{x}^{(1)}, \mathbf{x}^{(2)} \in dom(f)$ and $0 \leq \alpha \leq 1$.

Also, good to remember that $f$ is concave if $-f$ is convex.

Examples on $\mathbb{R}$

  • Convex
    • Affine: $ax + b$ on $\mathbb{R}$, for any $a, b \in \mathbb{R}$.
    • Exponential: $e^{ax}$, for any $a \in \mathbb{R}$.
    • Powers: $x^a$ on $\mathbb{R}_{++}$, for $a \geq 1$ or $a \leq 0$.
    • Powers of absolute value: $|x|^a$ on $\mathbb{R}$, for $a \geq 1$.
    • Negative entropy: $x \log x$ on $\mathbb{R}_{++}$.
  • Concave
    • Affine: $ax + b$ on $\mathbb{R}$, for any $a, b \in \mathbb{R}$.
    • Powers: $x^a$ on $\mathbb{R}_{++}$, for $0 \leq a \leq 1$.
    • Logarithm: $\log x$ on $\mathbb{R}_{++}$.

Examples on $\mathbb{R}^N$ and $\mathbb{R}^{M \times N}$

  • On $\mathbb{R}^N$
    • Affine function: $\mathbf{a}^T \mathbf{x} + b$.
    • Norms: $\Vert \mathbf{x} \Vert_p = \left(\sum_{j=1}^N |x_j|^p\right)^{1/p}$, for $p \geq 1$; $\Vert \mathbf{x} \Vert_{\infty} = \max_j |x_j|$.
  • On $\mathbb{R}^{M \times N}$ ($M \times N$ matrices)
    • Affine function: $$ f(\mathbf{X}) = \sum_{i = 1}^M \sum_{j = 1}^N A_{ij} X_{ij} + b = \text{tr}(\mathbf{A}^T \mathbf{X}) + b. $$

    • Spectral norm (maximum singular value): $$ f(\mathbf{X}) = \Vert \mathbf{X} \Vert_2 = \sigma_{\text{max}}(\mathbf{X}) = (\lambda_{\text{max}}(\mathbf{X}^T \mathbf{X}))^{1/2}. $$

Example: Quadratic Function

$$ f(x) = x^2 $$

Proof of convexity: $$ \begin{align*} (\alpha x + (1 - \alpha)y)^2 & = \alpha^2 x^2 + 2 \alpha (1 - \alpha) xy + (1 - \alpha)^2 y^2 \newline & = \alpha x^2 + (1 - \alpha) y^2 + \alpha(1 - \alpha)(x^2 - 2xy + y^2) \newline & \leq \alpha x^2 + (1 - \alpha) y^2. \end{align*} $$

Properties of Convex Functions

  • Any line segment we draw between two points lies above the curve.

  • Every local minimum is a global minimum.

    • A simple geometric intuition is that, if the curve goes above any line segment we draw, it is not convex, by definition.
    • This is what makes convex optimization easy. It suffices to find a local minimum because it is also a global minimum.
  • Non-negative combinations of convex functions are convex

    • $h(\mathbf{x}) = af(\mathbf{x}) + bg(\mathbf{x})$
  • Affine scaling of convex functions are convex

    • $h(\mathbf{x}) = f(A\mathbf{x} + \mathbf{b})$
  • Pointwise maximum of convex functions is convex

    • $h(\mathbf{x}) = \max(f(\mathbf{x}), g(\mathbf{x}))$
  • Composition of convex functions are not generally convex

    • Neural nets are like this $h(\mathbf{x}) = f(g(\mathbf{x}))$

First-Order Condition

$f$ is differentiable if the gradient, $$ \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_N} \end{bmatrix}^T $$

exists at each $\mathbf{x} \in dom(f)$.

The first-order condition is that, differentiable $f$ with convex domain is convex if and only if, $$ f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) \quad \forall \mathbf{x}, \mathbf{y} \in dom(f). $$

First-order approximation of $f$ is global underestimator.

Second-Order Condition

$f$ is twice differentiable if the Hessian $\nabla^2 f(\mathbf{x})$ exists at each $\mathbf{x} \in \mathbb{S}^N$, $$ \nabla^2 f(\mathbf{x})_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}. $$

exists at each $\mathbf{x} \in dom(f)$.

The second-order condition is that, twice differentiable $f$ with convex domain is convex if and only if, $$ \nabla^2 f(\mathbf{x}) \succeq 0 $$

This means that the Hessian matrix is positive semidefinite, $$ \mathbf{A} \succeq 0 \Leftrightarrow \forall \mathbf{x}, \mathbf{x}^T \mathbf{A} \mathbf{x} \geq 0. $$

Example: Quadratic

$$ f(x) = x^2 $$

Proof of convexity: $$ f”(x) = 2 \geq 0. $$

Example: Exponential

$$ f(x) = e^{x} $$

Proof of convexity: $$ f”(x) = e^{x} \geq 0. $$

Example: Logistic Loss

$$ f(x) = \log(1 + e^{-x}) $$

Proof of convexity: $$ f’(x) = \frac{e^{-x}}{1 + e^{-x}} = \frac{1}{1 + e^x} \newline f”(x) = \frac{e^x}{(1 + e^x)^2} \geq 0. $$

Parameter Estimation

As we said earlier, many machine learning optimizations are convex optimization problems.

So our goal is to minimize a convex loss function with respect to parameters. For example $\ell(\mathbf{\theta}) = \mathbf{\theta}^2$.

However, sometimes we deal with problems that are not convex, then we need to use numerical methods.

Gradient Descent

The most basic optimization algorithm is gradient descent.

Gradient Descent (GD) uses the following update formula, $$ \mathbf{\theta} \leftarrow \mathbf{\theta} - \alpha \nabla \ell(\mathbf{\theta}), $$

where $\alpha$ is the learning rate.

$\nabla \ell(\mathbf{\theta})$ is the gradient of $\ell(\mathbf{\theta})$ evaluated at $\mathbf{\theta}$.

So, with our example $\ell(\mathbf{\theta}) = \mathbf{\theta}^2$, $\nabla \ell(\mathbf{\theta}) = 2\mathbf{\theta}$.

We can repeat this process until convergence, so GD first chooses an initial point $\mathbf{\theta}^{(0)}$ and repeats, $$ \mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \alpha \nabla \ell(\mathcal{D}; \mathbf{\theta}^{(t)}) = \mathbf{\theta}^{(t)} - \alpha \frac{1}{M} \sum_{i=1}^M \nabla \ell(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{\theta}^{(t)}). $$

How do we stop? $$ \Vert \mathbf{\theta}^{(t+1)} - \mathbf{\theta}^{(t)} \Vert_2 < \epsilon \newline \Vert \nabla \ell(\mathcal{D}; \mathbf{\theta}^{(t)}) \Vert_2 < \epsilon, $$

where $\Vert \mathbf{\theta} \Vert_2 = \sqrt{\sum_{j} \mathbf{\theta}_j^2}$, is the $\ell_2$-norm of $\mathbf{\theta}$.

$\epsilon$ is a small positive constant, e.g., $10^{-6}$.

An important thing we need to discuss is that, a negative gradient step can decrease the loss function.

Proof. Let $\mathbf{\theta}^{(0)}$ be any initial point. Using the first-order Taylor expansion for the next point $\mathbf{\theta}^{(1)} = \mathbf{\theta}^{(0)} - \alpha \nabla \ell(\mathbf{\theta}^{(0)})$, we have $$ \begin{align*} \ell(\mathbf{\theta}^{(1)}) & = \ell(\mathbf{\theta}^{(0)} - \nabla \ell(\mathbf{\theta}^{(0)}))^T (\mathbf{\theta}^{(1)} - \mathbf{\theta}^{(0)}) \newline & = \ell(\mathbf{\theta}^{(0)}) + \nabla \ell(\mathbf{\theta}^{(0)})^T (-\alpha \nabla \ell(\mathbf{\theta}^{(0)})) \newline & = \ell(\mathbf{\theta}^{(0)}) - \alpha \Vert \nabla \ell(\mathbf{\theta}^{(0)}) \Vert_2^2 \end{align*} $$

Hence, if $\nabla \ell(\mathbf{\theta}^{(0)}) \neq 0$ and $\alpha$ is sufficiently small, we have, $$ \ell(\mathbf{\theta}^{(1)}) - \ell(\mathbf{\theta}^{(0)}) = -\alpha \Vert \nabla \ell(\mathbf{\theta}^{(0)}) \Vert_2^2 < 0. $$

For sufficiently small $\alpha$, $\mathbf{\theta}^{(1)}$ is an improvement over $\mathbf{\theta}^{(0)}$.

The initial point matters, the final convergence point depends a lot on the initial point.

To improve optimization, run GD multiple times with different initial points and choose the best one.

The learning rate matters as well, if it’s too big, GD moves more quickly with high risk that the algorithm will never converge.

If it is too small, GD moves slowly, so slowly that you might just lose patience and never reach the minimum.

Stochastic Gradient Descent

The computational cost for GD at each iteration is $\mathcal{O}(M)$, which grows linearly with $M$.

SGD, reduces the computational cost at each iteration to $\mathcal{O}(1)$.

At $t$-th iteration, randomly sample a single training example and compute the gradient to update $\mathbf{\theta}^{(t)}$. $$ \mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \alpha \nabla \ell(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{\theta}^{(t)}). $$

In expectation, SGD converges to the same solution as GD, but with more noise. $$ \mathbb{E}[\nabla \ell(\mathbf{x}, y; \mathbf{\theta})] = \frac{1}{M} \sum_{i=1}^M \nabla \ell(\mathbf{x}^{(i)}, y^{(i)}; \mathbf{\theta}) = \nabla \ell(\mathcal{D}; \mathbf{\theta}). $$

  • Advantages
    • It is easy to fit into memory and computationally fast due to a single training sample being processed at each iteration.
    • The steps have oscillations, which may help the algorithm get out of bad local minima.
  • Disadvantages
    • The steps are noisy. The objective does not always decrease for each step, and it may take longer to achieve convergence.
    • It loses the advantage of vectorized operations as it deals with only a single example at a time.

Mini-Batch Gradient Descent: A Compromise

Unlike SGD, mini-batch gradient descent randomly samples a small subset of training data $\mathcal{B} \subset \mathcal{D}$ at $t$-th iteration, and compute the gradient to update $\mathbf{\theta}^{(t)}$. $$ \mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \alpha \nabla \ell(\mathcal{B}; \mathbf{\theta}^{(t)}) = \mathbf{\theta}^{(t)} - \alpha \frac{1}{|\mathcal{B}|} \sum_{(\mathbf{x}, y) \in \mathcal{B}} \nabla \ell(\mathbf{x}, y; \mathbf{\theta}^{(t)}). $$

Mini-batch gradient descent and its variants are extremely popular and extensively used in deep learning

Summary: Convex Optimization and Numerical Methods

  • Convex optimization
    • Convex set
      • $\mathbf{x}^{(1)}, \mathbf{x}^{(2)} \in \chi, 0 \leq \alpha \leq 1 \Rightarrow \alpha \mathbf{x}^{(1)} + (1 - \alpha) \mathbf{x}^{(2)} \in \chi$.
    • Convex function
      • $dom(f)$ is convex
      • $f(\alpha \mathbf{x} + (1 - \alpha) \mathbf{y}) \leq \alpha f(\mathbf{x}) + (1 - \alpha) f(\mathbf{y})$.
    • Establish convexity
      • By definition
      • First-order condition $f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x})$.
      • Second-order condition $\nabla^2 f(\mathbf{x}) \succeq 0$.
  • Numerical optimization methods
    • GD - guarantee to decrease loss; not scalable to large datasets.
    • SGD - superfast to compute with noisy gradients.
    • Mini-batch GD - a compromise between GD and SGD; the de facto method in deep learning.

Support Vector Machines

So far we have only seen one example of a discriminative linear classifier, the logistic regression. The logistic regression uses a maximum likelihood framework to learn the separating hyperplane.

The Support Vector Machine (SVM) is another example of a discriminative linear classifier. However, this is a purely geometric approach.

Linearly Separable Data

For now, assume that the training is linearly separable. E.g. the two classes in the training data can be separated by a line (hyperplane).

There are many “correct” solutions, but which is the “best” separating line?

Maximum Margin Principle

We can define the distance between the separating line and the closest point as the margin. Think of this space as the “amount of wiggle room” for accommodating errors in estimating $\mathbf{w}$.

Idea: the best separating line is the one that maximizes the margin. I.e. puts the most distance between the closest points and the decision boundary.

The solution is quite trivial if you think about it.

By symmetry, there should be at least one margin point on each side of the boundary. The points on the margins are called the support vectors. The points support (define) the margin.

Computing the Margin

What is the margin (i.e., distance) of $\mathbf{w}^T \mathbf{x} + b = 0$ with respect to a point $(\mathbf{x}^{(i)}, y^{(i)})$? $$ d^{(i)} = \frac{y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b)}{\Vert \mathbf{w} \Vert_2}. $$

This is called the geometric margin. The numerator $y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)}$ is called the functional margin.

What is the margin of $\mathbf{w}^T \mathbf{x} + b = 0$ with respect to a training set $\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)}), i = 1, \ldots, M\}$? $$ d = \min\{d^{(i)}\}_{i=1}^M = \underset{(\mathbf{x}, y) \in \mathcal{D}} \min \frac{y(\mathbf{w}^T \mathbf{x} + b)}{\Vert \mathbf{w} \Vert_2}. $$

Maximizing the Margin

The maximum margin solution is found by solving, $$ \underset{\mathbf{w}, b} \max\left(\frac{1}{\Vert \mathbf{w} \Vert_2} \underset{(\mathbf{x}, y) \in \mathcal{D}} \min y(\mathbf{w}^T \mathbf{x} + b)\right). $$

Notice that if we make the rescaling $\mathbf{w} \rightarrow \gamma \mathbf{w}$ and $b \rightarrow \gamma b$, then the objective function is unchanged.

We can use this freedom to set, $$ \underset{(\mathbf{x}, y) \in \mathcal{D}} \min y(\mathbf{w}^T \mathbf{x} + b) = 1. $$

Or equivalently, $$ y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) \geq 1 \quad \forall i = 1, \ldots, M. $$

SVM (Primal Form, Hard-Margin)

Given a training set $\mathcal{D} = \{\mathbf{x}^{(i)}, y^{(i)}\}_{i=1}^M$, optimize, $$ \underset{\mathbf{w}, b} \min \frac{1}{2} \Vert \mathbf{w} \Vert_2^2 \newline \text{subject to } y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) \geq 1 \quad \forall i = 1, \ldots, M. $$

The objective minimizes the inverse of the margin distance, i.e., maximizes the geometric margin.

The inequality constraints ensure that all points are either on or outside the functional margin.

With this, our prediction, given a new data point $\mathbf{x}^{\star}$, use sign of linear function to predict class. $$ y^{\star} = \text{sign}(\mathbf{w}^T \mathbf{x}^{\star} + b). $$

Why is Maximizing the Margin Good?

The true $\mathbf{w}$ is uncertain. Maximizing the margin allows the most uncertainty (wiggle room) for $\mathbf{w}$, while keeping all the points correctly classified.

The data points are also uncertain. Maximizing the margin allows the most wiggle of the points, while keeping all the points correctly classified.

Is Hard-Margin Enough?

The hard-margin SVM is very sensitive to outliers, if we have trained an SVM in an “ideal” dataset, it will perform very poorly on a new but different dataset that maybe reflects real-world data better.

So what is the solution to make our classifier more robust?

Allow some training samples to violate the margin. I.e., are inside the margin (or even miss-classified).

We define the “slack” variable $\xi_i \geq 0$, where $\xi_i = 0$ means sample is outside of margin area (no slack) and $\xi_i > 0$ means sample is inside margin area (slack).

SVM (Primal Form, Soft-Margin)

So now we instead optimize, $$ \underset{\mathbf{w}, b} \min \frac{1}{2} \Vert \mathbf{w} \Vert_2^2 + C \sum_{i=1}^M \xi_i \newline \text{subject to } y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) \geq 1 - \xi_i \quad \forall i = 1, \ldots, M \newline $$

This will introduce a parameter $C$ as penalty for violating the margin. Smaller value of $C$ means that we allow more violation (less penalty). Larger values of $C$ means we do not allow violations (more penalty).

Loss Function

In the case of primal-form SVM with soft-margin, it is equivalent to minimizing the function, $$ \sum_{i = 1}^m \max(0, 1 - y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b)) + \frac{1}{2C} \Vert \mathbf{w} \Vert_2^2. $$

This is called the hinge loss, $$ \ell_h(z) = max(0, 1 - z) $$

Zero-One Loss

Both the logistic loss and the hinge loss are convex relaxations of the zero-one loss, $$ \ell_{01}(z) = \mathbb{I}[z \leq 0]. $$

The average zero-one loss over a data set is exactly the classification error rate.

This is the loss function we’d like to minimize, but this generally isn’t computationally feasible, thus the need for surrogate loss functions.

Multiclass SVM

If we have multiple classes to classify, we can use different approaches to solve the objective.

  • “1-vs-1” multiclass classification
    • Train binary classifiers on all pairs of classes.
      • 3-class example: 1 vs 2, 1 vs 3, 2 vs 3.
    • To label a sample, pick the class with the most votes among the binary classifies.
    • Problem: 1v1 classification is very slow when there are many number of classes
      • If there are $C$ classes, we need to train $C/(C - 1)/2$ binary classifiers.
  • “1-vs-rest” classification
    • Train 1-vs-rest binary classifiers
      • 3-class example: 1 vs {2, 3}, 2 vs {1, 3}, 3 vs {1, 2}.
    • To label a sample, pick the class with the largest geometric margin.
      • $y^{\star} = \underset{c \in \{1, \ldots, C\}}{\arg \max} (\mathbf{w}^T_c \mathbf{x}^{\star} + b_c) / \Vert \mathbf{W}_c \Vert_2$.

Summary: SVM

  • Classifier
    • Linear Function $f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$.
    • Given new sample $\mathbf{x}^{\star}$, predict $y^{\star} = \text{ sign}(\mathbf{w}^t \mathbf{x}^{\star} + b)$.
  • Training
    • Maximize the margin of the training data
      • I.e., maximize the separation between the points and the decision boundary.
    • Allow some training samples to violate the margin.
      • Use cross-validation to pick the hyperparameter $C$.
  • Linear classifiers
    • Separate the data using a linear surface (hyperplane).
    • $y = \text{ sign}(\mathbf{w}^T \mathbf{x} + b)$.
  • Two formulations
    • Logistic regression - maximize the probability of the data.
    • Support Vector Machine - maximize the margin of the hyperplane.
  • Loss functions
    • Logistic regression - push the classification boundary as far as possible from all points.
    • Support Vector Machine - ensure a margin of 1 between boundary and closest point.
  • Advantages
    • SVM works well on high-dimensional features ($N$ large), and has low generalization error.
    • LR has well-defined probabilities.
  • Disadvantages
    • Decision surface can only be linear.