Part 5 - Regression | machine learning

The Regression Task

Definition of the regression task,

Given a feature vector $\mathbf{x} \in \mathbb{R}^N$, predict its corresponding output vale $y \in \mathbb{R}$.

Example: Curve Fitting

Given $M$ points sampled from some underlying curve (assuming no measurement noise), the goal is to find that curve.

But let’s first think about this problem.

Would a small handful of points work well, or them or the better?
Given that the number of points are fixed, do their locations matter?
Given a set of $M$ points, is curve fitting unique?
Lastly, how do we determine a “seemingly best” curve, compared to other possibilities?

I would say the answer to 1-3 are quite obvious, but the fourth one is the deal-breaker.

Answers

We always like more samples, if possible.
Samples need to be representative or informative.
Estimating continuous model from discrete data is an ill-posed problem and in most cases cannot be certain.
You need to have some criteria to choose your preferred model.

The Regression Learning Problem

Definition of the regression learning problem,

Given a data set of example pairs $\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)}), i = 1, \ldots, M \}$ where $\mathbf{x}^{(i)} \in \mathbb{R}^N$ is a feature vector and $y^{(i)} \in \mathbb{R}$ is the output, learn a function $f : \mathbb{R}^N \mapsto \mathbb{R}$ that accurately predicts $y$ for any feature vector $\mathbf{x}$.

Okay, so we have our goal defined. Let’s define our error.

Definition of error measure, in this case the mean squared error (MSE),

Given a data set of example pairs $\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)}), i = 1, \ldots, M \}$ and a function $f : \mathbb{R}^N \mapsto \mathbb{R}$, the MSE of $f$ on $\mathcal{D}$ is defined as, $$ MSE(\mathcal{D}, f) = \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - f(\mathbf{x}^{(i)}))^2 $$

Linear Regression

If we are in $1$-D, the output $y$ is a linear function of input feature $x$. I.e., $$ y = wx + b $$ where $w$ is the slope and $b$ is the intercept.

Let’s generalize this to $N$ dimensions.

In the $N$-D case, the output $y$ is a linear combination of $N$ input features $x_1, x_2, \ldots, x_N$. I.e., $$ y = w_1x_1 + w_2x_2 + \ldots + w_Nx_N + w_0 $$

Or equivalently, $$ y = \mathbf{w}^T \mathbf{x} + w_0 = \sum_{i=1}^{N} w_i x_i + w_0 = \sum_{i=0}^{N} w_i x_i \ \bigg\rvert \ x_0 = 1 $$

$\mathbf{x} \in \mathbb{R}^{N}$ is the vector of input values.

$\mathbf{w} \in \mathbb{R}^{N}$ are the weights of the linear function and $w_0$ is the intercept (bias term).

Ordinary Least Squares (OLS)

The linear function has form $f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$.

How do we estimate the parameters $(\mathbf{w}, b)$ from the data?

Ordinary Least Squares (OLS) selects the linear regression parameters to minimize the MSE on the training data set. $$ \mathbf{w}^{\star}, b^{\star} = \underset{\mathbf{w}, b}{\arg \min} \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - \mathbf{w}^T \mathbf{x}^{(i)} - b)^2 $$

Solving OLS For One Feature

Let’s solve this for one feature, i.e., take the derivate of the objective with respect to $w$ and set it to zero. $$ \frac{\partial}{\partial w} \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - w x^{(i)} - b)^2 = 0 \newline 2 \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - w x^{(i)} - b)(-x^{(i)}) = 0 \newline \left( \sum_{i=1}^{M} (x^{(i)})^2 \right) w + \left( \sum_{i=1}^{M} x^{(i)} \right) b = \sum_{i=1}^{M} y^{(i)} x^{(i)} $$

Now, let’s do the same for the intercept $b$. $$ \frac{\partial}{\partial b} \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - w x^{(i)} - b)^2 = 0 \newline 2 \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - w x^{(i)} - b)(-1) = 0 \newline \left( \sum_{i=1}^{M} x^{(i)} \right) w + M b = \sum_{i=1}^{M} y^{(i)} $$

We can now write the two equations in matrix form, $$ \begin{bmatrix} \sum_{i=1}^{M} (x^{(i)})^2 & \sum_{i=1}^{M} x^{(i)} \newline \sum_{i=1}^{M} x^{(i)} & M \end{bmatrix} \begin{bmatrix} w \newline b \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{M} y^{(i)} x^{(i)} \newline \sum_{i=1}^{M} y^{(i)} \end{bmatrix} $$

Solving for the parameters, $$ \begin{bmatrix} w \newline b \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{M} (x^{(i)})^2 & \sum_{i=1}^{M} x^{(i)} \newline \sum_{i=1}^{M} x^{(i)} & M \end{bmatrix}^{-1} \begin{bmatrix} \sum_{i=1}^{M} y^{(i)} x^{(i)} \newline \sum_{i=1}^{M} y^{(i)} \end{bmatrix} $$

General OLS Solution

In matrix notation, $\mathbf{X} \in \mathbb{R}^{M \times (N+1)}$ with one data case $\mathbf{x}^{(i)} \in \mathbb{R}^{N+1}$ per row.

$$ \mathbf{X} = \begin{bmatrix} — & (\mathbf{x}^{(1)})^T & — \newline — & (\mathbf{x}^{(2)})^T & — \newline & \vdots & \newline — & (\mathbf{x}^{(M)})^T & — \end{bmatrix}, $$

where $\mathbf{x}^{(i)} \in \mathbb{R}^{N+1}$ and we have defined $x_0^{(i)} = 1$.

$\mathbf{y} \in \mathbb{R}^{M}$ is a column vector containing the corresponding outputs, $$ \mathbf{y} = \begin{bmatrix} y^{(1)} \newline y^{(2)} \newline \vdots \newline y^{(M)} \end{bmatrix} $$

and $\mathbf{w} \in \mathbb{R}^{N+1}$, where $w_0 = b$.

This yields the general solution for $\mathbf{w} = \in \mathbb{R}^{N+1}$, $$ \begin{align*} \mathbf{w}^{\star} & = \underset{\mathbf{w}}{\arg \min} \frac{1}{M} \sum_{i=1}^{M} (y^{(i)} - (\mathbf{x}^{(i)})^T \mathbf{w})^2 \newline & = \underset{\mathbf{w}}{\arg \min} \frac{1}{M} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) \end{align*} $$

Taking the derivative with respect to $\mathbf{w}$ and setting it to zero, $$ \begin{align*} \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) & = 0 \newline \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (y^T - (\mathbf{X} \mathbf{w})^T)(\mathbf{y} - \mathbf{X} \mathbf{w}) & = 0 \newline \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (y^T y - y^T \mathbf{X} \mathbf{w} - (\mathbf{X} \mathbf{w})^T \mathbf{y} + (\mathbf{X} \mathbf{w})^T \mathbf{X} \mathbf{w}) & = 0 \newline \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (y^T y - y^T \mathbf{X} \mathbf{w} - \mathbf{w}^T \mathbf{X}^T \mathbf{y} + \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w}) & = 0 \bigg\rvert (AB)^T = B^T A^T \newline \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (y^T y - y^T \mathbf{X} \mathbf{w} - y^T \mathbf{X}\mathbf{w} + \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w}) & = 0 \bigg\rvert (AB)^T = B^T A^T \newline \frac{\partial}{\partial \mathbf{w}} \frac{1}{M} (y^T y - 2 y^T \mathbf{X} \mathbf{w} + \mathbf{w}^T \mathbf{X}^T \mathbf{X} \mathbf{w}) & = 0 \newline \end{align*} $$

Now, the first term is a constant, so we can ignore it. For the second term, we can use the identity, $$ \frac{\partial}{\partial \mathbf{w}} A^T \mathbf{w} = A $$

For the third term, we can use the identity, $$ \frac{\partial}{\partial \mathbf{w}} \mathbf{w}^T A \mathbf{w} = 2 A \mathbf{w} $$

If $A$ is symmetric, i.e., $A = A^T$.

Which yields,

$$ \begin{align*} \frac{2}{M} (-\mathbf{X}^T \mathbf{y} + \mathbf{X}^T \mathbf{X} \mathbf{w}) & = 0 \newline \mathbf{X}^T \mathbf{X} \mathbf{w} & = \mathbf{X}^T \mathbf{y} \newline \mathbf{w}^{\star} & = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \end{align*} $$

Connection to Probabilistic Models

The same solution can be derived as the MLE of the parameters under a conditional Gaussian model.

Assumption, the output (or the observation) $\mathbf{y}$ is from a deterministic function with additive Gaussian noise, $$ \mathbf{y} = \mathbf{w}^T \mathbf{x} + \mathbf{\epsilon}, \ \text{where} \ p(\mathbf{\epsilon}, \sigma^2) = \mathcal{N}(\mathbf{\epsilon}; 0, \sigma^2 \mathbf{I}) $$

This implies that, $$ p(\mathbf{y} | \mathbf{X}; \mathbf{w}, \sigma^2) = p(\mathbf{y} - \mathbf{X} \mathbf{w}; \sigma^2) = \mathcal{N}(\mathbf{y} - \mathbf{X} \mathbf{w}; 0, \sigma^2 \mathbf{I}) = \mathcal{N}(\mathbf{y}; \mathbf{X} \mathbf{w}, \sigma^2 \mathbf{I}) $$

Or equivalently, $$ p(\mathbf{y} | \mathbf{X}; \mathbf{w}, \sigma^2) = \frac{1}{\sqrt{(2\pi \sigma^2)^{M}}} \exp \left( -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) \right) $$

Let’s maximize the log likelihood, $$ \begin{align*} \underset{\mathbf{w}, \sigma^2}{\arg \max} \log p(\mathbf{y} | \mathbf{X}; \mathbf{w}, \sigma^2) & = \underset{\mathbf{w}, \sigma^2}{\arg \max} -\frac{M}{2} \log(2\pi) - \frac{M}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) \newline & = \underset{\mathbf{w}, \sigma^2}{\arg \min} \frac{M}{2} \log(\sigma^2) + \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) \end{align*} $$

Note that solving for $\mathbf{w}$ is independent of $\sigma^2$. $$ \mathbf{w}^{\star} = \underset{\mathbf{w}}{\arg \min} \frac{1}{2} (\mathbf{y} - \mathbf{X} \mathbf{w})^T (\mathbf{y} - \mathbf{X} \mathbf{w}) $$

More on Linear Regression

Take a closer look at the closed-form solution, $$ \mathbf{w}^{\star} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $$

We call $\mathbf{X}^{\dagger} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T$ the pseudo-inverse (generalization of inverse to non-square matrix). See for a square invertible matrix $\mathbf{X}$, we have $\mathbf{X}^{\dagger} = \mathbf{X}^{-1} (\mathbf{X}^T)^{-1} \mathbf{X}^T = \mathbf{X}^{-1}$.

What if $\mathbf{X}^T \mathbf{X} \in \mathbb{R}^{(N+1) \times (N+1)}$ is non-invertible?

If $M < N + 1$, add more data or enforce regularization.
If $M \geq N + 1$, remove redundant features or enforce regularization.

What if $N$ is too large? We can use gradient descent, $$ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \alpha \mathbf{X}^t (\mathbf{X} \mathbf{w}^{(t)} - \mathbf{y}) $$

What if $M$ is too large? Use mini-batch gradient descent. $$ \mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \alpha \sum_{(\mathbf{x}, y) \in \mathcal{B}} \mathbf{x} (\mathbf{x}^T \mathbf{w}^{(t)} - y) $$

Regularized Linear Regression

As we have discussed previously, regularization is a technique to prevent overfitting. But also to obtain numerically more stable solutions (e.g., when $\mathbf{X}^T\mathbf{X}$ is not invertible), and enforce the desired parameter space as a prior (e.g., select a subset of features that are good for prediction by enforcing the weights of irrelevant features to zero).

But the price we pay for having regularization is that the regularization term will bias the least squares estimate.

Ridge Regression

Let’s add regularization to OLS, $$ \underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \frac{\alpha}{2} \Vert \mathbf{w} \Vert_2^2 $$

The first term is the data-fit term, we sum the squared error of the prediction.

The second term is the regularization term, $\Vert \mathbf{w} \Vert_2^2 = \sum_{j=1}^{N+1} w_j^2$ penalizes large weights.

$\alpha$ is the hyperparameter that controls the amount of “shrinkage”. Larger $\alpha$ means more shrinkage, $\alpha = 0$ is OLS.

This also has a probabilistic interpretation, $$ p(\mathbf{w} | \mathbf{X}; \mathbf{y}, \alpha) \propto p(\mathbf{y} | \mathbf{w}, \mathbf{X}; \sigma^2) p(\mathbf{w}; \alpha) $$

But, Why?

Why should we penalize large weights in the first place?

Equivalently, why are smaller weights (i.e., weights that are close to zero) more preferred than larger weights?

Reason 1: Smaller weights are more robust to perturbations of input features.
- Consider $f_1(x) = 2x + 0.1$ and $f_2(x) = 100x-98$.
- If we add a small perturbation to $x \rightarrow x + 0.1$.
- Then, $\Delta f_1(x) = 0.2$ and $\Delta f_2(x) = 10$.
Reason 2
- There are better chances to zero out some input features $x$ that are redundant or uninformative, leading to more accurate prediction of $y$.

Ridge Regression Closed-Form Solution

The closed-form solution for Ridge regression is, $$ \mathbf{w}^{\star} = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} $$

The term “ridge regression” comes from the closed-form solution, where a “ridge” is added to the diagonal of $\mathbf{X}^T \mathbf{X}$.

Lasso Regression

With ridge regression, some weights are small but still non-zero. These are less important, but somehow still necessary.

To get a better shrinkage to zero, we can change the regularization term to encourge more weights to be zero.

The Lasso regression is defined as, $$ \underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \alpha \Vert \mathbf{w} \Vert_1 $$

Least Absolute Shrinkage and Selection Operator, or LASSO.

Keeps the same data-fit term, but changes the regularization term. We take the sum of absolute weight values $\Vert \mathbf{w} \Vert_1 = \sum_{j=1}^{N} |w_j|$.

When a weight is close to zero, the regularization term will force it to be equal to zero.

This is a convex optimization problem, with no closed-form solution in general. However, it can be solved efficiently using an algorithm called least angle regression.

Why Shrinkage?

Under the orthogonal design $(\mathbf{X}^T \mathbf{X} = \mathbf{I})$, we have,

Linear Regression: $\underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2$, $$ \mathbf{w}_L = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} = \mathbf{X}^T \mathbf{y} $$

Ridge Regression: $\underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \frac{\alpha}{2} \Vert \mathbf{w} \Vert_2^2$, $$ \mathbf{w}_{RR} = (\mathbf{X}^T \mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y} = \frac{\mathbf{X}^T \mathbf{y}}{1 + \alpha} $$

Which is just $$ \frac{\mathbf{w}_L}{1 + \alpha} $$

Lasso Regression: $\underset{\mathbf{w}}{\min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \alpha \Vert \mathbf{w} \Vert_1$, $$ \mathbf{w}_{LR} = \underset{\mathbf{w}}{\arg \min} \frac{1}{2} \Vert \mathbf{X} \mathbf{w} - \mathbf{y} \Vert_2^2 + \alpha \Vert \mathbf{w} \Vert_1 $$

Which is just, $$ sign(\mathbf{w}_L) \max(|\mathbf{w}_L - \alpha|, 0) $$

Linear Regression Summary

OLS needs at least $M \geq N + 1$ data cases to learn a model with a $N$ dimensional feature vector.
Ridge and Lasso work when $\mathbf{X}^T \mathbf{X}$ is close to singular.
- E.g., caused by co-linear features $(x_i \approx ax_j + b)$.
MSE objective function for OLS, Ridge, and Lasso is sensitive to noise and outliers.
- Regularization (Ridge and Lasso) can prevent very large weights, and reduce the possibility of overfitting to outliers.
All fail when output $y$ non-linearly depends on input features $\mathbf{x}$.

Part 4 - Duality and SVM