Part 7 - Linear and logistic regression

Posted on Feb 6, 2024
(Last updated: May 26, 2024)

Introduction

In this part we’ll cover linear and logistic regression.

Core data science tasks

Regression is one of the core data science tasks:

  • Regression
    • Predicting a numerical quantity
  • Classification
    • Assigning a label from a discrete set of possibilities
  • Clustering
    • Grouping items by similarity

We will cover clustering in a few parts.

Linear regression

Regression line is useful for visualization and a method for forecasting numerical values. Residual error of a regression line is the difference between the predicted and actual values.

The objective is to find the line $y = f(x)$ which minimizes the residual error, we often we this as an optimization problem.

We want to fit a line $f(x) = \beta x + \epsilon$ to our data. We want to select $\beta$ and $\epsilon$ so that the total error is minimal, but how do we define the error?

Least squares linear regression

We can define the error of the line as the sum of the squared errors of the datapoints. This is a somewhat arbitrary choice that is easy to work with, but we’ve seen that this is usually a good idea. We want to find the line that minimizes this sum.

$$ \begin{align*} error & = \sum_{i = 1}^n (y_i - f(x_i))^2 \\ & = (y_i - (\beta x_i + \epsilon))^2 \\ & = y_i^2 - 2y_i(\beta x_i + \epsilon) + (\beta x_i + \epsilon)^2 \\ & = a \beta^2 + b \beta \epsilon + c \epsilon^2 + d \end{align*} $$

So we want to minimize this with respect to $\beta$ and $\epsilon$, let us use a bit of calculus: $$ \begin{cases} 2\beta a + b \epsilon & = 0 \\ 2\epsilon c + b \beta & = 0 \end{cases} $$

We solve for $\beta$ and $\epsilon$. In reality, all minimization of errors leads to an equation system.

Coefficient of determination

$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$

Where: $$ SS_{res} = \sum_i (y_i - f_i)^2 = \sum_i e_i^2 \\ SS_{tot} = \sum_i (y_i - \bar{y})^2 $$

Correlation

Pearson correlation: $$ R = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} = \sum_{i = 1}^n \frac{x_i - \bar{x}}{s_x} \frac{y_i - \bar{y}}{s_y} $$

Where $s_x$ and $s_y$ is the standard deviation of $x$ and $y$ respectively

Spearman correlation: $$ \rho = 1 - \frac{6 \sum d_{i}^2}{n(n^2 - 1)} $$

Where $n$ is the number of observations and $d_i$ is the difference between the two ranks of each observation.

Correlation and causation

Correlation does not imply causation, simple.

Ridge regression

Regularization is the trick of adding secondary terms to the objective function to favor models that keep coefficients small. Suppose we generalize our loss function with a second set of terms that are a function of the coefficients, not the training data: $$ J(w) = \frac{1}{2n} \sum_{i = 1}^n (y_i - f(x_i))^2 + \lambda \sum_{j = 1}^m w_{j}^2 $$

LASSO Regression

Least Absolute Shrinkage and Selection Operator. Minimize the sum of the absolute values of the coefficients, which is just as happy to drive down the smallest coefficients as the big ones.

$$ J(w, t) = \frac{1}{2n} \sum_{i = 1}^n (y_i - f(x_i))^2 \text{ subject to } \sum_{j = 1}^m |w_j| \leq t $$

Logistic regression

Logistic regression is a powerful tool for modeling the probability of a binary outcome. It is particularly useful when the dependent variable is categorical and the relationship between the independent variables and the probability of the outcome needs to be understood.

Logistic Function

The logistic function, also known as the sigmoid function, is used in logistic regression to map input values to a probability between 0 and 1. The formula for the logistic function is: $$ \sigma = \frac{1}{1 + e^{-x}} $$

Maximum Likelihood Estimation

In logistic regression, the model parameters are estimated using maximum likelihood estimation. The goal is to find the parameter values that maximize the likelihood of observing the data given the model. This involves optimizing the log-likelihood function, which quantifies how well the model fits the data.

Model Evaluation

Once the logistic regression model is trained, it is essential to evaluate its performance. Common metrics for evaluating classification models include accuracy, precision, recall, F1 score, and the receiver operating characteristic (ROC) curve.