Part 6 - Mathematical models

Posted on Feb 2, 2024

(Last updated: May 26, 2024)

Introduction

In this part we’ll cover some central mathematical models and their applications.

Mathematical models

All models are wrong, but some models are useful - George Box

Before we start, let’s remember that a model is not perfect by any means, and it is just a tool for a subset of tasks.

Linear VS. Non-linear models

Let’s define what a linear and non-linear model is.

In mathematics, a linear polynomial has the form $f(x) = ax + b$ for some constants $a, b$, so it corresponds to a line drawn on the Cartesian plane.

In general, a linear polynomial of $k$ variables has the form: $$ f(x_1, x_2, \ldots, x_k) = b + \sum_{i = 1}^k a_ix_i $$

A linear map between vector spaces $V$ and $W$ over a field $K$ is a function, $f\ :\ V \to W$ that satisfies for all $x, y \in V$ and $c \in K$: $$ \begin{align*} f(x + y) & = f(x) + f(y) \\ f(cx) & = cf(x) \end{align*} $$

In particular, if our linear model is interpreted as a line, it must also go through the origin.

Non-linear is that which is not linear, e.g., higher-order polynomial functions, exponential or logarithm functions, trigonometric functions etc.

Pros of linear models:

Linear models are easily interpretable
Linear models are very efficient to compute
Linear models are tractable and can be manipulated mathematically.

Cons of linear models:

They are ill-suited to model non-linear effects.
A lot of things in the world have non-linear relationship

However, sometimes it is possible to linearize models, e.g., by applying an appropriate transformation.

Black box vs. descriptive models

A model is a black box if we can observe that the outputs have predictive power, but we cannot explain why the outputs are what they are.

Descriptive models come with semantics that help human beings understand the reasoning behind the answer.

First-principle vs. data-driven models

First-principle models are based on theoretical understanding of the phenomena in question.

Data-driven models are based on observations. The actual model might have no understanding of the domain, but only make inferences based on observed probabilities.

Stochastic vs. deterministic models

Stochastic models include some kind of random component. This can be explicit randomization, as in the case of a Monte Carlo simulation. The underlying mathematical model can make use of probability, as in logistic regression, for example, yielding a probability for each potential class, explicitly modelling uncertainty.

Deterministic models simply yield the same result every time, and may not include a notion of uncertainty.

Flat vs. hierarchical models

A flat model is one where there is only one problem that is being solved, without sub-problems with a parent-child relationship.

A hierarchical model has structure that has parent-child relationship between sub-problems.

Evaluating models

It is usually not possible to know in advance which model is the best choice for a given problem and/or dataset.

Given two or more models, we must evaluate them to determine which one performs better.

Human insight should not be ignored: our domain knowledge can suggest whether the model makes any sense.

However, sometimes, we need to compute an appropriate metric for the model with respect to the data. What is appropriate depends on the problem type (classification/regression) and what we are interested in. We are often interested in generalization performance, that is, how well the model does with inputs that are not part of the data the model has seen. Understanding whether our model is any good means it needs to be measured against a baseline.

Train-test split

Generalization performance needs to be evaluated on data that was never used to train the model.

We call this the test set, never ever make evaluate the model on the training set.

Evaluating generalization on data that the model has seen in training is worthless.

If the model has parameters, we need to potentially split the training set in two: the actual training set and the validation set.

Train the model with respect to the training set using different choices for the parameters.
Evaluate against the validation set.
Choose the parameters that performed the best with regard to the validation set.
Evaluate the model against the test set.

The simplest way to construct the sets is by partitioning the dataset uniformly at random. We often like to use more data for the training set than test set.

Rule of thumb: if we do a simple train-test split, use 75% of data for the training set and 25% for the test set. If we need a validation set as well, split the data 50-25-25.

Train-test-split is easily done using sklearn.model_selection.train_test_split

Cross-validation

What if we have very little data?

Then the training set can become too small if we split it.

Cross-validation is a method to amplify the size of the sets by partitioning data into $k$ disjoint parts, and then systematically using $k-1$ parts for training and the remaining part for testing, and computing the metric, e.g., as average and standard deviation.

The extreme case occurs when $k=n$, which is known as leave-one-out cross-validation.

Often, we would then train the final model with all data, assuming it will do as well as during cross-validation.

Baseline models

Baseline model is something that we compare against to see if we can get an improvement.

To understand if our model is worth anything, we should have some kind of baseline and conclusively show that our model beats the baseline.

Evaluating binary classifiers

A binary classifier classifies objects into two classes: a positive and a negative class.

Often positive instances are the more interesting or rarer ones.

The performance of a classifier can often be presented using a confusion matrix

True Positives (TP) are correctly identified positives
True Negatives (TN) are correctly identified negatives
False Positives (FP) are incorrectly identified positives
False Negatives (FN) are incorrectly identified negatives

	Predicted positive	Predicted negative
Actual positive	True Positives (TP)	False Negatives (FN)
Actual negative	False Positives (FP)	True Negatives (TN)

Accuracy

The simplest metric for classifiers is probably accuracy.

Accuracy measures the fraction of correct classifications out of all classifications: $$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$

For binary classifiers, accuracy is at its worst when it is around $0.5$.

Accuracy of $0$ would mean we get perfect accuracy by swapping the class labels.

Accuracy does little to tell why or what is wrong.

Precision and recall

Precision and recall are basic metrics for binary classifiers.

Precision measures the fraction of interesting finds among the returned positives. $$ Precision = \frac{TP}{TP + FP} $$

Recall measures how large a fraction of interesting values we are able to pick up from the set of all positive instances. $$ Recall = \frac{TP}{TP + FN} $$

There is usually a tradeoff between precision and recall.

Multi-class classification

Accuracy is easily extended to multiple classes (simply count the fraction of correct classifications).

Top-$k$ success rule gives credit if the correct class label was among the top-$k$ possibilities.

Confusion matrices can be generalized to multi-class cases. Correct predictions would be on the diagonal.

Precision and recall can also be generalized into multi-class environments.

	Predicted A	Predicted B	Predicted C
Actual A	0.97	0.03	0.00
Actual B	0.16	0.84	0.00
Actual C	0.00	0.00	1.00

Evaluating regression models

Regression models are usually evaluated by computing error statistics.

Suppose $\hat{y}$ is the predicted value and $y$ is the correct value.

Plain difference $\hat{y} - y$ is simple but signed, so it is inappropriate if the signedness is not significant.
Absolute error $|\hat{y} - y|$ is a simple statistic, but is sensitive to variations in scale.
Relative error $\left| \frac{\hat{y} - y}{y} \right|$ can be used even in the presence of variable scales but behaves erratically near 0.
Squared error $\left(\hat{y} - y\right)^2$ is always positive and penalizes large deviations more.

The (signed) error statistics can be plotted as a histogram; this shows the distribution of error If the histogram is not bell-shaped (approximately normal), this indicates there is a systemic source of error.

A common summary statistic is Mean Squared Error, the mean of squared errors: $$ MSE(\hat{y}, y) = \frac{1}{n} \sum_{i = 1}^n \left(\hat{y}_i - y_i \right)^2 $$

A related statistic is the Root Mean Squared that has the same units as the values themselves: $$ RMS(\hat{y}, y) = \sqrt{MSE(\hat{y}, y)} $$