Part 9 - Machine Learning

Posted on Feb 20, 2024
(Last updated: May 26, 2024)

Introduction

In this part, we will discuss machine learning and neural networks. We will start with a brief introduction to machine learning.

Learning types

There are three types of learning:

  • Supervised learning
  • Unsupervised learning
  • Semi-supervised learning
  • Self-supervised learning

Supervised learning

In supervised learning, we have for each input $x$ a label $y$, that we want to predict. Models are trained to minimize the difference between their prediction and the labels.

Unsupervised learning

In unsupervised learning, we only make use of the inputs $x$. Here patterns are found in the inputs, for example using clustering. But as we do not have labels to validate the patterns/clusters found we can never be sure how could the model is.

Semi-supervised learning

In semi-supervised learning, we have a small dataset that has labels, and data that has no labels. The model trained on the initially labeled data creates labels for the unlabeled data. The final model is trained on the complete data.

Self-supervised learning

In self-supervised learning no true labels exist, however, simple labels can be inferred from the data. For example, reconstructing an image (turning an image from RGB to BW), here $x$ and $y$ are identical.

Support Vector Machines

Support Vector Machines (SVM) are a type of supervised learning model. They are used for classification and regression analysis. SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes.

In two dimensions, the hyperplane is a line: $$ y = \frac{1}{1 + e^{-mx + b}} $$

In three dimensions, the hyperplane is a plane: $$ y = \frac{1}{1 + e^{-m_1 x + -m_2 z + b}} $$

Decision boundary

The decision boundary is the line that separates the classes. The distance between the decision boundary and the closest data point is called the margin. The best decision boundary is the one that maximizes the margin.

Soft margin

In some cases, the data is not linearly separable. In this case, we can use a soft margin. The soft margin allows for some data points to be on the wrong side of the decision boundary. The margin is then maximized by minimizing the number of misclassified data points.

Non-linear classification

In some cases, the data is not linearly separable. In this case, we can use the kernel trick. The kernel trick maps the data to a higher-dimensional space where it is linearly separable. This allows us to use a linear decision boundary in the higher-dimensional space.

Scaling

SVMs want to optimize the distance. Changing the variables with large scales quickly leads to a significant “improvement” in the result. But, small-scale variables are ignored.

So, all variables should be scaled to the same range. This can be done by subtracting the mean and dividing by the standard deviation. $$ x_{\text{scaled}} = \frac{x - \mu}{\sigma} $$

Alternatively, we can use the min-max scaler, which scales the data to the range $[0, 1]$: $$ x_{\text{scaled}} = \frac{x - \min(x)}{\max(x) - \min(x)} $$

Random Forest

Random Forest is a type of supervised learning model. It is used for classification and regression analysis. Unlike SVMs, Random Forests are not dependent on distances, or the scale of the data. They’re instead based on decision trees.

Decision trees

A decision tree is a tree shaped diagram used to determine a course of action. It can be used for decision-making, classification.

Each node represents a question/problem. Each branch represents an answer/decision. Each leaf represents a class label.

Importance and overfitting

The importance of a feature is a measure of how much the feature contributes to the decision. How do we know if a decision is based on a real pattern or just noise/outliers? To prevent this error, we can allow the creation of branches only if there are $n$ data points in a group. This is called pruning.

Impurity

Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The most common impurity measures are:

  • Gini impurity
  • Entropy
  • Classification error
Gini impurity

The Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. $$ \text{Gini impurity} = 1 - (P(\text{yes})^2 + P(\text{no})^2) $$

Random Forest

A Random Forest is a collection of decision trees. This to prevent overfitting. A random forest uses several hundred or thousands of decision trees. Decision trees are always built from optimal variables, do all trees look the same?

Bagging

To answer the last question, we use bagging. Each tree is trained with its own data set. Outliers are drawn less often because they are less common in the data set.

Not only does each tree in the random forest have its own data. Each tree also receives only one random subunit of the variable.

Boosting

Boosting is an alternative ensemble method to bagging, used in random forests.

In boosting the trees are built sequentially:

  • Build a tree
  • Evaluate performance on data
  • Reweight the samples
  • Repeat

Samples for which the first trees perform bad, are assigned a higher weight.

Increased consideration when building the next tree.

Rosenblatt perceptron

The Rosenblatt perceptron is the “grandfather” of neural networks. It is based on the idea of biological neurons. A perceptron has one or more inputs and a bias. It sums the inputs and applies an activation function to the sum.

This idea conveys the idea of a threshold. If the sum is above the threshold, the neuron fires. If not, it does not. This is how we can determine if some data point is “important” or not.

But, in order to “learn” something needs to change, this is why every input has a weight. The weight determines the importance of the input.

Neural networks

With the Rosenblatt perceptron in mind, we can now discuss neural networks. A neural network is a collection of neurons, connected in so-called layers. The first layer is the input layer, the last layer is the output layer, and the layers in between are called hidden layers.

Let’s take the most famous dataset, the MNIST dataset as an example.

Given an image of a digit, we want to predict which digit it is. The input layer has 784 neurons, one for each pixel in the image.

So for every neuron in the hidden layer, we perform: $$ w_1 x_1 + w_2 x_2 + \ldots + w_{784} x_{784} + b $$

If you remember linear algebra, this is a dot product. The weights are the coefficients, the inputs are the vector.

So that entire sum can be written as: $$ (x \cdot w) + b $$

But we usually have more than one neuron in the hidden layer. So each neuron has its own weight vector and bias. $$ (x \cdot w_1) + b_1 \ | \ \text{neuron 1} \\ (x \cdot w_2) + b_2 \ | \ \text{neuron 2} \\ \vdots \\ (x \cdot w_n) + b_n \ | \ \text{neuron n} $$

We can combine all the weight vectors into a matrix: $$ \mathbf{W} = \begin{bmatrix} w_{1} \\ w_{2} \\ \vdots \\ w_{n} \end{bmatrix} $$

And all the biases into a vector: $$ \mathbf{b} = \begin{bmatrix} b_{1} \\ b_{2} \\ \vdots \\ b_{n} \end{bmatrix} $$

This mean we can write every summation every neuron performs as one matrix multiplication: $$ x\mathbf{W}^{T} + \mathbf{b} $$

But one component is missing from the original Rosenblatt perceptron, the activation function. After the linear transformation, all values are passed through a non-linear activation function. We can view this as a filter function.

The non-linear transformation is necessary because it allows the network to “learn” more complex, non-linear patterns. Some common activation functions are:

  • Step function
  • Sigmoid function
  • ReLU function

So from our input layer we receive: $$ a_1 = \sigma(z_1) = \sigma(x\mathbf{W_1}^{T} + \mathbf{b_1}) $$

Our output layer then receives (from the hidden layer): $$ z_2 = a_1\mathbf{W_2}^{T} + \mathbf{b_2} $$

Now at the output layer, we need to perform a non-linear transformation so that we can make a statement. For example for the MNIST dataset, we want the network to tell us which digit it is. So we use the softmax function to transform the output into probabilities for each class.

This entire process is called forward propagation. But how does the network learn the correct values for the parameters to perform a correct classification?

This is what machine learning is all about. The correct values are learned by “training” the network.

Training

After each image and prediction, we need to “tell” the network how good it did. From this information, the network can adjust its parameters to perform better next time. The loss function is a measure of how good the network did. The loss function quantifies the difference between the predicted and the true values.

So, the loss function measures the error, the goal is therefore to minimize the loss function. In our case, the loss function is the cross-entropy loss function. $$ L = -\frac{1}{N} \sum_{n=1}^{N} \sum_{i=1}^{I} y_{n,i} \log(\hat{y}_{n,i}) $$

  • For each position of a prediction, we multiply the true value of the position by the log of the prediction.
  • For each prediction, the products are summed up.
  • The results for all predictions are summed up and divided by the number of all data points.
  • The sign is changed, so that the loss function is minimized rather than maximized.

So far we’ve only sent one input at a time through the network. But we can also send multiple inputs at once. This is called a batch. $$ \mathbf{X} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} $$

$$ \mathbf{Z} = \mathbf{X}\mathbf{W}^{T} + \mathbf{b} $$

So training in reality is just:

  • Forward propagation
    • $A_1 = \sigma(Z_1) = \sigma(XW_1^{T} + b_1)$
    • $\hat{Y} = softmax(Z_2) = softmax(A_1W_2^{T} + b_2)$
  • Calculate the loss
    • $ J = -\frac{1}{N} \sum_{n=1}^{N} \sum_{i=1}^{I} y_{n,i} \log(\hat{y}_{n,i})$
  • Backward propagation
    • Goal: Find the matrices $\mathbf{W_1}$, $\mathbf{W_2}$, and the vectors $\mathbf{b_1}$, $\mathbf{b_2}$ that minimize the loss $J(\mathbf{W_1}, \mathbf{W_2}, \mathbf{b_1}, \mathbf{b_2})$.

In backpropagation, the error (loss) is fed back through the network to shift the weights and biases in the right direction.

Gradient descent

But how do we minimize the loss function? What happens if our loss landscape is non-convex? How do we find the global minimum?

The answer is gradient descent. Gradient descent is an optimization algorithm used to minimize the loss function. It works by iteratively moving in the direction of the negative gradient of the loss function.

Say we have: $$ L(\phi) = \frac{1}{2} \phi^2 + \phi^{-\frac{1}{2}} - 1 $$

The gradient of the loss function is: $$ \frac{dL}{d\phi} = \phi - \frac{1}{2}\phi^{-\frac{3}{2}} $$

Which means: $$ \phi_{t + 1} = \phi_t - \alpha \frac{dL}{d\phi} $$

Stochastic gradient descent

Stochastic gradient descent is a variant of gradient descent. It is used to minimize the loss function. The difference is that the gradient is calculated for a random subset of the data, rather than the entire dataset. This makes the algorithm faster, but also less accurate.

Revisiting backpropagation and the chain rule

From multivariate calculus, we know that the gradient of a function points in the direction of the steepest increase. So the negative gradient points in the direction of the steepest decrease. Following the negative gradient, we can find a local minimum of the loss function.

Since each layer depends on the previous layer, we can use the chain rule to calculate the gradient of the loss function with respect to the weights and biases. We can say that we calculate the average per column.

So in the MNIST example we finally have:

  • Forward propagation
    • $A_1 = \sigma(Z_1) = \sigma(XW_1^{T} + b_1)$
    • $\hat{Y} = softmax(Z_2) = softmax(A_1W_2^{T} + b_2)$
  • Calculate the loss
    • $ J = -\frac{1}{N} \sum_{n=1}^{N} \sum_{i=1}^{I} y_{n,i} \log(\hat{y}_{n,i})$
  • Backward propagation
    • $dZ_1 = dZ_2 W_2 \cdot A_1 \cdot (1 - A_1)$
    • $dW_1 = \frac{1}{m} \cdot dZ_1^T X$
    • $db_1 = \frac{1}{m} \sum_{j = 1}^{n} dZ_{1_{ij}}$
    • $dZ_2 = \hat{Y} - Y$
    • $dW_2 = \frac{1}{n} \cdot dZ_2^T A_1$
    • $db_2 = \frac{1}{n} \sum_{j = 1}^{n} dZ_{2_{ij}}$
  • In backpropagation, the error (loss) is returned through the network to shift the weights and biases in the right direction.
    • $W_1 = W_{1, t - 1} - \alpha \cdot dW_{1, t - 1}$
    • $b_1 = b_{1, t - 1} - \alpha \cdot db_{1, t - 1}$
    • $W_2 = W_{2, t - 1} - \alpha \cdot dW_{2, t - 1}$
    • $b_2 = b_{2, t - 1} - \alpha \cdot db_{2, t - 1}$

This step can be repeated indefinitely. An ’epoch’ is a complete pass through the entire dataset. During training, we can determine how many epochs we want to train, i.e. how many times we want to pass through the entire dataset.

Hyperparameters

So far we’ve seen that a neural network adjusts its weights and biases to minimize the loss function. But some parameters are not learned by the network, rather they are set by the user. These are called hyperparameters. There are many methods to find the best hyperparameters, but we won’t discuss them here.

Deep learning

Deep learning is the study of neural networks with many layers. The idea is that the network can learn more complex patterns with more layers. Since deep networks often need fewer parameters to represent the same function as a shallow network, they are more efficient.

However, with a wide network comes its downsides as well. Wide networks are more likely to memorize the training data, rather than generalize to new data.

Images to vectors

In the MNIST example, we used the raw pixel values as input. This works because the pictures are so small and grayscale. But in general, pictures are colored and much larger. So we need to transform the images into vectors. Doing this naively, pixel by pixel, would result in a very large vector.

However, by flattening the image, the spatial coherence is lost. The proximity of pixels are important. So instead of flattening the image, maybe we could let the network look at smaller parts of the image at a time.

So we need a method that takes the original image with multiple dimensions, but only look at a small part of the image at a time. Any math nerd will recognize this as a convolution, a mathematical filter.

Convolutional neural networks

A convolutional neural network (CNN) is a type of neural network that is well-suited for image recognition. CNNs use a mathematical operation called convolution to filter the input image. This allows the network to learn local patterns in the image. So even if a image is flipped, rotated, or scaled, the network can still recognize the actual pattern.

Kernel

The kernel is the filter that is applied to the input image. It is a small matrix that is convolved with the input image. The kernel is learned by the network, and it is used to extract features from the input image. For example, edges, corners, and textures.

Given an input image $I$ and a kernel $K$, the feature map $F$ is calculated by convolving the input image with the kernel: $$ F = I * K $$

RBG MNIST

Assume we had the MNIST dataset, but with colored images. The input image would then have three dimensions: height, width, and color. Therefore, our kernel is 3-dimensional as well.

So our image has shape $28 \times 28 \times 3$, and our kernel has shape $3 \times 3 \times 3$. The feature map would then have shape $26 \times 26 \times 1$.

In other words, each channel of the input images contains information about a certain color. Each filter is responsible for detecting a certain feature in the image. Each channel in the feature map contains information about a certain feature in the image.

Padding

When we convolve an image with a kernel, the feature map is smaller than the input image. This is because the kernel cannot be applied to the edges of the image. To prevent this, we can add a border of zeros around the input image. This is called padding. This can be useful to prevent the loss of information at the edges of the image.

Pooling

Pooling is a technique used to reduce the size of the feature map. The deeper in the network, the smaller the feature map. The deeper in the network, the more channels the image has. As neurons deeper in the network focus on more complex structures, each pixel is no longer as important.

So we can reduce the size of the feature map by taking the maximum value in a small region of the feature map. This is called max pooling.

Fully connected layer

So, the power of CNNs is that they can learn local patterns in images. We can use CNNs to filter the important information from the image. From this new filtered image, if we flatten this new, smaller image, we can feed this into our network just like before.

This is called a fully connected layer. It is the same as the hidden layer in a regular neural network.