Part 1 - Introduction | machine learning

What is learning?

Defining learning is a task many people have tried to do during our time. Some examples:

Behaviorism (Skinner, 1900-1950)
- Learning is a long-term change in behavior due to experience.
Cognitivism (Gestalt School, 1920-):
- Learning is an internal mental process that integrates new information into established mental frameworks and updates those frameworks over time.
Connectionism (Hebb, 1949):
- Learning is a physical process in which neurons join by developing the synapses between them.

This also applies to the definition of machine learning:

Samuel (1959):
- Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.
Mitchell (1997):
- A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.
Jordan (2015):
- It is one of today’s rapidly growing technical fields, lying at the intersection of computer science and statistics, and at the core of artificial intelligence and data science

Topics in Machine Learning

So we’ve seen that we can define learning in many different ways, and as such, there are several ways we can approach machine learning. Here are some of the topics we will cover:

Supervised Learning
Unsupervised Learning
Reinforcement Learning
Learning Theory

Supervised Learning

We have some sort of task, be it classification or regression, but we have training data that has both the input and the corresponding output.

We try to find the function(s) that best map the input to the output.

Simply, mathematically we can define it as: $$ f: X \mapsto Y $$

Where $X$ is the input space and $Y$ is the output space, and we want to find the best function $f$ that maps $X$ to $Y$.

Unsupervised Learning

In the case that we do not have data that has both the input and the output, we can use unsupervised learning. The problem itself does not have to differ from a supervised learning problem, but the data we have does not have the output.

In this case, we try to find the structure in the data, and we can do this in several ways:

Density estimation:
- Construct a probability model over the input space.
Clustering:
- Discover groups of similar examples in the data.
Dimensionality reduction:
- Project high-dim data to 2 or 3-dimensions (for visualization purposes).

Reinforcement Learning

This approach differs a bit from supervised and unsupervised learning. In this case, we have an agent that interacts with an environment, and the agent receives rewards or punishments based on its actions.

We make a sequence of actions, given the current states. E.g. a robot interacting with its environment. We aim to maximize the reward function over time. At some point, receive a reward or a punishment, actions may also affect future reward!

Learning Theory

But, why does machine learning even work in the first place?

From traditional learning theory we know that we have a so-called performance guarantee. The performance guarantee is a bound on the generalization error, which is the difference between the training error and the test error.

So, the question is not if it will work, it is how well will it work. So the questions we need to figure out are:

Which functions can be learned/represented?
How much data do we need?

Math Revision

Let’s quickly revise some math concepts that we will need in this course.

Linear Algebra

The definition of a vector space can be defined as:

The real vector space $\mathbb{R}^N$ is a set with elements $\mathbf{x} = [x_1, x_2, \ldots, x_N]^T$ where each $x_j \in \mathbb{R}$. The elements $\mathbf{x}$ are called vectors. Which must satisfy the following properties:

Addition: If $\mathbf{x} \in \mathbb{R}^N$ and $\mathbf{y} \in \mathbb{R}^N$, then $\mathbf{x} + \mathbf{y} = [x_1 + y_1, x_2 + y_2, \ldots, x_N + y_N]^T \in \mathbb{R}^N$.
Scalar Product: If $\mathbf{x} \in \mathbb{R}^N$ and $a \in \mathbb{R}$, then $a\mathbf{x} = [ax_1, ax_2, \ldots, ax_N]^T \in \mathbb{R}^N$.
Inner Product: If $\mathbf{x} \in \mathbb{R}^N$ and $\mathbf{y} \in \mathbb{R}^N$, then $\mathbf{x}^T\mathbf{y} = \sum_{j=1}^{N} x_jy_j$.

The definition of a matrix can be defined as:

A matrix $\mathbf{X} \in \mathbb{R}^{M \times N}$ is rectangular array of elements $x_{ij} \in \mathbb{R}$, $1 \leq i \leq M$, $1 \leq j \leq N$. $$ \mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \ldots & x_{1N} \newline x_{21} & x_{22} & \ldots & x_{2N} \newline \vdots & \vdots & \ddots & \vdots \newline x_{M1} & x_{M2} & \ldots & x_{MN} \end{bmatrix} $$

A matrix $\mathbf{X} \in \mathbb{R}^{M \times N}$ supports the following operations:

Addition: If $\mathbf{X} \in \mathbb{R}^{M \times N}$ and $\mathbf{Y} \in \mathbb{R}^{M \times N}$, then $\mathbf{Z} = \mathbf{X} + \mathbf{Y} \in \mathbb{R}^{M \times N}$ where $z_{ij} = x_{ij} + y_{ij}$.
Scalar Product: If $\mathbf{X} \in \mathbb{R}^{M \times N}$ and $a \in \mathbb{R}$, then $\mathbf{Y} = a\mathbf{X} \in \mathbb{R}^{M \times N}$ where $y_{ij} = ax_{ij}$.
Matrix Product: If $\mathbf{X} \in \mathbb{R}^{M \times N}$ and $\mathbf{Y} \in \mathbb{R}^{N \times P}$, then $\mathbf{Z} = \mathbf{X}\mathbf{Y} \in \mathbb{R}^{M \times P}$ where $z_{ij} = \sum_{k=1}^{N} x_{ik}y_{kj}$.

Probability Theory

Let’s start with the definition of a probability distribution:

A probability distribution $p$ over a sample space $\Omega$ is a function from elements/subsets of $\Omega$ to the real numbers, that satisfies the following conditions:

Non-negativity: $p(\omega) \geq 0$ for all $\omega \subseteq \Omega$.
Normalization: $\sum_{\omega \in \Omega} p(\omega) = 1$.
Additivity: For all $\omega$, $\omega^\prime \subseteq \Omega$ that are disjoint sets, $p(\omega \bigcup \omega^\prime) = p(\omega) + p(\omega^\prime)$.

Let’s define a random variable now:

A random variable $X$ is defined by a function $f_x$ that maps each element $\omega$ of the sample space $\Omega$ to a value $x = f_x(\omega)$ in a set $\chi$ (called the range of the random variable).

For each $x \in \chi$, the event $\{X = x\}$ refers to the subset of the sample space $\{\omega | \omega \in \Omega, f_x(\omega) = x\}$.

For each $x \in \chi$, the probability $p(X = x) = p({\omega | \omega \in \Omega, f_x(\omega) = x})$.

We can also specify a probability distribution for a random variable $X$ with range $\chi$ directly instead of via an underlying sample space $\Omega$.

The following conditions must hold:

Discrete probability mass function:
- $p(X = x) \geq 0 \quad \forall x \in \chi$ and $\sum_{x \in \chi} p(X = x) = 1$.
Continuous probability density function:
- $p(X = x) \geq 0 \quad \forall x \in \chi$ and $\int_{\chi} p(X = x)dx = 1$.

Let’s now look at classification from a mathematical perspective.

The Classification Task

The definition of a classification task can be defined as:

Given a feature vector $\mathbf{x} \in \mathbb{R}^N$ that describes an object that belongs to one of $C$ classes from the set $\mathcal{Y}$, predict which class the object belongs to.

A classical example is iris classification, where we have 3 classes of iris flowers. We’re given two features, the petal length and the sepal width, and we want to predict the class of the iris. $$ \mathbf{x} = \begin{bmatrix} \text{petal length} \newline \text{sepel width} \end{bmatrix} = \begin{bmatrix} x_1 \newline x_2 \end{bmatrix} \in \mathbb{R}^2 $$

$$ \mathcal{Y} = \{ \text{“versicolor”}, \text{“setosa”}, \text{“virginica”} \}, \text{where } |\mathcal{Y}| = C = 3 $$

or with numbers: $$ \mathcal{Y} = \{ 0, 1, 2 \} $$

The Classifier Learning Problem

The definition of the classifier learning problem can be defined as:

Given a data set of example pairs $\mathcal{D} = \{ (\mathbf{x}^{(i)}, y^{(i)}), i = 1, \ldots, M \}$ where $\mathbf{x}^{(i)} \in \mathbb{R}^N$ is a feature vector and $y^{(i)} \in \mathcal{Y} = \{1, \ldots, C\}$ is the class label, learn a function $f: \mathbb{R}^N \mapsto \mathcal{Y}$ that accurately predicts the class label $y$ for any feature vector $\mathbf{x}$.

Let’s define the indicator function: $$ \mathbb{I}[A] = \begin{cases} 1 & \text{if } A \text{ is true} \newline 0 & \text{otherwise} \end{cases} $$

Classification Error and Accuracy

Let’s define the classification error:

Given a data set of example pairs $\mathcal{D} = \{ (\mathbf{x}^{(i)}, y^{(i)}), i = 1, \ldots, M \}$ and a function $f: \mathbb{R}^N \mapsto \mathcal{Y}$, the classification error rate of $f$ on $\mathcal{D}$ is defined as: $$ Err(f, \mathcal{D}) = \frac{1}{M} \sum_{i=1}^{M} \mathbb{I}[y^{(i)} \neq f(\mathbf{x}^{(i)})] $$

The classification accuracy is then defined as:

Given a data set of example pairs $\mathcal{D} = \{ (\mathbf{x}^{(i)}, y^{(i)}), i = 1, \ldots, M \}$ and a function $f: \mathbb{R}^N \rightarrow \mathcal{Y}$, the classification accuracy of $f$ on $\mathcal{D}$ is defined as: $$ Acc(f, \mathcal{D}) = \frac{1}{M} \sum_{i=1}^{M} \mathbb{I}[y^{(i)} = f(\mathbf{x}^{(i)})] = 1 - Err(f, \mathcal{D}) $$

K-Nearest Neighbors (KNN) Classifier

The KNN classifier is a simple classifier that classifies a new example based on the majority class of its $k$ nearest neighbors.

It stores the training set $\mathcal{D}$ and classifies each new instance $\mathbf{x}$ using a majority vote of its $K$ nearest neighbors. $$ \mathcal{N}_K(\mathbf{x}) \subset \{1, \ldots, M\} $$

Use of the KNN requires choosing the distance function $d : \mathbb{R}^N \times \mathbb{R}^N \mapsto \mathbb{R}$ and the number of neighbors $K$.

Max vs. Arg Max

The max operator of a function $f(\mathbf{x})$ defined over $\mathbf{x} \in \mathcal{D}$ returns the maximum value of the function: $$ \max_{\mathbf{x}} f(\mathbf{x}) = \{f(\mathbf{x}) | \mathbf{x} \in \mathcal{D} \land \forall \mathbf{y} \in \mathcal{D}, f(\mathbf{y}) \leq f(\mathbf{x})\} $$

The arg max operator of a function $f(\mathbf{x})$ defined over $\mathbf{x} \in \mathcal{D}$ gives the set of points which $f(\mathbf{x})$ reaches the maximum value: $$ \underset{\mathbf{x}}{\arg\max} f(\mathbf{x}) = \{\mathbf{x} | \mathbf{x} \in \mathcal{D} \land \forall \mathbf{y} \in \mathcal{D}, f(\mathbf{y}) \leq f(\mathbf{x})\} $$

KNN Classification

$$ f_{KNN}(\mathbf{x}) = \underset{c \in {1, \ldots, C}}{\arg\max} \sum_{i \in \mathcal{N}_K(\mathbf{x})} \mathbb{I}[y^{(i)} = c] $$

In general, KNN can work with any distance function $d$ satisfying non-negativity $d(\mathbf{x}, \mathbf{x}^\prime) \geq 0$ and identity of indiscernibles $d(\mathbf{x}, \mathbf{x}) = 0$.

Genrally, the more structure the distance function has (symmetry, triangle inequality, etc.), the more strucutre you can exploit when designing algorithms.

The Minowski distance ($\ell_p$ norm) is defined as:

Given two data vectors $\mathbf{x}, \mathbf{x}^\prime \in \mathbb{R}^N$, the Minowski distance with parameter $p$ (the $\ell_p$ norm) is a proper metric defined as: $$ d_p(\mathbf{x}, \mathbf{x}^\prime) = \lVert \mathbf{x} - \mathbf{x}^\prime \rVert_p \newline = \left( \sum_{j=1}^{N} |x_j - x^{\prime}_j|^p \right)^{\frac{1}{p}} $$

Special cases include, Euclidean distance ($p = 2$), Manhattan distance ($p = 1$), and Chebyshev distance ($p = \infty$).

Brute Force KNN

Given any distance function $d$, brute force KNN works by computing the distance $d_i = d(\mathbf{x}^{(i)}, \mathbf{x})$ from a target point $\mathbf{x}$ to all of the training points $\mathbf{x}^{(i)}$.

Sort the distances $\{d_i, i = 1, \ldots, M\}$ and choose the data cases with the $K$ smallest distances to form the neighbor index set $\mathcal{N}_K(\mathbf{x})$.

Once the $K$ neighbors are selected, applying the classification rule is easy.

KNN Variants

Instead of giving all of the $K$ neighbors equal weight in the majority vote, a distance-weighted majority can be used: $$ f_{KNN}(\mathbf{x}) = \underset{c \in {1, \ldots, C}}{\arg\max} \frac{\sum w_i \mathbb{I}[y^{(i)} = c]}{\sum w_i} \newline w_i = exp(-\alpha d_i) $$

Instead of a brute force nearest neighbor search, data structures like $K$-d trees can be constructed over the training data that support nearest neighbor search with lower computational complexity.

KNN Trade-offs

Advantages:
- No training period is involved (i.e., lazy learning), and new data can be added seamlessly without re-training the model
- Converges to the correct decision surface as data goes to infinity
Disadvantages:
- Does not work well with large datasets. Since KNN needs to store all training data, performing neighbor search requires a lot of memory and takes a lot of time
- Does not work well with high dimensions. It becomes difficult for KNN to calculate the distance in each dimension. Moreover, everything is far from everything else in high dimensions (the so-called “curse of dimensionality”)

Probabilistic Classifiers

Probabilistic classifiers are classifiers that output a probability distribution over the classes.

For example, generative models are probabilistic classifiers.

Each object/pattern has a particular (and possibly unique) probability distribution over the classes.

Class Model

As we’ve defined it before, possible classes are $\mathcal{Y}$. In the real world, the frequency that class $y$ occurs is given by the probabalistic distribution $p(y)$.

$p(y)$ is called the prior distribution.

Observation Model

We measure/observe feature vector $\mathbf{x}$. The value of the features depends on the class.

The observation is drawn according to the distribution $p(\mathbf{x} | y)$.

$p(\mathbf{x} | y)$ is called the class conditional distribution. It indicates the probability of observing a particular feature vector $\mathbf{x}$ given that the class is $y$.

Gaussian Distribution

Each class is modeled as a separate Gaussian distribution of the feature value.

$$ p(x | y = c; \mu_c, \sigma_{c}^2) = \frac{1}{\sqrt{2\pi\sigma_c^2}} \exp \left( -\frac{(x - \mu_c)^2}{2\sigma_c^2} \right) $$

Each class has its own mean $\mu_c$ and variance $\sigma_c^2$.

Learn the Parameters from Data

Maximum likelihood estimation (MLE) is used to estimate the parameters of the Gaussian distribution.

Set the parameters $(\mu_c, \sigma_c^2)$ to maximize the likelihood (probability) of the samples for class $c$.

Let $\mathcal{D}_c = \{x^{(i)} | y^{(i)}\}$ for $i = 1, \ldots, M_c$ be the data for class $c$.

The likelihood of the data is: $$ (\hat{\mu}_c, \hat{\sigma}_c^2) = \underset{\mu_c, \sigma_c^2}{\arg\max} \sum \log p(x^{(i)} | y^{(i)}; \mu_c, \sigma_c^2) $$

When we view the above objective as a function of the parameters $\{\mu_c, \sigma_c^2\}$, we instead call it the likelihood function of the data.

Sample mean: $$ \hat{\mu}_c = \frac{1}{M} \sum x^{(i)} $$

Sample variance: $$ \hat{\sigma}_c^2 = \frac{1}{M_c} \sum (x^{(i)} - \hat{\mu}_c)^2 $$

Bayesian Decision Rule

The Bayesian decision rule (BDR) makes the optimal decisions on problems involving probability (uncertainty). It minimizes the probability of making a prediction error.

We call this the Bayes Optimal Classifier.

Given observation $\mathbf{x}$, pick the class $c$ with the largest posterior probability $p(y = c | \mathbf{x})$. $$ f_{B}(\mathbf{x}) = \underset{c \in \{1, \ldots, C\}}{\arg\max} p(y = c | \mathbf{x}) $$

But there is a problem, we don’t have $p(y | \mathbf{x})$, we only have $p(y)$ and $p(\mathbf{x} | y)$.

Bayes Rule

The posterior probability can be calculated using Bayes rule: $$ p(y | \mathbf{x}) = \frac{p(\mathbf{x} | y) p(y)}{p(\mathbf{x})} $$

The denominator is the probability of $\mathbf(x)$, $$ p(\mathbf{x}) = \sum_{y in \mathcal{Y}} p(\mathbf{x}, y) = \sum_{y \in \mathcal{Y}} p(\mathbf{x} | y) p(y) $$

The denominator makes $p(y | \mathbf{x})$ sum to 1.

So, we can write it as: $$ p(y | \mathbf{x}) = \frac{p(\mathbf{x} | y) p(y)}{\sum_{y \in \mathcal{Y}} p(\mathbf{x} | y) p(y)} $$

Bayes Classifier Summary

For training, we collect training data from each class. For each class $c$, we estimate the class conditional densities $p(x | y = c)$:

Select a form of the distribution (e.g., Gaussian)
Estimate its parameters with MLE
Estimate the class priors $p(y)$ using MLE

For classification, given a new sample $\mathbf{x}^\star$, calculate the likelihood $p(\mathbf{x}^\star | y = c)$ for each class $c$. Pick the class $c$ with the largest posterior probability $p(y = c | \mathbf{x}^\star)$.

Equivalently, use $p(\mathbf{x}^\star | y = c)p(y = c)$ or $\log p(\mathbf{x}^\star | y = c) + \log p(y = c)$.

Part 2 - Generative VS. Discriminative Models