Part 8 - Distance and network methods

Posted on Feb 13, 2024

(Last updated: May 26, 2024)

Introduction

In data science, classification stands out as a fundamental technique. It involves the art of categorizing data into distinct classes based on various features and attributes.

Different type of classifiers

Within the arsenal of classification algorithms, we encounter a diverse array of tools. From the simplicity of Logistic Regression to the complexity of Support Vector Machines, each classifier brings its unique strengths to the table.

Measuring Distances

The most common metric for distance is euclidean distance: $$ d(p, q) = \sqrt{\sum_{i=1}^{n} |q_i - p_i|^2} $$

Where $p$ and $q$ are two points in the dataset and $n$ is the number of features.

But let’s define what a distance metric is, a distance metric must satisfy the following properties:

Positivity: $d(p, q) \geq 0$ for all $p$ and $q$.
Identity: $d(p, q) = 0$ if and only if $p = q$.
Symmetry: $d(p, q) = d(q, p)$ for all $p$ and $q$.
Triangle inequality: $d(p, q) + d(q, r) \geq d(p, r)$ for all $p$, $q$ and $r$.

The euclidean distance is a special case of a more general family, the $L_k$ distance: $$ d(p, q) = \left(\sum_{i=1}^{n} |q_i - p_i|^k\right)^{1/k} $$

$K$-nearest neighbours

Among these classifiers we find, the $k$-nearest neighbours algorithm. By seeking the consensus among the $k$ closest neighbours, this method navigates the data points to assign each point a class label.

Determining the optimal $k$

If there are two classes, choosing an odd value for $k$ ensures no ties. If $k$ is small, noise can have a greater influence. Compute and compare confusion matrices for different values of $k$.

Data Clustering

Beyond classification lies the realm of data clustering, where patterns and structures emerge from the data’s depths. Through clustering, we gain insights into the underlying relationships and groupings that shape our understanding of the data landscape.

Clustering methods

The $k$-means algorithm partitions the data into $k$ clusters. It iteratively assigns each data point to the nearest cluster and recalculates the cluster centroids.

Hierarchical clustering, on the other hand, builds a tree of clusters. It starts with each data point as a cluster and merges the closest clusters until only one cluster remains.

$K$-means clustering

The $k$-means algorithm is a simple yet powerful tool for clustering.

The algorithm works as follows:

Choose value of $k$.
Choose initial positions of the $k$ cluster centres.
Choose a distance metric.
Assign each data point to the nearest cluster.
Recalculate the cluster centres as the mean of the data points in the cluster.
Repeat steps 4 and 5 until convergence.

The elbow method

How do we pick the optimal value of $k$?

Let the diameter of a clustering be the longest intra-cluster distance. The elbow method involves plotting the diameter of the clustering as a function of $k$. The optimal value of $k$ is the value at the “elbow” of the curve.

Limitations of $k$-means

The $k$-means algorithm is sensitive to the initial choice of cluster centres. It is also sensitive to the choice of $k$. The algorithm is not suitable for clusters of different sizes and densities.

In summary, $k$-means works best for:

Spherical clusters.
Equal diameter clusters.
Equal cluster size.

Hierarchical clustering

Somtimes called agglomerative clustering, hierarchical clustering builds a tree of clusters.

The algorithm works as follows:

Start with each data point as a cluster.
Merge the two closest clusters.
Repeat step 2 until only one cluster remains.

Rand index

Similarity measure between two clusters by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

$$ RI = \frac{\text{Number of agreeing pairs}}{\text{Total number of pairs}} $$