Written by Amandeep Singh Khanna
on January 01, 2024

Introduction to Clustering

What is Clustering?
K-Means clustering algorithm

2.1. Algorithm desciption

2.2. The algorithm (steps)

2.3. Evaluating cluster quality

2.4. Determining the optimal value of k for clustering

1. What is Clustering?

Clustering is the process of grouping similar data points together. It’s a group of unsupervised learning algorithms, meaning it doesn’t require any pre-existing labels. Instead, it aims to discover patterns and relationships within the data.

2. K-Means clustering algorithm

2.1. Algorithm description

The k-means clustering algorithm aims to partition/split a dataset into a pre-defined number (represented by the letter k) of distinct, non-overlapping clusters.
Each cluster contains data points that are more like each other than to data points in other clusters.
Each cluster is represented by a central point called ‘centroid’, calculated as the mean of all the data points within that cluster.

2.2. The algorithm (Steps)

Start
Choose the number of required clusters (k).
Select k random data points from the dataset to serve as initial set of centroids that represent the centers of each cluster.
Calculate the distance between each data point and each centroid using Euclidean distance.
Assign each data point its closest centroid.
Recalculate the centroids by taking the mean of all the data points assigned to each cluster.
Recalculate the distance between each data point and each centroid using Euclidean distance.
Repeat steps 6 and 7 until convergence is reached.
Display the final cluster label for each data point.
Stop

2.3. Evaluating cluster quality

1. Inertia: Inertia measures the sum of the distances of all the data points within a cluster from the centroid of the cluster.

Average Silhouette Coefficient/Silhouette Score: Silhouette score is a measure of how well the data points are clustered. The value of the silhouette score range between -1 and 1. A value closer to +1 indicates that the data points within each cluster are very similar and the data points belonging to different clusters are different. The values closer to 0 and -1 indicate sub-optimal/indirect clusters.

To calculate the average silhouette score, we calculate separation on, cohesion and the silhouette coefficient for each data point.

Cohesion (a): Average distance to all other data points within the same cluster. This measures how closely a data point is related to the cluster.
Separation (b): Average distance to all data points in the nearest neighboring cluster.

\[silhouette\,coefficient = \dfrac{(b-a)}{max(a,b)}\]

2.4. Determining the optimal value of k for clustering

Elbow method: A line plot of different k values vs the respective inertia values and choose a k where inertia decreasing less significantly.
Silhouette analysis: Calculate silhouette coefficient for different k values and choose a k value with highest average silhouette coefficient.
Domain knowledge: Consider the number of underlying groups based on prior understanding of data.
Visualization: Reduce dimensions of data to 2-3 dimensions and visualize clusters using a scatter plot to assess the optimal value of k.
Interpretting the dendrogram from Hierarchial clustering: Implement the Hierarchial clustering algorithm and visualize the dendrogram to identify the optimal number of clusters present within the data.

2.5. Additional things to consider

Incorrect values of k can lead to poor results.
A small deviation in selection of initial centeroid can lead to drastically different results.
The algorithm is highly sensitive to outliers within the dataset.
The algorithm is not ideal for categorical/qualitative variables within the dataset.
The algorithm doesn’t perform well with high dimensonal data.
The algorithm doesn’t perform well with clusters of varying shapes, densities, or non-convex and non-spherical structures.
The optimal distance measure needs to be determined based on the nature of the dataset.

Continue to the next session on Hierarchial clustering by clicking here.

🧑🏽‍💻 Thank you for reading this document. If you have any feedback or want to make an addition to the information, you can email me at amandeepsinghkhanna@gmail.com.

← Top