Introduction to Clustering

Table of contents

  1. What is Clustering?

  2. K-Means clustering algorithm

    2.1. Algorithm desciption

    2.2. The algorithm (steps)

    2.3. Evaluating cluster quality

    2.4. Determining the optimal value of k for clustering

1. What is Clustering?

Clustering is the process of grouping similar data points together. It’s a group of unsupervised learning algorithms, meaning it doesn’t require any pre-existing labels. Instead, it aims to discover patterns and relationships within the data.

2. K-Means clustering algorithm

2.1. Algorithm description

2.2. The algorithm (Steps)

  1. Start

  2. Choose the number of required clusters (k).

  3. Select k random data points from the dataset to serve as initial set of centroids that represent the centers of each cluster.

  4. Calculate the distance between each data point and each centroid using Euclidean distance.

  5. Assign each data point its closest centroid.

  6. Recalculate the centroids by taking the mean of all the data points assigned to each cluster.

  7. Recalculate the distance between each data point and each centroid using Euclidean distance.

  8. Repeat steps 6 and 7 until convergence is reached.

  9. Display the final cluster label for each data point.

  10. Stop

2.3. Evaluating cluster quality

1. Inertia: Inertia measures the sum of the distances of all the data points within a cluster from the centroid of the cluster.

  1. Average Silhouette Coefficient/Silhouette Score: Silhouette score is a measure of how well the data points are clustered. The value of the silhouette score range between -1 and 1. A value closer to +1 indicates that the data points within each cluster are very similar and the data points belonging to different clusters are different. The values closer to 0 and -1 indicate sub-optimal/indirect clusters.

To calculate the average silhouette score, we calculate separation on, cohesion and the silhouette coefficient for each data point.

\[silhouette\,coefficient = \dfrac{(b-a)}{max(a,b)}\]

2.4. Determining the optimal value of k for clustering

  1. Elbow method: A line plot of different k values vs the respective inertia values and choose a k where inertia decreasing less significantly.

  2. Silhouette analysis: Calculate silhouette coefficient for different k values and choose a k value with highest average silhouette coefficient.

  3. Domain knowledge: Consider the number of underlying groups based on prior understanding of data.

  4. Visualization: Reduce dimensions of data to 2-3 dimensions and visualize clusters using a scatter plot to assess the optimal value of k.

  5. Interpretting the dendrogram from Hierarchial clustering: Implement the Hierarchial clustering algorithm and visualize the dendrogram to identify the optimal number of clusters present within the data.

2.5. Additional things to consider

Continue to the next session on Hierarchial clustering by clicking here.

🧑🏽‍💻 Thank you for reading this document. If you have any feedback or want to make an addition to the information, you can email me at amandeepsinghkhanna@gmail.com.