Did you notice in a conference room or event, people are grouped based on their similarities? They could be from the same working environment, company or school. For example, they could have similar expertise in machine learning development, software development, or chemistry. You may even find people talking about a common friend. The reason behind the formation of such groups is a point of similarity between the people within them. Similarly, clustering methods, as unsupervised machine learning approaches, use similarities of data points to group them in different clusters.

**Clustering**

Clustering is an unsupervised process to identify groups of similar data points. The similarities are determined using the feature values, characteristics of each data point. There is no unknown property to be predicted in clustering and everything is based on known data point features. Now that we know what clustering means, the question is why we like to use it. Clustering can help us to:

1) partition the data points into smaller groups;

2) provide insight into underlying patterns within the smaller groups;

3) understand the relationship between clusters and determining features in their separations.

These are some information that we can extract from the results of all clustering methods. However, depending on the clustering algorithm we use, we could obtain other insightful information about a dataset. Here, we review partitional and hierarchical clustering as two widely used types of clustering approaches.

**Datapoint similarity**

Identification of clusters of similar data points relies on assessing similarities of data point pairs. Distance between pairs of data points can be calculated using different distance measures such as Euclidean distance, Manhattan distance, Minkowski distance, Chebychev distance, Cosine similarity, or Hamming distance. Data points with lower distances are then considered more similar and grouped together according to each clustering algorithm.

**Partitional clustering**

Partitional clustering is about identifying non-overlapping groups of data points. Thus, each group will provide us with a smaller group of similar data points. However, intra- or inter-cluster relationships between data points will not be directly available. A famous example of such algorithms is k-means clustering which is an iterative approach with the following six steps (Figure 1):

**Figure 1. Schematic representations of algorithmic steps in k-means clustering for 9 data points. This example is chosen to make sure the clusters can be identified after one iteration, which rarely happens in reality.**

- Choosing a number of clusters (k)
- Randomly selecting k data points (as initial cluster centers)
- Calculating distance of every data point to the chosen centers in step (2)
- Assigning each data point to the nearest center
- Calculating the new center of each cluster and repeating steps 3 and 4
- Continuing steps 3 to 5 till convergence (stopping the process)

In step one of this process, the number of clusters (k) should be specified by the user of k-means as an input. K is a hyperparameter of k-means which means that it is not optimized automatically in cluster identification but needs to be determined by the user. However, there are ways to develop an optimal hyperparameter like the elbow method in which k is determined to result in low distortion, or sum of squared errors (SSE) based on the distance to centers, while not generating very small (or maybe one data point) clusters.

**Hierarchical clustering**

Hierarchical clustering results in a set of nested clusters instead of partitioning data into non-overlapping groups as in partitional clustering. A tree structure, illustrated usually with a dendrogram (Figure 2), shows this nested grouping. But as shown in Figure 2, the data points are not separated as non-overlapping groups but can identify the groups by cutting the tree (dendrogram) at a specific height. Agglomerative and divisive are two main types of hierarchical clustering starting from the bottom (individual data points) or top (all data points) to build the nested clusters, respectively.

Agglomerative hierarchical clustering is used more often in scientific research, and the process can be summarized in the following steps (Figure 2):

- Computing dissimilarity or similarity between every pair of data points
- Using linkage function to group objects into hierarchical cluster tree
- The linkage function determines the distance between sets of data points as a function of the pairwise distances between data points in the groups (Figure 3).
- Examples of linkage functions include Single, Complete, Average, Median, and Centroid.

- Deciding where to cut the hierarchical tree into clusters.

**Figure 2. Schematic representation of hierarchical clustering starting with groupings of closest individual data points (singletons) and continuing till covering all data points.****Figure 3. Schematic representation of Single and Complete linkage functions as the minimum and maximum distances between pairs of datapoints in two different groups, respectively.**

In our next post of the series, we will talk about neural networks and deep learning models. In this series, we plan to introduce several other fundamental topics associated with Machine Learning, such as graph neural networks and transfer learning!

Stay tuned !!

Author: Ali Madani

Editor: Andreas Windemuth