4.2 Unsupervised classification

Clustering is an unsupervised learning technique that groups similar data points such that the points in the same group are more similar to each other than the points in the other groups. The group of similar data points is called a cluster. It is said to be unsupervised as the groups are not labeled but discovered by the algorithm. Consequently, it may not be possible to find separate clusters of data points.

4.2.1 Hierarchical Clustering on Principal Components (HCPC)

The HCPC approach combines the three standard methods used in multivariate data analysis:

  • Principal Components methods such as MCA,
  • Agglomerative hierarchical clustering,
  • Consolidation by k-means partitioning.

4.2.2 Agglomerative Hierarchical Clustering (AHC)

Algorithm

  1. Each data point is initially considered an individual cluster;
  2. Compute the proximity matrix which represents the distances, taken pairwise, between the elements of a set;
  3. Merge the two closest clusters and update the proximity matrix until only a single cluster remains.

Distance between two observations

The choice of the distance metric is a critical step in clustering. It defines how the similarity of two elements \((x, y)\) is calculated and it will influence the shape of the clusters. There are several distance measures such as Euclidean, Manhattan, \(\chi^2\), etc. In our study the classical Euclidean distance metric is chosen and is defined as follows:

\[\begin{equation} d_{euc}(x,y) = \sqrt{\sum_{i=1}^n(x_i - y_i)^2} \tag{4.3} \end{equation}\]

Similarities between clusters - Ward method

Calculating the similarity between two clusters is determining to merge the clusters. Here, the Ward method (Scholler 2020) is chosen and it consists in merging groups which drive down as little as possible the within inertia, ie the homogeneity of clusters. Mathematically, Ward criterion is defined as follows:

\[\begin{equation} \Delta (A, B) = \frac{1}{n} \times \frac{n_A n_B}{n_A + n_B}d^2(g_A, g_B) \tag{4.4} \end{equation}\]

with \(g_A\) et \(g_B\) the clusters’ barycenters (the mean point) et \(n_A\) et \(n_B\) the clusters’ frequencies.

Ward method needs to be minimized and tends to group together clusters which are close to each other in terms of barycenters, as well as small frequency clusters.

Ward method leads to choose the optimal number of clusters. Let \(\Delta k\) be the increase in within inertia when going from \(k+1\) groups to \(k\) groups. Then, the proper number of classes \(k^*\) can be such that \(\frac{\Delta k-1}{\Delta k}\) is as little as possible.

4.2.3 The k-means algorithm

\(k\) data points are chosen as initial centers.The following steps are repeated until the clusters identified are homogeneous enough or until a fixed number of iterations:

  1. The distances between the data points and the centers are computed;
  2. Each data point is assigned to the nearest center, ;
  3. The \(k\) previous centers are replaced by the barycenters of the \(k\) classes identified during the previous step.

References

Scholler, Julie. 2020. M1 Analyse de Données Exploratoire - Classification Non Supervisée. Université de Tours. https://juliescholler.gitlab.io/publication/m1ade-2021/M1-AnaExpl-ClassifNonSup-etu.pdf.