This tutorial shows how to use k-means clustering in Python using Scikit-Learn, installed using bioconda.
1. K-Means Clustering
1.1. What is K-means
K-means is an unsupervised non-hierarchical clustering algorithm. It allows the observations of the data set to be grouped into K distinct clusters. Thus, similar data will be found in the same cluster. Furthermore, an observation can only be found in one cluster at a time. The same observation cannot, therefore, belong to two different clusters.
1.2. What is clustering
Clustering is an unsupervised learning method. Thus, we do not try to learn a correlation relation between a set of features X of observation and a predicted value Y, as is the case with supervised learning. Instead, unsupervised learning will find patterns in the data. In particular, by grouping things that are alike.
The most common way to choose the number of clusters is to run K-Means with different values of K and calculate the variance of the other clusters. The variance is the sum of the distances between each centroid in a cluster and the different observations included in the same cluster. Thus, we try to find several clusters K such that the selected clusters minimize the distance between their centers (centroids) and the observations in the same cluster. We are talking about minimizing the intra-class distance.
The fields of application of K-Means are numerous; it is used in particular in
- Customer segmentation according to a certain criterion (demographic, purchasing habit, etc.)
- Use of Data Mining clustering when mining data to identify similar individuals. Usually, once these populations are detected, other techniques can be employed as needed.
- Document clustering (grouping documents according to their content. Think about how Google Newsgroups documents by topic.)
1.3. How the K-Means algorithm works
K-Means is a widely used unsupervised machine learning algorithm for clustering. It is used to divide a set of n observations into k clusters, where k is a user-defined parameter. The algorithm works by iteratively assigning each observation to the nearest cluster center and then recalculating the cluster center based on the mean of the assigned observations.
Here is a step-by-step explanation of how the K-Means algorithm works:
- Initialization: The first step is to initialize the cluster centers, also known as centroids. There are several methods for initializing the centroids, such as randomly selecting k observations from the data, or using a statistical method such as the K-Means++ algorithm.
- Assignment step: In this step, each observation is assigned to the nearest cluster center based on the Euclidean distance between the observation and the cluster center.
- Recalculation step: In this step, the cluster centers are recalculated as the mean of all the observations assigned to the respective cluster.
- Repeat steps 2 and 3 until convergence: The algorithm repeats the assignment and recalculation steps until either a maximum number of iterations is reached or the cluster centers no longer change.
- Final clusters: Once convergence is reached, the final clusters are the result of the K-Means algorithm.
It’s important to note that the final results of the K-Means algorithm can depend on the initial cluster centers, and the algorithm may get stuck in a local optimum rather than finding the global optimum. To mitigate this, it’s common to run the algorithm multiple times with different initializations and choose the best result.
2. Kmeans in Python
First, we need to install Scikit-Learn, which can be quickly done using bioconda as we show below:
$ conda install -c anaconda scikit-learn
Now that scikit-learn was installed, we show below an example of k-means which generates a random dataset of size seven by two and clusters the data using k-means into 3 clusters and prints the data organized by cluster.
import numpy as np
from collections import defaultdict
from sklearn.cluster import KMeans
# define number os clusters
NUMBER_CLUSTERS = 3
# create random dataset
random_dataset = np.random.rand(7, 2)
# cluster the dataset into the pre-defined number of clusters
kmeans = KMeans(n_clusters=NUMBER_CLUSTERS).fit(random_dataset)
# read classification and add it in a dictonary
clusters = defaultdict(list)
for i, l in enumerate(kmeans.labels_):
clusters[l].append(random_dataset[i])
# print results by cluster
for cluster_name in sorted(clusters):
print("Cluster Name: {}".format(cluster_name))
for data in clusters[cluster_name]:
print(" Data {}".format(data))
The variable random_dataset when I ran the code contains the following values:
[[0.25897058 0.18748314]
[0.70125152 0.06717995]
[0.455812 0.26494425]
[0.96850191 0.11605294]
[0.35323561 0.94662841]
[0.63322816 0.88710228]
[0.59380873 0.63386557]]
Which was ultimately clustered into 3 clusters as
Cluster Name: 0
Data [0.25897058 0.18748314]
Data [0.455812 0.26494425]
Cluster Name: 1
Data [0.35323561 0.94662841]
Data [0.63322816 0.88710228]
Data [0.59380873 0.63386557]
Cluster Name: 2
Data [0.70125152 0.06717995]
Data [0.96850191 0.11605294]
Last but not least, the example used in this tutorial was straightforward. However, it can be generalized to any number of clusters and data frame size.