Painless Kmeans in Python – Step-by-Step with Sklearn

by:

Machine Learning

This tutorial shows how to use k-means clustering in Python using Scikit-Learn, installed using bioconda.

1. K-Means Clustering

1.1. What is K-means

K-means is an unsupervised non-hierarchical clustering algorithm. It allows the observations of the data set to be grouped into K distinct clusters. Thus, similar data will be found in the same cluster. Furthermore, an observation can only be found in one cluster at a time. The same observation cannot, therefore, belong to two different clusters.

1.2. What is clustering

Clustering is an unsupervised learning method. Thus, we do not try to learn a correlation relation between a set of features X of observation and a predicted value Y, as is the case with supervised learning. Instead, unsupervised learning will find patterns in the data. In particular, by grouping things that are alike.

To group a dataset into K distinct clusters, the K-Means algorithm needs a way to compare the degree of similarity between different observations. Thus, two data that resemble each other will have a reduced dissimilarity distance, while two different objects will have a greater separation distance. Choosing several clusters K is not necessarily intuitive. Especially when the dataset is large, and you don’t have a priori or assumptions about the data. A large K number can lead to overly fragmented data partitioning. This will prevent the discovery of interesting patterns in the data. On the other hand, a too-small number of clusters will potentially lead to having too generalist clusters containing a lot of data. In this case, we won’t have any “fine” patterns to discover. For the same dataset, there is no unique clustering possible. The difficulty will therefore lie in choosing several clusters K which will make it possible to highlight exciting patterns between the data. Unfortunately, there is no automated process for finding the correct

the number of clusters.

The most common way to choose the number of clusters is to run K-Means with different values ​​of K and calculate the variance of the other clusters. The variance is the sum of the distances between each centroid in a cluster and the different observations included in the same cluster. Thus, we try to find several clusters K such that the selected clusters minimize the distance between their centers (centroids) and the observations in the same cluster. We are talking about minimizing the intra-class distance.

The fields of application of K-Means are numerous; it is used in particular in

  • Customer segmentation according to a certain criterion (demographic, purchasing habit, etc.)
  • Use of Data Mining clustering when mining data to identify similar individuals. Usually, once these populations are detected, other techniques can be employed as needed.
  • Document clustering (grouping documents according to their content. Think about how Google Newsgroups documents by topic.)

1.3. How the K-Means algorithm works

k-means is an iterative algorithm that minimizes the sum of the distances between each individual and the centroid. The initial choice of centroids determines the final result.

Admitting a cloud of a set of points, K-Means changes the points of each cluster until the sum can no longer decrease. The result is a set of compact and separated clusters, subject to choosing the correct K-value for the number of clusters.

2. Kmeans in Python

First, we need to install Scikit-Learn, which can be quickly done using bioconda as we show below:

$ conda install -c anaconda scikit-learn

Now that scikit-learn was installed, we show below an example of k-means which generates a random dataset of size seven by two and clusters the data using k-means into 3 clusters and prints the data organized by cluster.

import numpy as np

from collections import defaultdict
from sklearn.cluster import KMeans

# define number os clusters
NUMBER_CLUSTERS = 3

# create random dataset
random_dataset = np.random.rand(7, 2)

# cluster the dataset into the pre-defined number of clusters
kmeans = KMeans(n_clusters=NUMBER_CLUSTERS).fit(random_dataset)

# read classification and add it in a dictonary
clusters = defaultdict(list)
for i, l in enumerate(kmeans.labels_):
    clusters[l].append(random_dataset[i])

# print results by cluster
for cluster_name in sorted(clusters):
    print("Cluster Name: {}".format(cluster_name))

    for data in clusters[cluster_name]:
        print("     Data {}".format(data))

The variable random_dataset when I ran the code contains the following values:

[[0.25897058 0.18748314]
 [0.70125152 0.06717995]
 [0.455812   0.26494425]
 [0.96850191 0.11605294]
 [0.35323561 0.94662841]
 [0.63322816 0.88710228]
 [0.59380873 0.63386557]]

Which was ultimately clustered into 3 clusters as

Cluster Name: 0
     Data [0.25897058 0.18748314]
     Data [0.455812   0.26494425]
Cluster Name: 1
     Data [0.35323561 0.94662841]
     Data [0.63322816 0.88710228]
     Data [0.59380873 0.63386557]
Cluster Name: 2
     Data [0.70125152 0.06717995]
     Data [0.96850191 0.11605294]

Last but not least, the example used in this tutorial was straightforward. However, it can be generalized to any number of clusters and data frame size.

More Resources