Painless Kmeans in Python – Step-by-Step with Sklearn


Machine Learning

This tutorial shows how to use k-means clustering in Python using Scikit-Learn which can be installed using bioconda.

Background on Kmeans

K-means clustering is a type of unsupervised learning (unlabeled data) that has the goal of finding groups in the data that represent k clusters; that is why it is called k-means.

In case you want to learn more about it. Please see the videos below:

Kmeans in Python

First, we need to install Scikit-Learn which can be easily done using bioconda as we show below:

$ conda install -c anaconda scikit-learn

Now that scikit-learn was installed, we show below an example of k-means which generates a random dataset of size 7 by 2 and clusters the data using k-means into 3 clusters and prints the data organized by cluster.

import numpy as np

from collections import defaultdict
from sklearn.cluster import KMeans

# define number os clusters

# create random dataset
random_dataset = np.random.rand(7, 2)

# cluster the dataset into the pre-defined number of clusters
kmeans = KMeans(n_clusters=NUMBER_CLUSTERS).fit(random_dataset)

# read classification and add it in a dictonary
clusters = defaultdict(list)
for i, l in enumerate(kmeans.labels_):

# print results by cluster
for cluster_name in sorted(clusters):
    print("Cluster Name: {}".format(cluster_name))

    for data in clusters[cluster_name]:
        print("     Data {}".format(data))

The variable random_dataset when I ran the code contains the following values:

[[0.25897058 0.18748314]
 [0.70125152 0.06717995]
 [0.455812   0.26494425]
 [0.96850191 0.11605294]
 [0.35323561 0.94662841]
 [0.63322816 0.88710228]
 [0.59380873 0.63386557]]

Which was ultimately clustered into 3 clusters as

Cluster Name: 0
     Data [0.25897058 0.18748314]
     Data [0.455812   0.26494425]
Cluster Name: 1
     Data [0.35323561 0.94662841]
     Data [0.63322816 0.88710228]
     Data [0.59380873 0.63386557]
Cluster Name: 2
     Data [0.70125152 0.06717995]
     Data [0.96850191 0.11605294]

Last but not least, the example used in this tutorial was very simple. However, it can be generalized to any number of clusters and data frame size.

More Resources