Simple Box Plot and Swarm Plot in Python

by:

Data visualizationPython

This short tutorial teaches you how to create a box and plot whisker and overlap it with a swarm plot in Python. Furthermore, it shows how to read it.

What is a Box Plot and how to read it?

A box plot, also known as a box-and-whisker plot, is a graphical representation of a set of continuous or discrete data that is used to show the distribution of the data. It provides a summary of the data by showing the median, quartiles, and outliers in a single plot.

A box plot consists of a box that represents the interquartile range (IQR) of the data, which is defined as the range between the first (lower) quartile (25th percentile) and the third (upper) quartile (75th percentile). The median of the data is shown as a line inside the box. Whiskers extend from either side of the box to show the range of the data, excluding any outliers. Outliers are plotted as individual points outside of the whiskers.

Box plots are commonly used in statistics, data analysis, and machine learning for several purposes, including:

  1. Visualizing the distribution of the data: Box plots provide a visual representation of the distribution of the data, including the median, quartiles, and outliers.
  2. Comparing multiple sets of data: Box plots can be used to compare the distribution of two or more sets of data. By plotting multiple box plots on the same graph, you can quickly identify differences and similarities in the data.
  3. Identifying outliers: Box plots provide a visual representation of outliers in the data, which can help to identify values that are significantly different from the rest of the data.
  4. Removing skewness: Box plots are useful for removing skewness in the data and providing a clearer representation of the distribution.

In general, box plots are a useful tool for visualizing the distribution of data and are particularly useful when comparing multiple sets of data or identifying outliers in the data. They provide a quick and easy way to get a summary of the distribution of the data and are a common method used in data analysis and statistical applications.

Please see the video below for a complete explanation on the topic: Box plot how to read.

Dependencies for the script

First and foremost, the function below has some dependencies around seaborn and matplotlib, so please make sure you install them.

This can be easily installed using pip. When installing seaborn, the matplotlib is automatically handled. Please see here how to install it.

Generating the Box and Swarm Plot

Secondly, this post plots three random population distributions into a box and swarm plot. However, this can be easily modified to less or more populations.

Please see code below:

import matplotlib.pyplot as matplotlib
import numpy as np
import seaborn


def plot_box_swarm(data, y_axis_label, x_labels, plot_title, figure_name):
    """Plot box-plot and swarm plot for data list.

    Args:
        data (list of list): List of lists with data to be plotted.
        y_axis_label (str): Y- axis label.
        x_labels (list of str): List with labels of x-axis.
        plot_title (str): Plot title.
        figure_name (str): Path to output figure.
        
    """
    seaborn.set(color_codes=True)
    matplotlib.figure(1, figsize=(9, 6))

    # add title to plot
    matplotlib.title(plot_title)

    # plot data on swarmplot and boxplot
    seaborn.swarmplot(data=data, color=".25")
    ax = seaborn.boxplot(data=data)

    # y-axis label
    ax.set(ylabel=y_axis_label)

    # write labels with number of elements
    ax.set_xticklabels(["{} (n={})".format(l, len(data[x])) for x, l in enumerate(x_labels)], rotation=10)

    # write figure file with quality 400 dpi
    matplotlib.savefig(figure_name, bbox_inches='tight', dpi=400)
    matplotlib.close()


# set seed for same plot can be re-generated on example presented here using np.random.normal
np.random.seed(11)

# create random distributions for 3 populations
population_a = np.random.normal(0.1, 0.5, 50)
population_b = np.random.normal(0.2, 0.7, 45)
population_c = np.random.normal(0.7, 0.3, 51)
data = [population_a, population_b, population_c]

x_labels = ["Population A", "Population B", "Population C"]
y_axis_label = "Target Metric"

plot_box_swarm(data, y_axis_label, x_labels, "Box/Swarm plot - Population A vs B vs C", "pop_A_B_C.png")

And here is how our plot looks like

In case you only want to plot the swarm plot, please comment on line 25 of the code above. Or if you want to plot only the boxplot, please comment on line 24.

More Resources

Here are two of my favorite Data Visualization Python Books in case you want to learn more about it.

Conclusion

In summary, this tutorial showed you how to use seaborn to plot a box and swarm plot which is very useful to help you visualize data.

Related Posts