Easy Outlier Detection in Python

onestop_databy:

Data AnalysisPython

I believe you are here to learn to detect outliers in Python. You probably have read the book “Outliers” by Malcolm Gladwell – but here we will be talking about the detection of an outlier in Python lists.

1. What is An Outlier?

First and foremost, in data analysis, an outlier is an untypical observed data point in a given distribution of data points.

However, how do we define an untypical data point?

An outlier can be easily defined and visualized using a box-plot which can be used to define by finding the box-plot IQR (Q3 – Q1) and multiplying the IQR by 1.5.

The outcome is the lower and upper bounds. Once the bounds are calculated, any value lower than the lower value or higher than the upper bound is considered an outlier.

Box-plot representation (Image source).

Next, in case you still have questions on box-plots, please check this video:

2. Detect Outliers in Python

Last but not least, now that you understand the logic behind outliers, coding in python the detection should be straight-forward, right?

Given the following list in Python, it is easy to tell that the outliers’ values are 1 and 100.

>>> data = [1, 20, 20, 20, 21, 100]

Using the function bellow with requires NumPy for the calculation of Q1 and Q3, it finds the outliers (if any) given the list of values:

import numpy as np

def detect_outlier(data):
    # find q1 and q3 values
    q1, q3 = np.percentile(sorted(data), [25, 75])

    # compute IRQ
    iqr = q3 - q1

    # find lower and upper bounds
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)

    outliers = [x for x in data if x <= lower_bound or x >= upper_bound]

    return outliers

# input data
>>> detect_outlier((data))

# returns outliers
>>> [1, 100]

Related Posts