I believe you are here to learn to detect outliers in Python. You probably have read the book “Outliers” by Malcolm Gladwell – but here we will be talking about the detection of an outlier in Python lists.
1. What is An Outlier?
In statistics, an outlier is an observation that lies an abnormal distance from other values in a random sample from a population. Outliers can be caused by a number of factors, including measurement errors, data entry errors, or the presence of extreme values in the population. Outliers can have a significant impact on the results of statistical analyses, such as measures of central tendency (e.g., mean, median), and variability (e.g., standard deviation, range), and can lead to incorrect conclusions about the underlying population.
There are different methods for identifying outliers, including graphical methods (e.g., box plots) and statistical methods (e.g., the Z-score method, the modified Z-score method, and the interquartile range method). The choice of method depends on the nature of the data and the goals of the analysis. In general, outliers should be carefully investigated and, if necessary, corrected or removed before conducting further statistical analyses.
An outlier can be easily defined and visualized using a box-plot which is used to determine by finding the box-plot IQR (Q3 – Q1) and multiplying the IQR by 1.5.
The outcome is the lower and upper bounds: Any value lower than the lower or higher than the upper bound is considered an outlier.
Next, in case you still have questions on box plots, please check this video:
2. Detect Outliers in Python
Last but not least, now that you understand the logic behind outliers, coding in Python, the detection should be straightforward, right?
Given the following list in Python, it is easy to tell that the outliers’ values are 1 and 100.
>>> data = [1, 20, 20, 20, 21, 100]
Using the function bellow with requires NumPy for the calculation of Q1 and Q3, it finds the outliers (if any) given the list of values:
import numpy as np def detect_outlier(data): # find q1 and q3 values q1, q3 = np.percentile(sorted(data), [25, 75]) # compute IRQ iqr = q3 - q1 # find lower and upper bounds lower_bound = q1 - (1.5 * iqr) upper_bound = q3 + (1.5 * iqr) outliers = [x for x in data if x <= lower_bound or x >= upper_bound] return outliers # input data >>> detect_outlier((data)) # returns outliers >>> [1, 100]