Most real-world data sets have outliers that have unusually small or big values then the typical values of the data set. Outliers can have a negative impact on data analysis if not detected correctly and also outliers can provide significant information at the same time. There are several methods to detect an outlier in the data set. It is important to understand the characteristics of these methods properly since they are quite powerful with large normal data but it may be problematic to apply them to nonnormal data or small sample sizes without proper knowledge of their characteristics in these circumstances. In this post, we will discuss different outlier detection methods and how to apply them properly in the data set.

## Outlier

An outlier is an observation that lies in an unusual distance from other observations in a random sample of a population. Generally the person who works with data define what should be the unusual distance. Outliers can be caused by many reasons e.g. data entry mistakes, taking data from different populations, etc. To see the impact of an outlier we can take a simple data set $$1,2,3,4,5,6,7,8,9$$. If we look at the statistics of this data set

Now, if we change the value from 9 to 99 we get

As we can see that mean and variance has become much bigger due to one value got bigger. Also, 95% confidence interval has become wider.

## Outlier Detection Methods

There are two types of outlier detection methods e.g. formal tests and informal tests. Formal tests are also known as tests of discordancy while informal tests are known as outlier labeling methods.

## Standard Deviation Method

One of the simplest and classical ways of screening outliers in the data set is by using the standard deviation method. We can define an interval with mean as a center and $$\bar{x} – 2SD$$, $$\bar{x} + 2SD$$ being two endpoints respectively. Observations that fall outside the interval is defined as outliers. \begin{aligned} \text{2 SD Method} &: \bar{x} \pm 2 SD \\ \text{3 SD Method} &: \bar{x} \pm 3 SD \end{aligned}

According to the Chebyshev inequality, if a random variable X with mean $$\mu$$ and variance $$\sigma^2$$ exists, then for any $$k>0$$, \begin{aligned}P[|X – \mu|\ge k\sigma] &\le \frac{1}{k^2} \\ P[|X – \mu|\ge k\sigma] &\ge 1 – \frac{1}{k^2}, \; k>0 \end{aligned} We can determine the proportion of data within the $$k$$ standard deviation of mean by using the inequality $$1 – (1/k)^2$$ e.g. about $$75%$$, $$89%$$ and $$94%$$ of the data are within 2, 3 and 4 standard deviation of the mean, respectively. This theorem is true for any data in any distribution. Now, lets look at an example data set $$X$$ to look at this method: $$3, 5.1, 5. 3, 5.7, 6, 7.2, 10, 11,11.5, 12, 13, 13.5, 14.2, 15.0, 45, 55$$ The mean for this data set $$\bar{x} = 14.53$$ and $$SD = 14.45$$. If we calculate interval for 2SD method we get an interval $$(-14.37, 43.43)$$. We can see we have detected $$45$$ and $$55$$ as outlliers. But if we take interval for 3SD we get $$(-28.82, 57.88)$$ that doesn’t detect any outliers.

## Z-score

One of the most common tools in all of the statistics is Z-score. Z-score can be defined as the number of standard deviation $$(\sigma)$$ a certain data point is from mean, $$\bar{x}$$. $$\text{Z-score} = \frac{X – \bar{x}}{\sigma}$$ In standard deviation method, we have formed an interval with mean and 2SD or 3SD. Here, we take the difference from the observation and mean and divide it with the standard deviation to find out the number of standard deviation is equivalent to the difference. For any observation, if the absolute value of the Z-score exceeds 3 then that observation is considered as outliers. If we look closely Z-score formula is actually the same as the 3SD method. The maximum Z-score is dependent on the sample size and computed as $$\pm (n-1)/\sqrt{n}$$ for a sample size of $$n$$.

For case 1, with all of the observations included even though 45 and 55 are outliers but Z-score for no observation exceeds the absolute value of 3. For case 2, when we have excluded 55, the most extreme value, we have detected 45 as an outlier. This is because multiple extreme values have inflated standard deviation.

## Modified Z-score Method

In Z-score method we have used two estimators, sample mean and standard deviation, that can be affected by a single or multiple extreme values. To avoid these issues we use the median and the Median Absolute Deviation in the modified Z-score method. The modified Z-score $$M_i$$ is computes as: \begin{aligned}M_i &= \frac{0.6745(x_i – \tilde{x})}{MAD}, \tilde{x} \; is \; the \; sample \; median\\ where, \; MAD &= median\{|x_i – \tilde{x}|\}\end{aligned} To find out MAD let’s take a set of numbers: $$1,2,3,4,5$$. The median $$(\tilde{x})$$ will be 3. Now, we will subtract the median from each x-value: \begin{aligned}|x_i – \tilde{x}|&\\ |1-3| &= 2\\ |2-3| &= 1\\ |3-3| &= 0\\ |4-3|&= 1\\ |5-3| &= 2\\ the \; MAD \; value & \;will\; be\; the \;median \;of \;the \;sorted \;values: (0, 1, 1, 2, 2)\\ \tilde{x} &= 1\end{aligned} We consider an observation as an outlier if $$|M_i|>3.5$$.

Here we have shown the comparison of the Z-score method and modified Z-score on the previous data set. We can notice that even though the Z-score method failed but in modified Z-score, we have detected 45 and 55 as outliers. This is because the modified Z-score is less sensitive to extreme values.

The MADₑ method is similar to SD method but it uses median and Median Absolute Deviation (MAD), instead of mean and standard deviation. MADₑ method is defined as follows: \begin{aligned}2 MAD_e \; Method&:\; Median \pm 2 MAD_e \\ 3 MAD_e \; Method&:\; Median \pm 3 MAD_e \\ where \; MAD &= median(|x_i – median(x)|_{i = 1,2,\ldots,n})\\MAD_e &= 1.483 \times MAD\end{aligned} After scaling the MAD value by a factor of 1.483, it is similar to the standard deviation in a normal distribution. In our datraset, \begin{aligned} Median &= 11.25 \\ MAD &= 3.9 \\ MAD_e &= 1.483 \times 3.9 = 5.78 \\ 2 \; MAD_e&: [-0.31, 22.81] \\ 3 \; MAD_e &: [-6.09, 28.59]\end{aligned}
Median rule was introduced by Carling(2000) and it actually a substitution method for Tukey’s mehod where quartiles are substituted by median and a different scale is used. The method was defined as: \begin{aligned} \textbf{[}c_1, c_2\textbf{]} &= q2 \pm c(n)(q3 – q1) = q2 \pm c(n)IQR\\ where, \; q1& = first \; quartile \\ q2 &= median \\ q3 &= third \;quartile \\ n &= number \; of \; samples \\ c(n) & = \frac{17.63n – 23.64}{7.74n – 3.71} \approx 2.3 \; for \; large \; n \\ for \; our \; dataset, q2 &= 11.25\\ iqr &= 7.75\\ \textbf{[}c_1, c_2\textbf{]} &= \textbf{[}-6.575, 29.075\textbf{]} \end{aligned}
Tukey’s boxplot method is a well known simple graphical method to find out five-number summary and outliers in univariate data. This method is less sensitive than previous methods since it doesn’t use sample mean and variance but quartiles instead. From the five-number summary, we first get the value of $$Q1$$ and $$Q3$$ and from that, we get the interquartile range. $$IQR = Q3 – Q1$$ Using this we define an interval and any value outside this interval will be defined as outliers. $$\big[ Q_1 – k (Q_3 – Q_1 ) , Q_3 + k (Q_3 – Q_1 ) \big], \; where \; k \; is\; nonnegative$$ Jhon Tukey proposed that $$k = 1.5$$ indicates an outlier where $$k = 3.0$$ indicates that the data is far away. There is no statistical basis for which Tukey uses 1.5 for inner and 3 for outer fences. For our example dataset, \begin{aligned} Q1 &= 5.925 \\ Q2 &= 11.25 \\ Q3 &= 13.675\\ IQR &= 7.75\end{aligned} So the defined interval according to Tukey will be $$\big[ Q_1 – 1.5(Q_3 – Q_1 ) , Q_3 + 1.5 (Q_3 – Q_1 ) \big] = [-5.7, 25.3]$$ We can see the lowest value from our data set that fall in this interval is $$3$$ and the largest value is $$15.00$$ while 45 and 55 are detected as outliers.