An *outlier* is an observation that lies in an unusual distance from other observations in a random sample of a population. Generally the person who works with data define what should be the unusual distance. Outliers can be caused by many reasons e.g. data entry mistakes, taking data from different populations, etc. To see the impact of an outlier we can take a simple data set \(1,2,3,4,5,6,7,8,9\). If we look at the statistics of this data set

Mean | Median | Variance | 95% Confidence Interval |
---|---|---|---|

5 | 5 | 7.5 | [3.2, 6.80] |

Now, if we change the value from 9 to 99 we get

Mean | Median | Variance | 95% Confidence Interval |
---|---|---|---|

15 | 5 | 997.5 | [-5.6, 35.6] |

As we can see that mean and variance has become much bigger due to one value got bigger. Also, 95% confidence interval has become wider.

There are two types of outlier detection methods e.g. formal tests and informal tests. Formal tests are also known as tests of discordancy while informal tests are known as outlier labeling methods.

One of the simplest and classical ways of screening outliers in the data set is by using the standard deviation method. We can define an interval with mean as a center and \(\bar{x} – 2SD\), \(\bar{x} + 2SD\) being two endpoints respectively. Observations that fall outside the interval is defined as outliers. $$\begin{aligned} \text{2 SD Method} &: \bar{x} \pm 2 SD \\ \text{3 SD Method} &: \bar{x} \pm 3 SD \end{aligned}$$

According to the Chebyshev inequality, if a random variable X with mean \(\mu\) and variance \(\sigma^2\) exists, then for any \(k>0\), $$\begin{aligned}P[|X – \mu|\ge k\sigma] &\le \frac{1}{k^2} \\ P[|X – \mu|\ge k\sigma] &\ge 1 – \frac{1}{k^2}, \; k>0 \end{aligned}$$ We can determine the proportion of data within the \(k\) standard deviation of mean by using the inequality \(1 – (1/k)^2\) e.g. about \(75%\), \(89%\) and \(94%\) of the data are within 2, 3 and 4 standard deviation of the mean, respectively. This theorem is true for any data in any distribution. Now, lets look at an example data set \(X\) to look at this method: $$3, 5.1, 5. 3, 5.7, 6, 7.2, 10, 11,11.5, 12, 13, 13.5, 14.2, 15.0, 45, 55$$ The mean for this data set \(\bar{x} = 14.53\) and \(SD = 14.45\). If we calculate interval for 2SD method we get an interval \((-14.37, 43.43)\). We can see we have detected \(45\) and \(55\) as outlliers. But if we take interval for 3SD we get \((-28.82, 57.88)\) that doesn’t detect any outliers.

One of the most common tools in all of the statistics is Z-score. Z-score can be defined as the number of standard deviation \((\sigma)\) a certain data point is from mean, \(\bar{x}\). $$\text{Z-score} = \frac{X – \bar{x}}{\sigma}$$ In standard deviation method, we have formed an interval with mean and 2SD or 3SD. Here, we take the difference from the observation and mean and divide it with the standard deviation to find out the number of standard deviation is equivalent to the difference. For any observation, if the absolute value of the Z-score exceeds 3 then that observation is considered as outliers. If we look closely Z-score formula is actually the same as the 3SD method. The maximum Z-score is dependent on the sample size and computed as \(\pm (n-1)/\sqrt{n}\) for a sample size of \(n\).

\(i\) | \(x_i\) | Z-score \(\tiny(\bar x = 14.53, \sigma = 13.99)\) | \(x_i\) | Z-score \(\tiny(\bar x = 11.83, \sigma = 9.60)\) |
---|---|---|---|---|

1 | 3 | -0.82 | 3 | -0.92 |

2 | 5.1 | -0.67 | 5.1 | -0.7 |

3 | 5.3 | -0.66 | 5.3 | -0.68 |

4 | 5.7 | -0.63 | 5.7 | -0.64 |

5 | 6 | -0.61 | 6 | -0.61 |

6 | 7.2 | -0.52 | 7.2 | -0.48 |

7 | 10 | -0.32 | 10 | -0.19 |

8 | 11 | -0.25 | 11 | -0.09 |

9 | 11.5 | -0.22 | 11.5 | -0.03 |

10 | 12 | -0.18 | 12 | 0.01 |

11 | 13 | -0.11 | 13 | 0.12 |

12 | 13.5 | -0.07 | 13.5 | 0.17 |

13 | 14.2 | -0.02 | 14.2 | 0.25 |

14 | 15 | 0.03 | 15 | 0.33 |

15 | 45 | 2.18 | 45 | 3.45 |

16 | 55 | 2.89 | – | – |

For case 1, with all of the observations included even though 45 and 55 are outliers but Z-score for no observation exceeds the absolute value of 3. For case 2, when we have excluded 55, the most extreme value, we have detected 45 as an outlier. This is because multiple extreme values have inflated standard deviation.

In Z-score method we have used two estimators, sample mean and standard deviation, that can be affected by a single or multiple extreme values. To avoid these issues we use the median and the Median Absolute Deviation in the modified Z-score method. The modified Z-score \(M_i\) is computes as: $$\begin{aligned}M_i &= \frac{0.6745(x_i – \tilde{x})}{MAD}, \tilde{x} \; is \; the \; sample \; median\\ where, \; MAD &= median\{|x_i – \tilde{x}|\}\end{aligned}$$ To find out MAD let’s take a set of numbers: \(1,2,3,4,5\). The median \((\tilde{x})\) will be 3. Now, we will subtract the median from each x-value: $$\begin{aligned}|x_i – \tilde{x}|&\\ |1-3| &= 2\\ |2-3| &= 1\\ |3-3| &= 0\\ |4-3|&= 1\\ |5-3| &= 2\\ the \; MAD \; value & \;will\; be\; the \;median \;of \;the \;sorted \;values: (0, 1, 1, 2, 2)\\ \tilde{x} &= 1\end{aligned}$$ We consider an observation as an outlier if \(|M_i|>3.5\).

\(i\) | \(x_i\) | Z-score | Modified Z-score |
---|---|---|---|

1 | 3 | -0.82 | -1.43 |

2 | 5.1 | -0.67 | -1.06 |

3 | 5.3 | -0.66 | -1.03 |

4 | 5.7 | -0.63 | -0.96 |

5 | 6 | -0.61 | -0.91 |

6 | 7.2 | -0.52 | -0.70 |

7 | 10 | -0.32 | -0.22 |

8 | 11 | -0.25 | -0.04 |

9 | 11.5 | -0.22 | 0.04 |

10 | 12 | -0.18 | 0.13 |

11 | 13 | -0.11 | 0.30 |

12 | 13.5 | -0.07 | 0.39 |

13 | 14.2 | -0.02 | 0.51 |

14 | 15 | 0.03 | 0.65 |

15 | 45 | 2.18 | 5.84 |

16 | 55 | 2.89 | 7.57 |

Here we have shown the comparison of the Z-score method and modified Z-score on the previous data set. We can notice that even though the Z-score method failed but in modified Z-score, we have detected 45 and 55 as outliers. This is because the modified Z-score is less sensitive to extreme values.

The MADₑ method is similar to SD method but it uses median and Median Absolute Deviation (MAD), instead of mean and standard deviation. MADₑ method is defined as follows: $$ \begin{aligned}2 MAD_e \; Method&:\; Median \pm 2 MAD_e \\ 3 MAD_e \; Method&:\; Median \pm 3 MAD_e \\ where \; MAD &= median(|x_i – median(x)|_{i = 1,2,\ldots,n})\\MAD_e &= 1.483 \times MAD\end{aligned}$$ After scaling the MAD value by a factor of 1.483, it is similar to the standard deviation in a normal distribution. In our datraset, $$\begin{aligned} Median &= 11.25 \\ MAD &= 3.9 \\ MAD_e &= 1.483 \times 3.9 = 5.78 \\ 2 \; MAD_e&: [-0.31, 22.81] \\ 3 \; MAD_e &: [-6.09, 28.59]\end{aligned}$$

Median rule was introduced by Carling(2000) and it actually a substitution method for Tukey’s mehod where quartiles are substituted by median and a different scale is used. The method was defined as: $$\begin{aligned} \textbf{[}c_1, c_2\textbf{]} &= q2 \pm c(n)(q3 – q1) = q2 \pm c(n)IQR\\ where, \; q1& = first \; quartile \\ q2 &= median \\ q3 &= third \;quartile \\ n &= number \; of \; samples \\ c(n) & = \frac{17.63n – 23.64}{7.74n – 3.71} \approx 2.3 \; for \; large \; n \\ for \; our \; dataset, q2 &= 11.25\\ iqr &= 7.75\\ \textbf{[}c_1, c_2\textbf{]} &= \textbf{[}-6.575, 29.075\textbf{]} \end{aligned}$$

Tukey’s boxplot method is a well known simple graphical method to find out five-number summary and outliers in univariate data. This method is less sensitive than previous methods since it doesn’t use sample mean and variance but quartiles instead. From the five-number summary, we first get the value of \(Q1\) and \(Q3\) and from that, we get the interquartile range. $$IQR = Q3 – Q1$$ Using this we define an interval and any value outside this interval will be defined as outliers. $$\big[ Q_1 – k (Q_3 – Q_1 ) , Q_3 + k (Q_3 – Q_1 ) \big], \; where \; k \; is\; nonnegative$$ Jhon Tukey proposed that \(k = 1.5\) indicates an outlier where \(k = 3.0\) indicates that the data is far away. There is no statistical basis for which Tukey uses 1.5 for inner and 3 for outer fences. For our example dataset, $$\begin{aligned} Q1 &= 5.925 \\ Q2 &= 11.25 \\ Q3 &= 13.675\\ IQR &= 7.75\end{aligned}$$ So the defined interval according to Tukey will be $$\big[ Q_1 – 1.5(Q_3 – Q_1 ) , Q_3 + 1.5 (Q_3 – Q_1 ) \big] = [-5.7, 25.3]$$ We can see the lowest value from our data set that fall in this interval is \(3\) and the largest value is \(15.00\) while 45 and 55 are detected as outliers.

]]>**Statistics** is the discipline that concerns about data collection, organization, analysis, interpretation, and presentation. So what is **data**? Data is simply defined as distinct pieces of information. Data can be in many forms e.g. text, video, databases, spreadsheet, audio, images, etc. In this machine learning era, there are numerous use cases of data. From speech recognition to autonomous driving data has influenced our life like never imagined and statistics play a big role in it. Data can be divided into two basic types e.g. **quantitative** and **categorical**.

**Quantitative data** is the kind of data that contains the numerical value and **categorical data** contains labeled data. We can do mathematical operations in quantitative data while in categorical data we define a group or set of items. Quantitative data can be divided into two groups, **discrete** and **continuous**. Discrete data represents particular countable values. For example, the number of pages in a book, the number of dogs, etc. Continuous data on the other hand are not restricted to defined values but can take any value within a certain range. For example, the age of a child, the amount of rain in a year, etc. Categorical data can also be divided into two groups e.g. **nominal** and **ordinal**. Categorical data that can be represented as a ranked order, e.g. grade results of an exam (A+, A, A-) is defined as nominal. If there’s no ranked order in categorical data then it is defined as ordinal data e.g. different accessories. Understanding different data types are important since it enables us to determine which types of analysis are best suited for which type of data. Here we will take a dataset containing 2015 Homicide data in the USA.

Age | Number of Homicide Deaths |
---|---|

21 | 652 |

22 | 633 |

23 | 652 |

24 | 644 |

25 | 610 |

26 | 565 |

In this dataset, age is continuous and the number of homicide deaths is discrete data. To describe or analyze both discrete and continuous quantitative data we generally discuss **four** main aspects of data e.g. **measure of center**, **measure of spread**, **shape**, and **outliers**.

The measures of center, also known as central tendency, is summary statistics that represent the central or the typical value of a dataset. These measures indicate where the most values of a distribution fall and also referred to as a central location of a distribution. The three most common types of measures of center are:

**Mean****Mode****Median**

In the US homicide dataset, the number of homicide deaths associated with age 21 is 652, 633 for age 22, 653 for age 23, and so on. If we want to know what is the expected number of deaths associated with any given age we generally use mean or average to answer these questions. Mean is defined as the sum of the values divided by the count of the values on our dataset and denoted as \(\bar x\). *Suppose, we have a set of real-valued data \((x_1, x_2, \cdots, x_n)\). Then the sample mean of the data is defined as:* $$\bar x = \frac{1}{n}\sum_{i = 1}^{n} x_i$$ For our dataset, $$\text{mean}(\bar x) = \frac{652 + 633 + 652 + 644 + 610 + 565}{6} = 526$$

Mean is very important and common in data science. A common preprocessing step for data analysis is to center a set of data by subtracting its mean. An example of centering the dataset is given below.

However, mean is not always the best measure of center. Since mean has the nature of giving a larger or smaller value if one or two values in the dataset get bigger or get smaller. Also, mean can give decimal values for discrete data.

Mean gives us the average of the dataset and doesn’t always give the center value of the dataset except the dataset is symmetric. So, we need a measurement that will always give us the center of the dataset. Median does that for us and it has some other advantages over mean. Extreme values (outliers) do not affect the **median** as strongly as they do the mean and it is useful comparing different sets of data since its unique and there is only one possible value.

Median is defined as the **middle value** of the **sorted** dataset. Median divides our dataset in a way such that fifty percent of our dataset is less than the median and the other fifty percent is greater than the median. If the number of observations is an odd value then the median is simply the middle value of the observations. For example, if we take the first five values of our given dataset \(652, 633, 652, 644, 610\) then first we need to sort our dataset in ascending order to get \(610, 633, 644, 652, 652\). Then we will choose only the middle value which is \(644\) in our case.

For the even number of observations, the median is the mean of the two middle values of our dataset. For example, if we take the first six observations from our above dataset \(652, 633, 652, 644, 610, 565\) and sort in ascending order \(565, 610, 633, 644, 652, 652\). So the median will be $$\frac{633 + 644}{2} = 638.5$$

Another popular measure of center is **Mode** that provides us the **most frequent** or **common value** in the dataset. In our dataset \(652, 633, 652, 644, 610, 565\) we can see 652 appears twice and most frequent. So, 652 is the mode here. There can be **multiple** modes in a dataset. For example, if a dataset contains \(1, 2, 2, 3, 3, 4\) then the mode will be \(2\) and \(3\). However, if the observations in a dataset contain **similar frequencies** there is **no mode** e.g. \(1, 1, 2, 2, 3, 3, 4, 4\) have no mode since all observations occur a similar number of times. We generally use mode if the data is** categorical** e.g. colors, fruits, etc.

**Measure of spread** is the second aspect of analyzing quantitative data and defined as the **numerical values** that are associated with the spread of the data or **how far** is each observation in a dataset from one another while the measure of center tells us about the center of a dataset.

From the above images we can see that both have a similar mean, mode, and median which is about 100. But the spread of data is different. In the left image, the data ranges from 90 to 109 while in the right image it ranges from 86 to 115 indicating more spread in data. To measure the spread of data we generally use:

- Range
- Interquatile Range (IQR)
- Standard Deviation
- Variance

One of the most common ways to calculate the spread of data is by looking at the **five-number summary** of data. The **five-number summary** consists of five values:

**Maximum****Third Quartile (Q3)****Second Quartile/Median (Q2)****First Quartile (Q1)****Minimum**

**Maximum** and **minimum** values are the highest and lowest values of a dataset respectively. The **second quartile (Q2)** is the middle value or the **median** value of the dataset. The **first quartile (Q1)** value is the median value of the data that is on the **left side** of the second quartile while the **third quartile (Q3)** value is the median value of the data that is on the **right side** of the second quartile.

For example, suppose we have a dataset containing \( 5, 2, 8, 1, 3, 6, 8\). To get the five-number summary, we will first arrange our values in **ascending order** \( 1, 2, 3, 5, 6, 8, 8\). Then we will easily get the minimum and maximum value. In our case, the minimum value will be \(1\) and the maximum will be \(8\). The second quartile (Q2) or median will be \(5\). Then, the first quartile will be the median of the data that is on the left side of the second quartile, and for our case, it will be \(2\). Finally, the third quartile will be the median of data that is the right side of the second quartile and in our case, it will be \(8\).

The best way to visualize the five-number summary by using boxplot. Suppose, we have a distribution of the age of NBA player in the 2013-14 session. If we plot the dataset:

From the box plot, we can easily visualize the five-number summary.

After we get the five number summary getting **range** and **interquartile range** is simple. Range is the **difference** between maximum and minimum value in the dataset. For our dataset, $$ \text{range} = \text{maximum} – \text{minimum} = 8 – 1 = 7$$ Then the **interquartile range** will be the difference between third and first quartile. $$\text{IQR} = Q3 – Q1 = 8 -2 = 6$$

Previously, we have used the **five-number summary** to understand the spread of the distribution. But what if we want to use only one number instead of using five numbers to compare the spread of the two distributions. The easiest and most common way to do that by using the **standard deviation** and **variance**. **Standard deviation** is defined as the **average distance** of each observation from its mean. To calculate the standard deviation, we need to calculate the variance first.

Suppose, we have a dataset containing values of \(1, 1, 2, 4\). If we take the mean we get $$\bar x = \frac{1+1+2+4}{4} = 2$$ If we calculate the sum of the difference between each observation and mean we get $$\begin{aligned} 1 – 2 &= -1\\ 1 – 2 &= -1\\ 2 – 2 &= 0\\ 4 – 2 &= 2\\ -1 -1 +0 + 2 &= 0 \end{aligned}$$ Now, calculating the mean we get $0$ indicating there’s **no spread** in the data. But that’s not true and to stop this we will make all the distances positive by squaring them. $$\begin{aligned} (1 – 2)^2 &= 1\\ (1 – 2) &= 1\\ 2 – 2 &= 0\\ (4 – 2)^2 &= 4\\ 1 +1 +0 + 4 &= 6 \\ \text{Variance} &= \frac{6}{4} = 1.5 \\ \text{Standard Deviation} &= \sqrt{1.5} = 1.22\end{aligned}$$ This is called **variance**. So the variance is the average squared difference of each observation from the mean. $$\begin{aligned}\text{Variance, } \sigma^2 &= \frac{1}{n}\sum_{i = 1}^n (x_i – \bar{x})^2 \\ \text{SD, } \sigma &= \sqrt{\frac{1}{n}\sum_{i = 1}^n (x_i – \bar{x})^2}\end{aligned}$$

This is the average of the squared values which we have used to get the positive values. To get the standard deviation we simply take the square root of the variance. Standard deviation is actually the **root mean square (r.m.s)** deviation from the average.

The **shape** of the distributions tells us a lot about measures of center and spread. Looking at the **histogram** we can easily identify the shape of the data and from that, we can extract valuable information. The shape of our data falls in three common shapes.

**Left Skewed****Right Skewed****Symmetric**

Distributions, where we can draw a line in the **middle** and right side **mirrors** the left side are called symmetric distributions. One common example is **normal distribution** also known as the **bell curve**. For **symmetric distribution** $$mean = median = mode$$ **Mode** is the **tallest bar** in our histogram.

For **skewed distribution** mean is pulled by the tail of the distribution where median stays close to mode. Histograms that have the shorter bins on the right and taller bins on the left are called right-skewed shapes. For right-skewed distribution $$mean > meadian$$

Histograms that have shorter bins on the left and taller bins on the right are called left-skewed. In left-skewed histogram $$mean < median$$

In general, outliers are the points that fall very far from the rest of the data points. Our analysis can be impacted by outliers. For example, if we have dataset \(10, 11, 14, 15, 120\) then the mean will be \(34\) which can be misleading since most of our data points are much less then that. Outlier doesn’t affect the median or mode. Outlier can also be useful for retrieving useful data. For example, to identify medical practitioners who under- or over-utilize specific procedures or medical equipment, such as an x-ray instrument. We can easily visualize outliers by creating a quick plot e.g. scatter plot, box plot, histogram, etc.

From the above plot, we can easily visualize an outlier that is far from the other data points. In the **box plot**, an **outlier** is defined as a data point that is located outside the whiskers of the **box plot**.