Statistics is very important in our day to day life. There are so many applications of statistics in sectors like predictive analytics, machine learning, etc. It is really important to understand statistics well in order to understand these high-level concepts. In this post, we will introduce different parts of statistics, and then in later posts, we will describe them briefly.

## Statistics

**Statistics** is the discipline that concerns about data collection, organization, analysis, interpretation, and presentation. So what is **data**? Data is simply defined as distinct pieces of information. Data can be in many forms e.g. text, video, databases, spreadsheet, audio, images, etc. In this machine learning era, there are numerous use cases of data. From speech recognition to autonomous driving data has influenced our life like never imagined and statistics play a big role in it. Data can be divided into two basic types e.g. **quantitative** and **categorical**.

## Quantitative or Categorical Data

**Quantitative data** is the kind of data that contains the numerical value and **categorical data** contains labeled data. We can do mathematical operations in quantitative data while in categorical data we define a group or set of items. Quantitative data can be divided into two groups, **discrete** and **continuous**. Discrete data represents particular countable values. For example, the number of pages in a book, the number of dogs, etc. Continuous data on the other hand are not restricted to defined values but can take any value within a certain range. For example, the age of a child, the amount of rain in a year, etc. Categorical data can also be divided into two groups e.g. **nominal** and **ordinal**. Categorical data that can be represented as a ranked order, e.g. grade results of an exam (A+, A, A-) is defined as nominal. If there’s no ranked order in categorical data then it is defined as ordinal data e.g. different accessories. Understanding different data types are important since it enables us to determine which types of analysis are best suited for which type of data. Here we will take a dataset containing 2015 Homicide data in the USA.

Age | Number of Homicide Deaths |
---|---|

21 | 652 |

22 | 633 |

23 | 652 |

24 | 644 |

25 | 610 |

26 | 565 |

In this dataset, age is continuous and the number of homicide deaths is discrete data. To describe or analyze both discrete and continuous quantitative data we generally discuss **four** main aspects of data e.g. **measure of center**, **measure of spread**, **shape**, and **outliers**.

## Measures of Center

The measures of center, also known as central tendency, is summary statistics that represent the central or the typical value of a dataset. These measures indicate where the most values of a distribution fall and also referred to as a central location of a distribution. The three most common types of measures of center are:

**Mean****Mode****Median**

**Mean**

In the US homicide dataset, the number of homicide deaths associated with age 21 is 652, 633 for age 22, 653 for age 23, and so on. If we want to know what is the expected number of deaths associated with any given age we generally use mean or average to answer these questions. Mean is defined as the sum of the values divided by the count of the values on our dataset and denoted as \(\bar x\). *Suppose, we have a set of real-valued data \((x_1, x_2, \cdots, x_n)\). Then the sample mean of the data is defined as:* $$\bar x = \frac{1}{n}\sum_{i = 1}^{n} x_i$$ For our dataset, $$\text{mean}(\bar x) = \frac{652 + 633 + 652 + 644 + 610 + 565}{6} = 526$$

Mean is very important and common in data science. A common preprocessing step for data analysis is to center a set of data by subtracting its mean. An example of centering the dataset is given below.

However, mean is not always the best measure of center. Since mean has the nature of giving a larger or smaller value if one or two values in the dataset get bigger or get smaller. Also, mean can give decimal values for discrete data.

### Median

Mean gives us the average of the dataset and doesn’t always give the center value of the dataset except the dataset is symmetric. So, we need a measurement that will always give us the center of the dataset. Median does that for us and it has some other advantages over mean. Extreme values (outliers) do not affect the **median** as strongly as they do the mean and it is useful comparing different sets of data since its unique and there is only one possible value.

Median is defined as the **middle value** of the **sorted** dataset. Median divides our dataset in a way such that fifty percent of our dataset is less than the median and the other fifty percent is greater than the median. If the number of observations is an odd value then the median is simply the middle value of the observations. For example, if we take the first five values of our given dataset \(652, 633, 652, 644, 610\) then first we need to sort our dataset in ascending order to get \(610, 633, 644, 652, 652\). Then we will choose only the middle value which is \(644\) in our case.

For the even number of observations, the median is the mean of the two middle values of our dataset. For example, if we take the first six observations from our above dataset \(652, 633, 652, 644, 610, 565\) and sort in ascending order \(565, 610, 633, 644, 652, 652\). So the median will be $$\frac{633 + 644}{2} = 638.5$$

### Mode

Another popular measure of center is **Mode** that provides us the **most frequent** or **common value** in the dataset. In our dataset \(652, 633, 652, 644, 610, 565\) we can see 652 appears twice and most frequent. So, 652 is the mode here. There can be **multiple** modes in a dataset. For example, if a dataset contains \(1, 2, 2, 3, 3, 4\) then the mode will be \(2\) and \(3\). However, if the observations in a dataset contain **similar frequencies** there is **no mode** e.g. \(1, 1, 2, 2, 3, 3, 4, 4\) have no mode since all observations occur a similar number of times. We generally use mode if the data is** categorical** e.g. colors, fruits, etc.

## Measure of Spread

**Measure of spread** is the second aspect of analyzing quantitative data and defined as the **numerical values** that are associated with the spread of the data or **how far** is each observation in a dataset from one another while the measure of center tells us about the center of a dataset.

Small Spread Large Spread

From the above images we can see that both have a similar mean, mode, and median which is about 100. But the spread of data is different. In the left image, the data ranges from 90 to 109 while in the right image it ranges from 86 to 115 indicating more spread in data. To measure the spread of data we generally use:

- Range
- Interquatile Range (IQR)
- Standard Deviation
- Variance

### Five Number Summary

One of the most common ways to calculate the spread of data is by looking at the **five-number summary** of data. The **five-number summary** consists of five values:

**Maximum****Third Quartile (Q3)****Second Quartile/Median (Q2)****First Quartile (Q1)****Minimum**

**Maximum** and **minimum** values are the highest and lowest values of a dataset respectively. The **second quartile (Q2)** is the middle value or the **median** value of the dataset. The **first quartile (Q1)** value is the median value of the data that is on the **left side** of the second quartile while the **third quartile (Q3)** value is the median value of the data that is on the **right side** of the second quartile.

For example, suppose we have a dataset containing \( 5, 2, 8, 1, 3, 6, 8\). To get the five-number summary, we will first arrange our values in **ascending order** \( 1, 2, 3, 5, 6, 8, 8\). Then we will easily get the minimum and maximum value. In our case, the minimum value will be \(1\) and the maximum will be \(8\). The second quartile (Q2) or median will be \(5\). Then, the first quartile will be the median of the data that is on the left side of the second quartile, and for our case, it will be \(2\). Finally, the third quartile will be the median of data that is the right side of the second quartile and in our case, it will be \(8\).

The best way to visualize the five-number summary by using boxplot. Suppose, we have a distribution of the age of NBA player in the 2013-14 session. If we plot the dataset:

From the box plot, we can easily visualize the five-number summary.

### Range & Interquartile Range (IQR)

After we get the five number summary getting **range** and **interquartile range** is simple. Range is the **difference** between maximum and minimum value in the dataset. For our dataset, $$ \text{range} = \text{maximum} – \text{minimum} = 8 – 1 = 7$$ Then the **interquartile range** will be the difference between third and first quartile. $$\text{IQR} = Q3 – Q1 = 8 -2 = 6$$

### Standard Deviation & Variance

Previously, we have used the **five-number summary** to understand the spread of the distribution. But what if we want to use only one number instead of using five numbers to compare the spread of the two distributions. The easiest and most common way to do that by using the **standard deviation** and **variance**. **Standard deviation** is defined as the **average distance** of each observation from its mean. To calculate the standard deviation, we need to calculate the variance first.

Suppose, we have a dataset containing values of \(1, 1, 2, 4\). If we take the mean we get $$\bar x = \frac{1+1+2+4}{4} = 2$$ If we calculate the sum of the difference between each observation and mean we get $$\begin{aligned} 1 – 2 &= -1\\ 1 – 2 &= -1\\ 2 – 2 &= 0\\ 4 – 2 &= 2\\ -1 -1 +0 + 2 &= 0 \end{aligned}$$ Now, calculating the mean we get $0$ indicating there’s **no spread** in the data. But that’s not true and to stop this we will make all the distances positive by squaring them. $$\begin{aligned} (1 – 2)^2 &= 1\\ (1 – 2) &= 1\\ 2 – 2 &= 0\\ (4 – 2)^2 &= 4\\ 1 +1 +0 + 4 &= 6 \end{aligned}$$ So, the mean will be $$\frac{6}{2} = 3$$ This is called **variance**. So the variance is the average squared difference of each observation from the mean. $$\text{Variance,} \sigma^2 = \frac{1}{n}\sum_{i = 1}^n (x_i – \bar{x})^2$$

This is the average of the squared values which we have used to get the positive values. To get the standard deviation we simply take the square root of the variance. Standard deviation is actually the **root mean square (r.m.s)** deviation from the average. $$\text{SD,} \sigma = \sqrt{\frac{1}{n}\sum_{i = 1}^n (x_i – \bar{x})^2}$$

## Shape

The **shape** of the distributions tells us a lot about measures of center and spread. Looking at the **histogram** we can easily identify the shape of the data and from that, we can extract valuable information. The shape of our data falls in three common shapes.

**Left Skewed****Right Skewed****Symmetric**

Distributions, where we can draw a line in the **middle** and right side **mirrors** the left side are called symmetric distributions. One common example is **normal distribution** also known as the **bell curve**. For **symmetric distribution** $$mean = median = mode$$ **Mode** is the **tallest bar** in our histogram.

For **skewed distribution** mean is pulled by the tail of the distribution where median stays close to mode. Histograms that have the shorter bins on the right and taller bins on the left are called right-skewed shapes. For right-skewed distribution $$mean > meadian$$

Histograms that have shorter bins on the left and taller bins on the right are called left-skewed. In left-skewed histogram $$mean < median$$

## Outliers

In general, outliers are the points that fall very far from the rest of the data points. Our analysis can be impacted by outliers. For example, if we have dataset \(10, 11, 14, 15, 120\) then the mean will be \(34\) which can be misleading since most of our data points are much less then that. Outlier doesn’t affect the median or mode. Outlier can also be useful for retrieving useful data. For example, to identify medical practitioners who under- or over-utilize specific procedures or medical equipment, such as an x-ray instrument. We can easily visualize outliers by creating a quick plot e.g. scatter plot, box plot, histogram, etc.

From the above plot, we can easily visualize an outlier that is far from the other data points. In the **box plot**, an **outlier** is defined as a data point that is located outside the whiskers of the **box plot**.