Descriptive Statistics: A Brief Introduction

Statistics is very important in our day to day life. There are so many applications of statistics in sectors like predictive analytics, machine learning, etc. It is really important to understand statistics well in order to understand these high-level concepts. In this post, we will introduce different parts of statistics, and then in later posts, we will describe them briefly.

Statistics

Statistics is the discipline that concerns about data collection, organization, analysis, interpretation, and presentation. So what is data? Data is simply defined as distinct pieces of information. Data can be in many forms e.g. text, video, databases, spreadsheet, audio, images, etc. In this machine learning era, there are numerous use cases of data. From speech recognition to autonomous driving data has influenced our life like never imagined and statistics play a big role in it. Data can be divided into two basic types e.g. quantitative and categorical.

Quantitative or Categorical Data

Quantitative data is the kind of data that contains the numerical value and categorical data contains labeled data. We can do mathematical operations in quantitative data while in categorical data we define a group or set of items. Quantitative data can be divided into two groups, discrete and continuous. Discrete data represents particular countable values. For example, the number of pages in a book, the number of dogs, etc. Continuous data on the other hand are not restricted to defined values but can take any value within a certain range. For example, the age of a child, the amount of rain in a year, etc. Categorical data can also be divided into two groups e.g. nominal and ordinal. Categorical data that can be represented as a ranked order, e.g. grade results of an exam (A+, A, A-) is defined as nominal. If there’s no ranked order in categorical data then it is defined as ordinal data e.g. different accessories. Understanding different data types are important since it enables us to determine which types of analysis are best suited for which type of data. Here we will take a dataset containing 2015 Homicide data in the USA.

AgeNumber of Homicide Deaths
21652
22633
23652
24644
25610
26565

In this dataset, age is continuous and the number of homicide deaths is discrete data. To describe or analyze both discrete and continuous quantitative data we generally discuss four main aspects of data e.g. measure of center, measure of spread, shape, and outliers.

Measures of Center

The measures of center, also known as central tendency, is summary statistics that represent the central or the typical value of a dataset. These measures indicate where the most values of a distribution fall and also referred to as a central location of a distribution. The three most common types of measures of center are:

  • Mean
  • Mode
  • Median

Mean

In the US homicide dataset, the number of homicide deaths associated with age 21 is 652, 633 for age 22, 653 for age 23, and so on. If we want to know what is the expected number of deaths associated with any given age we generally use mean or average to answer these questions. Mean is defined as the sum of the values divided by the count of the values on our dataset and denoted as \(\bar x\). Suppose, we have a set of real-valued data \((x_1, x_2, \cdots, x_n)\). Then the sample mean of the data is defined as: $$\bar x = \frac{1}{n}\sum_{i = 1}^{n} x_i$$ For our dataset, $$\text{mean}(\bar x) = \frac{652 + 633 + 652 + 644 + 610 + 565}{6} = 526$$

Mean is very important and common in data science. A common preprocessing step for data analysis is to center a set of data by subtracting its mean. An example of centering the dataset is given below.

However, mean is not always the best measure of center. Since mean has the nature of giving a larger or smaller value if one or two values in the dataset get bigger or get smaller. Also, mean can give decimal values for discrete data.

Median

Mean gives us the average of the dataset and doesn’t always give the center value of the dataset except the dataset is symmetric. So, we need a measurement that will always give us the center of the dataset. Median does that for us and it has some other advantages over mean. Extreme values (outliers) do not affect the median as strongly as they do the mean and it is useful comparing different sets of data since its unique and there is only one possible value.

Median is defined as the middle value of the sorted dataset. Median divides our dataset in a way such that fifty percent of our dataset is less than the median and the other fifty percent is greater than the median. If the number of observations is an odd value then the median is simply the middle value of the observations. For example, if we take the first five values of our given dataset \(652, 633, 652, 644, 610\) then first we need to sort our dataset in ascending order to get \(610, 633, 644, 652, 652\). Then we will choose only the middle value which is \(644\) in our case.

For the even number of observations, the median is the mean of the two middle values of our dataset. For example, if we take the first six observations from our above dataset \(652, 633, 652, 644, 610, 565\) and sort in ascending order \(565, 610, 633, 644, 652, 652\). So the median will be $$\frac{633 + 644}{2} = 638.5$$

Mode

Another popular measure of center is Mode that provides us the most frequent or common value in the dataset. In our dataset \(652, 633, 652, 644, 610, 565\) we can see 652 appears twice and most frequent. So, 652 is the mode here. There can be multiple modes in a dataset. For example, if a dataset contains \(1, 2, 2, 3, 3, 4\) then the mode will be \(2\) and \(3\). However, if the observations in a dataset contain similar frequencies there is no mode e.g. \(1, 1, 2, 2, 3, 3, 4, 4\) have no mode since all observations occur a similar number of times. We generally use mode if the data is categorical e.g. colors, fruits, etc.

Measure of Spread

Measure of spread is the second aspect of analyzing quantitative data and defined as the numerical values that are associated with the spread of the data or how far is each observation in a dataset from one another while the measure of center tells us about the center of a dataset.

From the above images we can see that both have a similar mean, mode, and median which is about 100. But the spread of data is different. In the left image, the data ranges from 90 to 109 while in the right image it ranges from 86 to 115 indicating more spread in data. To measure the spread of data we generally use:

  • Range
  • Interquatile Range (IQR)
  • Standard Deviation
  • Variance

Five Number Summary

One of the most common ways to calculate the spread of data is by looking at the five-number summary of data. The five-number summary consists of five values:

  • Maximum
  • Third Quartile (Q3)
  • Second Quartile/Median (Q2)
  • First Quartile (Q1)
  • Minimum

Maximum and minimum values are the highest and lowest values of a dataset respectively. The second quartile (Q2) is the middle value or the median value of the dataset. The first quartile (Q1) value is the median value of the data that is on the left side of the second quartile while the third quartile (Q3) value is the median value of the data that is on the right side of the second quartile.

For example, suppose we have a dataset containing \( 5, 2, 8, 1, 3, 6, 8\). To get the five-number summary, we will first arrange our values in ascending order \( 1, 2, 3, 5, 6, 8, 8\). Then we will easily get the minimum and maximum value. In our case, the minimum value will be \(1\) and the maximum will be \(8\). The second quartile (Q2) or median will be \(5\). Then, the first quartile will be the median of the data that is on the left side of the second quartile, and for our case, it will be \(2\). Finally, the third quartile will be the median of data that is the right side of the second quartile and in our case, it will be \(8\).

The best way to visualize the five-number summary by using boxplot. Suppose, we have a distribution of the age of NBA player in the 2013-14 session. If we plot the dataset:

From the box plot, we can easily visualize the five-number summary.

Range & Interquartile Range (IQR)

After we get the five number summary getting range and interquartile range is simple. Range is the difference between maximum and minimum value in the dataset. For our dataset, $$ \text{range} = \text{maximum} – \text{minimum} = 8 – 1 = 7$$ Then the interquartile range will be the difference between third and first quartile. $$\text{IQR} = Q3 – Q1 = 8 -2 = 6$$

Standard Deviation & Variance

Previously, we have used the five-number summary to understand the spread of the distribution. But what if we want to use only one number instead of using five numbers to compare the spread of the two distributions. The easiest and most common way to do that by using the standard deviation and variance. Standard deviation is defined as the average distance of each observation from its mean. To calculate the standard deviation, we need to calculate the variance first.

Suppose, we have a dataset containing values of \(1, 1, 2, 4\). If we take the mean we get $$\bar x = \frac{1+1+2+4}{4} = 2$$ If we calculate the sum of the difference between each observation and mean we get $$\begin{aligned} 1 – 2 &= -1\\ 1 – 2 &= -1\\ 2 – 2 &= 0\\ 4 – 2 &= 2\\ -1 -1 +0 + 2 &= 0 \end{aligned}$$ Now, calculating the mean we get $0$ indicating there’s no spread in the data. But that’s not true and to stop this we will make all the distances positive by squaring them. $$\begin{aligned} (1 – 2)^2 &= 1\\ (1 – 2) &= 1\\ 2 – 2 &= 0\\ (4 – 2)^2 &= 4\\ 1 +1 +0 + 4 &= 6 \end{aligned}$$ So, the mean will be $$\frac{6}{2} = 3$$ This is called variance. So the variance is the average squared difference of each observation from the mean. $$\text{Variance,} \sigma^2 = \frac{1}{n}\sum_{i = 1}^n (x_i – \bar{x})^2$$

This is the average of the squared values which we have used to get the positive values. To get the standard deviation we simply take the square root of the variance. Standard deviation is actually the root mean square (r.m.s) deviation from the average. $$\text{SD,} \sigma = \sqrt{\frac{1}{n}\sum_{i = 1}^n (x_i – \bar{x})^2}$$

Shape

The shape of the distributions tells us a lot about measures of center and spread. Looking at the histogram we can easily identify the shape of the data and from that, we can extract valuable information. The shape of our data falls in three common shapes.

  • Left Skewed
  • Right Skewed
  • Symmetric

Distributions, where we can draw a line in the middle and right side mirrors the left side are called symmetric distributions. One common example is normal distribution also known as the bell curve. For symmetric distribution $$mean = median = mode$$ Mode is the tallest bar in our histogram.

For skewed distribution mean is pulled by the tail of the distribution where median stays close to mode. Histograms that have the shorter bins on the right and taller bins on the left are called right-skewed shapes. For right-skewed distribution $$mean > meadian$$

Histograms that have shorter bins on the left and taller bins on the right are called left-skewed. In left-skewed histogram $$mean < median$$

Outliers

In general, outliers are the points that fall very far from the rest of the data points. Our analysis can be impacted by outliers. For example, if we have dataset \(10, 11, 14, 15, 120\) then the mean will be \(34\) which can be misleading since most of our data points are much less then that. Outlier doesn’t affect the median or mode. Outlier can also be useful for retrieving useful data. For example, to identify medical practitioners who under- or over-utilize specific procedures or medical equipment, such as an x-ray instrument. We can easily visualize outliers by creating a quick plot e.g. scatter plot, box plot, histogram, etc.

From the above plot, we can easily visualize an outlier that is far from the other data points. In the box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.