Why The Central Limit Theorem in Data Science?

Today I’ll be discussing what the central limit theorem (or CLT) is and why is it important for every data science enthusiast to know.

Formal Definition

The central limit theorem states that for a given dataset with unknown distribution, the sample means will approximate the normal distribution.

Let’s go for basics first

To understand this theorem more clearly, let’s cover the basics first. I’ll be discussing in brief about the histograms and standard normal distribution.

Histograms

Histograms are very simple chart type tool used by every data scientist, mostly to understand and visualise the distribution of a given dataset.

Standard Normal Distribution

The standard normal distribution or bell curve is a special case of the normal distribution. It is the distribution that happens when a normal random variable has a mean of zero and a standard deviation of one.

                      z = (X — μ) / σ

Assumptions Behind the Central Limit Theorem

It’s important to understand the assumptions behind this theorem:

  1. Samples should be independent of each other. One sample should not influence the other samples.
  2. Sample size should be no more than 10% of the population when sampling is done without replacement.
  3. The sample size should be sufficiently large. When the population is skewed or asymmetric, the sample size should be large. If the population is symmetric, then we can draw small samples as well.

Theorem in Practice

Let’s take an example. Say you work at a university and you want to understand the distribution of earnings in an alumni’s first year out of school.

Code in Python

We create random samples of women’s weights (imagining they range between 50 and 80 kg), each of size n=40. Then, we will run this simulation multiple times and observe whether the sample means distribution resembles a normal distribution.

from numpy.random import seed
from numpy.random import randint
from numpy import mean
import matplotlib.pyplot as plt
# seed the random number generator
seed(1)
# generate a sample of women's weights
weights = randint(50, 80, 40)
print(weights)
print('The average weight is {} kg'.format(mean(weights)))
means = [mean(randint(50, 80, 40)) for _i in range(1000)]# plot the distribution of sample means
plt.hist(means)
plt.show()
print('The mean of the sample means is {}'.format(mean(means)))

Machine Learning Enthusiast | Software Developer