Why The Central Limit Theorem in Data Science?

Formal Definition

The central limit theorem states that for a given dataset with unknown distribution, the sample means will approximate the normal distribution.

In other words, the theorem states that as the size of the sample increases, the distribution of the mean across multiple samples will approximate a Gaussian distribution. But for this theorem to hold true, these samples should be sufficient in size. The distribution of sample means, calculated from repeated sampling, will tend to normality with the increase in size of these samples.

Let’s go for basics first

To understand this theorem more clearly, let’s cover the basics first. I’ll be discussing in brief about the histograms and standard normal distribution.


Histograms are very simple chart type tool used by every data scientist, mostly to understand and visualise the distribution of a given dataset.

A histogram represents the number of occurrences on the y-axis for different values of a variable(say, weight of individuals), found on the x-axis as shown in the given figure.

This depiction makes it easy to visualize the underlying distribution of the dataset, and understand other properties such as skewness and kurtosis. In histograms, it is important to keep in mind the number of bins and try to have same-width bins as well for ease of interpretation.

Standard Normal Distribution

The standard normal distribution or bell curve is a special case of the normal distribution. It is the distribution that happens when a normal random variable has a mean of zero and a standard deviation of one.

The normal random variable of a standard normal distribution is called a standard score or a z score. Every normal random variable X can be transformed into a z score via the following equation:

                      z = (X — μ) / σ

where X is a normal random variable, μ is the mean, and σ is the standard deviation.

Assumptions Behind the Central Limit Theorem

It’s important to understand the assumptions behind this theorem:

  1. The data must follow the randomization condition. It must be sampled randomly.
  2. Samples should be independent of each other. One sample should not influence the other samples.
  3. Sample size should be no more than 10% of the population when sampling is done without replacement.
  4. The sample size should be sufficiently large. When the population is skewed or asymmetric, the sample size should be large. If the population is symmetric, then we can draw small samples as well.

The central limit theorem has important implications in applied machine learning. The theorem does inform the solution to linear algorithms such as linear regression, but not complex models like artificial neural networks that are solved using numerical optimization methods. Instead, we must use experiments to observe and record the behaviour of the algorithms and use statistical methods to interpret their results.

Theorem in Practice

Let’s take an example. Say you work at a university and you want to understand the distribution of earnings in an alumni’s first year out of school.

The fact is you won’t be able to collect that datapoint for every single alumnus. Alternatively, you will sample the population a variety of times obtaining individual sample means for each ‘sample’. We now plot the sample means via a histogram and can see the emergence of a normal distribution.

The main point here is that even if the input variables are not normally distributed, the sampling distribution will approximate the standard normal distribution.

Code in Python

We create random samples of women’s weights (imagining they range between 50 and 80 kg), each of size n=40. Then, we will run this simulation multiple times and observe whether the sample means distribution resembles a normal distribution.

from numpy.random import seed
from numpy.random import randint
from numpy import mean
import matplotlib.pyplot as plt
# seed the random number generator
# generate a sample of women's weights
weights = randint(50, 80, 40)
print('The average weight is {} kg'.format(mean(weights)))

Now, we will repeat this sampling simulation 1000 times

means = [mean(randint(50, 80, 40)) for _i in range(1000)]# plot the distribution of sample means
print('The mean of the sample means is {}'.format(mean(means)))

The mean of the sample means is 64.547425

According to the CLT, the mean of the sample means (64.54) should be a good estimate of the real parameter which is unknown.




Machine Learning Enthusiast | Software Developer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

End-to-End Bulk RNAseq Data Analysis in the BioBox Platform

A Data Scientist’s Guide to Navigating COVID-19 Response Projects

Using Data to Reveal Negative Effects of “Fair Policy”

The missing link — A peculiar life of a Data Scientist

Case Study: Reduce US Contact Center Cost by 15% with ShyftOff

Do wise investments by monitoring people’s daily attitudes

4 Underrated Tech Trends in ‘21

Reopening a Society During the Pandemic and the Revisited Prisoner’s Dilemma

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aditi Mittal

Aditi Mittal

Machine Learning Enthusiast | Software Developer

More from Medium

What is Data Science ?

30 days of Data Science — Day 2: Simple Linear Regression

Multicollinearity in Data Science

Getting started with EDA and Feature Engineering