Why The Central Limit Theorem in Data Science?
Today I’ll be discussing what the central limit theorem (or CLT) is and why is it important for every data science enthusiast to know.
Formal Definition
The central limit theorem states that for a given dataset with unknown distribution, the sample means will approximate the normal distribution.
In other words, the theorem states that as the size of the sample increases, the distribution of the mean across multiple samples will approximate a Gaussian distribution. But for this theorem to hold true, these samples should be sufficient in size. The distribution of sample means, calculated from repeated sampling, will tend to normality with the increase in size of these samples.
Let’s go for basics first
To understand this theorem more clearly, let’s cover the basics first. I’ll be discussing in brief about the histograms and standard normal distribution.
Histograms
Histograms are very simple chart type tool used by every data scientist, mostly to understand and visualise the distribution of a given dataset.
A histogram represents the number of occurrences on the y-axis for different values of a variable(say, weight of individuals), found on the x-axis as shown in the given figure.
This depiction makes it easy to visualize the underlying distribution of the dataset, and understand other properties such as skewness and kurtosis. In histograms, it is important to keep in mind the number of bins and try to have same-width bins as well for ease of interpretation.
Standard Normal Distribution
The standard normal distribution or bell curve is a special case of the normal distribution. It is the distribution that happens when a normal random variable has a mean of zero and a standard deviation of one.
The normal random variable of a standard normal distribution is called a standard score or a z score. Every normal random variable X can be transformed into a z score via the following equation:
z = (X — μ) / σ
where X is a normal random variable, μ is the mean, and σ is the standard deviation.
Assumptions Behind the Central Limit Theorem
It’s important to understand the assumptions behind this theorem:
- The data must follow the randomization condition. It must be sampled randomly.
- Samples should be independent of each other. One sample should not influence the other samples.
- Sample size should be no more than 10% of the population when sampling is done without replacement.
- The sample size should be sufficiently large. When the population is skewed or asymmetric, the sample size should be large. If the population is symmetric, then we can draw small samples as well.
The central limit theorem has important implications in applied machine learning. The theorem does inform the solution to linear algorithms such as linear regression, but not complex models like artificial neural networks that are solved using numerical optimization methods. Instead, we must use experiments to observe and record the behaviour of the algorithms and use statistical methods to interpret their results.
Theorem in Practice
Let’s take an example. Say you work at a university and you want to understand the distribution of earnings in an alumni’s first year out of school.
The fact is you won’t be able to collect that datapoint for every single alumnus. Alternatively, you will sample the population a variety of times obtaining individual sample means for each ‘sample’. We now plot the sample means via a histogram and can see the emergence of a normal distribution.
The main point here is that even if the input variables are not normally distributed, the sampling distribution will approximate the standard normal distribution.
Code in Python
We create random samples of women’s weights (imagining they range between 50 and 80 kg), each of size n=40. Then, we will run this simulation multiple times and observe whether the sample means distribution resembles a normal distribution.
from numpy.random import seed
from numpy.random import randint
from numpy import mean
import matplotlib.pyplot as plt# seed the random number generator
seed(1)# generate a sample of women's weights
weights = randint(50, 80, 40)print(weights)
print('The average weight is {} kg'.format(mean(weights)))
Now, we will repeat this sampling simulation 1000 times
means = [mean(randint(50, 80, 40)) for _i in range(1000)]# plot the distribution of sample means
plt.hist(means)
plt.show()
print('The mean of the sample means is {}'.format(mean(means)))
The mean of the sample means is 64.547425
According to the CLT, the mean of the sample means (64.54) should be a good estimate of the real parameter which is unknown.
Thank you for reading.
My other machine learning posts are:
- Writing Python Code for Neural Networks from Scratch
- Cross Validation and Performance Measures in Machine Learning
- Understanding RNN and LSTM
- Instance segmentation using Mask R-CNN
- Introduction to U-Net and Res-Net for Image Segmentation
Please do give these articles a read if interested.