Why The Central Limit Theorem in Data Science?

Today I’ll be discussing what the central limit theorem (or CLT) is and why is it important for every data science enthusiast to know.

Formal Definition

In other words, the theorem states that as the size of the sample increases, the distribution of the mean across multiple samples will approximate a Gaussian distribution. But for this theorem to hold true, these samples should be sufficient in size. The distribution of sample means, calculated from repeated sampling, will tend to normality with the increase in size of these samples.

Let’s go for basics first

Histograms

A histogram represents the number of occurrences on the y-axis for different values of a variable(say, weight of individuals), found on the x-axis as shown in the given figure.

This depiction makes it easy to visualize the underlying distribution of the dataset, and understand other properties such as skewness and kurtosis. In histograms, it is important to keep in mind the number of bins and try to have same-width bins as well for ease of interpretation.

Standard Normal Distribution

The normal random variable of a standard normal distribution is called a standard score or a z score. Every normal random variable X can be transformed into a z score via the following equation:

                      z = (X — μ) / σ

where X is a normal random variable, μ is the mean, and σ is the standard deviation.

Assumptions Behind the Central Limit Theorem

  1. The data must follow the randomization condition. It must be sampled randomly.
  2. Samples should be independent of each other. One sample should not influence the other samples.
  3. Sample size should be no more than 10% of the population when sampling is done without replacement.
  4. The sample size should be sufficiently large. When the population is skewed or asymmetric, the sample size should be large. If the population is symmetric, then we can draw small samples as well.

The central limit theorem has important implications in applied machine learning. The theorem does inform the solution to linear algorithms such as linear regression, but not complex models like artificial neural networks that are solved using numerical optimization methods. Instead, we must use experiments to observe and record the behaviour of the algorithms and use statistical methods to interpret their results.

Theorem in Practice

The fact is you won’t be able to collect that datapoint for every single alumnus. Alternatively, you will sample the population a variety of times obtaining individual sample means for each ‘sample’. We now plot the sample means via a histogram and can see the emergence of a normal distribution.

The main point here is that even if the input variables are not normally distributed, the sampling distribution will approximate the standard normal distribution.

Code in Python

from numpy.random import seed
from numpy.random import randint
from numpy import mean
import matplotlib.pyplot as plt
# seed the random number generator
seed(1)
# generate a sample of women's weights
weights = randint(50, 80, 40)
print(weights)
print('The average weight is {} kg'.format(mean(weights)))

Now, we will repeat this sampling simulation 1000 times

means = [mean(randint(50, 80, 40)) for _i in range(1000)]# plot the distribution of sample means
plt.hist(means)
plt.show()
print('The mean of the sample means is {}'.format(mean(means)))

The mean of the sample means is 64.547425

According to the CLT, the mean of the sample means (64.54) should be a good estimate of the real parameter which is unknown.

Machine Learning Enthusiast | Software Developer