Why The Central Limit Theorem in Data Science?

Today I’ll be discussing what the central limit theorem (or CLT) is and why is it important for every data science enthusiast to know.

Formal Definition

Let’s go for basics first

Histograms

Standard Normal Distribution

                      z = (X — μ) / σ

Assumptions Behind the Central Limit Theorem

  1. The data must follow the randomization condition. It must be sampled randomly.
  2. Samples should be independent of each other. One sample should not influence the other samples.
  3. Sample size should be no more than 10% of the population when sampling is done without replacement.
  4. The sample size should be sufficiently large. When the population is skewed or asymmetric, the sample size should be large. If the population is symmetric, then we can draw small samples as well.

Theorem in Practice

Code in Python

from numpy.random import seed
from numpy.random import randint
from numpy import mean
import matplotlib.pyplot as plt
# seed the random number generator
seed(1)
# generate a sample of women's weights
weights = randint(50, 80, 40)
print(weights)
print('The average weight is {} kg'.format(mean(weights)))
means = [mean(randint(50, 80, 40)) for _i in range(1000)]# plot the distribution of sample means
plt.hist(means)
plt.show()
print('The mean of the sample means is {}'.format(mean(means)))

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store