Statistics Cheat Sheet for Data Scientists (Part-1)
Data science is an interdisciplinary field of study which requires knowledge of various concepts such as machine learning and statistics. If you are starting with your data science journey then deep understanding in statistics will help you to take leverage in the journey.
Statistics provides various tools and methods to find structure and to give deeper data insights. In this series of articles, I will cover the basic Statistics topics for data science and data analytics. These articles will help in getting started with the statistical knowledge or refreshing the same.
Random Variables
A random variable is a value obtained from mapping the outcomes from a random experiment. And the set of these possible values is called a sample space. The random experiment can be anything like flipping a coin or rolling a dice.
We can define the experiment of flipping a coin by a random variable X which takes a value 1 if the outcome comes out as ‘heads’ and 0 if the outcome comes out as ‘tails’. In this experiment, the sample space would be {0,1}. Whenever the random experiment is repeated, it is called an ‘event’. The chance of the event occurring having a certain outcome is called the ‘probability’ of that specific event. It is denoted by P(X) where X is the value that a random variable takes. For example, in case of rolling a dice and getting a 1, P(X=1) can be defined as P(X=1) = 1/6 where 1/6 is the probability of getting a 1 on rolling a dice.
Population and Sample
The concept of population and sample get confusing as they can be used interchangeably sometimes. Population is the entire space one desires statistical information about. Sample is the subset of the population because it is difficult to analyse the complete population together. It’s believed that if the sample size is large enough then the statistical information about the sample can be the representation for complete population.
Mean, Median, Mode, Variance and Standard Deviation
The mean is an average of a finite set of numbers. If we have a random variable X in the data that can take N values or we can say that there are N number of observations or data points in the sample. Then the ‘mean’ is defined as:
where x_i are all the values in the dataset of random variable X. In some cases, the mean is also referred to as expectation.
The median is the central value of a data set i.e. 50% of data points will have a value smaller or equal to the median and other 50% of data points will have a value higher or equal to the median. If the size of dataset is odd, the median is the number that is in the middle, with the same amount of numbers below and above. If the size of dataset is even, median value is calculated by identifying the middle pair and taking their average.
The mode of the dataset is the most frequent value i.e. it is the number which occurs the highest number of times in the dataset. It involves counting the occurrence of each value in the dataset and then determining the mode.
The variance measures how far the data points are from the mean, and is mathematically equal to the sum of squares of differences between the values and the mean.
The standard deviation is the square root of the variance and measures the extent to which data varies from its mean. It is defined as:
Covariance and Correlation
The covariance is a measure of the joint variability of two random variables. It helps in describing the relationship between the two variables. Mathematically, it can be defined as the expected value of the product of the two random variables’ deviations from their means. It can take negative or positive values or 0. A positive value indicates that two random variables vary in the same direction, whereas a negative value says that these variables vary in the opposite direction and the value 0 means that they don’t vary together.
The correlation is a measure for relationship which measures both the strength and the direction of the linear relationship between two variables. Correlation between two random variables X and Z are equal to the covariance between these two variables divided by the product of their standard deviations.
Difference between Correlation and Covariance
Covariance helps in measuring if the variation in one variable results in a variation in another variable. Correlation measures the direction as well as the strength of the relationship between two variables. Covariance considers the linear relationship of only two variables in the dataset, whereas correlation can be calculated between two or multiple variables and their linear relationships.
Bayes Theorem
Conditional probability is a measure of the probability of an event happening, given that another event has already occurred. Bayes theorem for the events X and Y can be written as:
- Pr(X|Y): the probability of event X occurring given that event Y has occurred
- Pr(Y|X): the probability of event Y occurring given that event X has occurred
- Pr(X) & Pr(Y): the probabilities of happening of events X and Y
Thanks for reading the part 1. Stay tuned for part 2 where we will be discussing about more complex concepts.