Statistics Cheat Sheet for Data Scientists (Part-2)
You can find the Part-1 of this series at https://aditi-mittal.medium.com/statistics-cheat-sheet-for-data-scientists-part-1-e1be6a649f45 where I discussed about random variable, mean, median, mode, variance and standard deviation along with the measures of relationship like correlation and covariance.
In this part, I will be discussing about the type of distributions, central limit theorem and few accuracy metrics.
Type of Distributions
Probability density functions help in calculating the the probability of observing a given value of a particular random variable. It can be used to summarize the likelihood of observations across the distribution’s sample space. It generally generates a common distribution, such as the bell-curve for the Gaussian distribution.
Cumulative density functions are the functions that tells us the probability for a random variable being less than a certain value.
Gaussian distribution, also known as normal distribution, is a type of distribution which is defined using the mean and variance. The graph of gaussian distribution is a bell shaped curve. It has a mean as 0 and standard deviation as 1.
Uniform distribution is a kind of distribution in which all outcomes are equally likely that is the probability of each event is the same.
T-distribution or Student’s t-distribution is a distribution in which the parameters of the population are estimated using small samples. If the size of sample is large, the t distribution will start to look like the normal distribution. It is defined using the number of degrees of freedom. The t-value can be calculated as:
t = (x — mean) / (s/sqrt(n))
where x is the sample mean, mean is the population mean, s is the sample’s standard deviation, and n is the sample size.
Poisson Distribution is a probability distribution that gives the probability of a given number of events occurring within a fixed time interval.
where lambda is the mean number of occurrences of the given event and e is the Euler’s constant. Here tha variance and mean both are lambda.
Binomial Distribution is a distribution that has only two possible outcomes: success or failure. So it is a probability distribution of number of successes in n independent events where the probability of success is defined using (p) and the probability of failure will be (1-p). In this distribution, the mean is np. The variance can be calculated as (np(1–p)).
where n is the total number of trials, x is the number of successes and p is the probability of success of the event so (1-p) is the probability of failure of the event.
Chi-square distribution is the distribution of chi square statistic. Is is widely used in the hypothesis tests. The distribution is defined using the degree of freedom. The higher the degrees of freedom for a Chi-square distribution, the more it will look like a normal distribution. There are two types of chi square tests:
- Goodness of fit test determines if parameters calculated using the sample matches the population parameters.
- Test for independence is a test which is used to see if the two variables are related.
We will be discussing about the hypothesis tests in later articles.
The chi statistic can be calculated as:
here O_i is the i-th Observed value, E_i is the expected value, c is the degree of freedom.
Central Limit Theorem
Central Limit Theorem or CLT states that distribution of sample approximates a normal distribution as the sample size becomes larger, assuming that all samples are identical in size.
It means that if you have a population with mean μ and standard deviation σ and taking multiple large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed.
Accuracy Metrics
True positive: It means that the model detects the condition when the condition is present.
True negative: It means that the model does not detect the condition when the condition is not present.
False-positive: It means that the model detects the condition when the condition is absent.
False-negative: It means that the model does not detect the condition when the condition is present.
Sensitivity or recall: This metric is used to measure the ability of a model to detect the condition when the condition is actually present
Sensitivity = TP/(TP+FN)
Specificity: This metric measures the ability of a model to not identifythe condition when the condition is absent.
Specificity = TN/(TN+FP)
Precision: This metric measures the proportion of positives that correspond to the total positives there are in the sample
Precision = TP/(TP+FP)
Thanks for reading! Stay tuned for Part 3 in the series.