# Statistics Cheat Sheet for Data Scientists (Part-2)

--

You can find the Part-1 of this series at https://aditi-mittal.medium.com/statistics-cheat-sheet-for-data-scientists-part-1-e1be6a649f45 where I discussed about random variable, mean, median, mode, variance and standard deviation along with the measures of relationship like correlation and covariance.

In this part, I will be discussing about the type of distributions, central limit theorem and few accuracy metrics.

# Type of Distributions

**Probability density functions** help in calculating the the probability of observing a given value of a particular random variable. It can be used to summarize the likelihood of observations across the distribution’s sample space. It generally generates a common distribution, such as the bell-curve for the Gaussian distribution.

**Cumulative density functions **are the** **functions that tells us the probability for a random variable being less than a certain value.

**Gaussian distribution**,** **also known as normal distribution,** **is a type of distribution which is defined using the mean and variance. The graph of gaussian distribution is a bell shaped curve. It has a mean as 0 and standard deviation as 1.

**Uniform distribution** is a kind of distribution in which all outcomes are equally likely that is the probability of each event is the same.

**T-distribution** or **Student’s t-distribution **is a distribution in which the parameters of the population are estimated using small samples. If the size of sample is large, the t distribution will start to look like the normal distribution. It is defined using the number of degrees of freedom. The t-value can be calculated as:

t = (x — mean) / (s/sqrt(n))

where x is the sample mean, mean is the population mean, s is the sample’s standard deviation, and n is the sample size.

**Poisson Distribution **is** **a probability distribution that gives the probability of a given number of events occurring within a fixed time interval.

where lambda is the mean number of occurrences of the given event and e is the Euler’s constant. Here tha variance and mean both are lambda.

**Binomial Distribution **is a distribution that has only two possible outcomes: success or failure. So it is a probability distribution of number of successes in n independent events where the probability of success is defined using (p) and the probability of failure will be (1-p). In this distribution, the mean is np. The variance can be calculated as (np(1–p)).

where n is the total number of trials, x is the number of successes and p is the probability of success of the event so (1-p) is the probability of failure of the event.

**Chi-square distribution** is the distribution of chi square statistic. Is is widely used in the hypothesis tests. The distribution is defined using the degree of freedom. The higher the degrees of freedom for a Chi-square distribution, the more it will look like a normal distribution. There are two types of chi square tests:

**Goodness of fit test**determines if parameters calculated using the sample matches the population parameters.**Test for independence**is a test which is used to see if the two variables are related.

We will be discussing about the hypothesis tests in later articles.

The chi statistic can be calculated as:

here O_i is the i-th Observed value, E_i is the expected value, c is the degree of freedom.

# Central Limit Theorem

Central Limit Theorem or CLT states that distribution of sample approximates a normal distribution as the sample size becomes larger, assuming that all samples are identical in size.

It means that if you have a population with mean μ and standard deviation σ and taking multiple large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed.

# Accuracy Metrics

**True positive:** It means that the model detects the condition when the condition is present.

**True negative:** It means that the model does not detect the condition when the condition is not present.

**False-positive:** It means that the model detects the condition when the condition is absent.

**False-negative:** It means that the model does not detect the condition when the condition is present.

**Sensitivity or recall:** This metric is used to measure the ability of a model to detect the condition when the condition is actually present

Sensitivity =TP/(TP+FN)

**Specificity:** This metric** **measures the ability of a model to not identifythe condition when the condition is absent.

Specificity= TN/(TN+FP)

**Precision:** This metric measures** **the proportion of positives that correspond to the total positives there are in the sample

Precision= TP/(TP+FP)

*Thanks for reading! Stay tuned for Part 3 in the series.*