This is a series of articles with basic concepts of statistics. Use this to brush up your skills or to understand few standard things which are widely used by data scientists and analysts while making decision.

Photo by Carlos Muza on Unsplash

You can find the Part-1 of this series at https://medium.com/@aditi-mittal/statistics-cheat-sheet-for-data-scientists-part-1-e1be6a649f45 and Part-2 at https://medium.com/@aditi-mittal/statistics-cheat-sheet-for-data-scientists-part-2-14dc4e0d6f82.

In this part, I’ll be discussing Linear regression in detail

Linear Regression

Linear Regression is a statistical method that can determine the impact of a unit change in a variable on the values of another variables, considering their relationship is linear.

Independent or regressor variables are the variables whose values are changed to see the impact in another variable called dependent variable or response variables. Independent variables are plotted on x axis while dependent variables are plotted along y-axis.

Depending on the number of independent variable, we can divide linear regression into two subcategories:

  1. Simple Linear Regression — When there is only one independent variable present then it is called simple linear regression.
  2. Multiple Linear Regression — When there are multiple independent variables present then it is known as multiple linear regression.

Main idea behind linear regression is to find a straight line which can best-fit the data. This line is known as regression line. It can be found out by using the set of data of the form (X, Y)

Simple Linear Regression

The equation for simple linear regression can be written as:

where Y is a dependent variable, β0 is the intercept on x axis and β1 is the slope coefficient for the line. Mu is the error value that the model makes while predicting or estimating the Y value using X and the above given line.

Multiple Linear Regression

Similarly, the equation for multiple linear regression can be expressed as:

Ordinary Least Squares (OLS)

It is a method used to estimate the unknown parameters β0 and β1 in the above equation. Linear regression uses the concept of least squares to determine these unknown parameters. Least squares minimizes the sum of squares of the difference between observed and predicted values. The residual is defined as the difference between real or observed value and predicted value. So we can say that the coefficients can be determined using the square of residuals.

Before moving to the details of the derivation of these unknown parameters values, we would first want to discuss the assumptions made by OLS. Without these assumptions we can’t ensure that the OLS regression make best estimates:

  1. The linear regression model is linear in parameters.
  2. For creating a sample, observations are chosen randomly.
  3. The error term has population mean of 0.
  4. All independent variables are uncorrelated with the error term.
  5. The error term has a constant variance.
  6. No independent variable is an exact linear function of other dependent variables.
  7. Optional — The error term is normally distributed.

Coming back to deriving these unknown parameters for simple linear regression, we would want to minimize the squares of residuals to ensure that the linear regression predicts the value close to the observed value. Therefore we can define a function such as:

Here J is the least square function which we need to minimize to find β0 and β1 which can then be calculated using partial differentiation of the least square function and evaluating them to 0. The mathematical derivation is as follows:

First differentiating J with respect to β0 and β1

Using the first equation we can get the value of β0 in terms of β1 and by cancelling 2 and simplifying the terms, we can write it as:

Dividing the whole equation by n and using the formula of mean:

Now using the equation of partial differentiating with respect to β1 and cancelling 2:

Putting the value of β0 in above equation and simplifying it further by using basic mathematical transformations and definition of mean:

Once we find out these unknown parameters, we can find out the output variables substituting these values along with the X.

Thank you for reading my article. Don’t forget to give a clap and subscribe if you liked it.

Part-4 will be coming soon. Stay tuned!!!

--

--