Introduction to linear regression

Linear regression is used to examine the linear relationship between two variables x and y. The dependent variable is y and x is the independent variable. Such a linear relationship is expressed by the following formula:

\displaystyle{y=\beta_0+\beta_1x+\varepsilon}

\displaystyle{\varepsilonε} is an error variable.

To define the relationship between x and y we need to know the value of the coefficients \beta_0 and \beta_1 which are population parameters.

Examples of such linear relations are:

  1. Relation between hours spent at part-time jobs (x) and grade point averages (y);
  2. Relation between costs for repairs (x) and the age of machines (y);
  3. Relation between the number of hours television watching (x) and total debt (y);
  4. Relation between the time between movies (x) and the sales of popcorn, soft drinks (y).

We distinguish two types of models.

Deterministic and probabilistic models

An equation or set of equations that allow us to fully determine the value of the dependent variable from the values of the independent variables is called a deterministic model.

A method used to capture the randomness that is part of a real-life process:

\displaystyle{\varepsilon\sim{N(0,\sigma_\varepsilon})} \displaystyle{\varepsilon} is constant.

As an example we will look at the relationship between the score of the popular GMAT test and MBA scores. We consider a linear univariate regression model (i.e. there is only one independent linear variable). Program managers of MBA programs want to improve the MBA scores of their programs (MBA scores on a scale from 0 to 5) and thus want to select the best students.

They consider the introduction of a GMAT test (Graduate Management Admission Test) (GMAT scores range from 200 to 800) as one of the admission criterions and prefer students with the highest GMAT scores.

If there would be an exact linear relation between MBA scores and GMAT scores, the graph of such a relation would be a straight line with a positive slope. In this deterministic case the linear relation would be:

\displaystyle{y=-1.66+0.83x}

y: the MBA score, y\in[0, 5]
x: the GMAT score (in 100s), so x\in[2, 8]

The deterministic model suggests a positive linear relation between GMAT (x) and MBA (y) (positive slope +0.83). The MBA score would be exactly 3.32 if the GMAT score would be 600. On the other hand, if the manager aims to admit only students with an MBA score of 4 or more, he should require a GMAT score of exactly 682 or more.

Because of the lack of unexpected disturbances we have a deterministic model: if you know the GMAT score then you can exactly compute the corresponding MBA score.

This is not what happens in practice.
There are unknown and unforeseen disturbances (such as ‘illness’ (-), ‘lucky’ (+), ‘excellent student’ (+), ‘off-day’ (-), ‘fell in love’ (-), et cetera) which make the MBA score to vary randomly. So, we should use a probabilistic model.

Thus, it would be better to use a probabilistic model:

\displaystyle{y=\beta_0+\beta_1x+\varepsilon}

where \varepsilon is a normally distributed random variable with \mu_\varepsilon=0 and a given constant standard deviation \sigma_\varepsilon.

The graph of such a model would be a so-called scatter plot: not all points lie on the straight line.

In general the equation of the scatter plot is:

\displaystyle{y=\beta_0+\beta_1x+\varepsilon}

where \beta_0 and \beta_1 are unknown population parameters which have to be estimated using some statistics. If b_0 and b_1 are the estimates of \beta_0 and \beta_1, respectively, then the regression line (the 'best' linear approximation) would look like:

\hat{y}=b_0+b_1x

What is defined as the ‘best’ and how b_0 and b_1 are determined and other details will be explained later.

The following example clarifies some details. The annual bonuses (1,000s dollars) of 6 employees with different years of experience were recorded as below. We wish to determine the regression line.

Years of experienceAnnual bonus
16
21
39
45
517
612


We define the regression line \hat{y}=b_0+b_1x as follows. Find b_0 and b_1  such that the sum of the squared residuals is minimized:

\displaystyle{\min\sum{}{}{(y_i-{\hat{y}}_i)^2}}

or

\displaystyle{\min\sum{}{}{(y_i-b_0-b_1x_i)^2}}

This optimization problem is solved by partial differentiation. We find:

\displaystyle{b_1=\frac{s_{xy}}{s_x^2}}

\displaystyle{b_0=\overline{y}-b_1\overline{x}}

How does linear regression work?

We want to explain this with a simple example.
Suppose we have three observations (1,2), (2,1), (3,4) and assume a straight line through the origin (usually not the case, just as an example). Then the regression line should be:

\displaystyle{\hat{y}=ax}

Which a gives the best line?

So, determine a such that the following function is minimized with respect to a:

\displaystyle{\sum{(y_i-\hat{y_i})^2=(2-a\times1)^2+(1-a\times2)^2+(4-a\times3)^2=14a^2-32a+21}

This function (parabola) has a minimum for \displaystyle{a=\frac{32}{28}=1.143}.
So, the regression has a slope 1.143.

The one in the middle is the (optimal) regression line.

Now the general formula for this case

Minimize with respect to a the function:

\displaystyle{\sum{(y_i-\hat{y_i})^2=\sum(y_i-ax_i)^2}

Solve the derivative with respect to a equal to zero (chain rule):

\displaystyle{\sum{(y_i-ax_i)x_i}=0}

and find:

\displaystyle{a=\frac{\sum{x_iy_i}}{\sum{{x_i}^2}}}

Note. This is just a simple example showing how the method works. The method for the linear model \displaystyle{y=b_0+b_1x} is similar, now with 2 independent variables b_0 and b_1,, which requires partial differentiation.