Regression analysis

Simple linear regression models

In the introduction to regression we discussed the (univariate) linear model:

$y=b_0+b_1x+\varepsilon$

where $\beta__0$ and $\beta_1$ are unknown population parameters which have to be estimated using some statistics. If $b_0$ and $b_1$ are the estimates of $\beta_0$ and $\beta_1$ , respectively, then the regression line (the ‘best’ approximation) is:

$\hat{y}=b_0+b_1 x$

What is defined as the ‘best’ and how $b_0$ and $b_1$ are computed and for other details we refer to the introduction to regression.

Example
This example will use Excel.
A used-car dealer randomly selected 100 three-year old Toyota Camrys that were sold at auctions during the past month in order to determine the regression line price vs. odometer readings. The scatter plot (see the graph below) suggests a linear relation. The dealer recorded the price (in 1,000s) and the number of miles (in thousands) on the odometer.
Excel provides the following output.

Important output data are the coefficient of determination $R^2$ in the Regression Statistics table, the $F$ value and its $p$ value in the ANOVA table, and the values of the coefficients and their $p$ values in the coefficients table. We will discuss them later.
The coefficients of the regression line are used in the following formula.

$\hat{y}=17.25-0.067x$

Requirements

For the regression methods to be valid the following four conditions for the error variable $\varepsilon$ must be met:

The probability distribution of $\varepsilon$ is normal;
The mean of the distribution is $0$ ; that is $E(\varepsilon)=0$ ;
The standard deviation $\sigma_\varepsilon$ is a constant regardless of the value of $y$ ;
The value of $\varepsilon$ associated with any particular value of $y$ is independent of $\varepsilon$ associated with any other value of {y} (important only for time series).

In the graph below the frequency diagram of the residuals is depicted. The residuals seem to be normally distributed with zero mean which meets the first requirement. A more quantitative method is the Anderson-Darling test.

The diagram below shows that the variance of the residuals seems independent of the $y$ values.

Testing the slope

If no relationship exists between the two variables, we would expect the regression line to be horizontal, that is, to have a slope equal to zero. We want to find out if there is a linear relationship, i.e. we want to find out if the slope $\beta_1\neq 0$ . Then our research hypothesis is:

$H_1:\beta_1\neq0$

and the null hypothesis becomes:

$H_0:\beta_1=0$

We can implement the following test statistic to try our hypotheses:

$\displaystyle{t=\frac{b_1-\beta_1}{s_{b_1}}}$

Usually $\displaystyle{\beta_1$ is taken 0, because usually $H_0:\beta_1=0}$ .

$\displaystyle{s_{b_1}=\frac{s_\varepsilon}{\sqry{(n-1)s_x}}}$

If the error variable $\displaystyle{\varepsilon}$ is normally distributed, the test statistic has a Student’s $t$ distribution with $n-2$ degrees of freedom. The rejection region depends on whether or not we are doing a one or two tail test (two-tail test is most typical).

In the Odometer example we get:

$\displaystyle{b_1=-0066861; s_{b_1}=0.004975}$

and thus the test statistic is:

$\displaystyle{t=\frac{b_1-0}{s_{b_1}}=\frac{-0.066861}{0.004975}=-13.41}$

Note. $s_{b_1$ is the standard error and can be found in the coefficient table.

The rejection region (two-sided) is:

$\displaystyle{t<-t_{98;0.025}$ or $\displaystyle{t>t_{98;0.025}}$

$\displaystyle{t<-13,44}$ or $\displaystyle{t>13.44}$

The test statistic is in the rejection region and thus we reject $H_0:\beta_1=0$ .

Testing the coefficient of correlation

Also in the Odometer example we investigate the following hypotheses:

$\displaystyle{H_0:\rho=0;H_1:\rho\neq0}$

First we computer the sample coefficient of correlation:

$\displaystyle{r=\frac{s_{xy}}{s_xs_y}=-\frac{-2.909}{6.596\times0.5477}=-0.8052}$

The test statistic is:

$\displaystyle{t=r\sqrt{\frac{n-2}{1-r^2}}=-13.44}$

which certainly falls into the rejection region:

$\displaystyle{t<-t_{98;0.025}}$ or $\displaystyle{t>t_{98;0.025}}$

and thus there is overwhelming evidence to reject $H_0$ .

Coefficient of determination

Thus far we tested whether a linear relationship exists; it is also useful to measure to what extent the model fits the data. This is done by calculating the coefficient of determination $R^2$ .

$\displaystyle{R^2=r^2=\frac{s_{xy}^2}{s_x^2s_y^2}}$

Note. If $SSE=0$ (i.e. all points lie on the regression line) then $R^2=1$ and the model fits the data perfectly.

$\displaystyle{R^2=1-\frac{SSE}{\sum{}{}(y_i-\overline{y})^2}}$

In the Odometer example we found $R^2=0.6483$ . This means that $64.83$ % of the variation in the auction selling prices $y$ is explained by the variation in the odometer readings $x$ . The remaining $35.17$ % is unexplained, i.e. due to error.

Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. In general, the higher the value of $R^2$ , the better the model fits the data: $R^2=1$ (perfect fit); $R^2=0$ (no fit).

If we have the regression equation:

$\hat{y}=b_0+b_1x$

then we can use it for any $x_g$ to estimate intervals for $\hat{y_g}$

There are two types of intervals for $x=x_g$ :

Prediction interval:

$\displaystyle{\hat{y}\pm{t_{n-2;\alpha/2}s_\varepsilon\sqrt{1+\frac{1}{n}+\frac{(x_g-\overline{x})^2}{(n-1)s_x^2}}}}$

Confidence interval:

$\displaystyle{\hat{y}\pm{t_{n-2;\alpha/2}s_\varepsilon\sqrt{\frac{1}{n}+\frac{(x_g-\overline{x})^2}{(n-1)s_x^2}}}}$

What is the difference between these two intervals? The prediction interval concerns an individual trade-in price while the second interval can be compared with the expected value: the mean trade-in price.

Homo- and heteroscedasticity

When the requirement of a constant variance is violated, we have a condition of heteroscedasticity. If it is not violated then we have a condition of homoscedasticity. We can diagnose heteroscedasticity by plotting the residual against the predicted $y$ .

In the Odometer example the graph shows the plot of the residuals against the predicted value of $y$ .

There doesn’t appear to be a change in the spread of the plotted points, therefore homoscedastic. (Question: Which flights have greater variability of air time: Amsterdam-Singapore or Amsterdam-Madrid?)