1. GPA score
The data for this assignment are listed in the Excel file GPA score.
The admissions officer of a university is trying to develop a formal system of deciding which students to admit to the university. She believes that determinants of success include the standard variables: high school grades and SAT scores. However, she also believes that students who have participated in extracurricular activities are more likely to succeed than those who have not. To investigate the issue, she randomly sampled 100 fourth-year students and recorded the following variables (see the Excel file):
Univ GPA
GPA for the first 3 years at the university (range 0-12)
HS GPA
GPA from high school (range 0-12)
SAT
SAT score (range: 400-1600)
Activities
Numbers of hours on average spent per week in organized extracurricular activities in the last year of high school.
a. Determine in the following regression equation:
Univ GPA = HS GPA+ SAT+ Activities
and generate the usual statistics.
b. What is the coefficient of determination and interpret its value.
c. Test the overall validity of the model.
d. Interpret each of the coefficients of the independent variables.
e. Predict with 95% confidence the GPA for the first 3 years of university for a student whose high school GPA is 10, whose SAT score is 1200 and who worked an average of 2 hours per week on organized extracurricular activities in the last year of high school.
f. Estimate with 90% confidence the mean GPA for the first 3 years of university for all students whose high school GPA is 8, whose SAT score is 1100, and who worked an average of 10 hours per week organized extracurricular activities in the last year of high school.
Solution
a. . See table below.
b. The coefficient of determinations ; so 28.82% of the variation in university GPAs is explained by the model. The model does not fit the data very well.
c. . At least one is not equal to zero. , value . There is enough evidence to reject and thus conclude that the model is valid.
d. ; for each additional point of HS GPA Univ GPA increases on average by 0.611 provided that the other variables remain constant. Since , HS GPA is very relevant to University GPA.
; for each additional point of SAT Univ GPA increases on average by 0.00135 provided that the other variables remain constant. Since , we may conclude that the SAT score may not be relevant for University GPA.
; for each additional hour of Activities Univ GPA increases on average by 0.0462 provided that the other variables remain constant. Since , we may conclude that the number of hours of Activities may not be relevant for University GPA.
e. See table below. We predict that the student’s Univ GPA will fall between 4.45 and 12.00 (12 is the maximum).
f. See table below. The mean Univ GPA is estimated to lie between 6.90 and 8.22.
SUMMARY OUTPUT | |||||
Regression Statistics | |||||
Multiple R | 0.5369 | ||||
R Square | 0.2882 | ||||
Adjusted R Square | 0.2660 | ||||
Standard Error | 2.0302 | ||||
Observations | 100 | ||||
ANOVA | |||||
df | SS | MS | F | p | |
Regression | 3 | 160.23 | 53.41 | 12.96 | 0.0000 |
Residual | 96 | 395.70 | 4.12 | ||
Total | 99 | 555.93 | |||
Coefficients | Standard Error | t Stat | p | ||
Intercept | 0.721 | 1.87 | 0.39 | 0.7006 | |
HS GPA | 0.611 | 0.101 | 6.06 | 0.0000 | |
SAT | 0.00135 | 0.00135 | 0.00144 | 0.3485 | |
Activities | 0.0462 | 0.0641 | 0.72 | 0.4720 |
Prediction interval | ||
Univ GPA | ||
Predicted value | 8.55 | |
Prediction interval | ||
Lower limit | 4.55 | |
Upper limit | 12.65 | |
Interval Estimate of Expected Value | ||
Lower limit | 7.79 | |
Upper limit | 9.31 |
Prediction interval | ||
Univ GPA | ||
Predicted value | 7.56 | |
Prediction interval | ||
Lower limit | 4.12 | |
Upper limit | 10.99 | |
Interval Estimate of Expected Value | ||
Lower limit | 6.90 | |
Upper limit | 8.22 |
2.Delivery trucks
The data for this assignment are listed in the Excel file Delivery trucks.
Delivery trucks operating from a central warehouse regularly resupply stores. In an analysis of a chain’s operations, the general manager wanted to determine the factors that are related to how long it takes to unload delivery trucks.
A random sample of 50 deliveries to one store was observed. The times (in minutes) to unload the truck, the total number of boxes, the total weight (in hundreds of pounds) of the boxes and a code indicating the time of day (morning, early afternoon, late afternoon) were recorded.
a. First determine the multiple regression equation with the total number of boxes and the total weight as independent variables. Give the coefficients of both variables.
b. How well does the model fit the data? Is the model valid?
The general manager realized that another variable, the time of day, may affect unloading time. He recorded the following codes: 1 = morning, 2 = early afternoon, and 3 = late afternoon. These data can be found in the column Codes. Since these nominal variables are hard to interpret in a regression model, he introduced indicator or dummy variables.
c. Give the definition of the indicator variables based on Codes.
d. Determine the multiple regression equation with the total number of boxes, the total weight and the indicator variables as independent variables.
e. Compare the results of this model with those of the model in a. with respect to the fit of data and the validity. Which model would you prefer?
You need to use Excel.
Solution.
a. The multiple regression equation is: Time = Boxes + Weight.
.b.The coefficient of determination equals . So, we may conclude that the model fits the data very well. Furthermore, with a -value , so the model is a linear model and thus valid.
c. Morning = 1 if Code = 1, Morning = 0 if not. Early afternoon = 1 if Code = 2, Early afternoon = 0 if not. Late afternoon does not need to be defined, since Morning = 0 and Early afternoon = 0 implies Late afternoon.
d. Now, the multiple equation becomes: Time = Boxes + Weight - Morning + Early afternoon.
e. Now . So, we may conclude that the model fits the data even better. Furthermore, with a -value , so the model is a linear model and thus valid. This enhanced model performs much better than the model in a. and is to be preferred.
3. Longevity
The data for this assignment are listed in the Excel file Longevity.
Life insurance companies are keenly interested in predicting longevity of their customers (i.e., how long will their customers live) because their premiums and profitability depend on such numbers. An actuary for one insurance company gathered data from 100 recently deceased male customers. He recorded the age at death of the customer plus the ages at death of his mother and father, the mean ages at death of his grandmothers, and the mean ages at death of his grandfathers.
a. Perform a multiple regression analysis on these data and give the regression equation.
b. The -test concludes about the validity of the model. Formulate the null () and alternative () hypotheses for this -test. Give the results and explain why the model is or is not valid.
c. Interpret the coefficients of the regression equation.
d. Test the coefficients of the regression equation and explain the results.
e. Is the normality requirement satisfied?
f. Is the variance of the error variable constant?
g. Would you expect multicollinearity to be a problem?
Solution.
a. See the table below.
Longevity Mother Father Gmothers Gfathers
b. The null and alternative hypotheses are: ; at least one is not equal to zero. We find: and -value . There is overwhelming evidence to conclude that the model is valid.
c. ; intercept, not really useful since there are no data available around zero ages.
; for each one year increase in the mother’s age the customer’s age increases on average by 0.451 year provided the other variables are constant (which may not be possible because of multicollinearity).
; for each one year increase in the father’s age the customer’s age increases on average by 0.411 year provided the other variables are constant.
; for each one year increase in the grandmother’s mean age the customer’s age increases on average by 0.017 year provided the other variables are constant.
; for each one year increase in the grandfather’s mean age the customer’s age increases on average by 0.087 year provided the other variables are constant.
d. The null and alternative hypotheses for each of these coefficients are and
Mothers:
Fathers:
Grandmothers:
Grandfathers:
The ages of mothers and fathers are linearly related to the ages of their children. The other two variables are not.
e. The histogram for Frequency vs. Standard Residuals shows a normal distribution with mean approximately 0, so the normality requirement is satisfied.
f. The plot of Residuals vs. Predicted values shows that the variance of the error variable is constant.
g. The correlation matrix (Data Analysis) between Father, Mother, Gmothers and Gfathers coefficients shows correlations, some of which may cause problems with the -tests.
SUMMARY OUTPUT | |||||
Regression Statistics | |||||
Multiple R | 0.8608 | ||||
R Square | 0.7411 | ||||
Adjusted R Square | 0.7301 | ||||
Standard Error | 2.66 | ||||
Observations | 100 | ||||
ANOVA | |||||
df | SS | MS | F | p | |
Regression | 4 | 1930 | 482.38 | 67.97 | 0.0000 |
Residual | 95 | 674 | 7.10 | ||
Total | 99 | 2604 | |||
Coefficients | Standard Error | t Stat | p | ||
Intercept | 3.24 | 5.42 | 0.60 | 0.5512 | |
Mother | 0.451 | 0.0545 | 8.27 | 0.0000 | |
Father | 0.411 | 0.0498 | 8.26 | 0.0000 | |
Gmothers | 0.0166 | 0.0661 | 0.25 | 0.8028 | |
Gfathers | 0.0869 | 0.0657 | 1. 32 | 0.1890 |
4. Finger length
The data for this assignment are listed in the Excel file Finger length.
A computer dating service typically asks for various pieces of information such as height, weight, income and so on. One such service requested the length of index fingers. The only plausible reason for this request is to act as an indication of the height. Women have often complained that men lie about their height. If there is a strong relationship between heights and index fingers, the information can be used to “correct” false claims about heights.
To test the relationship between the two variables researchers gathered the heights and lengths of index fingers (in centimeters) of 121 male (Gender=1) and female (Gender=0) students.
a. Compute the formula of the regression line without gender. What can you conclude about how well the model fits the data and whether the model is valid? Explain your answer.
b. Is there sufficient evidence to infer that height and length of index fingers are linearly related.
c. Now also include gender in the model and see the difference in expected height.
Solution.
a. Use Excel and get the formula of the regression line: Height Finger. The model fits the data poorly . The model is valid because the -value is large with a -value which tells you that it is highly certain that there is a linear relation between Height and Finger length.
b. Excel gives as a result that the correlation between both variables is , There is a moderately strong positive correlation.
c. The gender makes a great difference. Take the same finger length and see the difference in height between males and females. If the finger length is 8.2 cm then a male is assumed to be 180 cm tall and a female only 169 cm.