Introduction

In this practical, the assumptions of the linear regression model will be discussed. You will practice with checking the different assumptions, and practice with accounting for some of the assumptions with additional steps. Please note that these assumptions can only be checked once you have selected a (final) model, as these assumptions ‘need’ a model that they apply to.

We will use the following packages in this practical:

library(magrittr)
library(ggplot2) 
library(regclass)
library(MASS)

Data set: Loading & Inspection

For the first part of this practical, a data set from a fish market is used. You can find the dataset in the surfdrive folder. The variables in this fish data set are:

Species of the fish
Weight of the fish in grams
Vertical, length of the fish in cm
Diagonal length of the fish in cm
Cross length of the fish in cm
Height of the fish in cm
Diagonal width of the fish in cm

Download the dataset from the Surfdrive folder, store it in the folder of your Rproject for this practical and open it in R. Also, adjust the column names according to the code below to make them a bit more intuitive.

# Read in the data set
data_fish <- read.csv("Fish.csv")

colnames(data_fish) <- c("species", "weigth", "vertical_length", "diagonal_length", "cross_length", "height", "diagonal_width")

# Check the data set with the 'head' function to have a general impression.
data_fish %>%
  head()

Model assumptions

We will now discuss and check the various model assumptions of linear regression. If steps can be taken to account for a violated assumption, this will also be touched upon.

Linearity

With the assumption of linearity, it is assumed that the relation between the dependent and independent variables is (more or less) linear. You can check this by generating a scatterplot using a predictor variable and outcome variable of the regression model.

Check whether there is a linear relation between the variables vertical length and the cross length.
Next check the relation between weight and height.
Describe both plots. What differences do you see?

When a non-linear relation is present, you can either choose another model to use, or transform the predictor before adding it to the model, for example using a log-transformation. Applying a transformation, however, will not always solve the problem, and makes interpretation of the model less intuitive.

Apply a log-transformation to the weight variable.
Plot the relation between length and weight again, but now including the transformed variable.
Describe if the transformation improved the linear relation.

Predictor matrix full rank

This assumption states that:

there need to be more observations than predictors (n > P).

no predictor can be a linear combination of other predictors; predictors cannot have a very high correlation (multicollinearity).

The first part of this assumption is easy to check: see if the number of observations minus the number of predictors is a positive number. The second part can be checked by either obtaining correlations between the predictors, or by determining the VIF (variance inflation factor). When the VIF is above 10, this indicate high multicollinearity. To account for this, predictors can be excluded from the model, or a new variable can be constructed based on predictors with a high correlation.

To examine VIF scores, the function VIF() from the regclass can be used on a prespecified model. If this model includes a categorical variable with multiple categories, such as ‘species’ in the example data, the generalized VIF is used, and we have to look at the third column (GVIF^*(1/(2*Df))), these values can be compared to normal VIF values.

Specify a linear model with weight as outcome variable using all other variables in the dataset as predictors. Save this model as model_fish1. Calculate VIF values for this model.
Check the VIF scores. If VIF scores exceed a score of 10, give substantial explanation why the VIF scores are this high.
What adjustments can be made to the model to account for multicollinearity in this case?
Run a new model which only includes one of the three length variables and call it model_fish2. Describe if there is an improvement.
What happens with the regression model when there are more predictors than observations?

Exogenous predictors

For this assumption, the expected value of the errors (mean of the errors) must be 0. Furthermore, The errors must be independent of the predictors.

What is the possible consequence of not meeting this assumption?

Constant, finite error variance

This assumptions is also called ‘the assumption of homoscedasticity’. It states that the variance of the error terms should be constant over all levels of the predictors. This can be checked by plotting the residuals against the fitted values. These plots can be obtained by simply taking the first plot of a specified model, plot(model_x).

Create a residual vs fitted values plot for model_fish1, which is the first plot generated by the plot() function.
Load in the iris data, and specify a model where sepal length is predicted by all other variables and save this as model_iris1.
Create a residual vs fitted plot for this model as well.
Discuss both plots and indicate whether the assumption is met.
Discuss what the consequence would be if this assumption is violated.

Independent errors

This assumption states that error terms should have no correlation. Dependence of the errors can result from multiple things. First, there is a possible dependence in the error terms when there is serial dependence, for example because the data contains variables that are measured over time. Another reason can be when there is a cluster structure in the data, for example students in classes in schools.

How can both causes of correlated error terms be detected, and what can be done to solve the problem?

Normally distributed errors

This assumption states that errors should be roughly normally distributed. Like the assumption of homoscedasticity, this can be checked by model plots, provided by R.

Create a QQ plot for model_iris1, which is the second plot generated by the plot() function. Indicate whether the assumption is met.
Create a new model using the fish data, where diagonal_width is predicted by cross_length, and store the model as model_fish3.
Create a QQ plot for model_fish3.
Interpret the two plots. Is the assumption met in both cases?
In what cases is it problematic that the assumption is not met? And in what cases is it no problem?

Influential observations

Outliers

Outliers are observations that show extreme outcomes compared to the other data, or observations with outcome values that fit the model very badly. Outliers can be detected by inspecting the externally studentized residuals.

Make a plot of studentized residuals by using the functions rstudent and plot for `model_fish1. What do you conclude?
Make a plot of studentized residuals for model_iris1.
Store the dataset Animals from the MASS package. Define a regression model where animals’ body weight is predicted by brain weight and store it as model_animals1.
Make a plot of the studentized residuals for model_animals1.

High-leverage observations

High-leverage observations are observations with extreme predictor values. To detect these observations, we look at their leverage values. These values can be summarized in a leverage plot.

For the model specified under model_animals1, create a leverage plot by plotting the hatvalues() of the model.

Influence on the model

Both outliers and observations with high leverage are not necessarily a problem. Cases that are both, however, seem to form more of a problem. These cases can influence the model heavily and can therefore be problematic.

Influence measures come in two sorts: Cook’s distance checks for influential observations, while DFBETAS check for influential, and possible problematic, observations per regression coefficients.

For model_animals1, check Cooks distance by plotting the cooks.distance of the model.
For model_animals1, check the DFBETAS by using the function dfbetas.
Describe what you see in the plots for Cook’s distance and DFBETAS. What do you conclude?
Delete the problematic observation that you found in Question 12 and store the dataset under a new name.
Fit the regression model where animals’ body weight is predicted by brain weight using the adjusted dataset and store it as model_animals2.
Compare the output to model_animals1 and describe the changes.
Run the plots for influential observations again on this new model and see if anything changes.

Fundamental Techniques in Data Science with `R` Practical 5

Introduction

Data set: Loading & Inspection

Model assumptions

Linearity

Predictor matrix full rank

Exogenous predictors

Constant, finite error variance

Independent errors

Normally distributed errors

Influential observations

Outliers

High-leverage observations

Influence on the model

End of practical

Fundamental Techniques in Data Science with R Practical 5

Introduction

Data set: Loading & Inspection

Model assumptions

Linearity

Predictor matrix full rank

Exogenous predictors

Constant, finite error variance

Independent errors

Normally distributed errors

Influential observations

Outliers

High-leverage observations

Influence on the model

End of practical

Fundamental Techniques in Data Science with `R` Practical 5