Introduction


We will use the following packages in this practical:

library(dplyr)
library(magrittr)
library(ggplot2)
library(gridExtra)

In this practical, you will perform regression analyses using lm() and inspect variables by plotting these variables, using ggplot().

Loading the dataset

In the this practical, we will use the build-in data set iris. This data set contains the measurement of different iris species (flowers), you can find more information here.

  1. Load the dataset and explain what variables are measured in the first three columns of your data set.

Inspecting the dataset

A good way of eyeballing on a relation between two continuous variables is by creating a scatterplot.

  1. Plot the sepal length and the petal width variables in a ggplot scatter plot (geom_points)

A loess curve can be added to the plot to get a general idea of the relation between the two variables. You can add a loess curve to a ggplot with stat_smooth(...., method = "loess").

  1. Add a loess curve to the plot under question 2, for further inspection.

To get a clearer idea of the general trend in the data (or of the relation), a regression line can be added to the plot. A regression line can be added in the same way as a loess curve, the method argument in the function needs to be altered to lm to do so.

  1. Change the loess curve of the previous plot to a regression line. Describe the relation that the line indicates.

Simple linear regression

With the lm() function, you can specify a linear regression model. You can save a model in an object and request summary statistics with the summary command.

When a model is stored in an object, you can ask for the coefficients with coefficients().

  1. Specify a regression model where Sepal length is predicted by Petal width. Store this model as `model1. Supply summary statistics for this model.

  2. Based on the summary of the model, give a substantive interpretation of the regression coefficient.

  3. Relate the summary statistics and coefficients to the plots you made earlier.


Multiple linear regression

You can add additional predictors to a model. This can improve the fit and the predictions. When multiple predictors are used in a regression model, it’s called a Multiple linear regression.

  1. Add Petal length as a second predictor to the model specified as model1 and store this under the name model2, and supply summary statistics. Again, give a substantive interpretation of the coefficients and the model.

Categorical predictors

Up to here, we only included continuous predictors in our models. We will now include a categorical predictor in the model as well.

When a categorical predictor is added, this predictor is split in several comparisons, where each group is compared to a reference group. In our example Iris data, the variable ‘Species’ is a categorical variable that indicate the species of flower. This variable can be added as example for a categorical predictor.

  1. Add species as a predictor to the model specified as model2, store it under the name model3 and interpret the coefficients of this new model.

Model comparison

Now you have created multiple models, you can compare how well these models function (compare the model fit). There are multiple ways of testing the model fit and to compare models. In this practical, we use the following:

  • AIC
  • BIC
  • RMSE
  • Deviance test
  1. Compare the fit of the model specified under question 5 and the model specified under question 8. Use all four fit comparison methods listed above. Interpret the fit statistics you obtain/tests you use to compare the fit.

Residuals: observed vs. predicted

When fitting a regression line, the predicted values have some error in comparison to the observed values. The sum of the squared values of these errors is the sum of squares. A regression analysis finds the line such that the lowest sum of squares possible is obtained.

The image below shows how the predicted (on the blue regression line) and observed values (black dots) differ and how the predicted values have some error (red vertical lines).

When having multiple predictors, it becomes harder or impossible to make such a plot as above (you need a plot with more dimensions). You can, however, still plot the observed values against the predicted values and infer the error terms from there.

  1. Create a dataset of predicted values for model 1 by taking the outcome variable Sepal.Length and the fitted.values from the model.

  2. Create an observed vs. predicted plot for model 1 (the red vertial lines are no must).

  3. Create a dataset of predicted values and create a plot for model 2.

  4. Compare the two plots and discuss the fit of the models based on what you see in the plots. You can combine them in one figure using the grid.arrange() function.