Introduction

We will use the following packages in this practical:

dplyr for manipulation
magrittr for piping
readr for reading data
ggplot for plotting
kableExtra for tables
library(pROC), library(regclass), and library(caret) for model diagnostics

library(dplyr)
library(magrittr)
library(ggplot2)
library(kableExtra)
library(readr)
library(pROC)
library(regclass)
library(caret)

Loading the data

In this practical, you will perform logistic regression analyses again using glm() and discuss model assumptions and diagnostics titanic data set.

1. Read in the data from the “titanic.csv” file, which we also used for the previous practical.

Logistic regression

Fit the following two logistic regression models and save them as fit1 and fit2:

Survived ~ Pclass
Survived ~ Age + Pclass*Sex

Model assumptions

Binary dependent variable

The first outcome in a logistic regression is that the outcome should be binary and therefore follow a binomial distribution. This is easy to check: you just need to be sure that the outcome can only take one of two responses. You can plot the responses of the outcome variable to visually check this if you want. In our case, the possible outcomes are:

Survived (coded 1)
Did not survive (coded 0)

Visualise the responses of the outcome variable Survived using ggplot().

Balanced outcomes

If you are using logistic regression to make predictions/classifications then the accuracy will be affected by imbalance in the outcome classes. Notice that in the plot you just made there are more people who did not survive than who did survive. A possible consequence is reduced accuracy in classification of survivors.

A certain amount of imbalance is expected and can be handled well by the model in most cases. The effects of this imbalance is context-dependent. Some solutions to serious class imbalance are down-sampling or weighting the outcomes to balance the importance placed on the outcomes by the model.

Sufficiently large sample size

Sample size in logistic regression is a complex issue, but some suggest that it is ideal to have 10 cases per candidate predictor in your model. The minimum number of cases to include is \(N = \frac{10*k} {p}\), where \(k\) is the number of predictors and \(p\) is the smallest proportion of negative or positive cases in the population.

Calculate the minimum number of positive cases needed in the model fit1.

Predictor matrix is full-rank

You learned about this assumption in the linear regression practicals, but to remind you:

there need to be more observations than predictors (n > P)
there should be no multicollinearity among the linear predictors

Check that there is no multicollinearity in the logistic model.

No influential values or outliers

Influential values are extreme individual data points that can affect the fit1 of the logistic regression model. They can be visualised using Cook’s distance and the Residuals vs Leverage plot.

Use the plot() function to visualise the outliers and influential points of fit2.

Hint: you need to specify the correct plot with the which argument. Check the lecture slides or search ??plot if you are unsure.

Are there any influential cases in the Leverage vs Residuals plot? If so, what would you do?

Differences to linear regressoin

Lastly, it is important to note that the assumptions of a linear regression do not all map to logistic regression. In logistic regression, we do not need:

constant, finite error variance
normally distributed errors

However, deviance residuals are useful for determining if the individual points are not fit1 well by the model.

Hint: you can use some of the code from the lecture for the next few questions.

Use the resid() function to get the deviance residuals for fit2.
Compute the predicted logit values for the model.
Plot the deviance residuals.

Pearson residuals can also be useful in logistic regression. They measure deviations between the observed and fit1ted values. Pearson residuals are easier to plot than deviance residuals as the plot() function can be used.

Plot the pearson residuals for the model.

Predicted probabilities

In last week’s practical, you learned how to use the predict() function to calculate predicted probabilites using the models. This week we will create predicted probabilities for the final two models from last week compare the results by using the confusion matrix.

Use the predict() function to get model predicted probabilities for fit1 and fit2.
Create model predicted classifications for survival, for fit1 and fit2.

Confusion matrix

You can read about the confusion matrix on this Wikipedia page. This section tells you how to get some useful metrics from the confusion matrix to evaluate model performance.

Create two confusion matrices (one each for each model) using the classifications from the previous question. You can use the table() function, providing the modeled outcome as the true parameter and the classifications as the pred parameter.
Based on the confusion matrices, which model do you think makes better predictions?
Calculate the accuracy, sensitivity, and specificity, false positive rate, positive and negative predictive values from the confusion matrix of the model that makes the best predictions.
Explain what the difference metrics mean in substantive terms?
What does it mean for a model to have such low specificity, but high sensitivity?

The confusionMatrix() function from the caret package can do a lot of this for us. The function takes three arguments:

data - a vector of predicted classes (in factor form)
reference - a vector of true classes (in factor from)
positive - a character string indicating the ‘positive’ outcome. If not specified, the confusion matrix assumes that the first specified category is the positive outcome.

You can type ??confusionMatrix into the console to learn more.

Fundamental Techniques in Data Science with `R` Practical 7

Introduction

Loading the data

Logistic regression

Model assumptions

Binary dependent variable

Balanced outcomes

Sufficiently large sample size

Predictor matrix is full-rank

No influential values or outliers

Differences to linear regressoin

Predicted probabilities

Confusion matrix

End of practical

Fundamental Techniques in Data Science with R Practical 7

Introduction

Loading the data

Logistic regression

Model assumptions

Binary dependent variable

Balanced outcomes

Sufficiently large sample size

Predictor matrix is full-rank

Continuous predictors are linearly related to the \(logit(\pi)\)

No influential values or outliers

Differences to linear regressoin

Predicted probabilities

Confusion matrix

End of practical

Fundamental Techniques in Data Science with `R` Practical 7