R
Practical 7We will use the following packages in this practical:
dplyr
for manipulationmagrittr
for pipingreadr
for reading dataggplot
for plottingkableExtra
for tableslibrary(pROC)
, library(regclass)
, and library(caret)
for model diagnosticslibrary(dplyr)
library(magrittr)
library(ggplot2)
library(kableExtra)
library(readr)
library(pROC)
library(regclass)
library(caret)
In this practical, you will perform logistic regression analyses again using glm()
and discuss model assumptions and diagnostics titanic
data set.
1. Read in the data from the “titanic.csv” file, which we also used for the previous practical.
fit1
and fit2
:Survived ~ Pclass
Survived ~ Age + Pclass*Sex
The first outcome in a logistic regression is that the outcome should be binary and therefore follow a binomial distribution. This is easy to check: you just need to be sure that the outcome can only take one of two responses. You can plot the responses of the outcome variable to visually check this if you want. In our case, the possible outcomes are:
Survived
using ggplot()
.If you are using logistic regression to make predictions/classifications then the accuracy will be affected by imbalance in the outcome classes. Notice that in the plot you just made there are more people who did not survive than who did survive. A possible consequence is reduced accuracy in classification of survivors.
A certain amount of imbalance is expected and can be handled well by the model in most cases. The effects of this imbalance is context-dependent. Some solutions to serious class imbalance are down-sampling or weighting the outcomes to balance the importance placed on the outcomes by the model.
Sample size in logistic regression is a complex issue, but some suggest that it is ideal to have 10 cases per candidate predictor in your model. The minimum number of cases to include is \(N = \frac{10*k} {p}\), where \(k\) is the number of predictors and \(p\) is the smallest proportion of negative or positive cases in the population.
fit1
.You learned about this assumption in the linear regression practicals, but to remind you:
Influential values are extreme individual data points that can affect the fit1 of the logistic regression model. They can be visualised using Cook’s distance and the Residuals vs Leverage plot.
plot()
function to visualise the outliers and influential points of fit2
.Hint: you need to specify the correct plot with the which
argument. Check the lecture slides or search ??plot
if you are unsure.
Lastly, it is important to note that the assumptions of a linear regression do not all map to logistic regression. In logistic regression, we do not need:
However, deviance residuals are useful for determining if the individual points are not fit1 well by the model.
Hint: you can use some of the code from the lecture for the next few questions.
Use the resid()
function to get the deviance residuals for fit2
.
Compute the predicted logit values for the model.
Plot the deviance residuals.
Pearson residuals can also be useful in logistic regression. They measure deviations between the observed and fit1ted values. Pearson residuals are easier to plot than deviance residuals as the plot()
function can be used.
In last week’s practical, you learned how to use the predict()
function to calculate predicted probabilites using the models. This week we will create predicted probabilities for the final two models from last week compare the results by using the confusion matrix.
Use the predict()
function to get model predicted probabilities for fit1
and fit2
.
Create model predicted classifications for survival, for fit1
and fit2
.
You can read about the confusion matrix on this Wikipedia page. This section tells you how to get some useful metrics from the confusion matrix to evaluate model performance.
Create two confusion matrices (one each for each model) using the classifications from the previous question. You can use the table()
function, providing the modeled outcome as the true
parameter and the classifications as the pred
parameter.
Based on the confusion matrices, which model do you think makes better predictions?
Calculate the accuracy, sensitivity, and specificity, false positive rate, positive and negative predictive values from the confusion matrix of the model that makes the best predictions.
Explain what the difference metrics mean in substantive terms?
What does it mean for a model to have such low specificity, but high sensitivity?
The confusionMatrix()
function from the caret
package can do a lot of this for us. The function takes three arguments:
data
- a vector of predicted classes (in factor form)reference
- a vector of true classes (in factor from)positive
- a character string indicating the ‘positive’ outcome. If not specified, the confusion matrix assumes that the first specified category is the positive outcome.You can type ??confusionMatrix
into the console to learn more.