Quick Overview

Column 1

In nine weeks, you will learn the basics of data handling with R and details about regression techniques in the context of statistical inference. We will also cover the connection between these concepts and research philosophy. During every lecture, we will cover a different theoretical topic. In addition to the lectures, there will also be a weekly computer lab exercise that connects the statistical theory to practice. You will also attend weekly workgroup meetings wherein you will work on solving motivating, real-world case studies.

Assignment and Grading

The final grade is computed as follows

Grade Component Weight
Group assignment 1: Linear regression 25%
Group assignment 2: Logistic Regression 25%
Written Exam 50%

In addition to the grade components listed above, you will also do R exercises for the first 7 weeks of the course. These exercises will develop the skills needed to successfully complete the assignments.

To pass the course:

  1. Your final exam grade must be 5.5 or higher
  2. Both of your assignments grades must be 5.5 or higher

Attendance

During this course, you will attend 8 workgroup sessions and hand in 7 practical assignments. We expect to you attend at least 7 out of 8 workgroup sessions, and hand in at least 6 out of 7 practical assignments (before the deadline). If you do not meet these requirements, you lose the right to resit the exam.

Literature

We will use two open-source books in this course:

  1. R for Data Science (R4DS)
    • You can find solutions for the R4DS exercises here.
  2. Applied Statistics with R (ASWR)

There is no need to purchase these books. The freely available online versions are sufficient. The relevant chapters will be linked in this dashboard where the reading is assigned. We will also use several external webpages and web apps. These resources will also be linked in this dashboard.

Column 2

Schedule

Week # Topic R Exercise Workgroup Reading
1 The basics of R How to work with R via scripts, projects, and markdown; How to import external data into R; How to write your own functions; How to iterate repetative tasks Form groups; Search for a dataset for the two group assignments; Formulate research questions R4DS: Chapter 11, Chapter 27, Chapter 19, and Chapter 21
2 Programmatic data manipulation 1 Data types and objects in R; Data transformation; Working with pipes Perform data transformations on your found dataset R4DS: Chapter 5, Chapter 10, Chapter 14 (only 14.1 and 14.2), Chapter 15, Chapter 18, and Chapter 20
3 Programmatic data manipulation 2 Data visualization; Data inspection; Data cleaning Continue with data inspection and cleaning R4DS: Chapter 3 and Chapter 7; ASWR: Chapter 4
4 Multiple linear regression Estimating linear models in R using the lm()function; Model fit and model comparison; Categorical predictors; Moderation Find a best fitting model; Test your hypotheses ASWR: Chapter 7 (only 7.1–7.4), Chapter 9 (only 9.1–9.4), Chapter 11 (only 11.1–11.3), and Chapter 16 (only 16.1—not 16.1.4—and 16.2)
5 Model assumptions and diagnostics Assumptions of the linear model; Leverage, outliers, and influential cases Check assumptions of your model and inspect for unusual observations; Make adjustments if necessary; Draw conclusions; Submit Assignment 1 ASWR: Chapter 13
6 Generalized linear model and logistic regression Estimating generalized linear models using the glm() function in R; Definition, estimation, and interpretation of logistic regression models Perform data inspection and cleaning for the second assignment; Formulate hypothesis; Find a best fitting model and test your hypotheses ASWR: Chapter 17 (only 17.1–17.3); This webpage
7 Logistic regression assumptions and classification Logistic regression assumptions; Classification; Confusion matrix Check the assumptions of your model and make adjustments if necessary; Make classifications ASWR: Chapter 17 (only 17.4); This webpage
8 Summary, catch-up, and questions - Interpret your final model as well as the confusion matrix; Draw conclusions; Submit Assignment 2 -

Course Manual

Column 1

Course Content

Regression techniques are widely used to quantify the relationship between two or more variables. In data science, linear and logistic regression are common and powerful techniques for evaluating such relations. These techniques are only useful, however, once you understand when and how to apply them. In this course, students will learn how to apply linear and logistic regression with the R statistical software package.

This course will introduce students to the principles of analytical data science, linear and logistic regression, and the basics of statistical learning. Students will develop fundamental R programming skills and will gain experience with tidyverse: visualize data with ggplot2 and performing basic data wrangling with dplyr. This course helps prepare students for an entry-level research career (e.g. junior researcher or research assistant) or further education in research (e.g., a [research] Master program or a PhD).

Course goals

At the end of this course, students are able to:

  1. Identify key statistical concepts such as:
    • (Conditional) probability
    • Inference
    • Estimation
    • Prediction
    • Classification
    • Sampling variability
    • Statistical modeling
    • Residuals
    • Fitted values \(\\[6pt]\)
  2. Choose an appropriate regression model for a given research scenario. \(\\[6pt]\)
  3. Explain the differences/similarities between statistical inference and model-based prediction/classification; give examples of each type of problem. \(\\[6pt]\)
  4. Identify the assumptions of linear and logistic regression; describe the consequences of violating these assumptions. \(\\[6pt]\)
  5. Describe the three components of a generalized linear model and how these components are specified in logistic regression. \(\\[6pt]\)
  6. Interpret the estimates from linear and logistic regression models, and use these estimates to answer research questions. \(\\[6pt]\)
  7. Use the R statistical software platform to perform basic statistical programming, data manipulation, data visualization, and basic data wrangling. \(\\[6pt]\)
  8. Use the R statistical software platform to perform, interpret, and evaluate linear and logistic regression analyses on real-world data. \(\\[6pt]\)
  9. Interpret R output and use the results to answer research questions. \(\\[6pt]\)
  10. Use R Markdown to document the results of a statistical analysis.

Relation between assessment and objective

In this course, skills and knowledge are evaluated with two types of assignment.

  1. The exam evaluates knowledge and understanding of statistical concepts (Learning goal 1), the ability to critically evaluate research problems and statistical methods (Learning goals 2–5), and the ability to interpret statistical results and software output and apply these interpretations (Learning goals 6 & 9). \(\\[6pt]\)
  2. The group assignments evaluate the student’s ability to work with data, solve basic data analytic problems, execute quantitative data analyses on real-world data sets, and document the results (learning goals 6–10).

Column 2

Course structure

In eight weeks, you will learn the basics of data handling and statistical programming with R and details about regression techniques in the context of statistical inference, prediction, and classification. Each week will comprise three class activities:

  1. During the weekly lectures, we will cover the theoretical content. \(\\[6pt]\)
  2. Weekly practical exercises connect the statistical theory to practice by applying the lecture content in the R statistical programming language. \(\\[6pt]\)
  3. During the weekly workgroup meetings, you will work on real-world data analysis with a group of your peers.

Attendance

During this course, you will attend 8 workgroup sessions and hand in 7 practical assignments. We expect to you attend at least 7 out of 8 workgroup sessions, and hand in at least 6 out of 7 practical assignments (before the deadline). If you do not meet these requirements, you lose the right to resit the exam.

Group assignment 1: Linear regression

Type of assignment: Group (4 students)

Grading: 25% of your final grade

Deadline: Monday December 18, 17:00

What to submit: A ZIP archive containing the complete R project (dataset, RMD, HTML)

Where to submit: This Surfdrive folder

Description: For this assignment, you perform and report a multiple linear regression analysis in an R markdown document. The assignment will be graded on the following five dimensions:

  1. Preliminaries: Introduction of your research questions, description and potential processing of your data. \(\\[6pt]\)
  2. Model estimation: Description of the model estimates, model fit, and model comparison procedure. \(\\[6pt]\)
  3. Assumptions: Testing of model assumptions, checking for influential cases. Act upon and/or reflect on violations when needed. \(\\[6pt]\)
  4. Interpretation: Substantive interpretation of the final model. Answering your research question. \(\\[6pt]\)
  5. Layout: Structure of the document, efficiency of output presentation, use of custom functions (when applicable). Presentation of suitable visualizations.

Group assignment 2: Logistic regression

Type of assignment: Group (4 students)

Grading: 25% of your final grade

Deadline: Thursday January 18, 17:00

What to submit: A ZIP archive containing the complete R project (dataset, RMD, HTML)

Where to submit:: This Surfdrive folder

Description: For this assignment, you perform and report a multiple logistic regression analysis in an R markdown document. The assignment will be graded on the following five dimensions:

  1. Preliminaries: Introduction of your research questions, description and potential processing of your data. \(\\[6pt]\)
  2. Model estimation: Description of the model estimates, model fit, and model comparison procedure. \(\\[6pt]\)
  3. Assumptions: Testing of model assumptions, checking for influential cases. Act upon and/or reflect on violations when needed. \(\\[6pt]\)
  4. Interpretation: Substantive interpretation of the final model (including the confusion matrix). Answering your research question. \(\\[6pt]\)
  5. Layout: Structure of the document, efficiency of output presentation, use of custom functions (when applicable). Presentation of suitable visualizations.

Preparation

Column 1

This semester, you will participate in the Fundamental Techniques in Data Science with R course at Utrecht University. In this course, you will use both R and RStudio. The steps below will guide you through installing both R and RStudio. Please do so before the first meeting.

System requirements

Bring a laptop computer to the course, and make sure that you have full write access and administrator rights on the machine. We will explore programming and compiling in this course, so you will need full access to your machine. Some corporate laptops come with limited access for their users, I therefore advise you to bring a personal laptop to the workgroup meetings.

1. Install R

You can obtain a copy of R here. We won’t use R directly in the course. Rather, we’ll call R through RStudio. Therefore, you also need to install RStudio.

2. Install RStudio Desktop

RStudio is an Integrated Development Environment (IDE) for R. You can download RStudio as stand-alone software here. The free and open-source RStudio Desktop version is sufficient.

3. Start RStudio and install the following packages.

Open RStudio, and copy-paste the following lines of code into the console window to execute them.

  • If nothing happens after you paste the code, try hitting the “Enter/Return” key.
install.packages(c("ggplot2", 
                   "tidyverse", 
                   "magrittr", 
                   "micemd", 
                   "jomo", 
                   "pan", 
                   "lme4", 
                   "knitr", 
                   "rmarkdown", 
                   "plotly", 
                   "ggplot2", 
                   "devtools", 
                   "class", 
                   "car", 
                   "MASS", 
                   "ISLR",  
                   "mice"), 
                 dependencies = TRUE)

If you are not sure where to paste the code, use the following figure to identify the console:

HTML5 Icon

When you are asked the following:

Do you want to install from sources the package which needs
compilation? (Yes/no/cancel)

Type Yes in the console, and press the “Enter/Return” key (or click the corresponding button if the question presents as a dialog box).

What if the steps do not work for me?

If the suggested steps fail, or you have insufficient rights on your machine, you can use the following web-based solutions.

  1. Open a free account on posit.cloud.

    • You can run your own cloud-based RStudio environment there. \(\\[12pt]\)
  2. Use Utrecht University’s MyWorkPlace.

    • You will have access to R and RStudio there. When you start a new MyWorkPlace session, you may need to (re)install packages.

Naturally, you will need internet access to use these services.

Column 2

Get acquainted with R

To familiarize yourself with basic R usage, complete the following exercise before the first lecture. This exercise will get you started with R and RStudio. You can always also have a look at the posit website for more detailed tutorials.

Suggested reading:

Basic statistical concepts

We expect you to be familiar with some basic statistical concepts such as:

  • Descriptive statistics
  • Sampling
  • Correlation
  • T-test
  • P-values

To refresh your memory, you can have a look at the material below. Note that these topics are background knowledge for this course. The course material builds on this knowledge.

From ASWR: Chapter 5 (only 5.1 and 5.2) and Chapter 7 (only 7.1 and 7.2).

Furthermore, you may benefit from exploring the following shiny apps:

Week 1

Column 1

Lecture

This week, we’ll cover some fundamentals of R programming.

  • Interacting with R via scripts and R Markdown
  • Importing external data into R
  • Writing your own R functions
  • Methods for iterating repetive operations in R scripts

You can find the lecture slides here.

Required Reading

These readings are exam materials.

Column 2

Workgroup

In this week’s workgroup meeting, we discuss the assignments and expectations for your work in this course. You will form groups and decide which research questions you will answer for Assignments 1 and 2. You will search for a dataset to use in the two group assignments and thinking of possible research questions.

You can find the slides for this workgroup meeting here.

Deadline:

Email the following information to your workgroup instructor before the end of the workgroup meeting.

  • The names of your group members
  • The research questions your group will use for Assignments 1 and 2

R Practical

NOTE: Please read the Preparation page before starting with these practical exercises.

  • Download the dataset for this exercise from the Surfdrive datasets folder.
  • Complete Practical 1.
  • Submit your RMD script and the compiled HTML file it produces to the SurfDrive folder for this practical before the next lecture.
    • Name the files your_name_1.Rmd and your_name_1.html, respectively. Where your_name is your full name in lower snake case, and the 1 indicates Practical 1.

Answers

You can find suggested answers to the practical below. We provide these answer files for two reasons:

  1. So you won’t ever get intractably stuck on a question.
  2. So you can check your answers after you attempt a problem.

Even though you have the solutions available, we strongly encourage you to seriously attempt answering each question in the exercises before checking the solutions.

Week 2

Column 1

Lecture

This week, we start looking more closely at programmatic data manipulation in R.

  • R objects and data types
  • Manipulating and transforming data
  • Working with pipes

You can find the lecture slides here.

Column 2

Workgroup

In this week’s meeting, you refine your research questions and perform any necessary manipulations to the variables in your dataset.

You can find the slides for this workgroup meeting here.

R practical

This week’s R practical is about R objects and data types, performing basic data manipulations, and working with pipes.

  • Complete Practical 2.
  • Submit your RMD script and the compiled HTML file it produces to the SurfDrive folder for this practical before the next lecture.
    • Name the files your_name_2.Rmd and your_name_2.html, respectively. Where your_name is your full name in lower snake case, and the 2 indicates Practical 2.

Answers

You can find suggested answers to the practical exercises here:

Week 3

Column 1

Lecture

This week, we continue with our discussion of programmatic data processing.

  • Data visualization using ggplot2
  • Data exploration and cleaning

You can find the lecture slides below.

Required Reading

These readings are exam materials.

Column 2

Workgroup

In today’s workgroup, you will continue with inspecting and cleaning your group’s chosen dataset.

You can find the slides for this workgroup meeting here.

R practical

This week’s R practical will cover data visualization, inspection, and cleaning as well as writing functions in R.

  • Complete Practical 3.
  • Submit your RMD script and the compiled HTML file it produces to the SurfDrive folder for this practical before the next lecture.
    • Name the files your_name_3.Rmd and your_name_3.html, respectively. Where your_name is your full name in lower snake case, and the 3 indicates Practical 3.

Answers

You can find suggested answers to the practical exercises here:

Week 4

Column 1

Lecture

This week, we move on from R programming and begin our discussion of linear modeling.

  • Specifying linear models in R using the lm() function
  • Model fit and model comparison
  • Categorical predictors
  • Moderation

You can find the lecture slides here

Column 2

Workgroup

In this week’s workgroup, you will continue to work with your group on Assignment 1. In particular, you will build your multiple linear regression model. You start will build up your final model in steps:

  • Begin with a single predictor variable
  • Add additional predictors
  • Add complexity to the right hand side of your model (e.g., interaction terms to test for moderation)
  • Compare models using the various model fit criteria discussed in the lecture

Finally, you will interpret the results of your optimal model.

You can find the slides for this workgroup meeting here.

R practical

This week’s R practical is about linear regression and model comparison. The practical also includes more practice with data visualization.

  • Complete Practical 4.
  • Submit your RMD script and the compiled HTML file it produces to the SurfDrive folder for this practical before the next lecture.
    • Name the files your_name_4.Rmd and your_name_4.html, respectively. Where your_name is your full name in lower snake case, and the 4 indicates Practical 4.

You can find an additional R code demonstration script here.

  • You don’t need to submit anything related to this demonstration script.
  • This script simply provides more information on how to implement this week’s topics in R.

Answers

You can find suggested answers to the practical here:

Week 5

Column 1

Lecture

This week, we wrap up our discussion of the linear model by considering how we can check if our model results are trustworthy.

  • Assumptions of the linear model
  • Regression diagnostics
  • Outliers, high-leverage cases, and influential observations

You can find the lecture slides here.

Required Reading

These readings are exam materials.

Column 2

Workgroup

In this week’s workgroup, you will:

  • Check the assumptions of your model
  • Check for the influence of unusual observations
  • Make adjustments to your data or model, if necessary
  • Draw your final conclusions

You can find the slides for this workgroup meeting here.

Deadlines

You must submit Assignment 1 by Monday December 19, 17:00.

  • See the Course Manual page for details.

R practical

This week’s R practical guides you through the various assumptions of the linear model as well as checks for outliers and influential cases.

  • Download the dataset for this practical from the SurfDrive datasets folder.
  • Complete Practical 5.
  • Submit your RMD script and the compiled HTML file it produces to the SurfDrive folder for this practical before the next lecture.
    • Name the files your_name_5.Rmd and your_name_5.html, respectively. Where your_name is your full name in lower snake case, and the 5 indicates Practical 5.

Answers

You can find suggested answers to the practical here:

Week 6

Column 1

Lecture

This week’s lecture covers the generalized linear model and the basics of logistic regression.

  • Overview of the GLM
  • Probabilities, odds, and odds-ratios
  • Definition of the logistic regression model
  • Interpreting logistic regression estimates
  • Classification using logistic regression

You can find the lecture slides here

Required Reading

These readings are exam materials.

Column 2

Workgroup

In this week’s workgroup, you will start working on Assignment 2.

  • You will perform final data inspection and cleaning for the dataset that you chose for Assignment 2.
  • You will formulate your research questions.
  • You will fit and interpret some logistic regression models to test your research question.

You can find the slides for this workgroup meeting here.

R practical

This week’s R practical guides you through the basics of logistic regression analyses.

  • Download the dataset for this practical from the SurfDrive datasets folder.
  • Complete Practical 6.
  • Submit your RMD script and the compiled HTML file it produces to the SurfDrive folder for this practical before the next lecture.
    • Name the files your_name_6.Rmd and your_name_6.html, respectively. Where your_name is your full name in lower snake case, and the 6 indicates Practical 6.

Answers

You can find suggested answers to the practical here:

Week 7

Column 1

Lecture

This week, will cover the assumptions of logistic regression and evaluating classification performance via confusion matrices.

You can find the lecture slides here

Required Reading

These readings are exam materials.

Column 2

Workgroup

In this week’s workgroup, you will continue to work on Assignment 2. You will check the assumptions of your model and make adjustments, if necessary. Also, you will interpret the confusion matrix of your model.

You can find the slides for this workgroup meeting here.

R practical

This week’s R practical guides you through the process of checking the assumptions of the logistic regression model and evaluating classification performance.

  • Download the dataset for this practical from the Surfdrive datasets folder.
  • Complete Practical 7.
  • Submit your RMD script and the compiled HTML file it produces to the SurfDrive folder for this practical before the next lecture.
    • Name the files your_name_7.Rmd and your_name_7.html, respectively. Where your_name is your full name in lower snake case, and the 7 indicates Practical 7.

Answers

You can find suggested answers to the practical here:

Week 8

Column 1

Lecture

In this week’s lecture, we will wrap up the course, and I’ll give an overview of the main points we’ve covered.

You can find the lecture slides here

Column 2

Workgroup

In this week’s workgroup, you will finalize your second group project. First, you will confirm that you have performed all steps as discussed last week. If you have time left, you can can fine-tune the interpretation of your results and polish the figures and tables in your markdown document.

  • You can find the slides for this workgroup meeting here.

Deadlines

You must submit Assignment 2 by Thursday January 19, 17:00.

  • See the Course Manual page for details.

R practical

There is no R practical this week. Use your time to finish the second assignment and prepare for the exam.

Exam Material

Column 1

Practice Exam

You can find the practice exam here

What can be tested?

Anything mentioned in the lectures may appear on the exam.

  • This includes both information printed in the lecture slides and information delivered verbally during a lecture itself.

Anything covered in the required readings may appear on the exam.

  • Obviously, this is a lot of material. To prioritize, keep in mind that topics mentioned in the lectures and topics directly related to ideas covered in the lectures are most likely to appear on the exam.

Summary of Required Readings

Week # Topic Reading
1 The basics of R R4DS: Chapter 11, Chapter 27, Chapter 19, and Chapter 21
2 Programmatic data manipulation 1 R4DS: Chapter 5, Chapter 10, Chapter 14 (only 14.1 and 14.2), Chapter 15, Chapter 18, and Chapter 20
3 Programmatic data manipulation 2 R4DS: Chapter 3 and Chapter 7; ASWR: Chapter 4
4 Multiple linear regression ASWR: Chapter 7 (only 7.1–7.4), Chapter 9 (only 9.1–9.4), Chapter 11 (only 11.1–11.3), and Chapter 16 (only 16.1—not 16.1.4—and 16.2)
5 Model assumptions and diagnostics ASWR: Chapter 13
6 Generalized linear model and logistic regression ASWR: Chapter 17 (only 17.1–17.3); This webpage
7 Logistic regression assumptions and classification ASWR: Chapter 17 (only 17.4); This webpage

What about equations?

This is not a math class; we are not trying to test your ability to do calculations or manipulate equations. That being said, a certain degree of mathematical literacy is crucial to statistics and data science, so you will have to do some simple calculations on the exam. For example, you should be comfortable with the following.

  • Working with the linear regression equation, \(Y_i = \beta_0 + \beta_1 X_i +\varepsilon_i\), to:
    • Calculate predicted values give certain inputs
    • Interpret parameter estimates
    • Evaluate hypotheses/research questions
  • The differences between:
    • The full regression model: \(Y_i = \hat{\beta}_0 + \hat{\beta}_1 X_i +\hat{\varepsilon}_i\)
    • The equation for the best-fit line: \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i\)
  • The definition of a residual:
    • \(\hat{\varepsilon}_i = Y_i - \hat{Y}_i\)
  • The relationship between probabilities (\(p\)) and odds (for binary outcomes):
    • \(\text{odds} = \frac{p}{1-p}\)
  • The definition of the logit function:
    • \(\ln(\text{odds}) = \ln\left(\frac{p}{1-p}\right) = \text{logit}(p)\)
  • The definition of the logistic function, its relation to the logit function, and its role in logistic regression:
    • \(\text{logistic}(\eta_i) = \text{logit}^{-1}(p_i) = \frac{\exp(\eta_i)}{1+\exp(\eta_i)} = \frac{\exp(\hat{\beta}_0 + \hat{\beta}_1 X_i)}{1 + \exp(\hat{\beta}_0 + \hat{\beta}_1 X_i)} = \hat{p}_i\)

Of course, you should also be able to do basic arithmetic operations that are too trivial to detail here (e.g., calculating the difference between the \(R^2\) statistics from two models that you are trying to compare).

Note: Although all examples above are shown in terms of simple linear regression models, you should also be able to do these calculations/interpretations using multiple linear regression models and models that include dummy codes and interactions.

What if you’re still unsure?

If any of the course materials confuse you, feel free to ask about it during the final lecture meeting (even if your concerns relate to content from earlier weeks).

We well devote part (most?) of the final lecture to a dedicated Q&A session