Quick Overview

Column 1

In nine weeks, you will learn the basics of data handling with R and details about regression techniques in the context of statistical inference. We will also cover the connection between these concepts and research philosophy. During every lecture, we will cover a different theoretical topic. In addition to the lectures, there will also be a weekly computer lab exercise that connects the statistical theory to practice. You will also attend weekly workgroup meetings wherein you will work on solving motivating, real-world case studies.

Assignment and Grading

The final grade is computed as follows

Grade Component Weight
Group assignment 1: Linear regression 25%
Group assignment 2: Logistic Regression 25%
Written Exam 50%

In addition to the grade components listed above, you will also do R exercises for the first 7 weeks of the course. These exercises will develop the skills needed to successfully complete the assignments.

To pass the course:

  1. Your final exam grade must be 5.5 or higher
  2. Both of your assignments grades must be 5.5 or higher

Attendance

During this course, you will attend 8 workgroup sessions and hand in 7 practical assignments. We expect to you attend at least 7 out of 8 workgroup sessions, and hand in at least 6 out of 7 practical assignments (before the deadline). If you do not meet these requirements, you lose the right to resit the exam.

Literature

We will use two open-source books in this course:

  1. R for Data Science (R4DS)
    • You can find solutions for the R4DS exercises here.
  2. Applied Statistics with R (ASWR)

There is no need to purchase these books. The freely available online versions are sufficient. The relevant chapters will be linked in this dashboard where the reading is assigned. We will also use several external webpages and web apps. These resources will also be linked in this dashboard.

Column 2

Schedule

Week # Topic R Exercise Workgroup Reading
1 The basics of R How to work with R via scripts, projects, and markdown; How to import external data into R; How to write your own functions; How to iterate repetative tasks Form groups; Search for a dataset for the two group assignments; Formulate research questions R4DS: Chapter 11, Chapter 27, Chapter 19, and Chapter 21
2 Programmatic data manipulation 1 Data types and objects in R; Data transformation; Working with pipes Perform data transformations on your found dataset R4DS: Chapter 5, Chapter 10, Chapter 14 (only 14.1 and 14.2), Chapter 15, Chapter 18, and Chapter 20
3 Programmatic data manipulation 2 Data visualization; Data inspection; Data cleaning Continue with data inspection and cleaning R4DS: Chapter 3 and Chapter 7; ASWR: Chapter 4
4 Multiple linear regression Estimating linear models in R using the lm()function; Model fit and model comparison; Categorical predictors; Moderation Find a best fitting model; Test your hypotheses ASWR: Chapter 7 (only 7.1–7.4), Chapter 9 (only 9.1–9.4), Chapter 11 (only 11.1–11.3), and Chapter 16 (only 16.1—not 16.1.4—and 16.2)
5 Model assumptions and diagnostics Assumptions of the linear model; Leverage, outliers, and influential cases Check assumptions of your model and inspect for unusual observations; Make adjustments if necessary; Draw conclusions; Submit Assignment 1 ASWR: Chapter 13
6 Generalized linear model and logistic regression Estimating generalized linear models using the glm() function in R; Definition, estimation, and interpretation of logistic regression models Perform data inspection and cleaning for the second assignment; Formulate hypothesis; Find a best fitting model and test your hypotheses ASWR: Chapter 17 (only 17.1–17.3); This webpage
7 Logistic regression assumptions and classification Logistic regression assumptions; Classification; Confusion matrix Check the assumptions of your model and make adjustments if necessary; Make classifications ASWR: Chapter 17 (only 17.4); This webpage
8 Summary, catch-up, and questions - Interpret your final model as well as the confusion matrix; Draw conclusions; Submit Assignment 2 -

Course Manual

Column 1

Course Content

Regression techniques are widely used to quantify the relationship between two or more variables. In data science, linear and logistic regression are common and powerful techniques for evaluating such relations. These techniques are only useful, however, once you understand when and how to apply them. In this course, students will learn how to apply linear and logistic regression with the R statistical software package.

This course will introduce students to the principles of analytical data science, linear and logistic regression, and the basics of statistical learning. Students will develop fundamental R programming skills and will gain experience with tidyverse: visualize data with ggplot2 and performing basic data wrangling with dplyr. This course helps prepare students for an entry-level research career (e.g. junior researcher or research assistant) or further education in research (e.g., a [research] Master program or a PhD).

Course goals

At the end of this course, students are able to:

  1. Identify key statistical concepts such as:
    • (Conditional) probability
    • Inference
    • Estimation
    • Prediction
    • Classification
    • Sampling variability
    • Statistical modeling
    • Residuals
    • Fitted values \(\\[6pt]\)
  2. Choose an appropriate regression model for a given research scenario. \(\\[6pt]\)
  3. Explain the differences/similarities between statistical inference and model-based prediction/classification; give examples of each type of problem. \(\\[6pt]\)
  4. Identify the assumptions of linear and logistic regression; describe the consequences of violating these assumptions. \(\\[6pt]\)
  5. Describe the three components of a generalized linear model and how these components are specified in logistic regression. \(\\[6pt]\)
  6. Interpret the estimates from linear and logistic regression models, and use these estimates to answer research questions. \(\\[6pt]\)
  7. Use the R statistical software platform to perform basic statistical programming, data manipulation, data visualization, and basic data wrangling. \(\\[6pt]\)
  8. Use the R statistical software platform to perform, interpret, and evaluate linear and logistic regression analyses on real-world data. \(\\[6pt]\)
  9. Interpret R output and use the results to answer research questions. \(\\[6pt]\)
  10. Use R Markdown to document the results of a statistical analysis.

Relation between assessment and objective

In this course, skills and knowledge are evaluated with two types of assignment.

  1. The exam evaluates knowledge and understanding of statistical concepts (Learning goal 1), the ability to critically evaluate research problems and statistical methods (Learning goals 2–5), and the ability to interpret statistical results and software output and apply these interpretations (Learning goals 6 & 9). \(\\[6pt]\)
  2. The group assignments evaluate the student’s ability to work with data, solve basic data analytic problems, execute quantitative data analyses on real-world data sets, and document the results (learning goals 6–10).

Column 2

Course structure

In eight weeks, you will learn the basics of data handling and statistical programming with R and details about regression techniques in the context of statistical inference, prediction, and classification. Each week will comprise three class activities:

  1. During the weekly lectures, we will cover the theoretical content. \(\\[6pt]\)
  2. Weekly practical exercises connect the statistical theory to practice by applying the lecture content in the R statistical programming language. \(\\[6pt]\)
  3. During the weekly workgroup meetings, you will work on real-world data analysis with a group of your peers.

Attendance

During this course, you will attend 8 workgroup sessions and hand in 7 practical assignments. We expect to you attend at least 7 out of 8 workgroup sessions, and hand in at least 6 out of 7 practical assignments (before the deadline). If you do not meet these requirements, you lose the right to resit the exam.

Group assignment 1: Linear regression

Type of assignment: Group (4 students)

Grading: 25% of your final grade

Deadline: Monday December 18, 17:00

What to submit: A ZIP archive containing the complete R project (dataset, RMD, HTML)

Where to submit: This Surfdrive folder

Description: For this assignment, you perform and report a multiple linear regression analysis in an R markdown document. The assignment will be graded on the following five dimensions:

  1. Preliminaries: Introduction of your research questions, description and potential processing of your data. \(\\[6pt]\)
  2. Model estimation: Description of the model estimates, model fit, and model comparison procedure. \(\\[6pt]\)
  3. Assumptions: Testing of model assumptions, checking for influential cases. Act upon and/or reflect on violations when needed. \(\\[6pt]\)
  4. Interpretation: Substantive interpretation of the final model. Answering your research question. \(\\[6pt]\)
  5. Layout: Structure of the document, efficiency of output presentation, use of custom functions (when applicable). Presentation of suitable visualizations.

Group assignment 2: Logistic regression

Type of assignment: Group (4 students)

Grading: 25% of your final grade

Deadline: Thursday January 18, 17:00

What to submit: A ZIP archive containing the complete R project (dataset, RMD, HTML)

Where to submit:: This Surfdrive folder

Description: For this assignment, you perform and report a multiple logistic regression analysis in an R markdown document. The assignment will be graded on the following five dimensions:

  1. Preliminaries: Introduction of your research questions, description and potential processing of your data. \(\\[6pt]\)
  2. Model estimation: Description of the model estimates, model fit, and model comparison procedure. \(\\[6pt]\)
  3. Assumptions: Testing of model assumptions, checking for influential cases. Act upon and/or reflect on violations when needed. \(\\[6pt]\)
  4. Interpretation: Substantive interpretation of the final model (including the confusion matrix). Answering your research question. \(\\[6pt]\)
  5. Layout: Structure of the document, efficiency of output presentation, use of custom functions (when applicable). Presentation of suitable visualizations.

Preparation

Column 1

This semester, you will participate in the Fundamental Techniques in Data Science with R course at Utrecht University. In this course, you will use both R and RStudio. The steps below will guide you through installing both R and RStudio. Please do so before the first meeting.

System requirements

Bring a laptop computer to the course, and make sure that you have full write access and administrator rights on the machine. We will explore programming and compiling in this course, so you will need full access to your machine. Some corporate laptops come with limited access for their users, I therefore advise you to bring a personal laptop to the workgroup meetings.

1. Install R

You can obtain a copy of R here. We won’t use R directly in the course. Rather, we’ll call R through RStudio. Therefore, you also need to install RStudio.

2. Install RStudio Desktop

RStudio is an Integrated Development Environment (IDE) for R. You can download RStudio as stand-alone software here. The free and open-source RStudio Desktop version is sufficient.

3. Start RStudio and install the following packages.

Open RStudio, and copy-paste the following lines of code into the console window to execute them.

  • If nothing happens after you paste the code, try hitting the “Enter/Return” key.
install.packages(c("ggplot2", 
                   "tidyverse", 
                   "magrittr", 
                   "micemd", 
                   "jomo", 
                   "pan", 
                   "lme4", 
                   "knitr", 
                   "rmarkdown", 
                   "plotly", 
                   "ggplot2", 
                   "devtools", 
                   "class", 
                   "car", 
                   "MASS", 
                   "ISLR",  
                   "mice"), 
                 dependencies = TRUE)

If you are not sure where to paste the code, use the following figure to identify the console: