Quick Overview

Column 1

In nine weeks, you will learn the basics of data handling with R and details about regression techniques in the context of statistical inference. We will also cover the connection between these concepts and research philosophy. During every lecture, we will cover a different theoretical topic. In addition to the lectures, there will also be a weekly computer lab exercise that connects the statistical theory to practice. You will also attend weekly workgroup meetings wherein you will work on solving motivating, real-world case studies.

Assignment and Grading

The final grade is computed as follows

Grade Component Weight
Linear Regression Assignment 25%
Logistic Regression Assignment 25%
Written Exam 50%

In addition to the grade components listed above, you will also do R exercises for the first 7 weeks of the course. These exercises will develop the skills needed to successfully complete the assignments.

To pass the course:

  1. Your final exam grade must be 5.5 or higher
  2. Both of your assignments grades must be 5.5 or higher

Column 2

Schedule

Week # Topic R Exercise Workgroup Reading
1 The elemental building blocks of R Assigning values to objects; Creating vectors, matrices, data frames, and lists Receive instructions and form groups; Locate a data set for predictive modeling
2 Data manipulation; Least squares Data manipulation; Using pipes to simplify workflows Get approval for data; Beginning data processing, cleaning, and exploration; Formulate a research hypothesis
3 Linear model 1; Data visualization The lm() function in R; Visualizing bivariate relations Specify a linear model; Fit your defined model ISL 3.1 & 6.1, Blog Post 1, Blog Post 2, Lecture Notes
4 Linear model 2; Assumptions; Diagnostics Investigating the assumptions of the linear model Check the assumptions of your model; Use your model to test your hypotheses; Continue the project in rmarkdown ISL 3.2 – 3.4
5 Model building; Prediction; Cross-validation Tying the analytic pieces together into a full regression analysis Evaluate and, if possible, improve your model; Prepare Assignment 1; Evaluate the final linear model on your own data ISL 5.1, Document
6 Generalized linear model; Logistic regression 1 The glm() function in R; Logistic regression modeling; Classification Formulate a research hypothesis and define a logistic model; Fit your defined model ISL 4.1 – 4.3 (except 4.3.5), Webpage
7 Logistic regression 2 Finish exercise from last week Check the assumptions of your model; Use your model to test your hypotheses Webpage
8 Summary, catch-up, and questions None Evaluate and, if possible, improve your model; Prepare Assignment 2; Evaluate the final logistic model on your own data

Course Manual

Column 1

Course Description

Regression techniques are widely used to quantify the relationship between two or more variables, and investigating such relations is common in data science. Linear and logistic regression are well-established and powerful techniques for analyzing the relations between a set of (predictor) variables and a single (outcome) variable. However, you must understand how and when it is appropriate to apply these regression techniques before you can use them in any beneficial way. In this course, you will learn exactly that: how and when to apply linear and logistic regression with the statistical software package R.

This course gives students a new set of tools that they can apply to real-world data to explore interesting issues and problems. The course will introduce students to the principles of analytic data science, linear and logistic regression, and the basics of statistical learning. These techniques will be presented in the context of estimation, testing, and prediction. Students will learn to think carefully and critically about statistical inference, quantifying uncertainty, and measuring the accuracy of statistical estimates. Students will also develop fundamental R programming skills and will gain experience with tidyverse: visualizing data with ggplot2 and performing basic data wrangling with dplyr. This course will prepare students for basic research tasks (e.g. junior researcher or research assistant) or further education in research, such as a (research) Master program.

Assignments

Students will form groups to work on two assignments. Students will need to perform calculations and write R code for these assignments. All work must be combined into an understandable and insightful R project and must be submitted to the Surfdrive file drop environment.

Each assignment will be graded on the quality of the following components:

  1. The methodological application
  2. The model evaluation and assumption checking
  3. The code and scripts

Grading

Students will be evaluated on the following aspects:

  1. Apply and interpret the basic methodological and statistical concepts underlying predictive and/or inferential research.

    1. Explain concepts from inferential statistics, such as probability, inference, and modeling; apply these concepts in practice.
    2. Make an informed choice of research designs that are suitable for regression analyses.
    3. Apply and explain the choice of techniques for investigating data problems.
    4. Apply and explain the concepts of linearity and non-linearity.
    5. Interpret statistical software output, and report software output following APA reporting guidelines.
    6. Explain and conceptualize statistical inference and its relation to statistical theory.
    7. Perform the different steps of solving basic regression analysis problems and report on these steps.


  2. Apply and interpret important techniques in linear and logistic regression analysis.

    1. Perform, interpret, and evaluate quantitative (causal) analyses on data with the statistical software platform R.
    2. Perform analyses in statistical software.

Relation between Assessment and Objective

In this course, skills and knowledge are evaluated in three separate ways:

  • The exam evaluates the knowledge of methodological and statistical concepts (learning goals 1a, 1d, 1f), as well as the application of these concepts to research scenarios (learning goals 1b and 1c). During the exam students will need to interpret statistical software output (learning goal 1e).

  • The practical labs test if the student has sufficient skills to solve basic analysis problems and execute quantitative analyses on real-life data sets (learning goals 2a and 2b).

  • The workgroups focus on applying the newly gained knowledge and skills to solving relevant data analysis problems and reporting on the steps taken to obtain a solution (learning goal 1g).

Preparation

Column 1

Hello All,

This semester, you will participate in the Fundamental Techniques in Data Science with R course at Utrecht University. In this course, you will use both R and RStudio. The below steps guide you through installing both R and RStudio. Please do so before the first meeting.

Regards,
Instructor Team

System requirements

Bring a laptop computer to the course and make sure that you have full write access and administrator rights on the machine. We will explore programming and compiling in this course, so you will need full access to your machine. Some corporate laptops come with limited access for their users, I therefore advise you to bring a personal laptop to the workgroup meetings.

1. Install R

You can obtain a copy of R here. We won’t use R directly in the course. Rather, we’ll call R through RStudio. Therefore, you also need to install RStudio.

2. Install RStudio Desktop

RStudio is an Integrated Development Environment (IDE) for R. You can download RStudio as stand-alone software here. The free and open-source RStudio Desktop version is sufficient.

3. Start RStudio and install the following packages.

Open RStudio, and copy-paste the following lines of code into the console window to execute them.

  • If nothing happens after you paste the code, try hitting the “Enter/Return” key.
install.packages(c("ggplot2", 
                   "tidyverse", 
                   "magrittr", 
                   "micemd", 
                   "jomo", 
                   "pan", 
                   "lme4", 
                   "knitr", 
                   "rmarkdown", 
                   "plotly", 
                   "ggplot2", 
                   "shiny", 
                   "devtools", 
                   "boot", 
                   "class", 
                   "car", 
                   "MASS", 
                   "ggplot2movies", 
                   "ISLR", 
                   "DAAG", 
                   "mice"), 
                 dependencies = TRUE)

If you are not sure where to paste the code, use the following figure to identify the console: