In nine weeks, you will learn the basics of data handling with R and details about regression techniques in the context of statistical inference. We will also cover the connection between these concepts and research philosophy. During every lecture, we will cover a different theoretical topic. In addition to the lectures, there will also be a weekly computer lab exercise that connects the statistical theory to practice. You will also attend weekly workgroup meetings wherein you will work on solving motivating, real-world case studies.

The final grade is computed as follows

Grade Component | Weight |
---|---|

Group assignment 1: Linear regression | 25% |

Group assignment 2: Logistic Regression | 25% |

Written Exam | 50% |

In addition to the grade components listed above, you will also do
`R`

exercises for the first 7 weeks of the course. These
exercises will develop the skills needed to successfully complete the
assignments.

To pass the course:

- Your final exam grade must be 5.5 or higher
- Both of your assignments grades must be 5.5 or higher

During this course, you will attend 8 workgroup sessions and hand in 7 practical assignments. We expect to you attend at least 7 out of 8 workgroup sessions, and hand in at least 6 out of 7 practical assignments (before the deadline). If you do not meet these requirements, you lose the right to resit the exam.

We will use two open-source books in this course:

*R for Data Science*(R4DS)- You can find solutions for the R4DS exercises here.

*Applied Statistics with R*(ASWR)

There is no need to purchase these books. The freely available online versions are sufficient. The relevant chapters will be linked in this dashboard where the reading is assigned. We will also use several external webpages and web apps. These resources will also be linked in this dashboard.

Week # | Topic | `R` Exercise |
Workgroup | Reading |
---|---|---|---|---|

1 | The basics of `R` |
How to work with `R` via scripts, projects, and markdown;
How to import external data into `R` ; How to write your own
functions; How to iterate repetative tasks |
Form groups; Search for a dataset for the two group assignments; Formulate research questions | R4DS: Chapter 11, Chapter 27, Chapter 19, and Chapter 21 |

2 | Programmatic data manipulation 1 | Data types and objects in `R` ; Data transformation;
Working with pipes |
Perform data transformations on your found dataset | R4DS: Chapter 5, Chapter 10, Chapter 14 (only 14.1 and 14.2), Chapter 15, Chapter 18, and Chapter 20 |

3 | Programmatic data manipulation 2 | Data visualization; Data inspection; Data cleaning | Continue with data inspection and cleaning | R4DS: Chapter 3 and Chapter 7; ASWR: Chapter 4 |

4 | Multiple linear regression | Estimating linear models in `R` using the
`lm()` function; Model fit and model comparison; Categorical
predictors; Moderation |
Find a best fitting model; Test your hypotheses | ASWR: Chapter 7 (only 7.1–7.4), Chapter 9 (only 9.1–9.4), Chapter 11 (only 11.1–11.3), and Chapter 16 (only 16.1—not 16.1.4—and 16.2) |

5 | Model assumptions and diagnostics | Assumptions of the linear model; Leverage, outliers, and influential cases | Check assumptions of your model and inspect for unusual
observations; Make adjustments if necessary; Draw conclusions;
Submit Assignment 1 |
ASWR: Chapter 13 |

6 | Generalized linear model and logistic regression | Estimating generalized linear models using the `glm()`
function in `R` ; Definition, estimation, and interpretation
of logistic regression models |
Perform data inspection and cleaning for the second assignment; Formulate hypothesis; Find a best fitting model and test your hypotheses | ASWR: Chapter 17 (only 17.1–17.3); This webpage |

7 | Logistic regression assumptions and classification | Logistic regression assumptions; Classification; Confusion matrix | Check the assumptions of your model and make adjustments if necessary; Make classifications | ASWR: Chapter 17 (only 17.4); This webpage |

8 | Summary, catch-up, and questions | - | Interpret your final model as well as the confusion matrix; Draw
conclusions; Submit Assignment 2 |
- |

Regression techniques are widely used to quantify the relationship
between two or more variables. In data science, linear and logistic
regression are common and powerful techniques for evaluating such
relations. These techniques are only useful, however, once you
understand when and how to apply them. In this course, students will
learn how to apply linear and logistic regression with the
`R`

statistical software package.

This course will introduce students to the principles of analytical
data science, linear and logistic regression, and the basics of
statistical learning. Students will develop fundamental `R`

programming skills and will gain experience with tidyverse: visualize
data with ggplot2 and performing basic data wrangling with dplyr. This
course helps prepare students for an entry-level research career
(e.g. junior researcher or research assistant) or further education in
research (e.g., a [research] Master program or a PhD).

At the end of this course, students are able to:

- Identify key statistical concepts such as:
- (Conditional) probability
- Inference
- Estimation
- Prediction
- Classification
- Sampling variability
- Statistical modeling
- Residuals
- Fitted values \(\\[6pt]\)

- Choose an appropriate regression model for a given research scenario. \(\\[6pt]\)
- Explain the differences/similarities between statistical inference and model-based prediction/classification; give examples of each type of problem. \(\\[6pt]\)
- Identify the assumptions of linear and logistic regression; describe the consequences of violating these assumptions. \(\\[6pt]\)
- Describe the three components of a generalized linear model and how these components are specified in logistic regression. \(\\[6pt]\)
- Interpret the estimates from linear and logistic regression models, and use these estimates to answer research questions. \(\\[6pt]\)
- Use the
`R`

statistical software platform to perform basic statistical programming, data manipulation, data visualization, and basic data wrangling. \(\\[6pt]\) - Use the
`R`

statistical software platform to perform, interpret, and evaluate linear and logistic regression analyses on real-world data. \(\\[6pt]\) - Interpret
`R`

output and use the results to answer research questions. \(\\[6pt]\) - Use
`R`

Markdown to document the results of a statistical analysis.

In this course, skills and knowledge are evaluated with two types of assignment.

- The exam evaluates knowledge and understanding of statistical concepts (Learning goal 1), the ability to critically evaluate research problems and statistical methods (Learning goals 2–5), and the ability to interpret statistical results and software output and apply these interpretations (Learning goals 6 & 9). \(\\[6pt]\)
- The group assignments evaluate the student’s ability to work with data, solve basic data analytic problems, execute quantitative data analyses on real-world data sets, and document the results (learning goals 6–10).

In eight weeks, you will learn the basics of data handling and statistical programming with R and details about regression techniques in the context of statistical inference, prediction, and classification. Each week will comprise three class activities:

- During the weekly lectures, we will cover the theoretical content. \(\\[6pt]\)
- Weekly practical exercises connect the statistical theory to practice by applying the lecture content in the R statistical programming language. \(\\[6pt]\)
- During the weekly workgroup meetings, you will work on real-world data analysis with a group of your peers.

During this course, you will attend 8 workgroup sessions and hand in 7 practical assignments. We expect to you attend at least 7 out of 8 workgroup sessions, and hand in at least 6 out of 7 practical assignments (before the deadline). If you do not meet these requirements, you lose the right to resit the exam.

**Type of assignment:** Group (4 students)

**Grading:** 25% of your final grade

**Deadline:** Monday December 18, 17:00

**What to submit:** A ZIP archive containing the
complete R project (dataset, RMD, HTML)

**Where to submit:** This
Surfdrive folder

**Description:** For this assignment, you perform and
report a multiple *linear* regression analysis in an R markdown
document. The assignment will be graded on the following five
dimensions:

**Preliminaries:**Introduction of your research questions, description and potential processing of your data. \(\\[6pt]\)**Model estimation:**Description of the model estimates, model fit, and model comparison procedure. \(\\[6pt]\)**Assumptions:**Testing of model assumptions, checking for influential cases. Act upon and/or reflect on violations when needed. \(\\[6pt]\)**Interpretation:**Substantive interpretation of the final model. Answering your research question. \(\\[6pt]\)**Layout:**Structure of the document, efficiency of output presentation, use of custom functions (when applicable). Presentation of suitable visualizations.

**Type of assignment:** Group (4 students)

**Grading:** 25% of your final grade

**Deadline:** Thursday January 18, 17:00

**What to submit:** A ZIP archive containing the
complete R project (dataset, RMD, HTML)

**Where to submit:**: This
Surfdrive folder

**Description:** For this assignment, you perform and
report a multiple *logistic* regression analysis in an R markdown
document. The assignment will be graded on the following five
dimensions:

**Preliminaries:**Introduction of your research questions, description and potential processing of your data. \(\\[6pt]\)**Model estimation:**Description of the model estimates, model fit, and model comparison procedure. \(\\[6pt]\)**Assumptions:**Testing of model assumptions, checking for influential cases. Act upon and/or reflect on violations when needed. \(\\[6pt]\)**Interpretation:**Substantive interpretation of the final model (including the confusion matrix). Answering your research question. \(\\[6pt]\)**Layout:**Structure of the document, efficiency of output presentation, use of custom functions (when applicable). Presentation of suitable visualizations.

This semester, you will participate in the **Fundamental
Techniques in Data Science with R** course at
Utrecht University. In this course, you will use both

`R`

and
`RStudio`

. The steps below will guide you through installing
both `R`

and `RStudio`

. Please do so before the
first meeting.Bring a laptop computer to the course, and make sure that you have full write access and administrator rights on the machine. We will explore programming and compiling in this course, so you will need full access to your machine. Some corporate laptops come with limited access for their users, I therefore advise you to bring a personal laptop to the workgroup meetings.

`R`

You can obtain a copy of `R`

here. We won’t use `R`

directly in the course. Rather, we’ll call `R`

through
`RStudio`

. Therefore, you also need to install
`RStudio`

.

`RStudio`

Desktop`RStudio`

is an Integrated Development Environment (IDE)
for `R`

. You can download `RStudio`

as stand-alone
software here. The
free and open-source `RStudio Desktop`

version is
sufficient.

Open `RStudio`

, and copy-paste the following lines of code
into the console window to execute them.

- If nothing happens after you paste the code, try hitting the “Enter/Return” key.

```
install.packages(c("ggplot2",
"tidyverse",
"magrittr",
"micemd",
"jomo",
"pan",
"lme4",
"knitr",
"rmarkdown",
"plotly",
"ggplot2",
"devtools",
"class",
"car",
"MASS",
"ISLR",
"mice"),
dependencies = TRUE)
```

If you are not sure where to paste the code, use the following figure to identify the console: