1 Introduction: The practicals as part of this course

Welcome to this first practical of Fundamental Techniques of Data Science in R! Each week, the practical of that week covers (part of) the topics discussed in the lectures and the reading materials. The practicals prepare the students for the topics of the graded assignments of the workgroups.

For each practical you are supposed to hand in an R Markdown file and the corresponding html file that shows that you followed all steps of the practical. These files won’t be graded, but students are expected to make and hand in all practicals as part of meeting the course requirements. The deadline for each practical is the start of the Monday afternoon lecture of the next week. Students are allowed to not submit 1 practical. However, if a student fails to submit the practical more than once, this student loses their right to take a retake exam. Instructions on submitting practicals can be found on the course website under ‘R practical’.

The practicals are discussed in the Tuesday afternoon lecture. Here, there is room to ask questions or help with the practicals. The answers of each practical can be found online on the course website. Note that it is strongly recommended to first try the practical yourself before checking the answers. However, you can check the answers any time, as long as you hand in your practical in time.

2 Packages for this Practical

In this practical, we will make use of the following packages

  1. Load the following r packages. Suppress warnings and messages in the code chunk where you load these packages.
library(dplyr)
library(readr)
library(knitr)
library(kableExtra)

#Note that `dplyr` and `readr` are included in the `tidyverse` distribution. If you have `tidyverse` installed you can simply load `library(tidyverse)`. You can also load packages within `tidyverse` independently (as above), which is a bit quicker. 

3 Code exercises

The following exercises are some basic (mathematical) operations to illustrate what you could code with R. Please run all exercises up to 11 in a single code chunk. Comment on half of the exercises in the code chunk (use a # after the code or on a different line). These comments can be helpful for others (or a later version of yourself) to understand what the code is supposed to do.

  1. Create an object a with value 1

  2. Verify that 1 is stored in a

  3. Square a

  4. Assign a + a to the object b, and check if b is equal to a + a.

  5. Square b

  6. Multiply the answer of question 6 by a over b

  7. Assign the result to c

  8. Take the square root (use sqrt()) of c to the power b

  9. Multiply the answer of question 9 by a over (b to the power 6)

  10. Round the answer from the previous question to 3 decimal places (use round(), and use ?round() to find out more about how to use this code).

Now you know how to use R as a calculator and R markdown for typesetting, we can move on to some more advanced operations.


4 Functions in R

A function in R is a piece of code that contains a set of statements organized to perform a specific task. For example, a function could be used to calculate the mean of a some data, or to make a barplot of some other data. In R functions in code can be recognized by the parentheses after a word (e.g. mean() is a function).

Functions consist of:

  • Input argument(s);
  • Function actions;
  • Output / results of the function.

For example, the function mean() takes a vector of numbers as input, then has the actions of summing the numbers and dividing them over the number of elements, and finishes with returning the obtained number.

The code below illustrates what a function would be constructed.

# Example function

# Function name and the input arguments
function_name <- function(argument_1, argument_2, argument_3){
  
  # actions of the function
  x <- (argument_1 + argument_2) / argument_3
  
  # returning output
  return(x)
}

R has lots of built-in functions or functions in packages, such as seq(), mean(), min(), max(), and sum(). If you want more information about a built-in function, you can always run the code ?function_name() to retrieve documentation on the function. Functions can also be coded/created by a programmer themselves. This can for example be useful if some longer code needs to be repeated multiple times.

  1. Perform the following operations by using built-in R functions:
  • Create a sequence of numbers from 12 to 24, by using the function sec().

  • Sum the numbers from 28 to 63 by using the sum()-function.

  • Find the mean of the numbers from 25 to 82.


5 Getting data into R

There are several of ways to read data into R. One option is to use the readr package that comes with the tidyverse distribution. The function read_csv() reads comma delimited files.

Download the file “flightdata.csv” from the course page and store it in your project folder. This file contains a sample from the “flights” dataset from the nycflights13 package. This contains airline data for all flight departing from NYC in 2013. Note that you have to assign the desired data to an object when reading the data into R.

  1. Read the flightdata.csv file into R with the readr package` using the code below
flight_data <- read_csv("flightdata.csv") # Imports the data
flight_data # View the data

To get other types of data into R the tidyverse packages listed below are recommended.

  • haven reads SPSS, Stata, and SAS files
  • readxl reads excel files (.xls and .xlsx)

6 Working with the data

6.1 Summarizing the data

There are different functions to summarise data, but the base R function summary() works well too.

  1. Apply the summary() function to the data

6.2 Adjusting data

6.2.1 Creating new columns

Sometimes we need to add new columns that are functions of existing columns, and mutate() does this.

  1. Add a column that calculates speed using the distance and air_time columns using speed = distance / air_time * 60. Store the adjusted flight_data dataset under a new name flight_data2.

6.2.2 Selecting data

You might get a data set with more variables than you need. In this case, it is useful to narrow it down to just the variables you will be working with. select() can be used for this.

  1. Select the columns year, month, day and speed using the select() function from flight_data2 and store it under flight_data3.

7 Loops

Sometimes when coding, you want to repeat the same code for multiple times for different pieces of data, or you want to repeat the same action multiple times. For example, you could have a situation where you would draw 10 random numbers and want to calculate the mean, and then want to repeat this same action multiple times.

When repeating something multiple times (also called having iterations of something), you could use loops in R. Loops are pieces of code that are repeated a set number of times. In this practical, we discuss the for loop.

7.1 for-loops

for-loops repeat the given loop for the number of elements in a provided sequence or vector. The following code shows how we loop over the numbers of 1 to 10. Running this code would provide the third power for each of the numbers from 1 to 10.

# Defining the loop
for(i in seq(1, 10)){
  
  # action you want to repeat, in this case each number to the power 3.
  print(i^3)
}

Note that for-loops always have the form described below. When using a for-loop, pay attention to the parentheses and brackets.

for(<NAME_FOR_ELEMENT_IN_LOOP> in <SEQUENCE_OR_VECTOR>){
  <WANTED ACTIONS>
}
  1. Create a for-loop that iterates over the numbers 1 to 12 and for each number takes the third power and divides that number by 13. Then print the output for each number.

8 Apply statements

As an alternative to loops, apply statements can be used to apply the same function to a list or vector of elements. For example, you can compute the mean of every column in your data set by using an apply statement.

The apply statements are several similar statements that are useful in different situations. Some examples are apply(), sapply(), lapply(), and ‘mapply()’. To learn the exact differences between the statements, please read the function-documentation (e.g. ?apply()).

The standard apply() function has the input arguments

  • A data frame or vector
  • The margin on which the function should be used (e.g. should we apply the function on the rows (= 1), or columns (= 2))
  • The function you want to apply
  • Arguments the chosen function would need

Below, an example is provided were we want to calculate the mean for each column of a data matrix of 9 by 9 cells.

# Create a 9 by 9 cell matrix with numbers 1 to 81
data <- matrix(1:81, nrow = 9, ncol = 9)

# apply with the input the data, margin and function
apply(X = data, MARGIN = 2, FUN = mean)
  1. Now use an apply statement to calculate the variance (var()) of each row of an 8 by 8 matrix with numbers 1 to 64.

9 Saving data to a file

After making changes to data frames, or after creating output in the form of a data frame, you might want to save your data in a new file. For example, after pre-processing your data for the analysis, you might want to save a pre-processed version in addition to the version with raw data.

The write_csv() function saves a data frame as a csv file. The write_csv() function has two main arguments: the data frame to save, and the path to where you want the file to be located. Other options can also be specified, for example how to write missing values. Type ?write_csv to learn more about this function.

  1. Write the data flight_sample3 to a file using the write_csv() function.
write_csv(flight_data3, "flight_data3.csv")

# As we work in an Rproject, the .csv file is automatically stored within the project folder

10 End of practical

This concludes the practical for this week. Don’t forget to hand in your work on this practical! Find instructions for this on the course website!