1 Intro and Overview

This document has a set of exercises for training your R programming skills using the tidyverse packages to process and analyses example datasets.

You will need:

  • to have been introduced to R and tidyverse
  • R and RStudio installed
  • the tidyverse package installed

Datasets:

  • Exercises will use built-in datasets
  • built-in datasets are already loaded in R and ready to use
  • you should read help pages of the datasets you analyze
  • The titanic dataset is not built-in but it will be accessible by an URL

Solution to exercises can be revealed by clicking on the [Code] buttons displayed at the right-hand side of the exercises.

2 Preparation

Load the tidyverse package.

library(tidyverse)

3 Datasets

3.1 Built-in dataset: trees

This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.

  • Show the head of table trees
trees %>% head()
  • Create trees2 variable by copying trees and
    • Renaming column Girth to Diameter
    • Converting Diameter and Height to centimeters (1 inch = 2,54 cm)
    • Converting Volume in cubic meters (1 cibic foot = 0,0283168 cubic meter)
trees2 <- trees %>% 
  rename(Diameter=Girth) %>% 
  mutate(Diameter=Diameter*2.54, Height=Height*2.54) %>% 
  mutate(Volume=Volume*0.0283168)
  • Show the head of table trees2
trees2 %>% head()
  • Calculate the mean value of each column
trees2 %>% 
  summarise(
    mean.diameter=mean(Diameter),
    mean.height=mean(Height),
    mean.vol=mean(Volume)
    )
  • Save in variable trees2.plot a scatter plot of the diameter vs height
    • color points by Volume
    • add a title to the plot using ggtitle()
trees2.plot <- trees2 %>% 
  ggplot(aes(x=Diameter, y=Height, color=Volume)) +
  geom_point() + 
  ggtitle("Scatter Plot")
  • save the plot in a PNG image file on your computer
    • use ggsave(trees2.plot, filename = ‘your_file.png’, …) with appropriate parameters for ggsave
    • read the help of the function to create a 10x10cm plot named “trees2.plot.png”
ggsave(trees2.plot, filename = "scatterplot.png", width = 10, height = 10, units = "cm")

3.2 Built-in dataset: PlantGrowth

Results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions.

  • Show a summary of the table using summary(TABLE) (not a tidyverse’s function)
summary(PlantGrowth)
     weight       group   
 Min.   :3.590   ctrl:10  
 1st Qu.:4.550   trt1:10  
 Median :5.155   trt2:10  
 Mean   :5.073            
 3rd Qu.:5.530            
 Max.   :6.310            
  • Show a density plot of the weight values divided by group in a single plot
PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density()

  • Tuning the plots is sometimes as simple as using a special parameter to a ggplot layer
    • replot the same plot with the following setting in geom_density() to set the transparency: alpha=0.2
    • alpha can take values from 0 to 1, test alpha=0.5 and alpha=0.8
PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density(alpha=0.2)

PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density(alpha=0.5)

PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density(alpha=0.8)

3.3 Built-in dataset: CO2

The CO2 data frame has 84 rows and 5 columns of data from an experiment on the cold tolerance of the grass species Echinochloa crus-galli.

  • read the documentation of the CO2 dataset to understand the columns
  • show a summary of the table
summary(CO2)
     Plant             Type         Treatment       conc          uptake     
 Qn1    : 7   Quebec     :42   nonchilled:42   Min.   :  95   Min.   : 7.70  
 Qn2    : 7   Mississippi:42   chilled   :42   1st Qu.: 175   1st Qu.:17.90  
 Qn3    : 7                                    Median : 350   Median :28.30  
 Qc1    : 7                                    Mean   : 435   Mean   :27.21  
 Qc3    : 7                                    3rd Qu.: 675   3rd Qu.:37.12  
 Qc2    : 7                                    Max.   :1000   Max.   :45.50  
 (Other):42                                                                  
  • Calculate the minimum and maximum uptake per geographical place of origin
CO2 %>% 
  group_by(Type) %>% 
  summarise(
    min=min(uptake),
    max=max(uptake),
  )
  • Create a line graph showing uptake by concentration for each plant
CO2 %>% 
  ggplot(aes(x=conc, y=uptake, color=Plant)) +
  geom_line() +
  geom_point()

3.4 Built-in dataset: WorldPhones

The number of telephones in various regions of the world (in thousands).

  • show the matrix WorldPhones
WorldPhones
     N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951  45939  21574 2876   1815    1646     89      555
1956  60423  29990 4708   2568    2366   1411      733
1957  64721  32510 5230   2695    2526   1546      773
1958  68484  35218 6662   2845    2691   1663      836
1959  71799  37598 6856   3000    2868   1769      911
1960  76036  40341 8220   3145    3054   1905     1008
1961  79831  43173 9053   3338    3224   2005     1076
  • Convert the matrix into a tibble named phones and show the tibble
    • adapt the following template: as_tibble(MATRIX, rownames=“year”)
    • Parameter rownames is needed because by default row names are not kept by as_tibble()
phones <- as_tibble(WorldPhones, rownames="year")
phones
  • Tidy up the tibble in order to make an observation a geographical area in a year
phones <- phones %>% 
  gather(N.Amer:Mid.Amer, key="area", value="phones")
phones
  • Create a plot to show the number of phones by year in each geographical area
    • use facets and colors for the areas
phones %>% 
  ggplot(aes(x=year, y=phones, color=area)) +
  geom_point() +
  facet_wrap(vars(area))

3.5 Built-in dataset: mtcars

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

  • read the related help page to understand the columns
  • show the head of the data frame
mtcars %>% head()
  • To compare engine types (vs), calculate mean gross horsepower and mean time per 1/4 mile
mtcars %>%
  group_by(vs) %>% 
  summarise(mean.power=mean(hp), mean.time=mean(qsec))
  • create a boxplot to compare the weight per engine type (1 plot with 2 boxes)
    • Hint: the x axis of a boxplot should not be a numeric column
mtcars %>% 
  mutate(vs=as.character(vs)) %>% 
  ggplot(aes(x=vs, y=wt)) +
  geom_boxplot()

  • fastest car per category
    • a category is defined by a particular combination of engine (vs) and transmission (am)
    • calculate the fastest car in each category
      • the data frame’s rownames must be first transformed as a normal column using rownames_to_column(“car”)
mtcars %>% 
  rownames_to_column("car") %>% 
  group_by(vs, am) %>% 
  filter(qsec==max(qsec))
  • fastest car per category
    • reproduce the result above using the slice_max dplyr’s function
      • adapt the following template: slice_max(COLUMN, n=1)
mtcars %>% 
  rownames_to_column("car") %>% 
  group_by(vs, am) %>% 
  slice_max(qsec)

3.6 Titanic dataset

3.6.1 Prepare the data

  • Load the titanic dataset into variable titanic.source from the following URL: http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/titanic.tsv
titanic.source <- read_tsv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/titanic.tsv")  
  • observe the 20 first rows using head() function (see help page to see how to define number of rows). Would it be possible after some processing to derive a numerical age value for each row?
titanic.source %>% head(n=20)
  • create variable titanic with a copy of titanic.source
    • filter the table to keep rows that contains a possible value for age
    • tidy up the table in order to have a numerical column for Age
    • remove any temporary column created during this task, if relevant
  • show the head of the new table
titanic <- titanic.source %>% 
  filter(Age!="Not Available") %>% 
  separate(Age, into = c("Age", "rest"), sep=", ") %>% 
  mutate(Age=as.numeric(Age)) %>% 
  select(-rest) 
titanic %>% head()

3.6.2 Survival status

  • Display a bar plot with numbers of survived and died passengers in each class
    • use titanic table
    • use aes() with parameter x and fill
  • Tuning plots can get complicated: add the following ggplot layer after geom_bar() to show numbers in bars:
    • stat_count(geom = "text", colour = "white", aes(label = ..count..), position=position_stack(vjust=0.5))
titanic %>% 
  ggplot(aes(x=Class, fill=SurvivalStatus)) +
  geom_bar() +
  stat_count(geom = "text", colour = "white", aes(label = ..count..), position=position_stack(vjust=0.5))

  • In a new table titanic.stats, calculate numbers of survived and died passengers in each class
    • use titanic table
    • hint: use summarise(n=n()) where function n() counts number of rows
titanic.stats <- titanic %>%
  group_by(Class, SurvivalStatus) %>% 
  summarise(n=n())
`summarise()` has grouped output by 'Class'. You can override using the `.groups` argument.
titanic.stats
  • tidy the titanic.stats table to make a class an observation, resulting in a new table with 3 columns (Class, died, survived)
titanic.stats <- titanic.stats %>% 
  spread(SurvivalStatus, n)
titanic.stats
  • Using the titanic.stats table, calculate the frequency for male or female passengers to die in each class
titanic.stats %>% 
  mutate(freq=died/(died+survived))

3.6.3 Distribution of age

  • Calculate the mean age in each class for male and female passengers using summarise()
    • use titanic table
    • Note that the setting na.rm=TRUE for the mean() function prevents the calculation to fail in cases of missing values. As we filtered out missing values earlier it should not be necessary. Also applies to sum, min and max functions.
titanic %>% 
  group_by(Class, Sex) %>% 
  summarise(avg=mean(Age, na.rm=TRUE))
`summarise()` has grouped output by 'Class'. You can override using the `.groups` argument.
  • plot the distribution of age in each class for male and female passengers using boxplots
    • use parameters x, y and fill in aes()
titanic %>% 
  ggplot(aes(y=Age, x=Sex, fill=Class)) +
  geom_boxplot()

  • Create subplots of the previous plot by survival status
titanic %>% 
  ggplot(aes(y=Age, x=Sex, fill=Class)) +
  geom_boxplot() +
  facet_wrap(vars(SurvivalStatus))

---
title: "Data analysis with R and the tidyverse"
output: 
  html_notebook:
    theme: readable
    highlight: tango
    number_sections: true
    toc: true
    toc_depth: 2
    toc_float: true
    code_folding: hide
---

<!-- ###################################################################### -->
<!-- ###################################################################### -->
# Intro and Overview
<!-- ###################################################################### -->
<!-- ###################################################################### -->

This document has a set of exercises for training your R programming skills 
using the tidyverse packages to process and analyses example datasets.

You will need:

* to have been introduced to R and tidyverse
* R and RStudio installed
* the tidyverse package installed

Datasets:

* Exercises will use built-in datasets
* built-in datasets are already loaded in R and ready to use
* you should read help pages of the datasets you analyze 
* The titanic dataset is not built-in but it will be accessible by an URL

Solution to exercises can be revealed by clicking on the **[Code]** buttons displayed 
at the right-hand side of the exercises. 


<!-- ###################################################################### -->
# Preparation
<!-- ###################################################################### -->

Load the tidyverse package.

```{r class.source = 'fold-show'}
library(tidyverse)
```


```{r include=FALSE}
# data()
rock
state.x77
airquality
Orange


# train.data <- read_tsv("https://www.wolframcloud.com/objects/8bbe975c-48a9-4d36-a358-1dde7c5c572a")
# train.data %>% mutate(Age = str_replace(Age, "Quantity\\[", "")) %>% 
#   mutate(Age = str_replace(Age, "Missing\\[", "")) %>%  
#   mutate(Age = str_replace(Age, "\\]", "")) %>% 
#   mutate(Age = str_replace_all(Age, '"', "")) %>% 
#   mutate(Age = str_replace(Age, "\\.,", ".0,"))  %>% 
#   write_tsv("titanic.tsv")
# titanic %>% head(n=20)
```


<!-- ###################################################################### -->
<!-- ###################################################################### -->
# Datasets
<!-- ###################################################################### -->
<!-- ###################################################################### -->

<!-- ###################################################################### -->
## Built-in dataset: trees
<!-- ###################################################################### -->

This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.

* Show the head of table **trees** 
```{r}
trees %>% head()
```

* Create **trees2** variable by copying **trees** and
    * Renaming column **Girth** to **Diameter**
    * Converting **Diameter** and **Height** to centimeters (1 inch = 2,54 cm)
    * Converting Volume in cubic meters (1 cibic foot = 0,0283168 cubic meter)
```{r}
trees2 <- trees %>% 
  rename(Diameter=Girth) %>% 
  mutate(Diameter=Diameter*2.54, Height=Height*2.54) %>% 
  mutate(Volume=Volume*0.0283168)
```

* Show the head of table **trees2** 
```{r}
trees2 %>% head()
```

* Calculate the mean value of each column 
```{r}
trees2 %>% 
  summarise(
    mean.diameter=mean(Diameter),
    mean.height=mean(Height),
    mean.vol=mean(Volume)
    )
```

* Save in variable **trees2.plot** a scatter plot of the diameter vs height 
    * color points by **Volume**
    * add a title to the plot using **ggtitle()**   
```{r}
trees2.plot <- trees2 %>% 
  ggplot(aes(x=Diameter, y=Height, color=Volume)) +
  geom_point() + 
  ggtitle("Scatter Plot")
```


* save the plot in a PNG image file on your computer
    * use **ggsave(trees2.plot, filename = 'your_file.png', ...)** with appropriate parameters for **ggsave**
    * read the help of the function to create a 10x10cm plot named **"trees2.plot.png"**
```{r}
ggsave(trees2.plot, filename = "scatterplot.png", width = 10, height = 10, units = "cm")
```

<!-- ###################################################################### -->
## Built-in dataset: PlantGrowth
<!-- ###################################################################### -->

Results from an experiment to compare yields (as measured by dried weight of 
plants) obtained under a control and two different treatment conditions.

* Show a summary of the table using **summary(TABLE)** (not a tidyverse's function)
```{r}
summary(PlantGrowth)
```

* Show a density plot of the **weight** values divided by **group** in a single plot
```{r}
PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density()
```
* Tuning the plots is sometimes as simple as using a special parameter to a ggplot layer
    * replot the same plot with the following setting in **geom_density()** to set the transparency: **alpha=0.2**
    * alpha can take values from 0 to 1, test **alpha=0.5** and **alpha=0.8**
```{r}
PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density(alpha=0.2)
PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density(alpha=0.5)
PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density(alpha=0.8)
```

<!-- ###################################################################### -->
## Built-in dataset: CO2
<!-- ###################################################################### -->

The CO2 data frame has 84 rows and 5 columns of data from an experiment on the cold tolerance of the grass species Echinochloa crus-galli.

* read the documentation of the **CO2** dataset to understand the columns
* show a summary of the table
```{r}
summary(CO2)
```

* Calculate the minimum and maximum uptake per geographical place of origin
```{r}
CO2 %>% 
  group_by(Type) %>% 
  summarise(
    min=min(uptake),
    max=max(uptake),
  )
```

* Create a line graph showing uptake by concentration for each plant
```{r}
CO2 %>% 
  ggplot(aes(x=conc, y=uptake, color=Plant)) +
  geom_line() +
  geom_point()
```


<!-- ###################################################################### -->
## Built-in dataset: WorldPhones
<!-- ###################################################################### -->

The number of telephones in various regions of the world (in thousands).

* show the matrix **WorldPhones**
```{r}
WorldPhones
```

* Convert the matrix into a tibble named **phones** and show the tibble
    * adapt the following template: **as_tibble(MATRIX, rownames="year")**
    * Parameter **rownames** is needed because by default row names are not kept by **as_tibble()**

```{r}
phones <- as_tibble(WorldPhones, rownames="year")
phones
```

* Tidy up the tibble in order to make an observation a geographical area in a year 
```{r}
phones <- phones %>% 
  gather(N.Amer:Mid.Amer, key="area", value="phones")
phones
```

* Create a plot to show the number of phones by year in each geographical area
    * use facets and colors for the areas 
```{r fig.width=10, fig.height=6}
phones %>% 
  ggplot(aes(x=year, y=phones, color=area)) +
  geom_point() +
  facet_wrap(vars(area))
```


<!-- ###################################################################### -->
## Built-in dataset: mtcars
<!-- ###################################################################### -->

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

* read the related help page to understand the columns
* show the head of the data frame
```{r}
mtcars %>% head()
```

* To compare engine types (vs), calculate mean gross horsepower and mean time per 1/4 mile
```{r}
mtcars %>%
  group_by(vs) %>% 
  summarise(mean.power=mean(hp), mean.time=mean(qsec))
```


* create a boxplot to compare the weight per engine type (1 plot with 2 boxes)
    * Hint: the x axis of a boxplot should not be a numeric column

```{r}
mtcars %>% 
  mutate(vs=as.character(vs)) %>% 
  ggplot(aes(x=vs, y=wt)) +
  geom_boxplot()
```
* fastest car per category
    * a category is defined by a particular combination of engine (vs) and transmission (am)
    * calculate the fastest car in each category 
        * the data frame's rownames must be first transformed as a normal column using **rownames_to_column("car")**
        
```{r}
mtcars %>% 
  rownames_to_column("car") %>% 
  group_by(vs, am) %>% 
  filter(qsec==max(qsec))
```
* fastest car per category
    * reproduce the result above using the **slice_max** dplyr's function
        * adapt the following template: **slice_max(COLUMN, n=1)**
```{r}
mtcars %>% 
  rownames_to_column("car") %>% 
  group_by(vs, am) %>% 
  slice_max(qsec)
```



<!-- ###################################################################### -->
## Titanic dataset
<!-- ###################################################################### -->

### Prepare the data

* Load the titanic dataset into variable **titanic.source** from the following URL:  `http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/titanic.tsv`
```{r message=FALSE}
titanic.source <- read_tsv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/titanic.tsv")  
```

* observe the 20 first rows using **head()** function (see help page to see how 
to define number of rows). Would it be possible after some processing to derive 
a numerical age value for each row?
```{r}
titanic.source %>% head(n=20)
```

* create variable **titanic** with a copy of **titanic.source**
    * filter the table to keep rows that contains a possible value for age
    * tidy up the table in order to have a numerical column for **Age**
    * remove any temporary column created during this task, if relevant
* show the head of the new table

```{r}
titanic <- titanic.source %>% 
  filter(Age!="Not Available") %>% 
  separate(Age, into = c("Age", "rest"), sep=", ") %>% 
  mutate(Age=as.numeric(Age)) %>% 
  select(-rest) 
titanic %>% head()
```


### Survival status

* Display a bar plot with numbers of survived and died passengers in each class
    * use **titanic** table
    * use **aes()** with parameter **x** and **fill**
* Tuning plots can get complicated: add the following ggplot layer after **geom_bar()** to show numbers in bars:
    * `stat_count(geom = "text", colour = "white", aes(label = ..count..), position=position_stack(vjust=0.5))`

```{r}
titanic %>% 
  ggplot(aes(x=Class, fill=SurvivalStatus)) +
  geom_bar() +
  stat_count(geom = "text", colour = "white", aes(label = ..count..), position=position_stack(vjust=0.5))
```

* In a new table **titanic.stats**, calculate numbers of survived and died passengers in each class 
    * use **titanic** table
    * hint: use **summarise(n=n())** where function **n()** counts number of rows

```{r}
titanic.stats <- titanic %>%
  group_by(Class, SurvivalStatus) %>% 
  summarise(n=n())
titanic.stats
```

* tidy the **titanic.stats** table to make a class an observation, resulting in a new table with 3 columns (Class, died, survived)
```{r}
titanic.stats <- titanic.stats %>% 
  spread(SurvivalStatus, n)
titanic.stats
```

* Using the **titanic.stats** table, calculate the frequency for male or female passengers to die in each class

```{r}
titanic.stats %>% 
  mutate(freq=died/(died+survived))
```


### Distribution of age

* Calculate the mean age in each class for male and female passengers using **summarise()**
    * use titanic table
    * Note that the setting **na.rm=TRUE** for the **mean()** function prevents the calculation to fail in cases of missing values. As we filtered out missing values earlier it should not be necessary. Also applies to **sum, min and max** functions.  
```{r}
titanic %>% 
  group_by(Class, Sex) %>% 
  summarise(avg=mean(Age, na.rm=TRUE))
```

* plot the distribution of age in each class for male and female passengers using boxplots
    * use parameters **x**, **y** and **fill** in **aes()**
```{r}
titanic %>% 
  ggplot(aes(y=Age, x=Sex, fill=Class)) +
  geom_boxplot()
```
* Create subplots of the previous plot by survival status
```{r}
titanic %>% 
  ggplot(aes(y=Age, x=Sex, fill=Class)) +
  geom_boxplot() +
  facet_wrap(vars(SurvivalStatus))
```
