Intro and Overview
This document has a set of exercises for training your R programming
skills using the tidyverse packages to process and analyses example
datasets.
You will need:
- to have been introduced to R and tidyverse
- R and RStudio installed
- the tidyverse package installed
Datasets:
- Exercises will use built-in datasets
- built-in datasets are already loaded in R and ready to use
- you should read help pages of the datasets you analyze
- The titanic dataset is not built-in but it will be accessible by an
URL
Solution to exercises can be revealed by clicking on the
[Code] buttons displayed at the right-hand side of the
exercises.
Preparation
Load the tidyverse package.
Datasets
Built-in dataset:
trees
This data set provides measurements of the diameter, height and
volume of timber in 31 felled black cherry trees. Note that the diameter
(in inches) is erroneously labelled Girth in the data. It is measured at
4 ft 6 in above the ground.
- Show the head of table trees
- Create trees2 variable by copying
trees and
- Renaming column Girth to
Diameter
- Converting Diameter and Height to
centimeters (1 inch = 2,54 cm)
- Converting Volume in cubic meters (1 cibic foot = 0,0283168 cubic
meter)
trees2 <- trees %>%
rename(Diameter=Girth) %>%
mutate(Diameter=Diameter*2.54, Height=Height*2.54) %>%
mutate(Volume=Volume*0.0283168)
- Show the head of table trees2
- Calculate the mean value of each column
trees2 %>%
summarise(
mean.diameter=mean(Diameter),
mean.height=mean(Height),
mean.vol=mean(Volume)
)
- Save in variable trees2.plot a scatter plot of the
diameter vs height
- color points by Volume
- add a title to the plot using ggtitle()
trees2.plot <- trees2 %>%
ggplot(aes(x=Diameter, y=Height, color=Volume)) +
geom_point() +
ggtitle("Scatter Plot")
- save the plot in a PNG image file on your computer
- use ggsave(trees2.plot, filename = ‘your_file.png’,
…) with appropriate parameters for ggsave
- read the help of the function to create a 10x10cm plot named
“trees2.plot.png”
ggsave(trees2.plot, filename = "scatterplot.png", width = 10, height = 10, units = "cm")
Built-in dataset:
PlantGrowth
Results from an experiment to compare yields (as measured by dried
weight of plants) obtained under a control and two different treatment
conditions.
- Show a summary of the table using summary(TABLE)
(not a tidyverse’s function)
weight group
Min. :3.590 ctrl:10
1st Qu.:4.550 trt1:10
Median :5.155 trt2:10
Mean :5.073
3rd Qu.:5.530
Max. :6.310
- Show a density plot of the weight values divided by
group in a single plot
PlantGrowth %>%
ggplot(aes(x=weight, fill=group)) +
geom_density()
- Tuning the plots is sometimes as simple as using a special parameter
to a ggplot layer
- replot the same plot with the following setting in
geom_density() to set the transparency:
alpha=0.2
- alpha can take values from 0 to 1, test alpha=0.5
and alpha=0.8
PlantGrowth %>%
ggplot(aes(x=weight, fill=group)) +
geom_density(alpha=0.2)
PlantGrowth %>%
ggplot(aes(x=weight, fill=group)) +
geom_density(alpha=0.5)
PlantGrowth %>%
ggplot(aes(x=weight, fill=group)) +
geom_density(alpha=0.8)
Built-in dataset:
CO2
The CO2 data frame has 84 rows and 5 columns of data from an
experiment on the cold tolerance of the grass species Echinochloa
crus-galli.
- read the documentation of the CO2 dataset to
understand the columns
- show a summary of the table
Plant Type Treatment conc uptake
Qn1 : 7 Quebec :42 nonchilled:42 Min. : 95 Min. : 7.70
Qn2 : 7 Mississippi:42 chilled :42 1st Qu.: 175 1st Qu.:17.90
Qn3 : 7 Median : 350 Median :28.30
Qc1 : 7 Mean : 435 Mean :27.21
Qc3 : 7 3rd Qu.: 675 3rd Qu.:37.12
Qc2 : 7 Max. :1000 Max. :45.50
(Other):42
- Calculate the minimum and maximum uptake per geographical place of
origin
CO2 %>%
group_by(Type) %>%
summarise(
min=min(uptake),
max=max(uptake),
)
- Create a line graph showing uptake by concentration for each
plant
CO2 %>%
ggplot(aes(x=conc, y=uptake, color=Plant)) +
geom_line() +
geom_point()
Built-in dataset:
WorldPhones
The number of telephones in various regions of the world (in
thousands).
- show the matrix WorldPhones
N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
1951 45939 21574 2876 1815 1646 89 555
1956 60423 29990 4708 2568 2366 1411 733
1957 64721 32510 5230 2695 2526 1546 773
1958 68484 35218 6662 2845 2691 1663 836
1959 71799 37598 6856 3000 2868 1769 911
1960 76036 40341 8220 3145 3054 1905 1008
1961 79831 43173 9053 3338 3224 2005 1076
- Convert the matrix into a tibble named phones and
show the tibble
- adapt the following template: as_tibble(MATRIX,
rownames=“year”)
- Parameter rownames is needed because by default row
names are not kept by as_tibble()
phones <- as_tibble(WorldPhones, rownames="year")
phones
- Tidy up the tibble in order to make an observation a geographical
area in a year
phones <- phones %>%
gather(N.Amer:Mid.Amer, key="area", value="phones")
phones
- Create a plot to show the number of phones by year in each
geographical area
- use facets and colors for the areas
phones %>%
ggplot(aes(x=year, y=phones, color=area)) +
geom_point() +
facet_wrap(vars(area))
Built-in dataset:
mtcars
The data was extracted from the 1974 Motor Trend US magazine, and
comprises fuel consumption and 10 aspects of automobile design and
performance for 32 automobiles (1973–74 models).
- read the related help page to understand the columns
- show the head of the data frame
- To compare engine types (vs), calculate mean gross horsepower and
mean time per 1/4 mile
mtcars %>%
group_by(vs) %>%
summarise(mean.power=mean(hp), mean.time=mean(qsec))
- create a boxplot to compare the weight per engine type (1 plot with
2 boxes)
- Hint: the x axis of a boxplot should not be a numeric column
mtcars %>%
mutate(vs=as.character(vs)) %>%
ggplot(aes(x=vs, y=wt)) +
geom_boxplot()
- fastest car per category
- a category is defined by a particular combination of engine (vs) and
transmission (am)
- calculate the fastest car in each category
- the data frame’s rownames must be first transformed as a normal
column using rownames_to_column(“car”)
mtcars %>%
rownames_to_column("car") %>%
group_by(vs, am) %>%
filter(qsec==max(qsec))
- fastest car per category
- reproduce the result above using the slice_max
dplyr’s function
- adapt the following template: slice_max(COLUMN,
n=1)
mtcars %>%
rownames_to_column("car") %>%
group_by(vs, am) %>%
slice_max(qsec)
Titanic dataset
Prepare the
data
- Load the titanic dataset into variable
titanic.source from the following URL:
http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/titanic.tsv
titanic.source <- read_tsv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/titanic.tsv")
- observe the 20 first rows using head() function
(see help page to see how to define number of rows). Would it be
possible after some processing to derive a numerical age value for each
row?
titanic.source %>% head(n=20)
- create variable titanic with a copy of
titanic.source
- filter the table to keep rows that contains a possible value for
age
- tidy up the table in order to have a numerical column for
Age
- remove any temporary column created during this task, if
relevant
- show the head of the new table
titanic <- titanic.source %>%
filter(Age!="Not Available") %>%
separate(Age, into = c("Age", "rest"), sep=", ") %>%
mutate(Age=as.numeric(Age)) %>%
select(-rest)
titanic %>% head()
Survival
status
- Display a bar plot with numbers of survived and died passengers in
each class
- use titanic table
- use aes() with parameter x and
fill
- Tuning plots can get complicated: add the following ggplot layer
after geom_bar() to show numbers in bars:
stat_count(geom = "text", colour = "white", aes(label = ..count..), position=position_stack(vjust=0.5))
titanic %>%
ggplot(aes(x=Class, fill=SurvivalStatus)) +
geom_bar() +
stat_count(geom = "text", colour = "white", aes(label = ..count..), position=position_stack(vjust=0.5))
- In a new table titanic.stats, calculate numbers of
survived and died passengers in each class
- use titanic table
- hint: use summarise(n=n()) where function
n() counts number of rows
titanic.stats <- titanic %>%
group_by(Class, SurvivalStatus) %>%
summarise(n=n())
`summarise()` has grouped output by 'Class'. You can override using the `.groups` argument.
- tidy the titanic.stats table to make a class an
observation, resulting in a new table with 3 columns (Class, died,
survived)
titanic.stats <- titanic.stats %>%
spread(SurvivalStatus, n)
titanic.stats
- Using the titanic.stats table, calculate the
frequency for male or female passengers to die in each class
titanic.stats %>%
mutate(freq=died/(died+survived))
Distribution of
age
- Calculate the mean age in each class for male and female passengers
using summarise()
- use titanic table
- Note that the setting na.rm=TRUE for the
mean() function prevents the calculation to fail in
cases of missing values. As we filtered out missing values earlier it
should not be necessary. Also applies to sum, min and
max functions.
titanic %>%
group_by(Class, Sex) %>%
summarise(avg=mean(Age, na.rm=TRUE))
`summarise()` has grouped output by 'Class'. You can override using the `.groups` argument.
- plot the distribution of age in each class for male and female
passengers using boxplots
- use parameters x, y and
fill in aes()
titanic %>%
ggplot(aes(y=Age, x=Sex, fill=Class)) +
geom_boxplot()
- Create subplots of the previous plot by survival status
titanic %>%
ggplot(aes(y=Age, x=Sex, fill=Class)) +
geom_boxplot() +
facet_wrap(vars(SurvivalStatus))
---
title: "Data analysis with R and the tidyverse"
output: 
  html_notebook:
    theme: readable
    highlight: tango
    number_sections: true
    toc: true
    toc_depth: 2
    toc_float: true
    code_folding: hide
---

<!-- ###################################################################### -->
<!-- ###################################################################### -->
# Intro and Overview
<!-- ###################################################################### -->
<!-- ###################################################################### -->

This document has a set of exercises for training your R programming skills 
using the tidyverse packages to process and analyses example datasets.

You will need:

* to have been introduced to R and tidyverse
* R and RStudio installed
* the tidyverse package installed

Datasets:

* Exercises will use built-in datasets
* built-in datasets are already loaded in R and ready to use
* you should read help pages of the datasets you analyze 
* The titanic dataset is not built-in but it will be accessible by an URL

Solution to exercises can be revealed by clicking on the **[Code]** buttons displayed 
at the right-hand side of the exercises. 


<!-- ###################################################################### -->
# Preparation
<!-- ###################################################################### -->

Load the tidyverse package.

```{r class.source = 'fold-show'}
library(tidyverse)
```


```{r include=FALSE}
# data()
rock
state.x77
airquality
Orange


# train.data <- read_tsv("https://www.wolframcloud.com/objects/8bbe975c-48a9-4d36-a358-1dde7c5c572a")
# train.data %>% mutate(Age = str_replace(Age, "Quantity\\[", "")) %>% 
#   mutate(Age = str_replace(Age, "Missing\\[", "")) %>%  
#   mutate(Age = str_replace(Age, "\\]", "")) %>% 
#   mutate(Age = str_replace_all(Age, '"', "")) %>% 
#   mutate(Age = str_replace(Age, "\\.,", ".0,"))  %>% 
#   write_tsv("titanic.tsv")
# titanic %>% head(n=20)
```


<!-- ###################################################################### -->
<!-- ###################################################################### -->
# Datasets
<!-- ###################################################################### -->
<!-- ###################################################################### -->

<!-- ###################################################################### -->
## Built-in dataset: trees
<!-- ###################################################################### -->

This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.

* Show the head of table **trees** 
```{r}
trees %>% head()
```

* Create **trees2** variable by copying **trees** and
    * Renaming column **Girth** to **Diameter**
    * Converting **Diameter** and **Height** to centimeters (1 inch = 2,54 cm)
    * Converting Volume in cubic meters (1 cibic foot = 0,0283168 cubic meter)
```{r}
trees2 <- trees %>% 
  rename(Diameter=Girth) %>% 
  mutate(Diameter=Diameter*2.54, Height=Height*2.54) %>% 
  mutate(Volume=Volume*0.0283168)
```

* Show the head of table **trees2** 
```{r}
trees2 %>% head()
```

* Calculate the mean value of each column 
```{r}
trees2 %>% 
  summarise(
    mean.diameter=mean(Diameter),
    mean.height=mean(Height),
    mean.vol=mean(Volume)
    )
```

* Save in variable **trees2.plot** a scatter plot of the diameter vs height 
    * color points by **Volume**
    * add a title to the plot using **ggtitle()**   
```{r}
trees2.plot <- trees2 %>% 
  ggplot(aes(x=Diameter, y=Height, color=Volume)) +
  geom_point() + 
  ggtitle("Scatter Plot")
```


* save the plot in a PNG image file on your computer
    * use **ggsave(trees2.plot, filename = 'your_file.png', ...)** with appropriate parameters for **ggsave**
    * read the help of the function to create a 10x10cm plot named **"trees2.plot.png"**
```{r}
ggsave(trees2.plot, filename = "scatterplot.png", width = 10, height = 10, units = "cm")
```

<!-- ###################################################################### -->
## Built-in dataset: PlantGrowth
<!-- ###################################################################### -->

Results from an experiment to compare yields (as measured by dried weight of 
plants) obtained under a control and two different treatment conditions.

* Show a summary of the table using **summary(TABLE)** (not a tidyverse's function)
```{r}
summary(PlantGrowth)
```

* Show a density plot of the **weight** values divided by **group** in a single plot
```{r}
PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density()
```
* Tuning the plots is sometimes as simple as using a special parameter to a ggplot layer
    * replot the same plot with the following setting in **geom_density()** to set the transparency: **alpha=0.2**
    * alpha can take values from 0 to 1, test **alpha=0.5** and **alpha=0.8**
```{r}
PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density(alpha=0.2)
PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density(alpha=0.5)
PlantGrowth %>% 
  ggplot(aes(x=weight, fill=group)) +
  geom_density(alpha=0.8)
```

<!-- ###################################################################### -->
## Built-in dataset: CO2
<!-- ###################################################################### -->

The CO2 data frame has 84 rows and 5 columns of data from an experiment on the cold tolerance of the grass species Echinochloa crus-galli.

* read the documentation of the **CO2** dataset to understand the columns
* show a summary of the table
```{r}
summary(CO2)
```

* Calculate the minimum and maximum uptake per geographical place of origin
```{r}
CO2 %>% 
  group_by(Type) %>% 
  summarise(
    min=min(uptake),
    max=max(uptake),
  )
```

* Create a line graph showing uptake by concentration for each plant
```{r}
CO2 %>% 
  ggplot(aes(x=conc, y=uptake, color=Plant)) +
  geom_line() +
  geom_point()
```


<!-- ###################################################################### -->
## Built-in dataset: WorldPhones
<!-- ###################################################################### -->

The number of telephones in various regions of the world (in thousands).

* show the matrix **WorldPhones**
```{r}
WorldPhones
```

* Convert the matrix into a tibble named **phones** and show the tibble
    * adapt the following template: **as_tibble(MATRIX, rownames="year")**
    * Parameter **rownames** is needed because by default row names are not kept by **as_tibble()**

```{r}
phones <- as_tibble(WorldPhones, rownames="year")
phones
```

* Tidy up the tibble in order to make an observation a geographical area in a year 
```{r}
phones <- phones %>% 
  gather(N.Amer:Mid.Amer, key="area", value="phones")
phones
```

* Create a plot to show the number of phones by year in each geographical area
    * use facets and colors for the areas 
```{r fig.width=10, fig.height=6}
phones %>% 
  ggplot(aes(x=year, y=phones, color=area)) +
  geom_point() +
  facet_wrap(vars(area))
```


<!-- ###################################################################### -->
## Built-in dataset: mtcars
<!-- ###################################################################### -->

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

* read the related help page to understand the columns
* show the head of the data frame
```{r}
mtcars %>% head()
```

* To compare engine types (vs), calculate mean gross horsepower and mean time per 1/4 mile
```{r}
mtcars %>%
  group_by(vs) %>% 
  summarise(mean.power=mean(hp), mean.time=mean(qsec))
```


* create a boxplot to compare the weight per engine type (1 plot with 2 boxes)
    * Hint: the x axis of a boxplot should not be a numeric column

```{r}
mtcars %>% 
  mutate(vs=as.character(vs)) %>% 
  ggplot(aes(x=vs, y=wt)) +
  geom_boxplot()
```
* fastest car per category
    * a category is defined by a particular combination of engine (vs) and transmission (am)
    * calculate the fastest car in each category 
        * the data frame's rownames must be first transformed as a normal column using **rownames_to_column("car")**
        
```{r}
mtcars %>% 
  rownames_to_column("car") %>% 
  group_by(vs, am) %>% 
  filter(qsec==max(qsec))
```
* fastest car per category
    * reproduce the result above using the **slice_max** dplyr's function
        * adapt the following template: **slice_max(COLUMN, n=1)**
```{r}
mtcars %>% 
  rownames_to_column("car") %>% 
  group_by(vs, am) %>% 
  slice_max(qsec)
```



<!-- ###################################################################### -->
## Titanic dataset
<!-- ###################################################################### -->

### Prepare the data

* Load the titanic dataset into variable **titanic.source** from the following URL:  `http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/titanic.tsv`
```{r message=FALSE}
titanic.source <- read_tsv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/titanic.tsv")  
```

* observe the 20 first rows using **head()** function (see help page to see how 
to define number of rows). Would it be possible after some processing to derive 
a numerical age value for each row?
```{r}
titanic.source %>% head(n=20)
```

* create variable **titanic** with a copy of **titanic.source**
    * filter the table to keep rows that contains a possible value for age
    * tidy up the table in order to have a numerical column for **Age**
    * remove any temporary column created during this task, if relevant
* show the head of the new table

```{r}
titanic <- titanic.source %>% 
  filter(Age!="Not Available") %>% 
  separate(Age, into = c("Age", "rest"), sep=", ") %>% 
  mutate(Age=as.numeric(Age)) %>% 
  select(-rest) 
titanic %>% head()
```


### Survival status

* Display a bar plot with numbers of survived and died passengers in each class
    * use **titanic** table
    * use **aes()** with parameter **x** and **fill**
* Tuning plots can get complicated: add the following ggplot layer after **geom_bar()** to show numbers in bars:
    * `stat_count(geom = "text", colour = "white", aes(label = ..count..), position=position_stack(vjust=0.5))`

```{r}
titanic %>% 
  ggplot(aes(x=Class, fill=SurvivalStatus)) +
  geom_bar() +
  stat_count(geom = "text", colour = "white", aes(label = ..count..), position=position_stack(vjust=0.5))
```

* In a new table **titanic.stats**, calculate numbers of survived and died passengers in each class 
    * use **titanic** table
    * hint: use **summarise(n=n())** where function **n()** counts number of rows

```{r}
titanic.stats <- titanic %>%
  group_by(Class, SurvivalStatus) %>% 
  summarise(n=n())
titanic.stats
```

* tidy the **titanic.stats** table to make a class an observation, resulting in a new table with 3 columns (Class, died, survived)
```{r}
titanic.stats <- titanic.stats %>% 
  spread(SurvivalStatus, n)
titanic.stats
```

* Using the **titanic.stats** table, calculate the frequency for male or female passengers to die in each class

```{r}
titanic.stats %>% 
  mutate(freq=died/(died+survived))
```


### Distribution of age

* Calculate the mean age in each class for male and female passengers using **summarise()**
    * use titanic table
    * Note that the setting **na.rm=TRUE** for the **mean()** function prevents the calculation to fail in cases of missing values. As we filtered out missing values earlier it should not be necessary. Also applies to **sum, min and max** functions.  
```{r}
titanic %>% 
  group_by(Class, Sex) %>% 
  summarise(avg=mean(Age, na.rm=TRUE))
```

* plot the distribution of age in each class for male and female passengers using boxplots
    * use parameters **x**, **y** and **fill** in **aes()**
```{r}
titanic %>% 
  ggplot(aes(y=Age, x=Sex, fill=Class)) +
  geom_boxplot()
```
* Create subplots of the previous plot by survival status
```{r}
titanic %>% 
  ggplot(aes(y=Age, x=Sex, fill=Class)) +
  geom_boxplot() +
  facet_wrap(vars(SurvivalStatus))
```
