3 Tidy data

3.1 Foreword

The rationale behind the tidy data philosophy is to organise the data you imported to or created in R into a consistent format. The goal is to spend less time formatting your data and more time analysing it. If your data is tidy you will be able to use it seamlessly across the tidyverse.

This part will show you how to use the functions from package tidyr to tidy up your data. As a result, data transformations and visualisations will be a real piece of cake.

Packages tidyverse and readxl must be loaded.

3.2 What is tidy data?

There are three simple and interrelated rules that make a dataset tidy:

Each observation/sample must be reported in its own row.
Each variable/feature describing the sample must have its own column.
Each value must have its own cell.

Let’s import some tables from the datasets directory of your data_analysis folder and check if they fulfil the above rules.

3.2.1 Exercise 3.1

Import file datasets/who1.csv into variable who1 and show the imported data. Is the who1 dataset tidy?

3.2.2 Exercise 3.2

Import file datasets/who2.txt into variable who2 and show the imported data. Is the who2 dataset tidy?

3.2.3 Exercise 3.3

Import sheet population from file datasets/who_split.xlsx into variable who_pop and show the imported data. Is who_pop tidy?

3.2.4 Why tidy data?

Why ensure that your data is tidy? There are two main advantages:

If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. The functions in the tidyverse all work flawlessly with tidy datasets.
Placing variables in columns allows R’s vectorised nature to shine.

The principles of tidy data seem very obvious. Unfortunately, most data that you will encounter will be untidy. There are two main reasons:

Most people aren’t familiar with the principles of tidy data.
Data is often organised to facilitate some use other than analysis.

This means that for most real analyses, you’ll need to tidy your data.

3.3 Spreading and gathering

When you’re in front of a dataset, the first step is always to figure out what the observations and the features that describe them are. The second step is to solve one of two common problems:

One observation is scattered across multiple rows.
One feature is spread across multiple columns.

To fix these problems, the tidyverse provides you with functions spread() and gather().

3.3.1 Spreading

You use function spread() for Problem 1: one observation is scattered across multiple rows. That’s exactly the problem we saw in table who1:

## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   type = col_character(),
##   count = col_integer()
## )

who1

## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <int>
##  1 Afghanistan 1999. cases             745
##  2 Afghanistan 1999. population   19987071
##  3 Afghanistan 2000. cases            2666
##  4 Afghanistan 2000. population   20595360
##  5 Brazil      1999. cases           37737
##  6 Brazil      1999. population  172006362
##  7 Brazil      2000. cases           80488
##  8 Brazil      2000. population  174504898
##  9 China       1999. cases          212258
## 10 China       1999. population 1272915272
## 11 China       2000. cases          213766
## 12 China       2000. population 1280428583

who1 is an excerpt of the World Health Organisation Tuberculosis Report. Each observation, a country in a year in this case, should be accompanied by that year’s number of tuberculosis cases and population. However, each country-year pair appears twice in who1: one for the tuberculosis cases and one for the country’s population.

To tidy up this dataset, we need to spread column count into new columns, one for each of the keys specified in column type:

Adapted from Chapter 12 of “R for Data Science”

We can do this with function spread()as follows:

who1_tidy <- who1 %>% 
   spread(value = count, key = type)
who1_tidy

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan 1999.    745   19987071
## 2 Afghanistan 2000.   2666   20595360
## 3 Brazil      1999.  37737  172006362
## 4 Brazil      2000.  80488  174504898
## 5 China       1999. 212258 1272915272
## 6 China       2000. 213766 1280428583

Note the syntax used to tidy up who1:

Choose a variable name to store the tidy data (this could very well be who1 again).
Use the assignment operator <-.
Specify the variable that we want to tidy up.
Use the pipe operator %>% to channel the contents of who1 to function spread (more on the pipe operator later in this Part).
Invoke function spread() and specify the column containing the values that we want to spread and the column containing the new variable names (known as the key column).

3.3.2 Gathering

You use function gather() for Problem 2: one feature is spread across multiple columns. That’s exactly the problem we saw in, for example, table who_cases from the Excel file who_split.xlsx:

who_cases

## # A tibble: 3 x 3
##   country      `1999`  `2000`
##   <chr>         <dbl>   <dbl>
## 1 Afghanistan    745.   2666.
## 2 Brazil       37737.  80488.
## 3 China       212258. 213766.

The column names 1999 and 2000 are values of the variable year, which means that each row represents two observations instead of one.

To tidy this dataset, we need to gather columns 1999 and 2000 into a new pair of variables:

Adapted from Chapter 12 of “R for Data Science”

We can do this with function gather()as follows:

who_cases_tidy <- who_cases %>% 
   gather(`1999`, `2000`, key = "year", value = "cases")
who_cases_tidy

## # A tibble: 6 x 3
##   country     year    cases
##   <chr>       <chr>   <dbl>
## 1 Afghanistan 1999     745.
## 2 Brazil      1999   37737.
## 3 China       1999  212258.
## 4 Afghanistan 2000    2666.
## 5 Brazil      2000   80488.
## 6 China       2000  213766.

Note the syntax used to tidy up who_cases:

Choose a variable name to store the tidy data (this could very well be who_cases again).
Use the assignment operator <-.
Specify the variable that we want to tidy up.
Use the pipe operator %>% to channel the contents of who_cases to function gather.
Invoke function gather() and specify the columns we want to merge, the name of the column that will contain the merged keys and the name of the column that will contain the value for the number of tuberculosis cases per country.

3.3.3 Exercise 3.4

Consider the following tibble, which records heights and weights of 3 students:

(students <- tibble(name = rep(c("Jonas", "Ines", "Hanna"), each = 2),
                   type = rep(c("height", "weight"), 3),
                   measure = c(1.83, 81, 1.75, 71, 1.69, 55)))

## # A tibble: 6 x 3
##   name  type   measure
##   <chr> <chr>    <dbl>
## 1 Jonas height    1.83
## 2 Jonas weight   81.0 
## 3 Ines  height    1.75
## 4 Ines  weight   71.0 
## 5 Hanna height    1.69
## 6 Hanna weight   55.0

Use the appropriate tidyr function on variable students to make it tidy.

3.3.4 Exercise 3.5

Consider the following tibble, which records the expression of 4 genes at 3 different time points:

(genes <- tibble(symbol = c("DMD", "MYOG", "MYF5", "MYOD1"),
                d00 = c(0.697, 0.844, 1.878, 1.622),
                d02 = c(1.986, 0.051, 0.887, 1.313),
                d04 = c(0.157, 0.774, 1.507, 0.628)))

## # A tibble: 4 x 4
##   symbol   d00    d02   d04
##   <chr>  <dbl>  <dbl> <dbl>
## 1 DMD    0.697 1.99   0.157
## 2 MYOG   0.844 0.0510 0.774
## 3 MYF5   1.88  0.887  1.51 
## 4 MYOD1  1.62  1.31   0.628

Use the appropriate tidyr function on variable genes to make it tidy.

3.4 Separating and uniting

3.4.1 Separating

When one column in a dataset contains two variables, we’ll need to use function separate() to fix the issue. That’s exactly the problem we saw in table who2:

## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   rate = col_character()
## )

who2

## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan 1999. 745/19987071     
## 2 Afghanistan 2000. 2666/20595360    
## 3 Brazil      1999. 37737/172006362  
## 4 Brazil      2000. 80488/174504898  
## 5 China       1999. 212258/1272915272
## 6 China       2000. 213766/1280428583

Column rate contains both cases and population, so we need to split it in two:

Adapted from Chapter 12 of “R for Data Science”

We can do this with function separate()as follows:

who2_tidy <- who2 %>% 
   separate(rate, into = c("cases", "population"), sep = "/")
who2_tidy

## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan 1999. 745    19987071  
## 2 Afghanistan 2000. 2666   20595360  
## 3 Brazil      1999. 37737  172006362 
## 4 Brazil      2000. 80488  174504898 
## 5 China       1999. 212258 1272915272
## 6 China       2000. 213766 1280428583

Note the syntax used to tidy up who2:

Choose a variable name to store the tidy data (this could very well be who2 again).
Use the assignment operator <-.
Specify the variable that we want to tidy up.
Use the pipe operator %>% to channel the contents of who2 to function separate().
Invoke function separate() and specify the column we want to split, the name of the new columns and the separator that is currently merging the variables.

3.4.2 Uniting

When a variable is stored in two separate columns and is more convenient to combine them, we need to use function unite(). Table who3 has this problem:

who3 <- read_tsv("datasets/who3.tsv")

## Parsed with column specification:
## cols(
##   country = col_character(),
##   century = col_integer(),
##   year = col_character(),
##   rate = col_character()
## )

who3

## # A tibble: 6 x 4
##   country     century year  rate             
##   <chr>         <int> <chr> <chr>            
## 1 Afghanistan      19 99    745/19987071     
## 2 Afghanistan      20 00    2666/20595360    
## 3 Brazil           19 99    37737/172006362  
## 4 Brazil           20 00    80488/174504898  
## 5 China            19 99    212258/1272915272
## 6 China            20 00    213766/1280428583

Column century and year can be combined into a single column called year:

Adapted from Chapter 12 of “R for Data Science”

We can do this with function unite()as follows:

who3_tidy <- who3 %>% 
   unite(century, year, col = "year", sep = "")
who3_tidy

## # A tibble: 6 x 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

Note the syntax used to tidy up who3:

Choose a variable name to store the tidy data (this could very well be who3 again).
Use the assignment operator <-.
Specify the variable that we want to tidy up.
Use the pipe operator %>% to channel the contents of who3 to function unite().
Invoke function unite() and specify the columns we want to merge, the name of the new column and the separator that we want to use to merge the column values.

3.4.3 Exercise 3.6

Consider the following tibble, which records heights and weights of 3 students:

(students <- tibble(name = c("Jonas", "Ines", "Hanna"),
                   ratio = c("81/1.83", "71/1.75", "55/1.69")))

## # A tibble: 3 x 2
##   name  ratio  
##   <chr> <chr>  
## 1 Jonas 81/1.83
## 2 Ines  71/1.75
## 3 Hanna 55/1.69

Use the appropriate tidyr function on variable students to make it tidy.

3.4.4 Exercise 3.7

Consider the following tibble, which records the price of 4 drugs:

(drugs <- tibble(name = c("penicillin", "insuline", "aspirin", "lanoxin"),
                euros = c(13, 17, 5, 25),
                cents = c(81, 20, 14, 12)))

## # A tibble: 4 x 3
##   name       euros cents
##   <chr>      <dbl> <dbl>
## 1 penicillin   13.   81.
## 2 insuline     17.   20.
## 3 aspirin       5.   14.
## 4 lanoxin      25.   12.

Use the appropriate tidyr function on variable drugs to make it tidy.

3.5 The %>% operator

As you know already, we channel the contents of a tibble to the different tidyr functions using the pipe operator: %>%.

Pipes are a powerful tool for clearly expressing the operation we want to perform on a variable. In addition, they can be used to apply a sequence of operations to a variable.

Table who3, for example, has two problems: cases and population are expressed as a rate and year is split in columns century and year:

who3

## # A tibble: 6 x 4
##   country     century year  rate             
##   <chr>         <int> <chr> <chr>            
## 1 Afghanistan      19 99    745/19987071     
## 2 Afghanistan      20 00    2666/20595360    
## 3 Brazil           19 99    37737/172006362  
## 4 Brazil           20 00    80488/174504898  
## 5 China            19 99    212258/1272915272
## 6 China            20 00    213766/1280428583

So, we clearly need two steps to tidy up this dataset:

who3_tidy <- who3 %>% 
   separate(rate, into = c("cases", "population"), sep = "/")

who3_tidy <- who3_tidy %>% 
   unite(century, year, col = "yyyy", sep = "")

who3_tidy

## # A tibble: 6 x 4
##   country     yyyy  cases  population
##   <chr>       <chr> <chr>  <chr>     
## 1 Afghanistan 1999  745    19987071  
## 2 Afghanistan 2000  2666   20595360  
## 3 Brazil      1999  37737  172006362 
## 4 Brazil      2000  80488  174504898 
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

Thanks to the pipe, we can apply these two operations to who3 in one go:

who3_tidy <- who3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/") %>% 
  unite(century, year, col = "yyyy", sep = "")

who3_tidy

## # A tibble: 6 x 4
##   country     yyyy  cases  population
##   <chr>       <chr> <chr>  <chr>     
## 1 Afghanistan 1999  745    19987071  
## 2 Afghanistan 2000  2666   20595360  
## 3 Brazil      1999  37737  172006362 
## 4 Brazil      2000  80488  174504898 
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

Note that this process reads almost like natural language:

We channel (or pipe) the contents of who3 to function separate().
separate() splits column rate into columns cases and population using “/” as separator.
The result of separate() is piped to function unite().
unite() combines columns century and year into column yyyy using the blank character as separator.

Even though writing the above code on a single line of code is still valid, it is good practice to use ENTER (i.e. a new line) after every %>% for clarity.