3 Tidy data
3.1 Foreword
The rationale behind the tidy data philosophy is to organise the data you imported to or created in R into a consistent format. The goal is to spend less time formatting your data and more time analysing it. If your data is tidy you will be able to use it seamlessly across the tidyverse
.
This part will show you how to use the functions from package tidyr
to tidy up your data. As a result, data transformations and visualisations will be a real piece of cake.
Packages tidyverse
and readxl
must be loaded.
3.2 What is tidy data?
There are three simple and interrelated rules that make a dataset tidy:
- Each observation/sample must be reported in its own row.
- Each variable/feature describing the sample must have its own column.
- Each value must have its own cell.
Let’s import some tables from the datasets
directory of your data_analysis
folder and check if they fulfil the above rules.
3.2.1 Exercise 3.1
Import file datasets/who1.csv
into variable who1
and show the imported data. Is the who1
dataset tidy?
3.2.2 Exercise 3.2
Import file datasets/who2.txt
into variable who2
and show the imported data. Is the who2
dataset tidy?
3.2.3 Exercise 3.3
Import sheet population
from file datasets/who_split.xlsx
into variable who_pop
and show the imported data. Is who_pop
tidy?
3.2.4 Why tidy data?
Why ensure that your data is tidy? There are two main advantages:
- If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. The functions in the
tidyverse
all work flawlessly with tidy datasets. - Placing variables in columns allows R’s vectorised nature to shine.
The principles of tidy data seem very obvious. Unfortunately, most data that you will encounter will be untidy. There are two main reasons:
- Most people aren’t familiar with the principles of tidy data.
- Data is often organised to facilitate some use other than analysis.
This means that for most real analyses, you’ll need to tidy your data.
3.3 Spreading and gathering
When you’re in front of a dataset, the first step is always to figure out what the observations and the features that describe them are. The second step is to solve one of two common problems:
- One observation is scattered across multiple rows.
- One feature is spread across multiple columns.
To fix these problems, the tidyverse
provides you with functions spread()
and gather()
.
3.3.1 Spreading
You use function spread()
for Problem 1: one observation is scattered across multiple rows. That’s exactly the problem we saw in table who1
:
## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_double(),
## type = col_character(),
## count = col_integer()
## )
who1
## # A tibble: 12 x 4
## country year type count
## <chr> <dbl> <chr> <int>
## 1 Afghanistan 1999. cases 745
## 2 Afghanistan 1999. population 19987071
## 3 Afghanistan 2000. cases 2666
## 4 Afghanistan 2000. population 20595360
## 5 Brazil 1999. cases 37737
## 6 Brazil 1999. population 172006362
## 7 Brazil 2000. cases 80488
## 8 Brazil 2000. population 174504898
## 9 China 1999. cases 212258
## 10 China 1999. population 1272915272
## 11 China 2000. cases 213766
## 12 China 2000. population 1280428583
who1
is an excerpt of the World Health Organisation Tuberculosis Report. Each observation, a country in a year in this case, should be accompanied by that year’s number of tuberculosis cases and population. However, each country-year pair appears twice in who1
: one for the tuberculosis cases and one for the country’s population.
To tidy up this dataset, we need to spread column count
into new columns, one for each of the keys specified in column type
:
We can do this with function spread()
as follows:
who1_tidy <- who1 %>%
spread(value = count, key = type)
who1_tidy
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999. 745 19987071
## 2 Afghanistan 2000. 2666 20595360
## 3 Brazil 1999. 37737 172006362
## 4 Brazil 2000. 80488 174504898
## 5 China 1999. 212258 1272915272
## 6 China 2000. 213766 1280428583
Note the syntax used to tidy up who1
:
- Choose a variable name to store the tidy data (this could very well be
who1
again). - Use the assignment operator
<-
. - Specify the variable that we want to tidy up.
- Use the pipe operator %>% to channel the contents of
who1
to functionspread
(more on the pipe operator later in this Part). - Invoke function
spread()
and specify the column containing the values that we want to spread and the column containing the new variable names (known as the key column).
3.3.2 Gathering
You use function gather()
for Problem 2: one feature is spread across multiple columns. That’s exactly the problem we saw in, for example, table who_cases
from the Excel file who_split.xlsx
:
who_cases
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745. 2666.
## 2 Brazil 37737. 80488.
## 3 China 212258. 213766.
The column names 1999
and 2000
are values of the variable year, which means that each row represents two observations instead of one.
To tidy this dataset, we need to gather columns 1999
and 2000
into a new pair of variables:
We can do this with function gather()
as follows:
who_cases_tidy <- who_cases %>%
gather(`1999`, `2000`, key = "year", value = "cases")
who_cases_tidy
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745.
## 2 Brazil 1999 37737.
## 3 China 1999 212258.
## 4 Afghanistan 2000 2666.
## 5 Brazil 2000 80488.
## 6 China 2000 213766.
Note the syntax used to tidy up who_cases
:
- Choose a variable name to store the tidy data (this could very well be
who_cases
again). - Use the assignment operator
<-
. - Specify the variable that we want to tidy up.
- Use the pipe operator %>% to channel the contents of
who_cases
to functiongather
. - Invoke function
gather()
and specify the columns we want to merge, the name of the column that will contain the merged keys and the name of the column that will contain the value for the number of tuberculosis cases per country.
3.3.3 Exercise 3.4
Consider the following tibble
, which records heights and weights of 3 students:
(students <- tibble(name = rep(c("Jonas", "Ines", "Hanna"), each = 2),
type = rep(c("height", "weight"), 3),
measure = c(1.83, 81, 1.75, 71, 1.69, 55)))
## # A tibble: 6 x 3
## name type measure
## <chr> <chr> <dbl>
## 1 Jonas height 1.83
## 2 Jonas weight 81.0
## 3 Ines height 1.75
## 4 Ines weight 71.0
## 5 Hanna height 1.69
## 6 Hanna weight 55.0
Use the appropriate tidyr
function on variable students
to make it tidy.
3.3.4 Exercise 3.5
Consider the following tibble
, which records the expression of 4 genes at 3 different time points:
(genes <- tibble(symbol = c("DMD", "MYOG", "MYF5", "MYOD1"),
d00 = c(0.697, 0.844, 1.878, 1.622),
d02 = c(1.986, 0.051, 0.887, 1.313),
d04 = c(0.157, 0.774, 1.507, 0.628)))
## # A tibble: 4 x 4
## symbol d00 d02 d04
## <chr> <dbl> <dbl> <dbl>
## 1 DMD 0.697 1.99 0.157
## 2 MYOG 0.844 0.0510 0.774
## 3 MYF5 1.88 0.887 1.51
## 4 MYOD1 1.62 1.31 0.628
Use the appropriate tidyr
function on variable genes
to make it tidy.
3.4 Separating and uniting
3.4.1 Separating
When one column in a dataset contains two variables, we’ll need to use function separate()
to fix the issue. That’s exactly the problem we saw in table who2
:
## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_double(),
## rate = col_character()
## )
who2
## # A tibble: 6 x 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999. 745/19987071
## 2 Afghanistan 2000. 2666/20595360
## 3 Brazil 1999. 37737/172006362
## 4 Brazil 2000. 80488/174504898
## 5 China 1999. 212258/1272915272
## 6 China 2000. 213766/1280428583
Column rate
contains both cases
and population
, so we need to split it in two:
We can do this with function separate()
as follows:
who2_tidy <- who2 %>%
separate(rate, into = c("cases", "population"), sep = "/")
who2_tidy
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999. 745 19987071
## 2 Afghanistan 2000. 2666 20595360
## 3 Brazil 1999. 37737 172006362
## 4 Brazil 2000. 80488 174504898
## 5 China 1999. 212258 1272915272
## 6 China 2000. 213766 1280428583
Note the syntax used to tidy up who2
:
- Choose a variable name to store the tidy data (this could very well be
who2
again). - Use the assignment operator
<-
. - Specify the variable that we want to tidy up.
- Use the pipe operator %>% to channel the contents of
who2
to functionseparate()
. - Invoke function
separate()
and specify the column we want to split, the name of the new columns and the separator that is currently merging the variables.
3.4.2 Uniting
When a variable is stored in two separate columns and is more convenient to combine them, we need to use function unite()
. Table who3
has this problem:
who3 <- read_tsv("datasets/who3.tsv")
## Parsed with column specification:
## cols(
## country = col_character(),
## century = col_integer(),
## year = col_character(),
## rate = col_character()
## )
who3
## # A tibble: 6 x 4
## country century year rate
## <chr> <int> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
Column century
and year
can be combined into a single column called year
:
We can do this with function unite()
as follows:
who3_tidy <- who3 %>%
unite(century, year, col = "year", sep = "")
who3_tidy
## # A tibble: 6 x 3
## country year rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Note the syntax used to tidy up who3
:
- Choose a variable name to store the tidy data (this could very well be
who3
again). - Use the assignment operator
<-
. - Specify the variable that we want to tidy up.
- Use the pipe operator %>% to channel the contents of
who3
to functionunite()
. - Invoke function
unite()
and specify the columns we want to merge, the name of the new column and the separator that we want to use to merge the column values.
3.4.3 Exercise 3.6
Consider the following tibble
, which records heights and weights of 3 students:
(students <- tibble(name = c("Jonas", "Ines", "Hanna"),
ratio = c("81/1.83", "71/1.75", "55/1.69")))
## # A tibble: 3 x 2
## name ratio
## <chr> <chr>
## 1 Jonas 81/1.83
## 2 Ines 71/1.75
## 3 Hanna 55/1.69
Use the appropriate tidyr
function on variable students
to make it tidy.
3.4.4 Exercise 3.7
Consider the following tibble
, which records the price of 4 drugs:
(drugs <- tibble(name = c("penicillin", "insuline", "aspirin", "lanoxin"),
euros = c(13, 17, 5, 25),
cents = c(81, 20, 14, 12)))
## # A tibble: 4 x 3
## name euros cents
## <chr> <dbl> <dbl>
## 1 penicillin 13. 81.
## 2 insuline 17. 20.
## 3 aspirin 5. 14.
## 4 lanoxin 25. 12.
Use the appropriate tidyr
function on variable drugs
to make it tidy.
3.5 The %>% operator
As you know already, we channel the contents of a tibble
to the different tidyr
functions using the pipe operator: %>%
.
Pipes are a powerful tool for clearly expressing the operation we want to perform on a variable. In addition, they can be used to apply a sequence of operations to a variable.
Table who3
, for example, has two problems: cases
and population
are expressed as a rate
and year is split in columns century
and year
:
who3
## # A tibble: 6 x 4
## country century year rate
## <chr> <int> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
So, we clearly need two steps to tidy up this dataset:
who3_tidy <- who3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
who3_tidy <- who3_tidy %>%
unite(century, year, col = "yyyy", sep = "")
who3_tidy
## # A tibble: 6 x 4
## country yyyy cases population
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Thanks to the pipe, we can apply these two operations to who3
in one go:
who3_tidy <- who3 %>%
separate(rate, into = c("cases", "population"), sep = "/") %>%
unite(century, year, col = "yyyy", sep = "")
who3_tidy
## # A tibble: 6 x 4
## country yyyy cases population
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Note that this process reads almost like natural language:
- We channel (or pipe) the contents of
who3
to functionseparate()
. separate()
splits columnrate
into columnscases
andpopulation
using “/” as separator.- The result of
separate()
is piped to functionunite()
. unite()
combines columnscentury
andyear
into columnyyyy
using the blank character as separator.
Even though writing the above code on a single line of code is still valid, it is good practice to use ENTER (i.e. a new line) after every %>%
for clarity.