4 Data transformation

4.1 Foreword

Once you have tidied your dataset of interest, you will often need to create some new variables or summaries, or to reorder the observations to make the data easier to work with.

This part will show you how to use the functions from package dplyr to transform your data and understand it better.

Throughout this part, you are going to use a tidied up excerpt from the World Health Organization Global Tuberculosis Report. We have been working with this dataset, so you should be familiar with it:

(who_ref <- read_tsv("datasets/who_ref.tsv"))

## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   cases = col_integer(),
##   population = col_integer()
## )

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan 1999.    745   19987071
## 2 Afghanistan 2000.   2666   20595360
## 3 Brazil      1999.  37737  172006362
## 4 Brazil      2000.  80488  174504898
## 5 China       1999. 212258 1272915272
## 6 China       2000. 213766 1280428583

The tibble reports the number of tuberculosis cases in 1999 and 2000 for three different countries, as well as their population.

You must load package tidyverse for this part.

4.2 Sorting by column(s)

To sort observations by one or more columns, dplyr offers function arrange(). This function takes a tibble and a column name to order by:

who_ref %>% 
  arrange(year)

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan 1999.    745   19987071
## 2 Brazil      1999.  37737  172006362
## 3 China       1999. 212258 1272915272
## 4 Afghanistan 2000.   2666   20595360
## 5 Brazil      2000.  80488  174504898
## 6 China       2000. 213766 1280428583

If you provide more columns, arrange() will first sort by the first one, then the second one and so on. This process breaks ties in the values of the preceding columns:

who_ref %>% 
  arrange(year, cases, population)

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan 1999.    745   19987071
## 2 Brazil      1999.  37737  172006362
## 3 China       1999. 212258 1272915272
## 4 Afghanistan 2000.   2666   20595360
## 5 Brazil      2000.  80488  174504898
## 6 China       2000. 213766 1280428583

To sort in descending order, you have to use desc():

who_ref %>% 
  arrange(desc(country), cases)

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 China       1999. 212258 1272915272
## 2 China       2000. 213766 1280428583
## 3 Brazil      1999.  37737  172006362
## 4 Brazil      2000.  80488  174504898
## 5 Afghanistan 1999.    745   19987071
## 6 Afghanistan 2000.   2666   20595360

4.3 Selecting columns

Function select() allows you to focus on variables you’re really interested in by removing unwanted columns:

who_ref %>% 
  select(country, cases)

## # A tibble: 6 x 2
##   country      cases
##   <chr>        <int>
## 1 Afghanistan    745
## 2 Afghanistan   2666
## 3 Brazil       37737
## 4 Brazil       80488
## 5 China       212258
## 6 China       213766

You can also use the : operator to select a group of columns:

who_ref %>% 
  select(year:population)

## # A tibble: 6 x 3
##    year  cases population
##   <dbl>  <int>      <int>
## 1 1999.    745   19987071
## 2 2000.   2666   20595360
## 3 1999.  37737  172006362
## 4 2000.  80488  174504898
## 5 1999. 212258 1272915272
## 6 2000. 213766 1280428583

If you put a minus sign before columns in select(), it means that you want to discard such variables:

who_ref %>% 
  select(-population)

## # A tibble: 6 x 3
##   country      year  cases
##   <chr>       <dbl>  <int>
## 1 Afghanistan 1999.    745
## 2 Afghanistan 2000.   2666
## 3 Brazil      1999.  37737
## 4 Brazil      2000.  80488
## 5 China       1999. 212258
## 6 China       2000. 213766

This, in combination with the : operator is specially useful when your dataset has hundreds or even thousands of columns and you want to focus only on a few:

who_ref %>% 
  select(-(year:population))

## # A tibble: 6 x 1
##   country    
##   <chr>      
## 1 Afghanistan
## 2 Afghanistan
## 3 Brazil     
## 4 Brazil     
## 5 China      
## 6 China

4.4 Renaming columns

If a column in your tibble has a strange or non-informative name, you can use function rename() to solve this issue:

who_ref %>% 
  rename(tuberculosis_cases = cases)

## # A tibble: 6 x 4
##   country      year tuberculosis_cases population
##   <chr>       <dbl>              <int>      <int>
## 1 Afghanistan 1999.                745   19987071
## 2 Afghanistan 2000.               2666   20595360
## 3 Brazil      1999.              37737  172006362
## 4 Brazil      2000.              80488  174504898
## 5 China       1999.             212258 1272915272
## 6 China       2000.             213766 1280428583

4.5 Filtering rows

Function filter() is one of the most useful tools in package dplyr as it allows you to subset observations based on their values. filter() takes the contents of a tibble and its arguments are logical expressions to filter it.

For example, we can focus on tuberculosis cases in Brazil as follows:

who_ref %>% 
  filter(country == "Brazil")

## # A tibble: 2 x 4
##   country  year cases population
##   <chr>   <dbl> <int>      <int>
## 1 Brazil  1999. 37737  172006362
## 2 Brazil  2000. 80488  174504898

To retrieve Chinese cases that occurred after 1999, we do:

who_ref %>% 
  filter(country == "China" & year > 1999)

## # A tibble: 1 x 4
##   country  year  cases population
##   <chr>   <dbl>  <int>      <int>
## 1 China   2000. 213766 1280428583

To subset observations from Brazil or China:

who_ref %>% 
  filter(country == "Brazil" | country == "China")

## # A tibble: 4 x 4
##   country  year  cases population
##   <chr>   <dbl>  <int>      <int>
## 1 Brazil  1999.  37737  172006362
## 2 Brazil  2000.  80488  174504898
## 3 China   1999. 212258 1272915272
## 4 China   2000. 213766 1280428583

Note that we use the comparison operators >, <, >=, <=, ==, != (not equal) to specify the column value or range of values we want to focus on. In addition, we use the logical operators & (and) and | (or) to combine multiple conditions passed on to filter().

You can negate conditions with the ! operator:

who_ref %>% 
  filter(!(country == "Brazil"))

## # A tibble: 4 x 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan 1999.    745   19987071
## 2 Afghanistan 2000.   2666   20595360
## 3 China       1999. 212258 1272915272
## 4 China       2000. 213766 1280428583

Basically, any operation that generates a logical vector can be used within filter() to subset your tibble.

4.6 Adding new variables

Besides selecting existing columns, it is often useful to create new ones that are functions of existing variables. dplyr offers function mutate() to add new columns at the end of your dataset.

For example, it might be tempting to say that Afghanistan has better programmes against tuberculosis than Brazil and China. However, Afghanistan’s population is smaller… Let’s look at the number of cases per 10 thousand individuals to get a better picture:

who_ref %>% 
  mutate(cases_per_10k = cases/(population/10000))

## # A tibble: 6 x 5
##   country      year  cases population cases_per_10k
##   <chr>       <dbl>  <int>      <int>         <dbl>
## 1 Afghanistan 1999.    745   19987071         0.373
## 2 Afghanistan 2000.   2666   20595360         1.29 
## 3 Brazil      1999.  37737  172006362         2.19 
## 4 Brazil      2000.  80488  174504898         4.61 
## 5 China       1999. 212258 1272915272         1.67 
## 6 China       2000. 213766 1280428583         1.67

mutate() can be used to create one or more variables at once:

who_ref %>% 
  mutate(cases_per_10k = cases/(population/10000), thousand_cases = cases/1000)

## # A tibble: 6 x 6
##   country      year  cases population cases_per_10k thousand_cases
##   <chr>       <dbl>  <int>      <int>         <dbl>          <dbl>
## 1 Afghanistan 1999.    745   19987071         0.373          0.745
## 2 Afghanistan 2000.   2666   20595360         1.29           2.67 
## 3 Brazil      1999.  37737  172006362         2.19          37.7  
## 4 Brazil      2000.  80488  174504898         4.61          80.5  
## 5 China       1999. 212258 1272915272         1.67         212.   
## 6 China       2000. 213766 1280428583         1.67         214.

4.7 Grouped operations

It is often the case that some records in your dataset belong to certain groups. For example, we can group data in who_ref by country or by year:

who_ref %>% 
  group_by(country)

## # A tibble: 6 x 4
## # Groups:   country [3]
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan 1999.    745   19987071
## 2 Afghanistan 2000.   2666   20595360
## 3 Brazil      1999.  37737  172006362
## 4 Brazil      2000.  80488  174504898
## 5 China       1999. 212258 1272915272
## 6 China       2000. 213766 1280428583

who_ref %>% 
  group_by(year)

## # A tibble: 6 x 4
## # Groups:   year [2]
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan 1999.    745   19987071
## 2 Afghanistan 2000.   2666   20595360
## 3 Brazil      1999.  37737  172006362
## 4 Brazil      2000.  80488  174504898
## 5 China       1999. 212258 1272915272
## 6 China       2000. 213766 1280428583

Note that grouping by country results in 3 groups, whereas grouping by year results in 2.

It is also possible to group by multiple variables:

who_ref %>% 
  group_by(country, year)

## # A tibble: 6 x 4
## # Groups:   country, year [6]
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan 1999.    745   19987071
## 2 Afghanistan 2000.   2666   20595360
## 3 Brazil      1999.  37737  172006362
## 4 Brazil      2000.  80488  174504898
## 5 China       1999. 212258 1272915272
## 6 China       2000. 213766 1280428583

The result is 6 groups. This is not very useful in this particular example, because each observation belongs to its own group.

4.7.1 Grouped summaries

If we pipe the result of a group_by() operation to the function summarise(), the unit of analysis changes from the complete tibble to each individual group. This is very useful, because we can compute useful stats and summaries on a per group basis.

The following example computes the total number of tuberculosis cases per year:

who_ref %>% 
  group_by(year) %>% 
  summarise(total_cases = sum(cases))

## # A tibble: 2 x 2
##    year total_cases
##   <dbl>       <int>
## 1 1999.      250740
## 2 2000.      296920

The following example computes the average number of cases per country and the corresponding standard deviation:

who_ref %>% 
  group_by(country) %>% 
  summarise(avg_cases = mean(cases), st_dev = sd(cases))

## # A tibble: 3 x 3
##   country     avg_cases st_dev
##   <chr>           <dbl>  <dbl>
## 1 Afghanistan     1706.  1358.
## 2 Brazil         59112. 30230.
## 3 China         213012.  1066.

Note how summarise collapses each group to a single row and drops columns not involved in the grouping process. In fact, if we use summarise() without grouping first, it sees the entire tibble as a single group and the result is a single row:

who_ref %>% 
  summarise(total_cases = sum(cases))

## # A tibble: 1 x 1
##   total_cases
##         <int>
## 1      547660

The above means that grouped summaries should be generated with functions that collapse their argument into a single value. Some useful summary functions are sum(), mean(), median(), sd(), min(), max() and n().

4.7.2 Grouped mutates

If we pipe the result of a group_by() operation to the function mutate(), we can create new variables on a per group basis.

Say, for example, that we want to compute the fraction of tuberculosis cases in a year from the total occurred in a country:

who_ref %>% 
  group_by(country) %>% 
  mutate(fraction = cases/sum(cases))

## # A tibble: 6 x 5
## # Groups:   country [3]
##   country      year  cases population fraction
##   <chr>       <dbl>  <int>      <int>    <dbl>
## 1 Afghanistan 1999.    745   19987071    0.218
## 2 Afghanistan 2000.   2666   20595360    0.782
## 3 Brazil      1999.  37737  172006362    0.319
## 4 Brazil      2000.  80488  174504898    0.681
## 5 China       1999. 212258 1272915272    0.498
## 6 China       2000. 213766 1280428583    0.502

Note that grouped mutates do not collapse the input tibble. Instead, they maintain the original number of samples and columns. As a result, functions used to create new variables in mutate() should produce the same number of members within the group.

4.7.3 Grouped filters

If we pipe the result of a group_by() operation to the function filter(), we can remove observations (i.e. rows) on a per group basis.

For example, to find the year with the minimum number of tuberculosis cases in a country, we can do:

who_ref %>% 
  group_by(country) %>% 
  filter(cases == min(cases))

## # A tibble: 3 x 4
## # Groups:   country [3]
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan 1999.    745   19987071
## 2 Brazil      1999.  37737  172006362
## 3 China       1999. 212258 1272915272

Note that in this case, using summarise() doesn’t work because we lose the information about the year:

who_ref %>%
  group_by(country) %>% 
  summarise(min_cases = min(cases))

## # A tibble: 3 x 2
##   country     min_cases
##   <chr>           <dbl>
## 1 Afghanistan      745.
## 2 Brazil         37737.
## 3 China         212258.

4.8 Case study

In this case study, you’re going to work with a dataset giving the monthly deaths from lung diseases in the UK between 1974 and 1979.

Import file datasets/uk_lung_deaths.csv into variable uk.
Tidy up the dataset. The goal is to have a tibble with three columns: year, month and deaths.
Use select() to re-accommodate the columns, such that they appear in the following order: month, year and deaths.
Data from 1974 is too old to be reliable, filter it out.
Use the %>% to perform points 1-4 in a single go and assign the result to variable uk.
Report the average number of deaths per month. Sort the result decreasingly by average deaths.
Report the average and total number of deaths per year. Sort the result by year.
Compute the percentage of deaths with which each month contributes to the total per year.
For each year, report the month with the maximum number of deaths. Sort the result by year.
Speculate about possible reasons why the months resulting from point 9 have the most deaths from lung diseases.