4 Data transformation
4.1 Foreword
Once you have tidied your dataset of interest, you will often need to create some new variables or summaries, or to reorder the observations to make the data easier to work with.
This part will show you how to use the functions from package dplyr
to transform your data and understand it better.
Throughout this part, you are going to use a tidied up excerpt from the World Health Organization Global Tuberculosis Report. We have been working with this dataset, so you should be familiar with it:
(who_ref <- read_tsv("datasets/who_ref.tsv"))
## Parsed with column specification:
## cols(
## country = col_character(),
## year = col_double(),
## cases = col_integer(),
## population = col_integer()
## )
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999. 745 19987071
## 2 Afghanistan 2000. 2666 20595360
## 3 Brazil 1999. 37737 172006362
## 4 Brazil 2000. 80488 174504898
## 5 China 1999. 212258 1272915272
## 6 China 2000. 213766 1280428583
The tibble
reports the number of tuberculosis cases in 1999 and 2000 for three different countries, as well as their population.
You must load package tidyverse
for this part.
4.2 Sorting by column(s)
To sort observations by one or more columns, dplyr
offers function arrange()
. This function takes a tibble
and a column name to order by:
who_ref %>%
arrange(year)
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999. 745 19987071
## 2 Brazil 1999. 37737 172006362
## 3 China 1999. 212258 1272915272
## 4 Afghanistan 2000. 2666 20595360
## 5 Brazil 2000. 80488 174504898
## 6 China 2000. 213766 1280428583
If you provide more columns, arrange()
will first sort by the first one, then the second one and so on. This process breaks ties in the values of the preceding columns:
who_ref %>%
arrange(year, cases, population)
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999. 745 19987071
## 2 Brazil 1999. 37737 172006362
## 3 China 1999. 212258 1272915272
## 4 Afghanistan 2000. 2666 20595360
## 5 Brazil 2000. 80488 174504898
## 6 China 2000. 213766 1280428583
To sort in descending order, you have to use desc()
:
who_ref %>%
arrange(desc(country), cases)
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 China 1999. 212258 1272915272
## 2 China 2000. 213766 1280428583
## 3 Brazil 1999. 37737 172006362
## 4 Brazil 2000. 80488 174504898
## 5 Afghanistan 1999. 745 19987071
## 6 Afghanistan 2000. 2666 20595360
4.3 Selecting columns
Function select()
allows you to focus on variables you’re really interested in by removing unwanted columns:
who_ref %>%
select(country, cases)
## # A tibble: 6 x 2
## country cases
## <chr> <int>
## 1 Afghanistan 745
## 2 Afghanistan 2666
## 3 Brazil 37737
## 4 Brazil 80488
## 5 China 212258
## 6 China 213766
You can also use the :
operator to select a group of columns:
who_ref %>%
select(year:population)
## # A tibble: 6 x 3
## year cases population
## <dbl> <int> <int>
## 1 1999. 745 19987071
## 2 2000. 2666 20595360
## 3 1999. 37737 172006362
## 4 2000. 80488 174504898
## 5 1999. 212258 1272915272
## 6 2000. 213766 1280428583
If you put a minus sign before columns in select()
, it means that you want to discard such variables:
who_ref %>%
select(-population)
## # A tibble: 6 x 3
## country year cases
## <chr> <dbl> <int>
## 1 Afghanistan 1999. 745
## 2 Afghanistan 2000. 2666
## 3 Brazil 1999. 37737
## 4 Brazil 2000. 80488
## 5 China 1999. 212258
## 6 China 2000. 213766
This, in combination with the :
operator is specially useful when your dataset has hundreds or even thousands of columns and you want to focus only on a few:
who_ref %>%
select(-(year:population))
## # A tibble: 6 x 1
## country
## <chr>
## 1 Afghanistan
## 2 Afghanistan
## 3 Brazil
## 4 Brazil
## 5 China
## 6 China
4.4 Renaming columns
If a column in your tibble
has a strange or non-informative name, you can use function rename()
to solve this issue:
who_ref %>%
rename(tuberculosis_cases = cases)
## # A tibble: 6 x 4
## country year tuberculosis_cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999. 745 19987071
## 2 Afghanistan 2000. 2666 20595360
## 3 Brazil 1999. 37737 172006362
## 4 Brazil 2000. 80488 174504898
## 5 China 1999. 212258 1272915272
## 6 China 2000. 213766 1280428583
4.5 Filtering rows
Function filter()
is one of the most useful tools in package dplyr
as it allows you to subset observations based on their values. filter()
takes the contents of a tibble
and its arguments are logical expressions to filter it.
For example, we can focus on tuberculosis cases in Brazil as follows:
who_ref %>%
filter(country == "Brazil")
## # A tibble: 2 x 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Brazil 1999. 37737 172006362
## 2 Brazil 2000. 80488 174504898
To retrieve Chinese cases that occurred after 1999, we do:
who_ref %>%
filter(country == "China" & year > 1999)
## # A tibble: 1 x 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 China 2000. 213766 1280428583
To subset observations from Brazil or China:
who_ref %>%
filter(country == "Brazil" | country == "China")
## # A tibble: 4 x 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Brazil 1999. 37737 172006362
## 2 Brazil 2000. 80488 174504898
## 3 China 1999. 212258 1272915272
## 4 China 2000. 213766 1280428583
Note that we use the comparison operators >
, <
, >=
, <=
, ==
, !=
(not equal) to specify the column value or range of values we want to focus on. In addition, we use the logical operators &
(and) and |
(or) to combine multiple conditions passed on to filter()
.
You can negate conditions with the !
operator:
who_ref %>%
filter(!(country == "Brazil"))
## # A tibble: 4 x 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999. 745 19987071
## 2 Afghanistan 2000. 2666 20595360
## 3 China 1999. 212258 1272915272
## 4 China 2000. 213766 1280428583
Basically, any operation that generates a logical vector can be used within filter()
to subset your tibble
.
4.6 Adding new variables
Besides selecting existing columns, it is often useful to create new ones that are functions of existing variables. dplyr
offers function mutate()
to add new columns at the end of your dataset.
For example, it might be tempting to say that Afghanistan has better programmes against tuberculosis than Brazil and China. However, Afghanistan’s population is smaller… Let’s look at the number of cases per 10 thousand individuals to get a better picture:
who_ref %>%
mutate(cases_per_10k = cases/(population/10000))
## # A tibble: 6 x 5
## country year cases population cases_per_10k
## <chr> <dbl> <int> <int> <dbl>
## 1 Afghanistan 1999. 745 19987071 0.373
## 2 Afghanistan 2000. 2666 20595360 1.29
## 3 Brazil 1999. 37737 172006362 2.19
## 4 Brazil 2000. 80488 174504898 4.61
## 5 China 1999. 212258 1272915272 1.67
## 6 China 2000. 213766 1280428583 1.67
mutate()
can be used to create one or more variables at once:
who_ref %>%
mutate(cases_per_10k = cases/(population/10000), thousand_cases = cases/1000)
## # A tibble: 6 x 6
## country year cases population cases_per_10k thousand_cases
## <chr> <dbl> <int> <int> <dbl> <dbl>
## 1 Afghanistan 1999. 745 19987071 0.373 0.745
## 2 Afghanistan 2000. 2666 20595360 1.29 2.67
## 3 Brazil 1999. 37737 172006362 2.19 37.7
## 4 Brazil 2000. 80488 174504898 4.61 80.5
## 5 China 1999. 212258 1272915272 1.67 212.
## 6 China 2000. 213766 1280428583 1.67 214.
4.7 Grouped operations
It is often the case that some records in your dataset belong to certain groups. For example, we can group data in who_ref
by country
or by year
:
who_ref %>%
group_by(country)
## # A tibble: 6 x 4
## # Groups: country [3]
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999. 745 19987071
## 2 Afghanistan 2000. 2666 20595360
## 3 Brazil 1999. 37737 172006362
## 4 Brazil 2000. 80488 174504898
## 5 China 1999. 212258 1272915272
## 6 China 2000. 213766 1280428583
who_ref %>%
group_by(year)
## # A tibble: 6 x 4
## # Groups: year [2]
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999. 745 19987071
## 2 Afghanistan 2000. 2666 20595360
## 3 Brazil 1999. 37737 172006362
## 4 Brazil 2000. 80488 174504898
## 5 China 1999. 212258 1272915272
## 6 China 2000. 213766 1280428583
Note that grouping by country
results in 3 groups, whereas grouping by year
results in 2.
It is also possible to group by multiple variables:
who_ref %>%
group_by(country, year)
## # A tibble: 6 x 4
## # Groups: country, year [6]
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999. 745 19987071
## 2 Afghanistan 2000. 2666 20595360
## 3 Brazil 1999. 37737 172006362
## 4 Brazil 2000. 80488 174504898
## 5 China 1999. 212258 1272915272
## 6 China 2000. 213766 1280428583
The result is 6 groups. This is not very useful in this particular example, because each observation belongs to its own group.
4.7.1 Grouped summaries
If we pipe the result of a group_by()
operation to the function summarise()
, the unit of analysis changes from the complete tibble
to each individual group. This is very useful, because we can compute useful stats and summaries on a per group basis.
The following example computes the total number of tuberculosis cases per year:
who_ref %>%
group_by(year) %>%
summarise(total_cases = sum(cases))
## # A tibble: 2 x 2
## year total_cases
## <dbl> <int>
## 1 1999. 250740
## 2 2000. 296920
The following example computes the average number of cases per country and the corresponding standard deviation:
who_ref %>%
group_by(country) %>%
summarise(avg_cases = mean(cases), st_dev = sd(cases))
## # A tibble: 3 x 3
## country avg_cases st_dev
## <chr> <dbl> <dbl>
## 1 Afghanistan 1706. 1358.
## 2 Brazil 59112. 30230.
## 3 China 213012. 1066.
Note how summarise collapses each group to a single row and drops columns not involved in the grouping process. In fact, if we use summarise()
without grouping first, it sees the entire tibble
as a single group and the result is a single row:
who_ref %>%
summarise(total_cases = sum(cases))
## # A tibble: 1 x 1
## total_cases
## <int>
## 1 547660
The above means that grouped summaries should be generated with functions that collapse their argument into a single value. Some useful summary functions are sum()
, mean()
, median()
, sd()
, min()
, max()
and n()
.
4.7.2 Grouped mutates
If we pipe the result of a group_by()
operation to the function mutate()
, we can create new variables on a per group basis.
Say, for example, that we want to compute the fraction of tuberculosis cases in a year from the total occurred in a country:
who_ref %>%
group_by(country) %>%
mutate(fraction = cases/sum(cases))
## # A tibble: 6 x 5
## # Groups: country [3]
## country year cases population fraction
## <chr> <dbl> <int> <int> <dbl>
## 1 Afghanistan 1999. 745 19987071 0.218
## 2 Afghanistan 2000. 2666 20595360 0.782
## 3 Brazil 1999. 37737 172006362 0.319
## 4 Brazil 2000. 80488 174504898 0.681
## 5 China 1999. 212258 1272915272 0.498
## 6 China 2000. 213766 1280428583 0.502
Note that grouped mutates do not collapse the input tibble
. Instead, they maintain the original number of samples and columns. As a result, functions used to create new variables in mutate()
should produce the same number of members within the group.
4.7.3 Grouped filters
If we pipe the result of a group_by()
operation to the function filter()
, we can remove observations (i.e. rows) on a per group basis.
For example, to find the year with the minimum number of tuberculosis cases in a country, we can do:
who_ref %>%
group_by(country) %>%
filter(cases == min(cases))
## # A tibble: 3 x 4
## # Groups: country [3]
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999. 745 19987071
## 2 Brazil 1999. 37737 172006362
## 3 China 1999. 212258 1272915272
Note that in this case, using summarise()
doesn’t work because we lose the information about the year
:
who_ref %>%
group_by(country) %>%
summarise(min_cases = min(cases))
## # A tibble: 3 x 2
## country min_cases
## <chr> <dbl>
## 1 Afghanistan 745.
## 2 Brazil 37737.
## 3 China 212258.
4.8 Case study
In this case study, you’re going to work with a dataset giving the monthly deaths from lung diseases in the UK between 1974 and 1979.
- Import file
datasets/uk_lung_deaths.csv
into variableuk
. - Tidy up the dataset. The goal is to have a
tibble
with three columns:year
,month
anddeaths
. - Use
select()
to re-accommodate the columns, such that they appear in the following order:month
,year
anddeaths
. - Data from 1974 is too old to be reliable, filter it out.
- Use the
%>%
to perform points 1-4 in a single go and assign the result to variableuk
. - Report the average number of deaths per month. Sort the result decreasingly by average deaths.
- Report the average and total number of deaths per year. Sort the result by year.
- Compute the percentage of deaths with which each month contributes to the total per year.
- For each year, report the month with the maximum number of deaths. Sort the result by year.
- Speculate about possible reasons why the months resulting from point 9 have the most deaths from lung diseases.