Make sure that R and RStudio are installed in your computer. You can create a new R Script in RStudio to copy example source code from this tutorial and write code for the exercises. Save the script in your home directory (or a sub directory of your choice). To run the code written on the current line (or of selected lines), click on button Run or press CTRL+ENTER. Results will be displayed either in the RStudio’s console or the Viewer panel (graphics). To read the documentation about a function (Help panel), in the script, place the cursor on the function’s name and press F1.
This tutorial is based on a powerful set of packages for data manipulation and visualization known as the tidyverse. We will also need the readxl package to enable reading data from Microsoft Excel files. So, make sure that those 2 packages (tidyverse and readxl) have been installed in your R environment (see RStudio’s menu Tools/Install Packages…).
Because of the ever-growing number of publicly available and large datasets in life sciences, computational and statistical skills are becoming a key part of the life scientist’s curriculum.
The goal of this tutorial is to give you a foundation in the most important tools for data analysis in R. You can then complement and expand your knowledge on your own in order to perform more complex analyses.
This tutorial is an adaptation of Dr. Alanis-Lobato’s tutorial that borrows materials and concepts from the book R for Data Science by Garret Grolemund and Hadley Wickham.
In most data analysis projects you will have to do as shown below. The tidyverse set of packages include functions to tackle each one of the parts of the data analysis process.
The first step in data science is to import your data into R in order to manipulate it with R functions. This means that you will take data stored in a file, database or website and load it into a variable in R. These datasets can be the output of a genomics project, the result of a survey, measurements done in the lab, etc. In this tutorial, you will use functions from the readr and readxl packages to load data tables into R.
Second, you will have to tidy up your data. Briefly, in a tidy dataset each row is an observation or a sample and each column is a variable. If your data is tidy and has a consistent structure, you will be able to use it seamlessly throughout the tidyverse functions for analysis, manipulation and visualization. In this tutorial, you will use functions from the tidyr package to tidy up your data.
Once your data is tidy, often times you need to transform it. Common transformations include normalization, the creation of new variables from existing ones (e.g. convert from inches to centimeters) or the computation of summary statistics (e.g. means and counts). In this tutorial, you will use functions from the dplyr package (the d comes from data-frame and plyr from “pliers”, the pincers for gripping small objects or bending wire) to perform data transformations.
Armed with tidy data that has the variables you need, you are ready for knowledge generation. Good data graphics will show you things that you did not know about the data, they will also show you things you did not expect, raise more questions or tell you that you are asking the wrong questions. In this tutorial, you will generate plots to gain a better understanding of your data. This will be done with functions from the ggplot package.
Models are complementary to visualizations. Models are mathematical representations of a system that are often derived from precise data analyzes and observations. Models are key to making predictions that can be later tested in the lab.
Even though there are powerful tools in R for data modeling, in this tutorial you will only propose simple models based on visualizations.
It is important to stress that visualization and modeling will often derive in new questions and hypotheses that will bring you back to the data transformation step. With new variables and stats, you can visualize again and fine tune your model.
Surrounding all the above steps is programming, a critical part of any data analysis project. You do not need to be an expert programmer to analyze data, but programming skills allow you to automate common tasks and solve new problems more easily. There are hundreds of programming languages out there, but R and its rich collection of packages, functions and data are particularly useful when it comes to doing statistics and data analysis.
Cool graphics and precise models are of no use if you cannot communicate your results to others. Communication of methods and results can be done via oral or poster presentations, theses, scientific articles, reports, etc. There are tons of tools for communication of data analyzes, from PowerPoint or Word to web pages and R Markdown. These interactive notes have been written with the latter.
To motivate your interest for data analysis and give you a glimpse of how easy it is to do it in the tidyverse, the following lines will guide you through an example project aimed at automating the classification of iris flowers.
The example will go through all the steps covered above and highlight some of the functions that you will learn and use in this tutorial. Don not freak out about the code or the data analysis itself. Everything will be explained in more detail in the following Parts.
The iris flower dataset was collected by Edgar Anderson, an American botanist, in the 1920s. This data was used by statistician Ronald Fisher to demonstrate statistical methods of classification. You may know it already, as it is still quite popular among statistical and programming teachers.
The original dataset contains 150 samples from three iris species. You are going to work with a reduced version with only 100 samples from two species: Iris setosa and Iris virginica.
For each sample, Anderson recorded the sepal length and width, as well as the petal length and width:
Given an iris flower, is it possible to classify it as setosa or virginica based on these measurements?
Before starting our analysis, we need to load the tidyverse and ggvis packages:
library(tidyverse)
library(readxl)
As shown below, by calling some functions such as library, warnings and technical messages may be displayed on the console. Although they could provide the user with useful information, we will focus on the ‘normal’ output in this tutorial (i.e. tables, figures, and vectors).
# -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
# v ggplot2 3.2.1 v purrr 0.3.3
# v tibble 2.1.3 v dplyr 0.8.3
# v tidyr 1.0.2 v stringr 1.4.0
# v readr 1.3.1 v forcats 0.4.0
# -- Conflicts ------------------------------------------ tidyverse_conflicts() --
# x dplyr::filter() masks stats::filter()
# x dplyr::lag() masks stats::lag()
# Registered S3 methods overwritten by 'htmltools':
# method from
# print.html tools:rstudio
# print.shiny.tag tools:rstudio
# print.shiny.tag.list tools:rstudio
#
# Attaching package: ggvis
#
# The following object is masked from package:ggplot2
#
# resolution
We will import the iris.tsv data file. The file extension tells us that this is a “tab-separated value” type of file (.tsv) that stores a table of data. The readr package from the tidyverse has function read_tsv to import them into R. If on the local computer, the function will need the file path (e.g. “datasets/iris.tsv”). If on Internet, we will provide the function with the URL.
# example if on local computer
# ir <- read_tsv("datasets/iris.tsv")
# Using an URL
<- read_tsv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/iris.tsv") ir
# Parsed with column specification:
# cols(
# ratio_sl_sw = col_character(),
# ratio_pl_pw = col_character(),
# species = col_character()
# )
Functions that load data usually display messages in the R console, such as shown above to notify which data type is set by column of the loaded table.
Variable ir now contains the iris flower data table.
ir
Note that in the R console, only a text version of the table will be displayed.
Remember that tidy datasets report each sample in its own row and each variable in its own column. The iris dataset reports sepal length and width and petal length and width as ratios in columns ratio_sl_sw and ratio_pl_pw, respectively. We can tidy up this table with function separate from package tidyr:
<- ir %>%
ir separate(col = ratio_sl_sw,
into = c("sepal_length", "sepal_width"),
sep = "/") %>%
separate(col = ratio_pl_pw,
into = c("petal_length", "petal_width"),
sep = "/")
ir
Note that the width and length columns in the tidied iris dataset are of type character (chr) instead of numeric (dbl or int). If we want to compute sums or other statistics on this table, we need to transform these columns into the appropriate type. We can do this with function mutate from package dplyr:
<- ir %>%
ir mutate(sepal_length = as.numeric(sepal_length),
sepal_width = as.numeric(sepal_width),
petal_length = as.numeric(petal_length),
petal_width = as.numeric(petal_width))
Let us now plot histograms for each flower feature to see if there is one that can be used to classify the two iris species:
# sepal length
%>%
ir ggplot(aes(x=sepal_length, fill = species)) +
geom_histogram(binwidth = 0.15)
# sepal width
%>%
ir ggplot(aes(x=sepal_width, fill = species)) +
geom_histogram(binwidth = 0.15)
# petal length
%>%
ir ggplot(aes(x=petal_length, fill=species)) +
geom_histogram(binwidth = 0.15)
# petal width
%>%
ir ggplot(aes(x=petal_width, fill=species)) +
geom_histogram(binwidth = 0.15)
Thanks to the above plots, we know that we can use petal length and width to differentiate between a flower from the species Iris setosa and the species Iris virginica. Under the assumption that we are only dealing with flowers from these two species, let us use our observations to propose a simple model for iris flower classification:
# the model is implemented with a function that takes 2 parameters
<- function(petal_length, petal_width){
which_species
if (petal_length <= 3 && petal_width <= 1){
print("Iris setosa")
}else{
print("Iris virginica")
}
}
# We can call the function with 2 specific values to get an answer
which_species(petal_length=2, petal_width=0.5)
[1] "Iris setosa"
which_species(petal_length=10, petal_width=0.5)
[1] "Iris virginica"
In Part 1, you saw that the first step in any data analysis pipeline is importing your tables into R. You do this to take advantage of R’s functions, which simplify the manipulation of your data.
There are two packages in the tidyverse that make data import into R a matter of a single line of code.
Package readr provides functions to import data tables stored in plain-text files, such as “comma-separated value” (.csv) or “tab-separated value” (.tsv) files.
Package readxl provides functions to import data stored in Microsoft Excel files. These functions also let you import from specific sheets or ranges within the Excel file.
To demarcate the content of each cell in a file, people use special characters like commas, tabs, pipes, etc. The extension of plain-text files is often a good hint of what delimiter has been used. For example, csv files should use “commas” as the delimiter, whereas tsv files should use “tabs”. Nevertheless, you can find csv or tsv files that are delimited with different characters.
The first step in data import is locating the file in your computer or in a remote location (e.g. by URL). The most important parameter for readr’s functions is the path to the file of interest.
In this part, you are going to import files located on internet. Packages tidyverse and readxl must be loaded.
Let’s start by learning how to import a tsv file:
<- read_tsv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/who_ref.tsv")
who_ref who_ref
Importing a csv file is very similar:
<- read_csv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/who1.csv")
who1 who1
Note how simple the syntax for data import is: you choose the name of a variable to store your table, use the <- operator (written with 2 characters: < and -), call the appropriate readr function and specify the location of your file.
So far, we used URLs to access the datasets used for the examples. However, you can also download them to save them locally and work with the files directly. There is a folder online, available for download via this link: “http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/datasets.zip”. Download it, unzip it and save the files under “datasets”. You should also use an R script (or several scripts) to write the code for the upcoming examples. Put such scripts in an appropriate folder together with the “datasets” folder. If you then set your “Working Directory” to the location of your R scripts, you can access the example files wih a direct path like this “./datasets/andrade_lab.csv” (see next small exercise).
To change your “Working Directory”, go to “Session” -> “Set Working Directory” -> “To source file location”
Read file ./datasets/andrade_lab.csv
into variable
al and show the content of the variable.
If the column delimiter of a file is not obvious from its extension
and the person in charge of the file didn’t provide this information,
you must first figure out which delimiter was used. For example, open
the following file ./datasets/who2.txt
.
Since the delimiter is a -, we need to call read_delim to import:
<- read_delim("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/who2.txt", delim = "-")
who2 who2
Note that in this special case, you have to specify the delimiter using parameter delim.
From its extension, file ./datasets/andrade_lab.tsv
is
supposed to be tab-separated. Read the file with function
read_tsv into variable al and show its
contents. What character is the delimiter in this file? Use the
appropriate readr function to import this table into
R.
The syntax to import from Excel files is very similar to the one used above. By default, function read_excel imports from the first available sheet in the Excel workbook.
Load a EXCEL sheet via the following path in your lacal folder. Open the file also with EXCEL to see its content.
./datasets/who_split.xlsx
<- read_excel("datasets/who_split.xlsx") # will load first sheet named 'cases'
who_cases who_cases
If you want data from a different sheet, you must use parameter sheet:
<- read_excel("datasets/who_split.xlsx",
who_pop sheet = "population")
who_pop
You can also specify a range to import from:
<- read_excel("datasets/who_split.xlsx",
who_pop_99 sheet = "population",
range = "A1:B4")
who_pop_99
Copy the following file to the datasets subfolder: ./datasets/andrade_lab.xlsx
Import the list of PhD students in Andrade Lab from the file. Choose a variable name to store the file contents.
Reading files with readr or readxl functions results in the creation of a variable of type tbl_df instead of R’s traditional data frame (type data.frame). This is because one of the unifying features of the tidyverse is the tibble (type tbl_df).
# Using base R function to import CSV files
<- read.csv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/who1.csv",
who1_df stringsAsFactors=TRUE)
class(who1_df)
[1] "data.frame"
# Using readr function to import CSV files
<- read_csv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/who1.csv")
who1_tb class(who1_tb)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
tibbles are data frames but they tweak some old behaviors that make our life easier when analyzing data.
To start with, printing the content of a tibble shows a rich output that gives information about the data type of each column. By contrast, data frames just show the records as is:
who1_tb
Also, tibbles never change the type of the inputs. While column country is of type character in who1_tb, it could be imported as a factor in who1_df depending on the R version and settings. For the sake of this demonstration, we used option stringsAsFactors=TRUE when importing the data frame with the base R function read.csv. Except if required, I recommend users importing data with base R functions such as read.csv to set this option to FALSE (stringsAsFactors=FALSE).
class(who1_tb$country)
[1] "character"
class(who1_df$country)
[1] "factor"
Factor can be problematic to some users (especially beginners): in the data frame, country cannot take other values other than the ones already imported. If we want to append information about a new country to our table, it will be added but as an NA (Not Applicable):
# Adding a new row
<- rbind(who1_df, c("Mexico", 1999, "cases", 100)) tmp
Warnung: ungültiges Faktorniveau, NA erzeugt
# Printing the bottom of the table (country is "NA" on last row)
%>% tail() tmp
Since packages outside the tidyverse use data frames, you might want to coerce them to tibbles with function as_tibble:
<- as_tibble(who1_df) who1_df
In addition, it’s possible for a tibble to have column names that are not valid R variable names that are limited to specific rules. For example, they might not start with a letter, or they might contain unusual characters:
<- read_tsv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/unusual.tsv")
unusual unusual
To refer to these column names, you have to use back-ticks:
$`:)` unusual
[1] "smile"
Finally, you can create your own tibbles with function tibble:
<- tibble(
stocks year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
revenue = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
) stocks
The rationale behind the tidy data philosophy is to organize the data you imported to or created in R into a consistent format. The goal is to spend less time formatting your data and more time analyzing it. If your data is tidy you will be able to use it seamlessly across the tidyverse.
This part will show you how to use the functions from package tidyr to tidy up your data. As a result, data transformations and visualizations will be a real piece of cake.
There are three simple and interrelated rules that make a dataset tidy:
We will import some tables and check if they fulfill the above rules.
Import file ./datasets/who1.csv
into variable
who1 and show the imported data. Is the who1 dataset
tidy?
Import file ./datasets/who2.txt
into variable
who2 and show the imported data. Is the
who2 dataset tidy?
Import sheet population from file datasets/who_split.xlsx into variable who_pop and show the imported data. Is who_pop tidy?
Why ensure that your data is tidy? There are two main advantages:
The principles of tidy data seem very obvious. Unfortunately, most data that you will encounter will be untidy. There are two main reasons:
This means that for most real analyzes, you will need to tidy your data.
When you are in front of a dataset, the first step is always to understand what the observations and the features that describe them are. The second step is to solve one of two common problems:
To fix these problems, the tidyverse provides you with functions spread and gather.
You use function spread for Problem 1: one observation is scattered across multiple rows. That’s exactly the problem we saw in table who1:
who1
who1 is an excerpt of the World Health Organization Tuberculosis Report. Each observation, a country in a year in this case, should be accompanied by that year’s number of tuberculosis cases and population. However, each country-year pair appears twice in who1: one for the tuberculosis cases and one for the country’s population.
To tidy up this dataset, we need to spread column count into new columns, one for each of the keys specified in column type:
We can do this with function spread as follows:
<- who1 %>%
who1_tidy spread(value = count, key = type)
who1_tidy
Note the syntax used to tidy up who1:
You use function gather for Problem 2: one feature is spread across multiple columns. That is exactly the problem we saw in, for example, table who_cases from the Excel file who_split.xlsx:
who_cases
The column names 1999 and 2000 are actually values of the variable year, which means that each row represents two observations instead of one.
To tidy this dataset, we need to gather columns 1999 and 2000 into a new pair of variables:
We can do this with function gather as follows:
<- who_cases %>%
who_cases_tidy gather(`1999`, `2000`, key = "year", value = "cases")
who_cases_tidy
Note the syntax used to tidy up who_cases:
Consider the following tibble, which records heights and weights of 3 students:
<- tibble(name = rep(c("Jonas", "Ines", "Hanna"), each = 2),
students type = rep(c("height", "weight"), 3),
measure = c(1.83, 81, 1.75, 71, 1.69, 55))
students
Use the appropriate tidyr function on variable students to make it tidy.
Consider the following tibble, which records the expression of 4 genes at 3 different time points (d00, d02 and d04):
<- tibble(symbol = c("DMD", "MYOG", "MYF5", "MYOD1"),
genes d00 = c(0.697, 0.844, 1.878, 1.622),
d02 = c(1.986, 0.051, 0.887, 1.313),
d04 = c(0.157, 0.774, 1.507, 0.628))
genes
Here, an observation should be a gene at a given time point. Use the appropriate tidyr function on variable genes to make it tidy.
When one column in a dataset contains two variables, we’ll need to use function separate to fix the issue. That’s exactly the problem we saw in table who2:
who2
Column rate contains both cases and population, so we need to split it in two:
We can do this with function separate as follows:
<- who2 %>%
who2_tidy separate(rate, into = c("cases", "population"), sep = "/")
who2_tidy
Note the syntax used to tidy up who2:
When a variable is stored in two separate columns and is more convenient to combine them, we need to use function **unite. Table who3* has this problem:
<- read_tsv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/who3.tsv")
who3 who3
Column century and year can be combined into a single column called year:
We can do this with function unite as follows:
<- who3 %>%
who3_tidy unite(century, year, col = "year", sep = "")
who3_tidy
Note the syntax used to tidy up who3:
Consider the following tibble, which records heights and weights of 3 students:
<- tibble(name = c("Jonas", "Ines", "Hanna"),
students ratio = c("81/1.83", "71/1.75", "55/1.69"))
students
Use the appropriate tidyr function on variable students to make it tidy.
Consider the following tibble, which records the price of 4 drugs:
<- tibble(name = c("penicillin", "insuline", "aspirin", "lanoxin"),
drugs euros = c(13, 17, 5, 25),
cents = c(81, 20, 14, 12))
drugs
Use the appropriate tidyr function on variable drugs to make it tidy.
As you know already, we channel the contents of a tibble to the different tidyr functions using the pipe operator: %>%.
Pipes are a powerful tool for clearly expressing the operation we want to perform on a variable. In addition, they can be used to apply a sequence of operations to a variable.
Table who3, for example, has two problems: cases and population are expressed as a rate and year is split in columns century and year:
who3
So, we clearly need two steps to tidy up this dataset:
<- who3 %>%
who3_tidy separate(rate, into = c("cases", "population"), sep = "/")
<- who3_tidy %>%
who3_tidy unite(century, year, col = "yyyy", sep = "")
who3_tidy
Thanks to the pipe, we can apply these two operations to who3 in one go:
<- who3 %>%
who3_tidy separate(rate, into = c("cases", "population"), sep = "/") %>%
unite(century, year, col = "yyyy", sep = "")
who3_tidy
Note that this process reads almost like natural language:
Even though writing the above code on a single line of code is still valid, it is good practice to use ENTER (i.e. a new line) after every %>% for clarity.
Once you have tidied your dataset of interest, you will often need to create some new variables or summaries, or to reorder the observations to make the data easier to work with.
This part will show you how to use the functions from package dplyr to transform your data and understand it better.
Throughout this part, you are going to use a tidied up excerpt from the World Health Organization Global Tuberculosis Report. We have been working with this dataset, so you should be familiar with it:
<- read_tsv("http://cbdm-01.zdv.uni-mainz.de/~stalbrec/RcourseData/who_ref.tsv")
who_ref who_ref
The tibble reports the number of tuberculosis cases in 1999 and 2000 for three different countries, as well as their population.
To sort observations by one or more columns, dplyr offers function arrange. This function takes a tibble and a column name to order by:
%>%
who_ref arrange(year)
If you provide more columns, arrange will first sort by the first one, then the second one and so on. This process breaks ties in the values of the preceding columns:
%>%
who_ref arrange(year, cases, population)
To sort in descending order, you have to use desc:
%>%
who_ref arrange(desc(country), cases)
Function select allows you to focus on variables you are really interested in by removing unwanted columns:
%>%
who_ref select(country, cases)
You can also use the : operator to select a group of columns:
%>%
who_ref select(year:population)
If you put a minus sign before columns in select, it means that you want to discard such variables:
%>%
who_ref select(-population)
This, in combination with the : operator is specially useful when your dataset has hundreds or even thousands of columns and you want to focus only on a few:
%>%
who_ref select(-(year:population))
If a column in your tibble has a strange or non-informative name, you can use function rename to solve this issue:
%>%
who_ref rename(tuberculosis_cases = cases)
Function filter is one of the most useful tools in package dplyr as it allows you to subset observations based on their values. filter takes the contents of a tibble and its arguments are logical expressions to filter it.
For example, we can focus on tuberculosis cases in Brazil as follows:
%>%
who_ref filter(country == "Brazil")
To retrieve Chinese cases that occurred after 1999, we do:
%>%
who_ref filter(country == "China" & year > 1999)
To subset observations from Brazil or China:
%>%
who_ref filter(country == "Brazil" | country == "China")
Note that we use the comparison operators >, <, >=, <=, ==, != (not equal) to specify the column value or range of values we want to focus on. In addition, we use the logical operators & (and) and | (or) to combine multiple conditions passed on to filter.
You can negate conditions with the ! operator:
%>%
who_ref filter(!(country == "Brazil"))
Basically, any operation that generates a logical vector can be used within filter to subset your tibble.
Besides selecting existing columns, it is often useful to create new ones that are functions of existing variables. dplyr offers function mutate to add new columns at the end of your dataset.
For example, it might be tempting to say that Afghanistan has better programmes against tuberculosis than Brazil and China. However, Afghanistan’s population is smaller. Let’s look at the number of cases per 10 thousand individuals to get a better picture:
%>%
who_ref mutate(cases_per_10k = cases/(population/10000))
mutate can be used to create one or more variables at once:
%>%
who_ref mutate(cases_per_10k = cases/(population/10000), thousand_cases = cases/1000)
It is often the case that some records in your dataset belong to certain groups. For example, we can group data in who_ref by country or by year:
%>%
who_ref group_by(country)
%>%
who_ref group_by(year)
Note that grouping by country results in 3 groups, whereas grouping by year results in 2.
It is also possible to group by multiple variables:
%>%
who_ref group_by(country, year)
The result is 6 groups. This is not very useful in this particular example, because each observation belongs to its own group.
If we pipe the result of a group_by operation to the function summarise, the unit of analysis changes from the complete tibble to each individual group. This is very useful, because we can compute useful statistics and summaries on a per group basis.
The following example computes the total number of tuberculosis cases per year:
%>%
who_ref group_by(year) %>%
summarise(total_cases = sum(cases))
The following example computes the average number of cases per country and the corresponding standard deviation:
%>%
who_ref group_by(country) %>%
summarise(avg_cases = mean(cases), st_dev = sd(cases))
Note how summarise collapses each group to a single row and drops columns not involved in the grouping process. In fact, if we use summarise without grouping first, it sees the entire tibble as a single group and the result is a single row:
%>%
who_ref summarise(total_cases = sum(cases))
The above means that grouped summaries should be generated with functions that collapse their argument into a single value. Some useful summary functions are sum, mean, median, sd, min, max and n.
If we pipe the result of a group_by operation to the function mutate, we can create new variables on a per group basis.
Say, for example, that we want to compute the fraction of tuberculosis cases in a year from the total occurred in a country over all the years:
%>%
who_ref group_by(country) %>%
mutate(fraction = cases/sum(cases))
Note that grouped mutates do not collapse the input tibble. They maintain the original number of samples and columns. As a result, functions used to create new variables in mutate should produce the same number of members within the group.
If we pipe the result of a group_by operation to the function filter, we can remove observations (i.e. rows) on a per group basis.
For example, to find the year with the minimum number of tuberculosis cases in a country, we can do:
%>%
who_ref group_by(country) %>%
filter(cases == min(cases))
As shown below, using summarise doesn’t work in this case because we lose the information about the year:
%>%
who_ref group_by(country) %>%
summarise(min_cases = min(cases))
In this case study, you’re going to work with a dataset giving the monthly deaths from lung diseases in the UK between 1974 and 1979.
./datasets/uk_lung_deaths.csv
into variable
uk.The best way to show you the power of visualization for knowledge generation is by means of an example. In this part, you will use a dataset comprised of subjects who were part of a study aimed at understanding the prevalence of diabetes in central Virginia, USA. The variables describing the subjects are:
After tidying and transforming the “Diabetes Dataset”, you will visualize relationships between variables and feature distributions with functions from package ggvis in order to propose a simple model for susceptibility to diabetes.
You must load packages tidyverse and ggvis.
./datasets/diabetes.tsv
into variable
diab.Result should look like the following table:
In order to plot with package ggplot, we need the %>% operator to channel information into function ggplot. This creates a blank canvas on top of which we add layers of information and polish our plot. Note that, The ggplot functions are chained by the + operator!
%>%
diab ggplot(aes(x = age, y = chol)) +
geom_point() +
xlab("Subject's age") +
ylab("Cholesterol Level")
The above example feeds function ggplot with the diab table, indicates that we want to compare age with cholesterol levels (aes stands for aesthetic), uses geom_points to generate a scatter plot with the data and polishes the plot by providing custom axis labels (xlab and ylab).
You will see that following this syntax allows you to generate many different types of plots. It is just a matter of changing the layer function or adding more layers to your plot:
%>%
diab ggplot(aes(x = age, y = chol )) +
geom_point(shape=23, color='red') +
geom_smooth() +
xlab("Subject's age") +
ylab("Cholesterol Level")
See how the above example adds a geom_smooth layer on top of geom_point. Also, note that it is possible to further polish the plot via parameters that are specific to each type of layer. For example, we can change the color and shape of points through parameters color and shape of function geom_point.
Bar plots are useful when we want to graphically show counts of a discrete or categorical variable. For example, to see the number of male and female subjects in diab we do:
%>%
diab ggplot(aes(x=gender)) +
geom_bar(width = 0.5) +
xlab("Gender") +
ylab("Number of subjects")
Note how the width parameter of geom_bar has been used to plot narrow bars.
We can also see the number of diabetic and healthy subjects:
%>%
diab ggplot(aes(x=diagnosis)) +
geom_bar(width = 0.5) +
xlab("Diagnosis") +
ylab("Number of subjects")
Finally, we can also use the y-axis of bar plots to show information other than counts. For example, we can contrast the average cholesterol levels of men and women:
%>%
diab group_by(gender) %>%
summarise(avg_chol = mean(chol)) %>%
ggplot(aes(x = gender, y = avg_chol)) +
geom_col(width = 0.5) +
xlab("Gender") +
ylab("Avg. cholesterol")
Note how data transformations can be applied on the fly and piped to ggplot.
Scatter plots allow us to study the relationships between two numerical variables and regressions try to find the curve that best describes how x is related to y.
Let’s see how weight and waist measurements are related:
%>%
diab ggplot(aes(x = weight, y = waist)) +
geom_point() +
geom_smooth() +
xlab("Subject's weight") +
ylab("Waist measurement")
In the above example, geom_point is in charge of the scatter plot and geom_smooth finds a curve that follows the points and summarizes the observed trend. By default, a standard error grey band is shown to indicate how much we can trust the blue regression curve.
If we want to have more control over the type of curve that is fitted to the points, we can use the method parameter. For example, to fit a straight line to the above plot, we have to use a linear regression model (“lm”):
%>%
diab ggplot(aes(x = weight, y = waist)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Subject's weight") +
ylab("Waist measurement")
Even though a scatter plot is intrinsically in 2 dimensions, we can add more layers of information to learn more about our data. For example, we can color our points according to the subject’s diagnosis and adjust their size by glyhb:
%>%
diab ggplot(aes(x = weight, y = waist, color = diagnosis, size = glyhb)) +
geom_point() +
xlab("Subject's weight") +
ylab("Waist measurement")
Line plots are similar to scatter plots, but the variable in the x-axis is sorted and points are connected in order with a line:
%>%
diab ggplot(aes(x = waist, y = hip)) +
geom_line() +
xlab("Subject's weight") +
ylab("Waist measurement")
It is very common to add a point layer on top of line plots:
%>%
diab ggplot(aes(x = waist, y = hip)) +
geom_line() +
geom_point() +
xlab("Subject's weight") +
ylab("Waist measurement")
Box plots are very powerful statistical tools that summarize the most important aspects of a variable’s distribution:
The x-axis in a box plot is usually a categorical variable and the y-axis is normally a numeric variable who’s distribution we want to explore. For example, let’s visualize the distribution of High Density Lipoprotein values for males and females in diab:
%>%
diab ggplot(aes(x = gender, y = hdl)) +
geom_boxplot() +
xlab("gender") +
ylab("HDL level ")
Histograms work on a single variable and show how it is distributed by dividing its range into a certain number of bins. Thus, the x-axis reports the values that the variable can take (binned) and the y-axis reports the number or fraction of observations that take such values.
Let’s look at the distribution of Stabilized glucose values, specifying the width of the histogram bins with parameter binwidth:
%>%
diab ggplot(aes(x = stab.glu)) +
geom_histogram(binwidth = 15) +
xlab("Stabilised glucose") +
ylab("Counts")
We can also put two histograms in the same plot:
%>%
diab group_by(diagnosis) %>%
ggplot(aes(x = stab.glu, fill = diagnosis)) +
geom_histogram(binwidth = 15) +
xlab("Stabilised glucose") +
ylab("Counts")
Density plots are like regressions but for histograms. Density plots try to find the curve that best summarizes the distribution of a variable and report probability densities instead of counts. To control the smoothing of the density plot, we use parameter adjust:
%>%
diab ggplot(aes(x = stab.glu)) +
geom_density(adjust = 0.7, fill = 'grey') +
xlab("Stabilised glucose") +
ylab("Density")
Ggplot’s facets are a powerful feature that subdivides a plot in subplots according to given variables (categories). There are 2 main functions to create the subplots: facet_wrap and facet_grid.
For example, the density plot can be subdivided by gender (it also works for other types of plots):
%>%
diab ggplot(aes(x = stab.glu)) +
geom_density(adjust = 0.7, fill='grey') +
facet_wrap(vars(gender)) +
xlab("Stabilised glucose") +
ylab("Density")
You can provide a list of columns to the facets function such as gender and location:
%>%
diab ggplot(aes(x = stab.glu)) +
geom_density(adjust = 0.7, fill='grey') +
facet_wrap(vars(gender, location)) +
xlab("Stabilised glucose") +
ylab("Density")
You can combine facets with the fill parameter of aes (not using anymore fill in geom_density) to create subplots by gender and location, and to compare the diagnosis in each subplot:
%>%
diab ggplot(aes(x = stab.glu, fill=diagnosis)) +
geom_density(adjust = 0.7) +
facet_wrap(vars(gender, location)) +
xlab("Stabilised glucose") +
ylab("Density")
Finally, facets can be displayed as a grid using facet_grid. Using the example below, one can compare the distributions of male and female HDL levels in relation to the diagnosis and location:
%>%
diab ggplot(aes(x = gender, y = hdl, fill=gender)) +
geom_boxplot() +
facet_grid(rows=vars(location), cols=vars(diagnosis)) +
xlab("gender") +
ylab("HDL level ")
In Exercise 5.1, you used Glycosolated haemoglobin levels above 7.0 to diagnose diabetes. The histograms from the previous section showed that diabetics have bigger levels of Stabilized glucose compared to healthy subjects.