This are the exercises part for the lecture Analysis of ChIP-seq data. We will not cover the raw read data analysis (quality control, read mapping, peak-calling) and rather start directly with some basic analysis on the level of already identified ChIP-seq peaks for two transcription factors. We will use a ChIP-seq data set from the estrogen receptor (ER) and FoxA1 in MCF-7 breast cancer cell lines (Hurtado et al. 2011).

1 Requirements

If not done already, installed recent versions of R or RStudio on your computers. Install additional R packages according to this guide: install_packages.html.

Additionally, please install the following package ChIPpeakAnno by executing the following lines.

# install the package "ChIPpeakAnno" from Bioconductor:
source("http://bioconductor.org/biocLite.R")
biocLite("ChIPpeakAnno")

2 Download and parse input data

The authors of (Hurtado et al. 2011) provide ChIP-seq peak data on their website http://www.carroll-lab.org.uk/data. The peaks for FOXA1 in MCF-7 cell are in this file http://www.carroll-lab.org.uk/FreshFiles/Data/Data_Sheet_3/MCF7_FOXA1%20binding.bed. This file is actually not a standard BED file but an output of the MACS peak caller. Bellow you see the first 25 lines of this file.

# This file is generated by MACS
# ARGUMENTS LIST:
# name = FoxA1_FM_MCF7
# format = BED
# ChIP-seq file = jc193_FOXA1_ChIP_MCF7_full_CRI01.bed
# control file = Input_Full_MCF7_CRI01.bed
# effective genome size = 2.70e+09
# tag size = 25
# band width = 300
# model fold = 32
# pvalue cutoff = 1.00e-05
# Ranges for calculating regional lambda are : peak_region,1000,5000,10000
# unique tags in treatment: 19195674
# total tags in treatment: 26475802
# unique tags in control: 7466901
# total tags in control: 7683557
# d = 130
chr start   end length  summit  tags    -10*log10(pvalue)   fold_enrichment FDR(%)
chr1    8250    8671    422 286 46  145.84  11.68   0.51
chr1    36382   36984   603 405 46  315.23  27.05   0.24
chr1    70082   70475   394 273 14  65.40   9.74    1.13
chr1    389664  390025  362 118 13  62.30   7.57    1.19
chr1    414083  414588  506 279 41  234.60  21.95   0.33
chr1    449089  450054  966 665 90  368.81  20.10   0.19
chr1    450217  450908  692 245 77  349.57  18.81   0.20

The first lines starting with # indicate parameter and input data for the peak calling. the rows bellow are peak coordinates. Each peak is in one row with the chromosome name, start and end coordinate, peak length, summit coordinate, number of reads (tags), a \(-\log_{10}()\) transformation of the p-value, fold enrichment over input, and the FDR.

2.1 Downlaod the data from within R

We will use the links on the website to directly download the peaks data into R

# read the input data directly from the URL
FOXA1.df <- read.table("http://www.carroll-lab.org.uk/FreshFiles/Data/Data_Sheet_3/MCF7_FOXA1%20binding.bed", header=TRUE)

# Alternativley, use the copy on our webserver:
#FOXA1.df <- read.table("http://cbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/MCF7_FOXA1_binding.bed", header=TRUE)

# show the first six lines of the data.frame
head(FOXA1.df)
##    chr  start    end length summit tags X.10.log10.pvalue. fold_enrichment FDR...
## 1 chr1   8250   8671    422    286   46             145.84           11.68   0.51
## 2 chr1  36382  36984    603    405   46             315.23           27.05   0.24
## 3 chr1  70082  70475    394    273   14              65.40            9.74   1.13
## 4 chr1 389664 390025    362    118   13              62.30            7.57   1.19
## 5 chr1 414083 414588    506    279   41             234.60           21.95   0.33
## 6 chr1 449089 450054    966    665   90             368.81           20.10   0.19

The function read.table reads an input file and transformed it into a data.frame, the basic data structure in R. Lines starting with # are ignored by default and the option header=TRUE indicates that the first line contains the column names.

2.2 Convert peaks into a GRanges object

A GRanges object form the GenomicRanges packages is a container for genomic locations and their associated annotations. We will use this data structure for the ChIP-seq peak coordinates.

require(GenomicRanges) # load the GenomicRanges pakcage

FOXA1 <- GRanges(
  FOXA1.df$chr,
  IRanges(FOXA1.df$star, FOXA1.df$end),
  strand="*"
)

# we can add more data to each peak subsequently
names(FOXA1) <- paste("FOXA1_peak", 1:nrow(FOXA1.df), sep="_")
score(FOXA1) <- FOXA1.df[,"X.10.log10.pvalue."]  
  
# show the first and last lines of the GRanges object
FOXA1
## GRanges object with 95314 ranges and 1 metadata column:
##                    seqnames               ranges strand   |     score
##                       <Rle>            <IRanges>  <Rle>   | <numeric>
##       FOXA1_peak_1     chr1     [  8250,   8671]      *   |    145.84
##       FOXA1_peak_2     chr1     [ 36382,  36984]      *   |    315.23
##       FOXA1_peak_3     chr1     [ 70082,  70475]      *   |      65.4
##       FOXA1_peak_4     chr1     [389664, 390025]      *   |      62.3
##       FOXA1_peak_5     chr1     [414083, 414588]      *   |     234.6
##                ...      ...                  ...    ... ...       ...
##   FOXA1_peak_95310     chrY [27226584, 27227719]      *   |     67.62
##   FOXA1_peak_95311     chrY [57228665, 57327117]      *   |      3100
##   FOXA1_peak_95312     chrY [57428755, 57429933]      *   |   1646.77
##   FOXA1_peak_95313     chrY [57436997, 57437537]      *   |     54.18
##   FOXA1_peak_95314     chrY [57551525, 57551838]      *   |     51.13
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

3 Basic number and statistics of the FOXA1 peaks

3.1 What is the number of peaks?

Use the function length().

length(FOXA1)
## [1] 95314

3.2 What is the mean, median, and max size of the peaks?

The function width() returns the sizes of all ranges in a GRanges object as vector.

FOXA1.size <- width(FOXA1)
summary(FOXA1.size)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   142.0   426.0   538.0   594.9   682.0 98450.0

3.3 Remove all peaks that are larger than 2kb

A GRanges object can be indexed using the [] operator similar than vectors.

FOXA1 <- FOXA1[width(FOXA1) <= 2000]

3.4 What is the distribution of peak sizes?

Plot the distribution of sizes using the hist() function.

hist(width(FOXA1), xlab="FOXA1 peak size", col="gray")

3.5 What is the distribution of FOXA1 ChIP-seq peak p-values?

We plot the histogram of \(-log_{10}\) transformed p-values. Remember that we stored them in the score column of the GRanges object. We can access them using the function score().

# get the -log_10 transformed p-values from the score column
mlPvals <- score(FOXA1)

hist(mlPvals, xlab="-log_10(p-value)", col="gray")

4 Compare the peaks of ER and FOXA1

4.1 Load the estrogen receptor peaks from BED file

The peaks for estrogen receptor (ER) are in BED file format. We will use the function import.bed from the rtracklayer package to directly import the peaks as GRanges object.

require(rtracklayer) # load the rtracklayer package
ER <- import.bed("http://www.carroll-lab.org.uk/FreshFiles/Data/Data_Sheet_3/MCF7_ER_binding.bed")

# Alternatively, use a copy from our webserver:
#ER <- import.bed("http://cbdm-01.zdv.uni-mainz.de/~jibnsale/teaching/MCF7_ER_binding.bed")

ER
## UCSC track 'MCF7'
## UCSCData object with 12999 ranges and 1 metadata column:
##           seqnames               ranges strand   |               name
##              <Rle>            <IRanges>  <Rle>   |        <character>
##       [1]     chr1   [ 846289,  846544]      *   |  0.114243427848363
##       [2]     chr1   [ 920943,  921363]      *   | 0.0839698748765652
##       [3]     chr1   [ 998507,  999838]      *   |  0.404335575562571
##       [4]     chr1   [1004633, 1005377]      *   |   0.19387301910821
##       [5]     chr1   [1049927, 1050719]      *   | 0.0508101787980254
##       ...      ...                  ...    ... ...                ...
##   [12995]     chrY [11948846, 11948929]      *   |  0.264330981850441
##   [12996]     chrY [57394536, 57394602]      *   |   0.22452241426891
##   [12997]     chrY [57395445, 57395564]      *   |  0.335700586987822
##   [12998]     chrY [57402653, 57402756]      *   |  0.260627565261645
##   [12999]     chrY [57405865, 57406005]      *   |  0.183393351786731
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

We fix the name and score columns.

# assign scores by converting the 'name' feald to type numeric
score(ER) <- as.numeric(ER$name)

# overwrite the name column
ER$name <- paste("ER_peaks", 1:length(ER), sep="_")

# use the names() function 
names(ER) <- ER$name 
ER
## UCSC track 'MCF7'
## UCSCData object with 12999 ranges and 2 metadata columns:
##                  seqnames               ranges strand   |           name      score
##                     <Rle>            <IRanges>  <Rle>   |    <character>  <numeric>
##       ER_peaks_1     chr1   [ 846289,  846544]      *   |     ER_peaks_1 0.11424343
##       ER_peaks_2     chr1   [ 920943,  921363]      *   |     ER_peaks_2 0.08396987
##       ER_peaks_3     chr1   [ 998507,  999838]      *   |     ER_peaks_3 0.40433558
##       ER_peaks_4     chr1   [1004633, 1005377]      *   |     ER_peaks_4 0.19387302
##       ER_peaks_5     chr1   [1049927, 1050719]      *   |     ER_peaks_5 0.05081018
##              ...      ...                  ...    ... ...            ...        ...
##   ER_peaks_12995     chrY [11948846, 11948929]      *   | ER_peaks_12995  0.2643310
##   ER_peaks_12996     chrY [57394536, 57394602]      *   | ER_peaks_12996  0.2245224
##   ER_peaks_12997     chrY [57395445, 57395564]      *   | ER_peaks_12997  0.3357006
##   ER_peaks_12998     chrY [57402653, 57402756]      *   | ER_peaks_12998  0.2606276
##   ER_peaks_12999     chrY [57405865, 57406005]      *   | ER_peaks_12999  0.1833934
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

4.2 Make a barplot of the number of peaks

bp <- barplot(c(length(ER), length(FOXA1)), names=c("ER", "FOXA1"))
# add actual values as text lables to the plot
text(bp, c(length(ER), length(FOXA1)), labels=c(length(ER), length(FOXA1)), pos=1)

4.3 How many ER peaks overlap FOXA1 peaks?

We can calculate the overlaps between two sets of regions with the findOverlaps function from the GenomicRanges package. To get the subset of ER peaks that overlap with the FOXA1 peaks we can use the ‘subsetByOverlaps’ function.

# compute overlaps
ovlHits <- findOverlaps(ER, FOXA1)

# show the resulting Hits object
ovlHits
## Hits object with 8188 hits and 0 metadata columns:
##          queryHits subjectHits
##          <integer>   <integer>
##      [1]         3          18
##      [2]         5          19
##      [3]         6          20
##      [4]        15          26
##      [5]        16          28
##      ...       ...         ...
##   [8184]     12962       94809
##   [8185]     12963       94809
##   [8186]     12964       94810
##   [8187]     12970       94817
##   [8188]     12973       94830
##   -------
##   queryLength: 12999
##   subjectLength: 95000
# get subsets of binding sites
ovl <- subsetByOverlaps(ER, FOXA1)

# number of overlaps
length(ovl)
## [1] 8176
# as percent
length(ovl) / length(ER) * 100
## [1] 62.89715

4.4 Make a Venn-diagram showing the overlap of of ER and FOXA1 peaks.

We first take the subset of peaks that are unique to ER and FOXA1 using the function setdiff.

# get subsets of binding sites
ER.uniq <- setdiff(ER, FOXA1)
FOXA1.uniq <- setdiff(FOXA1, ER)

Then we use the package Vennerable to build a Venn object with the number of peaks in each subset. This object can than be passed to the ‘plot’ function.

# plot overlap of binding sites as Venn diagram
require(Vennerable)

# build objects with the numbers of sites in the subsets
venn <- Venn(SetNames=c("ER", "FOXA1"), 
    Weight=c(
        '10'=length(ER.uniq), 
        '11'=length(ovl), 
        '01'=length(FOXA1.uniq)
    )
)

# plot Venn Diagram
plot(venn)

5 Functional annotation of ChIP-seq peaks

To understand the function of a transcription factor we want to know to which genes and genomic features it binds.

We will use a TxDb objects. Such an object is an R interface to prefabricated databases contained by specific annotation packages. The package TxDb.Hsapiens.UCSC.hg18.knownGene includes all human genes and transcripts from UCSC with coordinates for the hg18 genome assembly.

# load gene annotation from UCSC for human genome build hg18
require(TxDb.Hsapiens.UCSC.hg18.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg18.knownGene # just a shortcut

Now we want to assign each peak to a chromosomal region by using the feature data base.

5.1 Calculate overlap with genomic features like introns, exons, 5’UTR.

We will use the ChIPpeakAnno package. It has the function assignChromosomeRegion for calculating these overlaps by using a annotation database.

require(ChIPpeakAnno)   # laod package to annotate peaks with genes

# calucluate the overlap with features
ER.features <- assignChromosomeRegion(ER, TxDb=txdb, nucleotideLevel=FALSE)
# show the results
ER.features
## $percentage
## subjectHits
##               Exons             Introns            fiveUTRs           threeUTRs           Promoters 
##           10.970075           49.957689            3.823371            2.046311            6.062005 
## immediateDownstream   Intergenic.Region 
##            2.484807           44.618817 
## 
## $jaccard
##               Exons             Introns            fiveUTRs           threeUTRs           Promoters 
##         0.005124685         0.026022737         0.006642698         0.006018236         0.014054613 
## immediateDownstream   Intergenic.Region 
##         0.006266977         0.566050316

5.2 Plot the percentages of peaks overlapping each feature as pie-chart or barplot.

We can access the percent values using the $ operator. There are function pie() and barplot().

# make pie-chart
pie(ER.features$percentage)

# make barplot
bp <- barplot(ER.features$percentage, ylab="%")
text(bp, ER.features$percentage, signif(ER.features$percentage, 4), pos=1)

5.3 Get the subset of ER peaks that overlap a promoter region.

We can get all genes as GRanges object form the annotation database with the function genes(). The function promoters() returns ranges around the transcription start site, defined as start(x). We first calculate the promoters as GRanges object:

# get all genes from the data base as GRanges object
genes <- genes(txdb)

# take the the region arround the gene start as promoter
prom <- promoters(genes, upstream=2000, downstream=200)
prom
## GRanges object with 19742 ranges and 1 metadata column:
##         seqnames                 ranges strand   |     gene_id
##            <Rle>              <IRanges>  <Rle>   | <character>
##       1    chr19 [ 63565733,  63567932]      -   |           1
##      10     chr8 [ 18291035,  18293234]      +   |          10
##     100    chr20 [ 42713591,  42715790]      -   |         100
##    1000    chr18 [ 24010990,  24013189]      -   |        1000
##   10000     chr1 [242073008, 242075207]      -   |       10000
##     ...      ...                    ...    ... ...         ...
##    9991     chr9 [114135534, 114137733]      -   |        9991
##    9992    chr21 [ 34656193,  34658392]      +   |        9992
##    9993    chr22 [ 17489768,  17491967]      -   |        9993
##    9994     chr6 [ 90594334,  90596533]      +   |        9994
##    9997    chr22 [ 49310701,  49312900]      -   |        9997
##   -------
##   seqinfo: 49 sequences (1 circular) from hg18 genome
Now lets calculate the subset of ER peaks that overlap a promoter region.

# get only those ER peaks that overlap a promoter region
ERatProm <- subsetByOverlaps(ER, prom)

# subset size
length(ERatProm)
## [1] 684
# percent of all ER peaks
length(ERatProm) / length(ER) * 100
## [1] 5.261943

5.4 Get the subset of genes that have a peak in their promoter regions and write their gene IDs into an output file.

We can use again the function findOverlaps() to associate each peak to a promoter. In the subject column of the resulting hits object, we can find the indexes of genes that have an overlap with a promoter.

# We search for overlap between ER peaks and promoters
ERatProm.Hits = findOverlaps(ER, prom)

ERprom = genes[subjectHits(ERatProm.Hits)]

# take only unique ids
gene.ids <- unique(names(ERprom))

# write names to an output file
write.table(gene.ids, file="ER_regulated_genes.txt", quote=FALSE, row.names=FALSE, col.names=FALSE)

5.5 What is the average distances between peaks to the nearest TSS?

First we need to restrict the genes to only the start coordinate (TSS). Than we can use the function distanceToNearest from the GenomicRanges package to calculate the distances from peaks to TSS.

# get TSS as only the start coordinate of genes.
tss <- resize(genes, width=1, fix="start")

# calculate the distances from peaks to tss
d <- distanceToNearest(ER, tss)

# show object d
d
## Hits object with 12999 hits and 1 metadata column:
##           queryHits subjectHits   |  distance
##           <integer>   <integer>   | <integer>
##       [1]         1        3008   |      3848
##       [2]         2       13362   |      3969
##       [3]         3       11799   |     41760
##       [4]         4       11799   |     36221
##       [5]         5       11799   |      8327
##       ...       ...         ... ...       ...
##   [12995]     12995       14399   |   1334762
##   [12996]     12996        1500   |  31110510
##   [12997]     12997        1500   |  31111419
##   [12998]     12998        1500   |  31118627
##   [12999]     12999        1500   |  31121839
##   -------
##   queryLength: 12999
##   subjectLength: 19742

The result is a Hitsobject associating each peak (query) to a gene (subject) with a metadata column distance. We can access this column using the function mcols() which returns a DataFrame containing the distance column.

# show the metacolumns as DataFrame object
mcols(d)
## DataFrame with 12999 rows and 1 column
##        distance
##       <integer>
## 1          3848
## 2          3969
## 3         41760
## 4         36221
## 5          8327
## ...         ...
## 12995   1334762
## 12996  31110510
## 12997  31111419
## 12998  31118627
## 12999  31121839
# extract the distance column as vector
dist <- mcols(d)[,1]

# get the average distance in kb
mean(dist) * 10^-3
## [1] 108.0779

5.6 Associate each peak to the nearest gene within a range of 10 kb.

We can use the Hits object resulting from the distanceToNearest() function and use the distances to subset the peaks to those that are less than 10 Kb away from the TSS. Er can than use the function subjectHits() to get the genes associated to each peak.

# subset hits object by distance
close.Hits <- d[dist <= 10000,]

# show the subset 
close.Hits
## Hits object with 3004 hits and 1 metadata column:
##          queryHits subjectHits   |  distance
##          <integer>   <integer>   | <integer>
##      [1]         1        3008   |      3848
##      [2]         2       13362   |      3969
##      [3]         5       11799   |      8327
##      [4]         7        2111   |      4745
##      [5]         8        2111   |      5851
##      ...       ...         ... ...       ...
##   [3000]     12968       17054   |      1672
##   [3001]     12969       17058   |       151
##   [3002]     12970       17949   |         0
##   [3003]     12971       17949   |       235
##   [3004]     12972        7639   |      3080
##   -------
##   queryLength: 12999
##   subjectLength: 19742
# get the indeces of genes
ER.genes <- genes[subjectHits(close.Hits)]

5.7 Write the gene IDs to an output file.

# extract the vector of names from the GRanges object
gene.ids <- names(ER.genes)

# take only unique ids
gene.ids <- unique(gene.ids)

# write names to an output file
write.table(gene.ids, file="ER_regulated_genes.txt", quote=FALSE, row.names=FALSE, col.names=FALSE)

5.8 Write the ER and FOXA1 peak coordinates as BED file.

Use the function export.bed from the rtracklayer package.

export.bed(ER, "ER_peaks.bed")
export.bed(FOXA1, "FOXA1_peaks.bed")

6 OPTIONAL: Aditional Exercises

  1. Repeat the Functional annotation of ChIP-seq peaks with the FOXA1 peaks. Where do you find differences to ER?

  2. Does ER bind to the promoter of the p53 gene?
    • Find the Entrez Gene ID for the p53 cancer suppressor gene.
    • Check if this gene is an element of the set of genes with ER close to the promoter.
  3. Visuallize the ER and FOXA1 ChIP-seq peaks in the UCSC Genome Browser.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.


References

Hurtado, Antoni, Kelly a Holmes, Caryn S Ross-Innes, Dominic Schmidt, and Jason S Carroll. 2011. “FOXA1 Is a Key Determinant of Estrogen Receptor Function and Endocrine Response.” Nature Genetics 43 (1). Nature Publishing Group: 27–33. doi:10.1038/ng.730.