This is the second part for the lecture Analysis of ChIP-seq data. We will cover basic sequence motif analysis using some R Bioconductor packages. We will use again the ChIP-seq data set from the estrogen receptor (ER) and FoxA1 in MCF-7 breast cancer cell lines (Hurtado et al. 2011).

1 Get the sequences of ChIP-seq peak regions

1.1 Parse the ChIP-seq peaks for ER and FOXA1 as GRanges object.

We will the peaks form the previous exercise and parse the BED files as GRanges object.

require(rtracklayer)
ER <- import.bed("ER_peaks.bed")
FOXA1 <- import.bed("FOXA1_peaks.bed")

1.2 Get the sequences of the 100 most significant peaks.

We can order the peaks by their scores and than take the subset of the first 100 peaks.

# take only the subset of 100 peaks with the highest scores
ER.top <- ER[order(ER$score, decreasing = TRUE)][1:100]
FOXA1.top <- FOXA1[order(FOXA1$score, decreasing = TRUE)][1:100]

Now we will use the package BSgenome.Hsapiens.UCSC.hg18 which contains the reference sequence of the human genome assembly hg18. Using the function getSeq() we can retrieve sequences for each region an a GRanges object as a DNAStringSet object.

require(BSgenome.Hsapiens.UCSC.hg18)    # human reference genome

# get DNA sequence of peaks (this might take some time)
ER.seq <- getSeq(Hsapiens, ER.top)
ER.seq
##   A DNAStringSet instance of length 100
##       width seq
##   [1]   716 GGTTTGAAAGTTGTATTTTGTTTATTAGGAAGGAAAGAAGTGG...GGAAAGATAAAATAATTGCTTTGGGTTGCTCTTCGGAAGCCT
##   [2]   431 TTTTAGCATCTGGGAATATCCGAGAGCTGGTGGCTCCTCCCTA...TTTCTTTTTTTTTTTTTTTTGGGATGGAGTTTCGCTCTTGTT
##   [3]   433 GTTTTTCTTGACTTCCTAGGAAAATGATTCCATGCCTTGCTTT...AGTTCCTTCCACAAAAATTGGGCAGCTTCTAGAATTTAACTC
##   [4]   707 CTCATGTAAGGCTAGACAGAAGAATTCCCAGTAACTTCCTTGT...GGAAACGGGATTTCTTCATATTCTGCTAGACAGAAGAATTCN
##   [5]   615 CAACAGTACCTAAAGTTTGCATAATAAATTATGATATTCACCG...GTGGAACATAAAAAGCACAGATAAGCAACATGTATCTGCAAT
##   ...   ... ...
##  [96]   515 GGAGTGGGTTGACTGCTAAGACAAAGGTGACACCAGTCAGGGA...GGGGAGAGGCGGGGGGGGGTCCTTCCACTATCTTGGGATTCT
##  [97]   749 ATTGCCCAGGCTGGTCTTGAATTCCTGGACTTAAGCGATCCTC...AATCAGGAAGCCCTGATTCTTGGAGTAAGAACTCTGTGGAAA
##  [98]   608 ACATTTAAAAGGTTACATTTATGATAATCACATCATACAAAGA...CATTTCACACAGTATCATTATGCCTTAGCTAGTCTTCATTTA
##  [99]   476 CTCCCAAAGTGCTGGGATTATAGGCATGAGCCACGATGTCGCG...TGTTTCTTATCTATGTACACTAACAACATGGGGCACACTGAT
## [100]   566 CATCTCTTGACCCTGAGTTTCTCTCTCAACCTGTATCTAGATG...AAACAGTGTAAAGTGATTCTTAGTAGAGCAAACAGATCCTAA
Do the same for the FOXA1 peaks.

FOXA1.seq <- getSeq(Hsapiens, FOXA1.top)

1.3 Write the sequences into into FASTA files.

The function writeXStringSet writes a DNAStringSet object into a FASTA output file.

# write sequences to fasta file
writeXStringSet(ER.seq, "./ER_seq.fasta")
writeXStringSet(FOXA1.seq, "./FOXA1_seq.fasta")

2 De novo motif discovery

2.1 Find enriched sequence motifs in the ER peak sequences using the RSAT webserver.

  1. Go to the RSAT(Medina-Rivera et al. 2015; M. Thomas-Chollier et al. 2012) website http://teaching.rsat.eu/.
  2. Select NGS - ChIP-seq on the left site and choose the peak-motifs tool.
  3. Enter a title of your analysis (e.g. ER motif)
  4. Upload the peak sequences using the button Durchsuchen….
  5. Under motif discovery paramters select only Discover over-represented words.
  6. Under output select dispaly and click GO.
  7. After the calculation is finished (this might take some minutes) click on Small summary to see predicted motifs and associated TFs.

2.2 Which motifs could you identify?

2.3 Repeat the motif discovery for FOXA1. Which motif to you find?

3 Motif matching

We will now use motif matrix models of known TF motifs to search for motif hits in the ChIP-seq peak sequences.

3.1 Is there a knwon motif model for the estorgen receptor (ESR1) in the JASPAR database?

  1. Go to JASPAR(Mathelier et al. 2015) data base http://jaspar.genereg.net/ and select JASPAR CORE Vertebrata.
  2. In the search field for Name write ESR1 (this is the gene name of estrogen receptor) and click SEARCH.
  3. Click on the Logo plot to see a larger window.

Which typical features of TF recognition motifs do you observe?

3.2 Retrive the ER motif model from within R.

We will use the packages TFBSTools and JASPAR2014 to use the ER motif matrix.

# get ER motif from JASPAR
require(JASPAR2014) # load teh JASPAR2014 packabe
require(TFBSTools)

# search for "ESR1" by using species human (9606) and take the first result
pfm <- getMatrixSet(JASPAR2014, list(species=9606, name="ESR1"))[[1]]
pfm
## An object of class PFMatrix
## ID: MA0112.2
## Name: ESR1
## Matrix Class: Zinc-coordinating
## strand: +
## Tags: 
## $alias
## [1] "NR3A1,Era"
## 
## $comment
## [1] "-"
## 
## $description
## [1] "estrogen receptor 1"
## 
## $family
## [1] "Hormone-nuclear Receptor"
## 
## $medline
## [1] "19339991"
## 
## $pazar_tf_id
## [1] "TF0000189"
## 
## $symbol
## [1] "ESR1"
## 
## $tax_group
## [1] "vertebrates"
## 
## $tfbs_shape_id
## [1] "108"
## 
## $tfe_id
## [1] "138"
## 
## $type
## [1] "ChiP-seq"
## 
## $collection
## [1] "CORE"
## 
## $species
##           9606 
## "Homo sapiens" 
## 
## $acc
## [1] "P03372"
## 
## Background: 
##    A    C    G    T 
## 0.25 0.25 0.25 0.25 
## Matrix: 
##   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
## A  122  107   64   83  134  308   36   19   33     4   398    58    63    64    32    21   305
## C  120   80  173  229  232   28    8   18   41   394    13   250   276   258    19    22   106
## G  154  164  149   65   47   89  387  420   91    53    27   107    53    97     8   426    31
## T   71  117   82   93   57   48   43   18  310    24    37    60    83    56   416     6    33
##   [,18] [,19] [,20]
## A    10    59    26
## C   436   353   165
## G    12     2    22
## T    17    61   262

3.3 Plot the sequence logo plot for the ER matrix model.

We first need to convert the matrix to a information content matrix (function toICM()), than we can plot is using the function seqLogo().

# we convert the matrix to the information content using the toICM function.
icm <- toICM(pfm)

# plot as sequence logo plot
seqLogo(icm)

3.4 Find motif hits in the ER peak-sequences.

We can take one the ER sequences, lets say the third one, and run the motif search on it using the faction matchPWM().

matchPWM(as.matrix(pfm), ER.seq[[3]])
##   Views on a 433-letter DNAString subject
## subject: GTTTTTCTTGACTTCCTAGGAAAATGATTCCATGCCTTGCTTTG...TAAGTTCCTTCCACAAAAATTGGGCAGCTTCTAGAATTTAACTC
## views:
##     start end width
## [1]   241 260    20 [AACCCGGGTCAGGGAGACCT]

The result is a Views object on this DNA String that shows the coordinates and the sequence where the motif hit was found. We can search on the entire subset of ER ChIP-seq peak sequences using the lapply() function.

hitsER <- lapply(ER.seq, function(s) matchPWM(as.matrix(pfm), s, min.score="75%") )
hitsFOXA1 <- lapply(FOXA1.seq, function(s) matchPWM(as.matrix(pfm), s, min.score="75%") )

3.5 How many ER peaks have one or more motif hits

Using sapply we apply the function length() on all hits objects to get the number of motif hits for each peak.

countsER <- sapply(hitsER, length)
countsFOXA1 <- sapply(hitsFOXA1, length)

sum(countsER >= 1)
## [1] 58

3.6 Plot the number of ER motif hits per peak for ER and FOXA1 peak sequences.

# create a data.frame with the number of motif hits and the corresponding peak source group as columns.
plotDF <- data.frame(
    motif_hits = c(countsER, countsFOXA1),
    group = rep(c("ER peaks", "FOXA1 peaks"), c(length(countsER), length(countsFOXA1))) 
)

# load the ggplot2 package
require(ggplot2)
qplot(x=motif_hits, fill=group, data=plotDF, geom="bar", facets=group~.)

4 OPTIONAL: Aditionall Exercises

  1. Repeat the motif matching part with the FOXA1 motif model from JASPAR.

  2. Use different score threshold for motif scanning in the funciton matchPWM().

  3. Plot the distance of motif hits to the peak summit for FOXA1 peaks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.


5 References

Hurtado, Antoni, Kelly a Holmes, Caryn S Ross-Innes, Dominic Schmidt, and Jason S Carroll. 2011. “FOXA1 Is a Key Determinant of Estrogen Receptor Function and Endocrine Response.” Nature Genetics 43 (1). Nature Publishing Group: 27–33. doi:10.1038/ng.730.

Mathelier, Anthony, Oriol Fornes, David J. Arenillas, Chih-yu Chen, Grégoire Denay, Jessica Lee, Wenqiang Shi, et al. 2015. “JASPAR 2016: A Major Expansion and Update of the Open-Access Database of Transcription Factor Binding Profiles.” Nucleic Acids Research 44 (November 2015): gkv1176. doi:10.1093/nar/gkv1176.

Medina-Rivera, a., M. Defrance, O. Sand, C. Herrmann, J. a. Castro-Mondragon, J. Delerce, S. Jaeger, et al. 2015. “RSAT 2015: Regulatory Sequence Analysis Tools.” Nucleic Acids Research, 1–7. doi:10.1093/nar/gkv362.

Thomas-Chollier, Morgane, Carl Herrmann, Matthieu Defrance, Olivier Sand, Denis Thieffry, and Jacques van Helden. 2012. “RSAT Peak-Motifs: Motif Analysis in Full-Size ChIP-Seq Datasets.” Nucleic Acids Research 40 (4): e31. doi:10.1093/nar/gkr1104.