This is the second part for the lecture Analysis of ChIP-seq data. We will cover basic sequence motif analysis using some R Bioconductor packages. We will use again the ChIP-seq data set from the estrogen receptor (ER) and FoxA1 in MCF-7 breast cancer cell lines (Hurtado et al. 2011).

# 1 Get the sequences of ChIP-seq peak regions

## 1.1 Parse the ChIP-seq peaks for ER and FOXA1 as GRanges object.

We will the peaks form the previous exercise and parse the BED files as GRanges object.

require(rtracklayer)
ER <- import.bed("ER_peaks.bed")
FOXA1 <- import.bed("FOXA1_peaks.bed")

## 1.2 Get the sequences of the 100 most significant peaks.

We can order the peaks by their scores and than take the subset of the first 100 peaks.

# take only the subset of 100 peaks with the highest scores
ER.top <- ER[order(ER$score, decreasing = TRUE)][1:100] FOXA1.top <- FOXA1[order(FOXA1$score, decreasing = TRUE)][1:100]

Now we will use the package BSgenome.Hsapiens.UCSC.hg18 which contains the reference sequence of the human genome assembly hg18. Using the function getSeq() we can retrieve sequences for each region an a GRanges object as a DNAStringSet object.

require(BSgenome.Hsapiens.UCSC.hg18)    # human reference genome

# get DNA sequence of peaks (this might take some time)
ER.seq <- getSeq(Hsapiens, ER.top)
ER.seq
##   A DNAStringSet instance of length 100
##       width seq
##   [1]   716 GGTTTGAAAGTTGTATTTTGTTTATTAGGAAGGAAAGAAGTGG...GGAAAGATAAAATAATTGCTTTGGGTTGCTCTTCGGAAGCCT
##   [2]   431 TTTTAGCATCTGGGAATATCCGAGAGCTGGTGGCTCCTCCCTA...TTTCTTTTTTTTTTTTTTTTGGGATGGAGTTTCGCTCTTGTT
##   [3]   433 GTTTTTCTTGACTTCCTAGGAAAATGATTCCATGCCTTGCTTT...AGTTCCTTCCACAAAAATTGGGCAGCTTCTAGAATTTAACTC
##   [4]   707 CTCATGTAAGGCTAGACAGAAGAATTCCCAGTAACTTCCTTGT...GGAAACGGGATTTCTTCATATTCTGCTAGACAGAAGAATTCN
##   [5]   615 CAACAGTACCTAAAGTTTGCATAATAAATTATGATATTCACCG...GTGGAACATAAAAAGCACAGATAAGCAACATGTATCTGCAAT
##   ...   ... ...
##  [96]   515 GGAGTGGGTTGACTGCTAAGACAAAGGTGACACCAGTCAGGGA...GGGGAGAGGCGGGGGGGGGTCCTTCCACTATCTTGGGATTCT
##  [97]   749 ATTGCCCAGGCTGGTCTTGAATTCCTGGACTTAAGCGATCCTC...AATCAGGAAGCCCTGATTCTTGGAGTAAGAACTCTGTGGAAA
##  [98]   608 ACATTTAAAAGGTTACATTTATGATAATCACATCATACAAAGA...CATTTCACACAGTATCATTATGCCTTAGCTAGTCTTCATTTA
##  [99]   476 CTCCCAAAGTGCTGGGATTATAGGCATGAGCCACGATGTCGCG...TGTTTCTTATCTATGTACACTAACAACATGGGGCACACTGAT
## [100]   566 CATCTCTTGACCCTGAGTTTCTCTCTCAACCTGTATCTAGATG...AAACAGTGTAAAGTGATTCTTAGTAGAGCAAACAGATCCTAA
Do the same for the FOXA1 peaks.

FOXA1.seq <- getSeq(Hsapiens, FOXA1.top)

## 1.3 Write the sequences into into FASTA files.

The function writeXStringSet writes a DNAStringSet object into a FASTA output file.

# write sequences to fasta file
writeXStringSet(ER.seq, "./ER_seq.fasta")
writeXStringSet(FOXA1.seq, "./FOXA1_seq.fasta")

# 2 De novo motif discovery

## 2.1 Find enriched sequence motifs in the ER peak sequences using the RSAT webserver.

1. Go to the RSAT(Medina-Rivera et al. 2015; M. Thomas-Chollier et al. 2012) website http://teaching.rsat.eu/.
2. Select NGS - ChIP-seq on the left site and choose the peak-motifs tool.
3. Enter a title of your analysis (e.g. ER motif)
4. Upload the peak sequences using the button Durchsuchen….
5. Under motif discovery paramters select only Discover over-represented words.
6. Under output select dispaly and click GO.
7. After the calculation is finished (this might take some minutes) click on Small summary to see predicted motifs and associated TFs.

# 3 Motif matching

We will now use motif matrix models of known TF motifs to search for motif hits in the ChIP-seq peak sequences.

## 3.1 Is there a knwon motif model for the estorgen receptor (ESR1) in the JASPAR database?

1. Go to JASPAR(Mathelier et al. 2015) data base http://jaspar.genereg.net/ and select JASPAR CORE Vertebrata.
2. In the search field for Name write ESR1 (this is the gene name of estrogen receptor) and click SEARCH.
3. Click on the Logo plot to see a larger window.

Which typical features of TF recognition motifs do you observe?

## 3.2 Retrive the ER motif model from within R.

We will use the packages TFBSTools and JASPAR2014 to use the ER motif matrix.

# get ER motif from JASPAR
require(JASPAR2014) # load teh JASPAR2014 packabe
require(TFBSTools)

# search for "ESR1" by using species human (9606) and take the first result
pfm <- getMatrixSet(JASPAR2014, list(species=9606, name="ESR1"))[[1]]
pfm
## An object of class PFMatrix
## ID: MA0112.2
## Name: ESR1
## Matrix Class: Zinc-coordinating
## strand: +
## Tags:
## $alias ## [1] "NR3A1,Era" ## ##$comment
## [1] "-"
##
## $description ## [1] "estrogen receptor 1" ## ##$family
## [1] "Hormone-nuclear Receptor"
##
## $medline ## [1] "19339991" ## ##$pazar_tf_id
## [1] "TF0000189"
##
## $symbol ## [1] "ESR1" ## ##$tax_group
## [1] "vertebrates"
##
## $tfbs_shape_id ## [1] "108" ## ##$tfe_id
## [1] "138"
##
## $type ## [1] "ChiP-seq" ## ##$collection
## [1] "CORE"
##
## $species ## 9606 ## "Homo sapiens" ## ##$acc
## [1] "P03372"
##
## Background:
##    A    C    G    T
## 0.25 0.25 0.25 0.25
## Matrix:
##   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
## A  122  107   64   83  134  308   36   19   33     4   398    58    63    64    32    21   305
## C  120   80  173  229  232   28    8   18   41   394    13   250   276   258    19    22   106
## G  154  164  149   65   47   89  387  420   91    53    27   107    53    97     8   426    31
## T   71  117   82   93   57   48   43   18  310    24    37    60    83    56   416     6    33
##   [,18] [,19] [,20]
## A    10    59    26
## C   436   353   165
## G    12     2    22
## T    17    61   262

## 3.3 Plot the sequence logo plot for the ER matrix model.

We first need to convert the matrix to a information content matrix (function toICM()), than we can plot is using the function seqLogo().

# we convert the matrix to the information content using the toICM function.
icm <- toICM(pfm)

# plot as sequence logo plot
seqLogo(icm)

## 3.4 Find motif hits in the ER peak-sequences.

We can take one the ER sequences, lets say the third one, and run the motif search on it using the faction matchPWM().

matchPWM(as.matrix(pfm), ER.seq[[3]])
##   Views on a 433-letter DNAString subject
## subject: GTTTTTCTTGACTTCCTAGGAAAATGATTCCATGCCTTGCTTTG...TAAGTTCCTTCCACAAAAATTGGGCAGCTTCTAGAATTTAACTC
## views:
##     start end width
## [1]   241 260    20 [AACCCGGGTCAGGGAGACCT]

The result is a Views object on this DNA String that shows the coordinates and the sequence where the motif hit was found. We can search on the entire subset of ER ChIP-seq peak sequences using the lapply() function.

hitsER <- lapply(ER.seq, function(s) matchPWM(as.matrix(pfm), s, min.score="75%") )
hitsFOXA1 <- lapply(FOXA1.seq, function(s) matchPWM(as.matrix(pfm), s, min.score="75%") )

## 3.5 How many ER peaks have one or more motif hits

Using sapply we apply the function length() on all hits objects to get the number of motif hits for each peak.

countsER <- sapply(hitsER, length)
countsFOXA1 <- sapply(hitsFOXA1, length)

sum(countsER >= 1)
## [1] 58

## 3.6 Plot the number of ER motif hits per peak for ER and FOXA1 peak sequences.

# create a data.frame with the number of motif hits and the corresponding peak source group as columns.
plotDF <- data.frame(
motif_hits = c(countsER, countsFOXA1),
group = rep(c("ER peaks", "FOXA1 peaks"), c(length(countsER), length(countsFOXA1)))
)

require(ggplot2)
qplot(x=motif_hits, fill=group, data=plotDF, geom="bar", facets=group~.)

1. Repeat the motif matching part with the FOXA1 motif model from JASPAR.

2. Use different score threshold for motif scanning in the funciton matchPWM().

3. Plot the distance of motif hits to the peak summit for FOXA1 peaks.