CoCalc Public Files25_Motifs-III / 25_Motif-III.ipynb
Views : 35
Description: Jupyter notebook 25_Motifs-III/25_Motif-III.ipynb
Compute Environment: Ubuntu 18.04 (Deprecated)

## Motif discovery analysis - III

#### Motif analysis for ChIP-Seq data

In the previous labs, we have learned how to analyze ChIP-Seq data such as read alignment, peak identification and visualization. In this lab, we are going to learn another important downstream analysis for the ChIP-Seq data, which is to find enriched motifs in ChIP-Seq peaks.

Why we want to do motif analysis for ChIP-Seq data? There are several reasons:

(1). The motif analysis can be used to validate ChIP-Seq experiment. If you are doing a ChIP-Seq experiment for a transcription factor with known binding motifs, you would expect to identify those motifs enriched in the ChIP-Seq peaks. For example, it is known that transcription factor Foxa2 binds to motif "GTAAACA". Then motif analysis of Foxa2 ChIP-Seq experiment should identify "GTAAACA" as one of the enriched motif in the peaks. Otherwise, the quality of the ChIP-Seq experiment is problematic and probably need further investigation. Therefore, researchers can use the motif analysis results to validate the ChIP-Seq experiment. If you are interested, See the following reference for an example. Xu, Chenhuan, et al. "Genome-wide roles of Foxa2 in directing liver specification." Journal of molecular cell biology (2012)

(2). The motif analysis can be used to identify novel binding motifs for transcription factors. If you are studing a transcription factor that the binding motif is unknown, you can use motif analysis to identify novel binding motifs. Those novel binding motifs can give useful information about the function of the transcription factor. For example, Bing Ren's group identified a novel binding motif for insulator protein CTCF. By analyzing the new motif, they identified some new functions of this CTCF protein. For reference, you can read this paper, Kim, Tae Hoon, et al. "Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome." Cell 128.6 (2007): 1231-1245.

(3). The motif analysis can be used to identify cofactors from the ChIP-Seq experiment. It is common to identify multiple different motifs from the ChIP-Seq experiment. Some of the motifs may belong to transcription factors that are not studied in the ChIP-Seq experiment and those transcription factors are potentially cofactors. Here is a reference for this kind of analysis. Ding, Jun, et al. "Systematic discovery of cofactor motifs from ChIP-seq data by SIOMICS." Methods 79 (2015): 47-51.

In this lab, we are going to learn how to use HOMER for motif analysis based on ChIP-Seq data. HOMER is a toolkit for motif discovery based on sequencing data and it is freely available at http://homer.salk.edu/homer/ngs/peakMotifs.html. HOMER contains several perl scripts (perl is a programming language similar to python). We already installed HOMER on sagemathcloud so you can use it directly for the following analysis.

The data we are going to use is from a publised Foxa2 ChIP-Seq experiment. The winged helix protein FOXA2 is a highly conserved, regionally-expressed transcription factor that regulate networks of genes controlling complex metabolic functions. The raw reads were aligned to reference genome and Foxa2 binding peaks were identified. You can find the ChIP-Seq peak data (GSE25836_Human_Liver_FOXA2_GLITR_1p5_FDR.bed) in BED format in the folder "data_for_motif_analysis". We are going to use this file for the following motif analysis.

We will use the findMotifsGenome.pl script in HOMER to find enriched motifs in Foxa2 ChIP-Seq peaks. The basic syntax is as follows:

In [ ]:
findMotifsGenome.pl <peak/BED file> <genome> <output directory> -size # [options]

1.<peak/BED file> is the input ChIP-Seq peaks in BED file format.
2.<genome> is the reference genome. We will use human reference genome hg18 for the analysis.
3.<output directory> is the output folder.
4.-size Selecting the size of the region for motif finding.  If you wish to find motifs using your peaks using their exact sizes, use the option "-size given").  However, for Transcription Factor peaks, most of the motifs are found +/- 50-75 bp from the peak center, making it better to use a fixed size rather than depend on your peak size.
5.[options] are some other options.


Here is an example using findMotifsGenome.pl to identify motifs in peaks.bed.

In [ ]:
findMotifsGenome.pl peaks.bed hg18 MotifOutput/ -size 200 -mask -preparsedDir parsed_genome -len 8


"-mask" means we want to mask out the repeat sequences in the genome.
"-preparsedDir parsed_genome" specifies the output folder for parsed sequences.
"-len 8" specifies the motif length to be 8.

Based on the above example, now run the motif analysis on the Foxa2 ChIP-Seq peaks. Write down your code.

In [ ]:
findMotifsGenome.pl GSE25836_Human_Liver_FOXA2_GLITR_1p5_FDR.bed  hg18 MotifOutput/ -size 200 -mask -preparsedDir parsed_genome -len 8


Note that the above analysis may take 5-10 mins to run. While we are waiting, let's do something else. Most of the bioinformaticians would agree that they spend most of their time reading documents, manuals or things from google, rather than coding. Next, try to be a real bioinformatician and read the section "How findMotifsGenome.pl works" at http://homer.salk.edu/homer/ngs/peakMotifs.html. Try to get a general idea about how HOMER (findMotifsGenome.pl) works and write down the steps the program goes through.

1) verify peak/BED file- make sure peaks are valid

2) Extract sequences from the genome corresponding to the regions in the input file, filtering sequences that are >70% "N"

3) Calculate GC/CpG content of peak sequences - since CpG sequences is big source of content bias

4) preparse genomic sequences of selected sizes to serve as background sequences

5) randomly select background regions for motif discovery

6) autonormalization of sequence bias

7) check enrichment of known motifs

8) de novo motif finding


When your motif analysis is done, look at the output folder on sagemath. You will find two 'html' files, "knownResults.html" and "homerResults.html". Let's open the "knownResults.html" first (click "Enable Preview" after you open the html file). You will see a series of known motifs enriched in the ChIP-Seq peaks. The motifs are ranked by their significance (q-value column).

What are top 5 known motifs enriched in our Foxa2 ChIP-Seq peaks? Write down their names.

1)	Foxa2(Forkhead)/Liver-Foxa2-ChIP-Seq(GSE25694)


Considering that we are analyzing Foxa2 ChIP-Seq data, do the top 5 motifs make sense? Why?

The top 5 motifs do make sense because they are all related to the Fox family, wheter it is Foxa2 or Foxa1


Look at the known motif list, do you find any other motifs that do not belong to Fox famliy? Find three and write down their protein names.

CEBP:AP1
HNF4a(NR)
Atf4(bZIP)



Why do we find those non-Fox motifs in Foxa2 ChIP-Seq peaks?

We find non-Fox motifs as well because those are probably motifs that are associated with cofactors or other proteins tha tmight interact with the Fox family.


Let's open "homerResults.html" (click "Enable Preview" after you open the html file). Those motifs are de novo motifs identified by HOMER. Since we have specified "-len 8", all the de novo motifs are 8mers. HOMER also matches the de novo motifs to known motifs and gives the best match in the output.

What is the top de novo motif identified in our Foxa2 ChIP-Seq peaks? Write down the best match protein name.

The best match appears to be FOXO4, with the following motif: TGTAAATA


For the top motif, what percentage of ChIP-Seq peaks have this motif? What percentage of background sequences have this motif?

It looks like 59.36% have this motif, while 8.76% of the background sequences have this motif.


Are there any other non-Fox de novo motifs in the list? What biological questions you can ask based on these results?

There are many non-Fox de novo motifs, such as HNF6, BATF, CEBPE. Based on these results, one can ask if these proteins interact with or serve as cofactors for Fox family proteins, or if it is possible that Fox might play a role in other processes related to those proteins.


### Homework (10 points)###

Let's use another ChIP-Seq data for homework. The nuclear receptor PPARg is a transcription factor that regulates networks of genes controlling complex metabolic functions. We will use a dataset from a PPARg ChIP-Seq experiment which was performed on a human adipocyte cell-line (SGBS).

Question 1: Perform the motif analysis on the PPARg ChIP-Seq peaks (GSE25836_Human_SGBS_Ads_PPARg_GLITR_1p5_FDR.bed) with HOMER. Write down your code. (1 point)

findMotifsGenome.pl GSE25836_Human_SGBS_Ads_PPARg_GLITR_1p5_FDR.bed hg18 MotifOutput2/ -size 200 -mask -preparsedDir parsed_genome -len 8


Question 2: In the list of known motifs that enriched in the ChIP-Seq peaks, can you find the PPARg motif? What can you conclude about the quality of this ChIP-Seq experiment? (1 point)

I cannot find the PPARG motif (though there is a PPARE motif) which suggests that the quality of this ChIP-Seq experiment is subpar.


Question 3: What other non-PPARg motif can you find in the list of known motifs? Find three and write down their names. Then, qualitatively compare their motifs to the motif of the top one. What kind of similarities are there, and why were all these different motifs found to be significantly enriched? (3 points)

Three other non-PPARg motifs include RXR, CEBP, and Erra. The motifs that were identified look very different than the top motif that was found, but look much more similar to each other. When looking at the percent target simliarity, all of these motifs range from about 20-30% (as opposed to >50% as we saw in the first in class example). These different motifs are likely enriched because these are all proteins that interact with PPARg in some way.


Question 4: What are the top three de novo motifs? Write down the names of their best match proteins. Similarly to the last question, compare these three de novo motifs. What kind of similarities are there? Are they more or less similar to each other than the known motifs were? Do any of them seem similar to the known motifs that were identified? (3 points)

The top three de novo motifs are:
CEBPA: TTGTGCAA
Nur77: TGACCTTT
Hnf4a: GGTCAAAG

At a glance, the top 3 de novo motifs look a less similar to each other than the known motifs, but it does appear that these are all proteins that could be interacting with our main protein of interest, PPARg. These do seem similar to the known motifs that were identified, there is a tendency to see a lot of the same protein names such as CEBPA or Hnf4a.



Question 5: Compare the p-value of the top known motif with the p-value of the top de novo motif. Why might they differ so much? (1 point)

The top known motif as a p-value of 1e-1920 while the top de novo motif has a p-value of 1e-1088. They might differ so much because there is a lot more confidence in the known motifs because several algorithims and experiments have validated it, whereas there is not as much support for the de novo motif. Another factor to consider is that de novo motif is fixed to 8 basepairs which may not be the optimal length.


Question 6: Find a de novo motif that is present in a higher percentage of target sequences than the top one. Why does this motif have a larger p-value than the top one? (1 point)

Another motif that is present in a higher percentage of target sequences is SCL(bHLH), with a target sequence similarity of 44.73%. Though it has a high sequence similarity with the target, it also has a high sequence similarity with background (44.24%) which means it is not unique and therefore is not a significant sequence.