In the previous labs, we have learned how to analyze ChIP-Seq data such as read alignment, peak identification and visualization. In this lab, we are going to learn another important downstream analysis for the ChIP-Seq data, which is to find enriched motifs in ChIP-Seq peaks.
Why we want to do motif analysis for ChIP-Seq data? There are several reasons:
(1). The motif analysis can be used to validate ChIP-Seq experiment. If you are doing a ChIP-Seq experiment for a transcription factor with known binding motifs, you would expect to identify those motifs enriched in the ChIP-Seq peaks. For example, it is known that transcription factor Foxa2 binds to motif "GTAAACA". Then motif analysis of Foxa2 ChIP-Seq experiment should identify "GTAAACA" as one of the enriched motif in the peaks. Otherwise, the quality of the ChIP-Seq experiment is problematic and probably need further investigation. Therefore, researchers can use the motif analysis results to validate the ChIP-Seq experiment. If you are interested, See the following reference for an example. Xu, Chenhuan, et al. "Genome-wide roles of Foxa2 in directing liver specification." Journal of molecular cell biology (2012)
(2). The motif analysis can be used to identify novel binding motifs for transcription factors. If you are studing a transcription factor that the binding motif is unknown, you can use motif analysis to identify novel binding motifs. Those novel binding motifs can give useful information about the function of the transcription factor. For example, Bing Ren's group identified a novel binding motif for insulator protein CTCF. By analyzing the new motif, they identified some new functions of this CTCF protein. For reference, you can read this paper, Kim, Tae Hoon, et al. "Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome." Cell 128.6 (2007): 1231-1245.
(3). The motif analysis can be used to identify cofactors from the ChIP-Seq experiment. It is common to identify multiple different motifs from the ChIP-Seq experiment. Some of the motifs may belong to transcription factors that are not studied in the ChIP-Seq experiment and those transcription factors are potentially cofactors. Here is a reference for this kind of analysis. Ding, Jun, et al. "Systematic discovery of cofactor motifs from ChIP-seq data by SIOMICS." Methods 79 (2015): 47-51.
In this lab, we are going to learn how to use HOMER for motif analysis based on ChIP-Seq data. HOMER is a toolkit for motif discovery based on sequencing data and it is freely available at http://homer.salk.edu/homer/ngs/peakMotifs.html. HOMER contains several perl scripts (perl is a programming language similar to python). We already installed HOMER on sagemathcloud so you can use it directly for the following analysis.
The data we are going to use is from a publised Foxa2 ChIP-Seq experiment. The winged helix protein FOXA2 is a highly conserved, regionally-expressed transcription factor that regulate networks of genes controlling complex metabolic functions. The raw reads were aligned to reference genome and Foxa2 binding peaks were identified. You can find the ChIP-Seq peak data (GSE25836_Human_Liver_FOXA2_GLITR_1p5_FDR.bed) in BED format in the folder "data_for_motif_analysis". We are going to use this file for the following motif analysis.
We will use the findMotifsGenome.pl script in HOMER to find enriched motifs in Foxa2 ChIP-Seq peaks. The basic syntax is as follows:
findMotifsGenome.pl <peak/BED file> <genome> <output directory> -size # [options]