CoCalc Shared Files24_Motifs-II-graded / 24_motif_2_inclass.ipynbOpen in CoCalc with one click!
Author: Joyce Lee
Views : 5
Description: Jupyter notebook 24_Motifs-II-graded/24_motif_2_inclass.ipynb

Motif discovery and regulatory analysis - II

Table of Contents

  1. Sequence motif databases
  2. Gene list promoter motif analysis

1. Sequence motif databases

As we saw in the prelab lecture, there are several databases containing known motifs for various transcription factors. Two very commonly used tools that we saw in the lecture are JASPAR, a motif database, and TOMTOM, a tool for matching motifs to databases, and here we will have some practice using these tools. Recall the PWM from questions 1-3 from the last homework:

NucleotidePos. 1Pos. 2Pos. 3Pos. 4Pos. 5Pos. 6Pos. 7Pos. 8

Let's try submitting this PWM to JASPAR. Navigate to and scroll to the very bottom, to the box labelled 'Align to a custom matrix or IUPAC string.' We can't directly use this matrix in the table format provided here, so copy and paste the table in the following format:

A [ 0.01 0.1 0.97 0.95 0.5 0.05 0.8 0.4 ] C [ 0.03 0.05 0.01 0.01 0.1 0.6 0.1 0.08 ] G [ 0.95 0.05 0.01 0.03 0.1 0.05 0.05 0.02 ] T [ 0.01 0.8 0.01 0.01 0.3 0.3 0.05 0.5 ]

Then, press the 'Align' button underneath the table entry form. You will be taken to a result page showing all the matching transcription factors including their name, species, class, family, score, percent score, and canonical sequence logo. If you click on the sequence logo for any match, you can find more information including the latin name of the species that the motif comes from. What is the best matching transcription factor, and what species does it come from?

NCU00019 from the species Neurospora crassa; class: Fork head / winged helix factors; score: 15.499; canonical sequence:

What is the second best matching transcription factor, and what species does it come from?

FOXD1 from homo sapiens

What do you make of the fact that the top two matches come from very different species?

there are regions of conserved Fork head factors across evolution

A related tool is TOMTOM, which is part of the MEME suite. To practice with this, we are going to use the MEME results we generated in the second half of the homework from last class. If you lost the link, you can re-run the job the same way you did last time. Once you have these results, go to the top discovered motif and click the rightward facing arrow. This time, select the 'Submit motif' tab. TOMTOM is in the same suite of tools as the MEME motif discovery tool, and so they are synced up. This means that we can easily submit this top motif to the TOMTOM tool to find its matching TFs. Select 'TOMTOM' under 'Submit to program' and then click submit. You will be taken to the TOMTOM submission page, with 'submitted motifs' already chosen for you under the 'Input query motifs' section. Select 'JASPAR DNA' under the 'Select target motifs' section, and make sure the box is checked under 'run immediately', then click the 'start search' button. This should take you to the results page fairly quickly.

How many matches did TOMTOM return?


What is the name and p-value of the top match?

MEF2B p-value: 6.01e-06

Do you notice any sort of pattern in the top 5 or so matches?

they all have 10 nucleotides that match up

Yes, and this is because many of them are part of the same MEF2 family of TFs

For this homework, we are going to do a similar analysis using MEME and TOMTOM, but this time using the file called 'promoter_seqs.fasta' in the inclass_data/ folder of this assignment. This file contains promoter sequences from several co-expressed genes, so these different promoters may contain different TF motifs. You should submit this file now so that it will be done processing by the time you get to the actual homework questions.

Submit this file the same way you did before (make sure to provide an email if you think you might lose the link because this analysis will take a while), but make sure to select 'any number of repetitions' under the 'Select the site distribution' section, and lower the maximum motif width to 20 under the advanced options.

2. Gene list promoter motif analysis

Another analysis we might want to do is to find the over-represented motifs for a list of co-expressed or otherwise related genes. To practice with this, we will be using a set of genes from an RNA sequencing study of a mouse model of ALS, where a disease-related protein called TDP-43 was mutated to localize to the cytosol instead of the nucleus ( In the inclass_data/ folder, there are two files you will need for this analysis: tdp_downreg_genes.txt, which contains the 88 genes that were significantly downregulated by at least 2-fold when the protein was mutated, and all_tdp_genes.txt, which contains the 10,601 genes that met the minimum read count threshold in this experiment. You will probably want to download these files to your own computer so that you can upload them easily without having to copy and paste 10,000 genes. We would like to determine whether there is any significant motif enrichment in these genes, so we will use the oPOSSUM tool, which was described in the prelab lecture.

Go to the site at, and select Human under 'Single site analysis (SSA)', in order to detect over-represented conserved transcription factor binding sites in these sequences. Upload the 'tdp_downreg_genes.txt' file as the target genes and the 'all_tdp_genes.txt' file as the background gene list, and then run the analysis using the default settings (using JASPAR CORE profiles). This may take a few minutes to finish running.

Once you have the results, click the column marked 'Fisher score' in order to sort by that column. This column contains the negative natural logarithm of the p-value for the Fisher's exact test-based enrichment. In other words, the higher, the better.

What is the transcription factor with the most significant overrepresentation based on the Fisher score?


What is its Fisher score, and what p-value does this correspond to (you can use exp(-X) in R, where X is the Fisher score)?

4.422 ; 0.0120

How many of the target genes is this TF predicted to interact with? (Use the 'Target gene hits' column)


Now sort the list by Z-score. What are the TFs with the highest and lowest Z-scores?

lowest Z-score: Zfx highest Z-score: Foxd3

Navigate to the help page at Using the statistical analysis section, can you explain the differences between the Z-score test and the Fisher score test?

Fisher score is one tailed and compares the transcription factor binding site with the background gene set of interest. The Z-score is binomial and only compares the transcription factor binding site with the general background.

Sort of; the main and fundamental difference is that the Fisher score doesn't care how many times a given motif is found in a gene, just whether it is present or absent, whereas the Z-score takes the number of occurrences into account

How might these differences explain the difference between the TFs prioritized by the two different scores?

The Fisher prioritizes based on the gene set we are interested in whereas the Z-score tells us how likely this TF site is present so we end up with slightly different TFs as the top hit.

Homework problems: matching motifs to TFs

For the first section of this homework, we are going to use the results of the MEME analysis of the promoter_seqs.fasta file.

Question 1: What is the consensus sequence for the best motif found by MEME? What is its E-value? (1 point)

GCCCATATGTGGGT, e-value: 5.0e-001

Question 2: Now, follow the same procedure as what we did for the in-class assignment to submit this top motif to TOMTOM, using the JASPAR database. What's the name of the closest matching motif? (1 point)


Question 3: What is the consensus sequence for this motif (either degenerate or exact)? How does this compare to the motif found by MEME? (1 point)

TGACCATATATGGTCA; looks pretty similar in the middle portion

Question 4: Now let's run the top motif through JASPAR ourselves, for comparison. Use the top motif from MEME, but instead of submitting it directly to TOMTOM, download the probability matrix. (Under the 'download motif' tab). To submit this to JASPAR, you will have to transpose this matrix: the format MEME provides has the rows as the positions and the columns as each nucleotide, while JASPAR requires the rows to be nucleotides and the columns to be positions. There are many ways to accomplish this: one easy way is to download the PWM, read it into R, and use the t() command to transpose it. Then, you can write that result out as a table and copy and paste it to get it into JASPAR. If you do this, make sure to include the following arguments to write.table: "quote=F, row.names=F, col.names=F". Paste the table you are using here:

0 0 0 0 1 0 0.785714 0 0.642857 0.285714 0 0 0.428571 0.142857 0.214286 0.857143 0.714286 0.928571 0 0 0 0 0.142857 0 0.142857 0.285714 0 0.285714 0.785714 0.142857 0.214286 0 0 0 0.214286 0.142857 0.214286 0.071429 0.857143 0.642857 0.5 0 0 0 0.071429 0.071429 0 1 0 0.857143 0 0.642857 0 0.071429 0.071429 0.571429

Now upload this PWM to JASPAR. What is the name of the top matched TF for this motif? How does it compare to the one we found using TOMTOM? (1 point)

Same as the one found on TOMTOM - SRF is the top matched TF

Question 5: Now compare the other matches in JASPAR and TOMTOM. List the second and third best matches in each database. Can you interpret why these aren't the same? (1 point)

JASPAR: AGL3, SEP1 TOMTOM: SEP3, Bhlha15 I think they are different because TOMTOM ranks by p-value and JASPAR scores by percent

True, but they also match PWMs to motif databses using different algorithms, so some of the hits in one may not even be found by the other!

For the rest of the homework, we will perform another oPOSSUM analysis, but using another dataset based on microarray results from a colon cancer study. In the inclass_data folder, the file 'cancer_gene_list.txt' contains a list of genes that were upregulated in colon cancer. Upload this gene list to the oPOSSUM tool, this time just using the 'all 24,752 genes in the oPOSSUM database' as the background, and run a similar analysis to what we did for the in-class data.

Question 6: What is the most significantly overrepresented transcription factor by Fisher score? (1 point)


Question 7: What is the Fisher score and corresponding p-value for this TF? (1 point)

Fisher score: 12.44 , p-value: exp(-X) = 0.00000395709

Question 8: How many genes in the list is this TF predicted to interact with? (1 point)


Question 9: Click on the entry in the JASPAR ID column to go to the JASPAR page for this TF. Looking at the sequence motif, which are the most informative positions of this motif and what bases do they contain? (1 point)

positions 4-7 have the most definitive bases, which are CGTG

Question 10 (short answer): The next most enriched TF by Fisher score interacts with the same number of target genes. However, if you compare which genes they interact with, they are not completely overlapping. Can you look at its JASPAR motif and the class of the two TFs to explain why these are different? (1 point)

Looks like you missed this one! The two proteins are HIF1A::ARNT and CEBPA, which belong to the helix-loop-helix and leucine zipper families, respectively, so they are expected to bind to different sets of genes (-1 point). Great job on this assignment!

In [ ]: