Kernel: Python 2 (SageMath)

Motif discovery and regulatory analysis - I

Consensus sequences
Probability and positional weight matrices
Information content / entropy
Motif finding approaches

1. Consensus sequences

As you saw in the prelab lecture, there are many ways to represent motifs. In this assignment, we are going to have some more practice with these different representations and the kinds of interesting information contained in each one.

One simple way to represent motifs which is easy for people to actually look at is the exact consensus sequence representation. In this representation, a motif is encoded as the most common base at each position. Say you have the following examples of a given motif:

ACAGGAA
TGCGGAA
TGAGGAT
AGTGGAA
AACGGAA
ACAGGAT

By finding the most common base at each position, what is the exact consensus sequence for this motif?

AGAGGAA

Although there is a single most common letter at each position in this example, you probably noticed that many of these positions seem to be somewhat flexible, where there is another nucleotide that comes up almost as frequently as the most common base. It is quite common for motifs such as transcription factor binding motifs to include some level of flexibility or degeneracy, and so we also have a human-readable way to encode this, called the degenerate consensus sequence representation.

There are two common ways to encode this. One is related to the concept of regular expressions that we have seen a few times now, where the set of symbols that are possible at each position is contained in brackets, i.e. [AT] means that position can contain either an A or a T. Using this representation, what is the degenerate consensus sequence for this motif?

[AT][CGA][ACT]GGA[AT]

In this case, we have two positions that seem to be able to contain three different nucleotides. For the sake of clarity, a common convention is to only include a base as a degenerate possibility if more than 25% of the input sequences include that base. In this example, that means that a base that is only present in one of the sequences should not be counted. Rewrite your degenerate representation using this convention:

[AT][CG][AC]GGA[AT]

The other way to represent degenerate consensus sequences is to use specific characters (defined by IUPAC) to represent these sets of possibilities:

Symbol	Description	Bases represented	Number of bases represented
A	Adenine	A	1
C	Cytosine	C	1
G	Guanine	G	1
T	Thymine	T	1
U	Uracil	U	1
W	Weak hydrogen bonding	A,T	2
S	Strong hydrogen bonding	G,C	2
M	aMino	A,C	2
K	Keto	G,T	2
R	puRine	A,G	2
Y	pYrimidine	C,T	2
B	not A (B comes after A)	C,G,T	3
D	not C (D comes after C)	A,G,T	3
H	not G (H comes after G)	A,C,T	3
V	not T (V comes after T)	A,C,G	3
N or -	any Nucleotide (not a gap)	A,C,G,T	4

Using this approach, write the representation of the motif with all the possible degenerate positions (don't filter out bases that only appear once in a position):

WVHGGAW

Now write the representation of the motif with the cleaner definition of degenerate positions (do filter out bases that appear only once in a position):

WSMGGAW

2. Probability and positional weight matrices

So far in this lab, we have seen motif representations that are meant to be easily human-readable and interpretable. However, one issue with these representations is that they throw away quantitative information about the probability of each base at each position, and so we cannot use them for any more mathematical approaches to motif interpretation. One very common alternative representation that retains this information is the probability weight matrix (PWM), which is a matrix with 4 rows, one for each nucleotide, and a number of columns corresponding to the length of the motif. For example, the PWM representation of the six motifs from above (ACAGGAA, TGCGGAA, TGAGGAT, AGTGGAA, AACGGAA, ACAGGAT) is:

Nucleotide	Pos. 1 Probability (Observed Counts)	Pos. 2 Probability (Observed Counts)	Pos. 3 Probability (Observed Counts)	Pos. 4 Probability (Observed Counts)	Pos. 5 Probability (Observed Counts)	Pos. 6 Probability (Observed Counts)	Pos. 7 Probability (Observed Counts)
A	0.66 (4)	0.166 (1)	0.5 (3)	0.0 (0)	0.0 (0)	1.0 (6)	0.66 (4)
C	0.0 (0)	0.33 (2)	0.33 (2)	0.0 (0)	0.0 (0)	0.0 (0)	0.0 (0)
G	0.0 (0)	0.5 (3)	0.0 (0)	1.0 (6)	1.0 (6)	0.0 (0)	0.0 (0)
T	0.33 (2)	0.0 (0)	0.166 (1)	0.0 (0)	0.0 (0)	0.0 (0)	0.33 (2)

Using this table, we can use a simple approach of finding how well a given putative motif sequence matches what we think the real motif is by just comparing it to this table and multiplying the probability at each base. For example, if we want to quantify how well the motif 'AGAGGAA' (which was our exact consensus sequence) matches, we just go through and multiply 0.66 * 0.5 * 0.5 * 1.0 * 1.0 * 1.0 * 0.66 = .1089. One major issue with using this approach is the fact that some of these cells contain '0.0' as their probability. Consider the motif 'CGAGGAA', which only differs from our exact consensus sequence by a single base pair. If we try to use the same quantification approach, we will compute 0.0 * 0.5 * 0.5 * 1.0 * 1.0 * 1.0 * 0.66 = 0.0. In other words, the fact that we had one position containing a nucleotide that was not observed in our reference set means that the probability of that motif, under this PWM, is 0. To avoid this issue, we can add a 'pseudocount' of 1 at every position for every nucleotide, yielding the following PWM:

Nucleotide	Pos. 1 Probability (Obs + Pseudocounts)	Pos. 2 Probability (Obs + Pseudocounts)	Pos. 3 Probability (Obs + Pseudocounts)	Pos. 4 Probability (Obs + Pseudocounts)	Pos. 5 Probability (Obs + Pseudocounts)	Pos. 6 Probability (Obs + Pseudocounts)	Pos. 7 Probability (Obs + Pseudocounts)
A	0.5 (5)	0.2 (2)	0.4 (4)	0.1 (1)	0.1 (1)	0.7 (7)	0.5 (5)
C	0.1 (1)	0.3 (3)	0.3 (3)	0.1 (1)	0.1 (1)	0.1 (1)	0.1 (1)
G	0.1 (1)	0.4 (4)	0.1 (1)	0.7 (7)	0.7 (7)	0.1 (1)	0.1 (1)
T	0.3 (3)	0.1 (1)	0.2 (2)	0.1 (1)	0.1 (1)	0.1 (1)	0.3 (3)

Now if we try to compute the probability of observing 'CGAGGAA', we get 0.1 * 0.4 * 0.4 * 0.7 * 0.7 * 0.7 * 0.5 = 0.0027.

What is the probability of observing a motif very unlike what we have seen, say 'CTCTTTG'?

=0.1*0.1*0.3*0.1*0.1*0.1*0.1=3x10^-7

Generating positional weight matrices

A further refinement to this idea is to correct these probabilities for the background distribution of bases in the genome you are interested in. Doing this, we can define positional weight matrices. To do this, after we have obtained the matrix of probabilities including pseudocounts (i.e. the table directly above this one), we divide each entry in each row by the background probability of observing the nucleotide corresponding to that row. In the naive case, we just use p(i) = 0.25 for each nucleotide i. This assumes an equal probability of observing any given nucleotide. Finally, a common transformation is to take the natural logarithm (ln, or log base e) of each of these background-corrected quantities (note that these are no longer probabilities). This is done so that in order to compute the score for a given sequence, the entries in each row can be added instead of multiplied together. In our example above, applying these transformations using the naive nucleotide distribution yields the following table:

Nucleotide	Pos. 1 Log-odds	Pos. 2 Log-odds	Pos. 3 Log-odds	Pos. 4 Log-odds	Pos. 5 Log-odds	Pos. 6 Log-odds	Pos. 7 Log-odds
A	0.693	-0.223	0.470	-0.916	-0.916	1.030	0.693
C	-0.916	0.182	0.182	-0.916	-0.916	-0.916	-0.916
G	-0.916	0.470	-0.916	1.030	1.030	-0.916	-0.916
T	0.182	-0.916	-0.223	-0.916	-0.916	-0.916	0.182

Now, the corrected probability of any given sequence can be computed by simply adding the entries corresponding to that sequence. If the score is greater than 0, the sequence is more likely to be a functional than a 'random' sequence, and if the score is less than 0, the reverse is true. This is why the column titles refer to the 'log-odds': this model represents the 'odds' or likelihood that a given sequence matches the motif. Compute the score for the exact consensus sequence 'AGAGGAA':

0.693+0.470+0.470-0.916+1.030-0.916-0.916= -0.085

It is worth noting that the human genome does not follow the naive distribution of an equal probability of observing each nucleotide. Instead, the distribution is roughly p(A) = 0.3, p(C) = 0.2, p(G) = 0.2, and p(T) = 0.3. Using this, we can recompute our positional weight matrix:

Nucleotide	Pos. 1 Log-odds	Pos. 2 Log-odds	Pos. 3 Log-odds	Pos. 4 Log-odds	Pos. 5 Log-odds	Pos. 6 Log-odds	Pos. 7 Log-odds
A	0.510	-0.405	0.288	-1.099	-1.099	0.847	0.511
C	-0.693	0.405	-0.693	-0.693	-0.693	-0.693	-0.693
G	-0.693	0.693	-0.693	1.253	1.253	-0.693	-0.693
T	0.000	-1.099	-0.405	-1.099	-1.099	-1.099	0.000

Now what is the score for the exact consensus sequence 'AGAGGAA'?

0.510+0.693+0.288+1.253+1.253+0.847+0.511=5.355

3. Information content and entropy

One aspect of these PWMs that we have not yet addressed is the concept of how well they actually capture the motif, or how informative they actually are. In other words, we want to know how well a motif, as represented by a PWM, can discriminate between a real signal and background noise. To do so, we can take advantage of a very useful and powerful concept called the information content (IC) of a motif. This is a way of directly quantifying how informative a signal is, and applications of this concept can be found in a wide range of fields from computer encryption to machine learning to physics. In this case, we define the information content of each column $j$ in the PWM (i.e. each position in the motif) as $IC_j = 2 + \sum_{x=A,C,G,T} p_x log_2(p_x)$ , where $p_x$ is the entry for nucleotide $x$ in that column. This means that a value of 2.0 is the most informative and a value of 0 is the least informative. Consider the following simple PWM:

Nucleotide	Pos. 1 Probability	Pos. 2 Probability	Pos. 3 Probability
A	1.00	0.25	0.4
C	0.00	0.25	0.4
G	0.00	0.25	0.1
T	0.00	0.25	0.1

The IC for each column can be calculated:

$IC_1 = 2 + 1.0 * log_2(1.0) + 0.0 + 0.0 + 0.0 = 2$

$IC_2 = 2 + 0.25 * log_2(0.25) + 0.25 * log_2(0.25) + 0.25 * log_2(0.25) + 0.25 * log_2(0.25) = 2 + 0.25 * (-2) + 0.25 * (-2) + 0.25 * (-2) + 0.25 * (-2) = 0$

$IC_3 = 2 + 0.4 * log_2(0.4) + 0.4 * log_2(0.4) + 0.1 * log_2(0.1) + 0.1 * log_2(0.1) = 2 + 0.4 * (-1.32) + 0.4 * (-1.32) + 0.1 * (-3.32) + 0.1 * (-3.32) = 0.27$

So we see that the first position is maximally informative (intuitively, we know that it will always be an A), while the second position is minimally informative (each base has an exactly equal chance of occuring), and the third position is weakly informative (it is more likely to be an A or a C than a G or a T).

Then, the IC for a motif can be calculated as the sum of the information contents of each column, so this motif would have an IC of 2.27.

Similarly to how we wanted to generate positional weight matrices to correct for the background nucleotide distributions, we may also want to account for the background nucleotide probabilities when we look at the information content in a motif. There is a related concept called relative entropy that allows us to do this. Entropy measures the 'randomness' of a signal, and in that sense is the opposite of information. Relative entropy measures this 'randomness' or 'disorderedness' of a given motif relative to the background distribution. In other words, relative entropy measures how different your motif is from what you would expect given the background distribution; thus, if a motif is very informative, it will have a high relative entropy.

The equation for relative entropy is given as $RE = \sum_{x=A,C,G,T} p_x log_2(p_x/Q_x)$ , where $Q_x$ is the background probability of the nucleotide $x$ . Thus, if your PWM exactly matches the background probability Q, the relative entropy of your PWM will be 0 (because $p_x / Q_x = 1$ and $log_2(1) = 0$ ); otherwise, this quantity can be arbitrarily high or low.

Aside: creating motif logos

A useful way of representing motifs is using what are known as sequence logos, which we saw in the prelab lecture. These logos scale each nucleotide at each position to represent their information content. An easy way to create these logos is to use the website http://weblogo.berkeley.edu/logo.cgi. We will practice this with the set of 6 sequences we were looking at earlier. The general approach is to upload a set of sequences, either by copy and pasting or by uploading the file. These sequences can be provided in fasta format, as we have done here, or as a plain text list, where each line is the same length, as we have in question 4 on the homework. Here, we will just copy and paste the 6 sequences from this box:

>seq1
ACAGGAA
>seq2
TGCGGAA
>seq3
TGAGGAT
>seq4
AGTGGAA
>seq5
AACGGAA
>seq6
ACAGGAT

Then, navigate to the website and paste those sequences into the box marked 'multiple sequence aligment'. Then, simply press the 'create logo' button, and you should get a sequence logo! Save this file and upload it into the images/ folder of this assignment.

4. Motif finding approaches

As we saw in the lecture, there are several different computational approaches that can be used to identify enriched motifs in a given set of sequences, including exact counting, iterative approaches like Gibbs' sampling and expectation maximization, and differential enrichment approaches. For this section of the lab, we will just have some practice using the most common tool for motif enrichment in relatively small datasets, MEME, which is based on expectation maximization.

We will analyze the file called 'selex_seqs.fasta', in the inclass_data/ folder. This fasta-formatted file contains sequences from a SELEX-like experiment, where sequences were pulled down based on their affinity with some transcription factor. We will use the online MEME tool to do this. You can either download this file to your computer (recommended) or copy and paste it to upload it to MEME, but make sure you get the full file if you do this. Navigate to http://meme-suite.org/tools/meme, and under the input the primary sequences header, select whichever approach you are using to upload the sequences.

Under select the site distribution, choose 'one occurrence per sequence', because this file comes from a SELEX-like experiment and so each sequence was experimentally found to bind to some transcription factor. Leave the value of 3 for how many motifs MEME should find, and under advanced options, change the maximum width of the motifs to 20bp to speed up the computation. This will take some time to finish running, so make sure to save the link, or you can provide an email address that they will mail the link to. Make sure to submit this job before starting the homework as some of the questions will be about these results!

Homework problems: motif practice

Question 1: Consider the following probability weight matrix:

Nucleotide	Pos. 1	Pos. 2	Pos. 3	Pos. 4	Pos. 5	Pos. 6	Pos. 7	Pos. 8
A	0.01	0.1	0.97	0.95	0.5	0.05	0.8	0.4
C	0.03	0.05	0.01	0.01	0.1	0.6	0.1	0.08
G	0.95	0.05	0.01	0.03	0.1	0.05	0.05	0.02
T	0.01	0.8	0.01	0.01	0.3	0.3	0.05	0.5

What is the information content of positions 3 and 5 in this matrix? (1 point)

IC for position 3= 1.758059
IC For position 5=1.215356

Question 2: Using the PWM given above, what is the exact consensus sequence and the degenerate consensus sequence (using either the regular expression or IUPAC characters)? For the degenerate sequence, only count a nucleotide as a degenerate possibility if it has a probability of more than 0.25. (1 point)

GTAAACAA-exact consensus sequence
GTAA[AT][CT]A[AT]-degenerate sequence

Question 3 (short answer): Based on this consensus sequence, do you expect the relative entropy of this probability matrix to be higher when compared to the naive nucleotide distribution (equal probability of any nucleotide) or to the human genome background probability (A and T are more common than G and C)? (1 point)

Relative Entropy will be higher when compared to the naive nucleotide distribution, as the consensus sequence contains more As than other nucleotides which are higher in the human genome background probability

For the next two questions, we will be using the following set of sequences:

TGGGAA
TGGGAA
TGGGAA
TGGGAA
TGGGAA 
TGAGAA
TGGGAA
TGGGAA
TGGGAA
TGGGAG
TGAGAA
TGAGAA
TGTGAA
TGGGAA
TGGGAG
TGGGAG
CGGGAA
TGGGAT

Question 4: From these sequences, create a positional weight matrix corrected for the human genome background probabilities (p(A) = 0.3, p(C) = 0.2, p(G) = 0.2, and p(T) = 0.3). Recall that this involves counting the nucleotide occurrences at each position, adding a psuedocount of 1, normalizing this into probabilities of each base, and finding the log odds by taking the natural log of each probability divided by the frequency of that nucleotide in the genome. (1 point)

             1	          2	            3	           4	         5	         6
A	-1.840549633	1.103889346	-0.454255272	-1.840549633	1.103889346	0.867500568
T	1.049822124	-1.840549633	-1.147402453	-1.840549633	-1.840549633	-1.147402453
C	-1.435084525	-1.435084525	-1.435084525	-1.435084525	-1.435084525	-1.435084525
G	-1.435084525	-1.435084525	1.272965676	1.509354454	-1.435084525	-0.048790164

Question 5: Using these sequences, make a logo using the webLogo tool and upload this logo to the 'images/' directory in this assignment (1 point)

Hopefully your MEME results are now ready, because the rest of the questions will deal with their analysis. On the MEME output page, follow the link to the 'MEME HTML output.' You should see ‘Discovered Motifs’, ‘Motif Locations’, and ‘Program information’ on this result page. ‘Program information’ contains some basic information about what version of MEME you used, your input data, and the parameters. Clicking the “?” links will give you more information on what each output column means, and may help you answer the questions below.

We are going to look at motif 1, which is the 'best' motif identified in the data by MEME. If you press the down arrow under the 'More' column, you can see more information about this motif.

Question 6: Based on the sequence logo, what is the consensus sequence for this motif (either exact or degenerate)? (1 point)

CTATTTATAG

Question 7: Now click on the right-facing arrow, which leads you to the 'submit or download motif' page. Select the 'download motif' tab, where you can download the motif in count matrix format, probability matrix format (useful for finding the degenerate consensus sequence), minimal MEME format, FASTA, and raw formats. Select the minimal MEME format. As you scroll through the window, you should see two familiar matrices, the log-odds matrix and the letter probability matrix. What are some key differences between these two matrices? (1 point)

The letter probability matrix does not have a pseudocount correction for the nucleotide probabilities. Additionally, the letter probability matrix is not corrected for background probabilities as is the log-odds matrix. The log-odds matrix is in logatharmic form which results inn negative values compared to the probability matrix. matrix values are multipled in the letter probability matrix to determine probability of a given sequence whereas matrix values are added in the log-odds matrix.

Question 8: What is the E-value of this motif? What does this value represent? Should this be considered to be a significant hit? (Hint: the question mark box next to the E-value contains valuable information.. Hint #2: in the E-value column, a positive number after 'e' means move the decimal point to the right.) (1 point)

E= 1.5e+002. The E-value represents the expected number of motifs with the given log likelihood ratio (or higher) one would find in a similarly sized set of random sequences. This is not considered significant as it is not below 0.05

Question 9: How many sites contain this motif? What is the information content of this motif? (1 point)

25 sites contain this motif and the IC content of the motif is 10.6

Question 10 (short answer): Of the 3 motifs identified, which, if any, do you think are the true motif? What are you basing this on? Also, if there is only 1 true motif, why do we identify 3? (1 point)

The first motif is liekly the true motif. This motif has the highest Information Content and Relative Entropy compared to the other motifs identified in the MEME search. Three motifs were identified due to the search parameters (Expectation Maximazation converging on local maxima rather than a global maxima, resulting in three motifs rather than one.

In [ ]: