CoCalc Public Fileswww / meccah / comps / sawyer.html
Author: William A. Stein
Compute Environment: Ubuntu 18.04 (Deprecated)
MECCAH: Computation of Stanley Sawyer

# How can one tell in which direction evolution is going?

Stanley Sawyer

In the long run, virtually all biologists believe that the most important changes in organisms are due to the replacements of genes by new genes that do a better job for the organism.

However, many biologists believe that in large, established populations, most evolutionary change is, in contrast, due to the replacement of genes by slightly deleterious variants. The reason for this is that most mutations are harmful rather than helpful, and that mildly harmful mutations can become established in a population (and replace the former, better variant) due to the chance effects of who mates with whom and who happens to survive. This process would take a long time for a large population, but most evolutionary change takes place on a long time scale. These chance effects could not establish a severely damaged gene that was important to its host, but could replace a good gene by a gene that was only slightly worse.

In this view, most evolution in large populations is downhill. Any improvement in the population is due to either (i) very rare mutations to significantly better genes, which then spread through the population very quickly, at which point the population begins to move downhill again from a higher plateau. Alternatively, (ii) the entire large population can be replaced by the descendants of an isolated small population. In this scenario, several new favorable mutations, or a whole family of favorable mutations, become established in the small population due to inbreeding and the chance effects of mating and survival in a small population. Current biological thinking since the 1920s is that (ii) is more likely than (i), at least for major changes.

An argument for either (i) are (ii) is that, in the fossil record, creatures appear not to change for long periods and then suddenly a noticeably different creature appears. The new creature is presumably doing a better job in the same habitat than the creature it replaced, but might conceivably be no better than the first creature before it started going downhill. For shorter time periods (millions of years rather than tens or hundreds of millions of years), there is not enough fossil evidence to tell whether evolutionary change is continuous or else comes in bursts.

The most reliable and easiest to analyze biological information is from DNA. Unfortunately, DNA older than around 10,000 years is very rare. Thus we are led to try to answer questions about historical trends on the basis of the distribution of DNA in contemporary populations. This can be done: The distribution of a set of mutations within a population is different if the mutations are advantageous, deleterious, or selectively the same as the original variants. For example, advantageous or deleterious mutant genes that are present in a sample will be less common in the sample than if they had no significantly different effect on their host. This is because it is more difficult for deleterious genes to become common, and advantageous genes will tend to become established or nearly established as soon as they become common. There are also subtle differences in the distributions of advantageous as opposed to deleterious mutant changes. One can also use the number of established differences between two related species to gain additional information.

The basic data that I am using consists of a sample of DNA sequences from one gene in one species (say, m' sequences) and n' DNA sequences from the same gene in a closely related species. The data currently being analyzed is from two species of a fruit fly, Drosophila, and from two species of a common weed, Arabidopsis.

One also needs a statistical model for the sample frequencies of DNA changes as a function of mutation rates and amounts of selection. One then applies the statistical model to the sequence data and estimates parameters along with measures of statistical confidence of the parameter estimates.

Unfortunately, even a large number of sequences from a single genetic locus or type of gene does not have sufficient statistical power, so that one needs sequence data from the two species from many different genetic loci or types of genes (for example, 34 loci or 54 loci). Classical statistical methods (maximum likelihoods'') are not well behaved for this much data of this type. A newer statistical method called Markov Chain Monte Carlo (MCMC) is effective and does produce results. The disadvantage is that MCMC methods can take many hours of computer time on a fast computer as opposed to milliseconds for classical statistical methods, but classical statistical methods do not work in this case.

Some references are

1. Sawyer, S. A. and D. L. Hartl (1992) Population genetics of polymorphism and divergence. Genetics 132, 1161--1176.

(This derives the basic statistical model: General speaking, biology journals do not like mathematical derivations, but this one allowed us to put a mathematical proof in an Appendix.)

2. Hartl, D. L., E. N. Moriyama, and S. A. Sawyer (1994) Selection intensity for codon bias. Genetics 138, 227--234.

(This applies the same theory to a slightly different problem, namely the tendency for different DNA variants to show the effects of selection in some cases even though they produce exactly the same gene product.)

3. Bustamante, Carlos, Rasmus Nielsen, Stanley A. Sawyer, Kenneth M. Olsen, Michael D. Purugganan, and Daniel L. Hartl (2002) The cost of inbreeding in Arabidopsis. Nature 416, 531--534.

(This applies the MCMC theory to a simple model of selection in which all new mutations of genes of a particular type (for example, for a particular enzyme) are either (i) immediately lethal or nearly lethal, and so can be ignored, or else (ii) have exactly the same selective advantage or disadvantage. This model is not realistic, but the single estimated selection coefficient for a particular gene might be an average selection coefficient of some kind. The conclusion was that two Drosophila species appeared to be positively evolving but that two weedy species (Arabidopsis) were going downhill.)

Technically, MECCAH is being used to investigate more realistic models of the selective effect of mutations --- for example, so that mutations within a given locus can have a range of selective effects. In particular, mutations that separate two species should have different average selective effects than mutations that are polymorphic within one or both species. Mutations in the first class are more likely to be advantageous than mutations that are present in two forms in a sample.

A model in which the selective advantage of new mutations within a given locus is normally distributed as opposed to being constant has led to the nicest results so far: Mutations that are polymorphic turn out to be chosen from the upper half of the normal distribution within a given locus, while mutations that become established in one of the two species come from the extreme upper tail of the normal distribution at that locus.

The robustness of the conclusions with respect to other changes in the statistical models to make them more biologically reasonable is also being investigated.