(Previous | Contents)

Using virus data in LAMARC

Data derived from viruses can be analyzed effectively in LAMARC, but experience shows that some unusual issues arise. This article outlines some possible issues, as well as special opportunities offered by virus data.

Two other coalescent MCMC programs which are often useful with virus data are:

BEAST (Drummond and Rambaut)

IM (Nielsen, Hey and Wakeley)

We recommend BEAST particularly when data from multiple time points are available, and IM particularly when divergence of populations from a common ancestor is relatively recent. Neither program can handle recombination, however, and IM cannot handle population growth. If those forces are significant in your data, estimates from these programs may be biased.

High growth rates

Sufficiently rapid growth will make the underlying genealogy of your data resemble a starburst, the so-called "star phylogeny." There is much less information in a star phylogeny than in an ordinary one, because all of the coalescences happen at practically the same time, and that single time is the only available piece of information. The practical result is that if you try to estimate Theta and growth on a star, you must fail; you are trying to infer two parameters from one piece of data. The likelihood surface will resemble an infinite ridge running in the direction of high Theta and high growth. LAMARC will attempt to find a maximum on this ridge, and either return an enormous value for Theta and growth, or give up and complain.

You can spot a star phylogeny by making an estimate of your phylogeny with branch lengths (using PAUP*, PHYLIP, or other tools). A star will have many near-zero branches near the bottom. If your data look like this, you will not be able to co-estimate Theta and growth. (Multiple unlinked star-like regions give you some chance of success, but a single one is hopeless.) You can estimate one parameter if you are willing to assume the other, however. If, for example, you feel you know your organism's current Theta, you can hold that value fixed and successfully estimate growth rate. Clearly, if you fix Theta at an incorrect value you will receive an incorrect growth rate in return.

Multiple time point data are more powerful in separating Theta and growth than single time point data. Such data can appropriately be analyzed by BEAST.

Data from multiple time points

Data from multiple time points far enough apart that measurable evolution has happened between the first and last are not handled correctly by LAMARC. We hope to add this capability in the near future. In the meantime, BEAST makes good use of such data.

You can analyze data from any one time point correctly in LAMARC, but you cannot consolidate estimates from multiple time points. They are highly non-independent and cannot be treated as multiple regions, nor is it a good idea to average them. Mixing data from different time points in a single LAMARC analysis will bias the estimate of Theta upwards, possibly severely. It is probably best to use only data from the time point with the largest sample size.

Unavailability of multiple unlinked genomic regions

LAMARC is much more powerful, for all parameters except recombination rate, if it is provided with multiple unlinked genomic regions. (For recombination, a single long region is best.) Many viruses simply do not have multiple unlinked regions, frustrating the researcher's desire for a precise estimate.

If the virus has recombination, using long sequences and performing an analysis which allows recombination will provide some of the advantages of multiple regions, because distant parts of the viral genome will have different trees.

In some cases, different patients or geographic area can be regarded as replicates of the evolutionary process, and combined as if they were independent genomic regions. For example, if you were trying to estimate the growth rate of HIV within a patient, but found that a single patient did not provide enough information, you could try treating virus sequences from a second patient as if they were, not additional copies of the same gene, but copies of a new gene. This is done by entering them into the file conversion utility as a separate region, and giving them a separate name. For example, you could analyze "env from patient 1" and "env from patient 2" as if they were completely separate genes.

The danger of this approach is that if population parameters differ between your two patients you will introduce errors by combining them together. Some hint of this can be found by comparing the single-region estimates. If patient 1's confidence intervals on any of the parameters reject patient 2's values, it is unwise to combine them. You will also want to avoid combining patients with known differences in their likely population parameters, such as patients with and without drug treatment.

If multiple time point samples are available, they may partially compensate for lack of additional regions. Such data can currently best be analyzed by BEAST. We hope to add this capability to LAMARC in the future.

Identifying specific recombinant strains

LAMARC estimates the recombination rate; it does not identify individual sequences as recombinants. If you are actually interested in finding recombinants, you will want a different tool, such as a bootscanning program. Knowing which sequences are probably recombinants is not actually helpful to LAMARC, so there is no way to give bootscanning results to LAMARC.

We recommend against the strategy of using bootscanning or eyeballing to identify recombinant sequences, removing them, and then doing a no-recombination LAMARC analysis (or BEAST or IM analysis). Many recombinants are inconspicuous because the two partner sequences are closely related. Any attempt to spot recombinants will therefore miss many of them (such as within-subtype recombinations in HIV) and those cryptic recombinations will distort the estimates of other parameters. It is better to leave all your data in the analysis and include recombination as a parameter. The only exception is if the putative recombinants are believed to be PCR artifacts, rather than biological recombinants. In this case it is correct to discard them.

Recombination in bursts

Some viruses seldom co-infect, so opportunities for recombination are rare, but when co-infection does occur multiple recombinations may immediately result. LAMARC's estimate of the recombination rate will be some unpredictable composite of the co-infection rate and recombination rate in these cases, though it is still valid to ask whether the confidence interval for the recombination rate includes or excludes zero.

Recent divergence

If you give LAMARC two populations which have recently diverged from a common ancestor, LAMARC will tend to estimate high migration between them even if there is none. LAMARC is detecting the shared lineages that came from the common ancestor, and interpreting them as migration. The IM program is a better tool if you suspect that divergence is recent. "Recent" here means divergence within approximately the last 2N generations for a haploid, 4N generations for a diploid.

Think of "migration" broadly

Migration rate estimation is often used when organisms are found in more than one geographical location. However, it is more generally applicable to any situation in which there is a long-term division of the population into two or more niches, with limited gene flow between niches. LAMARC has been successfully used to estimate gene flow between different tissue compartments within an HIV patient. It may also be able to estimate gene flow among different risk groups (e.g. sexual transmission versus needle-sharing transmission).

One thing you must not do, however, is sort your viruses into categories based on their genetic sequences and then try to infer the population sizes or migration rates of these genetically defined subgroups. For example, you cannot use LAMARC to estimate the effective population size of an HIV serotype. Using genetic data to define the subgroups destroys the evidence you would need to estimate population parameters. (For example, presence of a divergent sequence in a population is evidence of migration; but if population membership is defined by genetic sequence, divergent sequences will never be found.) To the best of our knowledge there is no way to analyze genetically defined subgroups in LAMARC, although we hope to add such capabilities in the future.

Mutational saturation

Error-prone virus replication can lead to data sets with immense amounts of polymorphism. It is important to choose an appropriate mutational model (the ModelTest plug-in to PAUP* can help here) but as long as the data are alignable, LAMARC can still handle them. Unlike algorithms based on the assumptions of the infinite-sites mutational model, it is not confused by multiple hits. There is an upper limit on mutations beyond which the information in the data will be lost, but generally alignment becomes impossible long before this limit is reached.

Areas of the sequence where reliable alignment is not possible should be replaced with "unknown data" characters. This may lead to a slight downwards bias in Theta (as the most variable areas are most likely to be unalignable) but is better than the large upwards bias produced by including wrongly aligned sequences.

Per-subtype data collection

Unfortunately, if you collect your data by identifying serological or sequence subtypes, and then sequencing one individual per subtype (or any similar strategy) you will not be able to analyze it correctly in LAMARC, BEAST or IM. All three programs assume a random population sample, and violating this assumption will lead to huge upwards biases in your parameter estimates. Similarly, you must not omit identical sequences, boring though they may appear. We know of no way to rescue such data for a coalescent analysis.

(Previous | Contents)