asymptoticMK: Asymptotic McDonald-Kreitman Test

asymptoticMK: Asymptotic McDonald–Kreitman Test

See below for background and usage information. If you use this service, please cite our paper:

B.C. Haller, P.W. Messer. (2017). asymptoticMK: A web-based tool for the asymptotic McDonald–Kreitman test. G3: Genes, Genomes, Genetics 7(5), 1569–1575. doi:10.1534/g3.117.039693

The web-based service that used to live on this page has been discontinued. It was constantly breaking due to web server upgrades, so maintaining it was an ongoing hassle. If someone else has a FastRWeb server running, and they would like to host this service, please contact us.

An R script suitable for running this test on your local machine is still available. It can be downloaded from the asymptoticMK Github repository. An example SLiM model for producing binned polymorphism data suitable for testing this service is also available in that repository.

Please let us know of any issues with asymptoticMK at philipp {dot} messer <at> gmail [dot] com. Thanks!

Background & usage:

This page provides an R-based implementation of the asymptotic McDonald–Kreitman test (Messer & Petrov 2013). This test is used to determine an estimate of α (alpha), the fraction of substitutions in a genomic test region that were driven to fixation by positive selection. To do this, it uses the data supplied to calculate empirical values of a function α(x):

α(x) = 1 − (d₀ / d) (p(x) / p₀(x))

where

x   =   derived allele frequency

d₀   =   substitution rate in the neutral reference region

d   =   substitution rate in the test region

p₀(x)   =   polymorphism level in the neutral reference region for frequency class x

p(x)   =   polymorphism level in the test region for frequency class x.

It then fits an exponential function to this data, of the form:

α_fit(x) = a + b exp(−cx)

The value of this function extrapolated to x = 1 provides an estimated value for α:

α_asymptotic = α_fit(x = 1)

Although the exponential function is generally expected to provide the best fit, a linear function is also fit to the data, of the form:

α_fit(x) = a + bx

If the exponential fit fails to converge (which can happen if the data does not fit an exponential pattern), or if the linear fit is superior according to AIC, then the linear fit is reported; otherwise, the exponential fit is reported. (There are also pathological cases in which the exponential fit is superior according to AIC, but the confidence interval of its estimate of α is very wide; in that case, the linear fit is also preferred.)

The individual steps required to execute the test are:

determine your test region and an appropriate neutral reference region,
determine the overall substitution rates for those regions,
subdivide your SNP data for these regions into derived allele frequency classes (the number of classes depending upon how much data you have, such that there are no "empty" frequency classes),
determine the polymorphism level within each of those frequency classes, and
crank this information through the provided R script to obtain plots and analysis.

Note that it is often advisable to trim the polymorphism data, removing the lowest and highest frequency classes. This is recommended because low-frequency polymorphisms can have a high error rate due to sequencing error, whereas high-frequency polymorphisms can be vulnerable to polarization error. This service therefore allows specification of the interval of x that will be used to fit the exponential and linear models.

Values for the polymorphism rates should be supplied in a tab-separated file of row data, with columns for x, p, and p₀. The file's first row should contain text labels for the columns; the specific text used to label the columns is unimportant, but their order (x, p, p₀) is important. A sample file can be seen here.

Source & acknowledgements:

The asymptoticMK web service (now defunct) was implemented using FastRWeb, a package for building R-based web services; thanks to S. Urbanek for FastRWeb, and for considerable help with getting it up and running. It's a cool technology, my ISP just seemed determined to break it every month or two. Thanks also to G. Grothendieck, J. Horner, and S. Urbanek for other R packages used in asymptoticMK. Thanks to A.-N. Spiess for the R code used to obtain a confidence interval for the exponential fit. Development of this service was supported by funds from the College of Agriculture and Life Sciences at Cornell University to PWM.

x	=	derived allele frequency
d₀	=	substitution rate in the neutral reference region
d	=	substitution rate in the test region
p₀(x)	=	polymorphism level in the neutral reference region for frequency class x
p(x)	=	polymorphism level in the test region for frequency class x.