## asymptoticMK: Asymptotic McDonald–Kreitman Test

By Benjamin C. Haller & Philipp W. Messer. Copyright © 2017 Philipp Messer.

See below for background and usage information. If you use this service, please cite our paper:

B.C. Haller, P.W. Messer. (2017). asymptoticMK: A web-based tool for the asymptotic McDonald–Kreitman test. G3: Genes, Genomes, Genetics 7(5), 1569–1575. doi:10.1534/g3.117.039693

 d : d0 : Input file : (Tab-delimited with named columns for x, p, and p0) [sample] x interval to fit : [,]

Please let us know of any issues with this service at philipp {dot} messer <at> gmail [dot] com. Thanks!

### Background & usage:

This page provides an R-based implementation of the asymptotic McDonald–Kreitman test (Messer & Petrov 2013) as a web-based service (it can also be run at the command line using curl, or as a local R script; see below). This test is used to determine an estimate of α (alpha), the fraction of substitutions in a genomic test region that were driven to fixation by positive selection. To do this, it uses the data supplied to calculate empirical values of a function α(x):

α(x) = 1 − (d0 / d) (p(x) / p0(x))

where

 x = derived allele frequency d0 = substitution rate in the neutral reference region d = substitution rate in the test region p0(x) = polymorphism level in the neutral reference region for frequency class x p(x) = polymorphism level in the test region for frequency class x.

It then fits an exponential function to this data, of the form:

αfit(x) = a + b exp(−cx)

The value of this function extrapolated to x = 1 provides an estimated value for α:

αasymptotic = αfit(x = 1)

Although the exponential function is generally expected to provide the best fit, a linear function is also fit to the data, of the form:

αfit(x) = a + bx

If the exponential fit fails to converge (which can happen if the data does not fit an exponential pattern), or if the linear fit is superior according to AIC, then the linear fit is reported; otherwise, the exponential fit is reported. (There are also pathological cases in which the exponential fit is superior according to AIC, but the confidence interval of its estimate of α is very wide; in that case, the linear fit is also preferred.)

The individual steps required to execute the test are:

1. determine your test region and an appropriate neutral reference region,
2. determine the overall substitution rates for those regions,
3. subdivide your SNP data for these regions into derived allele frequency classes (the number of classes depending upon how much data you have, such that there are no "empty" frequency classes),
4. determine the polymorphism level within each of those frequency classes, and
5. submit this information using the form above to obtain plots and analysis.

Note that it is often advisable to trim the polymorphism data, removing the lowest and highest frequency classes. This is recommended because low-frequency polymorphisms can have a high error rate due to sequencing error, whereas high-frequency polymorphisms can be vulnerable to polarization error. This service therefore allows specification of the interval of x that will be used to fit the exponential and linear models.

Values for the polymorphism rates should be supplied in a tab-separated file of row data, with columns for x, p, and p0. The file's first row should contain text labels for the columns; the specific text used to label the columns is unimportant, but their order (x, p, p0) is important. A sample file can be seen here.

### Command-line usage

This service can also be used in an automated fashion at the command line using the Un*x command curl. Depending upon your operating system, you may need to install curl first. To download the full HTML response for a query, use a command like:

```curl -F"d=593" -F"d0=930" -F"xlow=0.1" -F"xhigh=0.9" -F"datafile=@polymorphisms.txt"
-o "MK_full.html" http://benhaller.com/cgi-bin/R/asymptoticMK_run.html```

That should all be entered as a single line at the Un*x terminal prompt. The values for d, d0, and the x cutoff interval are supplied with -F options to the curl command as shown here. The file to upload with binned values for x, p, and p0 is given with another -F option, with an @ preceding the filename. The filename for output is given with -o, and the URL for submission to this web service is supplied last. In this example, the file to upload and the output file are both in the current directory, but supplying Un*x paths should also work. The result of this command is an HTML file with the full response, including embedded plots.

Often, however, for automation of a workflow one needs the results in a more machine-readable format, without plots, explanatory text, or HTML markup. To get that, use a command like:

```curl -F"d=593" -F"d0=930" -F"xlow=0.1" -F"xhigh=0.9" -F"datafile=@polymorphisms.txt"