ASMC
-
Phased haplotypes in Oxford haps/sample format (*.hap/hap.gz, samples)
Sum of pairwise posterior coalescence probabilities ``*.sumOverPairs.gz` <#sum-of-pairwise-posterior-coalescence-probabilities-sumoverpairsgz>`__
This page describes compiling and running the ASMC program described in this paper:
P. Palamara, J. Terhorst, Y. Song, A. Price. High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability. Nature Genetics, 2018.
A page with data and annotations from the paper can be found here.
Summary (TL;DR)
The Ascertained Sequentially Markovian Coalescent is a method to efficiently estimate pairwise coalescence times along the genome. It can be run using SNP array or whole-genome sequencing (WGS) data.
To run ASMC, download it here,
and follow the quickstart guide. Compile the
ASMC_exe target to build the ASMC executable. To compute pairwise
coalescence times for the following files containing SNP array data for
150 phased diploid samples:
ASMC_data/examples/asmc/exampleFile.n300.array.hap.gzASMC_data/examples/asmc/exampleFile.n300.array.samplesASMC_data/examples/asmc/exampleFile.n300.array.map.gz
(see here for file formats), you can run the following ASMC command:
./build/ASMC_exe \
--decodingQuantFile ASMC_data/decoding_quantities/30-100-2000_CEU.decodingQuantities.gz \
--inFileRoot ASMC_data/examples/asmc/exampleFile.n300.array \
--posteriorSums
This will generate
ASMC_data/examples/asmc/exampleFile.n300.array.1-1.sumOverPairs.gz,
which contains a matrix of size SxD, where S is the number of sites in
the data, and D is the number of discrete coalescence time intervals
defined in ASMC_data/discretizations/30-100-2000_CEU.disc. The
[i,j]-th entry of this matrix contains the sum of posterior
coalescence probabilities for all samples at SNP i and time j.
Running ASMC
Once you have computed or downloaded the decoding quantities
corresponding to your demographic model, time discretization, and SNP
allele frequencies, you can analyze SNP or WGS data using the ASMC
program:
./build/ASMC_exe \
--decodingQuantFile ASMC_data/decoding_quantities/30-100-2000_CEU.decodingQuantities.gz \
--inFileRoot ASMC_data/examples/asmc/exampleFile.n300.array \
--majorMinorPosteriorSums
For WGS data, add the --mode sequence option:
./build/ASMC_exe \
--decodingQuantFile ASMC_data/decoding_quantities/30-100-2000_CEU.decodingQuantities.gz \
--inFileRoot ASMC_data/examples/asmc/exampleFile.n300 \
--majorMinorPosteriorSums \
--mode sequence
Using the --majorMinorPosteriorSums flag, ASMC will output the sum
of posterior coalescence probabilities for all analyzed pairs of
individuals. These will be written in
ASMC_data/examples/asmc/exampleFile.n300.array.{00,01,11}.sumOverPairs.gz
for the SNP array example, and
ASMC_data/examples/asmc/exampleFile.n300.{00,01,11}.sumOverPairs.gz
for the WGS example.
If you are decoding a large number of samples, you can break down the
computation in several jobs using the --jobs int and
--jobInd int flags, which take the total number of jobs to be
performed and the current job index as arguments. If you don’t specify a
name for the output files, a job index will be automatically added to
the default output path.
The full set of command line options for ASMC is as follows:
Mandatory:
--inFileRoot arg Prefix of hap|haps|hap.gz|haps.gz, sample|samples, map|map.gz file
--decodingQuantFile arg Decoding quantities file
Choose one of:
--posteriorSums Output file for sum of posterior distribution
over pairs.
--majorMinorPosteriorSums Output file for sum of posterior distribution
over pairs, partitioned by major/minor alleles.
Optional:
--outFileRoot arg Output file for sum of posterior distribution
over pairs (default: --hapsFileRoot argument)
--mapFile arg Optional override for map|map.gz file, if not in inFileRoot
--jobs int (=0) Number of jobs being done in parallel
--jobInd int (=0) Job index (1..jobs)
--mode string (=array) Decoding mode. Choose from {sequence, array}.
--compress (=false) Compress emission to binary (no CSFS)
--useAncestral (=false) Assume ancestral alleles are coded as 1 in
input (will assume 1 = minor otherwise)
--skipCSFSdistance float (=0.0) Genetic distance (in cM) between two CSFS emissions
In addition to the arguments described above, ASMC options include:
--compressis a shorthand for--skipCSFSdistance Infinity(see below).--useAncestralcan be used to specify that a1in the data specifies an ancestral allele. This will cause the CSFS to be used without folding. This is mostly not needed.--skipCSFSdistance float, which takes a floating point argument, specifies the minimum distance for a CSFS emission to be used. The default is0.0(always use CSFS). Setting--skipCSFSdistance Infinity(which is the same as--compress) leads to never using the CSFS (i.e. the classic PSMC emission if decoding WGS data, or a binary emission which controls for ascertainment if decoding SNP array data).
Input/output file formats
You may want to look at files in ASMC_data/ for examples of the file
formats described below.
Phased haplotypes in Oxford haps/sample format (*.hap/hap.gz, samples)
These files are provided in input to ASMC and optionally
ASMCprepareDecoding. The file format explained
here. These
files are output by phasing programs like Eagle and Shapeit.
Genetic map (*.map/map.gz)
The genetic map provided in input to ASMC is in Plink
map format, in
which each line has four columns with format “Chromosome SNPName
GeneticPosition PhysicalPosition”. Genetic positions are in
centimorgans, physical positions are in bp. The map can be optionally
compressed using gzip.
Demographic history (*.demo)
The demographic history provided in input to ASMCprepareDecoding
represents a piece-wise constant history of past effective population
sizes, with format
TimeStart PopulationSize
Where TimeStart is the first generation where the population has size
PopulationSize. Note that population size is haploid, and that the
demographic model is usually built assuming a specific mutation rate,
which is passed as an argument to the ASMCprepareDecoding program.
The first line should contain generation 0. You can obtain this
model using e.g. PSMC/MSMC/SMC++. If your model is not piecewise
constant, you will need to approximate it as piecewise constant. The
last provided interval is assumed to last until time=Infinity (and is
usually remote enough to have negligible effects on the results).
The demographic models used with ASMC can be found here and were inferred using smc++ in the following paper:
Spence, J.P. and Song, Y.S. Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations. Science Advances, Vol. 5, No. 10, eaaw9206 (2019), [doi].
They correspond to these population sizes, but rescaled to assume mutation rate of 1.65e-8.
Time discretization (*.disc)
The list of discrete time intervals provided in input to ASMC
contains a single number per line, representing time measured in
(continuous) generations, and starting at generation 0.0. For
instance, the list ASMC_data/discretizations/30-100-2000_CEU.disc
contains time intervals:
0.0
30.0
60.0
90.0
120.0
150.0
180.0
210.0
240.0
270.0
300.0
330.0
360.0
... <lines omitted>
79855.6
96263.0
124311.7
The intervals defined in this file are:
{0.0-30.0, 30.0-60.0, ..., 96263.0-124311.7, 124311.7-Infinity}.
Decoding quantities (*.decodingQuantities.gz)
The *.decodingQuantities.gz file is generated by
ASMCprepareDecoding and input into ASMC. It is used to perform
efficient inference of pairwise coalescence times. There is no need to
understand the content of this file.
Note: the CEU.demo demographic model and the decoding quantities for CEU+UKBB previously provided in this repository and this repository were mistakenly encoded as diploid rather than haploid. The file CEU.demo and CEU+UKBB decoding quantities here have now been fixed. They were generated using v2.2.1 of the PrepareDecoding tool, which also provides a simpler interface for computing decoding quantities as well as support for additional demographic models. Using these new decoding quantities with v1.2 of ASMC will tend to produce more recent estimates for TMRCAs compared to the decoding quantities distributed with v1.0 and v1.1. This should not have a substantial impact on most downstream analyses.
Time discretization intervals (*.intervalsInfo)
The *.intervalsInfo file is generated by the ASMCprepareDecoding
and input into ASMC. It contains some useful information about the
time discretization and the demographic model. It contains a number of
lines corresponding to the number of discrete time intervals used in the
analysis. Each line has format:
IntervalStart ExpectedCoalescenceTime IntervalEnd
IntervalStart and IntervalEnd represent the start/end of each discrete time interval, ExpectedCoalescenceTime is the expected coalescence time for a pair of individuals who have been inferred to coalesce within this time interval, and depends on the demographic model.
Sum of pairwise posterior coalescence probabilities *.sumOverPairs.gz
The output of the ASMC analysis is written in
*.{00,01,11}.sumOverPairs.gz files. Each file contains a matrix of
size SxD, where S is the number of sites in the data, and D is the
number of discrete time intervals used in the analysis. The
[i,j]-th entry of each matrix contains the sum of posterior
coalescence probabilities for all samples at SNP i and discrete
coalescence time j. The output breaks down coalescence events of
samples carrying different alleles at each site, using the
{00,01,11} suffixes. Specifically:
The i-th row of the matrix in
*.00.sumOverPairs.gzcontains the sum of posterior probabilities for all pairs of samples that are homozygous0at site i.The i-th row of the matrix in
*.01.sumOverPairs.gzcontains the sum of posterior probabilities for all pairs of samples that heterozygous at site i.The i-th row of the matrix in
*.11.sumOverPairs.gzcontains the sum of posterior probabilities for all pairs of samples that are homozygous1at site i.
Tools, scripts, and analyses
These are some useful tools and scripts to be used with ASMC. They have not yet been relocated to this repository, but are standalone tools that are available at the indicated locations.
Tool to merge output of parallel ASMC jobs
The folder
`MERGE_POSTERIORS <https://github.com/PalamaraLab/ASMC_legacy/tree/master/TOOLS/MERGE_POSTERIORS>`__
contains the ASMCmergePosteriorSums.jar program, which may be used
to merge the output of different ASMC jobs. You can type
java -jar TOOLS/MERGE_POSTERIORS/ASMCmergePosteriorSums.jar -h
for a list of command line options. Also see the merge.sh file. This
tool assumes the decoding has been done using
--majorMinorPosteriorSums. You may normalize the output so that the
posterior sums to 1 for each site using the --norm flag. If you
used the --posteriorSums flag and you want to simply normalize the
output, you can run:
zcat path/to/output/exampleFile.n100.array.merged.sumOverPairs.gz | \
awk '{ c=0; for (i=1; i<=NF; i++) c+=$i; for (i=1; i<=NF; i++) $i/=c; print; }' | \
gzip -c -v - > path/to/output/exampleFile.n100.array.merged.norm.sumOverPairs.gz
Script to visualize average coalescence density in a region
The folder
`PLOT_POSTERIORS <https://github.com/PalamaraLab/ASMC_legacy/tree/master/TOOLS/PLOT_POSTERIORS>`__
contains the plotPosteriorHeatMap.py program, which can be used to
visualize the coalescence posterior density in specific regions (e.g.
figures 3.b, 3.c, and S7 in the ASMC paper). For an example, see the
data page, where
the
UKBB_posteriors
files can be downloaded to generate the figures from the paper.
Density of recent coalescence (DRC) statistic
The DRC statistic can be obtained by summing entries of the
sumOverPairs matrices that correspond to the desired time interval,
and averaging across SNPs within a window. Please refer to the paper for
details.
A few notes on analyzing the DRC statistic: it is important to remember that using a Gamma distribution as a null model for the DRC statistic is an approximation (see paper), and further work is needed to derive an exact null model. In particular the DRC cannot be larger than 1, so approximate p-values obtained under a Gamma model for very large DRC values will be artificially small. We found the Gamma approximation to be reasonable under some conditions, such as a large recent effective population size, which makes the DRC small in neutral regions, but this approximation will not work well in all populations (e.g. isolated populations) or for large time intervals. If you decide to use this approach to detect candidate regions for selection, we recommend carefully checking that the Gamma approximation is reasonable. Because particularly large DRC values may lead to artificially small p-values, you may also want to use the Gamma approximation only to determine an approximate significance threshold, then report raw DRC values larger than this threshold, rather than reporting approximate p-values.
Finally, we also suggest that you check that candidate regions detected this way do not overlap problematic regions of the genome, such as regions of very high/low recombination rate or LD, as well as large structural variants, such as inversions, which may affect recombination.
Precomputed decoding quantities
This repository contains various data files related to ASMC that users might find helpful. This directory contains two sets of decoding quantities that have been precomputed using the PrepareDecoding tool.
They are built using the European demographic model
CEU.demo,
SNP allele frequencies from the UK Biobank in
UKBB.frq,
and the time discretizations that can be found in this
directory.
The discretizations contain several short time intervals in the recent
generations, followed by intervals calculated from the mutation age
distribution from generation 2,000 on. This enables getting more
fine-grained information for recent generation, though note that smaller
time intervals will contain fewer coalescent events on average.