Ancestry
Introduction
Ancestry can infer ancestry using sequence reads in the Principal Component (PC) space. It is suited for targeted/exome sequencing or whole genome sequencing experiments. Ancestry is implemented in C++ for fast computations.
Download
To get a copy of the software, please contact: zhanxw@umich.edu .
For CSG users, the binary executable is located at: /net/fantasia/home/zhanxw/spa/cpp/executable/ancestrySeq
We plan to open source this program shortly.
Command Line Options
Input sequence data(--inSeq)
Sequence data (BAM files) need to be preprocessed in .seq format. This procedure is described here.
Seq file is generated from pileup files. It contains sequencing information and organize it in a LASER readable format. The first two columns represent population id and individual id. Subsequent columns are total read depths and reference base counts. For example, column 3 and 4 are 0, 0 in the following example. That means at first marker, the sequence read depth is 0 and thus none of the reads has reference base. We enforce tab delimiters between markers and space delimiters between each read depths and reference base counts. On line of seq file looks like below:
NA12878.chrom22 NA12878.chrom22 0 0 0 0 0 0 0 0 0
Input pileup sites (--inSite)
Site file is equivalent to BED file and it is used here to represent marker positions.
The preprocessing procedure is described here.
An example site file looks like below:
CHR POS ID REF ALT 1 752566 rs3094315 G A 1 768448 rs12562034 G A 1 1005806 rs3934834 C T 1 1018704 rs9442372 A G 1 1021415 rs3737728 A G
The site file has header line, and it contains chromosome, position(1-based), id (usually marker name), ref (reference allele) and alt (alternative allele).
Input parameter (--inModel)
This parameter specifies the SNP gradients and offsets. It is the output of program spa
Output prefix (--out)
The parameter specifies the output prefix. The main results will be stored in PREFIX.loc file.
An example output file looks like below:
PopId IndvId Loc1 Loc2 MPaS3287 MPaS3287 13.5669 176.051 MPaS3287 MPaS3287.ConfInt95 6.43922,20.724 169.103,182.928
Note, when --ci option is used, the outputted IndvID column will append ".ConfInt95" indicates that the inference results is a 95% confidence interval.
Inference option (--bootstrap)
When this option is specified, the program will infer ancestral locations using a bootstrap procedure. Essentially, we resample input sequence reads, and recalculate ancestral locations after each shuffle. The output will include ancestral locations from every resampling.
Inference option (--ci)
When this option is specified, the program will infer ancestral locations based on likelihood calculations. On a two-dimensional space, this option infer the top, bottom, left and right boundaries of an ellipse region, and the probability that the true ancestral location fells in this region is 95%.
Example
A basic command looks like:
ancestrySeq --inSeq 1108.amd.to.hgdp.seq --inSite HGDP_938.site --inModel spa.model.out --out test
The result file, test.loc, includes inferred location for each sample listed in 1108.amd.to.hgdp.seq.
Resources
LASER is a related project that starts earlier than ancestry. Both software can perform ancestral inference. But ancestral has computational advantages and does not sacrifice accuracies.
Contact
Comments on this wiki page or questions related to preparing input files for LASER can be sent to Xiaowei Zhan. This project was helped by Chaolong Wang and was directed by Gonçalo Abecasis.