Ancestry

From Genome Analysis Wiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Introduction

Ancestry can infer ancestry using sequence reads in the Principal Component (PC) space. It is suited for targeted/exome sequencing or whole genome sequencing experiments. Ancestry is implemented in C++ for fast computations.

Download

To get a copy of the software, please contact: zhanxw@umich.edu .

For CSG users, the binary executable is located at: /net/fantasia/home/zhanxw/spa/cpp/executable/ancestrySeq

We plan to open source this program shortly.

Command Line Options

Input sequence data(--inSeq)

Sequence data (BAM files) need to be preprocessed in .seq format. This procedure is described here.

Seq file is generated from pileup files. It contains sequencing information and organize it in a LASER readable format. The first two columns represent population id and individual id. Subsequent columns are total read depths and reference base counts. For example, column 3 and 4 are 0, 0 in the following example. That means at first marker, the sequence read depth is 0 and thus none of the reads has reference base. We enforce tab delimiters between markers and space delimiters between each read depths and reference base counts. On line of seq file looks like below:

NA12878.chrom22	NA12878.chrom22	0 0	0 0	0 0	0 0	0


Input pileup sites (--inSite)

Site file is equivalent to BED file and it is used here to represent marker positions.

The preprocessing procedure is described here.

An example site file looks like below:

CHR  POS      ID          REF  ALT
1    752566   rs3094315   G    A
1    768448   rs12562034  G    A
1    1005806  rs3934834   C    T
1    1018704  rs9442372   A    G
1    1021415  rs3737728   A    G

The site file has header line, and it contains chromosome, position(1-based), id (usually marker name), ref (reference allele) and alt (alternative allele).

Input parameter (--inModel)

This parameter specifies the SNP gradients and offsets. It is the output of program spa


Output prefix (--out)

The parameter specifies the output prefix. The main results will be stored in PREFIX.loc file.

An example output file looks like below:

 PopId   IndvId  Loc1    Loc2
 MPaS3287        MPaS3287        13.5669 176.051
 MPaS3287        MPaS3287.ConfInt95      6.43922,20.724  169.103,182.928

Note, when --ci option is used, the outputted IndvID column will append ".ConfInt95" indicates that the inference results is a 95% confidence interval.


Inference option (--bootstrap)

When this option is specified, the program will infer ancestral locations using a bootstrap procedure. Essentially, we resample input sequence reads, and recalculate ancestral locations after each shuffle. The output will include ancestral locations from every resampling.

Inference option (--ci)

When this option is specified, the program will infer ancestral locations based on likelihood calculations. On a two-dimensional space, this option infer the top, bottom, left and right boundaries of an ellipse region, and the probability that the true ancestral location fells in this region is 95%.

Example

A basic command looks like:

 ancestrySeq --inSeq 1108.amd.to.hgdp.seq --inSite HGDP_938.site --inModel spa.model.out --out test

The result file, test.loc, includes inferred location for each sample listed in 1108.amd.to.hgdp.seq.

Resources

LASER is a related project that starts earlier than ancestry. Both software can perform ancestral inference. But ancestral has computational advantages and does not sacrifice accuracies.

Contact

Comments on this wiki page or questions related to preparing input files for LASER can be sent to Xiaowei Zhan. This project was helped by Chaolong Wang and was directed by Gonçalo Abecasis.