LASER

From Genome Analysis Wiki
Revision as of 10:58, 11 September 2012 by Zhanxw (talk | contribs)
Jump to: navigation, search

LASER Manual


LASER is xxxx. To use LASER, please read part1 and part2. Also example will guide you through a complete procedure of how to use laser to infer ancestry. If you have feedbacks, please see contact.

Part 1: Prepare data

Generate pileup data from BAM file

To prepare data for LASER, you will need BAM files, reference SNP list and region list (in BED format). We first use samtools to generate pileup file. Those files will summarize sequenced bases at reference SNP sites. We use samtools 0.1.18 (r982:295) as an example. A typical samtools command is listed below:

samtools mpileup -q 30 -Q 20 -f hs37d5.fa -l referenceSNP.bed A.bam > A.pileup

We recommend using '-q 30' to remove all mapped read with mapping quality less than 30, and using '-Q 20' to remove bases with base qualities less than 20. Here we use reference 'hs37d5.fa' which is NCBI build 37. Make sure your reference SNP sites also in NCBI build 37. samtools also requires referenceSNP.bed, which is in BED format and defines locations of reference SNPs. Please note BED format used 0-based index as starting position. If you have a SNP, rs3094315, located in chromosome 1 position 752566 (1-based), make sure your reference SNP file has this line:

1       752565  752566  rs3094315

After this step, you will have A.pileup file for sample A. Repeat the same pileup procedure for sample B, sample C, ....

Generate seq, cov files from pileup data

In this step, you need to use pileup2laser.py script. For CSG users, this file is located in /net/t2dgenes/zhanxw/amd/analyze/chaolong/pileup2laser.py

You will need target definition file, MAPA file and pileup data from previous step.

Target definition should be in BED format.

MAPA file is similar to MAP file used in PLINK, and use reference allele and alternative allele as two additional columns. Example:

1       rs3094315       0       752566  G       A

Example pileup2laser.py command:

python pileup2laser.py -b target.bed -i amd.idFile -m HGDP_938.mapa -o pca A.pileup B.pileup C.pileup

You will obtain pca.seq and pca.cov file. These two files are needed in following LASER analysis. pca.seq looks like below:

A        A        0       0       9       9       2 1     0 0     0 0
B        B        0       0       9       9       4 3     0 0     0 0
C        C        0       0       9       9       2 2     0 0     0 0

The first 6 columns follows normal PED file convention. The 7, 8 column listed the total coverage and number of reference alleles at the first marker; the 9, 10 column listed the total coverage and number of reference alleles at the second marker; and so on.


pca.map looks like below:

1       rs3094315       0       752566  G

It's content is the typical MAP in the first 4 columns and use reference allele as the 5th column.


It is sometimes more convenient to use alternative set of sample names in pca.seq files. So instead using A, B, C... from pileup file names, we can conveniently define a conversion table to map pileup file name to specified, meaningful, sample names. For example, you can create an idFile:

A EUR1
B EUR2
C AFR1

Then use

python pileup2laser.py -b target.bed -i idFile -m HGDP_938.mapa -o pca A.pileup B.pileup C.pileup

In the generated pca.seq file, you will have these outputs:

EUR1        EUR1        0       0       9       9       2 1     0 0     0 0
EUR2        EUR2        0       0       9       9       4 3     0 0     0 0
AFR1        AFR1        0       0       9       9       2 2     0 0     0 0

NOTE: you genotype should be forward strand, mapping quality filter and base quality filter should be appropriately set up.

(Optionally) Summarize Coverage

To evaluate the coverage on off-target reference markers, you can generate per-sample and per-marker coverage file using script /net/t2dgenes/zhanxw/amd/analyze/chaolong/calculateCoverage.py

python calculateCoverage.py -o pca.pca.seq pca.map

The output files are pca.cov.sample and pca.cov.marker:

pca.cov.sample file:

PersonId        length  min     q1      median  mean    q3      max     sd
ERU1        632958.000      0.000   0.000   0.000   0.027   0.000   47.000  0.257
EUR2        632958.000      0.000   0.000   0.000   0.031   0.000   69.000  0.264
AFR1        632958.000      0.000   0.000   0.000   0.026   0.000   35.000  0.244

pca.cov.marker file:

chrom   markerName      cm      pos     ref     length  min     q1      median  mean    q3      max     sd
1       rs3094315       0       752566  G       3159.000        0.000   0.000   0.000   0.797   1.000   8.000   1.202
1       rs12562034      0       768448  G       3159.000        0.000   0.000   0.000   0.000   0.000   0.000   0.000
1       rs3934834       0       1005806 C       3159.000        0.000   0.000   0.000   0.000   0.000   0.000   0.000

Those files help you evaluate off-target coverage.


Part 2: LASER

We plan to address (maybe three) tasks here: simulation, PCA, and Procrustes You will need to download software from Chaonglong...

How to interpret your sults

You results will be output as xxx. An example of outputs are like below

XXX

Example

We provide a toy example (downloadble from: xxx) The workflow of LASER analysis are explained step by step.

1. generate pileup

2. prepare laser input

3. run laser


Contact

Xiaowei Zhan zhanxw_AT_gmail.com (for data preparation)

Chaolong Wang aaron.wcl_AT_gmail.com (for using Laser and interpreting results)