Changes

LASER (view source)

Revision as of 10:58, 11 September 2012

4,417 bytes added , 10:58, 11 September 2012

no edit summary

Line 6: Line 6:

= Part 1: Prepare data =

−

~~Use~~ samtools pileup

+

== Generate pileup data from BAM file ==

+

To prepare data for LASER, you will need BAM files, reference SNP list and region list (in BED format).

+

We first use samtools to generate pileup file. Those files will summarize sequenced bases at reference SNP sites.

+

We use samtools 0.1.18 (r982:295) as an example. A typical samtools command is listed below:

−

~~Use~~ pileup2laser.py script

+

samtools mpileup -q 30 -Q 20 -f hs37d5.fa -l referenceSNP.bed A.bam > A.pileup

+

We recommend using '-q 30' to remove all mapped read with mapping quality less than 30, and using '-Q 20' to remove bases with base qualities less than 20.

+

Here we use reference 'hs37d5.fa' which is NCBI build 37. Make sure your reference SNP sites also in NCBI build 37.

+

samtools also requires referenceSNP.bed, which is in BED format and defines locations of reference SNPs.

+

Please note BED format used 0-based index as starting position. If you have a SNP, rs3094315, located in chromosome 1 position 752566 (1-based), make sure your reference SNP file has this line:

+

1 752565 752566 rs3094315

+

After this step, you will have A.pileup file for sample A. Repeat the same pileup procedure for sample B, sample C, ....

+

== Generate seq, cov files from pileup data ==

+

In this step, you need to use pileup2laser.py script. For CSG users, this file is located in /net/t2dgenes/zhanxw/amd/analyze/chaolong/pileup2laser.py

+

You will need target definition file, MAPA file and pileup data from previous step.

+

Target definition should be in BED format.

+

MAPA file is similar to MAP file used in PLINK, and use reference allele and alternative allele as two additional columns. Example:

+

1 rs3094315 0 752566 G A

+

Example pileup2laser.py command:

+

python pileup2laser.py -b target.bed -i amd.idFile -m HGDP_938.mapa -o pca A.pileup B.pileup C.pileup

+

You will obtain pca.seq and pca.cov file. These two files are needed in following LASER analysis.

+

pca.seq looks like below:

+

A A 0 0 9 9 2 1 0 0 0 0

+

B B 0 0 9 9 4 3 0 0 0 0

+

C C 0 0 9 9 2 2 0 0 0 0

+

The first 6 columns follows normal PED file convention.

+

The 7, 8 column listed the total coverage and number of reference alleles at the first marker;

+

the 9, 10 column listed the total coverage and number of reference alleles at the second marker; and so on.

+

pca.map looks like below:

+

1 rs3094315 0 752566 G

+

It's content is the typical MAP in the first 4 columns and use reference allele as the 5th column.

+

It is sometimes more convenient to use alternative set of sample names in pca.seq files. So instead using A, B, C... from pileup file names,

+

we can conveniently define a conversion table to map pileup file name to specified, meaningful, sample names. For example, you can create an idFile:

+

A EUR1

+

B EUR2

+

C AFR1

+

Then use

+

python pileup2laser.py -b target.bed -i idFile -m HGDP_938.mapa -o pca A.pileup B.pileup C.pileup

+

In the generated pca.seq file, you will have these outputs:

+

EUR1 EUR1 0 0 9 9 2 1 0 0 0 0

+

EUR2 EUR2 0 0 9 9 4 3 0 0 0 0

+

AFR1 AFR1 0 0 9 9 2 2 0 0 0 0

NOTE: you genotype should be forward strand, mapping quality filter and base quality filter should be appropriately set up.

−

~~Tip: You~~ can generate ~~summary of~~ coverage files ~~using scripts~~.

+

== (Optionally) Summarize Coverage ==

+

To evaluate the coverage on off-target reference markers, you can generate per-sample and per-marker coverage file using script /net/t2dgenes/zhanxw/amd/analyze/chaolong/calculateCoverage.py

+

python calculateCoverage.py -o pca.pca.seq pca.map

+

The output files are pca.cov.sample and pca.cov.marker:

+

pca.cov.sample file:

+

PersonId length min q1 median mean q3 max sd

+

ERU1 632958.000 0.000 0.000 0.000 0.027 0.000 47.000 0.257

+

EUR2 632958.000 0.000 0.000 0.000 0.031 0.000 69.000 0.264

+

AFR1 632958.000 0.000 0.000 0.000 0.026 0.000 35.000 0.244

+

pca.cov.marker file:

+

chrom markerName cm pos ref length min q1 median mean q3 max sd

+

1 rs3094315 0 752566 G 3159.000 0.000 0.000 0.000 0.797 1.000 8.000 1.202

+

1 rs12562034 0 768448 G 3159.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

+

1 rs3934834 0 1005806 C 3159.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

+

Those files help you evaluate off-target coverage.

+

= Part 2: LASER =

Zhanxw

255

edits

Changes

LASER (view source)

Revision as of 10:58, 11 September 2012

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools