Changes

From Genome Analysis Wiki
Jump to navigationJump to search
4,417 bytes added ,  10:58, 11 September 2012
no edit summary
Line 6: Line 6:  
= Part 1: Prepare data =
 
= Part 1: Prepare data =
   −
Use samtools pileup
+
== Generate pileup data from BAM file ==
 +
To prepare data for LASER, you will need BAM files, reference SNP list and region list (in BED format).
 +
We first use samtools to generate pileup file. Those files will summarize sequenced bases at reference SNP sites.
 +
We use samtools 0.1.18 (r982:295) as an example. A typical samtools command is listed below:
   −
Use pileup2laser.py script
+
samtools mpileup -q 30 -Q 20 -f hs37d5.fa -l referenceSNP.bed A.bam > A.pileup
 +
 
 +
We recommend using '-q 30' to remove all mapped read with mapping quality less than 30, and using '-Q 20' to remove bases with base qualities less than 20.
 +
Here we use reference 'hs37d5.fa' which is NCBI build 37. Make sure your reference SNP sites also in NCBI build 37.
 +
samtools also requires referenceSNP.bed, which is in BED format and defines locations of reference SNPs.
 +
Please note BED format used 0-based index as starting position. If you have a SNP, rs3094315, located in chromosome 1 position 752566 (1-based), make sure your reference SNP file has this line:
 +
1      752565  752566  rs3094315
 +
 
 +
After this step, you will have A.pileup file for sample A. Repeat the same pileup procedure for sample B, sample C, ....
 +
 
 +
== Generate seq, cov files from pileup data ==
 +
 
 +
In this step, you need to use pileup2laser.py script. For CSG users, this file is located in /net/t2dgenes/zhanxw/amd/analyze/chaolong/pileup2laser.py
 +
 
 +
You will need target definition file, MAPA file and pileup data from previous step.
 +
 
 +
Target definition should be in BED format.
 +
 
 +
MAPA file is similar to MAP file used in PLINK, and use reference allele and alternative allele as two additional columns. Example:
 +
1      rs3094315      0      752566  G      A
 +
 
 +
Example pileup2laser.py command:
 +
python pileup2laser.py -b target.bed -i amd.idFile -m HGDP_938.mapa -o pca A.pileup B.pileup C.pileup
 +
 
 +
You will obtain pca.seq and pca.cov file. These two files are needed in following LASER analysis.
 +
pca.seq looks like below:
 +
A        A        0      0      9      9      2 1    0 0    0 0
 +
B        B        0      0      9      9      4 3    0 0    0 0
 +
C        C        0      0      9      9      2 2    0 0    0 0
 +
The first 6 columns follows normal PED file convention.
 +
The 7, 8 column listed the total coverage and number of reference alleles at the first marker;
 +
the 9, 10 column listed the total coverage and number of reference alleles at the second marker; and so on.
 +
 
 +
 
 +
pca.map looks like below:
 +
1      rs3094315      0      752566  G
 +
It's content is the typical MAP in the first 4 columns and use reference allele as the 5th column.
 +
 
 +
 
 +
It is sometimes more convenient to use alternative set of sample names in pca.seq files. So instead using A, B, C... from pileup file names,
 +
we can conveniently define a conversion table to map pileup file name to specified, meaningful, sample names. For example, you can create an idFile:
 +
A EUR1
 +
B EUR2
 +
C AFR1
 +
 
 +
Then use
 +
python pileup2laser.py -b target.bed -i idFile -m HGDP_938.mapa -o pca A.pileup B.pileup C.pileup
 +
 
 +
In the generated pca.seq file, you will have these outputs:
 +
EUR1        EUR1        0      0      9      9      2 1    0 0    0 0
 +
EUR2        EUR2        0      0      9      9      4 3    0 0    0 0
 +
AFR1        AFR1        0      0      9      9      2 2    0 0    0 0
    
NOTE: you genotype should be forward strand, mapping quality filter and base quality filter should be appropriately set up.
 
NOTE: you genotype should be forward strand, mapping quality filter and base quality filter should be appropriately set up.
   −
Tip: You can generate summary of coverage files using scripts.
+
== (Optionally) Summarize Coverage ==
 +
 
 +
To evaluate the coverage on off-target reference markers, you can generate per-sample and per-marker coverage file using script /net/t2dgenes/zhanxw/amd/analyze/chaolong/calculateCoverage.py
 +
python calculateCoverage.py -o pca.pca.seq pca.map
 +
 
 +
The output files are pca.cov.sample and pca.cov.marker:
 +
 
 +
pca.cov.sample file:
 +
 
 +
PersonId        length  min    q1      median  mean    q3      max    sd
 +
ERU1        632958.000      0.000  0.000  0.000  0.027  0.000  47.000  0.257
 +
EUR2        632958.000      0.000  0.000  0.000  0.031  0.000  69.000  0.264
 +
AFR1        632958.000      0.000  0.000  0.000  0.026  0.000  35.000  0.244
 +
 
 +
pca.cov.marker file:
 +
chrom  markerName      cm      pos    ref    length  min    q1      median  mean    q3      max    sd
 +
1      rs3094315      0      752566  G      3159.000        0.000  0.000  0.000  0.797  1.000  8.000  1.202
 +
1      rs12562034      0      768448  G      3159.000        0.000  0.000  0.000  0.000  0.000  0.000  0.000
 +
1      rs3934834      0      1005806 C      3159.000        0.000  0.000  0.000  0.000  0.000  0.000  0.000
 +
 
 +
Those files help you evaluate off-target coverage.
 +
 
    
= Part 2: LASER =
 
= Part 2: LASER =
255

edits

Navigation menu