Line 6: |
Line 6: |
| = Part 1: Prepare data = | | = Part 1: Prepare data = |
| | | |
− | Use samtools pileup
| + | == Generate pileup data from BAM file == |
| + | To prepare data for LASER, you will need BAM files, reference SNP list and region list (in BED format). |
| + | We first use samtools to generate pileup file. Those files will summarize sequenced bases at reference SNP sites. |
| + | We use samtools 0.1.18 (r982:295) as an example. A typical samtools command is listed below: |
| | | |
− | Use pileup2laser.py script
| + | samtools mpileup -q 30 -Q 20 -f hs37d5.fa -l referenceSNP.bed A.bam > A.pileup |
| + | |
| + | We recommend using '-q 30' to remove all mapped read with mapping quality less than 30, and using '-Q 20' to remove bases with base qualities less than 20. |
| + | Here we use reference 'hs37d5.fa' which is NCBI build 37. Make sure your reference SNP sites also in NCBI build 37. |
| + | samtools also requires referenceSNP.bed, which is in BED format and defines locations of reference SNPs. |
| + | Please note BED format used 0-based index as starting position. If you have a SNP, rs3094315, located in chromosome 1 position 752566 (1-based), make sure your reference SNP file has this line: |
| + | 1 752565 752566 rs3094315 |
| + | |
| + | After this step, you will have A.pileup file for sample A. Repeat the same pileup procedure for sample B, sample C, .... |
| + | |
| + | == Generate seq, cov files from pileup data == |
| + | |
| + | In this step, you need to use pileup2laser.py script. For CSG users, this file is located in /net/t2dgenes/zhanxw/amd/analyze/chaolong/pileup2laser.py |
| + | |
| + | You will need target definition file, MAPA file and pileup data from previous step. |
| + | |
| + | Target definition should be in BED format. |
| + | |
| + | MAPA file is similar to MAP file used in PLINK, and use reference allele and alternative allele as two additional columns. Example: |
| + | 1 rs3094315 0 752566 G A |
| + | |
| + | Example pileup2laser.py command: |
| + | python pileup2laser.py -b target.bed -i amd.idFile -m HGDP_938.mapa -o pca A.pileup B.pileup C.pileup |
| + | |
| + | You will obtain pca.seq and pca.cov file. These two files are needed in following LASER analysis. |
| + | pca.seq looks like below: |
| + | A A 0 0 9 9 2 1 0 0 0 0 |
| + | B B 0 0 9 9 4 3 0 0 0 0 |
| + | C C 0 0 9 9 2 2 0 0 0 0 |
| + | The first 6 columns follows normal PED file convention. |
| + | The 7, 8 column listed the total coverage and number of reference alleles at the first marker; |
| + | the 9, 10 column listed the total coverage and number of reference alleles at the second marker; and so on. |
| + | |
| + | |
| + | pca.map looks like below: |
| + | 1 rs3094315 0 752566 G |
| + | It's content is the typical MAP in the first 4 columns and use reference allele as the 5th column. |
| + | |
| + | |
| + | It is sometimes more convenient to use alternative set of sample names in pca.seq files. So instead using A, B, C... from pileup file names, |
| + | we can conveniently define a conversion table to map pileup file name to specified, meaningful, sample names. For example, you can create an idFile: |
| + | A EUR1 |
| + | B EUR2 |
| + | C AFR1 |
| + | |
| + | Then use |
| + | python pileup2laser.py -b target.bed -i idFile -m HGDP_938.mapa -o pca A.pileup B.pileup C.pileup |
| + | |
| + | In the generated pca.seq file, you will have these outputs: |
| + | EUR1 EUR1 0 0 9 9 2 1 0 0 0 0 |
| + | EUR2 EUR2 0 0 9 9 4 3 0 0 0 0 |
| + | AFR1 AFR1 0 0 9 9 2 2 0 0 0 0 |
| | | |
| NOTE: you genotype should be forward strand, mapping quality filter and base quality filter should be appropriately set up. | | NOTE: you genotype should be forward strand, mapping quality filter and base quality filter should be appropriately set up. |
| | | |
− | Tip: You can generate summary of coverage files using scripts.
| + | == (Optionally) Summarize Coverage == |
| + | |
| + | To evaluate the coverage on off-target reference markers, you can generate per-sample and per-marker coverage file using script /net/t2dgenes/zhanxw/amd/analyze/chaolong/calculateCoverage.py |
| + | python calculateCoverage.py -o pca.pca.seq pca.map |
| + | |
| + | The output files are pca.cov.sample and pca.cov.marker: |
| + | |
| + | pca.cov.sample file: |
| + | |
| + | PersonId length min q1 median mean q3 max sd |
| + | ERU1 632958.000 0.000 0.000 0.000 0.027 0.000 47.000 0.257 |
| + | EUR2 632958.000 0.000 0.000 0.000 0.031 0.000 69.000 0.264 |
| + | AFR1 632958.000 0.000 0.000 0.000 0.026 0.000 35.000 0.244 |
| + | |
| + | pca.cov.marker file: |
| + | chrom markerName cm pos ref length min q1 median mean q3 max sd |
| + | 1 rs3094315 0 752566 G 3159.000 0.000 0.000 0.000 0.797 1.000 8.000 1.202 |
| + | 1 rs12562034 0 768448 G 3159.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 |
| + | 1 rs3934834 0 1005806 C 3159.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 |
| + | |
| + | Those files help you evaluate off-target coverage. |
| + | |
| | | |
| = Part 2: LASER = | | = Part 2: LASER = |