Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Created page with 'Michigan Pilot One SNP calling work flow: (1) Create GLF files from BAM files (2) Split GLF files by chromosome (3) Build a list of individuals within each population (4)…'
Michigan Pilot One SNP calling work flow:

(1) Create GLF files from BAM files

(2) Split GLF files by chromosome

(3) Build a list of individuals within each population

(4) Link files and tabulate # of files per population, per platform

(5) Check total depth for each population, each platform

(6) Filter sites with total depth at the extremes, within each population, each platform

(7) Obtain one merged GLF for each individual by merging GLFs across platforms for the same individual

(8) Promote a set of sites for each population

(9) Merge with genotype data

(10) Run thunder

(11) Ligate thunder results for larger chromosomes

(12) Extract QC+ sites

(13) Generate other information for VCF format

(14) Generate VCF

(15) Quality check

<br>

<br> <br>(1) Create GLF files from BAM files

/home5/2009.08.GLF-1/

update-glf.csh<br>sam2glf.csh

key command: samtools pileup<br>Source codes: Check with Paul for the latest version

output GLF format: bgzf

(file size) BAM (binary SAM) file: up to 43Gb (average 10s Gb) for each individual<br>(file size) GLF (binary) file: up to 19Gb (average 10s Gb) for each individual

<br>back

<br>(2) Split GLF files by chromosome

/home1/ylwtx/2009.08.GLF-split/

update-glf.csh<br>splitGLF.csh

key command: glfSplit<br>Source codes: ~goncalo/code/glfSplit/

Input GLF format: gz or bgzf<br>Output GLF format: gz

Tom suggested combing the first two steps using the following samtools command:<br>samtools -view -u *.bam 22 | samtools pileup –g - &gt; *.glf

back<br> <br>(3) Build a list of individuals within each population

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all

STEP 0 in s1-5.csh

Note: Check to make sure that all the individuals with GLF are included in the list “NA.number.by.popn”

<br>back

<br>(4) Link files and tabulate # of files per population, per platform

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all

STEP 1 in s1-5.csh

key command: ln -s

<br>back

<br>

<br> <br>(5) Check total (aggregated over all individual samples) depth for each population, each platform

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 2 in s1-5.csh

key command&nbsp;: DepthDistr<br>Source codes&nbsp;: /home/ylwtx/codes/cpp/DepthDistr/<br>Input&nbsp;: GLF files<br>Output&nbsp;: PMF and CDF of total depth distribution

Other: total depth for each site (for vcf) (run after filter)<br> IMPORTANT/total_depth_per_site.r3.csh<br> key command&nbsp;: DepthPerSite<br> Source codes&nbsp;: /home/ylwtx/codes/cpp/ DepthPerSite/<br> Input&nbsp;: GLF files<br> Output&nbsp;: total depth for each site

back

<br> <br>(6) Filter sites with total depth at the extremes, within each population, each platform

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 3 in s1-5.csh

key command&nbsp;: filterGLF or filterGLF_notBGZF<br>Source codes&nbsp;: /home/ylwtx/codes/cpp/filterGLF<br>Input&nbsp;: GLF files<br>Output&nbsp;: GLF files

Parameters:<br> minMapQ = 30

back<br> <br>(7) Obtain one merged GLF for each individual by merging GLFs across platforms for the same individual

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 4 in s1-5.csh

key command&nbsp;: glfMerge or glfMerge_noBGZF<br>Source codes&nbsp;: /home/ylwtx/codes/cpp/glfMerge<br> Originally ~goncalo/code/glfMerge<br>Input&nbsp;: GLF files for one individual<br>Output&nbsp;: GLF file

back<br> <br>(8) Promote a set of sites for each population

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 5 in s1-5.csh

key command&nbsp;: GPT_Freq<br>Source codes&nbsp;: /home/ylwtx/codes/cpp/glfmultipes_prethunder/GPT_Freq_old<br> Originally ~goncalo/code/glfMultiples

Parameters set:<br>(1) Posterior probability (for being a polymorphism) threshold: 0.999 (0.9 for genomewide but need test)<br>(2) minMapQ = 30<br>(3) –allhet default is ON<br>Input: GLF<br>Output: simplified GLF with three likelihoods

back<br> <br>(9) Merge with genotype data

(8.1) prepare genotype data in a unified format

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/genotypes_all_2

See: readme.procedure.txt*

key command&nbsp;: ped_prethunder_freq<br>Source codes&nbsp;: /home/ylwtx/codes/cpp/ped_prethunder

*special thanks to Wei Chen for preparing the genotype files

(8.2) merge genotype data with sequence data at promoted sites

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 6 in s1-5.csh

key command&nbsp;: merge_prethunder_freq<br>Source codes&nbsp;: /home/ylwtx/codes/cpp/merge_prethunder

Notes: <br> Sites with more than two alleles (not including REF_ALLELE) will be discarded

<br>back<br> <br>[[|]](10) Run thunder (hidden Markov model)

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 7 in s1-5.csh <br>Or prep_and_get_thunder_jobs.chr.*csh

key command&nbsp;: thunder_glf_freq<br>Source codes&nbsp;: /home/ylwtx/codes/cpp/mach-1.0.16

Notes:<br>(1) Cleaned monomorphic sites before feeding to thunder (no need, b/c thunder 005 handles AL1/-)<br>(2) All sites are bi-allelic with one of the alleles being the reference allele (sites with more than 2 alleles including the reference allele are discarded at the beginning of thunder run: initially because of a prior dependent on the reference allele. In the current setting, where Freq1 is used for the prior, we can choose to ignore the reference allele information.) <br>a. Codes changed on 2009-11-02<br>(3) Split:

<br>

Total 150 jobs (50 jobs for each population)

back<br> <br>(11) Ligate thunder results for larger chromosomes

<br>/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all

ligate.all. 2009-10-27.csh

<br>back

<br>(12) Extract QC+ sites

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

share.QC+hap. csh (moved to share/ but works under the parent dir, with and without QC)<br>share/share.csh (with and without QC)<br>share.QC+dose.csh (moved to share/ but works under the parent dir, with and without QC)<br>without QC: for VCF<br>share/calcLD.csh <br>share/runBatches.LD.csh<br>share/ batch.LD.pl<br>share/noSingleton.CEU.csh<br>share/noSingleton.CHB+JPT.csh<br>share/noSingleton.YRI.csh

(1) Check for individuals who are genotyped only:<br>For example, before 2009-12, in CHB+JPT, the following 2 individuals are sequenced:<br>NA18631<br>NA18634<br>(2) Check for trios whose sequencing information were not used: e.g., daughter and father of the trio

back

<br>

<br>

<br> <br>(13) Generate other information for VCF format (no longer needed, already generated)

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/<br>site.depth.csh (no longer needed, generated in GLF step)

<br>

back<br> <br>(14) Generate VCF

/home/ylwtx/1000Genomes/UoM_2009_12<br>cmd.csh<br> $pop.sh<br> vcf.py

<br>

<br>back

<br>(15) Quality check

~ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all/chr20/tout/CEU

(A) Accuracy of genotype calls<br>eval.Rsq.csh<br>or eval.r2_hat.csh<br>(B) Accuracy of haplotype calls<br>comparehaplotypes.csh

<br>

<br>back

<br><br>
212

edits

Navigation menu