Difference between revisions of "1000 Genomes Project Pilot 1 SNP Calling"

From Genome Analysis Wiki
Jump to: navigation, search
(Created page with 'Michigan Pilot One SNP calling work flow: (1) Create GLF files from BAM files (2) Split GLF files by chromosome (3) Build a list of individuals within each population (4)…')
 
((3) Build a list of individuals within each population)
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
Michigan Pilot One SNP calling work flow:  
 
Michigan Pilot One SNP calling work flow:  
  
(1) Create GLF files from BAM files  
+
== (1) Create GLF files from BAM files ==
  
(2) Split GLF files by chromosome
+
/home5/2009.08.GLF-1/  
 
+
update-glf.csh<br>sam2glf.csh  
(3) Build a list of individuals within each population
 
 
 
(4) Link files and tabulate # of files per population, per platform
 
 
 
(5) Check total depth for each population, each platform
 
 
 
(6) Filter sites with total depth at the extremes, within each population, each platform
 
 
 
(7) Obtain one merged GLF for each individual by merging GLFs across platforms for the same individual
 
 
 
(8) Promote a set of sites for each population
 
 
 
(9) Merge with genotype data
 
 
 
(10) Run thunder
 
 
 
(11) Ligate thunder results for larger chromosomes
 
 
 
(12) Extract QC+ sites
 
 
 
(13) Generate other information for VCF format
 
 
 
(14) Generate VCF
 
 
 
(15) Quality check
 
 
 
<br>
 
 
 
<br> <br>(1) Create GLF files from BAM files
 
 
 
/home5/2009.08.GLF-1/  
 
 
 
update-glf.csh<br>sam2glf.csh  
 
  
 
key command: samtools pileup<br>Source codes: Check with Paul for the latest version  
 
key command: samtools pileup<br>Source codes: Check with Paul for the latest version  
Line 45: Line 12:
 
(file size) BAM (binary SAM) file: up to 43Gb (average 10s Gb) for each individual<br>(file size) GLF (binary) file: up to 19Gb (average 10s Gb) for each individual  
 
(file size) BAM (binary SAM) file: up to 43Gb (average 10s Gb) for each individual<br>(file size) GLF (binary) file: up to 19Gb (average 10s Gb) for each individual  
  
<br>back
+
== (2) Split GLF files by chromosome ==
 
 
<br>(2) Split GLF files by chromosome  
 
  
/home1/ylwtx/2009.08.GLF-split/  
+
/home1/ylwtx/2009.08.GLF-split/  
  
update-glf.csh<br>splitGLF.csh  
+
update-glf.csh<br>splitGLF.csh  
  
 
key command: glfSplit<br>Source codes: ~goncalo/code/glfSplit/  
 
key command: glfSplit<br>Source codes: ~goncalo/code/glfSplit/  
Line 57: Line 22:
 
Input GLF format: gz or bgzf<br>Output GLF format: gz  
 
Input GLF format: gz or bgzf<br>Output GLF format: gz  
  
Tom suggested combing the first two steps using the following samtools command:<br>samtools -view -u *.bam 22 | samtools pileup –g - &gt; *.glf  
+
Tom suggested combing the first two steps using the following samtools command:<br>samtools -view -u *.bam 22 | samtools pileup –g - &gt; *.glf
  
back<br> <br>(3) Build a list of individuals within each population  
+
== (3) Build a list of individuals within each population ==
  
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all  
+
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all  
  
 
STEP 0 in s1-5.csh  
 
STEP 0 in s1-5.csh  
  
Note: Check to make sure that all the individuals with GLF are included in the list “NA.number.by.popn”  
+
Note: Check to make sure that all the individuals with GLF are included in the list “NA.number.by.popn”
  
<br>back
+
== (4) Link files and tabulate # of files per population, per platform ==
 
 
<br>(4) Link files and tabulate # of files per population, per platform  
 
  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all  
Line 77: Line 40:
 
key command: ln -s  
 
key command: ln -s  
  
<br>back
+
== (5) Check total (aggregated over all individual samples) depth for each population, each platform ==
 
 
<br>
 
 
 
<br> <br>(5) Check total (aggregated over all individual samples) depth for each population, each platform  
 
  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
Line 91: Line 50:
 
Other: total depth for each site (for vcf) (run after filter)<br> IMPORTANT/total_depth_per_site.r3.csh<br> key command&nbsp;: DepthPerSite<br> Source codes&nbsp;: /home/ylwtx/codes/cpp/ DepthPerSite/<br> Input&nbsp;: GLF files<br> Output&nbsp;: total depth for each site  
 
Other: total depth for each site (for vcf) (run after filter)<br> IMPORTANT/total_depth_per_site.r3.csh<br> key command&nbsp;: DepthPerSite<br> Source codes&nbsp;: /home/ylwtx/codes/cpp/ DepthPerSite/<br> Input&nbsp;: GLF files<br> Output&nbsp;: total depth for each site  
  
back
+
== (6) Filter sites with total depth at the extremes, within each population, each platform ==
 
 
<br> <br>(6) Filter sites with total depth at the extremes, within each population, each platform  
 
  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
Line 103: Line 60:
 
Parameters:<br> minMapQ = 30  
 
Parameters:<br> minMapQ = 30  
  
back<br> <br>(7) Obtain one merged GLF for each individual by merging GLFs across platforms for the same individual  
+
== (7) Obtain one merged GLF for each individual by merging GLFs across platforms for the same individual ==
  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
Line 111: Line 68:
 
key command&nbsp;: glfMerge or glfMerge_noBGZF<br>Source codes&nbsp;: /home/ylwtx/codes/cpp/glfMerge<br> Originally ~goncalo/code/glfMerge<br>Input&nbsp;: GLF files for one individual<br>Output&nbsp;: GLF file  
 
key command&nbsp;: glfMerge or glfMerge_noBGZF<br>Source codes&nbsp;: /home/ylwtx/codes/cpp/glfMerge<br> Originally ~goncalo/code/glfMerge<br>Input&nbsp;: GLF files for one individual<br>Output&nbsp;: GLF file  
  
back<br> <br>(8) Promote a set of sites for each population  
+
== (8) Promote a set of sites for each population ==
  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
Line 121: Line 78:
 
Parameters set:<br>(1) Posterior probability (for being a polymorphism) threshold: 0.999 (0.9 for genomewide but need test)<br>(2) minMapQ = 30<br>(3) –allhet default is ON<br>Input: GLF<br>Output: simplified GLF with three likelihoods  
 
Parameters set:<br>(1) Posterior probability (for being a polymorphism) threshold: 0.999 (0.9 for genomewide but need test)<br>(2) minMapQ = 30<br>(3) –allhet default is ON<br>Input: GLF<br>Output: simplified GLF with three likelihoods  
  
back<br> <br>(9) Merge with genotype data  
+
== (9) Merge with genotype data ==
  
(8.1) prepare genotype data in a unified format  
+
=== (9.1) prepare genotype data in a unified format ===
  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/genotypes_all_2  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/genotypes_all_2  
Line 133: Line 90:
 
*special thanks to Wei Chen for preparing the genotype files
 
*special thanks to Wei Chen for preparing the genotype files
  
(8.2) merge genotype data with sequence data at promoted sites  
+
=== (9.2) merge genotype data with sequence data at promoted sites ===
  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
Line 143: Line 100:
 
Notes: <br> Sites with more than two alleles (not including REF_ALLELE) will be discarded  
 
Notes: <br> Sites with more than two alleles (not including REF_ALLELE) will be discarded  
  
<br>back<br> <br>[[|]](10) Run thunder (hidden Markov model)  
+
To-Do: <br> ** Include genotypes only for individuals that have sequence data
 +
 
 +
== (10) Run thunder (hidden Markov model) ==
  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
Line 153: Line 112:
 
Notes:<br>(1) Cleaned monomorphic sites before feeding to thunder (no need, b/c thunder 005 handles AL1/-)<br>(2) All sites are bi-allelic with one of the alleles being the reference allele (sites with more than 2 alleles including the reference allele are discarded at the beginning of thunder run: initially because of a prior dependent on the reference allele. In the current setting, where Freq1 is used for the prior, we can choose to ignore the reference allele information.) <br>a. Codes changed on 2009-11-02<br>(3) Split:  
 
Notes:<br>(1) Cleaned monomorphic sites before feeding to thunder (no need, b/c thunder 005 handles AL1/-)<br>(2) All sites are bi-allelic with one of the alleles being the reference allele (sites with more than 2 alleles including the reference allele are discarded at the beginning of thunder run: initially because of a prior dependent on the reference allele. In the current setting, where Freq1 is used for the prior, we can choose to ignore the reference allele information.) <br>a. Codes changed on 2009-11-02<br>(3) Split:  
  
<br>  
+
<br>
  
Total 150 jobs (50 jobs for each population)  
+
{| cellspacing="1" cellpadding="1" border="1" style="width: 634px; height: 251px;"
 +
|-
 +
| chromosome
 +
| #parts<br>
 +
| length per part in Mb (last segment)<br>
 +
| start<br>
 +
| end<br>
 +
|-
 +
| 1-2<br>
 +
| 4<br>
 +
| 70 (63-67)<br>
 +
| 0,60,120,180<br>
 +
| 70,130,190,243-247<br>
 +
|-
 +
| 3-4<br>
 +
| 4<br>
 +
| 60 (41-49)<br>
 +
| 0,50,100,150<br>
 +
| 60,110,160,191-200<br>
 +
|-
 +
| <br>
 +
| <br>
 +
| <br>
 +
| <br>
 +
| <br>
 +
|-
 +
| 5-6<br>
 +
| 3<br>
 +
| 70 (51-61)<br>
 +
| 0,60,120<br>
 +
| 70,130,171-181<br>
 +
|-
 +
| 7-8<br>
 +
| 3<br>
 +
| 60 (40-59)<br>
 +
| 0,50,100<br>
 +
| 60,110,140-159<br>
 +
|-
 +
| 9*<br>
 +
| 3<br>
 +
| 75, 45, 40<br>
 +
| 0,'''65''',100<br>
 +
| '''75''',110,140<br>
 +
|-
 +
| <br>
 +
| <br>
 +
| <br>
 +
| <br>
 +
| <br>
 +
|-
 +
| 10-12<br>
 +
| 2<br>
 +
| 70 (72-75)<br>
 +
| 0,60<br>
 +
| 70,132-135<br>
 +
|-
 +
| 13-15<br>
 +
| 2<br>
 +
| 60 (50-64)<br>
 +
| 0,50<br>
 +
| 60,100-114<br>
 +
|-
 +
| <br>
 +
| <br>
 +
| <br>
 +
| <br>
 +
| <br>
 +
|-
 +
| 16-22<br>
 +
| 1<br>
 +
| 47-89<br>
 +
| <br>
 +
| <br>
 +
|}
  
back<br> <br>(11) Ligate thunder results for larger chromosomes  
+
&nbsp;*GAP btw 47-65Mb
 +
 
 +
 
 +
 
 +
Total 150 jobs (50 jobs for each population)
 +
 
 +
== (11) Ligate thunder results for larger chromosomes ==
  
 
<br>/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all  
 
<br>/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all  
Line 163: Line 201:
 
ligate.all. 2009-10-27.csh  
 
ligate.all. 2009-10-27.csh  
  
<br>back
+
== (12) Extract QC+ sites ==
 
 
<br>(12) Extract QC+ sites  
 
  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/  
Line 173: Line 209:
 
(1) Check for individuals who are genotyped only:<br>For example, before 2009-12, in CHB+JPT, the following 2 individuals are sequenced:<br>NA18631<br>NA18634<br>(2) Check for trios whose sequencing information were not used: e.g., daughter and father of the trio  
 
(1) Check for individuals who are genotyped only:<br>For example, before 2009-12, in CHB+JPT, the following 2 individuals are sequenced:<br>NA18631<br>NA18634<br>(2) Check for trios whose sequencing information were not used: e.g., daughter and father of the trio  
  
back
+
== (13) Generate other information for VCF format (no longer needed, already generated) ==
 
 
<br>
 
 
 
<br>
 
 
 
<br> <br>(13) Generate other information for VCF format (no longer needed, already generated)  
 
  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/<br>site.depth.csh (no longer needed, generated in GLF step)  
 
/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/<br>site.depth.csh (no longer needed, generated in GLF step)  
  
<br>
+
== (14) Generate VCF ==
 
 
back<br> <br>(14) Generate VCF  
 
  
 
/home/ylwtx/1000Genomes/UoM_2009_12<br>cmd.csh<br> $pop.sh<br> vcf.py  
 
/home/ylwtx/1000Genomes/UoM_2009_12<br>cmd.csh<br> $pop.sh<br> vcf.py  
  
<br>
+
== (15) Quality check ==
 
 
<br>back
 
 
 
<br>(15) Quality check  
 
  
 
~ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all/chr20/tout/CEU  
 
~ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all/chr20/tout/CEU  
  
(A) Accuracy of genotype calls<br>eval.Rsq.csh<br>or eval.r2_hat.csh<br>(B) Accuracy of haplotype calls<br>comparehaplotypes.csh  
+
(A) Accuracy of genotype calls<br>eval.Rsq.csh<br>or eval.r2_hat.csh<br>(B) Accuracy of haplotype calls<br>comparehaplotypes.csh
 
 
<br>
 
 
 
<br>back
 
 
 
<br><br>
 

Latest revision as of 13:35, 10 March 2010

Michigan Pilot One SNP calling work flow:

(1) Create GLF files from BAM files

/home5/2009.08.GLF-1/ 
update-glf.csh
sam2glf.csh

key command: samtools pileup
Source codes: Check with Paul for the latest version

output GLF format: bgzf

(file size) BAM (binary SAM) file: up to 43Gb (average 10s Gb) for each individual
(file size) GLF (binary) file: up to 19Gb (average 10s Gb) for each individual

(2) Split GLF files by chromosome

/home1/ylwtx/2009.08.GLF-split/ 
update-glf.csh
splitGLF.csh

key command: glfSplit
Source codes: ~goncalo/code/glfSplit/

Input GLF format: gz or bgzf
Output GLF format: gz

Tom suggested combing the first two steps using the following samtools command:
samtools -view -u *.bam 22 | samtools pileup –g - > *.glf

(3) Build a list of individuals within each population

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all 

STEP 0 in s1-5.csh

Note: Check to make sure that all the individuals with GLF are included in the list “NA.number.by.popn”

(4) Link files and tabulate # of files per population, per platform

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all

STEP 1 in s1-5.csh

key command: ln -s

(5) Check total (aggregated over all individual samples) depth for each population, each platform

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 2 in s1-5.csh

key command : DepthDistr
Source codes : /home/ylwtx/codes/cpp/DepthDistr/
Input : GLF files
Output : PMF and CDF of total depth distribution

Other: total depth for each site (for vcf) (run after filter)
IMPORTANT/total_depth_per_site.r3.csh
key command : DepthPerSite
Source codes : /home/ylwtx/codes/cpp/ DepthPerSite/
Input : GLF files
Output : total depth for each site

(6) Filter sites with total depth at the extremes, within each population, each platform

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 3 in s1-5.csh

key command : filterGLF or filterGLF_notBGZF
Source codes : /home/ylwtx/codes/cpp/filterGLF
Input : GLF files
Output : GLF files

Parameters:
minMapQ = 30

(7) Obtain one merged GLF for each individual by merging GLFs across platforms for the same individual

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 4 in s1-5.csh

key command : glfMerge or glfMerge_noBGZF
Source codes : /home/ylwtx/codes/cpp/glfMerge
Originally ~goncalo/code/glfMerge
Input : GLF files for one individual
Output : GLF file

(8) Promote a set of sites for each population

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 5 in s1-5.csh

key command : GPT_Freq
Source codes : /home/ylwtx/codes/cpp/glfmultipes_prethunder/GPT_Freq_old
Originally ~goncalo/code/glfMultiples

Parameters set:
(1) Posterior probability (for being a polymorphism) threshold: 0.999 (0.9 for genomewide but need test)
(2) minMapQ = 30
(3) –allhet default is ON
Input: GLF
Output: simplified GLF with three likelihoods

(9) Merge with genotype data

(9.1) prepare genotype data in a unified format

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/genotypes_all_2

See: readme.procedure.txt*

key command : ped_prethunder_freq
Source codes : /home/ylwtx/codes/cpp/ped_prethunder

  • special thanks to Wei Chen for preparing the genotype files

(9.2) merge genotype data with sequence data at promoted sites

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 6 in s1-5.csh

key command : merge_prethunder_freq
Source codes : /home/ylwtx/codes/cpp/merge_prethunder

Notes:
Sites with more than two alleles (not including REF_ALLELE) will be discarded

To-Do:
** Include genotypes only for individuals that have sequence data

(10) Run thunder (hidden Markov model)

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

STEP 7 in s1-5.csh
Or prep_and_get_thunder_jobs.chr.*csh

key command : thunder_glf_freq
Source codes : /home/ylwtx/codes/cpp/mach-1.0.16

Notes:
(1) Cleaned monomorphic sites before feeding to thunder (no need, b/c thunder 005 handles AL1/-)
(2) All sites are bi-allelic with one of the alleles being the reference allele (sites with more than 2 alleles including the reference allele are discarded at the beginning of thunder run: initially because of a prior dependent on the reference allele. In the current setting, where Freq1 is used for the prior, we can choose to ignore the reference allele information.)
a. Codes changed on 2009-11-02
(3) Split:


chromosome #parts
length per part in Mb (last segment)
start
end
1-2
4
70 (63-67)
0,60,120,180
70,130,190,243-247
3-4
4
60 (41-49)
0,50,100,150
60,110,160,191-200





5-6
3
70 (51-61)
0,60,120
70,130,171-181
7-8
3
60 (40-59)
0,50,100
60,110,140-159
9*
3
75, 45, 40
0,65,100
75,110,140





10-12
2
70 (72-75)
0,60
70,132-135
13-15
2
60 (50-64)
0,50
60,100-114





16-22
1
47-89


 *GAP btw 47-65Mb


Total 150 jobs (50 jobs for each population)

(11) Ligate thunder results for larger chromosomes


/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all

ligate.all. 2009-10-27.csh

(12) Extract QC+ sites

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/

share.QC+hap. csh (moved to share/ but works under the parent dir, with and without QC)
share/share.csh (with and without QC)
share.QC+dose.csh (moved to share/ but works under the parent dir, with and without QC)
without QC: for VCF
share/calcLD.csh
share/runBatches.LD.csh
share/ batch.LD.pl
share/noSingleton.CEU.csh
share/noSingleton.CHB+JPT.csh
share/noSingleton.YRI.csh

(1) Check for individuals who are genotyped only:
For example, before 2009-12, in CHB+JPT, the following 2 individuals are sequenced:
NA18631
NA18634
(2) Check for trios whose sequencing information were not used: e.g., daughter and father of the trio

(13) Generate other information for VCF format (no longer needed, already generated)

/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/
site.depth.csh (no longer needed, generated in GLF step)

(14) Generate VCF

/home/ylwtx/1000Genomes/UoM_2009_12
cmd.csh
$pop.sh
vcf.py

(15) Quality check

~ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all/chr20/tout/CEU

(A) Accuracy of genotype calls
eval.Rsq.csh
or eval.r2_hat.csh
(B) Accuracy of haplotype calls
comparehaplotypes.csh