Line 44: |
Line 44: |
| (file size) BAM (binary SAM) file: up to 43Gb (average 10s Gb) for each individual<br>(file size) GLF (binary) file: up to 19Gb (average 10s Gb) for each individual | | (file size) BAM (binary SAM) file: up to 43Gb (average 10s Gb) for each individual<br>(file size) GLF (binary) file: up to 19Gb (average 10s Gb) for each individual |
| | | |
− | <br>back
| + | == (2) Split GLF files by chromosome == |
− | | |
− | <br>(2) Split GLF files by chromosome
| |
| | | |
| /home1/ylwtx/2009.08.GLF-split/ | | /home1/ylwtx/2009.08.GLF-split/ |
Line 58: |
Line 56: |
| Tom suggested combing the first two steps using the following samtools command:<br>samtools -view -u *.bam 22 | samtools pileup –g - > *.glf | | Tom suggested combing the first two steps using the following samtools command:<br>samtools -view -u *.bam 22 | samtools pileup –g - > *.glf |
| | | |
− | back<br> <br>(3) Build a list of individuals within each population
| + | == (3) Build a list of individuals within each population == |
| | | |
| /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all | | /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all |
Line 66: |
Line 64: |
| Note: Check to make sure that all the individuals with GLF are included in the list “NA.number.by.popn” | | Note: Check to make sure that all the individuals with GLF are included in the list “NA.number.by.popn” |
| | | |
− | <br>back
| + | == (4) Link files and tabulate # of files per population, per platform == |
− | | |
− | <br>(4) Link files and tabulate # of files per population, per platform
| |
| | | |
| /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all | | /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all |
Line 76: |
Line 72: |
| key command: ln -s | | key command: ln -s |
| | | |
− | <br>back
| + | == (5) Check total (aggregated over all individual samples) depth for each population, each platform == |
− | | |
− | <br>
| |
− | | |
− | <br> <br>(5) Check total (aggregated over all individual samples) depth for each population, each platform
| |
| | | |
| /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ | | /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ |
Line 90: |
Line 82: |
| Other: total depth for each site (for vcf) (run after filter)<br> IMPORTANT/total_depth_per_site.r3.csh<br> key command : DepthPerSite<br> Source codes : /home/ylwtx/codes/cpp/ DepthPerSite/<br> Input : GLF files<br> Output : total depth for each site | | Other: total depth for each site (for vcf) (run after filter)<br> IMPORTANT/total_depth_per_site.r3.csh<br> key command : DepthPerSite<br> Source codes : /home/ylwtx/codes/cpp/ DepthPerSite/<br> Input : GLF files<br> Output : total depth for each site |
| | | |
− | back
| + | == (6) Filter sites with total depth at the extremes, within each population, each platform == |
− | | |
− | <br> <br>(6) Filter sites with total depth at the extremes, within each population, each platform
| |
| | | |
| /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ | | /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ |
Line 110: |
Line 100: |
| key command : glfMerge or glfMerge_noBGZF<br>Source codes : /home/ylwtx/codes/cpp/glfMerge<br> Originally ~goncalo/code/glfMerge<br>Input : GLF files for one individual<br>Output : GLF file | | key command : glfMerge or glfMerge_noBGZF<br>Source codes : /home/ylwtx/codes/cpp/glfMerge<br> Originally ~goncalo/code/glfMerge<br>Input : GLF files for one individual<br>Output : GLF file |
| | | |
− | back<br> <br>(8) Promote a set of sites for each population
| + | == (8) Promote a set of sites for each population == |
| | | |
| /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ | | /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ |
Line 120: |
Line 110: |
| Parameters set:<br>(1) Posterior probability (for being a polymorphism) threshold: 0.999 (0.9 for genomewide but need test)<br>(2) minMapQ = 30<br>(3) –allhet default is ON<br>Input: GLF<br>Output: simplified GLF with three likelihoods | | Parameters set:<br>(1) Posterior probability (for being a polymorphism) threshold: 0.999 (0.9 for genomewide but need test)<br>(2) minMapQ = 30<br>(3) –allhet default is ON<br>Input: GLF<br>Output: simplified GLF with three likelihoods |
| | | |
− | back<br> <br>(9) Merge with genotype data
| + | == (9) Merge with genotype data == |
| | | |
− | (8.1) prepare genotype data in a unified format | + | === (9.1) prepare genotype data in a unified format === |
| | | |
| /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/genotypes_all_2 | | /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/genotypes_all_2 |
Line 132: |
Line 122: |
| *special thanks to Wei Chen for preparing the genotype files | | *special thanks to Wei Chen for preparing the genotype files |
| | | |
− | (8.2) merge genotype data with sequence data at promoted sites | + | === (9.2) merge genotype data with sequence data at promoted sites === |
| | | |
| /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ | | /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ |
Line 142: |
Line 132: |
| Notes: <br> Sites with more than two alleles (not including REF_ALLELE) will be discarded | | Notes: <br> Sites with more than two alleles (not including REF_ALLELE) will be discarded |
| | | |
− | <br>back<br> <br>[[|]](10) Run thunder (hidden Markov model)
| + | == (10) Run thunder (hidden Markov model) == |
| | | |
| /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ | | /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ |
Line 151: |
Line 141: |
| | | |
| Notes:<br>(1) Cleaned monomorphic sites before feeding to thunder (no need, b/c thunder 005 handles AL1/-)<br>(2) All sites are bi-allelic with one of the alleles being the reference allele (sites with more than 2 alleles including the reference allele are discarded at the beginning of thunder run: initially because of a prior dependent on the reference allele. In the current setting, where Freq1 is used for the prior, we can choose to ignore the reference allele information.) <br>a. Codes changed on 2009-11-02<br>(3) Split: | | Notes:<br>(1) Cleaned monomorphic sites before feeding to thunder (no need, b/c thunder 005 handles AL1/-)<br>(2) All sites are bi-allelic with one of the alleles being the reference allele (sites with more than 2 alleles including the reference allele are discarded at the beginning of thunder run: initially because of a prior dependent on the reference allele. In the current setting, where Freq1 is used for the prior, we can choose to ignore the reference allele information.) <br>a. Codes changed on 2009-11-02<br>(3) Split: |
− |
| |
− | <br>
| |
| | | |
| Total 150 jobs (50 jobs for each population) | | Total 150 jobs (50 jobs for each population) |
| | | |
− | back<br> <br>(11) Ligate thunder results for larger chromosomes
| + | == (11) Ligate thunder results for larger chromosomes == |
| | | |
| <br>/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all | | <br>/home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all |
Line 162: |
Line 150: |
| ligate.all. 2009-10-27.csh | | ligate.all. 2009-10-27.csh |
| | | |
− | <br>back
| + | == (12) Extract QC+ sites == |
− | | |
− | <br>(12) Extract QC+ sites
| |
| | | |
| /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ | | /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/ |
Line 172: |
Line 158: |
| (1) Check for individuals who are genotyped only:<br>For example, before 2009-12, in CHB+JPT, the following 2 individuals are sequenced:<br>NA18631<br>NA18634<br>(2) Check for trios whose sequencing information were not used: e.g., daughter and father of the trio | | (1) Check for individuals who are genotyped only:<br>For example, before 2009-12, in CHB+JPT, the following 2 individuals are sequenced:<br>NA18631<br>NA18634<br>(2) Check for trios whose sequencing information were not used: e.g., daughter and father of the trio |
| | | |
− | back
| + | == (13) Generate other information for VCF format (no longer needed, already generated) == |
− | | |
− | <br>
| |
− | | |
− | <br>
| |
− | | |
− | <br> <br>(13) Generate other information for VCF format (no longer needed, already generated)
| |
| | | |
| /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/<br>site.depth.csh (no longer needed, generated in GLF step) | | /home/ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.09.all/<br>site.depth.csh (no longer needed, generated in GLF step) |
| | | |
− | <br>
| + | == (14) Generate VCF == |
− | | |
− | back<br> <br>(14) Generate VCF
| |
| | | |
| /home/ylwtx/1000Genomes/UoM_2009_12<br>cmd.csh<br> $pop.sh<br> vcf.py | | /home/ylwtx/1000Genomes/UoM_2009_12<br>cmd.csh<br> $pop.sh<br> vcf.py |
| | | |
− | <br>
| + | == (15) Quality check == |
− | | |
− | <br>back
| |
− | | |
− | <br>(15) Quality check
| |
| | | |
| ~ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all/chr20/tout/CEU | | ~ylwtx/codes/cpp/mach-1.0.16/test_thunder/2009.11.all/chr20/tout/CEU |
| | | |
− | (A) Accuracy of genotype calls<br>eval.Rsq.csh<br>or eval.r2_hat.csh<br>(B) Accuracy of haplotype calls<br>comparehaplotypes.csh | + | (A) Accuracy of genotype calls<br>eval.Rsq.csh<br>or eval.r2_hat.csh<br>(B) Accuracy of haplotype calls<br>comparehaplotypes.csh |
− | | |
− | <br>
| |
− | | |
− | <br>back
| |
− | | |
− | <br><br>
| |