Line 1: |
Line 1: |
| + | ==Introduction== |
| + | Main Workshop wiki page: [[SeqShop: December 2014]] |
| + | |
| + | See the [[Media:Dec2014 SeqShop - GotCloud indel.pdf|introductory slides]] for an intro to this tutorial. |
| + | |
| == Goals of This Session == | | == Goals of This Session == |
| * What we want to learn | | * What we want to learn |
Line 4: |
Line 9: |
| ** How to examine the variants at particular genomic positions | | ** How to examine the variants at particular genomic positions |
| ** How to evaluate the quality of INDEL calls | | ** How to evaluate the quality of INDEL calls |
− |
| |
− | [[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]
| |
| | | |
| == Setup in person at the SeqShop Workshop == | | == Setup in person at the SeqShop Workshop == |
| ''This section is specifically for the SeqShop Workshop computers.'' | | ''This section is specifically for the SeqShop Workshop computers.'' |
− | <div class="mw-collapsible" style="width:600px"> | + | <div class="mw-collapsible mw-collapsed" style="width:600px"> |
| ''If you are not running during the SeqShop Workshop, please skip this section.'' | | ''If you are not running during the SeqShop Workshop, please skip this section.'' |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
Line 41: |
Line 44: |
| == Setup when running on your own outside of the SeqShop Workshop == | | == Setup when running on your own outside of the SeqShop Workshop == |
| ''This section is specifically for running on your own outside of the SeqShop Workshop.'' | | ''This section is specifically for running on your own outside of the SeqShop Workshop.'' |
− | <div class="mw-collapsible mw-collapsed" style="width:600px"> | + | <div class="mw-collapsible" style="width:600px"> |
| ''If you are running during the SeqShop Workshop, please skip this section.'' | | ''If you are running during the SeqShop Workshop, please skip this section.'' |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
Line 61: |
Line 64: |
| | | |
| == Running GotCloud Indel == | | == Running GotCloud Indel == |
− | ${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT} | + | ${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 6 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT} |
| * <code>${GC}/gotcloud</code> runs GotCloud | | * <code>${GC}/gotcloud</code> runs GotCloud |
| * <code>indel</code> tells GotCloud you want to run the indel calling pipeline. | | * <code>indel</code> tells GotCloud you want to run the indel calling pipeline. |
Line 73: |
Line 76: |
| ** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS} | | ** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS} |
| ** Alternatively, gotcloud.conf could be updated to specify the full paths | | ** Alternatively, gotcloud.conf could be updated to specify the full paths |
− | * <code>--out_dir</code> tells GotCloud where to write the output. | + | * <code>--outdir</code> tells GotCloud where to write the output. |
| ** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line | | ** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line |
| | | |
Line 82: |
Line 85: |
| </div> | | </div> |
| </div> | | </div> |
− | This should take about 4-5 minutes to run. | + | This should take about 2-4 minutes to run. |
− | * It should end with a line like: <code>Commands finished in 289 secs with no errors reported</code> | + | * It should end with a line like: <code>Commands finished in 125 secs with no errors reported</code> |
| | | |
| If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off. | | If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off. |
− |
| |
| | | |
| == Examining GotCloud indel Ouptut == | | == Examining GotCloud indel Ouptut == |
Line 216: |
Line 218: |
| The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name. | | The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name. |
| | | |
− | 22 36662041 . AATAATT A 756 PASS AC=2;AN=114;AF=0.0175439;GC=55,2,0;GN=57; | + | 22 36662041 . AATAATT A 756 PASS AC=2;AN=114;AF=0.0175439;GC=55,2,0;GN=57; |
− | GF=0.964912,0.0350877,0;NS=57;
| + | GF=0.964912,0.0350877,0;NS=57; |
− | HWEAF=0.019571;HWEGF=0.961242,0.038376,0.000383024;
| + | HWEAF=0.019571;HWEGF=0.961242,0.038376,0.000383024; |
− | MLEAF=0.0196187;MLEGF=0.960762,0.0392374,2.11537e-15;
| + | MLEAF=0.0196187;MLEGF=0.960762,0.0392374,2.11537e-15; |
− | HWE_LLR=-0.0222464;HWE_LPVAL=-0.182794;HWE_DF=1;
| + | HWE_LLR=-0.0222464;HWE_LPVAL=-0.182794;HWE_DF=1; |
− | FIC=-0.00372601;AB=0.384578
| + | FIC=-0.00372601;AB=0.384578 |
− | GT:PL:DP:AD:GQ 0/0:0,6,145:3:2,0,1:7 0/0:0,12,192:4:4,0,0:12
| + | GT:PL:DP:AD:GQ 0/0:0,6,145:3:2,0,1:7 0/0:0,12,192:4:4,0,0:12 |
| | | |
| Here is a description of the record's fields. | | Here is a description of the record's fields. |
| | | |
− | 22 : chromosome | + | 22 : chromosome |
− | 36662041 : genome position | + | 36662041 : genome position |
− | . : this is the ID field that is left blank. | + | . : this is the ID field that is left blank. |
− | AATAATT : the reference sequence that is replaced by the alternative sequence below. | + | AATAATT : the reference sequence that is replaced by the alternative sequence below. |
− | A : so this is basically a deletion of GT | + | A : so this is basically a deletion of ATAATT. |
− | 756 : QUAL field denoting validity of this variant, higher the better. | + | 756 : QUAL field denoting validity of this variant, higher the better. |
− | PASS : a passed variant. | + | PASS : a passed variant. |
− | INFO : fields containing information about the variant. | + | AC=2... : fields containing information about the variant. |
− | FORMAT : format field labels for the genotype columns. | + | GT:PL:DP:AD:GQ : format field labels for the genotype columns. |
− | 0/0:0,6,145:3:2,0,1:7 : genotype information. | + | 0/0:0,6,145:3:2,0,1:7 : genotype information. |
| | | |
| You can obtain the same output by using the following command | | You can obtain the same output by using the following command |
− | ${GC}/bin/vt view -i 22:36990878-36990879 ${OUT}/indel/final/all.genotypes.vcf.gz | + | ${GC}/bin/vt view -i 22:36662041-36662041 ${OUT}/indel/final/all.genotypes.vcf.gz |
| * -i specifies the region | | * -i specifies the region |
| * You can leave it out and look at all the records | | * You can leave it out and look at all the records |
Line 250: |
Line 252: |
| AF=0.017 : allele frequency based on AC/AN | | AF=0.017 : allele frequency based on AC/AN |
| GC=55,2,0 : genotype counts for 0/0, 0/1, 1/1 | | GC=55,2,0 : genotype counts for 0/0, 0/1, 1/1 |
− | GF=0.55,0.34,0.10 : genotype frequencies based on GC | + | GF=0.96,0.04,0 : genotype frequencies based on GC |
| NS=57 : no. of samples | | NS=57 : no. of samples |
− | HWEAF=0.28 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium | + | HWEAF=0.020 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium |
− | HWEGF=0.52,0.40,0.08 : genotype frequency derived from HWEAF | + | HWEGF=0.96,0.04,0.00 : genotype frequency derived from HWEAF |
− | HWE_LPVAL=-1.08 : log p value of HWE test | + | HWE_LPVAL=-0.18 : log p value of HWE test |
− | FIC=-0.07 : genotype likelihood based inbreeding coefficient | + | FIC=-0.003 : genotype likelihood based inbreeding coefficient, ranges -1 to 1. <0 denotes excess of heterozygotes and >0 means excess of homozygotes assuming HWE. |
− | AB=0.61 : genotype likelihood based allele balance | + | AB=0.38 : genotype likelihood based allele balance, ranges 0 to 1 with 0.5 for balance, >0.5 meaning reference bias and <0.5 denoting alternate allele bias. |
| | | |
| =====GENOTYPE field===== | | =====GENOTYPE field===== |
Line 263: |
Line 265: |
| | | |
| 0/0 : homozygous reference chosen based on PL | | 0/0 : homozygous reference chosen based on PL |
− | 0,9,158 : PHRED scaled genotype likelihoods | + | 0,6,145 : PHRED scaled genotype likelihoods |
| 3 : no. of reads covering this variant | | 3 : no. of reads covering this variant |
− | 3,0,0 : allele depth | + | 2,0,1 : allele depth |
| counts of reads supporting the reference allele, | | counts of reads supporting the reference allele, |
| the alternative allele and neither alleles respectively. | | the alternative allele and neither alleles respectively. |
Line 271: |
Line 273: |
| coverage of the read over the locus | | coverage of the read over the locus |
| or simply an allele that is not accounted for. | | or simply an allele that is not accounted for. |
− | 10 : genotype quality | + | 7 : genotype quality |
| | | |
| == INDEL Analysis == | | == INDEL Analysis == |
Line 281: |
Line 283: |
| First you want to know what is in the vcf file. | | First you want to know what is in the vcf file. |
| | | |
− | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz | + | ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz |
| | | |
| stats: no. of samples : 62 | | stats: no. of samples : 62 |
Line 295: |
Line 297: |
| We can count the number of variants with different filters with the following commands. | | We can count the number of variants with different filters with the following commands. |
| | | |
− | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS" | + | ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS" |
| | | |
| stats: no. of samples : 62 | | stats: no. of samples : 62 |
| no. of chromosomes : 1 <br> | | no. of chromosomes : 1 <br> |
− | no. Indels : 584 | + | no. Indels : 583 |
− | 2 alleles (ins/del) : 584 (0.69) [239/345] | + | 2 alleles (ins/del) : 583 (0.69) [239/344] |
| >=3 alleles (ins/del) : 0 (-nan) [0/0] | | >=3 alleles (ins/del) : 0 (-nan) [0/0] |
| | | |
− | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.overlap" | + | ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.overlap" |
| | | |
| stats: no. of samples : 62 | | stats: no. of samples : 62 |
| no. of chromosomes : 1 <br> | | no. of chromosomes : 1 <br> |
− | no. Indels : 136 | + | no. Indels : 137 |
− | 2 alleles (ins/del) : 136 (1.89) [89/47] #notice the difference in insertion deletion ratios | + | 2 alleles (ins/del) : 137 (1.85) [89/48] #notice the difference in insertion deletion ratios |
| >=3 alleles (ins/del) : 0 (-nan) [0/0] | | >=3 alleles (ins/del) : 0 (-nan) [0/0] |
| | | |
| #passed singletons only | | #passed singletons only |
− | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&INFO.AC==1" | + | ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&INFO.AC==1" |
| | | |
| #passed indels of length 1 only | | #passed indels of length 1 only |
− | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN==1" | + | ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN==1" |
| | | |
| #passed indels of length >4 | | #passed indels of length >4 |
− | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4" | + | ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4" |
| | | |
| #passed singletons of length 4 or insertions of length 3 | | #passed singletons of length 4 or insertions of length 3 |
− | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)" | + | ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)" |
| | | |
| === Comparison with other data sets === | | === Comparison with other data sets === |
| | | |
| It is usually useful to examine the call sets against known data sets for the passed variants. | | It is usually useful to examine the call sets against known data sets for the passed variants. |
− | | + | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
− | In order to do this, you need to update indel.reference.txt to point to the reference files.
| + | ''Command to use at SeqShop Workshop:'' |
| + | <div class="mw-collapsible-content"> |
| + | ${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS" |
| + | </div> |
| + | </div> |
| + | <div class="mw-collapsible" style="width:500px"> |
| + | ''Commands outside of SeqShop Workshop:'' |
| + | <div class="mw-collapsible-content"> |
| + | Outside the workshop, you need to update indel.reference.txt to point to the reference files. |
| cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt | | cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt |
| | | |
| Edit indel.reference.txt and specify the correct path to ${SS} | | Edit indel.reference.txt and specify the correct path to ${SS} |
| nedit ${OUT}/indel.reference.txt | | nedit ${OUT}/indel.reference.txt |
− | *'''Replace all occurrences of <code>username</code> with your username (or the correct path to your seqshop example directory).''' | + | *'''Replace the path to the reference files with the path to your seqshop example directory.''' |
| *:[[File:IndelRef.png]] | | *:[[File:IndelRef.png]] |
| | | |
− | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS" | + | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS" |
| + | </div> |
| + | </div> |
| | | |
| | | |
Line 355: |
Line 367: |
| Sensitivity 69.6% <br> | | Sensitivity 69.6% <br> |
| dbsnp #Indels from dbSNP | | dbsnp #Indels from dbSNP |
− | A-B 405 [0.68] | + | A-B 404 [0.68] |
− | A&B 208 [0.79] | + | A&B 209 [0.79] |
− | B-A 494 [2.03] | + | B-A 493 [2.04] |
− | Precision 33.9% | + | Precision 34.1% |
− | Sensitivity 29.6% | + | Sensitivity 29.8% |
| | | |
| Ins/Del ratios: Reference alignment based methods tend to be biased towards the detection of deletions. This provides a useful measure for discovery Indel sets to show the varying degree of biasness. It also appears that as coverage increases, the ins/del ratio tends to 1. | | Ins/Del ratios: Reference alignment based methods tend to be biased towards the detection of deletions. This provides a useful measure for discovery Indel sets to show the varying degree of biasness. It also appears that as coverage increases, the ins/del ratio tends to 1. |
Line 375: |
Line 387: |
| We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity. | | We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity. |
| | | |
− | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS" | + | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
| + | ''Command to use at SeqShop Workshop:'' |
| + | <div class="mw-collapsible-content"> |
| + | ${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS" |
| + | </div> |
| + | </div> |
| + | <div class="mw-collapsible" style="width:500px"> |
| + | ''Command outside of SeqShop Workshop:'' |
| + | <div class="mw-collapsible-content"> |
| + | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS" |
| + | </div> |
| + | </div> |
| + | |
| | | |
| data set | | data set |
Line 537: |
Line 561: |
| | | |
| UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO. | | UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO. |
| + | |
| + | |
| + | == Return to Workshop Wiki Page == |
| + | Return to main workshop wiki page: [[SeqShop: December 2014]] |