Line 1: |
Line 1: |
| + | ==Introduction== |
| + | Main Workshop wiki page: [[SeqShop: December 2014]] |
| + | |
| + | See the [[Media:Dec2014 SeqShop - GotCloud indel.pdf|introductory slides]] for an intro to this tutorial. |
| + | |
| == Goals of This Session == | | == Goals of This Session == |
| * What we want to learn | | * What we want to learn |
Line 4: |
Line 9: |
| ** How to examine the variants at particular genomic positions | | ** How to examine the variants at particular genomic positions |
| ** How to evaluate the quality of INDEL calls | | ** How to evaluate the quality of INDEL calls |
− |
| |
− | [[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]
| |
| | | |
| == Setup in person at the SeqShop Workshop == | | == Setup in person at the SeqShop Workshop == |
| ''This section is specifically for the SeqShop Workshop computers.'' | | ''This section is specifically for the SeqShop Workshop computers.'' |
− | <div class="mw-collapsible" style="width:600px"> | + | <div class="mw-collapsible mw-collapsed" style="width:600px"> |
| ''If you are not running during the SeqShop Workshop, please skip this section.'' | | ''If you are not running during the SeqShop Workshop, please skip this section.'' |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
Line 41: |
Line 44: |
| == Setup when running on your own outside of the SeqShop Workshop == | | == Setup when running on your own outside of the SeqShop Workshop == |
| ''This section is specifically for running on your own outside of the SeqShop Workshop.'' | | ''This section is specifically for running on your own outside of the SeqShop Workshop.'' |
− | <div class="mw-collapsible mw-collapsed" style="width:600px"> | + | <div class="mw-collapsible" style="width:600px"> |
| ''If you are running during the SeqShop Workshop, please skip this section.'' | | ''If you are running during the SeqShop Workshop, please skip this section.'' |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
Line 61: |
Line 64: |
| | | |
| == Running GotCloud Indel == | | == Running GotCloud Indel == |
− | ${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT} | + | ${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 6 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT} |
| * <code>${GC}/gotcloud</code> runs GotCloud | | * <code>${GC}/gotcloud</code> runs GotCloud |
| * <code>indel</code> tells GotCloud you want to run the indel calling pipeline. | | * <code>indel</code> tells GotCloud you want to run the indel calling pipeline. |
Line 73: |
Line 76: |
| ** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS} | | ** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS} |
| ** Alternatively, gotcloud.conf could be updated to specify the full paths | | ** Alternatively, gotcloud.conf could be updated to specify the full paths |
− | * <code>--out_dir</code> tells GotCloud where to write the output. | + | * <code>--outdir</code> tells GotCloud where to write the output. |
| ** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line | | ** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line |
| | | |
Line 82: |
Line 85: |
| </div> | | </div> |
| </div> | | </div> |
− | This should take about 4-5 minutes to run. | + | This should take about 2-4 minutes to run. |
− | * It should end with a line like: <code>Commands finished in 289 secs with no errors reported</code> | + | * It should end with a line like: <code>Commands finished in 125 secs with no errors reported</code> |
| | | |
| If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off. | | If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off. |
− |
| |
| | | |
| == Examining GotCloud indel Ouptut == | | == Examining GotCloud indel Ouptut == |
Line 216: |
Line 218: |
| The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name. | | The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name. |
| | | |
− | 22 36662041 . AATAATT A 756 PASS AC=2;AN=114;AF=0.0175439;GC=55,2,0;GN=57; | + | 22 36662041 . AATAATT A 756 PASS AC=2;AN=114;AF=0.0175439;GC=55,2,0;GN=57; |
− | GF=0.964912,0.0350877,0;NS=57;
| + | GF=0.964912,0.0350877,0;NS=57; |
− | HWEAF=0.019571;HWEGF=0.961242,0.038376,0.000383024;
| + | HWEAF=0.019571;HWEGF=0.961242,0.038376,0.000383024; |
− | MLEAF=0.0196187;MLEGF=0.960762,0.0392374,2.11537e-15;
| + | MLEAF=0.0196187;MLEGF=0.960762,0.0392374,2.11537e-15; |
− | HWE_LLR=-0.0222464;HWE_LPVAL=-0.182794;HWE_DF=1;
| + | HWE_LLR=-0.0222464;HWE_LPVAL=-0.182794;HWE_DF=1; |
− | FIC=-0.00372601;AB=0.384578
| + | FIC=-0.00372601;AB=0.384578 |
− | GT:PL:DP:AD:GQ 0/0:0,6,145:3:2,0,1:7 0/0:0,12,192:4:4,0,0:12
| + | GT:PL:DP:AD:GQ 0/0:0,6,145:3:2,0,1:7 0/0:0,12,192:4:4,0,0:12 |
| | | |
| Here is a description of the record's fields. | | Here is a description of the record's fields. |
| | | |
− | 22 : chromosome | + | 22 : chromosome |
− | 36662041 : genome position | + | 36662041 : genome position |
− | . : this is the ID field that is left blank. | + | . : this is the ID field that is left blank. |
− | AATAATT : the reference sequence that is replaced by the alternative sequence below. | + | AATAATT : the reference sequence that is replaced by the alternative sequence below. |
− | A : so this is basically a deletion of GT | + | A : so this is basically a deletion of ATAATT. |
− | 756 : QUAL field denoting validity of this variant, higher the better. | + | 756 : QUAL field denoting validity of this variant, higher the better. |
− | PASS : a passed variant. | + | PASS : a passed variant. |
− | INFO : fields containing information about the variant. | + | AC=2... : fields containing information about the variant. |
− | FORMAT : format field labels for the genotype columns. | + | GT:PL:DP:AD:GQ : format field labels for the genotype columns. |
− | 0/0:0,6,145:3:2,0,1:7 : genotype information. | + | 0/0:0,6,145:3:2,0,1:7 : genotype information. |
| | | |
| You can obtain the same output by using the following command | | You can obtain the same output by using the following command |
Line 252: |
Line 254: |
| GF=0.96,0.04,0 : genotype frequencies based on GC | | GF=0.96,0.04,0 : genotype frequencies based on GC |
| NS=57 : no. of samples | | NS=57 : no. of samples |
− | HWEAF=0.020 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium | + | HWEAF=0.020 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium |
| HWEGF=0.96,0.04,0.00 : genotype frequency derived from HWEAF | | HWEGF=0.96,0.04,0.00 : genotype frequency derived from HWEAF |
| HWE_LPVAL=-0.18 : log p value of HWE test | | HWE_LPVAL=-0.18 : log p value of HWE test |
− | FIC=-0.003 : genotype likelihood based inbreeding coefficient | + | FIC=-0.003 : genotype likelihood based inbreeding coefficient, ranges -1 to 1. <0 denotes excess of heterozygotes and >0 means excess of homozygotes assuming HWE. |
− | AB=0.38 : genotype likelihood based allele balance | + | AB=0.38 : genotype likelihood based allele balance, ranges 0 to 1 with 0.5 for balance, >0.5 meaning reference bias and <0.5 denoting alternate allele bias. |
| | | |
| =====GENOTYPE field===== | | =====GENOTYPE field===== |
Line 326: |
Line 328: |
| | | |
| It is usually useful to examine the call sets against known data sets for the passed variants. | | It is usually useful to examine the call sets against known data sets for the passed variants. |
− |
| |
| <div class="mw-collapsible mw-collapsed" style="width:500px"> | | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
− | If you are running outside the workshop, you will need to update indel.reference.txt to point to the reference files.
| + | ''Command to use at SeqShop Workshop:'' |
| + | <div class="mw-collapsible-content"> |
| + | ${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS" |
| + | </div> |
| + | </div> |
| + | <div class="mw-collapsible" style="width:500px"> |
| + | ''Commands outside of SeqShop Workshop:'' |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
| + | Outside the workshop, you need to update indel.reference.txt to point to the reference files. |
| cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt | | cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt |
| | | |
Line 336: |
Line 344: |
| *'''Replace the path to the reference files with the path to your seqshop example directory.''' | | *'''Replace the path to the reference files with the path to your seqshop example directory.''' |
| *:[[File:IndelRef.png]] | | *:[[File:IndelRef.png]] |
| + | |
| + | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS" |
| </div> | | </div> |
| </div> | | </div> |
| | | |
− | ${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
| |
− | * When running outside the workshop, use ${OUT}/indel.reference.txt instead of ${SS}/ref22/indel.reference.txt
| |
| | | |
| data set | | data set |
Line 359: |
Line 367: |
| Sensitivity 69.6% <br> | | Sensitivity 69.6% <br> |
| dbsnp #Indels from dbSNP | | dbsnp #Indels from dbSNP |
− | A-B 405 [0.68] | + | A-B 404 [0.68] |
− | A&B 208 [0.79] | + | A&B 209 [0.79] |
− | B-A 494 [2.03] | + | B-A 493 [2.04] |
− | Precision 33.9% | + | Precision 34.1% |
− | Sensitivity 29.6% | + | Sensitivity 29.8% |
| | | |
| Ins/Del ratios: Reference alignment based methods tend to be biased towards the detection of deletions. This provides a useful measure for discovery Indel sets to show the varying degree of biasness. It also appears that as coverage increases, the ins/del ratio tends to 1. | | Ins/Del ratios: Reference alignment based methods tend to be biased towards the detection of deletions. This provides a useful measure for discovery Indel sets to show the varying degree of biasness. It also appears that as coverage increases, the ins/del ratio tends to 1. |
Line 379: |
Line 387: |
| We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity. | | We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity. |
| | | |
− | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS" | + | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
| + | ''Command to use at SeqShop Workshop:'' |
| + | <div class="mw-collapsible-content"> |
| + | ${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS" |
| + | </div> |
| + | </div> |
| + | <div class="mw-collapsible" style="width:500px"> |
| + | ''Command outside of SeqShop Workshop:'' |
| + | <div class="mw-collapsible-content"> |
| + | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS" |
| + | </div> |
| + | </div> |
| + | |
| | | |
| data set | | data set |
Line 541: |
Line 561: |
| | | |
| UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO. | | UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO. |
| + | |
| + | |
| + | == Return to Workshop Wiki Page == |
| + | Return to main workshop wiki page: [[SeqShop: December 2014]] |