Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1:  +
==Introduction==
 +
Main Workshop wiki page: [[SeqShop: December 2014]]
 +
 +
See the [[Media:Dec2014 SeqShop - GotCloud indel.pdf|introductory slides]] for an intro to this tutorial.
 +
 
== Goals of This Session ==
 
== Goals of This Session ==
 
* What we want to learn  
 
* What we want to learn  
Line 4: Line 9:  
** How to examine the variants at particular genomic positions
 
** How to examine the variants at particular genomic positions
 
** How to evaluate the quality of INDEL calls
 
** How to evaluate the quality of INDEL calls
  −
[[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]
      
== Setup in person at the SeqShop Workshop ==
 
== Setup in person at the SeqShop Workshop ==
 
''This section is specifically for the SeqShop Workshop computers.''
 
''This section is specifically for the SeqShop Workshop computers.''
<div class="mw-collapsible" style="width:600px">
+
<div class="mw-collapsible mw-collapsed" style="width:600px">
 
''If you are not running during the SeqShop Workshop, please skip this section.''
 
''If you are not running during the SeqShop Workshop, please skip this section.''
 
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
Line 41: Line 44:  
== Setup when running on your own outside of the SeqShop Workshop ==
 
== Setup when running on your own outside of the SeqShop Workshop ==
 
''This section is specifically for running on your own outside of the SeqShop Workshop.''
 
''This section is specifically for running on your own outside of the SeqShop Workshop.''
<div class="mw-collapsible mw-collapsed" style="width:600px">
+
<div class="mw-collapsible" style="width:600px">
 
''If you are running during the SeqShop Workshop, please skip this section.''
 
''If you are running during the SeqShop Workshop, please skip this section.''
 
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
Line 61: Line 64:     
== Running GotCloud Indel ==
 
== Running GotCloud Indel ==
  ${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}
+
  ${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 6 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}
 
* <code>${GC}/gotcloud</code> runs GotCloud
 
* <code>${GC}/gotcloud</code> runs GotCloud
 
* <code>indel</code> tells GotCloud you want to run the indel calling pipeline.
 
* <code>indel</code> tells GotCloud you want to run the indel calling pipeline.
Line 73: Line 76:  
** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}
 
** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}
 
** Alternatively, gotcloud.conf could be updated to specify the full paths
 
** Alternatively, gotcloud.conf could be updated to specify the full paths
* <code>--out_dir</code> tells GotCloud where to write the output.
+
* <code>--outdir</code> tells GotCloud where to write the output.
 
** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line
 
** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line
   Line 82: Line 85:  
</div>
 
</div>
 
</div>
 
</div>
This should take about 4-5 minutes to run.
+
This should take about 2-4 minutes to run.
* It should end with a line like: <code>Commands finished in 289 secs with no errors reported</code>
+
* It should end with a line like: <code>Commands finished in 125 secs with no errors reported</code>
    
If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off.
 
If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off.
      
== Examining GotCloud indel Ouptut ==
 
== Examining GotCloud indel Ouptut ==
Line 216: Line 218:  
The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.
 
The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.
   −
  22 36662041 . AATAATT A 756 PASS AC=2;AN=114;AF=0.0175439;GC=55,2,0;GN=57;
+
  22 36662041   .   AATAATT   A 756     PASS     AC=2;AN=114;AF=0.0175439;GC=55,2,0;GN=57;
                          GF=0.964912,0.0350877,0;NS=57;
+
                        GF=0.964912,0.0350877,0;NS=57;
                          HWEAF=0.019571;HWEGF=0.961242,0.038376,0.000383024;
+
                        HWEAF=0.019571;HWEGF=0.961242,0.038376,0.000383024;
                          MLEAF=0.0196187;MLEGF=0.960762,0.0392374,2.11537e-15;
+
                        MLEAF=0.0196187;MLEGF=0.960762,0.0392374,2.11537e-15;
                          HWE_LLR=-0.0222464;HWE_LPVAL=-0.182794;HWE_DF=1;
+
                        HWE_LLR=-0.0222464;HWE_LPVAL=-0.182794;HWE_DF=1;
                          FIC=-0.00372601;AB=0.384578
+
                        FIC=-0.00372601;AB=0.384578
                      GT:PL:DP:AD:GQ 0/0:0,6,145:3:2,0,1:7 0/0:0,12,192:4:4,0,0:12
+
                        GT:PL:DP:AD:GQ 0/0:0,6,145:3:2,0,1:7 0/0:0,12,192:4:4,0,0:12
    
Here is a description of the record's fields.
 
Here is a description of the record's fields.
   −
   22             : chromosome
+
   22                   : chromosome
   36662041       : genome position
+
   36662041             : genome position
   .             : this is the ID field that is left blank.
+
   .                     : this is the ID field that is left blank.
   AATAATT       : the reference sequence that is replaced by the alternative sequence below.
+
   AATAATT               : the reference sequence that is replaced by the alternative sequence below.
   A             : so this is basically a deletion of GT
+
   A                     : so this is basically a deletion of ATAATT.
   756           : QUAL field denoting validity of this variant, higher the better.
+
   756                   : QUAL field denoting validity of this variant, higher the better.
   PASS           : a passed variant.
+
   PASS                 : a passed variant.
   INFO          : fields containing information about the variant.
+
   AC=2...              : fields containing information about the variant.
   FORMAT        : format field labels for the genotype columns.
+
   GT:PL:DP:AD:GQ        : format field labels for the genotype columns.
   0/0:0,6,145:3:2,0,1:7 : genotype information.
+
   0/0:0,6,145:3:2,0,1:7 : genotype information.
    
You can obtain the same output by using the following command
 
You can obtain the same output by using the following command
   ${GC}/bin/vt view -i 22:36990878-36990879 ${OUT}/indel/final/all.genotypes.vcf.gz
+
   ${GC}/bin/vt view -i 22:36662041-36662041 ${OUT}/indel/final/all.genotypes.vcf.gz
 
* -i specifies the region
 
* -i specifies the region
 
* You can leave it out and look at all the records
 
* You can leave it out and look at all the records
Line 250: Line 252:  
   AF=0.017            : allele frequency based on AC/AN
 
   AF=0.017            : allele frequency based on AC/AN
 
   GC=55,2,0            : genotype counts for 0/0, 0/1, 1/1
 
   GC=55,2,0            : genotype counts for 0/0, 0/1, 1/1
   GF=0.55,0.34,0.10    : genotype frequencies based on GC
+
   GF=0.96,0.04,0       : genotype frequencies based on GC
 
   NS=57                : no. of samples
 
   NS=57                : no. of samples
   HWEAF=0.28          : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium
+
   HWEAF=0.020          : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium
   HWEGF=0.52,0.40,0.08 : genotype frequency derived from HWEAF
+
   HWEGF=0.96,0.04,0.00 : genotype frequency derived from HWEAF
   HWE_LPVAL=-1.08     : log p value of HWE test
+
   HWE_LPVAL=-0.18     : log p value of HWE test
   FIC=-0.07            : genotype likelihood based inbreeding coefficient
+
   FIC=-0.003          : genotype likelihood based inbreeding coefficient, ranges -1 to 1.  <0 denotes excess of heterozygotes and >0 means excess of homozygotes assuming HWE.
   AB=0.61             : genotype likelihood based allele balance
+
   AB=0.38             : genotype likelihood based allele balance, ranges 0 to 1 with 0.5 for balance, >0.5 meaning reference bias and <0.5 denoting alternate allele bias.
    
=====GENOTYPE field=====
 
=====GENOTYPE field=====
Line 263: Line 265:     
   0/0      : homozygous reference chosen based on PL
 
   0/0      : homozygous reference chosen based on PL
   0,9,158 : PHRED scaled genotype likelihoods
+
   0,6,145 : PHRED scaled genotype likelihoods
 
   3        : no. of reads covering this variant
 
   3        : no. of reads covering this variant
   3,0,0   : allele depth
+
   2,0,1   : allele depth
 
             counts of reads supporting the reference allele,  
 
             counts of reads supporting the reference allele,  
 
             the alternative allele and neither alleles respectively.
 
             the alternative allele and neither alleles respectively.
Line 271: Line 273:  
             coverage of the read over the locus  
 
             coverage of the read over the locus  
 
             or simply an allele that is not accounted for.
 
             or simply an allele that is not accounted for.
   10      : genotype quality
+
   7        : genotype quality
    
== INDEL Analysis ==
 
== INDEL Analysis ==
Line 281: Line 283:  
First you want to know what is in the vcf file.
 
First you want to know what is in the vcf file.
   −
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz
+
   ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz
    
   stats: no. of samples                    :        62
 
   stats: no. of samples                    :        62
Line 295: Line 297:  
We can count the number of variants with different filters with the following commands.
 
We can count the number of variants with different filters with the following commands.
   −
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS"
+
   ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS"
    
   stats: no. of samples                    :        62
 
   stats: no. of samples                    :        62
 
         no. of chromosomes                :          1 <br>
 
         no. of chromosomes                :          1 <br>
         no. Indels                        :        584
+
         no. Indels                        :        583
             2 alleles (ins/del)            :            584 (0.69) [239/345]
+
             2 alleles (ins/del)            :            583 (0.69) [239/344]
 
             >=3 alleles (ins/del)          :              0 (-nan) [0/0]
 
             >=3 alleles (ins/del)          :              0 (-nan) [0/0]
   −
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.overlap"
+
   ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.overlap"
    
   stats: no. of samples                    :        62
 
   stats: no. of samples                    :        62
 
         no. of chromosomes                :          1 <br>
 
         no. of chromosomes                :          1 <br>
         no. Indels                        :        136
+
         no. Indels                        :        137
             2 alleles (ins/del)            :            136 (1.89) [89/47]  #notice the difference in insertion deletion ratios
+
             2 alleles (ins/del)            :            137 (1.85) [89/48]  #notice the difference in insertion deletion ratios
 
             >=3 alleles (ins/del)          :              0 (-nan) [0/0]
 
             >=3 alleles (ins/del)          :              0 (-nan) [0/0]
    
   #passed singletons only
 
   #passed singletons only
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&INFO.AC==1"
+
   ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&INFO.AC==1"
 
   
 
   
 
   #passed indels of length 1 only
 
   #passed indels of length 1 only
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN==1"
+
   ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN==1"
 
   
 
   
 
   #passed indels of length >4  
 
   #passed indels of length >4  
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"
+
   ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"
 
    
 
    
 
   #passed singletons of length 4 or insertions of length 3
 
   #passed singletons of length 4 or insertions of length 3
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"
+
   ${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"
    
=== Comparison with other data sets ===
 
=== Comparison with other data sets ===
    
It is usually useful to examine the call sets against known data sets for the passed variants.
 
It is usually useful to examine the call sets against known data sets for the passed variants.
 
+
<div class="mw-collapsible mw-collapsed" style="width:500px">
In order to do this, you need to update indel.reference.txt to point to the reference files.
+
''Command to use at SeqShop Workshop:''
 +
<div class="mw-collapsible-content">
 +
${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
 +
</div>
 +
</div>
 +
<div class="mw-collapsible" style="width:500px">
 +
''Commands outside of SeqShop Workshop:''
 +
<div class="mw-collapsible-content">
 +
Outside the workshop, you need to update indel.reference.txt to point to the reference files.
 
  cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt
 
  cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt
    
Edit indel.reference.txt and specify the correct path to ${SS}
 
Edit indel.reference.txt and specify the correct path to ${SS}
 
  nedit ${OUT}/indel.reference.txt
 
  nedit ${OUT}/indel.reference.txt
*'''Replace all occurrences of <code>username</code> with your username  (or the correct path to your seqshop example directory).'''
+
*'''Replace the path to the reference files with the path to your seqshop example directory.'''
 
*:[[File:IndelRef.png]]
 
*:[[File:IndelRef.png]]
   −
  ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
+
  ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
 +
</div>
 +
</div>
      Line 355: Line 367:  
     Sensitivity  69.6%  <br>
 
     Sensitivity  69.6%  <br>
 
   dbsnp  #Indels from dbSNP
 
   dbsnp  #Indels from dbSNP
     A-B        405 [0.68]
+
     A-B        404 [0.68]
     A&B        208 [0.79]
+
     A&B        209 [0.79]
     B-A        494 [2.03]
+
     B-A        493 [2.04]
     Precision    33.9%
+
     Precision    34.1%
     Sensitivity  29.6%
+
     Sensitivity  29.8%
    
Ins/Del ratios:  Reference alignment based methods tend to be biased towards the detection of deletions.  This provides a useful measure for discovery Indel sets to show the varying degree of biasness.  It also appears that as coverage increases, the ins/del ratio tends to 1.
 
Ins/Del ratios:  Reference alignment based methods tend to be biased towards the detection of deletions.  This provides a useful measure for discovery Indel sets to show the varying degree of biasness.  It also appears that as coverage increases, the ins/del ratio tends to 1.
Line 375: Line 387:  
We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.
 
We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.
   −
   ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f  "~PASS"
+
<div class="mw-collapsible mw-collapsed" style="width:500px">
 +
''Command to use at SeqShop Workshop:''
 +
<div class="mw-collapsible-content">
 +
  ${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f  "~PASS"
 +
</div>
 +
</div>
 +
<div class="mw-collapsible" style="width:500px">
 +
''Command outside of SeqShop Workshop:''
 +
<div class="mw-collapsible-content">
 +
   ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f  "~PASS"
 +
</div>
 +
</div>
 +
 
    
   data set
 
   data set
Line 537: Line 561:     
UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.
 
UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.
 +
 +
 +
== Return to Workshop Wiki Page ==
 +
Return to main workshop wiki page: [[SeqShop: December 2014]]

Navigation menu