Changes

From Genome Analysis Wiki
Jump to navigationJump to search
no edit summary
Line 1: Line 1:  +
'''Note:''' the latest version of this practical is available at: [[SeqShop: Variant Calling and Filtering for INDELs Practical]]
 +
* The ones here is the original one from the June workshop (updated to be run from elsewhere)
 +
 +
 
== Goals of This Session ==
 
== Goals of This Session ==
 
* What we want to learn  
 
* What we want to learn  
Line 5: Line 9:  
** How to evaluate the quality of INDEL calls
 
** How to evaluate the quality of INDEL calls
   −
[[Media:Variant Calling and Filtering for INDELs.pdf|Intro Slides]]
+
[[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]
    
== Setup in person at the SeqShop Workshop ==
 
== Setup in person at the SeqShop Workshop ==
Line 44: Line 48:  
''If you are running during the SeqShop Workshop, please skip this section.''
 
''If you are running during the SeqShop Workshop, please skip this section.''
 
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
=== Download the example data ===
     −
=== Setup your run environment ===
+
This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]]
   −
Environment variables will be used throughout the tutorial.
+
It also uses the bam.index file created in the SnpCall Tutorial.  If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]]
   −
We recommend that you setup these variables so you won't have to modify every command in the tutorial.
  −
  −
<div class="mw-collapsible mw-collapsed" style="width:500px">
  −
I'm using bash (replace the paths below with the appropriate paths):
  −
<div class="mw-collapsible-content">
  −
* Point to where you installed GotCloud
  −
*:<pre>export GC=/home/username/gotcloud</pre>
  −
* Point to where you installed the seqshop files
  −
*:<pre>export SS=/home/username/seqshop/</pre>
  −
* Point to where you want the output to go
  −
*:<pre>export OUT=/home/username/seqshop_output/</pre>
  −
</div>
  −
</div>
  −
  −
<div class="mw-collapsible mw-collapsed" style="width:500px">
  −
I'm using tcsh (replace the paths below with the appropriate paths):
  −
<div class="mw-collapsible-content">
  −
* Point to where you installed GotCloud
  −
*:<pre>setenv GC /home/username/gotcloud</pre>
  −
* Point to where you installed the seqshop files
  −
*:<pre>setenv SS /home/username/seqshop/</pre>
  −
* Point to where you want the output to go
  −
*:<pre>setenv OUT /home/username/seqshop_output/</pre>
  −
</div>
  −
</div>
      +
{{SeqShopRemoteEnv}}
 
</div>
 
</div>
 
</div>
 
</div>
Line 83: Line 62:  
* BAMs->INDELs rather than BAMs->SNPs
 
* BAMs->INDELs rather than BAMs->SNPs
   −
If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]
+
If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]
    
== Running GotCloud Indel ==
 
== Running GotCloud Indel ==
Line 289: Line 268:  
The following section details some simple analyses we can perform.
 
The following section details some simple analyses we can perform.
   −
== Summary ==
+
===Summary===
    
First you want to know what is in the vcf file.
 
First you want to know what is in the vcf file.
Line 330: Line 309:  
   
 
   
 
   #passed indels of length >4  
 
   #passed indels of length >4  
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>1"
+
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"
 
    
 
    
 
   #passed singletons of length 4 or insertions of length 3
 
   #passed singletons of length 4 or insertions of length 3
 
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"
 
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"
   −
== Comparison with other data sets ==
+
=== Comparison with other data sets ===
    
It is usually useful to examine the call sets against known data sets for the passed variants.
 
It is usually useful to examine the call sets against known data sets for the passed variants.
Line 344: Line 323:  
Edit indel.reference.txt and specify the correct path to ${SS}
 
Edit indel.reference.txt and specify the correct path to ${SS}
 
  nedit ${OUT}/indel.reference.txt
 
  nedit ${OUT}/indel.reference.txt
 +
*'''Replace all occurrences of <code>username</code> with your username  (or the correct path to your seqshop example directory).'''
 +
*:[[File:IndelRef.png]]
    
  ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
 
  ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
Line 413: Line 394:  
This analysis supports filters too.
 
This analysis supports filters too.
   −
==Normalization==
+
===Normalization===
    
A slight digression here, when analyzing indels, it is important to normalize it.  While it is a simple concept,
 
A slight digression here, when analyzing indels, it is important to normalize it.  While it is a simple concept,
Line 499: Line 480:  
- about a loss of 11% of variant thought distinct.
 
- about a loss of 11% of variant thought distinct.
   −
To normalize and remove duplicate variants:
+
For chromosome 22, there are 121 variants, but only 106 distinct Indels after normalization
 +
- about a loss of 12% of variant thought distinct.
   −
  ${GC}/bin/vt normalize ${VTREF}/mills_indels_hg19.sites.bcf -r ${VTREF}/hs37d5.fa  | ${GC}/bin/vt mergedups - -o ${OUT}/mills.normalized.genotypes.bcf
+
To normalize and remove duplicate variants for chromosome 22:
   −
and you will observe that 3994 variants had to be left aligned and 1092 variants were removed.
+
  ${GC}/bin/vt normalize  ${SS}/ref22/mills_indels_hg19.22.sites.bcf -r ${SS}/ref22/human.g1k.v37.chr22.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.22.normalized.genotypes.bcf
   −
  stats: biallelic
+
and you will observe that 43 variants had to be left aligned and 15 variants were removed.
 +
 
 +
<pre>
 +
stats: biallelic
 
           no. left trimmed                      : 0
 
           no. left trimmed                      : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and right trimmed    : 0
 
           no. left trimmed and right trimmed    : 0
           no. left aligned                      : 3994
+
           no. left aligned                      : 43
           no. right trimmed                    : 0 <br>
+
           no. right trimmed                    : 0
        multiallelic
+
 
 +
      total no. biallelic normalized          : 43
 +
 
 +
      multiallelic
 
           no. left trimmed                      : 0
 
           no. left trimmed                      : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and right trimmed    : 0
 
           no. left trimmed and right trimmed    : 0
 
           no. left aligned                      : 0
 
           no. left aligned                      : 0
           no. right trimmed                    : 0 <br>
+
           no. right trimmed                    : 0
       no. variants observed                   : 9996 <br>
+
 
  <br>
+
      total no. multiallelic normalized        : 0
  stats: Total number of observed variants  9996
+
 
        Total number of unique variants    8904 <br>
+
      total no. variants normalized            : 43
 +
       total no. variants observed             : 121
 +
 
 +
 
 +
stats: Total number of observed variants  121
 +
      Total number of unique variants    106
 +
</pre>
   −
Let's look for a variant that was normalized.
+
Let's look at the last two variants that were normalized.
  ${GC}/bin/vt view ${OUT}/mills.normalized.genotypes.bcf | grep OLD_VARIANT |head -1
+
  ${GC}/bin/vt view ${OUT}/mills.22.normalized.genotypes.bcf | grep OLD_VARIANT |tail -2
    
Results:
 
Results:
*The position has changed - it was:
+
*The positions have changed - the were:
** 18293100 (as seen after OLD_VARIANT)
+
** 48831918 & 50012616 (as seen after OLD_VARIANT)
*Now it is
+
*Now they are:
** 18293097
+
** 48831915 & 50012614
 
[[File:indelNormalize.png|800px]]
 
[[File:indelNormalize.png|800px]]
       
UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.
 
UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.

Navigation menu