Changes

From Genome Analysis Wiki
Jump to navigationJump to search
no edit summary
Line 1: Line 1:  +
'''Note:''' the latest version of this practical is available at: [[SeqShop: Variant Calling and Filtering for INDELs Practical]]
 +
* The ones here is the original one from the June workshop (updated to be run from elsewhere)
 +
 +
 
== Goals of This Session ==
 
== Goals of This Session ==
 
* What we want to learn  
 
* What we want to learn  
Line 5: Line 9:  
** How to evaluate the quality of INDEL calls
 
** How to evaluate the quality of INDEL calls
    +
[[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]
 +
 +
== Setup in person at the SeqShop Workshop ==
 +
''This section is specifically for the SeqShop Workshop computers.''
 +
<div class="mw-collapsible mw-collapsed" style="width:600px">
 +
''If you are not running during the SeqShop Workshop, please skip this section.''
 +
<div class="mw-collapsible-content">
    
{{SeqShopLogin}}
 
{{SeqShopLogin}}
   −
== Setup your run environment==
+
=== Setup your run environment===
 +
This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.
   −
This is the same setup you did for the previous tutorial, but you need to redo it each time you log in. This will setup some environment variables to point you to:
+
This will setup some environment variables to point you to
* GotCloud program
+
* [[GotCloud]] program
 
* Tutorial input files
 
* Tutorial input files
 
* Setup an output directory
 
* Setup an output directory
Line 18: Line 30:  
* You won't see any output after running <code>source</code>
 
* You won't see any output after running <code>source</code>
 
** It silently sets up your environment
 
** It silently sets up your environment
 +
** If you want to view the detail of the setup, type
 +
less /home/mktrost/seqshop/setup.txt
 +
and press 'q' to finish.
 +
 
<div class="mw-collapsible mw-collapsed" style="width:200px">
 
<div class="mw-collapsible mw-collapsed" style="width:200px">
 
View setup.txt
 
View setup.txt
Line 24: Line 40:  
</div>
 
</div>
 
</div>
 
</div>
 +
</div>
 +
</div>
 +
 +
== Setup when running on your own outside of the SeqShop Workshop ==
 +
''This section is specifically for running on your own outside of the SeqShop Workshop.''
 +
<div class="mw-collapsible" style="width:600px">
 +
''If you are running during the SeqShop Workshop, please skip this section.''
 +
<div class="mw-collapsible-content">
 +
 +
This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]]
 +
 +
It also uses the bam.index file created in the SnpCall Tutorial.  If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]]
 +
    +
{{SeqShopRemoteEnv}}
 +
</div>
 +
</div>
    
== Examining GotCloud Indel Input files ==
 
== Examining GotCloud Indel Input files ==
Line 30: Line 62:  
* BAMs->INDELs rather than BAMs->SNPs
 
* BAMs->INDELs rather than BAMs->SNPs
   −
If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]
+
If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]
    
== Running GotCloud Indel ==
 
== Running GotCloud Indel ==
  ${GC}/gotcloud indel --conf ${IN}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000
+
  ${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}
 +
* <code>${GC}/gotcloud</code> runs GotCloud
 +
* <code>indel</code> tells GotCloud you want to run the indel calling pipeline.
 +
* <code>--conf</code> tells GotCloud the name of the configuration file to use.
 +
** The configuration for this test was downloaded with the seqshop input files.
 
* --numjobs tells GotCloud how many jobs to run in parallel
 
* --numjobs tells GotCloud how many jobs to run in parallel
 
** Depends on your system
 
** Depends on your system
 
* --region 22:36000000-37000000
 
* --region 22:36000000-37000000
 
** The sample files are just a small region of chromosome 22, so to save time, we tell GotCloud to ignore the other regions
 
** The sample files are just a small region of chromosome 22, so to save time, we tell GotCloud to ignore the other regions
 +
* <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths.
 +
** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}
 +
** Alternatively, gotcloud.conf could be updated to specify the full paths
 +
* <code>--out_dir</code> tells GotCloud where to write the output.
 +
** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line
    
<div class="mw-collapsible mw-collapsed" style="width:500px">
 
<div class="mw-collapsible mw-collapsed" style="width:500px">
Line 65: Line 106:  
Let's look at the <code>final</code> directory:
 
Let's look at the <code>final</code> directory:
 
  ls ${OUT}/final
 
  ls ${OUT}/final
Just a <code>chr22</code> directory, so look inside of there:
+
 
ls ${OUT}/vcfs/chr22
   
;Can you identify the final indel VCF?
 
;Can you identify the final indel VCF?
 
<div class="mw-collapsible mw-collapsed" style="width:350px">
 
<div class="mw-collapsible mw-collapsed" style="width:350px">
Line 96: Line 136:  
Let's see if we found the indel  
 
Let's see if we found the indel  
   −
  $GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041 | head -1  
+
  ${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041 | head -1  
    
Did you see a variant at the position?
 
Did you see a variant at the position?
Line 102: Line 142:  
Let's check the sequence data to confirm that the variant really exists
 
Let's check the sequence data to confirm that the variant really exists
   −
  $GC/bin/samtools tview $IN/bams/HG01101.recal.bam $REF/human.g1k.v37.chr22.fa
+
  ${GC}/bin/samtools tview ${SS}/bams/HG01101.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa
    
* Type 'g' to go to a specific position
 
* Type 'g' to go to a specific position
Line 124: Line 164:  
==== Header ====
 
==== Header ====
 
First, let's look at the header:
 
First, let's look at the header:
  $GC/bin/tabix -H $OUT/final/all.genotypes.vcf.gz
+
  ${GC}/bin/tabix -H ${OUT}/final/all.genotypes.vcf.gz
    
The header is as follows:
 
The header is as follows:
Line 163: Line 203:  
To view a specific region of records (such as APOL1 g2 allele)
 
To view a specific region of records (such as APOL1 g2 allele)
   −
  $GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041-36662041
+
  ${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041-36662041
    
The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.
 
The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.
Line 228: Line 268:  
The following section details some simple analyses we can perform.
 
The following section details some simple analyses we can perform.
   −
== Summary ==
+
===Summary===
    
First you want to know what is in the vcf file.
 
First you want to know what is in the vcf file.
Line 269: Line 309:  
   
 
   
 
   #passed indels of length >4  
 
   #passed indels of length >4  
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>1"
+
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"
 
    
 
    
 
   #passed singletons of length 4 or insertions of length 3
 
   #passed singletons of length 4 or insertions of length 3
 
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"
 
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"
   −
== Comparison with other data sets ==
+
=== Comparison with other data sets ===
    
It is usually useful to examine the call sets against known data sets for the passed variants.
 
It is usually useful to examine the call sets against known data sets for the passed variants.
   −
  ${GC}/bin/vt profile_indels -g ${VTREF}/indel.reference.txt  -r ${REF}/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
+
In order to do this, you need to update indel.reference.txt to point to the reference files.
 +
cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt
 +
 
 +
Edit indel.reference.txt and specify the correct path to ${SS}
 +
nedit ${OUT}/indel.reference.txt
 +
*'''Replace all occurrences of <code>username</code> with your username  (or the correct path to your seqshop example directory).'''
 +
*:[[File:IndelRef.png]]
 +
 
 +
  ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
      Line 318: Line 366:  
We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.
 
We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.
   −
   ${GC}/bin/vt profile_indels -g ${VTREF}/indel.reference.txt  -r ${REF}/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f  "~PASS"
+
   ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f  "~PASS"
    
   data set
 
   data set
Line 346: Line 394:  
This analysis supports filters too.
 
This analysis supports filters too.
   −
==Normalization==
+
===Normalization===
    
A slight digression here, when analyzing indels, it is important to normalize it.  While it is a simple concept,
 
A slight digression here, when analyzing indels, it is important to normalize it.  While it is a simple concept,
Line 432: Line 480:  
- about a loss of 11% of variant thought distinct.
 
- about a loss of 11% of variant thought distinct.
   −
To normalize and remove duplicate variants:
+
For chromosome 22, there are 121 variants, but only 106 distinct Indels after normalization
 +
- about a loss of 12% of variant thought distinct.
 +
 
 +
To normalize and remove duplicate variants for chromosome 22:
   −
   ${GC}/bin/vt normalize  ${VTREF}/mills_indels_hg19.sites.bcf -r ${VTREF}/hs37d5.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.normalized.genotypes.bcf  
+
   ${GC}/bin/vt normalize  ${SS}/ref22/mills_indels_hg19.22.sites.bcf -r ${SS}/ref22/human.g1k.v37.chr22.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.22.normalized.genotypes.bcf  
   −
and you will observe that 3994 variants had to be left aligned and 1092 variants were removed.
+
and you will observe that 43 variants had to be left aligned and 15 variants were removed.
   −
  stats: biallelic
+
<pre>
 +
stats: biallelic
 
           no. left trimmed                      : 0
 
           no. left trimmed                      : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and right trimmed    : 0
 
           no. left trimmed and right trimmed    : 0
           no. left aligned                      : 3994
+
           no. left aligned                      : 43
           no. right trimmed                    : 0 <br>
+
           no. right trimmed                    : 0
        multiallelic
+
 
 +
      total no. biallelic normalized          : 43
 +
 
 +
      multiallelic
 
           no. left trimmed                      : 0
 
           no. left trimmed                      : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and right trimmed    : 0
 
           no. left trimmed and right trimmed    : 0
 
           no. left aligned                      : 0
 
           no. left aligned                      : 0
           no. right trimmed                    : 0 <br>
+
           no. right trimmed                    : 0
       no. variants observed                   : 9996 <br>
+
 
  <br>
+
      total no. multiallelic normalized        : 0
  stats: Total number of observed variants  9996
+
 
        Total number of unique variants    8904 <br>
+
      total no. variants normalized            : 43
 +
       total no. variants observed             : 121
 +
 
 +
 
 +
stats: Total number of observed variants  121
 +
      Total number of unique variants    106
 +
</pre>
   −
Let's look for a variant that was normalized.
+
Let's look at the last two variants that were normalized.
  ${GC}/bin/vt view ${OUT}/mills.normalized.genotypes.bcf | grep OLD_VARIANT |head -1
+
  ${GC}/bin/vt view ${OUT}/mills.22.normalized.genotypes.bcf | grep OLD_VARIANT |tail -2
    
Results:
 
Results:
*The position has changed - it was:
+
*The positions have changed - the were:
** 18293100 (as seen after OLD_VARIANT)
+
** 48831918 & 50012616 (as seen after OLD_VARIANT)
*Now it is
+
*Now they are:
** 18293097
+
** 48831915 & 50012614
 
[[File:indelNormalize.png|800px]]
 
[[File:indelNormalize.png|800px]]
       
UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.
 
UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.

Navigation menu