Changes

From Genome Analysis Wiki
Jump to navigationJump to search
no edit summary
Line 1: Line 1:  +
'''Note:''' the latest version of this practical is available at: [[SeqShop: Variant Calling and Filtering for INDELs Practical]]
 +
* The ones here is the original one from the June workshop (updated to be run from elsewhere)
 +
 +
 
== Goals of This Session ==
 
== Goals of This Session ==
 
* What we want to learn  
 
* What we want to learn  
Line 5: Line 9:  
** How to evaluate the quality of INDEL calls
 
** How to evaluate the quality of INDEL calls
    +
[[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]
 +
 +
== Setup in person at the SeqShop Workshop ==
 +
''This section is specifically for the SeqShop Workshop computers.''
 +
<div class="mw-collapsible mw-collapsed" style="width:600px">
 +
''If you are not running during the SeqShop Workshop, please skip this section.''
 +
<div class="mw-collapsible-content">
    
{{SeqShopLogin}}
 
{{SeqShopLogin}}
   −
== Setup your run environment==
+
=== Setup your run environment===
 +
This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.
   −
This is the same setup you did for the previous tutorial, but you need to redo it each time you log in. This will setup some environment variables to point you to:
+
This will setup some environment variables to point you to
* GotCloud program
+
* [[GotCloud]] program
 
* Tutorial input files
 
* Tutorial input files
 
* Setup an output directory
 
* Setup an output directory
Line 18: Line 30:  
* You won't see any output after running <code>source</code>
 
* You won't see any output after running <code>source</code>
 
** It silently sets up your environment
 
** It silently sets up your environment
 +
** If you want to view the detail of the setup, type
 +
less /home/mktrost/seqshop/setup.txt
 +
and press 'q' to finish.
 +
 
<div class="mw-collapsible mw-collapsed" style="width:200px">
 
<div class="mw-collapsible mw-collapsed" style="width:200px">
 
View setup.txt
 
View setup.txt
 
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
 
[[File:setup.png|500px]]
 
[[File:setup.png|500px]]
 +
</div>
 +
</div>
 
</div>
 
</div>
 
</div>
 
</div>
    +
== Setup when running on your own outside of the SeqShop Workshop ==
 +
''This section is specifically for running on your own outside of the SeqShop Workshop.''
 +
<div class="mw-collapsible" style="width:600px">
 +
''If you are running during the SeqShop Workshop, please skip this section.''
 +
<div class="mw-collapsible-content">
 +
 +
This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]]
 +
 +
It also uses the bam.index file created in the SnpCall Tutorial.  If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]]
 +
 +
 +
{{SeqShopRemoteEnv}}
 +
</div>
 +
</div>
    
== Examining GotCloud Indel Input files ==
 
== Examining GotCloud Indel Input files ==
Line 30: Line 62:  
* BAMs->INDELs rather than BAMs->SNPs
 
* BAMs->INDELs rather than BAMs->SNPs
   −
If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]
+
If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]
    
== Running GotCloud Indel ==
 
== Running GotCloud Indel ==
  ${GC}/gotcloud indel --conf ${IN}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000
+
  ${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}
 +
* <code>${GC}/gotcloud</code> runs GotCloud
 +
* <code>indel</code> tells GotCloud you want to run the indel calling pipeline.
 +
* <code>--conf</code> tells GotCloud the name of the configuration file to use.
 +
** The configuration for this test was downloaded with the seqshop input files.
 
* --numjobs tells GotCloud how many jobs to run in parallel
 
* --numjobs tells GotCloud how many jobs to run in parallel
 
** Depends on your system
 
** Depends on your system
 
* --region 22:36000000-37000000
 
* --region 22:36000000-37000000
 
** The sample files are just a small region of chromosome 22, so to save time, we tell GotCloud to ignore the other regions
 
** The sample files are just a small region of chromosome 22, so to save time, we tell GotCloud to ignore the other regions
 +
* <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths.
 +
** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}
 +
** Alternatively, gotcloud.conf could be updated to specify the full paths
 +
* <code>--out_dir</code> tells GotCloud where to write the output.
 +
** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line
    
<div class="mw-collapsible mw-collapsed" style="width:500px">
 
<div class="mw-collapsible mw-collapsed" style="width:500px">
Line 65: Line 106:  
Let's look at the <code>final</code> directory:
 
Let's look at the <code>final</code> directory:
 
  ls ${OUT}/final
 
  ls ${OUT}/final
Just a <code>chr22</code> directory, so look inside of there:
+
 
ls ${OUT}/vcfs/chr22
   
;Can you identify the final indel VCF?
 
;Can you identify the final indel VCF?
 
<div class="mw-collapsible mw-collapsed" style="width:350px">
 
<div class="mw-collapsible mw-collapsed" style="width:350px">
Line 96: Line 136:  
Let's see if we found the indel  
 
Let's see if we found the indel  
   −
  $GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041 | head -1  
+
  ${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041 | head -1  
    
Did you see a variant at the position?
 
Did you see a variant at the position?
Line 102: Line 142:  
Let's check the sequence data to confirm that the variant really exists
 
Let's check the sequence data to confirm that the variant really exists
   −
  $GC/bin/samtools tview $IN/bams/HG01101.recal.bam $REF/human.g1k.v37.chr22.fa
+
  ${GC}/bin/samtools tview ${SS}/bams/HG01101.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa
    
* Type 'g' to go to a specific position
 
* Type 'g' to go to a specific position
Line 113: Line 153:  
View Screenshot
 
View Screenshot
 
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
[[File:Samtoolstviewindel.png|600px]]
+
[[File:Samtoolsviewindel.png|600px]]
 
</div>
 
</div>
 
</div>
 
</div>
Line 124: Line 164:  
==== Header ====
 
==== Header ====
 
First, let's look at the header:
 
First, let's look at the header:
  $GC/bin/tabix -H $OUT/final/all.genotypes.vcf.gz
+
  ${GC}/bin/tabix -H ${OUT}/final/all.genotypes.vcf.gz
    
The header is as follows:
 
The header is as follows:
Line 153: Line 193:  
   ##FILTER=<ID=PASS,Description="Temporary pass">
 
   ##FILTER=<ID=PASS,Description="Temporary pass">
 
   ##FILTER=<ID=overlap,Description="Overlapping variant">
 
   ##FILTER=<ID=overlap,Description="Overlapping variant">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00551 HG00553 HG00554 HG00637 HG00638 HG00640 HG00641 HG00734 HG00736 HG00737 HG00739 HG00740 HG01047 HG01049 HG01051 HG01052 HG01054 HG01055 HG01060 HG01061 HG01066 HG01067 HG01069 HG01070 HG01072 HG01073 HG01075 HG01079 HG01080 HG01082 HG01083 HG01094 HG01097 HG01098 HG01101 HG01102 HG01107 HG01108 HG01110 HG01111 HG01167 HG01168 HG01170 HG01171 HG01173 HG01174 HG01176 HG01177 HG01182 HG01183 HG01187 HG01188 HG01190 HG01191 HG01197 HG01198 HG01204 HG01205 HG01241 HG01242 HG01247 HG01248
+
  #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00551 HG00553 HG00554 HG00637 HG00638 HG00640 HG00641 HG00734 HG00736 HG00737 HG00739 HG00740 HG01047 HG01049 HG01051 HG01052 HG01054 HG01055 HG01060 HG01061 HG01066 HG01067 HG01069 HG01070 HG01072 HG01073 HG01075 HG01079 HG01080 HG01082 HG01083 HG01094 HG01097 HG01098 HG01101 HG01102 HG01107 HG01108 HG01110 HG01111 HG01167 HG01168 HG01170 HG01171 HG01173 HG01174 HG01176 HG01177 HG01182 HG01183 HG01187 HG01188 HG01190 HG01191 HG01197 HG01198 HG01204 HG01205 HG01241 HG01242 HG01247 HG01248
    
Using [[Vt]], we can see the same output
 
Using [[Vt]], we can see the same output
Line 163: Line 203:  
To view a specific region of records (such as APOL1 g2 allele)
 
To view a specific region of records (such as APOL1 g2 allele)
   −
  $GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041-36662041
+
  ${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041-36662041
    
The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.
 
The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.
Line 180: Line 220:  
   36662041      : genome position
 
   36662041      : genome position
 
   .              : this is the ID field that is left blank.
 
   .              : this is the ID field that is left blank.
   AATAATT           : the reference sequence that is replaced by the alternative sequence below.
+
   AATAATT       : the reference sequence that is replaced by the alternative sequence below.
 
   A              : so this is basically a deletion of GT
 
   A              : so this is basically a deletion of GT
 
   756            : QUAL field denoting validity of this variant, higher the better.
 
   756            : QUAL field denoting validity of this variant, higher the better.
Line 197: Line 237:  
The key information fields are as follows:
 
The key information fields are as follows:
   −
   AC=2               : alternate allele count
+
   AC=2                 : alternate allele count
 
   AN=114              : total number of alleles
 
   AN=114              : total number of alleles
   AF=0.0175439  : allele frequency based on AC/AN
+
   AF=0.017            : allele frequency based on AC/AN
   GC=55,2,0         : genotype counts for 0/0, 0/1, 1/1
+
   GC=55,2,0           : genotype counts for 0/0, 0/1, 1/1
 
   GF=0.55,0.34,0.10    : genotype frequencies based on GC
 
   GF=0.55,0.34,0.10    : genotype frequencies based on GC
 
   NS=57                : no. of samples
 
   NS=57                : no. of samples
 
   HWEAF=0.28          : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium
 
   HWEAF=0.28          : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium
 
   HWEGF=0.52,0.40,0.08 : genotype frequency derived from HWEAF
 
   HWEGF=0.52,0.40,0.08 : genotype frequency derived from HWEAF
   HWE_LPVAL=-0.182794    : log p value of HWE test
+
   HWE_LPVAL=-1.08      : log p value of HWE test
FIC=-0.00372601;AB=0.384578
   
   FIC=-0.07            : genotype likelihood based inbreeding coefficient
 
   FIC=-0.07            : genotype likelihood based inbreeding coefficient
 
   AB=0.61              : genotype likelihood based allele balance
 
   AB=0.61              : genotype likelihood based allele balance
Line 229: Line 268:  
The following section details some simple analyses we can perform.
 
The following section details some simple analyses we can perform.
   −
== Summary ==
+
===Summary===
    
First you want to know what is in the vcf file.
 
First you want to know what is in the vcf file.
Line 241: Line 280:  
             >=3 alleles (ins/del)          :              0 (-nan) [0/0] <br>
 
             >=3 alleles (ins/del)          :              0 (-nan) [0/0] <br>
 
         no. of observed variants          :        720
 
         no. of observed variants          :        720
* Our VCF only contains INDELs so there are lots of <code>0 (-nan) [0/0]</code> values in the output.  <code>vt</code> can be used to analyze VCFs with more variants than just INDELs.
+
* Our VCF only contains only biallelic INDELs so there are lots of <code>0 (-nan) [0/0]</code> values in the output.  <code>vt</code> can be used to analyze VCFs with more variants than just INDELs.
   −
The variants have filter labels. PASS meaning a temporary pass and overlap, meaning that the variants are overlapping with another variant, implying multiallelicity.
+
The variants have filter labels. PASS meaning a passed variant and overlap, meaning that the variants are overlapping with another variant, implying multiallelicity.
    
We can count the number of variants with different filters with the following commands.
 
We can count the number of variants with different filters with the following commands.
Line 270: Line 309:  
   
 
   
 
   #passed indels of length >4  
 
   #passed indels of length >4  
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>1"
+
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"
 
    
 
    
 
   #passed singletons of length 4 or insertions of length 3
 
   #passed singletons of length 4 or insertions of length 3
 
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"
 
   ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"
   −
== Comparison with other data sets ==
+
=== Comparison with other data sets ===
    
It is usually useful to examine the call sets against known data sets for the passed variants.
 
It is usually useful to examine the call sets against known data sets for the passed variants.
   −
  ${GC}/bin/vt profile_indels -g indel.reference.txt  -r hs37d5.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
+
In order to do this, you need to update indel.reference.txt to point to the reference files.
 +
cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt
 +
 
 +
Edit indel.reference.txt and specify the correct path to ${SS}
 +
nedit ${OUT}/indel.reference.txt
 +
*'''Replace all occurrences of <code>username</code> with your username  (or the correct path to your seqshop example directory).'''
 +
*:[[File:IndelRef.png]]
 +
 
 +
${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
 +
 
    
   data set
 
   data set
Line 318: Line 366:  
We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.
 
We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.
   −
   ${GC}/bin/vt profile_indels -g indel.reference.txt  -r hs37d5.fa all.genotypes.bcf -i 22:36000000-37000000 -f "~PASS"
+
   ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt  -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"
    
   data set
 
   data set
Line 346: Line 394:  
This analysis supports filters too.
 
This analysis supports filters too.
   −
==Normalization==
+
===Normalization===
    
A slight digression here, when analyzing indels, it is important to normalize it.  While it is a simple concept,
 
A slight digression here, when analyzing indels, it is important to normalize it.  While it is a simple concept,
Line 366: Line 414:  
|-
 
|-
 
! scope="col"| Dataset
 
! scope="col"| Dataset
! scope="col"| Freebayes
   
! scope="col"| Haplotyecaller
 
! scope="col"| Haplotyecaller
 
! scope="col"| PINDEL
 
! scope="col"| PINDEL
Line 377: Line 424:  
|-
 
|-
 
| Left trim
 
| Left trim
| 27069
   
| 1
 
| 1
 
| 0
 
| 0
Line 386: Line 432:  
|-
 
|-
 
| Left aligned
 
| Left aligned
| 3
   
| 1
 
| 1
 
| 1
 
| 1
Line 397: Line 442:  
|-
 
|-
 
| Left trim
 
| Left trim
| 40782
   
| 0
 
| 0
 
| 0
 
| 0
Line 406: Line 450:  
|-
 
|-
 
| Left aligned
 
| Left aligned
| 1892
   
| 0
 
| 0
 
| 0
 
| 0
Line 415: Line 458:  
|-
 
|-
 
| Right trimmed
 
| Right trimmed
| 0
   
| 0
 
| 0
 
| 0
 
| 0
Line 426: Line 468:  
|-
 
|-
 
| Duplicate variants
 
| Duplicate variants
| 0
   
| 1
 
| 1
 
| 155
 
| 155
Line 439: Line 480:  
- about a loss of 11% of variant thought distinct.
 
- about a loss of 11% of variant thought distinct.
   −
To normalize and remove duplicate variants:
+
For chromosome 22, there are 121 variants, but only 106 distinct Indels after normalization
 +
- about a loss of 12% of variant thought distinct.
 +
 
 +
To normalize and remove duplicate variants for chromosome 22:
   −
   ${GC}/bin/vt normalize  mills.genotypes.bcf -r hs37d5.fa | ${GC}/bin/vt mergedups - -o mills.normalized.genotypes.bcf  
+
   ${GC}/bin/vt normalize  ${SS}/ref22/mills_indels_hg19.22.sites.bcf -r ${SS}/ref22/human.g1k.v37.chr22.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.22.normalized.genotypes.bcf  
   −
and you will observe that 3994 variants had to be left aligned and 1092 variants were removed.
+
and you will observe that 43 variants had to be left aligned and 15 variants were removed.
   −
  stats: biallelic
+
<pre>
 +
stats: biallelic
 
           no. left trimmed                      : 0
 
           no. left trimmed                      : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and right trimmed    : 0
 
           no. left trimmed and right trimmed    : 0
           no. left aligned                      : 3994
+
           no. left aligned                      : 43
           no. right trimmed                    : 0 <br>
+
           no. right trimmed                    : 0
        multiallelic
+
 
 +
      total no. biallelic normalized          : 43
 +
 
 +
      multiallelic
 
           no. left trimmed                      : 0
 
           no. left trimmed                      : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and left aligned    : 0
 
           no. left trimmed and right trimmed    : 0
 
           no. left trimmed and right trimmed    : 0
 
           no. left aligned                      : 0
 
           no. left aligned                      : 0
           no. right trimmed                    : 0 <br>
+
           no. right trimmed                    : 0
       no. variants observed                   : 9996 <br>
+
 
  <br>
+
      total no. multiallelic normalized        : 0
  stats: Total number of observed variants  9996
+
 
        Total number of unique variants    8904 <br>
+
      total no. variants normalized            : 43
 +
       total no. variants observed             : 121
 +
 
 +
 
 +
stats: Total number of observed variants  121
 +
      Total number of unique variants    106
 +
</pre>
 +
 
 +
Let's look at the last two variants that were normalized.
 +
${GC}/bin/vt view ${OUT}/mills.22.normalized.genotypes.bcf | grep OLD_VARIANT |tail -2
 +
 
 +
Results:
 +
*The positions have changed - the were:
 +
** 48831918 & 50012616 (as seen after OLD_VARIANT)
 +
*Now they are:
 +
** 48831915 & 50012614
 +
[[File:indelNormalize.png|800px]]
       
UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.
 
UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.

Navigation menu