Line 1: |
Line 1: |
| + | '''Note:''' the latest version of this practical is available at: [[SeqShop: Variant Calling and Filtering for INDELs Practical]] |
| + | * The ones here is the original one from the June workshop (updated to be run from elsewhere) |
| + | |
| + | |
| == Goals of This Session == | | == Goals of This Session == |
| * What we want to learn | | * What we want to learn |
Line 5: |
Line 9: |
| ** How to evaluate the quality of INDEL calls | | ** How to evaluate the quality of INDEL calls |
| | | |
− | [[Media:Variant Calling and Filtering for INDELs.pdf|Intro Slides]] | + | [[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]] |
| | | |
| == Setup in person at the SeqShop Workshop == | | == Setup in person at the SeqShop Workshop == |
Line 44: |
Line 48: |
| ''If you are running during the SeqShop Workshop, please skip this section.'' | | ''If you are running during the SeqShop Workshop, please skip this section.'' |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
− | === Download the example data ===
| |
| | | |
− | === Setup your run environment ===
| + | This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]] |
| | | |
− | Environment variables will be used throughout the tutorial.
| + | It also uses the bam.index file created in the SnpCall Tutorial. If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]] |
| | | |
− | We recommend that you setup these variables so you won't have to modify every command in the tutorial.
| |
− |
| |
− | <div class="mw-collapsible mw-collapsed" style="width:500px">
| |
− | I'm using bash (replace the paths below with the appropriate paths):
| |
− | <div class="mw-collapsible-content">
| |
− | * Point to where you installed GotCloud
| |
− | *:<pre>export GC=/home/username/gotcloud</pre>
| |
− | * Point to where you installed the seqshop files
| |
− | *:<pre>export SS=/home/username/seqshop/</pre>
| |
− | * Point to where you want the output to go
| |
− | *:<pre>export OUT=/home/username/seqshop_output/</pre>
| |
− | </div>
| |
− | </div>
| |
− |
| |
− | <div class="mw-collapsible mw-collapsed" style="width:500px">
| |
− | I'm using tcsh (replace the paths below with the appropriate paths):
| |
− | <div class="mw-collapsible-content">
| |
− | * Point to where you installed GotCloud
| |
− | *:<pre>setenv GC /home/username/gotcloud</pre>
| |
− | * Point to where you installed the seqshop files
| |
− | *:<pre>setenv SS /home/username/seqshop/</pre>
| |
− | * Point to where you want the output to go
| |
− | *:<pre>setenv OUT /home/username/seqshop_output/</pre>
| |
− | </div>
| |
− | </div>
| |
| | | |
| + | {{SeqShopRemoteEnv}} |
| </div> | | </div> |
| </div> | | </div> |
Line 83: |
Line 62: |
| * BAMs->INDELs rather than BAMs->SNPs | | * BAMs->INDELs rather than BAMs->SNPs |
| | | |
− | If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]] | + | If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]] |
| | | |
| == Running GotCloud Indel == | | == Running GotCloud Indel == |
Line 289: |
Line 268: |
| The following section details some simple analyses we can perform. | | The following section details some simple analyses we can perform. |
| | | |
− | == Summary == | + | ===Summary=== |
| | | |
| First you want to know what is in the vcf file. | | First you want to know what is in the vcf file. |
Line 330: |
Line 309: |
| | | |
| #passed indels of length >4 | | #passed indels of length >4 |
− | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>1" | + | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4" |
| | | |
| #passed singletons of length 4 or insertions of length 3 | | #passed singletons of length 4 or insertions of length 3 |
| ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)" | | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)" |
| | | |
− | == Comparison with other data sets == | + | === Comparison with other data sets === |
| | | |
| It is usually useful to examine the call sets against known data sets for the passed variants. | | It is usually useful to examine the call sets against known data sets for the passed variants. |
| | | |
− | ${GC}/bin/vt profile_indels -g ${VTREF}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS" | + | In order to do this, you need to update indel.reference.txt to point to the reference files. |
| + | cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt |
| + | |
| + | Edit indel.reference.txt and specify the correct path to ${SS} |
| + | nedit ${OUT}/indel.reference.txt |
| + | *'''Replace all occurrences of <code>username</code> with your username (or the correct path to your seqshop example directory).''' |
| + | *:[[File:IndelRef.png]] |
| + | |
| + | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS" |
| | | |
| | | |
Line 379: |
Line 366: |
| We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity. | | We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity. |
| | | |
− | ${GC}/bin/vt profile_indels -g ${VTREF}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS" | + | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS" |
| | | |
| data set | | data set |
Line 407: |
Line 394: |
| This analysis supports filters too. | | This analysis supports filters too. |
| | | |
− | ==Normalization== | + | ===Normalization=== |
| | | |
| A slight digression here, when analyzing indels, it is important to normalize it. While it is a simple concept, | | A slight digression here, when analyzing indels, it is important to normalize it. While it is a simple concept, |
Line 493: |
Line 480: |
| - about a loss of 11% of variant thought distinct. | | - about a loss of 11% of variant thought distinct. |
| | | |
− | To normalize and remove duplicate variants:
| + | For chromosome 22, there are 121 variants, but only 106 distinct Indels after normalization |
| + | - about a loss of 12% of variant thought distinct. |
| | | |
− | ${GC}/bin/vt normalize ${VTREF}/mills_indels_hg19.sites.bcf -r ${VTREF}/hs37d5.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.normalized.genotypes.bcf
| + | To normalize and remove duplicate variants for chromosome 22: |
| | | |
− | and you will observe that 3994 variants had to be left aligned and 1092 variants were removed.
| + | ${GC}/bin/vt normalize ${SS}/ref22/mills_indels_hg19.22.sites.bcf -r ${SS}/ref22/human.g1k.v37.chr22.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.22.normalized.genotypes.bcf |
| | | |
− | stats: biallelic
| + | and you will observe that 43 variants had to be left aligned and 15 variants were removed. |
| + | |
| + | <pre> |
| + | stats: biallelic |
| no. left trimmed : 0 | | no. left trimmed : 0 |
| no. left trimmed and left aligned : 0 | | no. left trimmed and left aligned : 0 |
| no. left trimmed and right trimmed : 0 | | no. left trimmed and right trimmed : 0 |
− | no. left aligned : 3994 | + | no. left aligned : 43 |
− | no. right trimmed : 0 <br> | + | no. right trimmed : 0 |
− | multiallelic
| + | |
| + | total no. biallelic normalized : 43 |
| + | |
| + | multiallelic |
| no. left trimmed : 0 | | no. left trimmed : 0 |
| no. left trimmed and left aligned : 0 | | no. left trimmed and left aligned : 0 |
| no. left trimmed and right trimmed : 0 | | no. left trimmed and right trimmed : 0 |
| no. left aligned : 0 | | no. left aligned : 0 |
− | no. right trimmed : 0 <br> | + | no. right trimmed : 0 |
− | no. variants observed : 9996 <br> | + | |
− | <br>
| + | total no. multiallelic normalized : 0 |
− | stats: Total number of observed variants 9996
| + | |
− | Total number of unique variants 8904 <br>
| + | total no. variants normalized : 43 |
| + | total no. variants observed : 121 |
| + | |
| + | |
| + | stats: Total number of observed variants 121 |
| + | Total number of unique variants 106 |
| + | </pre> |
| | | |
− | Let's look for a variant that was normalized. | + | Let's look at the last two variants that were normalized. |
− | ${GC}/bin/vt view ${OUT}/mills.normalized.genotypes.bcf | grep OLD_VARIANT |head -1 | + | ${GC}/bin/vt view ${OUT}/mills.22.normalized.genotypes.bcf | grep OLD_VARIANT |tail -2 |
| | | |
| Results: | | Results: |
− | *The position has changed - it was: | + | *The positions have changed - the were: |
− | ** 18293100 (as seen after OLD_VARIANT) | + | ** 48831918 & 50012616 (as seen after OLD_VARIANT) |
− | *Now it is | + | *Now they are: |
− | ** 18293097 | + | ** 48831915 & 50012614 |
| [[File:indelNormalize.png|800px]] | | [[File:indelNormalize.png|800px]] |
| | | |
| | | |
| UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO. | | UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO. |