Line 1: |
Line 1: |
| + | '''Note:''' the latest version of this practical is available at: [[SeqShop: Variant Calling and Filtering for INDELs Practical]] |
| + | * The ones here is the original one from the June workshop (updated to be run from elsewhere) |
| + | |
| + | |
| == Goals of This Session == | | == Goals of This Session == |
| * What we want to learn | | * What we want to learn |
Line 5: |
Line 9: |
| ** How to evaluate the quality of INDEL calls | | ** How to evaluate the quality of INDEL calls |
| | | |
| + | [[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]] |
| + | |
| + | == Setup in person at the SeqShop Workshop == |
| + | ''This section is specifically for the SeqShop Workshop computers.'' |
| + | <div class="mw-collapsible mw-collapsed" style="width:600px"> |
| + | ''If you are not running during the SeqShop Workshop, please skip this section.'' |
| + | <div class="mw-collapsible-content"> |
| | | |
| {{SeqShopLogin}} | | {{SeqShopLogin}} |
| | | |
− | == Setup your run environment== | + | === Setup your run environment=== |
| + | This is the same setup you did for the previous tutorial, but you need to redo it each time you log in. |
| | | |
− | This is the same setup you did for the previous tutorial, but you need to redo it each time you log in. This will setup some environment variables to point you to:
| + | This will setup some environment variables to point you to |
− | * GotCloud program | + | * [[GotCloud]] program |
| * Tutorial input files | | * Tutorial input files |
| * Setup an output directory | | * Setup an output directory |
Line 18: |
Line 30: |
| * You won't see any output after running <code>source</code> | | * You won't see any output after running <code>source</code> |
| ** It silently sets up your environment | | ** It silently sets up your environment |
| + | ** If you want to view the detail of the setup, type |
| + | less /home/mktrost/seqshop/setup.txt |
| + | and press 'q' to finish. |
| + | |
| <div class="mw-collapsible mw-collapsed" style="width:200px"> | | <div class="mw-collapsible mw-collapsed" style="width:200px"> |
| View setup.txt | | View setup.txt |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
| [[File:setup.png|500px]] | | [[File:setup.png|500px]] |
| + | </div> |
| + | </div> |
| </div> | | </div> |
| </div> | | </div> |
| | | |
| + | == Setup when running on your own outside of the SeqShop Workshop == |
| + | ''This section is specifically for running on your own outside of the SeqShop Workshop.'' |
| + | <div class="mw-collapsible" style="width:600px"> |
| + | ''If you are running during the SeqShop Workshop, please skip this section.'' |
| + | <div class="mw-collapsible-content"> |
| + | |
| + | This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]] |
| + | |
| + | It also uses the bam.index file created in the SnpCall Tutorial. If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]] |
| + | |
| + | |
| + | {{SeqShopRemoteEnv}} |
| + | </div> |
| + | </div> |
| | | |
| == Examining GotCloud Indel Input files == | | == Examining GotCloud Indel Input files == |
Line 30: |
Line 62: |
| * BAMs->INDELs rather than BAMs->SNPs | | * BAMs->INDELs rather than BAMs->SNPs |
| | | |
− | If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]] | + | If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]] |
| | | |
| == Running GotCloud Indel == | | == Running GotCloud Indel == |
− | ${GC}/gotcloud indel --conf ${IN}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 | + | ${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT} |
| + | * <code>${GC}/gotcloud</code> runs GotCloud |
| + | * <code>indel</code> tells GotCloud you want to run the indel calling pipeline. |
| + | * <code>--conf</code> tells GotCloud the name of the configuration file to use. |
| + | ** The configuration for this test was downloaded with the seqshop input files. |
| * --numjobs tells GotCloud how many jobs to run in parallel | | * --numjobs tells GotCloud how many jobs to run in parallel |
| ** Depends on your system | | ** Depends on your system |
| * --region 22:36000000-37000000 | | * --region 22:36000000-37000000 |
| ** The sample files are just a small region of chromosome 22, so to save time, we tell GotCloud to ignore the other regions | | ** The sample files are just a small region of chromosome 22, so to save time, we tell GotCloud to ignore the other regions |
| + | * <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths. |
| + | ** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS} |
| + | ** Alternatively, gotcloud.conf could be updated to specify the full paths |
| + | * <code>--out_dir</code> tells GotCloud where to write the output. |
| + | ** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line |
| | | |
| <div class="mw-collapsible mw-collapsed" style="width:500px"> | | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
Line 65: |
Line 106: |
| Let's look at the <code>final</code> directory: | | Let's look at the <code>final</code> directory: |
| ls ${OUT}/final | | ls ${OUT}/final |
− | Just a <code>chr22</code> directory, so look inside of there:
| + | |
− | ls ${OUT}/vcfs/chr22
| |
| ;Can you identify the final indel VCF? | | ;Can you identify the final indel VCF? |
| <div class="mw-collapsible mw-collapsed" style="width:350px"> | | <div class="mw-collapsible mw-collapsed" style="width:350px"> |
Line 96: |
Line 136: |
| Let's see if we found the indel | | Let's see if we found the indel |
| | | |
− | $GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041 | head -1 | + | ${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041 | head -1 |
| | | |
| Did you see a variant at the position? | | Did you see a variant at the position? |
Line 102: |
Line 142: |
| Let's check the sequence data to confirm that the variant really exists | | Let's check the sequence data to confirm that the variant really exists |
| | | |
− | $GC/bin/samtools tview $IN/bams/HG01101.recal.bam $REF/human.g1k.v37.chr22.fa | + | ${GC}/bin/samtools tview ${SS}/bams/HG01101.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa |
| | | |
| * Type 'g' to go to a specific position | | * Type 'g' to go to a specific position |
Line 113: |
Line 153: |
| View Screenshot | | View Screenshot |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
− | [[File:Samtoolstviewindel.png|600px]] | + | [[File:Samtoolsviewindel.png|600px]] |
| </div> | | </div> |
| </div> | | </div> |
Line 124: |
Line 164: |
| ==== Header ==== | | ==== Header ==== |
| First, let's look at the header: | | First, let's look at the header: |
− | $GC/bin/tabix -H $OUT/final/all.genotypes.vcf.gz | + | ${GC}/bin/tabix -H ${OUT}/final/all.genotypes.vcf.gz |
| | | |
| The header is as follows: | | The header is as follows: |
Line 153: |
Line 193: |
| ##FILTER=<ID=PASS,Description="Temporary pass"> | | ##FILTER=<ID=PASS,Description="Temporary pass"> |
| ##FILTER=<ID=overlap,Description="Overlapping variant"> | | ##FILTER=<ID=overlap,Description="Overlapping variant"> |
− | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00551 HG00553 HG00554 HG00637 HG00638 HG00640 HG00641 HG00734 HG00736 HG00737 HG00739 HG00740 HG01047 HG01049 HG01051 HG01052 HG01054 HG01055 HG01060 HG01061 HG01066 HG01067 HG01069 HG01070 HG01072 HG01073 HG01075 HG01079 HG01080 HG01082 HG01083 HG01094 HG01097 HG01098 HG01101 HG01102 HG01107 HG01108 HG01110 HG01111 HG01167 HG01168 HG01170 HG01171 HG01173 HG01174 HG01176 HG01177 HG01182 HG01183 HG01187 HG01188 HG01190 HG01191 HG01197 HG01198 HG01204 HG01205 HG01241 HG01242 HG01247 HG01248
| + | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00551 HG00553 HG00554 HG00637 HG00638 HG00640 HG00641 HG00734 HG00736 HG00737 HG00739 HG00740 HG01047 HG01049 HG01051 HG01052 HG01054 HG01055 HG01060 HG01061 HG01066 HG01067 HG01069 HG01070 HG01072 HG01073 HG01075 HG01079 HG01080 HG01082 HG01083 HG01094 HG01097 HG01098 HG01101 HG01102 HG01107 HG01108 HG01110 HG01111 HG01167 HG01168 HG01170 HG01171 HG01173 HG01174 HG01176 HG01177 HG01182 HG01183 HG01187 HG01188 HG01190 HG01191 HG01197 HG01198 HG01204 HG01205 HG01241 HG01242 HG01247 HG01248 |
| | | |
| Using [[Vt]], we can see the same output | | Using [[Vt]], we can see the same output |
Line 163: |
Line 203: |
| To view a specific region of records (such as APOL1 g2 allele) | | To view a specific region of records (such as APOL1 g2 allele) |
| | | |
− | $GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041-36662041 | + | ${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041-36662041 |
| | | |
| The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name. | | The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name. |
Line 180: |
Line 220: |
| 36662041 : genome position | | 36662041 : genome position |
| . : this is the ID field that is left blank. | | . : this is the ID field that is left blank. |
− | AATAATT : the reference sequence that is replaced by the alternative sequence below. | + | AATAATT : the reference sequence that is replaced by the alternative sequence below. |
| A : so this is basically a deletion of GT | | A : so this is basically a deletion of GT |
| 756 : QUAL field denoting validity of this variant, higher the better. | | 756 : QUAL field denoting validity of this variant, higher the better. |
Line 197: |
Line 237: |
| The key information fields are as follows: | | The key information fields are as follows: |
| | | |
− | AC=2 : alternate allele count | + | AC=2 : alternate allele count |
| AN=114 : total number of alleles | | AN=114 : total number of alleles |
− | AF=0.0175439 : allele frequency based on AC/AN | + | AF=0.017 : allele frequency based on AC/AN |
− | GC=55,2,0 : genotype counts for 0/0, 0/1, 1/1 | + | GC=55,2,0 : genotype counts for 0/0, 0/1, 1/1 |
| GF=0.55,0.34,0.10 : genotype frequencies based on GC | | GF=0.55,0.34,0.10 : genotype frequencies based on GC |
| NS=57 : no. of samples | | NS=57 : no. of samples |
| HWEAF=0.28 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium | | HWEAF=0.28 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium |
| HWEGF=0.52,0.40,0.08 : genotype frequency derived from HWEAF | | HWEGF=0.52,0.40,0.08 : genotype frequency derived from HWEAF |
− | HWE_LPVAL=-0.182794 : log p value of HWE test | + | HWE_LPVAL=-1.08 : log p value of HWE test |
− | FIC=-0.00372601;AB=0.384578
| |
| FIC=-0.07 : genotype likelihood based inbreeding coefficient | | FIC=-0.07 : genotype likelihood based inbreeding coefficient |
| AB=0.61 : genotype likelihood based allele balance | | AB=0.61 : genotype likelihood based allele balance |
Line 229: |
Line 268: |
| The following section details some simple analyses we can perform. | | The following section details some simple analyses we can perform. |
| | | |
− | == Summary == | + | ===Summary=== |
| | | |
| First you want to know what is in the vcf file. | | First you want to know what is in the vcf file. |
Line 241: |
Line 280: |
| >=3 alleles (ins/del) : 0 (-nan) [0/0] <br> | | >=3 alleles (ins/del) : 0 (-nan) [0/0] <br> |
| no. of observed variants : 720 | | no. of observed variants : 720 |
− | * Our VCF only contains INDELs so there are lots of <code>0 (-nan) [0/0]</code> values in the output. <code>vt</code> can be used to analyze VCFs with more variants than just INDELs. | + | * Our VCF only contains only biallelic INDELs so there are lots of <code>0 (-nan) [0/0]</code> values in the output. <code>vt</code> can be used to analyze VCFs with more variants than just INDELs. |
| | | |
− | The variants have filter labels. PASS meaning a temporary pass and overlap, meaning that the variants are overlapping with another variant, implying multiallelicity. | + | The variants have filter labels. PASS meaning a passed variant and overlap, meaning that the variants are overlapping with another variant, implying multiallelicity. |
| | | |
| We can count the number of variants with different filters with the following commands. | | We can count the number of variants with different filters with the following commands. |
Line 270: |
Line 309: |
| | | |
| #passed indels of length >4 | | #passed indels of length >4 |
− | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>1" | + | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4" |
| | | |
| #passed singletons of length 4 or insertions of length 3 | | #passed singletons of length 4 or insertions of length 3 |
| ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)" | | ${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)" |
| | | |
− | == Comparison with other data sets == | + | === Comparison with other data sets === |
| | | |
| It is usually useful to examine the call sets against known data sets for the passed variants. | | It is usually useful to examine the call sets against known data sets for the passed variants. |
| | | |
− | ${GC}/bin/vt profile_indels -g indel.reference.txt -r hs37d5.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"
| + | In order to do this, you need to update indel.reference.txt to point to the reference files. |
| + | cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt |
| + | |
| + | Edit indel.reference.txt and specify the correct path to ${SS} |
| + | nedit ${OUT}/indel.reference.txt |
| + | *'''Replace all occurrences of <code>username</code> with your username (or the correct path to your seqshop example directory).''' |
| + | *:[[File:IndelRef.png]] |
| + | |
| + | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS" |
| + | |
| | | |
| data set | | data set |
Line 318: |
Line 366: |
| We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity. | | We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity. |
| | | |
− | ${GC}/bin/vt profile_indels -g indel.reference.txt -r hs37d5.fa all.genotypes.bcf -i 22:36000000-37000000 -f "~PASS" | + | ${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS" |
| | | |
| data set | | data set |
Line 346: |
Line 394: |
| This analysis supports filters too. | | This analysis supports filters too. |
| | | |
− | ==Normalization== | + | ===Normalization=== |
| | | |
| A slight digression here, when analyzing indels, it is important to normalize it. While it is a simple concept, | | A slight digression here, when analyzing indels, it is important to normalize it. While it is a simple concept, |
Line 366: |
Line 414: |
| |- | | |- |
| ! scope="col"| Dataset | | ! scope="col"| Dataset |
− | ! scope="col"| Freebayes
| |
| ! scope="col"| Haplotyecaller | | ! scope="col"| Haplotyecaller |
| ! scope="col"| PINDEL | | ! scope="col"| PINDEL |
Line 377: |
Line 424: |
| |- | | |- |
| | Left trim | | | Left trim |
− | | 27069
| |
| | 1 | | | 1 |
| | 0 | | | 0 |
Line 386: |
Line 432: |
| |- | | |- |
| | Left aligned | | | Left aligned |
− | | 3
| |
| | 1 | | | 1 |
| | 1 | | | 1 |
Line 397: |
Line 442: |
| |- | | |- |
| | Left trim | | | Left trim |
− | | 40782
| |
| | 0 | | | 0 |
| | 0 | | | 0 |
Line 406: |
Line 450: |
| |- | | |- |
| | Left aligned | | | Left aligned |
− | | 1892
| |
| | 0 | | | 0 |
| | 0 | | | 0 |
Line 415: |
Line 458: |
| |- | | |- |
| | Right trimmed | | | Right trimmed |
− | | 0
| |
| | 0 | | | 0 |
| | 0 | | | 0 |
Line 426: |
Line 468: |
| |- | | |- |
| | Duplicate variants | | | Duplicate variants |
− | | 0
| |
| | 1 | | | 1 |
| | 155 | | | 155 |
Line 439: |
Line 480: |
| - about a loss of 11% of variant thought distinct. | | - about a loss of 11% of variant thought distinct. |
| | | |
− | To normalize and remove duplicate variants: | + | For chromosome 22, there are 121 variants, but only 106 distinct Indels after normalization |
| + | - about a loss of 12% of variant thought distinct. |
| + | |
| + | To normalize and remove duplicate variants for chromosome 22: |
| | | |
− | ${GC}/bin/vt normalize mills.genotypes.bcf -r hs37d5.fa | ${GC}/bin/vt mergedups - -o mills.normalized.genotypes.bcf | + | ${GC}/bin/vt normalize ${SS}/ref22/mills_indels_hg19.22.sites.bcf -r ${SS}/ref22/human.g1k.v37.chr22.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.22.normalized.genotypes.bcf |
| | | |
− | and you will observe that 3994 variants had to be left aligned and 1092 variants were removed. | + | and you will observe that 43 variants had to be left aligned and 15 variants were removed. |
| | | |
− | stats: biallelic
| + | <pre> |
| + | stats: biallelic |
| no. left trimmed : 0 | | no. left trimmed : 0 |
| no. left trimmed and left aligned : 0 | | no. left trimmed and left aligned : 0 |
| no. left trimmed and right trimmed : 0 | | no. left trimmed and right trimmed : 0 |
− | no. left aligned : 3994 | + | no. left aligned : 43 |
− | no. right trimmed : 0 <br> | + | no. right trimmed : 0 |
− | multiallelic
| + | |
| + | total no. biallelic normalized : 43 |
| + | |
| + | multiallelic |
| no. left trimmed : 0 | | no. left trimmed : 0 |
| no. left trimmed and left aligned : 0 | | no. left trimmed and left aligned : 0 |
| no. left trimmed and right trimmed : 0 | | no. left trimmed and right trimmed : 0 |
| no. left aligned : 0 | | no. left aligned : 0 |
− | no. right trimmed : 0 <br> | + | no. right trimmed : 0 |
− | no. variants observed : 9996 <br> | + | |
− | <br>
| + | total no. multiallelic normalized : 0 |
− | stats: Total number of observed variants 9996
| + | |
− | Total number of unique variants 8904 <br>
| + | total no. variants normalized : 43 |
| + | total no. variants observed : 121 |
| + | |
| + | |
| + | stats: Total number of observed variants 121 |
| + | Total number of unique variants 106 |
| + | </pre> |
| + | |
| + | Let's look at the last two variants that were normalized. |
| + | ${GC}/bin/vt view ${OUT}/mills.22.normalized.genotypes.bcf | grep OLD_VARIANT |tail -2 |
| + | |
| + | Results: |
| + | *The positions have changed - the were: |
| + | ** 48831918 & 50012616 (as seen after OLD_VARIANT) |
| + | *Now they are: |
| + | ** 48831915 & 50012614 |
| + | [[File:indelNormalize.png|800px]] |
| | | |
| | | |
| UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO. | | UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO. |