Changes

321 bytes added , 14:36, 13 November 2014

no edit summary

Line 1: Line 1: +

'''Note:''' the latest version of this practical is available at: [[SeqShop: Variant Calling and Filtering for INDELs Practical]]

+

* The ones here is the original one from the June workshop (updated to be run from elsewhere)

+

== Goals of This Session ==

* What we want to learn

Line 5: Line 9:

** How to evaluate the quality of INDEL calls

−

[[Media:Variant Calling and Filtering for INDELs.pdf|~~Intro~~ Slides]]

+

[[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]

== Setup in person at the SeqShop Workshop ==

Line 44: Line 48:

''If you are running during the SeqShop Workshop, please skip this section.''

−

~~=== Download the example data ===~~

−

~~=== Setup your~~ run ~~environment ===~~

+

This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]]

−

~~Environment variables will be used throughout~~ the tutorial.

+

It also uses the bam.index file created in the SnpCall Tutorial. If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]]

−

~~We recommend that you setup these variables so you won't have to modify every command in the tutorial.~~

−

~~<div class="mw-collapsible mw-collapsed" style="width:500px">~~

−

~~I'm using bash (replace the paths below with the appropriate paths):~~

−

~~<div class="mw-collapsible-content">~~

−

* Point to where you installed GotCloud

−

*:<pre>export GC=/home/username/gotcloud</pre>

−

* Point to where you installed the seqshop files

−

*:<pre>export SS=/home/username/seqshop/</pre>

−

* Point to where you want the output to go

−

*:<pre>export OUT=/home/username/seqshop_output/</pre>

−

~~</div>~~

−

~~</div>~~

−

~~<div class="mw-collapsible mw-collapsed" style="width:500px">~~

−

~~I'm using tcsh (replace the paths below with the appropriate paths):~~

−

~~<div class="mw-collapsible-content">~~

−

* Point to where you installed GotCloud

−

*:<pre>setenv GC /home/username/gotcloud</pre>

−

* Point to where you installed the seqshop files

−

*:<pre>setenv SS /home/username/seqshop/</pre>

−

* Point to where you want the output to go

−

*:<pre>setenv OUT /home/username/seqshop_output/</pre>

−

~~</div>~~

−

~~</div>~~

+

</div>

Line 83: Line 62:

* BAMs->INDELs rather than BAMs->SNPs

−

If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]

+

If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]

== Running GotCloud Indel ==

Line 289: Line 268:

The following section details some simple analyses we can perform.

−

== Summary ==

+

===Summary===

First you want to know what is in the vcf file.

Line 330: Line 309:

#passed indels of length >4

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>1"

+

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"

#passed singletons of length 4 or insertions of length 3

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"

−

== Comparison with other data sets ==

+

=== Comparison with other data sets ===

It is usually useful to examine the call sets against known data sets for the passed variants.

−

${GC}/bin/vt profile_indels -g ${~~VTREF~~}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

+

In order to do this, you need to update indel.reference.txt to point to the reference files.

+

cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt

+

Edit indel.reference.txt and specify the correct path to ${SS}

+

nedit ${OUT}/indel.reference.txt

+

*'''Replace all occurrences of <code>username</code> with your username (or the correct path to your seqshop example directory).'''

+

*:[[File:IndelRef.png]]

+

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

Line 379: Line 366:

We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.

−

${GC}/bin/vt profile_indels -g ${~~VTREF~~}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

+

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

data set

Line 407: Line 394:

This analysis supports filters too.

−

==Normalization==

+

===Normalization===

A slight digression here, when analyzing indels, it is important to normalize it. While it is a simple concept,

Line 493: Line 480:

- about a loss of 11% of variant thought distinct.

−

~~To normalize and remove duplicate~~ variants:

+

For chromosome 22, there are 121 variants, but only 106 distinct Indels after normalization

+

- about a loss of 12% of variant thought distinct.

−

~~${GC}/bin/vt~~ normalize ~~${VTREF}/mills_indels_hg19.sites.bcf -r ${VTREF}/hs37d5.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.normalized.genotypes.bcf~~

+

To normalize and remove duplicate variants for chromosome 22:

−

~~and you will observe that 3994 variants had to be left aligned and 1092 variants were removed~~.

+

${GC}/bin/vt normalize ${SS}/ref22/mills_indels_hg19.22.sites.bcf -r ${SS}/ref22/human.g1k.v37.chr22.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.22.normalized.genotypes.bcf

−

stats: biallelic

+

and you will observe that 43 variants had to be left aligned and 15 variants were removed.

+

<pre>

+

stats: biallelic

no. left trimmed : 0

no. left trimmed and left aligned : 0

no. left trimmed and right trimmed : 0

−

no. left aligned : ~~3994~~

+

no. left aligned : 43

−

no. right trimmed : 0 ~~<br>~~

+

no. right trimmed : 0

−

multiallelic

+

total no. biallelic normalized : 43

+

multiallelic

no. left trimmed : 0

no. left trimmed and left aligned : 0

no. left trimmed and right trimmed : 0

no. left aligned : 0

−

no. right trimmed : 0 ~~<br>~~

+

no. right trimmed : 0

−

no. variants observed : ~~9996 <br>~~

+

−

~~<br>~~

+

total no. multiallelic normalized : 0

−

stats: Total number of observed variants ~~9996~~

+

−

Total number of unique variants ~~8904~~ <br>

+

total no. variants normalized : 43

+

total no. variants observed : 121

+

stats: Total number of observed variants 121

+

Total number of unique variants 106

+

</pre>

−

Let's look ~~for a variant~~ that ~~was~~ normalized.

+

Let's look at the last two variants that were normalized.

−

${GC}/bin/vt view ${OUT}/mills.normalized.genotypes.bcf | grep OLD_VARIANT |~~head~~ -1

+

${GC}/bin/vt view ${OUT}/mills.22.normalized.genotypes.bcf | grep OLD_VARIANT |tail -2

Results:

−

*The ~~position has~~ changed - ~~it was~~:

+

*The positions have changed - the were:

−

** ~~18293100~~ (as seen after OLD_VARIANT)

+

** 48831918 & 50012616 (as seen after OLD_VARIANT)

−

*Now ~~it is~~

+

*Now they are:

−

** ~~18293097~~

+

** 48831915 & 50012614

[[File:indelNormalize.png|800px]]

UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.

Mktrost

Administrators

3,045

edits

Changes

SeqShop: Variant Calling and Filtering for INDELs Practical, June 2014 (view source)

Revision as of 14:36, 13 November 2014

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools