Changes

3,090 bytes added , 14:36, 13 November 2014

no edit summary

Line 1: Line 1: +

'''Note:''' the latest version of this practical is available at: [[SeqShop: Variant Calling and Filtering for INDELs Practical]]

+

* The ones here is the original one from the June workshop (updated to be run from elsewhere)

+

== Goals of This Session ==

* What we want to learn

Line 5: Line 9:

** How to evaluate the quality of INDEL calls

+

[[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]

+

== Setup in person at the SeqShop Workshop ==

+

''This section is specifically for the SeqShop Workshop computers.''

+

+

''If you are not running during the SeqShop Workshop, please skip this section.''

+

−

== Setup your run environment==

+

=== Setup your run environment===

+

This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.

−

~~This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.~~ This will setup some environment variables to point you to:

+

This will setup some environment variables to point you to

−

* GotCloud program

+

* [[GotCloud]] program

* Tutorial input files

* Setup an output directory

Line 18: Line 30:

* You won't see any output after running <code>source</code>

** It silently sets up your environment

+

** If you want to view the detail of the setup, type

+

less /home/mktrost/seqshop/setup.txt

+

and press 'q' to finish.

+

View setup.txt

Line 24: Line 40:

</div>

+

</div>

+

</div>

+

== Setup when running on your own outside of the SeqShop Workshop ==

+

''This section is specifically for running on your own outside of the SeqShop Workshop.''

+

+

''If you are running during the SeqShop Workshop, please skip this section.''

+

+

This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]]

+

It also uses the bam.index file created in the SnpCall Tutorial. If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]]

+

</div>

+

</div>

== Examining GotCloud Indel Input files ==

Line 30: Line 62:

* BAMs->INDELs rather than BAMs->SNPs

−

If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]

+

If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]

== Running GotCloud Indel ==

−

${GC}/gotcloud indel --conf ${IN}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000

+

${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}

+

* <code>${GC}/gotcloud</code> runs GotCloud

+

* <code>indel</code> tells GotCloud you want to run the indel calling pipeline.

+

* <code>--conf</code> tells GotCloud the name of the configuration file to use.

+

** The configuration for this test was downloaded with the seqshop input files.

* --numjobs tells GotCloud how many jobs to run in parallel

** Depends on your system

* --region 22:36000000-37000000

** The sample files are just a small region of chromosome 22, so to save time, we tell GotCloud to ignore the other regions

+

* <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths.

+

** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}

+

** Alternatively, gotcloud.conf could be updated to specify the full paths

+

* <code>--out_dir</code> tells GotCloud where to write the output.

+

** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line

Line 65: Line 106:

Let's look at the <code>final</code> directory:

ls ${OUT}/final

−

~~Just a <code>chr22</code> directory, so look inside of there:~~

+

−

~~ls ${OUT}/vcfs/chr22~~

;Can you identify the final indel VCF?

Line 96: Line 136:

Let's see if we found the indel

−

$GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041 | head -1

+

${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041 | head -1

Did you see a variant at the position?

Line 102: Line 142:

Let's check the sequence data to confirm that the variant really exists

−

$GC/bin/samtools tview $IN/bams/HG01101.recal.bam $~~REF~~/human.g1k.v37.chr22.fa

+

${GC}/bin/samtools tview ${SS}/bams/HG01101.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa

* Type 'g' to go to a specific position

Line 124: Line 164:

==== Header ====

First, let's look at the header:

−

$GC/bin/tabix -H $OUT/final/all.genotypes.vcf.gz

+

${GC}/bin/tabix -H ${OUT}/final/all.genotypes.vcf.gz

The header is as follows:

Line 163: Line 203:

To view a specific region of records (such as APOL1 g2 allele)

−

$GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041-36662041

+

${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041-36662041

The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.

Line 228: Line 268:

The following section details some simple analyses we can perform.

−

== Summary ==

+

===Summary===

First you want to know what is in the vcf file.

Line 269: Line 309:

#passed indels of length >4

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>1"

+

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"

#passed singletons of length 4 or insertions of length 3

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"

−

== Comparison with other data sets ==

+

=== Comparison with other data sets ===

It is usually useful to examine the call sets against known data sets for the passed variants.

−

${GC}/bin/vt profile_indels -g ${~~VTREF~~}/indel.reference.txt -r ${~~REF~~}/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

+

In order to do this, you need to update indel.reference.txt to point to the reference files.

+

cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt

+

Edit indel.reference.txt and specify the correct path to ${SS}

+

nedit ${OUT}/indel.reference.txt

+

*'''Replace all occurrences of <code>username</code> with your username (or the correct path to your seqshop example directory).'''

+

*:[[File:IndelRef.png]]

+

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

Line 318: Line 366:

We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.

−

${GC}/bin/vt profile_indels -g ${~~VTREF~~}/indel.reference.txt -r ${~~REF~~}/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

+

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

data set

Line 346: Line 394:

This analysis supports filters too.

−

==Normalization==

+

===Normalization===

A slight digression here, when analyzing indels, it is important to normalize it. While it is a simple concept,

Line 432: Line 480:

- about a loss of 11% of variant thought distinct.

−

To normalize and remove duplicate variants:

+

For chromosome 22, there are 121 variants, but only 106 distinct Indels after normalization

+

- about a loss of 12% of variant thought distinct.

+

To normalize and remove duplicate variants for chromosome 22:

−

${GC}/bin/vt normalize ${~~VTREF~~}/mills_indels_hg19.sites.bcf -r ${~~VTREF~~}/~~hs37d5~~.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.normalized.genotypes.bcf

+

${GC}/bin/vt normalize ${SS}/ref22/mills_indels_hg19.22.sites.bcf -r ${SS}/ref22/human.g1k.v37.chr22.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.22.normalized.genotypes.bcf

−

and you will observe that ~~3994~~ variants had to be left aligned and ~~1092~~ variants were removed.

+

and you will observe that 43 variants had to be left aligned and 15 variants were removed.

−

stats: biallelic

+

<pre>

+

stats: biallelic

no. left trimmed : 0

no. left trimmed and left aligned : 0

no. left trimmed and right trimmed : 0

−

no. left aligned : ~~3994~~

+

no. left aligned : 43

−

no. right trimmed : 0 ~~<br>~~

+

no. right trimmed : 0

−

multiallelic

+

total no. biallelic normalized : 43

+

multiallelic

no. left trimmed : 0

no. left trimmed and left aligned : 0

no. left trimmed and right trimmed : 0

no. left aligned : 0

−

no. right trimmed : 0 ~~<br>~~

+

no. right trimmed : 0

−

no. variants observed : ~~9996 <br>~~

+

−

~~<br>~~

+

total no. multiallelic normalized : 0

−

stats: Total number of observed variants ~~9996~~

+

−

Total number of unique variants ~~8904~~ <br>

+

total no. variants normalized : 43

+

total no. variants observed : 121

+

stats: Total number of observed variants 121

+

Total number of unique variants 106

+

</pre>

−

Let's look ~~for a variant~~ that ~~was~~ normalized.

+

Let's look at the last two variants that were normalized.

−

${GC}/bin/vt view ${OUT}/mills.normalized.genotypes.bcf | grep OLD_VARIANT |~~head~~ -1

+

${GC}/bin/vt view ${OUT}/mills.22.normalized.genotypes.bcf | grep OLD_VARIANT |tail -2

Results:

−

*The ~~position has~~ changed - ~~it was~~:

+

*The positions have changed - the were:

−

** ~~18293100~~ (as seen after OLD_VARIANT)

+

** 48831918 & 50012616 (as seen after OLD_VARIANT)

−

*Now ~~it is~~

+

*Now they are:

−

** ~~18293097~~

+

** 48831915 & 50012614

[[File:indelNormalize.png|800px]]

UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.

Mktrost

Administrators

3,045

edits

Changes

SeqShop: Variant Calling and Filtering for INDELs Practical, June 2014 (view source)

Revision as of 14:36, 13 November 2014

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools