Changes

3,403 bytes added , 14:36, 13 November 2014

no edit summary

Line 1: Line 1: +

'''Note:''' the latest version of this practical is available at: [[SeqShop: Variant Calling and Filtering for INDELs Practical]]

+

* The ones here is the original one from the June workshop (updated to be run from elsewhere)

+

== Goals of This Session ==

* What we want to learn

Line 5: Line 9:

** How to evaluate the quality of INDEL calls

+

[[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]

+

== Setup in person at the SeqShop Workshop ==

+

''This section is specifically for the SeqShop Workshop computers.''

+

+

''If you are not running during the SeqShop Workshop, please skip this section.''

+

−

== Setup your run environment==

+

=== Setup your run environment===

+

This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.

−

~~This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.~~ This will setup some environment variables to point you to:

+

This will setup some environment variables to point you to

−

* GotCloud program

+

* [[GotCloud]] program

* Tutorial input files

* Setup an output directory

Line 18: Line 30:

* You won't see any output after running <code>source</code>

** It silently sets up your environment

+

** If you want to view the detail of the setup, type

+

less /home/mktrost/seqshop/setup.txt

+

and press 'q' to finish.

+

View setup.txt

[[File:setup.png|500px]]

+

</div>

+

</div>

+

== Setup when running on your own outside of the SeqShop Workshop ==

+

''This section is specifically for running on your own outside of the SeqShop Workshop.''

+

+

''If you are running during the SeqShop Workshop, please skip this section.''

+

+

This tutorial builds on the alignment tutorial, if you have not already, please first run that tutorial: [[SeqShop:_Sequence_Mapping_and_Assembly_Practical, June 2014|Alignment Tutorial]]

+

It also uses the bam.index file created in the SnpCall Tutorial. If you have not yet run that tutorial, please follow the directions at: [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#GotCloud_BAM_Index_File|GotCloud BAM Index File]]

+

</div>

+

</div>

== Examining GotCloud Indel Input files ==

Line 30: Line 62:

* BAMs->INDELs rather than BAMs->SNPs

−

If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]

+

If you want a reminder, of what they look like, here is a link to the previous tutorial : [[SeqShop:_Variant_Calling_and_Filtering_for_SNPs_Practical, June 2014#Examining_GotCloud_SnpCall_Input_files|GotCloud SnpCall Input Files]]

== Running GotCloud Indel ==

−

${GC}/gotcloud indel --conf ${IN}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000

+

${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}

+

* <code>${GC}/gotcloud</code> runs GotCloud

+

* <code>indel</code> tells GotCloud you want to run the indel calling pipeline.

+

* <code>--conf</code> tells GotCloud the name of the configuration file to use.

+

** The configuration for this test was downloaded with the seqshop input files.

* --numjobs tells GotCloud how many jobs to run in parallel

** Depends on your system

* --region 22:36000000-37000000

** The sample files are just a small region of chromosome 22, so to save time, we tell GotCloud to ignore the other regions

+

* <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths.

+

** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}

+

** Alternatively, gotcloud.conf could be updated to specify the full paths

+

* <code>--out_dir</code> tells GotCloud where to write the output.

+

** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line

Line 65: Line 106:

Let's look at the <code>final</code> directory:

ls ${OUT}/final

−

~~Just a <code>chr22</code> directory, so look inside of there:~~

+

−

~~ls ${OUT}/vcfs/chr22~~

;Can you identify the final indel VCF?

Line 96: Line 136:

Let's see if we found the indel

−

$GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041 | head -1

+

${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041 | head -1

Did you see a variant at the position?

Line 102: Line 142:

Let's check the sequence data to confirm that the variant really exists

−

$GC/bin/samtools tview $IN/bams/HG01101.recal.bam $~~REF~~/human.g1k.v37.chr22.fa

+

${GC}/bin/samtools tview ${SS}/bams/HG01101.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa

* Type 'g' to go to a specific position

Line 113: Line 153:

View Screenshot

−

[[File:~~Samtoolstviewindel~~.png|600px]]

+

[[File:Samtoolsviewindel.png|600px]]

</div>

Line 124: Line 164:

==== Header ====

First, let's look at the header:

−

$GC/bin/tabix -H $OUT/final/all.genotypes.vcf.gz

+

${GC}/bin/tabix -H ${OUT}/final/all.genotypes.vcf.gz

The header is as follows:

Line 153: Line 193:

##FILTER=<ID=PASS,Description="Temporary pass">

##FILTER=<ID=overlap,Description="Overlapping variant">

−

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00551 HG00553 HG00554 HG00637 HG00638 HG00640 HG00641 HG00734 HG00736 HG00737 HG00739 HG00740 HG01047 HG01049 HG01051 HG01052 HG01054 HG01055 HG01060 HG01061 HG01066 HG01067 HG01069 HG01070 HG01072 HG01073 HG01075 HG01079 HG01080 HG01082 HG01083 HG01094 HG01097 HG01098 HG01101 HG01102 HG01107 HG01108 HG01110 HG01111 HG01167 HG01168 HG01170 HG01171 HG01173 HG01174 HG01176 HG01177 HG01182 HG01183 HG01187 HG01188 HG01190 HG01191 HG01197 HG01198 HG01204 HG01205 HG01241 HG01242 HG01247 HG01248

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00551 HG00553 HG00554 HG00637 HG00638 HG00640 HG00641 HG00734 HG00736 HG00737 HG00739 HG00740 HG01047 HG01049 HG01051 HG01052 HG01054 HG01055 HG01060 HG01061 HG01066 HG01067 HG01069 HG01070 HG01072 HG01073 HG01075 HG01079 HG01080 HG01082 HG01083 HG01094 HG01097 HG01098 HG01101 HG01102 HG01107 HG01108 HG01110 HG01111 HG01167 HG01168 HG01170 HG01171 HG01173 HG01174 HG01176 HG01177 HG01182 HG01183 HG01187 HG01188 HG01190 HG01191 HG01197 HG01198 HG01204 HG01205 HG01241 HG01242 HG01247 HG01248

Using [[Vt]], we can see the same output

Line 163: Line 203:

To view a specific region of records (such as APOL1 g2 allele)

−

$GC/bin/tabix $OUT/final/all.genotypes.vcf.gz 22:36662041-36662041

+

${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041-36662041

The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.

Line 180: Line 220:

36662041 : genome position

. : this is the ID field that is left blank.

−

AATAATT : the reference sequence that is replaced by the alternative sequence below.

+

AATAATT : the reference sequence that is replaced by the alternative sequence below.

A : so this is basically a deletion of GT

756 : QUAL field denoting validity of this variant, higher the better.

Line 197: Line 237:

The key information fields are as follows:

−

AC=2 : alternate allele count

+

AC=2 : alternate allele count

AN=114 : total number of alleles

−

AF=0.~~0175439~~ : allele frequency based on AC/AN

+

AF=0.017 : allele frequency based on AC/AN

−

GC=55,2,0 : genotype counts for 0/0, 0/1, 1/1

+

GC=55,2,0 : genotype counts for 0/0, 0/1, 1/1

GF=0.55,0.34,0.10 : genotype frequencies based on GC

NS=57 : no. of samples

HWEAF=0.28 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium

HWEGF=0.52,0.40,0.08 : genotype frequency derived from HWEAF

−

HWE_LPVAL=-0.~~182794~~ : log p value of HWE test

+

HWE_LPVAL=-1.08 : log p value of HWE test

−

~~FIC=-0.00372601;AB=0.384578~~

FIC=-0.07 : genotype likelihood based inbreeding coefficient

AB=0.61 : genotype likelihood based allele balance

Line 229: Line 268:

The following section details some simple analyses we can perform.

−

== Summary ==

+

===Summary===

First you want to know what is in the vcf file.

Line 241: Line 280:

>=3 alleles (ins/del) : 0 (-nan) [0/0] <br>

no. of observed variants : 720

−

* Our VCF only contains INDELs so there are lots of <code>0 (-nan) [0/0]</code> values in the output. <code>vt</code> can be used to analyze VCFs with more variants than just INDELs.

+

* Our VCF only contains only biallelic INDELs so there are lots of <code>0 (-nan) [0/0]</code> values in the output. <code>vt</code> can be used to analyze VCFs with more variants than just INDELs.

−

The variants have filter labels. PASS meaning a ~~temporary pass~~ and overlap, meaning that the variants are overlapping with another variant, implying multiallelicity.

+

The variants have filter labels. PASS meaning a passed variant and overlap, meaning that the variants are overlapping with another variant, implying multiallelicity.

We can count the number of variants with different filters with the following commands.

Line 270: Line 309:

#passed indels of length >4

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>1"

+

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"

#passed singletons of length 4 or insertions of length 3

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"

−

== Comparison with other data sets ==

+

=== Comparison with other data sets ===

It is usually useful to examine the call sets against known data sets for the passed variants.

−

${GC}/bin/vt profile_indels -g indel.reference.txt -r ~~hs37d5~~.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

+

In order to do this, you need to update indel.reference.txt to point to the reference files.

+

cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt

+

Edit indel.reference.txt and specify the correct path to ${SS}

+

nedit ${OUT}/indel.reference.txt

+

*'''Replace all occurrences of <code>username</code> with your username (or the correct path to your seqshop example directory).'''

+

*:[[File:IndelRef.png]]

+

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

+

data set

Line 318: Line 366:

We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.

−

${GC}/bin/vt profile_indels -g indel.reference.txt -r ~~hs37d5~~.fa all.genotypes.~~bcf~~ -i 22:36000000-37000000 -f "~PASS"

+

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

data set

Line 346: Line 394:

This analysis supports filters too.

−

==Normalization==

+

===Normalization===

A slight digression here, when analyzing indels, it is important to normalize it. While it is a simple concept,

Line 366: Line 414:

|-

! scope="col"| Dataset

−

~~! scope="col"| Freebayes~~

! scope="col"| Haplotyecaller

! scope="col"| PINDEL

Line 377: Line 424:

|-

| Left trim

−

~~| 27069~~

| 1

| 0

Line 386: Line 432:

|-

| Left aligned

−

~~| 3~~

| 1

Line 397: Line 442:

|-

| Left trim

−

~~| 40782~~

| 0

Line 406: Line 450:

|-

| Left aligned

−

~~| 1892~~

| 0

Line 415: Line 458:

|-

| Right trimmed

−

~~| 0~~

| 0

Line 426: Line 468:

|-

| Duplicate variants

−

~~| 0~~

| 1

| 155

Line 439: Line 480:

- about a loss of 11% of variant thought distinct.

−

To normalize and remove duplicate variants:

+

For chromosome 22, there are 121 variants, but only 106 distinct Indels after normalization

+

- about a loss of 12% of variant thought distinct.

+

To normalize and remove duplicate variants for chromosome 22:

−

${GC}/bin/vt normalize ~~mills~~.~~genotypes~~.bcf -r ~~hs37d5~~.fa | ${GC}/bin/vt mergedups - -o mills.normalized.genotypes.bcf

+

${GC}/bin/vt normalize ${SS}/ref22/mills_indels_hg19.22.sites.bcf -r ${SS}/ref22/human.g1k.v37.chr22.fa | ${GC}/bin/vt mergedups - -o ${OUT}/mills.22.normalized.genotypes.bcf

−

and you will observe that ~~3994~~ variants had to be left aligned and ~~1092~~ variants were removed.

+

and you will observe that 43 variants had to be left aligned and 15 variants were removed.

−

stats: biallelic

+

<pre>

+

stats: biallelic

no. left trimmed : 0

no. left trimmed and left aligned : 0

no. left trimmed and right trimmed : 0

−

no. left aligned : ~~3994~~

+

no. left aligned : 43

−

no. right trimmed : 0 ~~<br>~~

+

no. right trimmed : 0

−

multiallelic

+

total no. biallelic normalized : 43

+

multiallelic

no. left trimmed : 0

no. left trimmed and left aligned : 0

no. left trimmed and right trimmed : 0

no. left aligned : 0

−

no. right trimmed : 0 ~~<br>~~

+

no. right trimmed : 0

−

no. variants observed : ~~9996 <br>~~

+

−

~~<br>~~

+

total no. multiallelic normalized : 0

−

stats: Total number of observed variants ~~9996~~

+

−

Total number of unique variants ~~8904~~ <br>

+

total no. variants normalized : 43

+

total no. variants observed : 121

+

stats: Total number of observed variants 121

+

Total number of unique variants 106

+

</pre>

+

Let's look at the last two variants that were normalized.

+

${GC}/bin/vt view ${OUT}/mills.22.normalized.genotypes.bcf | grep OLD_VARIANT |tail -2

+

Results:

+

*The positions have changed - the were:

+

** 48831918 & 50012616 (as seen after OLD_VARIANT)

+

*Now they are:

+

** 48831915 & 50012614

+

[[File:indelNormalize.png|800px]]

UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.

Mktrost

Administrators

3,045

edits

Changes

SeqShop: Variant Calling and Filtering for INDELs Practical, June 2014 (view source)

Revision as of 14:36, 13 November 2014

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools