Changes

1,714 bytes added , 17:09, 17 December 2014

→‎Normalization

Line 1: Line 1: +

==Introduction==

+

Main Workshop wiki page: [[SeqShop: December 2014]]

+

See the [[Media:Dec2014 SeqShop - GotCloud indel.pdf|introductory slides]] for an intro to this tutorial.

+

== Goals of This Session ==

* What we want to learn

Line 4: Line 9:

** How to examine the variants at particular genomic positions

** How to evaluate the quality of INDEL calls

−

~~[[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]~~

== Setup in person at the SeqShop Workshop ==

''This section is specifically for the SeqShop Workshop computers.''

−

+

''If you are not running during the SeqShop Workshop, please skip this section.''

Line 41: Line 44:

== Setup when running on your own outside of the SeqShop Workshop ==

''This section is specifically for running on your own outside of the SeqShop Workshop.''

−

+

''If you are running during the SeqShop Workshop, please skip this section.''

Line 61: Line 64:

== Running GotCloud Indel ==

−

${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}

+

${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 6 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}

* <code>${GC}/gotcloud</code> runs GotCloud

* <code>indel</code> tells GotCloud you want to run the indel calling pipeline.

Line 73: Line 76:

** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}

** Alternatively, gotcloud.conf could be updated to specify the full paths

−

* <code>--~~out_dir~~</code> tells GotCloud where to write the output.

+

* <code>--outdir</code> tells GotCloud where to write the output.

** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line

Line 82: Line 85:

</div>

−

This should take about 4-5 minutes to run.

+

This should take about 2-4 minutes to run.

−

* It should end with a line like: <code>Commands finished in ~~289~~ secs with no errors reported</code>

+

* It should end with a line like: <code>Commands finished in 125 secs with no errors reported</code>

If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off.

−

== Examining GotCloud indel Ouptut ==

Line 102: Line 104:

Let's look at the <code>indel</code> directory:

ls ${OUT}/indel

+

* You should see 3 directories: <code>aux final indelvcf</code>

+

** aux & indelvcf are intermediate directories

+

** final contains the final indel output

Let's look at the <code>indel/final</code> directory:

Line 135: Line 140:

Let's see if we found the indel

−

${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041 | head -1

+

${GC}/bin/tabix ${OUT}/indel/final/all.genotypes.vcf.gz 22:36662041 | head -1

+

Did you see a variant at the position?

+

+

* Yes, Chr: 22; Pos: 36662041; Ref: AATAATT; Alt: A

+

</div>

+

</div>

+

Let's check the sequence data to confirm that the variant really exists

Line 163: Line 174:

==== Header ====

First, let's look at the header:

−

${GC}/bin/tabix -H ${OUT}/final/all.genotypes.vcf.gz

+

${GC}/bin/tabix -H ${OUT}/indel/final/all.genotypes.vcf.gz

The header is as follows:

−

##fileformat=VCFv4.2

+

<pre>

−

##FILTER=<ID=PASS,Description="All filters passed">

+

##fileformat=VCFv4.2

−

##contig=<ID=22,length=51304566>

+

##FILTER=<ID=PASS,Description="All filters passed">

−

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

+

##contig=<ID=22,length=51304566>

−

##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes">

+

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

−

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Depth">

+

##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes">

−

##FORMAT=<ID=AD,Number=3,Type=Integer,Description="Allele Depth">

+

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Depth">

−

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

+

##FORMAT=<ID=AD,Number=3,Type=Integer,Description="Allele Depth">

−

##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate Allele Counts">

+

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

−

##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Number Allele Counts">

+

##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate Allele Counts">

−

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

+

##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Number Allele Counts">

−

##INFO=<ID=AF,Number=A,Type=Float,Description="Alternate Allele Frequency">

+

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

−

##INFO=<ID=GC,Number=G,Type=Integer,Description="Genotype Counts">

+

##INFO=<ID=AF,Number=A,Type=Float,Description="Alternate Allele Frequency">

−

##INFO=<ID=GN,Number=1,Type=Integer,Description="Total Number of Genotypes Counts">

+

##INFO=<ID=GC,Number=G,Type=Integer,Description="Genotype Counts">

−

##INFO=<ID=GF,Number=G,Type=Float,Description="Genotype Frequency">

+

##INFO=<ID=GN,Number=1,Type=Integer,Description="Total Number of Genotypes Counts">

−

##INFO=<ID=HWEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency assuming HWE">

+

##INFO=<ID=GF,Number=G,Type=Float,Description="Genotype Frequency">

−

##INFO=<ID=HWEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency assuming HWE">

+

##INFO=<ID=HWEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency assuming HWE">

−

##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency">

+

##INFO=<ID=HWEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency assuming HWE">

−

##INFO=<ID=MLEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency">

+

##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency">

−

##INFO=<ID=HWE_LLR,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg ln(Likelihood Ratio)">

+

##INFO=<ID=MLEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency">

−

##INFO=<ID=HWE_LPVAL,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic ln(p-value)">

+

##INFO=<ID=HWE_LLR,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg ln(Likelihood Ratio)">

−

##INFO=<ID=HWE_DF,Number=1,Type=Integer,Description="Degrees of freedom for Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic">

+

##INFO=<ID=HWE_LPVAL,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic ln(p-value)">

−

##INFO=<ID=FIC,Number=1,Type=Float,Description="Genotype likelihood based Inbreeding Coefficient">

+

##INFO=<ID=HWE_DF,Number=1,Type=Integer,Description="Degrees of freedom for Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic">

−

##INFO=<ID=AB,Number=1,Type=Float,Description="Genotype likelihood based Allele Balance">

+

##INFO=<ID=FIC,Number=1,Type=Float,Description="Genotype likelihood based Inbreeding Coefficient">

−

~~##FILTER=<ID=PASS,Description="Temporary pass">~~

+

##INFO=<ID=AB,Number=1,Type=Float,Description="Genotype likelihood based Allele Balance">

−

##FILTER=<ID=overlap,Description="Overlapping variant">

+

##FILTER=<ID=overlap,Description="Overlapping variant">

−

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00551 HG00553 HG00554 HG00637 HG00638 ~~HG00640 HG00641~~ HG00734 HG00736 HG00737 HG00739 HG00740 HG01047 HG01049 HG01051 HG01052 HG01054 HG01055 HG01060 HG01061 HG01066 HG01067 HG01069 HG01070 HG01072 HG01073 HG01075 HG01079 HG01080 HG01082 HG01083 HG01094 HG01097 HG01098 HG01101 HG01102 HG01107 HG01108 HG01110 HG01111 HG01167 HG01168 HG01170 HG01171 HG01173 HG01174 HG01176 HG01177 HG01182 HG01183 HG01187 HG01188 HG01190 HG01191 HG01197 HG01198 HG01204 HG01205 HG01241 HG01242 HG01247 HG01248

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00641 HG00640 HG00551 HG00553 HG00554 HG00637 HG00638 HG00734 HG00736 HG00737 HG00739 HG00740 HG01047 HG01049 HG01051 HG01052 HG01054 HG01055 HG01060 HG01061 HG01066 HG01067 HG01069 HG01070 HG01072 HG01073 HG01075 HG01079 HG01080 HG01082 HG01083 HG01094 HG01097 HG01098 HG01101 HG01102 HG01107 HG01108 HG01110 HG01111 HG01167 HG01168 HG01170 HG01171 HG01173 HG01174 HG01176 HG01177 HG01182 HG01183 HG01187 HG01188 HG01190 HG01191 HG01197 HG01198 HG01204 HG01205 HG01241 HG01242 HG01247 HG01248

+

</pre>

Using [[Vt]], we can see the same output

−

${GC}/bin/vt view -H ${OUT}/final/all.genotypes.vcf.gz

+

${GC}/bin/vt view -H ${OUT}/indel/final/all.genotypes.vcf.gz

====Records====

Line 202: Line 214:

To view a specific region of records (such as APOL1 g2 allele)

−

${GC}/bin/tabix ${OUT}/final/all.genotypes.vcf.gz 22:36662041-36662041

+

${GC}/bin/tabix ${OUT}/indel/final/all.genotypes.vcf.gz 22:36662041-36662041

The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.

−

22 36662041 . AATAATT A 756 PASS AC=2;AN=114;AF=0.0175439;GC=55,2,0;GN=57;

+

22 36662041 . AATAATT A 756 PASS AC=2;AN=114;AF=0.0175439;GC=55,2,0;GN=57;

−

GF=0.964912,0.0350877,0;NS=57;

+

GF=0.964912,0.0350877,0;NS=57;

−

HWEAF=0.019571;HWEGF=0.961242,0.038376,0.000383024;

+

HWEAF=0.019571;HWEGF=0.961242,0.038376,0.000383024;

−

MLEAF=0.0196187;MLEGF=0.960762,0.0392374,2.11537e-15;

+

MLEAF=0.0196187;MLEGF=0.960762,0.0392374,2.11537e-15;

−

HWE_LLR=-0.0222464;HWE_LPVAL=-0.182794;HWE_DF=1;

+

HWE_LLR=-0.0222464;HWE_LPVAL=-0.182794;HWE_DF=1;

−

FIC=-0.00372601;AB=0.384578

+

FIC=-0.00372601;AB=0.384578

−

GT:PL:DP:AD:GQ 0/0:0,9,~~158~~:3:3,0,0:10 0/0:0,18,~~281~~:6:6,0,0:18

+

GT:PL:DP:AD:GQ 0/0:0,6,145:3:2,0,1:7 0/0:0,12,192:4:4,0,0:12

Here is a description of the record's fields.

−

22 : chromosome

+

22 : chromosome

−

36662041 : genome position

+

36662041 : genome position

−

. : this is the ID field that is left blank.

+

. : this is the ID field that is left blank.

−

AATAATT : the reference sequence that is replaced by the alternative sequence below.

+

AATAATT : the reference sequence that is replaced by the alternative sequence below.

−

A : so this is basically a deletion of GT

+

A : so this is basically a deletion of ATAATT.

−

756 : QUAL field denoting validity of this variant, higher the better.

+

756 : QUAL field denoting validity of this variant, higher the better.

−

PASS : a passed variant.

+

PASS : a passed variant.

−

~~INFO~~ : fields containing information about the variant.

+

AC=2... : fields containing information about the variant.

−

~~FORMAT~~ : format field labels for the genotype columns.

+

GT:PL:DP:AD:GQ : format field labels for the genotype columns.

−

0/0:0,9,~~158~~:3:3,0,0:10 : genotype information.

+

0/0:0,6,145:3:2,0,1:7 : genotype information.

You can obtain the same output by using the following command

−

${GC}/bin/vt view -i 22:~~36990878~~-~~36990879~~ ${OUT}/final/all.genotypes.vcf.gz

+

${GC}/bin/vt view -i 22:36662041-36662041 ${OUT}/indel/final/all.genotypes.vcf.gz

* -i specifies the region

* You can leave it out and look at all the records

Line 240: Line 252:

AF=0.017 : allele frequency based on AC/AN

GC=55,2,0 : genotype counts for 0/0, 0/1, 1/1

−

GF=0.55,0.34,0~~.10~~ : genotype frequencies based on GC

+

GF=0.96,0.04,0 : genotype frequencies based on GC

NS=57 : no. of samples

−

HWEAF=0.28 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium

+

HWEAF=0.020 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium

−

HWEGF=0.52,0.40,0.08 : genotype frequency derived from HWEAF

+

HWEGF=0.96,0.04,0.00 : genotype frequency derived from HWEAF

−

HWE_LPVAL=-1.08 : log p value of HWE test

+

HWE_LPVAL=-0.18 : log p value of HWE test

−

FIC=-0.07 : genotype likelihood based inbreeding coefficient

+

FIC=-0.003 : genotype likelihood based inbreeding coefficient, ranges -1 to 1. <0 denotes excess of heterozygotes and >0 means excess of homozygotes assuming HWE.

−

AB=0.61 : genotype likelihood based allele balance

+

AB=0.38 : genotype likelihood based allele balance, ranges 0 to 1 with 0.5 for balance, >0.5 meaning reference bias and <0.5 denoting alternate allele bias.

=====GENOTYPE field=====

Line 253: Line 265:

0/0 : homozygous reference chosen based on PL

−

0,9,~~158~~ : PHRED scaled genotype likelihoods

+

0,6,145 : PHRED scaled genotype likelihoods

3 : no. of reads covering this variant

−

3,0,0 : allele depth

+

2,0,1 : allele depth

counts of reads supporting the reference allele,

the alternative allele and neither alleles respectively.

Line 261: Line 273:

coverage of the read over the locus

or simply an allele that is not accounted for.

−

10 : genotype quality

+

7 : genotype quality

== INDEL Analysis ==

Line 271: Line 283:

First you want to know what is in the vcf file.

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz

stats: no. of samples : 62

Line 285: Line 297:

We can count the number of variants with different filters with the following commands.

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS"

stats: no. of samples : 62

no. of chromosomes : 1 <br>

−

no. Indels : ~~584~~

+

no. Indels : 583

−

2 alleles (ins/del) : ~~584~~ (0.69) [239/~~345~~]

+

2 alleles (ins/del) : 583 (0.69) [239/344]

>=3 alleles (ins/del) : 0 (-nan) [0/0]

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.overlap"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.overlap"

stats: no. of samples : 62

no. of chromosomes : 1 <br>

−

no. Indels : ~~136~~

+

no. Indels : 137

−

2 alleles (ins/del) : ~~136~~ (1.89) [89/47] #notice the difference in insertion deletion ratios

+

2 alleles (ins/del) : 137 (1.85) [89/48] #notice the difference in insertion deletion ratios

>=3 alleles (ins/del) : 0 (-nan) [0/0]

#passed singletons only

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&INFO.AC==1"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&INFO.AC==1"

#passed indels of length 1 only

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN==1"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN==1"

#passed indels of length >4

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"

#passed singletons of length 4 or insertions of length 3

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"

=== Comparison with other data sets ===

It is usually useful to examine the call sets against known data sets for the passed variants.

−

+

−

~~In order~~ to ~~do this~~, you need to update indel.reference.txt to point to the reference files.

+

''Command to use at SeqShop Workshop:''

+

+

${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

+

</div>

+

</div>

+

+

''Commands outside of SeqShop Workshop:''

+

+

Outside the workshop, you need to update indel.reference.txt to point to the reference files.

cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt

Edit indel.reference.txt and specify the correct path to ${SS}

nedit ${OUT}/indel.reference.txt

−

*'''Replace ~~all occurrences of <code>username</code>~~ with ~~your username (or~~ the ~~correct~~ path to your seqshop example directory).'''

+

*'''Replace the path to the reference files with the path to your seqshop example directory.'''

*:[[File:IndelRef.png]]

−

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

+

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

+

</div>

+

</div>

Line 345: Line 367:

Sensitivity 69.6% <br>

dbsnp #Indels from dbSNP

−

A-B ~~405~~ [0.68]

+

A-B 404 [0.68]

−

A&B ~~208~~ [0.79]

+

A&B 209 [0.79]

−

B-A ~~494~~ [2.03]

+

B-A 493 [2.04]

−

Precision 33.9%

+

Precision 34.1%

−

Sensitivity 29.6%

+

Sensitivity 29.8%

Ins/Del ratios: Reference alignment based methods tend to be biased towards the detection of deletions. This provides a useful measure for discovery Indel sets to show the varying degree of biasness. It also appears that as coverage increases, the ins/del ratio tends to 1.

Line 365: Line 387:

We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.

−

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

+

+

''Command to use at SeqShop Workshop:''

+

+

${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

+

</div>

+

</div>

+

+

''Command outside of SeqShop Workshop:''

+

+

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

+

</div>

+

</div>

+

data set

Line 527: Line 561:

UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.

+

== Return to Workshop Wiki Page ==

+

Return to main workshop wiki page: [[SeqShop: December 2014]]

Mktrost

Administrators

3,045

edits

Changes

SeqShop: Variant Calling and Filtering for INDELs Practical, December 2014 (view source)

Revision as of 17:09, 17 December 2014

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools