Changes

1,457 bytes added , 17:09, 17 December 2014

→‎Normalization

Line 1: Line 1: +

==Introduction==

+

Main Workshop wiki page: [[SeqShop: December 2014]]

+

See the [[Media:Dec2014 SeqShop - GotCloud indel.pdf|introductory slides]] for an intro to this tutorial.

+

== Goals of This Session ==

* What we want to learn

Line 4: Line 9:

** How to examine the variants at particular genomic positions

** How to evaluate the quality of INDEL calls

−

~~[[Media:Variant Calling and Filtering for INDELs.pdf|Lecture Slides]]~~

== Setup in person at the SeqShop Workshop ==

''This section is specifically for the SeqShop Workshop computers.''

−

+

''If you are not running during the SeqShop Workshop, please skip this section.''

Line 41: Line 44:

== Setup when running on your own outside of the SeqShop Workshop ==

''This section is specifically for running on your own outside of the SeqShop Workshop.''

−

+

''If you are running during the SeqShop Workshop, please skip this section.''

Line 61: Line 64:

== Running GotCloud Indel ==

−

${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 2 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}

+

${GC}/gotcloud indel --conf ${SS}/gotcloud.conf --numjobs 6 --region 22:36000000-37000000 --base_prefix ${SS} --outdir ${OUT}

* <code>${GC}/gotcloud</code> runs GotCloud

* <code>indel</code> tells GotCloud you want to run the indel calling pipeline.

Line 73: Line 76:

** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}

** Alternatively, gotcloud.conf could be updated to specify the full paths

−

* <code>--~~out_dir~~</code> tells GotCloud where to write the output.

+

* <code>--outdir</code> tells GotCloud where to write the output.

** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line

Line 82: Line 85:

</div>

−

This should take about 4-5 minutes to run.

+

This should take about 2-4 minutes to run.

−

* It should end with a line like: <code>Commands finished in ~~289~~ secs with no errors reported</code>

+

* It should end with a line like: <code>Commands finished in 125 secs with no errors reported</code>

If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off.

−

== Examining GotCloud indel Ouptut ==

Line 216: Line 218:

The columns are CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT, Genotype fields denoted by the sample name.

−

22 36662041 . AATAATT A 756 PASS AC=2;AN=114;AF=0.0175439;GC=55,2,0;GN=57;

+

22 36662041 . AATAATT A 756 PASS AC=2;AN=114;AF=0.0175439;GC=55,2,0;GN=57;

−

GF=0.964912,0.0350877,0;NS=57;

+

GF=0.964912,0.0350877,0;NS=57;

−

HWEAF=0.019571;HWEGF=0.961242,0.038376,0.000383024;

+

HWEAF=0.019571;HWEGF=0.961242,0.038376,0.000383024;

−

MLEAF=0.0196187;MLEGF=0.960762,0.0392374,2.11537e-15;

+

MLEAF=0.0196187;MLEGF=0.960762,0.0392374,2.11537e-15;

−

HWE_LLR=-0.0222464;HWE_LPVAL=-0.182794;HWE_DF=1;

+

HWE_LLR=-0.0222464;HWE_LPVAL=-0.182794;HWE_DF=1;

−

FIC=-0.00372601;AB=0.384578

+

FIC=-0.00372601;AB=0.384578

−

GT:PL:DP:AD:GQ 0/0:0,6,145:3:2,0,1:7 0/0:0,12,192:4:4,0,0:12

+

GT:PL:DP:AD:GQ 0/0:0,6,145:3:2,0,1:7 0/0:0,12,192:4:4,0,0:12

Here is a description of the record's fields.

−

22 : chromosome

+

22 : chromosome

−

36662041 : genome position

+

36662041 : genome position

−

. : this is the ID field that is left blank.

+

. : this is the ID field that is left blank.

−

AATAATT : the reference sequence that is replaced by the alternative sequence below.

+

AATAATT : the reference sequence that is replaced by the alternative sequence below.

−

A : so this is basically a deletion of GT

+

A : so this is basically a deletion of ATAATT.

−

756 : QUAL field denoting validity of this variant, higher the better.

+

756 : QUAL field denoting validity of this variant, higher the better.

−

PASS : a passed variant.

+

PASS : a passed variant.

−

~~INFO~~ : fields containing information about the variant.

+

AC=2... : fields containing information about the variant.

−

~~FORMAT~~ : format field labels for the genotype columns.

+

GT:PL:DP:AD:GQ : format field labels for the genotype columns.

−

0/0:0,6,145:3:2,0,1:7 : genotype information.

+

0/0:0,6,145:3:2,0,1:7 : genotype information.

You can obtain the same output by using the following command

−

${GC}/bin/vt view -i 22:~~36990878~~-~~36990879~~ ${OUT}/indel/final/all.genotypes.vcf.gz

+

${GC}/bin/vt view -i 22:36662041-36662041 ${OUT}/indel/final/all.genotypes.vcf.gz

* -i specifies the region

* You can leave it out and look at all the records

Line 250: Line 252:

AF=0.017 : allele frequency based on AC/AN

GC=55,2,0 : genotype counts for 0/0, 0/1, 1/1

−

GF=0.55,0.34,0~~.10~~ : genotype frequencies based on GC

+

GF=0.96,0.04,0 : genotype frequencies based on GC

NS=57 : no. of samples

−

HWEAF=0.28 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium

+

HWEAF=0.020 : genotype likelihood based estimation of the allele frequency assuming Hardy Weinberg equilibrium

−

HWEGF=0.52,0.40,0.08 : genotype frequency derived from HWEAF

+

HWEGF=0.96,0.04,0.00 : genotype frequency derived from HWEAF

−

HWE_LPVAL=-1.08 : log p value of HWE test

+

HWE_LPVAL=-0.18 : log p value of HWE test

−

FIC=-0.07 : genotype likelihood based inbreeding coefficient

+

FIC=-0.003 : genotype likelihood based inbreeding coefficient, ranges -1 to 1. <0 denotes excess of heterozygotes and >0 means excess of homozygotes assuming HWE.

−

AB=0.61 : genotype likelihood based allele balance

+

AB=0.38 : genotype likelihood based allele balance, ranges 0 to 1 with 0.5 for balance, >0.5 meaning reference bias and <0.5 denoting alternate allele bias.

=====GENOTYPE field=====

Line 263: Line 265:

0/0 : homozygous reference chosen based on PL

−

0,9,~~158~~ : PHRED scaled genotype likelihoods

+

0,6,145 : PHRED scaled genotype likelihoods

3 : no. of reads covering this variant

−

3,0,0 : allele depth

+

2,0,1 : allele depth

counts of reads supporting the reference allele,

the alternative allele and neither alleles respectively.

Line 271: Line 273:

coverage of the read over the locus

or simply an allele that is not accounted for.

−

10 : genotype quality

+

7 : genotype quality

== INDEL Analysis ==

Line 281: Line 283:

First you want to know what is in the vcf file.

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz

stats: no. of samples : 62

Line 295: Line 297:

We can count the number of variants with different filters with the following commands.

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS"

stats: no. of samples : 62

no. of chromosomes : 1 <br>

−

no. Indels : ~~584~~

+

no. Indels : 583

−

2 alleles (ins/del) : ~~584~~ (0.69) [239/~~345~~]

+

2 alleles (ins/del) : 583 (0.69) [239/344]

>=3 alleles (ins/del) : 0 (-nan) [0/0]

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.overlap"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.overlap"

stats: no. of samples : 62

no. of chromosomes : 1 <br>

−

no. Indels : ~~136~~

+

no. Indels : 137

−

2 alleles (ins/del) : ~~136~~ (1.89) [89/47] #notice the difference in insertion deletion ratios

+

2 alleles (ins/del) : 137 (1.85) [89/48] #notice the difference in insertion deletion ratios

>=3 alleles (ins/del) : 0 (-nan) [0/0]

#passed singletons only

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&INFO.AC==1"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&INFO.AC==1"

#passed indels of length 1 only

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN==1"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN==1"

#passed indels of length >4

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&LEN>4"

#passed singletons of length 4 or insertions of length 3

−

${GC}/bin/vt peek ${OUT}/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"

+

${GC}/bin/vt peek ${OUT}/indel/final/all.genotypes.vcf.gz -f "FILTER.PASS&&(LEN==4||DLEN==3)"

=== Comparison with other data sets ===

It is usually useful to examine the call sets against known data sets for the passed variants.

−

+

−

~~In order~~ to ~~do this~~, you need to update indel.reference.txt to point to the reference files.

+

''Command to use at SeqShop Workshop:''

+

+

${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

+

</div>

+

</div>

+

+

''Commands outside of SeqShop Workshop:''

+

+

Outside the workshop, you need to update indel.reference.txt to point to the reference files.

cp ${SS}/ref22/indel.reference.txt ${OUT}/indel.reference.txt

Edit indel.reference.txt and specify the correct path to ${SS}

nedit ${OUT}/indel.reference.txt

−

*'''Replace ~~all occurrences of <code>username</code>~~ with ~~your username (or~~ the ~~correct~~ path to your seqshop example directory).'''

+

*'''Replace the path to the reference files with the path to your seqshop example directory.'''

*:[[File:IndelRef.png]]

−

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

+

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "PASS"

+

</div>

+

</div>

Line 355: Line 367:

Sensitivity 69.6% <br>

dbsnp #Indels from dbSNP

−

A-B ~~405~~ [0.68]

+

A-B 404 [0.68]

−

A&B ~~208~~ [0.79]

+

A&B 209 [0.79]

−

B-A ~~494~~ [2.03]

+

B-A 493 [2.04]

−

Precision 33.9%

+

Precision 34.1%

−

Sensitivity 29.6%

+

Sensitivity 29.8%

Ins/Del ratios: Reference alignment based methods tend to be biased towards the detection of deletions. This provides a useful measure for discovery Indel sets to show the varying degree of biasness. It also appears that as coverage increases, the ins/del ratio tends to 1.

Line 375: Line 387:

We perform the same analysis for the failed variants again, the relatively low overlap with known data sets imply a reasonable tradeoff in sensitivity and specificity.

−

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

+

+

''Command to use at SeqShop Workshop:''

+

+

${GC}/bin/vt profile_indels -g ${SS}/ref22/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

+

</div>

+

</div>

+

+

''Command outside of SeqShop Workshop:''

+

+

${GC}/bin/vt profile_indels -g ${OUT}/indel.reference.txt -r ${SS}/ref22/human.g1k.v37.chr22.fa ${OUT}/indel/final/all.genotypes.vcf.gz -i 22:36000000-37000000 -f "~PASS"

+

</div>

+

</div>

+

data set

Line 537: Line 561:

UMICH's algorithm for normalization has been adopted by Petr Danecek in bcftools and is also used in GKNO.

+

== Return to Workshop Wiki Page ==

+

Return to main workshop wiki page: [[SeqShop: December 2014]]

Mktrost

Administrators

3,045

edits

Changes

SeqShop: Variant Calling and Filtering for INDELs Practical, December 2014 (view source)

Revision as of 17:09, 17 December 2014

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools