Changes

80,829 bytes added , 04:36, 4 May 2021

Line 1: Line 1: −

=== Introduction ===

+

= Introduction =

−

vt is a variant tool set that discovers short variants from Next Generation Sequencing data~~. The features are being rolled out to github as major rewriting is being undertaken~~.

+

vt is a variant tool set that discovers short variants from Next Generation Sequencing data.

−

=~~== Location ==~~=

+

= Installation =

−

~~The source files are housed in github.~~

+

== General ==

−

git clone https://github.com/atks/vt.git

+

The source files are housed in github. [https://github.com/samtools/htslib htslib] is

+

used and a copy of a developmental freeze is stored as part of the vt repository to

+

ensure compatibility.

+

To install, perform the following steps:

+

#this will create a directory named vt in the directory you cloned the repository

+

1. git clone https://github.com/atks/vt.git <br>

+

#change directory to vt

+

2. cd vt <br>

+

#update submodules

+

3. git submodule update --init --recursive <br>

+

#run make, note that compilers need to support the c++0x standard

+

4. make <br>

+

#you can test the build

+

5. make test

+

+

An expected output when all is well for the tests is shown here. (click expand =>)

+

+

user@server:~/vt$ make test

+

test/test.sh

+

++++++++++++++++++++++

+

Tests for vt normalize

+

++++++++++++++++++++++

+

testing normalize

+

output VCF file : ok

+

output logs : ok

+

+++++++++++++++++++++++++++++++

+

Tests for vt decompose_blocksub

+

+++++++++++++++++++++++++++++++

+

testing decompose_blocksub of even-length blocks

+

output VCF file : ok

+

output logs : ok

+

testing decompose_blocksub with alignment

+

output VCF file : ok

+

output logs : ok

+

testing decompose_blocksub of phased even-length blocks

+

output VCF file : ok

+

output logs : ok

+

++++++++++++++++++++++

+

Tests for vt decompose

+

++++++++++++++++++++++

+

testing decompose for a triallelic variant

+

output VCF file : ok

+

output logs : ok <br>

+

Passed tests : 5 / 5

+

</div>

+

</div>

+

=== Mac ===

+

You may install vt via homebrew.

+

brew tap brewsci/bio

+

brew tap brewsci/science

+

brew install brewsci/bio/vt

+

Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4. <br>

+

Some features of C++11 are used, thus there is a need for newer versions of gcc and clang.

+

= Updating =

+

vt is currently under heavy development, you will probably need to update often.

+

#remove all object files

+

#you need to do this as source files as the static libraries might have changed and need to be removed.

+

1. make clean <br>

+

#update source files

+

2. git pull <br>

+

#compile and link, the -j option tells Makefile to run up to 40 independent commands in parallel

+

3. make -j 40

+

= General Features and notes =

== Common options ==

Line 15: Line 89:

-I multiple intervals in <seq>:<start>-<end> format listed in a text file line by line.

−

-o defines the out file which and has the STDOUT set as the default.

+

-o defines the out file which and has the STDOUT set as the default.

−

You may modify the STDOUT to output the binary version of the format.

+

vt recognizes the appropriate output by file extension.

+

<name>.vcf - uncompressed VCF

+

<name>.vcf.gz - compressed VCF

+

<name>.bcf - BCF

+

You may modify the STDOUT to output the binary version of the format. Uncompressed

+

VCF and BCF streams are indicated by - and + respectively.

+

-f filter expression

+

-s sequential region selection as opposed to random access of regions specified by the i option.

+

This is useful when you want to select many close-by regions, while the -i option works,

+

it is less efficient and also selects a variant multiple times if it overlaps 2 regions. This

+

option iterates through the variants in the file sequentially and checks for overlap with the

+

bed file given.

+

== Uncompressed BCF streams ==

+

htslib is designed with BCF as the underlying data structure and it has incorporated

+

awareness of uncompressed BCF streams in the i/o API. One may use this feature to

+

stream uncompressed BCF records to save on computational time spent on (de)compression.

+

#using textual VCF streams indicated by -

+

cat mills.vcf | vt normalize - -r hs37d5.fa | vt uniq - -o out.bcf

+

#using uncompressed BCF streams indicated by +

+

cat mills.vcf | vt normalize - -r hs37d5.fa -o + | vt uniq + -o out.bcf

+

In this example, the former took 0.84s while the latter took 0.64s to process. (24% speed up!)

+

== Filters ==

+

For some programs. you may define a filter via the -f option.

+

This allows you to only analyse biallelic indels that are passed on chromosome 20.

+

vt profile_na12878 vt.bcf -g na12878.reference.txt -r genome.fa -f "N_ALLELE==2&&VTYPE==INDEL&&PASS" -i 20

+

This allows you to extract biallelic indels that are passed on chromosome 20.

+

vt view vt.bcf -f "N_ALLELE==2&&VTYPE==INDEL&&PASS" -i 20

−

~~== Programs ==~~

+

Other examples of filters

−

=== ~~Normalization~~ ===

+

#all variants with a SNP in them

+

VTYPE&SNP

+

#Simple insertions of length 1

+

VTYPE==INDEL&&DLEN==1

+

#Indels of length 1

+

VTYPE==INDEL&&LEN==1

+

Variant characteristics

+

VTYPE,N_ALLELE,DLEN,LEN,VARIANT_CONTAINS_N

+

Variant value types

+

SNP,MNP,INDEL,CLUMPED

+

Biallelic SNPs only : VTYPE==SNP&&N_ALLELE==2

+

Biallelic Indels with embedded SNP : VTYPE==(SNP|INDEL)&&N_ALLELE==2

+

Biallelic variants involving insertions : VTYPE&INDEL&&DLEN>0&&N_ALLELE==2

+

Biallelic variants involving 1bp variants : LEN==1&&N_ALLELE==2

+

Variants with explicit sequences with no Ns : ~VARIANT_CONTAINS_N

+

REF field

+

REF

+

ALT field

+

ALT

+

QUAL field

+

QUAL

+

FILTER fields

+

PASS, FILTER.<tag>

+

INFO fields

+

INFO.<tag>

+

A/C SNPs : REF=='A' && ALT=='C'

+

AC type of STRs : REF=~'^.(AC)+$' || ALT=~'^.(AC)+$'

+

Passed biallelic SNPs only : PASS&&VTYPE==SNP&&N_ALLELE==2

+

Passed Common biallelic SNPs only : PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005

+

Passed Common biallelic SNPs or rare indels : (PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005)

+

Passed Common biallelic SNPs or rare indels : ((PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005))&&QUAL>100

+

with quality greater than 100

+

Failed rare variants : ~PASS&&(INFO.AC/INFO.AN<0.005)

+

[http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC1 Regular expression] matching PERL style (implemented with pcre2)

+

Sometimes, an info field will contain several values in a string with functional annotation, to match what you want,

+

just use INFO.ANNO=~'<perl regular expression>'

+

Passed variants in intergenic regions or UTR : PASS&&INFO.ANNO=~'Intergenic|UTR'

+

Passed variants in intergenic regions or UTR ignoring case : PASS&&INFO.ANNO=~'(?i)Intergenic|UTR' <br>

+

pcre2's '(?i)Intergenic|UTR' is equivalent to PERL's '/intergenic|UTR/i'

+

Operations

+

== : equivalence for strings and numbers

+

!= : not equal

+

=~ : regular expression match for strings only

+

~~ : not of =~. Is equivalent to PERL's !~, this notation is used as BASH keeps interpreting ! for recalling commands from the history

+

~ : logical not

+

&& : logical and

+

|| : logical or

+

& : bitwise and

+

| : bitwise or

+

+ : add

+

- : subtract

+

* : multiply

+

/ : divide

+

The following programs support filter expressions.

+

* view

+

* peek

+

* profile_snps

+

* profile_indels

+

* profile_na12878

+

* profile_mendelian

+

* profile_len

+

* profile_chrom

+

* profile_afs

+

* profile_hwe

+

* concordance

+

* partition

+

== Alternate headers ==

+

As BCF is a restrictive format of VCF where all meta data must be present in the header,

+

vt provides a mechanism to read an alternative header for VCF files that do not have a

+

well formed header. Simply provide a header file stub named as <vcf-file>.hdr and vt

+

will automatically read it instead of the original header in <vcf-file>.

+

For more information about VCF/BCF : http://samtools.github.io/hts-specs/VCFv4.2.pdf

+

<span style="color:#FF0000">This mechanism is available only if one is reading VCF or compressed VCF files. It is

+

disabled for BCF files as this might corrupt the BCF file because the encoding of the

+

fields in BCF records is based on the order of the meta info lines in the header.</span>

+

<span style="color:#0000FF">Note: BCF2.2 introduces the IDX field in meta information lines that indicates the

+

dictionary encoding. This feature might be enabled for BCF files in the future.</span>

+

== General cases of Ploidy and Alleles ==

+

I am trying to make vt handle [http://genome.sph.umich.edu/wiki/Relationship_between_Ploidy,_Alleles_and_Genotypes general cases of ploidy and alleles].

+

Please let me know if that is lacking in a tool that you are using.

+

== BCF Compression Levels vs Compression Time ==

+

The zlib deflation algorithm (a variant of LZ77) has 10 levels - 0 to 9. 0 has no compression but instead wraps <br>

+

up the file in zlib or bgzf blocks. It may be useful to have 0 compression as it is indexable with the same mechanism <br>

+

used for compressed files. Levels 1-9 denote an increasing compression level in exchange for longer times for <br>

+

compression.

+

In general, zlib compression does not have significant differences in compression for BCF files between the 9 compression <br>

+

levels as shown in the following table:

+

{| class="wikitable"

+

|-

+

! scope="col"| Compression Level

+

! scope="col"| Size

+

! scope="col"| Time

+

|-

+

|0<br>

+

1<br>

+

2<br>

+

3<br>

+

4<br>

+

5<br>

+

6 (default) <br>

+

7<br>

+

8<br>

+

9

+

|153GB <br>

+

98.4GB <br>

+

98.0GB <br>

+

97.5GB <br>

+

95.5GB <br>

+

95.2GB <br>

+

94.9GB <br>

+

94.8GB <br>

+

94.76GB <br>

+

94.75GB

+

|45m<br>

+

2h3m <br>

+

2h7m <br>

+

2h12m <br>

+

2h26m <br>

+

2h54m <br>

+

3h19m <br>

+

3h41m <br>

+

4h5m <br>

+

4h25m

+

|}

+

<span style="color:#0000FF">So, it might be a good idea to compress at lower levels when dealing with large temporary <br>

+

files in a pipeline to save compute time. This can be achieved with the -c option in vt view</span>

+

= VCF Manipulation =

+

=== View ===

+

Views a VCF or VCF.GZ or BCF file.

+

+

#views mills.bcf and outputs to standard out

+

vt view -h mills.bcf

+

#views mills.bcf and locally sorts it in a 10000bp window and outputs to sorted-millsbcf

+

vt view -h -w 10000 mills.bcf -o sorted-mills.bcf

+

#views mills.bcf and outputs to c1-mills.bcf with a compression level of 1. By default,

+

#the compression level is 6 where lower levels compress the file less but are faster.

+

#The difference in compression for BCF files between level 1 to level 9 is about 5% of

+

#of a level 1 compression file. The difference in time taken is about an additional 50%

+

#of a level 1 compression. The levels range from 0 to 9 where 0 means no compression

+

#but the file is encapsulated in bgzf blocks that allows the file to be indexed. A special

+

#level -1 denotes an uncompressed BCF file that is not encapsulated in bgzf blocks and

+

#are thus not indexable but are highly suitable for streaming between vt commands.

+

vt view -h mills.bcf -c 1 -o c1-mills.bcf

+

#views mills.bcf and selects variants that overlap with the regions found in dust.bed from chromosome 20

+

#the -t option selects variants by checking if each variant overlaps with the regions in the bed file, this is

+

#as opposed to random accessing the variants via the index through the intervals defined in -i and -I options.

+

#this is useful when selecting variants from the target regions from an exome sequencing experiment.

+

vt view 10000 mills.bcf -t dust.bed -i 20

+

+

usage : vt view [options] <in.vcf>

+

options : -o output VCF/VCF.GZ/BCF file [-]

+

-f filter expression []

+

-w local sorting window size [0]

+

-s print site information only without genotypes [false]

+

-H print header only, this option is honored only for STDOUT [false]

+

-h omit header, this option is honored only for STDOUT [false]

+

-p print options and summary []

+

-r right window size for overlap []

+

-l left window size for overlap []

+

-c compression level 0-9, 0 and -1 denotes uncompressed with the former being wrapped in bgzf. [6]

+

-t bed file for variant selection via streaming []

+

-I file containing list of intervals []

+

-i intervals []

+

-? displays help

+

</div>

+

</div>

+

=== Index ===

+

Indexes a VCF.GZ or BCF file.

+

+

#indexes mills.bcf

+

vt index mills.bcf

+

#indexes mills.vcf.gz

+

vt index mills.vcf.gz

+

+

usage : vt index [options] <in.vcf>

+

options : -p print options and summary []

+

-- ignores the rest of the labeled arguments following this flag

+

-h displays help

+

</div>

+

</div>

+

=== Sorting ===

+

Sorting may be done in 3 approaches.

+

Locally:<br>

+

Performs sorting within a local window. The window size may be set by the -w option. The default window size <br>

+

is 1000bp and if a record is detected to be potentially out of order due to a small window size, it wil be reported.<br>

+

Use this when your VCF records are grouped by chromosome but not ordered in short stretches.<br><br>

+

By chromosome: <br>

+

Your VCF file is not ordered by the chromosomes in the header but is fully ordered within each chromosome.<br>

+

The VCF file should be indexed and vt will output the records in the order of chromosomes given in the header. <br> <br>

+

Full sort [default option]: <br>

+

No assumptions are made about the VCF file. Records will be ordered by the order of contigs in the header. <br>

+

Smaller temporary ordered files are created and their names are <output_vcf>.<no>.bcf and after generating <br>

+

these files, they are merged and output into <output_vcf>.<br>

−

~~[http://genome~~.~~sph~~.~~umich~~.~~edu/wiki/Variant_Normalization Normalize] variants~~ in a ~~VCF file~~.

+

#sorts mills.bcf and outputs to standard out in a 1000bp window.

+

vt sort -m local mills.bcf

+

#sorts mills.bcf and locally sorts it in a 10000bp window and outputs to out.bcf

+

vt sort -m local -w 10000 mills.bcf -o out.bcf

+

#sorts an indexed mills.bcf with chromosomes not sorted in the contig order in the header

+

vt sort -m chrom mills.bcf -o out.bcf

+

#sorts mills.bcf with no assumption

+

vt sort mills.bcf -o out.bcf

+

−

~~Normalized variants~~ may ~~have their positions changed;~~ in ~~such cases, the normalized variants~~

+

usage : vt sort [options] <in.vcf>

−

~~are reordered and~~ output ~~in an ordered fashion~~. ~~The~~ local ~~reordering takes place over a~~ window

+

−

~~of 10000 base pairs~~.

+

options : -m sorting modes. [full]

+

local : locally sort within a 1000bp window. Window size may be set by -w.

+

chrom : sort chromosomes based on order of contigs in header.

+

input must be indexed.

+

full : full sort with no assumptions.

+

-o output VCF/VCF.GZ/BCF file. [-]

+

-w local sorting window size, set by default to 1000 under local mode. [0]

+

-p print options and summary. []

+

-? displays help

+

</div>

+

=== Normalization ===

+

<div>

+

[http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a [http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42 VCF] file [http://bioinformatics.oxfordjournals.org/content/31/13/2202 (Tan et al. 2015)] . Normalized variants may have their positions changed; in such cases, the normalized variants

+

are reordered and output in an ordered fashion. The local reordering takes place over a window of 10000 base pairs which may be changed via the -w option. There is an underlying assumption that the REF

+

field is consistent with the reference sequence use, vt will check for this and will fail if reference inconsistency is encountered; this may be relaexd with the -n option.

</div>

−

vt normalize ~~mills~~.vcf -r seq.fa -o ~~- mills~~.normalized.vcf

+

#normalize variants and write out to dbsnp.normalized.vcf

+

vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf

#normalize variants, send to standard out and remove duplicates.

−

vt normalize ~~mills~~.vcf -r seq.fa ~~-o -~~ | vt ~~merge_duplicate_variants~~ - -o ~~mills~~.~~normalized_merged~~.vcf

+

vt normalize dbsnp.vcf -r seq.fa | vt uniq - -o dbsnp.normalized.uniq.vcf

+

#read in variants that do not contain N in the explicit alleles, normalize variants, send to standard out.

+

vt normalize dbsnp.vcf -r seq.fa -f "~VARIANT_CONTAINS_N"

+

#variants that are normalized will be annotated with an OLD_VARIANT info tag.

+

#CHROM POS ID REF ALT QUAL FILTER INFO

+

19 29238772 . C G . PASS VT=SNP;OLD_VARIANT=19:29238771:TC/TG

+

20 60674709 . GCCCAGCCCCAC G . PASS VT=INDEL;OLD_VARIANT=20:60674718:CACCCCAGCCCC/C

+

#this shows a sample output with the normalization operations that were used

+

#categorized into 5 categories each for biallelic and multiallelic variants. <br>

+

stats: biallelic

+

no. left trimmed : 156908

+

no. right trimmed : 323

+

no. left and right trimmed : 33

+

no. right trimmed and left aligned : 7

+

no. left aligned : 12360 <br>

+

total no. biallelic normalized : 169631 <br> <br>

+

multiallelic

+

no. left trimmed : 627189

+

no. right trimmed : 2509

+

no. left and right trimmed : 1498

+

no. right trimmed and left aligned : 212

+

no. left aligned : 1783 <br>

+

total no. multiallelic normalized : 633191 <br>

+

total no. variants normalized : 802822

+

total no. variants observed : 88052639

+

usage : vt normalize [options] <in.vcf>

+

options : -o output VCF file [-]

+

-d debug [false]

+

-q do not print options and summary [false]

+

-m warns but does not exit when REF is inconsistent

+

with masked reference sequence for non SNPs.

+

This overides the -n option [false]

+

-n warns but does not exit when REF is inconsistent

+

with reference sequence for non SNPs [false]

+

-w window size for local sorting of variants [10000]

+

-I file containing list of intervals []

+

-i intervals []

+

-r reference sequence fasta file []

+

-? displays help

+

</div>

+

</div>

+

=== Decompose biallelic block substitutions ===

+

<div>

+

Decomposes biallelic block substitutions into its constituent SNPs. <br>

+

There is now an additional option -a which decomposes non block substitutions into its constituent SNPs and indels. (kindly added by [[https://github.com/holtgrewe holtgrewe@github]]) <br>

+

There is no exact solution and this decomposition is based on the best guess outcome using a Needleman-Wunsch algorithm. <br>

+

You might also want to check out [https://github.com/vcflib/vcflib#vcfallelicprimitives vcfallelicprimitives]. <br>

+

<br>

+

There is now an additional option -m and -d which ensures that some MNVs are not decomposed. (kindly added by [[https://github.com/jaudoux jaudoux@github]]) <br>

+

The motivation is from<br>

+

*Exome-wide assessment of the functional impact and pathogenicity of multi-nucleotide mutations https://www.biorxiv.org/content/10.1101/258723v2.full<br>

+

*Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes https://www.biorxiv.org/content/10.1101/573378v2.full<br>

+

</div>

+

+

#decomposes biallelic block substitutions and write out to decomposed_blocksub.vcf

+

vt decompose_blocksub gatk.vcf -o decomposed_blocksub.vcf <br>

+

#before decomposition

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1

+

20 763837 . CA TG 50340.1 PASS AC=1;AN=2 GT 0|1 <br>

+

#after decomposition

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1

+

20 763837 . C T 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG GT 0|1

+

20 763838 . A G 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG GT 0|1

+

#decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf

+

vt decompose_blocksub -a gatk.vcf -o decomposed_blocksub.vcf <br>

+

#before decomposition

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1

+

20 763837 . CG TGA 50340.1 PASS AC=1;AN=2 GT 0|1 <br>

+

#after decomposition

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1

+

20 763837 . C T 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1

+

20 763838 . G GA 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1

+

#decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf and add phase set information in the genotype fields

+

vt decompose_blocksub -p gatk.vcf -o decomposed_blocksub.vcf <br>

+

#before decomposition

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tumor normal

+

1 159030 . TAACCTTTC TGACCTTTT 0.04 . AF=0.5 GT 0/0 1/1 <br>

+

#after decomposition

+

1 159031 . A G 0.04 . AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT GT:PS 0|0:159031 1|1:159031

+

1 159038 . C T 0.04 . AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT GT:PS 0|0:159031 1|1:159031

+

+

description : decomposes biallelic block substitutions into its constituent SNPs. <br>

+

usage : vt decompose_blocksub [options] <in.vcf> <br>

+

options : -m keep MNVs (multi-nucleotide variants) [false]

+

-a enable aggressive/alignment mode [false]

+

-d MNVs max distance (when -m option is used) [2]

+

-o output VCF file [-]

+

-I file containing list of intervals []

+

-i intervals []

+

-? displays help-a enable aggressive/alignment mode

+

</div>

+

</div>

+

=== Decompose===

+

<div>

+

Decompose multiallelic variants in a [http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42 VCF] file. If the VCF file has genotype fields GT,PL, GL or DP, they are

+

modified to reflect the change in alleles. All other genotype fields are removed. The -s option will retain the fields and decompose fields of counts R and A accordingingly.

+

Decomposition and combining variants is a complex operation where the correctness is dependent on [[https://github.com/tfarrah tfarrah@github]]:

+

*whether the observed variants are seen in the same sample,

+

*if same sample, whether they are homozygous or heterozygous,

+

*if both heterozygous, whether they are in the same haplotype or not (if known).

+

and one should be aware of the issues in handling variants resulting from such operations. <br>

+

The original purpose of this tool is to allow for allelic comparisons between call sets.

+

[[https://github.com/atks/vt/issues/16 example of a problem caused in combining separate variant records]]

+

</div>

+

+

#decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf

+

vt decompose gatk.vcf -o gatk.decomposed.vcf <br>

+

#before decomposition

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2

+

1 3759889 . TA TAA,TAAA,T . PASS AF=0.342,0.173,0.037 GT:DP:PL 1/2:81:281,5,9,58,0,115,338,46,116,809 0/0:86:0,30,323,31,365,483,38,291,325,567 <br>

+

#after decomposition

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2

+

1 3759889 . TA TAA . PASS OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL 1/.:281,5,9 0/0:0,30,323

+

1 3759889 . TA TAAA . . OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./1:281,58,115 0/0:0,31,483

+

1 3759889 . TA T . . OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./.:281,338,809 0/0:0,38,567 <br>

+

One might want to post process the partial genotypes like 1/. to the best guess genotype based on the PL values.

+

#decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf with the -s option.

+

#-s option splits up INFO and GENOTYPE fields that have number counts of R and A [[https://samtools.github.io/hts-specs/VCFv4.2.pdf VCFv4.2 section 1.2.2]] appropriately.

+

vt decompose -s gatk.vcf -o gatk.decomposed.vcf <br>

+

#before decomposition

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2

+

1 3759889 . TA TAA,TAAA,T . PASS AF=0.342,0.173,0.037 GT:DP:PL 1/2:81:281,5,9,58,0,115,338,46,116,809 0/0:86:0,30,323,31,365,483,38,291,325,567 <br>

+

#after decomposition

+

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2

+

1 3759889 . TA TAA . PASS AF=0.342;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL 1/.:281,5,9 0/0:0,30,323

+

1 3759889 . TA TAAA . . AF=0.173;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./1:281,58,115 0/0:0,31,483

+

1 3759889 . TA T . . AF=0.037;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./.:281,338,809 0/0:0,38,567 <br>

+

In general, you should recompute fields that involves alleles after decomposition. Information is generally lost after vertically decomposing a variant, so care should be taken

+

in interpreting the resultant values.

+

+

description : decomposes multiallelic variants into biallelic in a VCF file. <br>

+

usage : vt decompose [options] <in.vcf> <br>

+

options : -s smart decomposition [false]

+

-o output VCF file [-]

+

-I file containing list of intervals []

+

-i intervals []

+

-? displays help

+

</div>

+

</div>

+

=== Drop duplicate variants ===

+

Drops duplicate variants that appear later in the file. VCF file must be ordered. <br>

+

If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained.

+

+

#drop duplicate variants and save output in mills.uniq.vcf

+

vt uniq mills.vcf -o mills.uniq.vcf

+

+

usage : vt uniq [options] <in.vcf>

+

options : -o output VCF file [-]

+

-I file containing list of intervals []

+

-i intervals []

+

-? displays help

+

</div>

+

</div>

+

=== Paste ===

+

Pastes VCF files like the unix paste functions.

+

Input requirements and assumptions:

+

1. Same variants are represented in the same order for each file (required)

+

2. Genotype field order are the same for corresponding records (required)

+

3. Sample names are different in all the files (warning will be given if not)

+

4. Headers are the same for all the files (assumption, not checked, will fail if output is BCF)

+

Outputs:

+

1. INFO fields output will be that of the first file

+

2. Genotype fields are the same for corresponding records

+

+

#paste together genotypes from the CEU trio into one file.

+

vt paste NA12878.mills.bcf NA12891.mills.bcf NA12892.mills.bcf -o ceu_trio.bcf

+

+

usage : vt paste [options] <in1.vcf>...

+

options : -L file containing list of input VCF files

+

-o output VCF file [-]

+

-p print options and summary []

+

-? displays help

+

</div>

+

</div>

+

=== Concatenate ===

+

Concatenates VCF files. Assumes individuals are in the same order and files share the same header.

+

+

#concatenates chr1.mills.bcf and chr2.mills.bcf

+

vt cat chr1.mills.bcf chr2.mills.bcf -o mills.bcf

+

#concatenates chr1.mills.bcf and chr2.mills.bcf with the naive option.

+

#The naive option assumes that the headers are all the same and skips

+

#merging headers and translating encodings between BCF files. This is

+

#a much faster option if you know the nature of your BCF files in advance.

+

vt cat -n chr1.mills.bcf chr2.mills.bcf -o mills.bcf

+

+

usage : vt cat [options] <in1.vcf>...

+

options : -s print site information only without genotypes [false]

+

-p print options and summary [false]

+

-n naive, assumes that headers are the same. [false]

+

-w local sorting window size [0]

+

-f filter expression []

+

-L file containing list of input VCF files

+

-o output VCF file [-]

+

-I file containing list of intervals []

+

-i intervals

+

-? displays help

+

</div>

+

</div>

+

=== Remove info tags ===

+

Removes INFO tags from a VCF file

+

+

#removes the INFO tags OLD_VARIANT, ENTROPY, PSCORE and COMP

+

vt rminfo exact.del.bcf -t OLD_VARIANT,ENTROPY,PSCORE,COMP -o rm.bcf

+

+

usage : vt rminfo [options] <in.vcf>

+

options : -o output VCF file [-]

+

-q do not print options and summary [false]

+

-t list of info tags to be removed []

+

-I file containing list of intervals []

+

-i intervals []

+

-? displays help

+

</div>

+

</div>

+

=== Filter ===

+

Filters variants in a VCF file

+

+

#adds a filter tag "refA" for variants where the REF column is a A sequence.

+

vt filter in.bcf -f "REF=='A'" -d "refA"

+

+

usage : vt filter [options] <in.vcf> <br>

+

options : -x clear filter [false]

+

-f filter expression []

+

-d filter tag description []

+

-t filter tag []

+

-o output VCF file [-]

+

-I file containing list of intervals []

+

-i intervals

+

-? displays help </div>

+

</div>

+

=== Filter overlap ===

+

Tags overlapping variants in a VCF file with the FILTER flag overlap.

+

+

#adds a filter tag "overlap" for overlapping variants within a window size of 1 based on the REF sequence.

+

vt filter_overlap in.bcf -w 1 out.bcf

+

todo: option for considering END info tag for detecting overlaps.

+

+

usage : vt filter_overlap [options] <in.vcf>

+

options : -o output VCF file [-]

+

-w window overlap for variants [0]

-I file containing list of intervals []

-i intervals []

−

-r ~~reference sequence fasta file []~~

+

-? displays help

−

~~-- ignores the rest of the labeled arguments following this flag~~

+

</div>

−

-h ~~displays help~~

</div>

+

=== Validate ===

+

Checks the following properties of a VCF file:

+

#order

+

#reference sequence consistency

+

+

#validates lobstr.bcf

+

vt validate lobstr.bcf

+

+

usage : vt validate [options] <in.vcf>

+

options : -q do not print invalid records [false]

+

-I file containing list of intervals []

+

-i intervals []

+

-r reference sequence fasta file []

+

-? displays help

+

</div>

−

=== Merge ~~duplicate~~ variants ===

+

=== Extract INFO fields to a tab delimited file ===

+

Converts a VCF file and its shared information in the INFO field to a tab delimited file for further analysis.

+

+

#converts in.bcf to tab format with selected INFO and FILTER fields

+

vt info2tab in.bcf -u PASS -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT

+

+

INPUT

+

=====

+

20 17548608 . A AC . PASS CENTERS=vbi;NCENTERS=1;OLD_MULTIALLELIC=20:17548598:GAAAAAAAAAAAAA/GAAAAAAAAAAAA/GAAAAAAAAAAAAAA/GAAAAAAAAAA/GAAAAAAAAAAA/GAAAAAAAAAACAAA;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAACAAAG;EX_MOTIF=C;EX_MLEN=1;EX_RU=C;EX_BASIS=C;EX_BLEN=1;EX_REPEAT_TRACT=17548608,17548609;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=2;EX_RL=2;EX_LL=3;EX_RU_COUNTS=0,2;EX_SCORE=0;EX_TRF_SCORE=-14;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=14;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[A]AAAGAAGGAA;MDUST;LOBSTR

+

20 17548608 . AAAAG A . PASS CENTERS=ox1;NCENTERS=1;EX_MOTIF=AAAG;EX_MLEN=4;EX_RU=AAAG;EX_BASIS=AG;EX_BLEN=2;EX_REPEAT_TRACT=17548609,17548612;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=0.75;EX_RL=4;EX_LL=4;EX_RU_COUNTS=0,1;EX_SCORE=0.75;EX_TRF_SCORE=-1;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=13;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[AAAAG]AAGGAACTAC;MDUST;LOBSTR;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAA

+

</div>

+

OUTPUT

+

======

+

CHROM POS REF ALT N_ALLELE PASS EX_RL FZ_RL MDUST LOBSTR VNTRSEEK RMSK EX_REPEAT_TRACT_1 EX_REPEAT_TRACT_2

+

20 17548608 A AC 2 1 2 13 1 1 0 0 17548608 17548608

+

20 17548608 AAAAG A 2 1 4 13 1 1 0 0 17548609 17548609

+

+

usage : vt info2tab [options] <in.vcf>

+

options : -d debug [false]

+

-f filter expression []

+

-u list of filter tags to be extracted []-t list of info tags to be extracted []

+

-o output tab delimited file [-]

+

-I file containing list of intervals []

+

-i intervals []

+

-? displays help

+

</div>

+

</div>

+

= VCF Inspection and Evaluation =

+

=== Peek ===

+

Summarizes the variants in a VCF file

+

+

#summarizes the variants found in mills.vcf

+

vt peek mills.vcf

+

+

usage : vt peek [options] <in.vcf>

+

options : -o output VCF file [-]

+

-I file containing list of intervals []

+

-i intervals []

+

-r reference sequence fasta file []

+

-- ignores the rest of the labeled arguments following this flag

+

-h displays help

+

</div>

+

</div>

+

For a more detailed guide on [http://genome.sph.umich.edu/wiki/Variant_classification variant classification].

+

#This is a sample output of a peek command which summarizes the variants found in a VCF file.

+

stats: no. of samples : 0

+

no. of chromosomes : 22<br>

+

========== Micro variants ==========<br>

+

no. of SNPs : 77228885

+

2 alleles (ts/tv) : 77011302 (2.11) [52287790/24723512]

+

3 alleles (ts/tv) : 216560 (0.75) [185520/247600]

+

4 alleles (ts/tv) : 1023 (0.50) [1023/2046]<br>

+

no. of MNPs : 0

+

2 alleles (ts/tv) : 0 (-nan) [0/0]

+

>=3 alleles (ts/tv) : 0 (-nan) [0/0]<br>

+

no. Indels : 2147564

+

2 alleles (ins/del) : 2124842 (0.47) [683250/1441592]

+

>=3 alleles (ins/del) : 22722 (2.12) [32411/15286]<br>

+

no. SNP/MNP : 0

+

3 alleles (ts/tv) : 0 (-nan) [0/0]

+

>=4 alleles (ts/tv) : 0 (-nan) [0/0] <br>

+

no. SNP/Indels : 12913

+

2 alleles (ts/tv) (ins/del) : 412 (0.41) [120/292] (3.68) [324/88]

+

>=3 alleles (ts/tv) (ins/del) : 12501 (0.43) [7670/17649] (18.64) [12434/667]<br>

+

no. MNP/Indels : 153

+

2 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0]

+

>=3 alleles (ts/tv) (ins/del) : 153 (0.30) [138/465] (0.27) [67/248]<br>

+

no. SNP/MNP/Indels : 2

+

3 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0]

+

4 alleles (ts/tv) (ins/del) : 2 (0.00) [3/5] (1.00) [3/3]

+

>=5 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0]<br>

+

no. of clumped variants : 19025

+

2 alleles : 0 (-nan) [0/0] (-nan) [0/0]

+

3 alleles : 18508 (0.16) [12152/75366] (0.00) [93/18653]

+

4 alleles : 451 (0.15) [369/2390] (0.33) [201/609]

+

>=5 alleles : 66 (0.09) [37/414] (1.19) [107/90]<br>

+

====== Other useful categories =====<br>

+

no. complex variants : 32093

+

2 alleles (ts/tv) (ins/del) : 412 (0.41) [120/292] (3.68) [324/88]

+

>=3 alleles (ts/tv) (ins/del) : 31681 (0.21) [20369/96289] (0.64) [12905/20270]<br>

+

======= Structural variants ========<br>

+

no. of structural variants : 41217

+

2 alleles : 38079

+

deletion : 13135

+

insertion : 16451

+

mobile element : 16253

+

ALU : 12513

+

LINE1 : 2911

+

SVA : 829

+

numt : 198

+

duplication : 664

+

inversion : 100

+

copy number variation : 7729

+

>=3 alleles : 3138

+

copy number variation : 3138 <br>

+

========= General summary ========== <br>

+

no. of reference : 0 <br>

+

no. of observed variants : 79449759

+

no. of unclassified variants : 0

+

=== Partition ===

+

Partition variants from two data sets.

+

<span style="color:#FF0000">Please note that this only works if the contigs in the headers of both data sets are the same.</span>

+

+

#partitions all variants in bi1.bcf and bi2.bcf

+

vt partition bi1.bcf bi2.bcf

+

Options: input VCF file a bi1.bcf

+

input VCF file b bi2.bcf <br>

+

A: 504676 variants

+

B: 1389333 variants <br>

+

ts/tv ins/del

+

A-B 37564 [0.19] [1.34]

+

A&B 467112 [1.55] [0.72]

+

B-A 922221 [1.20] [0.58]

+

of A 92.6%

+

of B 33.6%

+

#partitions only passed variants in bi1.bcf and bi2.bcf

+

vt partition bi1.bcf bi2.bcf -f PASS

+

Options: input VCF file a bi1.bcf

+

input VCF file b bi2.bcf

+

[f] filter PASS <br>

+

A: 466148 variants

+

B: 986056 variants <br>

+

ts/tv ins/del

+

A-B 47261 [0.44] [1.36]

+

A&B 418887 [1.80] [0.68]

+

B-A 567169 [1.43] [0.72]

+

of A 89.9%

+

of B 42.5%

+

+

partition v0.5

+

description : partition variants. check the overlap of variants between 2 data sets.

+

usage : vt partition [options] <in1.vcf><in2.vcf>

+

options : -w write partitioned variants to file

+

-f filter

+

-I file containing list of intervals []

+

-i intervals []

+

-? displays help

+

</div>

+

</div>

+

=== Multi Partition ===

+

Partitions variants found in VCF files. <br>

+

In comparison to the simple 2 way partition, this does not support writing out of partitions to file and

+

reporting proportion of shared variants for each VCF.

+

+

#partitions variants n-ways

+

vt multi_partition hc.genotypes.bcf pl.genotypes.bcf st.genotypes.bcf

+

Options: input VCF file a hc.genotypes.bcf

+

input VCF file b pl.genotypes.bcf

+

input VCF file c st.genotypes.bcf <br>

+

A: 97274 variants

+

B: 95458 variants

+

C: 98943 variants <br>

+

no [ts/tv] [ins/del]

+

A-- 3887 [1.10] [0.86]

+

-B- 7890 [1.45] [0.98]

+

AB- 4360 [0.99] [1.32]

+

--C 8277 [1.75] [2.21]

+

A-C 7458 [1.78] [0.49]

+

-BC 1639 [1.63] [1.03]

+

ABC 81569 [2.28] [1.08] <br>

+

Unique variants : 115080

+

Overall concordance : 70.88% (#intersection/#union)

+

+

usage : vt multi_partition [options] <in1.vcf><in2.vcf>...

+

options : -f filter

+

-I file containing list of intervals []

+

-i intervals []

+

-? displays help

+

</div>

+

</div>

+

=== Annotate Regions ===

+

Annotates regions in a VCF file. The BED file should be bgzipped and indexed with tabix.

+

+

#annotates the variants that overlap with coding regions.

+

vt annotate_regions mills.vcf -b coding.bed.gz -t CDS -d "Coding region"

+

#annotates the variants that overlap with low complexity regions.

+

vt annotate_regions mills.vcf -b mdust.bed.gz -t DUST -d "DUST Low Complexity Region"

+

+

usage : vt annotate_regions [options] <in.vcf>

+

options : -d regions tag description []

+

-t regions tag []

+

-b regions BED file []

+

-o output VCF file [-]

+

-I file containing list of intervals []

+

-i intervals

+

-? displays help

+

</div>

+

</div>

+

=== Annotate Variants ===

+

Annotates variants in a VCF file. The GENCODE annotation file should be bgzipped and indexed with tabix.

+

This is available in the [[Vt#Resource_Bundle|vt resource bundle]].

+

+

#annotates the variants found in mills.vcf

+

vt annotate_variants mills.vcf -r hs37d5.fa -g gencode.v19.annotation.gtf.gz

+

#annotates variants with the following fields

+

##INFO=<ID=VT,Number=1,Type=String,Description="Variant Type - SNP, MNP, INDEL, CLUMPED">

+

##INFO=<ID=GENCODE_FS,Number=0,Type=Flag,Description="Frameshift INDEL">

+

##INFO=<ID=GENCODE_NFS,Number=0,Type=Flag,Description="Non Frameshift INDEL">

+

+

usage : vt annotate_variants [options] <in.vcf>

+

options : -g GENCODE annotations GTF file []

+

-r reference sequence fasta file []

+

-o output VCF file [-]

+

-I file containing list of intervals []

+

-i intervals

+

-? displays help

+

</div>

+

</div>

+

=== Compute Features ===

+

Compute features in a VCF file. Example of statistics are Allele counts, [[Genotype_Likelihood_based_Inbreeding_Coefficient|Genotype Likelihood based Inbreeding Coefficient]].

+

[[Genotype_Likelihood_based_Allele_Frequency|Hardy-Weinberg Genotype Likelihood based Allele Frequencies]] <br>

+

For more customizable feature computation - look at [http://genome.sph.umich.edu/wiki/Vt#Estimate estimate]

+

+

#compute features for the variants found in vt.vcf

+

#requires GT, PL and DP

+

vt compute_features vt.vcf

+

#annotates variants with the following fields

+

##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate Allele Counts">

+

##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Number Allele Counts">

+

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

+

##INFO=<ID=AF,Number=A,Type=Float,Description="Alternate Allele Frequency">

+

##INFO=<ID=GC,Number=G,Type=Integer,Description="Genotype Counts">

+

##INFO=<ID=GN,Number=1,Type=Integer,Description="Total Number of Genotypes Counts">

+

##INFO=<ID=GF,Number=G,Type=Float,Description="Genotype Frequency">

+

##INFO=<ID=HWEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency assuming HWE">

+

##INFO=<ID=HWEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency assuming HWE">

+

##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency">

+

##INFO=<ID=MLEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency">

+

##INFO=<ID=HWE_LLR,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg ln(Likelihood Ratio)">

+

##INFO=<ID=HWE_LPVAL,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic ln(p-value)">

+

##INFO=<ID=HWE_DF,Number=1,Type=Integer,Description="Degrees of freedom for Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic">

+

##INFO=<ID=FIC,Number=1,Type=Float,Description="Genotype likelihood based Inbreeding Coefficient">

+

##INFO=<ID=AB,Number=1,Type=Float,Description="Genotype likelihood based Allele Balance">

+

+

usage : vt compute_features for variants [options] <in.vcf>

+

options : -s print site information only without genotypes [false]

+

-o output VCF/VCF.GZ/BCF file [-]

+

-f filter expression []

+

-I File containing list of intervals

+

-i Intervals

+

-? displays help

+

</div>

+

</div>

+

=== Estimate ===

+

Compute variant based estimates.

+

Example of statistics are:

+

* Allele counts

+

* [[Genotype_Likelihood_based_Allele_Frequency|Hardy-Weinberg Genotype Likelihood based Allele Frequencies]]

+

* [[Genotype_Likelihood_based_Inbreeding_Coefficient|Genotype Likelihood based Inbreeding Coefficient]]

+

* [[HWEP|Genotype Likelihood based Hardy-Weinberg test]]

+

* [[Genotype_Likelihood_Based_Allele_Balance|Genotype Likelihood based Allele Balance]]

+

+

#compute features for the variants found in vt.vcf

+

#requires GT and PL

+

vt estimate -e AF,MLEAF vt.vcf

+

AF Genotype (GT) based allele frequencies

+

If genotypes are unavailable, best guess

+

genotypes are inferred based on genotype

+

likelihoods (GL or PL)

+

AC : Alternate Allele counts

+

AN : Total allele counts

+

NS : No. of samples.

+

AF : Alternate allele frequencies.

+

MLEAF GL based allele frequencies estimates

+

MLEAF : Alternate allele frequency derived from MLEGF

+

MLEGF : Genotype frequencies.

+

HWEAF GL based allele frequencies estimates assuming HWE

+

HWEAF : Alternate allele frequencies

+

HWEGF : Genotype frequencies derived from HWEAF.

+

HWE GL based Hardy-Weinberg statistics.

+

HWE_LLR : log likelihood ratio

+

HWE_LPVAL : log p-value

+

HWE_DF : degrees of freedom

+

AB GL based Allele Balance.

+

FIC GL based Inbreeding Coefficient

+

+

usage : vt estimate [options] <in.vcf>

+

options : -s print site information only without genotypes [false]

+

-o output VCF/VCF.GZ/BCF file [-]

+

-e comma separated estimates to be computed []

+

-f filter expression []

+

-I File containing list of intervals

+

-i Intervals

+

-? displays help

+

</div>

+

</div>

+

=== Profile Mendelian Errors ===

+

Profile Mendelian errors

+

+

#profile mendelian errors found in vt.genotypes.bcf, generate [[media:mendel.pdf|tables]] in the directory mendel, requires pdflatex.

+

vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel

+

pedigree file format is described in [[Vt#Pedigree File|here]].

+

#this is a sample output for mendelian error profiling.

+

#R and A stand for reference and alternate allele respectively.

+

#Error% - mendelian error (confounded with de novo mutation)

+

#HomHet - Homozygous-Heterozygous genotype ratios

+

#Het% - proportion of hets

+

Mendelian Errors <br>

+

Father Mother R/R R/A A/A Error(%) HomHet Het(%)

+

R/R R/R 14889 210 38 1.64 nan nan

+

R/R R/A 3403 3497 74 1.06 0.97 50.68

+

R/R A/A 176 1482 155 18.26 nan nan

+

R/A R/R 3665 3652 68 0.92 1.00 49.91

+

R/A R/A 1015 3151 990 0.00 0.64 61.11

+

R/A A/A 43 1300 1401 1.57 1.08 48.13

+

A/A R/R 172 1365 147 18.94 nan nan

+

A/A R/A 47 1164 1183 1.96 1.02 49.60

+

A/A A/A 20 78 5637 1.71 nan nan <br>

+

Parental R/R R/A A/A Error(%) HomHet Het(%)

+

R/R R/R 14889 210 38 1.64 nan nan

+

R/R R/A 7068 7149 142 0.99 0.99 50.28

+

R/R A/A 348 2847 302 18.59 nan nan

+

R/A R/A 1015 3151 990 0.00 0.64 61.11

+

R/A A/A 90 2464 2584 1.75 1.05 48.81

+

A/A A/A 20 78 5637 1.71 nan nan <br>

+

Parental R/R R/A A/A Error(%) HomHet Het(%)

+

HOM HOM 14909 288 5675 1.66 nan nan

+

HOM HET 7158 9613 2726 1.19 1.00 49.90

+

HET HET 1015 3151 990 0.00 0.64 61.11

+

HOMREF HOMALT 348 2847 302 18.59 nan nan <br>

+

total mendelian error : 2.505%

+

no. of trios : 2

+

no. of variants : 25346

+

+

profile_mendelian v0.5

+

usage : vt profile_mendelian [options] <in.vcf>

+

options : -q minimum genotype quality

+

-d minimum depth

+

-r reference sequence fasta file []

+

-x output latex directory []

+

-p pedigree file

+

-I file containing list of intervals []

+

-i intervals

+

-? displays help

+

</div>

+

</div>

+

=== Profile SNPs ===

+

Profile SNPs. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].

+

+

#profile snps found in 20.sites.vcf

+

vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa -i 20

+

#this is a sample output for indel profiling.

+

# square brackets contain the ts/tv ratio.

+

# The numbers in curved bracket are the counts of ts and tv SNPs respectively.

+

# Low complexity shows what percent of the SNPs are in low complexity regions.

+

data set

+

No. SNPs : 508603 [2.09]

+

Low complexity : 0.08 (39837/508603) <br>

+

1000g

+

A-B 109970 [1.39]

+

A&B 398633 [2.37]

+

B-A 1340682 [2.26]

+

Precision 78.4%

+

Sensitivity 22.9% <br>

+

dbsnp

+

A-B 324063 [1.99]

+

A&B 184540 [2.29]

+

B-A 103893 [2.60]

+

Precision 36.3%

+

Sensitivity 64.0%

+

# This file contains information on how to process reference data sets.

+

#

+

# dataset - name of data set, this label will be printed.

+

# type - True Positives (TP) and False Positives (FP)

+

# overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively

+

# - annotation

+

# file is used for GENCODE annotation of frame shift and non frame shift Indels

+

# filter - filter applied to variants for this particular data set

+

# path - path of indexed BCF file

+

#dataset type filter path

+

1000g TP N_ALLELE==2&&VTYPE==SNP /net/fantasia/home/atks/ref/vt/grch37/1000G.v5.snps.indels.complex.svs.sites.bcf

+

dbsnp TP N_ALLELE==2&&VTYPE==SNP /net/fantasia/home/atks/ref/vt/grch37/dbSNP138.snps.indels.complex.sites.bcf

+

GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz

+

DUST cplx_annotation . /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz

+

+

usage : vt profile_snps [options] <in.vcf>

+

options : -f filter expression []

+

-g file containing list of reference datasets []

+

-I file containing list of intervals []

+

-i intervals []

+

-r reference sequence fasta file []

+

-? displays help

+

</div>

+

</div>

+

=== Profile Indels ===

+

Profile Indels. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].

+

+

#profile indels found in mills.vcf

+

vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa -i 20

+

#this is a sample output for indel profiling.

+

# square brackets contain the ins/del ratio.

+

# for the FS/NFS field, that is the proportion of coding indels that are frame shifted.

+

# The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively.

+

data set

+

No Indels : 46974 [0.89]

+

FS/NFS : 0.26 (8/23) <br>

+

dbsnp

+

A-B 30704 [0.92]

+

A&B 16270 [0.83]

+

B-A 2049488 [1.52]

+

Precision 34.6%

+

Sensitivity 0.8% <br>

+

mills

+

A-B 43234 [0.88]

+

A&B 3740 [1.00]

+

B-A 203278 [0.98]

+

Precision 8.0%

+

Sensitivity 1.8% <br>

+

mills.chip

+

A-B 46847 [0.89]

+

A&B 127 [0.90]

+

B-A 8777 [0.93]

+

Precision 0.3%

+

Sensitivity 1.4% <br>

+

affy.exome.chip

+

A-B 46911 [0.89]

+

A&B 63 [0.43]

+

B-A 33997 [0.47]

+

Precision 0.1%

+

Sensitivity 0.2% <br>

+

# This file contains information on how to process reference data sets.

+

# dataset - name of data set, this label will be printed.

+

# type - True Positives (TP) and False Positives (FP).

+

# overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively.

+

# - annotation.

+

# file is used for GENCODE annotation of frame shift and non frame shift Indels.

+

# filter - filter applied to variants for this particular data set.

+

# path - path of indexed BCF file.

+

#dataset type filter path

+

1000g TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/1000G.snps_indels.sites.bcf

+

mills TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/mills.208620indels.sites.bcf

+

dbsnp TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/dbsnp.13147541variants.sites.bcf

+

GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.cds.bed.gz

+

DUST cplx_annotation . /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz

+

+

usage : vt profile_indels [options] <in.vcf>

+

options : -g file containing list of reference datasets []

+

-I file containing list of intervals []

+

-i intervals []

+

-r reference sequence fasta file []

+

-? displays help

+

</div>

+

</div>

+

=== Profile VNTRs ===

+

Profile VNTRs. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].

+

+

#profiles a set of VNTRs

+

vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt

+

profile_vntrs v0.5

+

no VNTRs 5660874 #number of VNTRs in vntrs.sites.bcf

+

no low complexity 2686460 (47.46%) #number of VNTRs in low complexity region determined by MDUST

+

no coding 17911 (0.32%) #number of VNTRs in coding regions determined by GENCODE v7

+

no redundant 1312209 (23.18%) #number of VNTRs involved in overlapping with one another<br>

+

trf_lobstr (1638516) #TRF based reference set used in lobSTR, motif lengths 1 to 6.

+

A-B 3269285 #TRs specific to vntrs.sites.bcf

+

A-B~ 1666185 #TRs in vntrs.sites.bcf that overlap partially with at least one TR in TRF(lobSTR) but does not overlap exactly with another TR.

+

A&B1 725404 #TRs in vntrs.sites.bcf that overlap exactly with at least one TR in TRF(lobSTR)

+

A&B2 723195 #TRs in TRF(lobSTR) that overlap exactly with at least one TR in vntrs.sites.bcf

+

B-A~ 710075 #TRs in TRF(lobSTR) that overlap partially with at least one TR in vntrs.sites.bcf but does not overlap exactly with another TR.

+

B-A 205246 #TRs specific to TRF(lobSTR)

+

#note that the first 3 rows should sum up to the number of TRs in vntrs.sites.bcf

+

#and the 4th to 6th rows should sum up to the number of TRs in TRF( lobSTR)

+

#This basically allows us to see the m to n overlapping in overlapping TRs<br>

+

trf_repeatseq (1624553) #TRF based reference set used in repeatseq, motif lengths 1 to 6.

+

A-B 3291652

+

A-B~ 1650190

+

A&B1 719032

+

A&B2 716838

+

B-A~ 703948

+

B-A 203767 <br>

+

trf_vntrseek (230306) #TRF based reference set used in vntrseek, motif lengths 7 to 2000.

+

A-B 5384453

+

A-B~ 271302

+

A&B1 5119

+

A&B2 4973

+

B-A~ 92496

+

B-A 132837 <br>

+

codis+ (15) #CODIS STRs + 2 STRs from PROMEGA

+

A-B 5660794

+

A-B~ 79

+

A&B1 1

+

A&B2 1

+

B-A~ 14

+

B-A 0

+

# This file contains information on how to process reference data sets.

+

# dataset - name of data set, this label will be printed.

+

# type - True Positives (TP) and False Positives (FP).

+

# overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively.

+

# - annotation.

+

# file is used for GENCODE annotation of coding VNTRs.

+

# filter - filter applied to variants for this particular data set.

+

# path - path of indexed BCF file.

+

#dataset type filter path

+

trf_lobstr TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.lobstr.sites.bcf

+

trf_repeatseq TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.repeatseq.sites.bcf

+

trf_vntrseek TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.vntrseek.sites.bcf

+

codis+ TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/codis.strs.sites.bcf

+

GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz

+

DUST cplx_annotation .

+

+

usage : vt profile_vntrs [options] <in.vcf>

+

options : -g file containing list of reference datasets []

+

-I file containing list of intervals []

+

-i intervals []

+

-r reference sequence fasta file []

+

-? displays help

+

</div>

+

</div>

+

=== Profile NA12878 ===

+

Profile Mendelian errors

+

+

#profile NA12878 overlap with broad knowledgebase and illumina platinum genomes for the file vt.genotypes.bcf for chromosome 20.

+

vt profile_na12878 vt.genotypes.bcf -g na12878.reference.txt -r hs37d5.fa -i 20

+

#this is a sample output for mendelian error profiling.

+

#R and A stand for reference and alternate allele respectively.

+

#Error% - mendelian error (confounded with de novo mutation)

+

#HomHet - Homozygous-Heterozygous genotype ratios

+

#Het% - proportion of hets

+

data set

+

No Indels : 27770 [0.94]

+

FS/NFS : 0.26 (8/23) <br>

+

broad.kb

+

A-B 13071 [1.19]

+

A&B 14699 [0.76]

+

B-A 21546 [0.62]

+

Precision 52.9%

+

Sensitivity 40.6% <br>

+

illumina.platinum

+

A-B 17952 [0.88]

+

A&B 9818 [1.07]

+

B-A 2418 [0.88]

+

Precision 35.4%

+

Sensitivity 80.2% <br>

+

broad.kb

+

R/R R/A A/A ./.

+

R/R 346 145 3 5473

+

R/A 3 4133 9 758

+

A/A 2 136 2186 956

+

./. 2 139 86 322 <br>

+

Total genotype pairs : 6963

+

Concordance : 95.72% (6665)

+

Discordance : 4.28% (298) <br>

+

illumina.platinum

+

R/R R/A A/A ./.

+

R/R 1768 85 2 0

+

R/A 10 4479 14 0

+

A/A 13 180 3028 0

+

./. 71 98 70 0<br>

+

Total genotype pairs : 9579

+

Concordance : 96.83% (9275)

+

Discordance : 3.17% (304)

+

# This file contains information on how to process reference data sets.

+

#

+

# dataset - name of data set, this label will be printed.

+

# type - True Positives (TP) and False Positives (FP)

+

# overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively

+

# - annotation

+

# file is used for GENCODE annotation of frame shift and non frame shift Indels

+

# filter - filter applied to variants for this particular data set

+

# path - path of indexed BCF file

+

#dataset type filter path

+

broad.kb TP PASS /net/fantasia/home/atks/dev/vt/bundle/public/grch37/broad.kb.241365variants.genotypes.bcf

+

illumina.platinum TP PASS /net/fantasia/home/atks/dev/vt/bundle/public/grch37/NA12878.illumina.platinum.5284448variants.genotypes.bcf

+

#gencode.v19 annotation . /net/fantasia/home/atks/dev/vt/bundle/public/grch37/gencode.v19.annotation.gtf.gz

+

+

profile_na12878 v0.5

+

usage : vt profile_na12878 [options] <in.vcf>

+

options : -g file containing list of reference datasets []

+

-I file containing list of intervals []

+

-i intervals []

+

-r reference sequence fasta file []

+

-? displays help

+

</div>

+

</div>

+

= Variant Calling =

+

=== Discover ===

+

Discovers variants from reads in a BAM/CRAM file.

+

+

#discover variants from NA12878.bam and write to stdout

+

vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20

+

+

usage : vt discover2 [options]

+

options : -b input BAM/CRAM file

+

-y soft clipped unique sequences cutoff [0]

+

-x soft clipped mean quality cutoff [0]

+

-w insertion desired type II error [0.0]

+

-c insertion desired type I error [0.0]

+

-h insertion fractional evidence cutoff [0]

+

-g insertion count cutoff [1]

+

-n deletion desired type II error [0.0]

+

-m deletion desired type I error [0.0]

+

-v deletion fractional evidence cutoff [0]

+

-u deletion count cutoff [1]

+

-k snp desired type II error [0.0]

+

-j snp desired type I error [0.0]

+

-f snp fractional evidence cutoff [0]

+

-e snp evidence count cutoff [1]

+

-q base quality cutoff for bases [0]

+

-C likelihood ratio cutoff [0]

+

-B reference bias [0]

+

-a read exclude flag [0x0704]

+

-l ignore overlapping reads [false]

+

-t MAPQ cutoff for alignments [0]

+

-p ploidy [2]

+

-s sample ID

+

-r reference sequence fasta file []

+

-o output VCF file [-]

+

-z ignore MD tags [0]

+

-d debug [0]

+

-I file containing list of intervals []

+

-i intervals []

+

-? displays help

+

</div>

+

</div>

+

=== Merge candidate variants ===

+

Merge candidate variants across samples. Each VCF file is required to have the FORMAT flags E and N and should have exactly one sample.

+

+

#merge candidate variants from VCFs in candidate.txt and output in candidate.sites.vcf

+

vt merge_candidate_variants candidates.txt -o candidate.sites.vcf

+

+

usage : vt merge_candidate_variants [options]

+

options : -L file containing list of input VCF files

+

-o output VCF file [-]

+

-I file containing list of intervals []

+

-i intervals

+

-- ignores the rest of the labeled arguments following this flag

+

-h displays help

+

</div>

+

</div>

+

=== Filter overlap ===

+

Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap.

+

+

#annotates variants that are overlapping

+

vt filter_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf

+

+

usage : vt filter_overlap [options] <in.vcf>

+

options : -o output VCF file [-]

+

-w window overlap for variants [0]

+

-I file containing list of intervals []

+

-i intervals []

+

-? displays help

+

</div>

+

</div>

+

+

#Use Remove overlap instead for versions older than Jan 12, 2017

+

vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf

+

+

usage: vt remove_overlap [options] <in.vcf>

+

The old version has the same options except that it lacks the -w option

+

The change occurred in the following commit:

+

https://github.com/atks/vt/commit/ab5cf7e91b3baa5349f439e6fe92491ae19da1a6

+

</div>

+

</div>

+

=== Annotate Indels ===

+

Annotates indels with VNTR information and adds a VNTR record. Facilitates the simultaneous calling of VNTR together with Indels and SNPs.

+

+

#annotates indels from VCFs with VNTR information.

+

vt annotate_indels in.vcf -r hs37d5.fa -o annotated.sites.vcf

+

+

CHROM POS ID REF ALT QUAL FILTER INFO

+

20 82079 . G A 1255.98 . NSAMPLES=1;E=43;N=51;ESUM=43;NSUM=51;FLANKSEQ=GGAGCACGCC[G/A]CCATGCCCGG

+

20 82217 . G A 1632.77 . NSAMPLES=1;E=56;N=61;ESUM=56;NSUM=61;FLANKSEQ=GAGCCACCGC[G/A]CCCGGCCCAG

+

20 83250 . CTGTGTGTG C . . NSAMPLES=1;E=18;N=35;ESUM=18;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT

+

20 83250 . CTGTGTGTGTG C . . NSAMPLES=1;E=3;N=35;ESUM=3;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT

+

20 83251 . TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG <VNTR> . . MOTIF=GT;RU=TG;FZ_CONCORDANCE=1;FZ_RL=52;FZ_LL=0;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FZ_RU_COUNTS=26,26;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG]TTTAGTATTT

+

20 83252 . G C 359.204 . NSAMPLES=1;E=13;N=14;ESUM=13;NSUM=14;FLANKSEQ=CTCTCTCTCT[G/C]TGTGTGTGTG

+

20 83260 . G C 500.163 . NSAMPLES=1;E=18;N=34;ESUM=18;NSUM=34;FLANKSEQ=CTGTGTGTGT[G/C]TGTGTGTGTG

+

20 83267 . T C 247.043 . NSAMPLES=1;E=11;N=43;ESUM=11;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT

+

20 83275 . T C 609.669 . NSAMPLES=1;E=24;N=43;ESUM=24;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT

+

20 90008 . C A 1546.88 . NSAMPLES=1;E=52;N=60;ESUM=52;NSUM=60;FLANKSEQ=AACAGAAAAC[C/A]AAATACTGTA

+

20 91088 . C T 1766.04 . NSAMPLES=1;E=58;N=66;ESUM=58;NSUM=66;FLANKSEQ=CCCAGCATAC[C/T]ATGGTTGTGC

+

20 91508 . G A 1266.93 . NSAMPLES=1;E=44;N=53;ESUM=44;NSUM=53;FLANKSEQ=AATTAGTAAG[G/A]CTTACGTAAG

+

20 91707 . C T 888.134 . NSAMPLES=1;E=30;N=53;ESUM=30;NSUM=53;FLANKSEQ=TGATTTTCTA[C/T]AGCAGGACCT

+

20 92527 . A G 828.593 . NSAMPLES=1;E=34;N=40;ESUM=34;NSUM=40;FLANKSEQ=ATTAATTGCC[A/G]TTCTCTCTTT

+

20 93440 . A G 688.144 . NSAMPLES=1;E=24;N=58;ESUM=24;NSUM=58;FLANKSEQ=TTGGATGCAT[A/G]GTCTGTAAAT

+

20 93636 . TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT <VNTR> . . MOTIF=T;RU=T;FZ_CONCORDANCE=0.939394;FZ_RL=35;FZ_LL=0;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FZ_RU_COUNTS=31,33;FLANKSEQ=TCTAGGATTC[TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT

+

20 93646 . C CT . . NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FLANKSEQ=TTTTTCTTTC[TTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT;GMOTIF=T;TR=20:93636:TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT:<VNTR>:T

+

20 93717 . A T 31.7622 . NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKSEQ=CAGTGGCGTG[A/T]TCTTAGATCA

+

20 93931 . G A 628.149 . NSAMPLES=1;E=22;N=53;ESUM=22;NSUM=53;FLANKSEQ=GATTACAGGT[G/A]TGAGCCGCTG

+

20 100699 . C T 809.09 . NSAMPLES=1;E=28;N=61;ESUM=28;NSUM=61;FLANKSEQ=GGTGAAAAAT[C/T]ACCTGTCAGT

+

20 101362 . G A 1087.13 . NSAMPLES=1;E=36;N=67;ESUM=36;NSUM=67;FLANKSEQ=TAATACTGAA[G/A]TTTACTTCTC

+

</div>

+

The following shows the trace of how the algorithm works

+

============================================

+

ANNOTATING INDEL FUZZILY

+

********************************************

+

EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT

+

20:131948:C/CCA

+

EXACT REGION 131948-131965 (18)

+

CCACACACACACACACAA

+

FINAL EXACT REGION 131948-131965 (18)

+

CCACACACACACACACAA

+

********************************************

+

PICK CANDIDATE MOTIFS

+

Longest Allele : C[CA]CACACACACACACACAA

+

detecting motifs for an str

+

seq: CCACACACACACACACACAA

+

len : 20

+

cmax_len : 10

+

candidate motifs: 25

+

AC : 0.894737 2 0

+

AAC : 0.5 3 0.0555556

+

ACC : 0.5 3 0.0555556

+

AAAC : 0.0588235 4 0.125 (< 2 copies)

+

ACCC : 0.0588235 4 0.125 (< 2 copies)

+

AACAC : 0.5 5 0.02

+

ACACC : 0.5 5 0.02

+

AAACAC : 0.0666667 6 0.0555556 (< 2 copies)

+

ACACCC : 0.0666667 6 0.0555556 (< 2 copies)

+

AACACAC : 0.5 7 0.0102041

+

ACACACC : 0.5 7 0.0102041

+

AAACACAC : 0.0769231 8 0.03125 (< 2 copies)

+

ACACACCC : 0.0769231 8 0.03125 (< 2 copies)

+

AACACACAC : 0.5 9 0.00617284 (< 2 copies)

+

ACACACACC : 0.5 9 0.00617284 (< 2 copies)

+

AAACACACAC : 0.0909091 10 0.02 (< 2 copies)

+

ACACACACCC : 0.0909091 10 0.02 (< 2 copies)

+

********************************************

+

PICKING NEXT BEST MOTIF

+

selected: AC 0.89 0.00

+

********************************************

+

DETECTING REPEAT TRACT FUZZILY

+

++++++++++++++++++++++++++++++++++++++++++++

+

Exact left/right alignment

+

repeat_tract : CACACACACACACACA

+

position : [131949,131964]

+

motif_concordance : 1

+

repeat units : 8

+

exact repeat units : 8

+

total no. of repeat units : 8

+

++++++++++++++++++++++++++++++++++++++++++++

+

Fuzzy right alignment

+

repeat motif : CA

+

rflank : AACTC

+

mlen : 2

+

rflen : 5

+

plen : 111

+

read : AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACACCACACACACACACACAAACTC

+

rlen : 106

+

optimal score: 50.5073

+

optimal state: MR

+

optimal track: MR|r|0|5

+

optimal probe len: 25

+

optimal path length : 107

+

max j: 106

+

probe: (1~82) [1~10] (1~5)

+

read : (1~82) [83~101] (102~106)

+

motif # : 10 [83,101]

+

motif concordance : 95% (9/10)

+

motif discordance : 0|1|0|0|0|0|0|0|0|0

+

Model: ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC

+

SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME

+

oo++oo++oo++oo++oo++RRRRR

+

Read: AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACAC-CACACACACACACACAAACTC

+

++++++++++++++++++++++++++++++++++++++++++++

+

Fuzzy left alignment

+

lflank : ATCTTA

+

repeat motif : CA

+

lflen : 6

+

mlen : 2

+

plen : 111

+

read : ATCTTACACCACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT

+

rlen : 105

+

optimal score: 50.5858

+

optimal state: Z

+

optimal track: Z|m|10|2

+

optimal probe len: 26

+

optimal path length : 106

+

max j: 105

+

mismatch penalty: 3

+

model: (1~6) [1~10]

+

read : (1~6) [7~25][26~106]

+

motif # : 10 [7,25]

+

motif concordance : 95% (9/10)

+

motif discordance : 0|1|0|0|0|0|0|0|0|0

+

Model: ATCTTACACACACACACACACACACA--------------------------------------------------------------------------------

+

SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE

+

LLLLLLoo++oo++oo++oo++oo++

+

Read: ATCTTACAC-CACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT

+

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

+

VNTR Summary

+

rid : 19

+

motif : AC

+

ru : CA

+

Exact

+

repeat_tract : CACACACACACACACA

+

position : [131949,131964]

+

reference repeat unit length : 8

+

motif_concordance : 1

+

repeat units : 8

+

exact repeat units : 8

+

total no. of repeat units : 8

+

Fuzzy

+

repeat_tract : CACCACACACACACACACA

+

position : [131946,131964]

+

reference repeat unit length : 19

+

motif_concordance : 0.95

+

repeat units : 19

+

exact repeat units : 9

+

total no. of repeat units : 10

+

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

+

+

usage : vt annotate_indels [options] <in.vcf>

+

options : -v add vntr record [false]

+

-x override tags [false]

+

-f filter expression []

+

-d debug [false]

+

-m mode [f]

+

e : by exact alignment f : by fuzzy alignment

+

-c classification schemas of tandem repeat [6]

+

1 : lai2003

+

2 : kelkar2008

+

3 : fondon2012

+

4 : ananda2013

+

5 : willems2014

+

6 : tan_kang2015

+

-a annotation type [v]

+

v : a. output VNTR variant (defined by classification).

+

RU repeat unit on reference sequence (CA)

+

MOTIF canonical representation (AC)

+

RL repeat tract length in bases (11)

+

FLANKS flanking positions of repeat tract determined by exact alignment

+

RU_COUNTS number of exact repeat units and total number of repeat units in

+

repeat tract determined by exact alignment

+

FZ_RL fuzzy repeat tract length in bases (11)

+

FZ_FLANKS flanking positions of repeat tract determined by fuzzy alignment

+

FZ_RU_COUNTS number of exact repeat units and total number of repeat units in

+

repeat tract determined by fuzzy alignment

+

FLANKSEQ flanking sequence of indel

+

LARGE_REPEAT_REGION repeat region exceeding 2000bp

+

b. mark indels with overlapping VNTR.

+

FLANKS flanking positions of repeat tract determined by exact alignment

+

FZ_FLANKS flanking positions of repeat tract determined by fuzzy alignment

+

GMOTIF generating motif used in fuzzy alignment

+

TR position and alleles of VNTR (20:23413:CACACACACAC:<VNTR>)

+

a : annotate each indel with RU, RL, MOTIF, REF.

+

-r reference sequence fasta file []

+

-o output VCF file [-]

+

-I file containing list of intervals []

+

-i intervals

+

-? displays help

+

</div>

+

</div>

+

=== Construct Probes ===

+

Construct probes for genotyping a variant.

+

+

#construct probes from candidate.sites.bcf and output to standard out

+

vt construct_probes candidates.sites.bcf -r ref.fa

+

+

usage : vt construct_probes [options] <in.vcf>

+

options : -o output VCF file [-]

+

-f minimum flank length [20]

+

-r reference sequence fasta file []

+

-I file containing list of intervals []

+

-i intervals []

+

-- ignores the rest of the labeled arguments following this flag

+

-h displays help

+

</div>

+

</div>

+

=== Genotype ===

+

Genotypes variants for each sample.

+

+

#genotypes variants found in candidate.sites.vcf from sample.bam

+

vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf

+

+

usage : vt genotype [options]

+

options : -r reference sequence fasta file []

+

-s sample ID []

+

-o output VCF file [-]

+

-b input BAM file []

+

-i input candidate VCF file []

+

-- ignores the rest of the labeled arguments following this flag

+

-h displays help

+

</div>

+

</div>

+

= Pedigree File =

+

vt understands an augmented version introduced by [mailto:hmkang@umich.edu Hyun] of the PED described by [http://zzz.bwh.harvard.edu/plink/data.shtml#ped plink].

+

The pedigree file format is as follows with the following mandatory fields:

+

{| class="wikitable"

+

|-

+

! scope="col"| Field

+

! scope="col"| Description

+

! scope="col"| Valid Values

+

! scope="col"| Missing Values

+

|-

+

|Family ID<br>

+

Individual ID<br>

+

Paternal ID<br>

+

Maternal ID<br>

+

Sex<br>

+

Phenotype

+

|ID of this family <br>

+

ID(s) of this individual (comma separated) <br>

+

ID of the father <br>

+

ID of the mother <br>

+

Sex of the individual<br>

+

Phenotype

+

|[A-Za-z0-9_]+<br>

+

[A-Za-z0-9_]+(,[A-Za-z0-9_]+)* <br>

+

[A-Za-z0-9_]+ <br>

+

[A-Za-z0-9_]+<br>

+

1=male, 2=female, other, male, female<br>

+

[A-Za-z0-9_]+

+

| 0 <br>

+

cannot be missing <br>

+

0 <br>

+

0 <br>

+

other<br>

+

-9

+

|}

+

Examples:

+

ceu NA12878 NA12891 NA12892 female -9

+

yri NA19240 NA19239 NA19238 female -9

+

ceu NA12878 NA12891 NA12892 2 -9

+

yri NA19240 NA19239 NA19238 2 -9

+

#allows tools like profile_mendelian to detect duplicates and check for concordance

+

ceu NA12878,NA12878A NA12891 NA12892 female case

+

yri NA19240 NA19239 NA19238 female control

+

#allows tools like profile_mendelian to detect duplicates and check for concordance

+

ceu NA12412 0 0 female case

+

yri NA19650 0 0 female control

+

= Resource Bundle =

+

== GRCh37 ==

+

Files are based on hs37d5.fa made by Heng Li.

+

* External : [ftp://share.sph.umich.edu/vt/grch37 GRCh37 resource bundle]

+

* Internal : /net/fantasia/home/atks/ref/vt/grch37

+

Read here for [ftp://share.sph.umich.edu/vt/grch37/readme.txt contents].

+

== GRCh38 ==

+

Files are based on [https://github.com/lh3/bwa/blob/master/README-alt.md hs38DH.fa] made by Heng Li.

+

Note that many of the references are simply lifted over from GRCh37 using Picard's liftover tool with the default options.

+

* External : [ftp://share.sph.umich.edu/vt/grch38 GRCh38 resource bundle]

+

* Internal : /net/fantasia/home/atks/ref/vt/grch38

+

Read here for [ftp://share.sph.umich.edu/vt/grch38/readme.txt contents].

+

= FAQ =

+

==1. vt cannot retrieve sequences from my reference sequence file ==

+

It is common to use reference files based on the UCSC browser's database and from the Genome Reference Consortium.

+

For example, HG19 vs Grch37. The key difference is that chromosome 1 is represented as chr1 and 1 respectively in the

+

FASTA files from these 2 sources. Just use the appropriate FASTA file that was used to generate your VCF file originally.

−

~~Merges duplicate variants by position with~~ the ~~option~~ of ~~considering alleles~~. ~~(This just discards~~ the ~~duplicate variant that appears later in~~ the ~~VCF~~ file)

+

Another common issue is due to the corruption of the index file of the reference sequence; say for a reference file named

+

hs37d5.fa or hs37d5.fa.gz, simply delete the index file denoted by hs37d5.fa.fai or hs37d5.fa.gz.fai and run the vt command

+

again. A new index file will be generated automatically.

−

~~Options:~~

+

= How to cite vt? =

−

~~-i, --input-vcf <string> : Input VCF file~~

−

~~-o, --output-vcf <string> : Output VCF file [-]~~

−

~~-p, --merge-by-position : Merge by position [false]~~

−

~~Example~~:

+

If you use normalize: <br>

−

e.g. ~~vt merge_duplicate_variants -i 8904indels~~.~~dups~~.~~genotypes~~.~~vcf -o out.vcf~~

+

[http://bioinformatics.oxfordjournals.org/content/31/13/2202 Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204]

−

~~e.g. vt merge_duplicate_variants -p~~ -~~i 8904indels.dups.genotypes.vcf -o out.vcf~~

= Maintained by =

This page is maintained by [mailto:atks@umich.edu Adrian]

Atks

1,102

edits

Changes

Vt (view source)

Revision as of 04:36, 4 May 2021

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools