Revision as of 11:02, 9 June 2016

Introduction

vt is a variant tool set that discovers short variants from Next Generation Sequencing data.

Installation

The source files are housed in github. htslib is used and a copy of a developmental freeze is stored as part of the vt repository to ensure compatibility.

To install, perform the following steps:

 #this will create a directory named vt in the directory you cloned the repository
 1. git clone https://github.com/atks/vt.git  

 #change directory to vt
 2. cd vt 

 #run make, note that compilers need to support the c++0x standard 
 3. make 

 #you can test the build
 4. make test

Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4.
Some features of C++11 is used, thus there is a need for newer versions of gcc and clang.

Updating

vt is currently under heavy development, you will probably need to update often.

 #remove all object files
 #you need to do this as source files as the static libraries might have changed and need to be removed.
 1. make clean 

 #update source files
 2. git pull 

 #compile and link, the -j option tells Makefile to run up to 40 independent commands in parallel
 3. make -j 40

General Features and notes

Common options

   -i   multiple intervals in <seq>:<start>-<end> format delimited by commas.

   -I   multiple intervals in <seq>:<start>-<end> format listed in a text file line by line.

   -o   defines the out file which and has the STDOUT set as the default.
        vt recognizes the appropriate output by file extension.
        <name>.vcf     - uncompressed VCF
        <name>.vcf.gz  - compressed VCF
        <name>.bcf     - BCF
        You may modify the STDOUT to output the binary version of the format.  Uncompressed
        VCF and BCF streams are indicated by - and + respectively.

   -f  filter expression

   -s  sequential region selection as opposed to random access of regions specified by the i option.
       This is useful when you want to select many close-by regions, while the -i option works,
       it is less efficient and also selects a variant multiple times if it overlaps 2 regions.  This 
       option iterates through the variants in the file sequentially and checks for overlap with the 
       bed file given.

Uncompressed BCF streams

htslib is designed with BCF as the underlying data structure and it has incorporated awareness of uncompressed BCF streams in the i/o API. One may use this feature to stream uncompressed BCF records to save on computational time spent on (de)compression.

 #using textual VCF streams indicated by -
 cat mills.vcf | vt normalize - -r hs37d5.fa | vt uniq - -o out.bcf

 #using uncompressed BCF streams indicated by +
 cat mills.vcf | vt normalize - -r hs37d5.fa -o + | vt uniq + -o out.bcf

In this example, the former took 0.84s while the latter took 0.64s to process. (24% speed up!)

Filters

For some programs. you may define a filter via the -f option.

 This allows you to only analyse biallelic indels that are passed on chromosome 20.
 vt profile_na12878 vt.bcf -g na12878.reference.txt -r genome.fa -f "N_ALLELE==2&&VTYPE==INDEL&&PASS"  -i 20

Other examples of filters

 #all variants with a SNP in them
 VTYPE&SNP
 #Simple insertions of length 1
 VTYPE==INDEL&&DLEN==1
 #Indels of length 1
 VTYPE==INDEL&&LEN==1

 Variant characteristics
   VTYPE,N_ALLELE,DLEN,LEN,VARIANT_CONTAINS_N

 Variant value types
   SNP,MNP,INDEL,CLUMPED

 Biallelic SNPs only                         : VTYPE==SNP&&N_ALLELE==2
 Biallelic Indels with embedded SNP          : VTYPE==(SNP|INDEL)&&N_ALLELE==2
 Biallelic variants involving insertions     : VTYPE&INDEL&&DLEN>0&&N_ALLELE==2
 Biallelic variants involving 1bp variants   : LEN==1&&N_ALLELE==2
 Variants with explicit sequences with no Ns : ~VARIANT_CONTAINS_N

 QUAL field
   QUAL

 FILTER fields
   PASS, FILTER.<tag>

 INFO fields
   INFO.<tag>

 Passed biallelic SNPs only                  : PASS&&VTYPE==SNP&&N_ALLELE==2
 Passed Common biallelic SNPs only           : PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005
 Passed Common biallelic SNPs or rare indels : (PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005)
 Passed Common biallelic SNPs or rare indels : ((PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005))&&QUAL>100
 with quality greater than 100
 Failed rare variants : ~PASS&&(INFO.AC/INFO.AN<0.005)

 Regular expression matching PERL style (implemented with pcre2)  
 Sometimes, an info field will contain several values in a string with functional annotation, to match what you want,
 just use INFO.ANNO=~'<perl regular expression>'

 Passed variants in intergenic regions or UTR                    : PASS&&INFO.ANNO=~'Intergenic|UTR'
 Passed variants in intergenic regions or UTR ignoring case      : PASS&&INFO.ANNO=~'(?i)Intergenic|UTR' 

 pcre2's '(?i)Intergenic|UTR' is equivalent to PERL's '/intergenic|UTR/i'

 Operations
 == : equivalence for strings and numbers
 != : not equal
 =~ : regular expression match for strings only
 ~~ : not of =~.  Is equivalent to PERL's !~, this notation is used as BASH keeps interpreting ! for recalling commands from the history
 ~  : logical not
 && : logical and
 || : logical or
 &  : bitwise and
 |  : bitwise or
 +  : add
 -  : subtract
 *  : multiply
 /  : divide

The following programs support filter expressions.

view
peek
profile_snps
profile_indels
profile_na12878
profile_mendelian
profile_len
profile_chrom
profile_afs
profile_hwe
concordance
partition

Alternate headers

 As BCF is a restrictive format of VCF where all meta data must be present in the header, 
 vt provides a mechanism to read an alternative header for VCF files that do not have a 
 well formed header.  Simply provide a header file stub named as <vcf-file>.hdr and vt
 will automatically read it instead of the original header in <vcf-file>.

 For more information about VCF/BCF : http://samtools.github.io/hts-specs/VCFv4.2.pdf

 This mechanism is available only if one is reading VCF or compressed VCF files.  It is
 disabled for BCF files as this might corrupt the BCF file because the encoding of the 
 fields in BCF records is based on the order of the meta info lines in the header.

 Note: BCF2.2 introduces the IDX field in meta information lines that indicates the 
 dictionary encoding. This feature might be enabled for BCF files in the future.

General cases of Ploidy and Alleles

 I am trying to make vt handle general cases of ploidy and alleles.  
 Please let me know if that is lacking in a tool that you are using.

BCF Compression Levels vs Compression Time

The zlib deflation algorithm (a variant of LZ77) has 10 levels - 0 to 9. 0 has no compression but instead wraps
up the file in zlib or bgzf blocks. It may be useful to have 0 compression as it is indexable with the same mechanism
used for compressed files. Levels 1-9 denote an increasing compression level in exchange for longer times for
compression.

In general, zlib compression does not have significant differences in compression for BCF files between the 9 compression
levels as shown in the following table:

Compression Level	Size	Time
0 1 2 3 4 5 6 (default) 7 8 9	153GB 98.4GB 98.0GB 97.5GB 95.5GB 95.2GB 94.9GB 94.8GB 94.76GB 94.75GB	45m 2h3m 2h7m 2h12m 2h26m 2h54m 3h19m 3h41m 4h5m 4h25m

So, it might be a good idea to compress at lower levels when dealing with large temporary
files in a pipeline to save compute time. This can be achieved with the -c option in vt view

VCF Manipulation

View

Views a VCF or VCF.GZ or BCF file.

  #views mills.bcf and outputs to standard out
  vt view -h mills.bcf

  #views mills.bcf and locally sorts it in a 10000bp window and outputs to sorted-millsbcf
  vt view -h -w 10000 mills.bcf -o sorted-mills.bcf

  #views mills.bcf and outputs to c1-mills.bcf with a compression level of 1.  By default, 
  #the compression level is 6 where lower levels compress the file less but are faster. 
  #The difference in compression for BCF files between level 1 to level 9 is about 5% of 
  #of a level 1 compression file.  The difference in time taken is about an additional 50%
  #of a level 1 compression.  The levels range from 0 to 9 where 0 means no compression 
  #but the file is encapsulated in bgzf blocks that allows the file to be indexed.  A special 
  #level -1 denotes an uncompressed BCF file that is not encapsulated in bgzf blocks and 
  #are thus not indexable but are highly suitable for streaming between vt commands.
  vt view -h mills.bcf -c 1 -o c1-mills.bcf

  #views mills.bcf and selects variants that overlap with the regions found in dust.bed from chromosome 20
  #the -t option selects variants by checking if each variant overlaps with the regions in the bed file, this is
  #as opposed to random accessing the variants via the index through the intervals defined in -i and -I options.
  #this is useful when selecting variants from the target regions from an exome sequencing experiment.
  vt view 10000 mills.bcf -t dust.bed -i 20

 usage : vt view [options] <in.vcf>

 options : -o  output VCF/VCF.GZ/BCF file [-]
           -f  filter expression []
           -w  local sorting window size [0]
           -s  print site information only without genotypes [false]
           -H  print header only, this option is honored only for STDOUT [false]
           -h  omit header, this option is honored only for STDOUT [false]
           -p  print options and summary []
           -r  right window size for overlap []
           -l  left window size for overlap []
           -c  compression level 0-9, 0 and -1 denotes uncompressed with the former being wrapped in bgzf. [6]
           -t  bed file for variant selection via streaming []
           -I  file containing list of intervals []
           -i  intervals []
           -?  displays help

Index

Indexes a VCF.GZ or BCF file.

  #indexes mills.bcf
  vt index mills.bcf 
  #indexes mills.vcf.gz
  vt index mills.vcf.gz

 usage : vt index [options] <in.vcf>

 options : -p  print options and summary []
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

Sorting

Sorting may be done in 3 approaches.

Locally:
Performs sorting within a local window. The window size may be set by the -w option. The default window size
is 1000bp and if a record is detected to be potentially out of order due to a small window size, it wil be reported.
Use this when your VCF records are grouped by chromosome but not ordered in short stretches.

By chromosome:
Your VCF file is not ordered by the chromosomes in the header but is fully ordered within each chromosome.
The VCF file should be indexed and vt will output the records in the order of chromosomes given in the header.

Full sort [default option]:
No assumptions are made about the VCF file. Records will be ordered by the order of contigs in the header.
Smaller temporary ordered files are created and their names are <output_vcf>.<no>.bcf and after generating
these files, they are merged and output into <output_vcf>.

  #sorts mills.bcf and outputs to standard out in a 1000bp window.
  vt sort -m local mills.bcf 
  #sorts mills.bcf and locally sorts it in a 10000bp window and outputs to out.bcf
  vt sort -m local -w 10000 mills.bcf -o out.bcf 
  #sorts an indexed mills.bcf  with chromosomes not sorted in the contig order in the header 
  vt sort -m chrom  mills.bcf -o out.bcf 
  #sorts mills.bcf with no assumption
  vt sort mills.bcf -o out.bcf

 usage : vt sort [options] <in.vcf>

 options : -m  sorting modes. [full]
               local : locally sort within a 1000bp window.  Window size may be set by -w.
               chrom : sort chromosomes based on order of contigs in header.
                       input must be indexed.
               full  : full sort with no assumptions.
           -o  output VCF/VCF.GZ/BCF file. [-]
           -w  local sorting window size, set by default to 1000 under local mode. [0]
           -p  print options and summary. []
           -?  displays help

Normalization

Normalize variants in a VCF file (Tan et al. 2015) . Normalized variants may have their positions changed; in such cases, the normalized variants are reordered and output in an ordered fashion. The local reordering takes place over a window of 10000 base pairs which may be changed via the -w option. There is an underlying assumption that the REF field is consistent with the reference sequence use, vt will check for this and will fail if reference inconsistency is encountered; this may be relaexd with the -n option.

  #normalize variants and write out to dbsnp.normalized.vcf
  vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf

  #normalize variants, send to standard out and remove duplicates.
  vt normalize dbsnp.vcf -r seq.fa | vt uniq - -o dbsnp.normalized.uniq.vcf

  #read in variants that do not contain N in the explicit alleles, normalize variants, send to standard out.
  vt normalize dbsnp.vcf -r seq.fa -f "~VARIANT_CONTAINS_N"

  #variants that are normalized will be annotated with an OLD_VARIANT info tag.
  #CHROM  POS      ID   REF           ALT  QUAL  FILTER  INFO
  19	  29238772 .	C             G    .     PASS	 VT=SNP;OLD_VARIANT=19:29238771:TC/TG
  20	  60674709 .	GCCCAGCCCCAC  G    .     PASS	 VT=INDEL;OLD_VARIANT=20:60674718:CACCCCAGCCCC/C

  #this shows a sample output with the normalization operations that were used 
  #categorized into 5 categories each for biallelic and multiallelic variants. 

  stats: biallelic
         no. left trimmed                      : 156908
         no. right trimmed                     : 323
         no. left and right trimmed            : 33
         no. right trimmed and left aligned    : 7
         no. left aligned                      : 12360 

      total no. biallelic normalized           : 169631 
 

      multiallelic
         no. left trimmed                      : 627189
         no. right trimmed                     : 2509
         no. left and right trimmed            : 1498
         no. right trimmed and left aligned    : 212
         no. left aligned                      : 1783 

      total no. multiallelic normalized        : 633191 

      total no. variants normalized            : 802822
      total no. variants observed              : 88052639

  usage : vt normalize [options] <in.vcf>

 options : -o  output VCF file [-]
           -d  debug [false]
           -q  do not print options and summary [false]
           -m  warns but does not exit when REF is inconsistent
               with masked reference sequence for non SNPs.
               This overides the -n option [false]
           -n  warns but does not exit when REF is inconsistent
               with reference sequence for non SNPs [false]
           -w  window size for local sorting of variants [10000]
           -I  file containing list of intervals []
           -i  intervals []
           -r  reference sequence fasta file []
           -?  displays help

Decompose biallelic block substitutions

Decomposes biallelic block substitutions into its constituent SNPs.
There is now an additional option -a which decomposes non block substitutions into its constituent SNPs and indels. (kindly added by [holtgrewe@github])
There is no exact solution and this decomposition is based on the best guess outcome using a Needleman-Wunsch algorithm.
You might also want to check out vcfallelicprimitives.

  #decomposes biallelic block substitutions and write out to decomposed_blocksub.vcf
  vt decompose_blocksub gatk.vcf -o decomposed_blocksub.vcf 

  #before decomposition
  #CHROM  POS     ID    REF     ALT     QUAL    FILTER  INFO            FORMAT  S1                                                                          
  20	  763837  .	CA	TG	50340.1	PASS	AC=1;AN=2	GT	0|1	

  #after decomposition
  #CHROM  POS     ID    REF     ALT     QUAL    FILTER  INFO                                    FORMAT  S1         
  20	  763837  .	C	T	50340.1	PASS	AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG	GT	0|1
  20	  763838  .	A	G	50340.1	PASS	AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG	GT	0|1

  #decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf
  vt decompose_blocksub -a gatk.vcf -o decomposed_blocksub.vcf 

  #before decomposition
  #CHROM  POS     ID    REF     ALT     QUAL    FILTER  INFO            FORMAT  S1                                                                          
  20	  763837  .	CG	TGA	50340.1	PASS	AC=1;AN=2	GT	0|1	

  #after decomposition
  #CHROM  POS     ID    REF     ALT     QUAL    FILTER  INFO                                    FORMAT  S1         
  20	  763837  .	C	T	50340.1	PASS	AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA	GT	0|1
  20	  763838  .	G	GA	50340.1	PASS	AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA	GT	0|1

  #decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf and add phase set information in the genotype fields
  vt decompose_blocksub -p gatk.vcf -o decomposed_blocksub.vcf 

  #before decomposition
  #CHROM  POS	    ID	  REF        ALT        QUAL	FILTER	INFO	                                        FORMAT	tumor	     normal
  1	  159030    .	  TAACCTTTC  TGACCTTTT  0.04	.	AF=0.5	                                        GT      0/0          1/1   
                                                               
  #after decomposition
  1	  159031    .	  A          G	        0.04	.	AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT	GT:PS	0|0:159031   1|1:159031
  1	  159038    .	  C          T	        0.04	.	AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT	GT:PS	0|0:159031   1|1:159031

  description : decomposes biallelic block substitutions into its constituent SNPs. 

  usage : vt decompose_blocksub [options] <in.vcf> 

  options : -a  enable aggressive/alignment mode
            -o  output VCF file [-]
            -I  file containing list of intervals []
            -i  intervals []
            -?  displays help

Decompose

Decompose multiallelic variants in a VCF file. If the VCF file has genotype fields GT,PL, GL or DP, they are modified to reflect the change in alleles. All other genotype fields are removed. The -s option will retain the fields and decompose fields of counts R and A accordingingly.

Decomposition and combining variants is a complex operation where the correctness is dependent on [tfarrah@github]:

whether the observed variants are seen in the same sample,
if same sample, whether they are homozygous or heterozygous,
if both heterozygous, whether they are in the same haplotype or not (if known).

and one should be aware of the issues in handling variants resulting from such operations.
The original purpose of this tool is to allow for allelic comparisons between call sets. [example of a problem caused in combining separate variant records]

  #decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf
  vt decompose gatk.vcf -o gatk.decomposed.vcf 

  #before decomposition
  #CHROM  POS     ID   REF     ALT         QUAL   FILTER  INFO                  FORMAT    S1                                     S2                                                                          
  1       3759889 .    TA      TAA,TAAA,T  .      PASS    AF=0.342,0.173,0.037	GT:DP:PL	  1/2:81:281,5,9,58,0,115,338,46,116,809	 0/0:86:0,30,323,31,365,483,38,291,325,567	

  #after decomposition
  #CHROM  POS     ID   REF     ALT         QUAL   FILTER  INFO                                        FORMAT   S1               S2             
  1	  3759889 .    TA      TAA	   .	  PASS    OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    1/.:281,5,9      0/0:0,30,323	
  1	  3759889 .    TA      TAAA        .      .       OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./1:281,58,115   0/0:0,31,483	
  1	  3759889 .    TA      T           .      .       OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./.:281,338,809  0/0:0,38,567	

  One might want to post process the partial genotypes like 1/. to the best guess genotype based on the PL values.

  #decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf with the -s option.
  #-s option splits up INFO and GENOTYPE fields that have number counts of R and A [VCFv4.2 section 1.2.2] appropriately.
  vt decompose -s gatk.vcf -o gatk.decomposed.vcf 

  #before decomposition
  #CHROM  POS     ID   REF     ALT         QUAL   FILTER  INFO                  FORMAT    S1                                     S2                                                                          
  1       3759889 .    TA      TAA,TAAA,T  .      PASS    AF=0.342,0.173,0.037	GT:DP:PL	  1/2:81:281,5,9,58,0,115,338,46,116,809	 0/0:86:0,30,323,31,365,483,38,291,325,567	

  #after decomposition
  #CHROM  POS     ID   REF     ALT         QUAL   FILTER  INFO                                                 FORMAT   S1               S2           
  1	  3759889 .    TA      TAA	   .	  PASS    AF=0.342;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    1/.:281,5,9      0/0:0,30,323	
  1	  3759889 .    TA      TAAA        .      .       AF=0.173;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./1:281,58,115   0/0:0,31,483	
  1	  3759889 .    TA      T           .      .       AF=0.037;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./.:281,338,809  0/0:0,38,567	

  In general, you should recompute fields that involves alleles after decomposition.  Information is generally lost after vertically decomposing a variant, so care should be taken
  in interpreting the resultant values.

  description : decomposes multiallelic variants into biallelic in a VCF file. 

  usage : vt decompose [options] <in.vcf> 

  options : -s  smart decomposition [false]
            -o  output VCF file [-]
            -I  file containing list of intervals []
            -i  intervals []
            -?  displays help

Drop duplicate variants

Drops duplicate variants that appear later in the file.
If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained.

  #drop duplicate variants and save output in mills.uniq.vcf
  vt uniq mills.vcf -o mills.uniq.vcf

  usage : vt uniq [options] <in.vcf>

  options : -o  output VCF file [-]
            -I  file containing list of intervals []
            -i  intervals []
            -?  displays help

Paste

Pastes VCF files like the unix paste functions.

 Input requirements and assumptions:
     1. Same variants are represented in the same order for each file (required)
     2. Genotype field order are the same for corresponding records (required)
     3. Sample names are different in all the files (warning will be given if not)
     4. Headers are the same for all the files (assumption, not checked, will fail if output is BCF)
 Outputs:
     1. INFO fields output will be that of the first file
     2. Genotype fields are the same for corresponding records

  #paste together genotypes from the CEU trio into one file.
  vt paste NA12878.mills.bcf NA12891.mills.bcf NA12892.mills.bcf -o ceu_trio.bcf

 usage : vt paste [options] <in1.vcf>...
 
 options : -L  file containing list of input VCF files
           -o  output VCF file [-]
           -p  print options and summary []
           -?  displays help

Concatenate

Concatenates VCF files. Assumes individuals are in the same order and files share the same header.

  #concatenates chr1.mills.bcf and chr2.mills.bcf
  vt cat chr1.mills.bcf chr2.mills.bcf -o mills.bcf

  #concatenates chr1.mills.bcf and chr2.mills.bcf with the naive option.
  #The naive option assumes that the headers are all the same and skips 
  #merging headers and translating encodings between BCF files.   This is 
  #a much faster option if you know the nature of your BCF files in advance.
  vt cat -n chr1.mills.bcf chr2.mills.bcf -o mills.bcf

 usage : vt cat [options] <in1.vcf>...

 options : -s  print site information only without genotypes [false]
           -p  print options and summary [false]
           -n  naive, assumes that headers are the same. [false]
           -w  local sorting window size [0]
           -f  filter expression []
           -L  file containing list of input VCF files
           -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals
           -?  displays help

Remove info tags

Removes INFO tags from a VCF file

  #removes the INFO tags OLD_VARIANT, ENTROPY, PSCORE and COMP 
  vt rminfo exact.del.bcf -t OLD_VARIANT,ENTROPY,PSCORE,COMP -o rm.bcf

 usage : vt rminfo [options] <in.vcf>

 options : -o  output VCF file [-]
           -q  do not print options and summary [false]
           -t  list of info tags to be removed []
           -I  file containing list of intervals []
           -i  intervals []
           -?  displays help

Validate

Checks the following properties of a VCF file:

order
reference sequence consistency

  #validates lobstr.bcf
  vt validate lobstr.bcf

 usage : vt validate [options] <in.vcf>
 
 options : -q  do not print invalid records [false]
           -I  file containing list of intervals []
           -i  intervals []
           -r  reference sequence fasta file []
           -?  displays help

VCF Inspection and Evaluation

Peek

Summarizes the variants in a VCF file

  #summarizes the variants found in mills.vcf
  vt peek mills.vcf

 usage : vt peek [options] <in.vcf>

 options : -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals []
           -r  reference sequence fasta file []
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

For a more detailed guide on variant classification.

#This is a sample output of a peek command which summarizes the variants found in a VCF file.
  stats: no. of samples                     :          0
         no. of chromosomes                 :         22

         ========== Micro variants ==========

         no. of SNPs                        :   77228885
             2 alleles (ts/tv)              :        77011302 (2.11) [52287790/24723512]
             3 alleles (ts/tv)              :          216560 (0.75) [185520/247600]
             4 alleles (ts/tv)              :            1023 (0.50) [1023/2046]

         no. of MNPs                        :          0
             2 alleles (ts/tv)              :               0 (-nan) [0/0]
             >=3 alleles (ts/tv)            :               0 (-nan) [0/0]

         no. Indels                         :    2147564
             2 alleles (ins/del)            :         2124842 (0.47) [683250/1441592]
             >=3 alleles (ins/del)          :           22722 (2.12) [32411/15286]

         no. SNP/MNP                        :          0
             3 alleles (ts/tv)              :               0 (-nan) [0/0] 
             >=4 alleles (ts/tv)            :               0 (-nan) [0/0] 

         no. SNP/Indels                     :      12913
             2 alleles (ts/tv) (ins/del)    :             412 (0.41) [120/292] (3.68) [324/88]
             >=3 alleles (ts/tv) (ins/del)  :           12501 (0.43) [7670/17649] (18.64) [12434/667]

         no. MNP/Indels                     :        153
             2 alleles (ts/tv) (ins/del)    :               0 (-nan) [0/0] (-nan) [0/0]
             >=3 alleles (ts/tv) (ins/del)  :             153 (0.30) [138/465] (0.27) [67/248]

         no. SNP/MNP/Indels                 :          2
             3 alleles (ts/tv) (ins/del)    :               0 (-nan) [0/0] (-nan) [0/0]
             4 alleles (ts/tv) (ins/del)    :               2 (0.00) [3/5] (1.00) [3/3]
             >=5 alleles (ts/tv) (ins/del)  :               0 (-nan) [0/0] (-nan) [0/0]

         no. of clumped variants            :      19025
             2 alleles                      :               0 (-nan) [0/0] (-nan) [0/0]
             3 alleles                      :           18508 (0.16) [12152/75366] (0.00) [93/18653]
             4 alleles                      :             451 (0.15) [369/2390] (0.33) [201/609]
             >=5 alleles                    :              66 (0.09) [37/414] (1.19) [107/90]

         ====== Other useful categories =====

         no. complex variants               :      32093
             2 alleles (ts/tv) (ins/del)    :             412 (0.41) [120/292] (3.68) [324/88]
             >=3 alleles (ts/tv) (ins/del)  :           31681 (0.21) [20369/96289] (0.64) [12905/20270]

         ======= Structural variants ========

         no. of structural variants         :      41217
             2 alleles                      :           38079
                 deletion                   :                13135
                 insertion                  :                16451
                    mobile element          :                    16253
                       ALU                  :                        12513
                       LINE1                :                         2911
                       SVA                  :                          829
                    numt                    :                      198
                 duplication                :                  664
                 inversion                  :                  100
                 copy number variation      :                 7729
             >=3 alleles                    :            3138
                 copy number variation      :                 3138 

         ========= General summary ========== 

         no. of reference                   :          0 

         no. of observed variants           :   79449759
         no. of unclassified variants       :          0

Partition

Partition variants from two data sets.

 Please note that this only works if the contigs in the headers of both data sets are the same.

  #partitions all variants in bi1.bcf  and bi2.bcf
  vt partition bi1.bcf bi2.bcf

 Options:     input VCF file a   bi1.bcf
              input VCF file b   bi2.bcf 

   A:      504676 variants
   B:     1389333 variants 

                  ts/tv  ins/del
   A-B      37564 [0.19] [1.34]
   A&B     467112 [1.55] [0.72]
   B-A     922221 [1.20] [0.58]
   of A     92.6%
   of B     33.6%

  #partitions only passed variants in bi1.bcf and bi2.bcf
  vt partition bi1.bcf bi2.bcf -f PASS

 Options:     input VCF file a   bi1.bcf
              input VCF file b   bi2.bcf 
              [f] filter             PASS 

   A:      466148 variants
   B:      986056 variants 

                  ts/tv  ins/del
   A-B      47261 [0.44] [1.36]
   A&B     418887 [1.80] [0.68]
   B-A     567169 [1.43] [0.72]
   of A     89.9%
   of B     42.5%

partition v0.5

description : partition variants. check the overlap of variants between 2 data sets.

 usage : vt partition [options] <in1.vcf><in2.vcf>

 options : -w  write partitioned variants to file
           -f  filter
           -I  file containing list of intervals []
           -i  intervals []
           -?  displays help

Multi Partition

Partitions variants found in VCF files.
In comparison to the simple 2 way partition, this does not support writing out of partitions to file and reporting proportion of shared variants for each VCF.

  #partitions variants n-ways
  vt multi_partition hc.genotypes.bcf pl.genotypes.bcf st.genotypes.bcf

 Options:     input VCF file a   hc.genotypes.bcf
              input VCF file b   pl.genotypes.bcf
              input VCF file c   st.genotypes.bcf 

     A:       97274 variants
     B:       95458 variants
     C:       98943 variants 

                 no  [ts/tv] [ins/del]
     A--       3887  [1.10]  [0.86]
     -B-       7890  [1.45]  [0.98]
     AB-       4360  [0.99]  [1.32]
     --C       8277  [1.75]  [2.21]
     A-C       7458  [1.78]  [0.49]
     -BC       1639  [1.63]  [1.03]
     ABC      81569  [2.28]  [1.08] 

     Unique variants     :     115080
     Overall concordance :      70.88% (#intersection/#union)

 usage : vt multi_partition [options] <in1.vcf><in2.vcf>...
 options : -f  filter
           -I  file containing list of intervals []
           -i  intervals []
           -?  displays help

Annotate Regions

Annotates regions in a VCF file. The BED file should be bgzipped and indexed with tabix.

  #annotates the variants that overlap with coding regions.
  vt annotate_regions mills.vcf -b coding.bed.gz -t CDS -d "Coding region"

  #annotates the variants that overlap with low complexity regions.
  vt annotate_regions mills.vcf -b mdust.bed.gz -t DUST -d "DUST Low Complexity Region"

 usage : vt annotate_regions [options] <in.vcf>

 options : -d  regions tag description []
           -t  regions tag []
           -b  regions BED file []
           -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals
           -?  displays help

Annotate Variants

Annotates variants in a VCF file. The GENCODE annotation file should be bgzipped and indexed with tabix. This is available in the vt resource bundle.

  #annotates the variants found in mills.vcf
  vt annotate_variants mills.vcf -r hs37d5.fa -g gencode.v19.annotation.gtf.gz

 #annotates variants with the following fields
 ##INFO=<ID=VT,Number=1,Type=String,Description="Variant Type - SNP, MNP, INDEL, CLUMPED"> 
 ##INFO=<ID=GENCODE_FS,Number=0,Type=Flag,Description="Frameshift INDEL">
 ##INFO=<ID=GENCODE_NFS,Number=0,Type=Flag,Description="Non Frameshift INDEL">

 usage : vt annotate_variants [options] <in.vcf>

 options : -g  GENCODE annotations GTF file []
           -r  reference sequence fasta file []
           -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals
           -?  displays help

Compute Features

Compute features in a VCF file. Example of statistics are Allele counts, Genotype Likelihood based Inbreeding Coefficient. Hardy-Weinberg Genotype Likelihood based Allele Frequencies

  #compute features for the variants found in vt.vcf
  #requires GT, PL and DP
  vt compute_features vt.vcf

 #annotates variants with the following fields
 ##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate Allele Counts">
 ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Number Allele Counts">
 ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
 ##INFO=<ID=AF,Number=A,Type=Float,Description="Alternate Allele Frequency">
 ##INFO=<ID=GC,Number=G,Type=Integer,Description="Genotype Counts">
 ##INFO=<ID=GN,Number=1,Type=Integer,Description="Total Number of Genotypes Counts">
 ##INFO=<ID=GF,Number=G,Type=Float,Description="Genotype Frequency">
 ##INFO=<ID=HWEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency assuming HWE">
 ##INFO=<ID=HWEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency assuming HWE">
 ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency">
 ##INFO=<ID=MLEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency">
 ##INFO=<ID=HWE_LLR,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg ln(Likelihood Ratio)">
 ##INFO=<ID=HWE_LPVAL,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic ln(p-value)">
 ##INFO=<ID=HWE_DF,Number=1,Type=Integer,Description="Degrees of freedom for Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic">
 ##INFO=<ID=FIC,Number=1,Type=Float,Description="Genotype likelihood based Inbreeding Coefficient">
 ##INFO=<ID=AB,Number=1,Type=Float,Description="Genotype likelihood based Allele Balance">

 usage : vt compute_features for variants [options] <in.vcf>

 options : -s  print site information only without genotypes [false]
           -o  output VCF/VCF.GZ/BCF file [-]
           -f  filter expression []
           -I  File containing list of intervals
           -i  Intervals
           -?  displays help

Estimate

 Compute variant based estimates.

 Example of statistics are:
 * Allele counts
 * Hardy-Weinberg Genotype Likelihood based Allele Frequencies
 * Genotype Likelihood based Inbreeding Coefficient
 * Genotype Likelihood based Hardy-Weinberg test
 * Genotype Likelihood based Allele Balance

  #compute features for the variants found in vt.vcf
  #requires GT and PL
  vt estimate -e AF,MLEAF vt.vcf

  AF         Genotype (GT) based allele frequencies
             If genotypes are unavailable, best guess
             genotypes are inferred based on genotype
             likelihoods (GL or PL)
             AC        : Alternate Allele counts
             AN        : Total allele counts
             NS        : No. of samples.
             AF        : Alternate allele frequencies.
  MLEAF      GL based allele frequencies estimates
             MLEAF     : Alternate allele frequency derived from MLEGF
             MLEGF     : Genotype frequencies.
  HWEAF      GL based allele frequencies estimates assuming HWE
             HWEAF     : Alternate allele frequencies
             HWEGF     : Genotype frequencies derived from HWEAF.
  HWE        GL based Hardy-Weinberg statistics.
             HWE_LLR   : log likelihood ratio
             HWE_LPVAL : log p-value
             HWE_DF    : degrees of freedom
  AB         GL based Allele Balance.
  FIC        GL based Inbreeding Coefficient

 usage : vt estimate [options] <in.vcf>

 options : -s  print site information only without genotypes [false]
           -o  output VCF/VCF.GZ/BCF file [-]
           -e  comma separated estimates to be computed []
           -f  filter expression []
           -I  File containing list of intervals
           -i  Intervals
           -?  displays help

Profile SNPs

Profile SNPs. The reference data sets can be obtained from vt resource bundle.

  #profile snps found in 20.sites.vcf
  vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa  -i 20

 #this is a sample output for indel profiling.
 # square brackets contain the ts/tv ratio.  
 # The numbers in curved bracket are the counts of ts and tv SNPs respectively.
 # Low complexity shows what percent of the SNPs are in low complexity regions.
  data set
    No. SNPs          :     508603 [2.09]
       Low complexity :       0.08 (39837/508603) 

 1000g
   A-B     109970 [1.39]
   A&B     398633 [2.37]
   B-A    1340682 [2.26]
   Precision    78.4%
   Sensitivity  22.9% 

 dbsnp
   A-B     324063 [1.99]
   A&B     184540 [2.29]
   B-A     103893 [2.60]
   Precision    36.3%
   Sensitivity  64.0%

 # This file contains information on how to process reference data sets.
 #
 # dataset - name of data set, this label will be printed.
 # type    - True Positives (TP) and False Positives (FP)
 #           overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively
 #         - annotation
 #           file is used for GENCODE annotation of frame shift and non frame shift Indels
 # filter  - filter applied to variants for this particular data set 
 # path    - path of indexed BCF file
 #dataset               type             filter                                 path
 1000g                  TP               N_ALLELE==2&&VTYPE==SNP                /net/fantasia/home/atks/ref/vt/grch37/1000G.v5.snps.indels.complex.svs.sites.bcf
 dbsnp                  TP               N_ALLELE==2&&VTYPE==SNP                /net/fantasia/home/atks/ref/vt/grch37/dbSNP138.snps.indels.complex.sites.bcf
 GENCODE_V19            cds_annotation   .                                      /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz
 DUST                   cplx_annotation  .                                      /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz

 usage : vt profile_snps [options] <in.vcf>

 options : -f  filter expression []
           -g  file containing list of reference datasets []
           -I  file containing list of intervals []
           -i  intervals []
           -r  reference sequence fasta file []
           -?  displays help

Profile Indels

Profile Indels. The reference data sets can be obtained from vt resource bundle.

  #profile indels found in mills.vcf
  vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa  -i 20

 #this is a sample output for indel profiling.
 # square brackets contain the ins/del ratio.  
 # for the FS/NFS field, that is the proportion of coding indels that are frame shifted.  
 # The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively.
 data set
   No Indels :      46974 [0.89]
      FS/NFS :       0.26 (8/23) 

 dbsnp
   A-B      30704 [0.92]
   A&B      16270 [0.83]
   B-A    2049488 [1.52]
   Precision    34.6%
   Sensitivity   0.8% 

 mills
   A-B      43234 [0.88]
   A&B       3740 [1.00]
   B-A     203278 [0.98]
   Precision     8.0%
   Sensitivity   1.8% 

 mills.chip
   A-B      46847 [0.89]
   A&B        127 [0.90]
   B-A       8777 [0.93]
   Precision     0.3%
   Sensitivity   1.4% 

 affy.exome.chip
   A-B      46911 [0.89]
   A&B         63 [0.43]
   B-A      33997 [0.47]
   Precision     0.1%
   Sensitivity   0.2%

 # This file contains information on how to process reference data sets.
 # dataset - name of data set, this label will be printed.
 # type    - True Positives (TP) and False Positives (FP).
 #           overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively.
 #         - annotation.
 #           file is used for GENCODE annotation of frame shift and non frame shift Indels.
 # filter  - filter applied to variants for this particular data set.
 # path    - path of indexed BCF file.
 #dataset     type            filter                       path
 1000g        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/1000G.snps_indels.sites.bcf
 mills        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/mills.208620indels.sites.bcf
 dbsnp        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/dbsnp.13147541variants.sites.bcf
 GENCODE_V19  cds_annotation  .                            /net/fantasia/home/atks/ref/vt/grch37/gencode.cds.bed.gz
 DUST         cplx_annotation .                            /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz

 usage : vt profile_indels [options] <in.vcf>

 options : -g  file containing list of reference datasets []
           -I  file containing list of intervals []
           -i  intervals []
           -r  reference sequence fasta file []
           -?  displays help

Profile VNTRs

Profile VNTRs. The reference data sets can be obtained from vt resource bundle.

 #profiles a set of VNTRs
 vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt

 profile_vntrs v0.5
 
   no VNTRs           5660874           #number of VNTRs in vntrs.sites.bcf
   no low complexity  2686460 (47.46%)  #number of VNTRs in low complexity region determined by MDUST
   no coding          17911 (0.32%)     #number of VNTRs in coding regions determined by GENCODE v7
   no redundant       1312209 (23.18%)  #number of VNTRs involved in overlapping with one another

 trf_lobstr (1638516)  #TRF based reference set used in lobSTR, motif lengths 1 to 6.
   A-B     3269285     #TRs specific to vntrs.sites.bcf
   A-B~    1666185     #TRs in vntrs.sites.bcf that overlap partially with at least one TR in TRF(lobSTR) but does not overlap exactly with another TR.
   A&B1     725404     #TRs in vntrs.sites.bcf that overlap exactly with at least one TR in TRF(lobSTR)
   A&B2     723195     #TRs in TRF(lobSTR) that overlap exactly with at least one TR in vntrs.sites.bcf
   B-A~     710075     #TRs in TRF(lobSTR) that overlap partially with at least one TR in vntrs.sites.bcf but does not overlap exactly with another TR.
   B-A      205246     #TRs specific to TRF(lobSTR)
 #note that the first 3 rows should sum up to the number of TRs in vntrs.sites.bcf
 #and the 4th to 6th rows should sum up to the number of TRs in TRF( lobSTR) 
 #This basically allows us to see the m to n overlapping in overlapping TRs

 trf_repeatseq (1624553) #TRF based reference set used in repeatseq, motif lengths 1 to 6.
   A-B     3291652 
   A-B~    1650190 
   A&B1     719032 
   A&B2     716838 
   B-A~     703948 
   B-A      203767  

 trf_vntrseek (230306)   #TRF based reference set used in vntrseek, motif lengths 7 to 2000.
   A-B     5384453 
   A-B~     271302 
   A&B1       5119 
   A&B2       4973 
   B-A~      92496 
   B-A      132837  

 codis+ (15)             #CODIS STRs + 2 STRs from PROMEGA
   A-B     5660794 
   A-B~         79 
   A&B1          1 
   A&B2          1 
   B-A~         14 
   B-A           0

 # This file contains information on how to process reference data sets.
 # dataset - name of data set, this label will be printed.
 # type    - True Positives (TP) and False Positives (FP).
 #           overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively.
 #         - annotation.
 #           file is used for GENCODE annotation of coding VNTRs.
 # filter  - filter applied to variants for this particular data set.
 # path    - path of indexed BCF file.
 #dataset      type            filter                       path
 trf_lobstr    TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.lobstr.sites.bcf
 trf_repeatseq TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.repeatseq.sites.bcf
 trf_vntrseek  TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.vntrseek.sites.bcf
 codis+        TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/codis.strs.sites.bcf
 GENCODE_V19   cds_annotation  .                            /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz
 DUST          cplx_annotation .

 usage : vt profile_vntrs [options] <in.vcf>

 options : -g  file containing list of reference datasets []
           -I  file containing list of intervals []
           -i  intervals []
           -r  reference sequence fasta file []
           -?  displays help

Profile Mendelian Errors

Profile Mendelian errors

  #profile mendelian errors found in vt.genotypes.bcf, generate tables in the directory mendel, requires pdflatex.
  vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel

  pedigree file format is described in here

  #this is a sample output for mendelian error profiling.
  #R and A stand for reference and alternate allele respectively.
  #Error% - mendelian error (confounded with de novo mutation)
  #HomHet - Homozygous-Heterozygous genotype ratios
  #Het% - proportion of hets
  Mendelian Errors 

  Father Mother       R/R          R/A          A/A    Error(%) HomHet    Het(%)
  R/R    R/R        14889          210           38     1.64       nan    nan
  R/R    R/A         3403         3497           74     1.06      0.97  50.68
  R/R    A/A          176         1482          155    18.26       nan    nan
  R/A    R/R         3665         3652           68     0.92      1.00  49.91
  R/A    R/A         1015         3151          990     0.00      0.64  61.11
  R/A    A/A           43         1300         1401     1.57      1.08  48.13
  A/A    R/R          172         1365          147    18.94       nan    nan
  A/A    R/A           47         1164         1183     1.96      1.02  49.60
  A/A    A/A           20           78         5637     1.71       nan    nan 

  Parental            R/R          R/A          A/A    Error(%) HomHet    Het(%)
  R/R    R/R        14889          210           38     1.64       nan    nan
  R/R    R/A         7068         7149          142     0.99      0.99  50.28
  R/R    A/A          348         2847          302    18.59       nan    nan
  R/A    R/A         1015         3151          990     0.00      0.64  61.11
  R/A    A/A           90         2464         2584     1.75      1.05  48.81
  A/A    A/A           20           78         5637     1.71       nan    nan  

  Parental            R/R          R/A          A/A    Error(%) HomHet    Het(%)
  HOM    HOM        14909          288         5675     1.66       nan    nan
  HOM    HET         7158         9613         2726     1.19      1.00  49.90
  HET    HET         1015         3151          990     0.00      0.64  61.11
  HOMREF HOMALT       348         2847          302    18.59       nan    nan  

  total mendelian error :   2.505% 
  no. of trios     : 2
  no. of variants  : 25346

profile_mendelian v0.5

 usage : vt profile_mendelian [options] <in.vcf>

 options : -q  minimum genotype quality
           -d  minimum depth
           -r  reference sequence fasta file []
           -x  output latex directory []
           -p  pedigree file
           -I  file containing list of intervals []
           -i  intervals
          -?  displays help

Profile NA12878

Profile Mendelian errors

  #profile NA12878 overlap with broad knowledgebase and illumina platinum genomes for the file vt.genotypes.bcf for chromosome 20.
  vt profile_na12878  vt.genotypes.bcf -g na12878.reference.txt -r hs37d5.fa -i 20

  #this is a sample output for mendelian error profiling.
  #R and A stand for reference and alternate allele respectively.
  #Error% - mendelian error (confounded with de novo mutation)
  #HomHet - Homozygous-Heterozygous genotype ratios
  #Het% - proportion of hets
    data set
   No Indels :      27770 [0.94]
      FS/NFS :       0.26 (8/23) 

 broad.kb
   A-B      13071 [1.19]
   A&B      14699 [0.76]
   B-A      21546 [0.62]
   Precision    52.9%
   Sensitivity  40.6% 

 illumina.platinum
   A-B      17952 [0.88]
   A&B       9818 [1.07]
   B-A       2418 [0.88]
   Precision    35.4%
   Sensitivity  80.2% 

 broad.kb
               R/R       R/A       A/A       ./.
   R/R         346       145         3      5473
   R/A           3      4133         9       758
   A/A           2       136      2186       956
   ./.           2       139        86       322 

   Total genotype pairs :      6963
   Concordance          :  95.72% (6665)
   Discordance          :   4.28% (298) 

 illumina.platinum
               R/R       R/A       A/A       ./.
   R/R        1768        85         2         0
   R/A          10      4479        14         0
   A/A          13       180      3028         0
   ./.          71        98        70         0

   Total genotype pairs :      9579
   Concordance          :  96.83% (9275)
   Discordance          :   3.17% (304)

  # This file contains information on how to process reference data sets.
  #
  # dataset - name of data set, this label will be printed.
  # type    - True Positives (TP) and False Positives (FP)
  #           overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively
  #         - annotation
  #           file is used for GENCODE annotation of frame shift and non frame shift Indels
  # filter  - filter applied to variants for this particular data set
  # path    - path of indexed BCF file
  #dataset              type         filter    path
  broad.kb              TP           PASS      /net/fantasia/home/atks/dev/vt/bundle/public/grch37/broad.kb.241365variants.genotypes.bcf
  illumina.platinum     TP           PASS      /net/fantasia/home/atks/dev/vt/bundle/public/grch37/NA12878.illumina.platinum.5284448variants.genotypes.bcf
  #gencode.v19           annotation   .         /net/fantasia/home/atks/dev/vt/bundle/public/grch37/gencode.v19.annotation.gtf.gz

profile_na12878 v0.5

 usage : vt profile_na12878 [options] <in.vcf>

 options : -g  file containing list of reference datasets []
           -I  file containing list of intervals []
           -i  intervals []
           -r  reference sequence fasta file []
           -?  displays help

Variant Calling

Discover

Discovers variants from reads in a BAM/CRAM file.

  #discover variants from NA12878.bam and write to stdout
  vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20

 usage : vt discover2 [options]

 options : -b  input BAM/CRAM file
         -y  soft clipped unique sequences cutoff [0]
         -x  soft clipped mean quality cutoff [0]
         -w  insertion desired type II error [0.0]
         -c  insertion desired type I error [0.0]
         -h  insertion fractional evidence cutoff [0]
         -g  insertion count cutoff [1]
         -n  deletion desired type II error [0.0]
         -m  deletion desired type I error [0.0]
         -v  deletion fractional evidence cutoff [0]
         -u  deletion count cutoff [1]
         -k  snp desired type II error [0.0]
         -j  snp desired type I error [0.0]
         -f  snp fractional evidence cutoff [0]
         -e  snp evidence count cutoff [1]
         -q  base quality cutoff for bases [0]
         -C  likelihood ratio cutoff [0]
         -B  reference bias [0]
         -a  read exclude flag [0x0704]
         -l  ignore overlapping reads [false]
         -t  MAPQ cutoff for alignments [0]
         -p  ploidy [2]
         -s  sample ID
         -r  reference sequence fasta file []
         -o  output VCF file [-]
         -z  ignore MD tags [0]
         -d  debug [0]
         -I  file containing list of intervals []
         -i  intervals []
         -?  displays help

Merge candidate variants

Merge candidate variants across samples. Each VCF file is required to have the FORMAT flags E and N and should have exactly one sample.

  #merge candidate variants from VCFs in candidate.txt and output in candidate.sites.vcf
  vt merge_candidate_variants candidates.txt -o candidate.sites.vcf

 usage : vt merge_candidate_variants [options]

 options : -L  file containing list of input VCF files
           -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

Remove overlap

Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap.

  #annotates variants that are overlapping  
  vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf

 usage : vt remove_overlap [options] <in.vcf>

 options : -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals []
           -?  displays help

Annotate Indels

Annotates indels with VNTR information and adds a VNTR record. Facilitates the simultaneous calling of VNTR together with Indels and SNPs.

  #annotates indels from VCFs with VNTR information.
  vt annotate_indels in.vcf -r hs37d5.fa -o annotated.sites.vcf

 CHROM   POS     ID      REF     ALT     QUAL    FILTER  INFO
 20      82079   .       G       A       1255.98 .       NSAMPLES=1;E=43;N=51;ESUM=43;NSUM=51;FLANKSEQ=GGAGCACGCC[G/A]CCATGCCCGG
 20      82217   .       G       A       1632.77 .       NSAMPLES=1;E=56;N=61;ESUM=56;NSUM=61;FLANKSEQ=GAGCCACCGC[G/A]CCCGGCCCAG
 20      83250   .       CTGTGTGTG       C       .       .       NSAMPLES=1;E=18;N=35;ESUM=18;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT
 20      83250   .       CTGTGTGTGTG     C       .       .       NSAMPLES=1;E=3;N=35;ESUM=3;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT
 20      83251   .       TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG    <VNTR>  .       .       MOTIF=GT;RU=TG;FZ_CONCORDANCE=1;FZ_RL=52;FZ_LL=0;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FZ_RU_COUNTS=26,26;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG]TTTAGTATTT
 20      83252   .       G       C       359.204 .       NSAMPLES=1;E=13;N=14;ESUM=13;NSUM=14;FLANKSEQ=CTCTCTCTCT[G/C]TGTGTGTGTG
 20      83260   .       G       C       500.163 .       NSAMPLES=1;E=18;N=34;ESUM=18;NSUM=34;FLANKSEQ=CTGTGTGTGT[G/C]TGTGTGTGTG
 20      83267   .       T       C       247.043 .       NSAMPLES=1;E=11;N=43;ESUM=11;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT
 20      83275   .       T       C       609.669 .       NSAMPLES=1;E=24;N=43;ESUM=24;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT
 20      90008   .       C       A       1546.88 .       NSAMPLES=1;E=52;N=60;ESUM=52;NSUM=60;FLANKSEQ=AACAGAAAAC[C/A]AAATACTGTA
 20      91088   .       C       T       1766.04 .       NSAMPLES=1;E=58;N=66;ESUM=58;NSUM=66;FLANKSEQ=CCCAGCATAC[C/T]ATGGTTGTGC
 20      91508   .       G       A       1266.93 .       NSAMPLES=1;E=44;N=53;ESUM=44;NSUM=53;FLANKSEQ=AATTAGTAAG[G/A]CTTACGTAAG
 20      91707   .       C       T       888.134 .       NSAMPLES=1;E=30;N=53;ESUM=30;NSUM=53;FLANKSEQ=TGATTTTCTA[C/T]AGCAGGACCT
 20      92527   .       A       G       828.593 .       NSAMPLES=1;E=34;N=40;ESUM=34;NSUM=40;FLANKSEQ=ATTAATTGCC[A/G]TTCTCTCTTT
 20      93440   .       A       G       688.144 .       NSAMPLES=1;E=24;N=58;ESUM=24;NSUM=58;FLANKSEQ=TTGGATGCAT[A/G]GTCTGTAAAT
 20      93636   .       TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT     <VNTR>  .       .       MOTIF=T;RU=T;FZ_CONCORDANCE=0.939394;FZ_RL=35;FZ_LL=0;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FZ_RU_COUNTS=31,33;FLANKSEQ=TCTAGGATTC[TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT
 20      93646   .       C       CT      .       .       NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FLANKSEQ=TTTTTCTTTC[TTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT;GMOTIF=T;TR=20:93636:TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT:<VNTR>:T
 20      93717   .       A       T       31.7622 .       NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKSEQ=CAGTGGCGTG[A/T]TCTTAGATCA
 20      93931   .       G       A       628.149 .       NSAMPLES=1;E=22;N=53;ESUM=22;NSUM=53;FLANKSEQ=GATTACAGGT[G/A]TGAGCCGCTG
 20      100699  .       C       T       809.09  .       NSAMPLES=1;E=28;N=61;ESUM=28;NSUM=61;FLANKSEQ=GGTGAAAAAT[C/T]ACCTGTCAGT
 20      101362  .       G       A       1087.13 .       NSAMPLES=1;E=36;N=67;ESUM=36;NSUM=67;FLANKSEQ=TAATACTGAA[G/A]TTTACTTCTC

 The following shows the trace of how the algorithm works

   ============================================
   ANNOTATING INDEL FUZZILY
   ********************************************
   EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT
   
   20:131948:C/CCA
   EXACT REGION 131948-131965 (18) 
                CCACACACACACACACAA
   FINAL EXACT REGION 131948-131965 (18) 
                      CCACACACACACACACAA
   ********************************************
   PICK CANDIDATE MOTIFS
   
   Longest Allele : C[CA]CACACACACACACACAA
   detecting motifs for an str
   seq: CCACACACACACACACACAA
   len : 20
   cmax_len : 10
   candidate motifs: 25
   AC : 0.894737 2 0
   AAC : 0.5 3 0.0555556
   ACC : 0.5 3 0.0555556
   AAAC : 0.0588235 4 0.125 (< 2 copies)
   ACCC : 0.0588235 4 0.125 (< 2 copies)
   AACAC : 0.5 5 0.02
   ACACC : 0.5 5 0.02
   AAACAC : 0.0666667 6 0.0555556 (< 2 copies)
   ACACCC : 0.0666667 6 0.0555556 (< 2 copies)
   AACACAC : 0.5 7 0.0102041
   ACACACC : 0.5 7 0.0102041
   AAACACAC : 0.0769231 8 0.03125 (< 2 copies)
   ACACACCC : 0.0769231 8 0.03125 (< 2 copies)
   AACACACAC : 0.5 9 0.00617284 (< 2 copies)
   ACACACACC : 0.5 9 0.00617284 (< 2 copies)
   AAACACACAC : 0.0909091 10 0.02 (< 2 copies)
   ACACACACCC : 0.0909091 10 0.02 (< 2 copies)
   ********************************************
   PICKING NEXT BEST MOTIF
   
   selected:         AC 0.89 0.00
   ********************************************
   DETECTING REPEAT TRACT FUZZILY
   ++++++++++++++++++++++++++++++++++++++++++++
   Exact left/right alignment
   
   repeat_tract              : CACACACACACACACA
   position                  : [131949,131964]
   motif_concordance         : 1
   repeat units              : 8
   exact repeat units        : 8
   total no. of repeat units : 8
   
   ++++++++++++++++++++++++++++++++++++++++++++
   Fuzzy right alignment
   
   repeat motif : CA
   rflank       : AACTC
   mlen         : 2
   rflen        : 5
   plen         : 111
   
   read         : AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACACCACACACACACACACAAACTC
   rlen         : 106
   
   optimal score: 50.5073
   optimal state: MR
   optimal track: MR|r|0|5
   optimal probe len: 25
   optimal path length : 107
   max j: 106
   probe: (1~82) [1~10] (1~5)
   read : (1~82) [83~101] (102~106)
   
   motif #           : 10 [83,101]
   motif concordance : 95% (9/10)
   motif discordance : 0|1|0|0|0|0|0|0|0|0
   
   Model:  ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC 
          SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME
                                                                                             oo++oo++oo++oo++oo++RRRRR 
   Read:   AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACAC-CACACACACACACACAAACTC 
   
   ++++++++++++++++++++++++++++++++++++++++++++
   Fuzzy left alignment
   
   lflank       : ATCTTA
   repeat motif : CA
   lflen        : 6
   mlen         : 2
   plen         : 111
   
   read         : ATCTTACACCACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT
   rlen         : 105
   
   optimal score: 50.5858
   optimal state: Z
   optimal track: Z|m|10|2
   optimal probe len: 26
   optimal path length : 106
   max j: 105
   mismatch penalty: 3
   
   model: (1~6) [1~10]
   read : (1~6) [7~25][26~106]
   
   motif #           : 10 [7,25]
   motif concordance : 95% (9/10)
   motif discordance : 0|1|0|0|0|0|0|0|0|0
   
   Model:  ATCTTACACACACACACACACACACA-------------------------------------------------------------------------------- 
          SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE
           LLLLLLoo++oo++oo++oo++oo++                                                                                 
   Read:   ATCTTACAC-CACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT 
   
   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
   VNTR Summary
   rid          : 19
   motif        : AC
   ru           : CA
   
   Exact
   repeat_tract                    : CACACACACACACACA
   position                        : [131949,131964]
   reference repeat unit length    : 8
   motif_concordance               : 1
   repeat units                    : 8
   exact repeat units              : 8
   total no. of repeat units       : 8
   
   Fuzzy
   repeat_tract                    : CACCACACACACACACACA
   position                        : [131946,131964]
   reference repeat unit length    : 19
   motif_concordance               : 0.95
   repeat units                    : 19
   exact repeat units              : 9
   total no. of repeat units       : 10
   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

 usage : vt annotate_indels [options] <in.vcf>

 options : -v  add vntr record [false]
           -x  override tags [false]
           -f  filter expression []
           -d  debug [false]
           -m  mode [f]
               e : by exact alignment              f : by fuzzy alignment
           -c  classification schemas of tandem repeat [6]
               1 : lai2003     
               2 : kelkar2008  
               3 : fondon2012  
               4 : ananda2013  
               5 : willems2014 
               6 : tan_kang2015
           -a  annotation type [v]
               v : a. output VNTR variant (defined by classification).
                      RU                    repeat unit on reference sequence (CA)
                      MOTIF                 canonical representation (AC)
                      RL                    repeat tract length in bases (11)
                      FLANKS                flanking positions of repeat tract determined by exact alignment
                      RU_COUNTS             number of exact repeat units and total number of repeat units in
                                            repeat tract determined by exact alignment
                      FZ_RL                 fuzzy repeat tract length in bases (11)
                      FZ_FLANKS             flanking positions of repeat tract determined by fuzzy alignment
                      FZ_RU_COUNTS          number of exact repeat units and total number of repeat units in
                                            repeat tract determined by fuzzy alignment
                      FLANKSEQ              flanking sequence of indel
                      LARGE_REPEAT_REGION   repeat region exceeding 2000bp
                   b. mark indels with overlapping VNTR.
                      FLANKS       flanking positions of repeat tract determined by exact alignment
                      FZ_FLANKS    flanking positions of repeat tract determined by fuzzy alignment
                      GMOTIF       generating motif used in fuzzy alignment
                      TR    position and alleles of VNTR (20:23413:CACACACACAC:<VNTR>)
               a : annotate each indel with RU, RL, MOTIF, REF.
           -r  reference sequence fasta file []
           -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals
           -?  displays help

Construct Probes

Construct probes for genotyping a variant.

  #construct probes from candidate.sites.bcf and output to standard out
  vt construct_probes candidates.sites.bcf -r ref.fa

 usage : vt construct_probes [options] <in.vcf>

 options : -o  output VCF file [-]
           -f  minimum flank length [20]
           -r  reference sequence fasta file []
           -I  file containing list of intervals []
           -i  intervals []
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

Genotype

Genotypes variants for each sample.

  #genotypes variants found in candidate.sites.vcf from sample.bam
  vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf

 usage : vt genotype [options]

 options : -r  reference sequence fasta file []
           -s  sample ID []
           -o  output VCF file [-]
           -b  input BAM file []
           -i  input candidate VCF file []
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

Resource Bundle

External : resource bundle
Internal : /net/fantasia/home/atks/ref/vt/grch37

GRCH37 set : Files are based on hs37d5.fa made by Heng Li.

data set	samples	snps/indels/complex/sv	description
1000G.v5 dbsnp138 1000G.omni.chip mills mills.chip affy.exome.chip NA12878.broad.kb NA12878.v7.illumina.platinum mdust.bed.gz gencode.cds.bed.gz trf.bed.gz	0 0 2141 0 158 2122 1 1 NA NA NA	81316694/3296894/66806/59426 10588965/2488793/69749/0 2432554/5/0/0 0/208753/0/0 0/8904/0/0 281875/34389/0/0 281345/87389/152/0 3702969/650764/13751/0 NA NA NA	1000G v5. [1000G 2015?] derived from GATK's resource bundle that excludes 1000G variants. 1000G individuals typed on the omni chip [1000G 2015?] indels from [Mills 2006] indels from [Mills 2011] 1000G individuals and others typed on the affymetrix exome chip [1000G 2015?] from GATK's NA12878 knowledgebase. Illumina's platinum genomes version 7 regions of low complexity annotated using mdust [Morgulis 2006] coding sequence regions based on GENCODE v19 annotations [Harrow 2012] tandem repeat finder STRs from lobSTR's resource bundle [Gymrek 2012]

Note: Please let me know if I did not cite a resource properly.

FAQ

1. vt cannot retrieve sequences from my reference sequence file

 It is common to use reference files based on the UCSC browser's database and from the Genome Reference Consortium.
 For example, HG19 vs Grch37.  The key difference is that chromosome 1 is represented as chr1 and 1 respectively in the 
 FASTA files from these 2 sources.  Just use the appropriate FASTA file that was used to generate your VCF file originally.

 Another common issue is due to the corruption of the index file of the reference sequence; say for a reference file named
 hs37d5.fa or hs37d5.fa.gz, simply delete the index file denoted by hs37d5.fa.fai or hs37d5.fa.gz.fai and run the vt command 
 again.  A new index file will be generated automatically.

How to cite vt?

If you use normalize:
Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204

Maintained by

This page is maintained by Adrian

@@ Line 1,135: / Line 1,135: @@
     #profile mendelian errors found in vt.genotypes.bcf, generate [[media:mendel.pdf|tables]] in the directory mendel, requires pdflatex.
     vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel
+   pedigree file format is described in [http://csg.sph.umich.edu//abecasis/merlin/tour/input_files.html here]
     #this is a sample output for mendelian error profiling.

Difference between revisions of "Vt"