Introduction

vt is a variant tool set that discovers short variants from Next Generation Sequencing data. The features are being rolled out to github as major rewriting is being undertaken.

Installation

The source files are housed in github. htslib is used and a copy of a developmental freeze is stored as part of the vt repository to ensure compatibility.

To install, perform the following steps:

 #this will create a directory named vt in the directory you cloned the repository
 1. git clone https://github.com/atks/vt.git 

 #change directory to vt
 2. cd vt

 #run make, note that compilers need to support the c++0x standard 
 3. make

Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4.

Updating

vt is currently under heavy development, you will probably need to update often.

 #remove all object files
 #you need to do this as source files for the static libraries might have changed and need to be removed.
 1. make clean 

 #update source files
 2. git pull 

 #compile and link, the -j option tells Makefile to run up to 40 independent commands in parallel
 3. make -j 40

General Features

Common options

   -i   multiple intervals in <seq>:<start>-<end> format delimited by commas.

   -I   multiple intervals in <seq>:<start>-<end> format listed in a text file line by line.

   -o   defines the out file which and has the STDOUT set as the default.
        You may modify the STDOUT to output the binary version of the format.  Uncompressed
        VCF and BCF streams are indicated by - and + respectively.

   -f  filter expression

   -s  sequential region selection as opposed to random access of regions specified by the i option.
       This is useful when you want to select many close-by regions, while the -i option works,
       it is less efficient and also selects a variant multiple times if it overlaps 2 regions.  This 
       option iterates through the variants in the file sequentially and checks for overlap with the 
       bed file given.

Uncompressed BCF streams

htslib is designed with BCF as the underlying data structure and it has incorporated awareness of uncompressed BCF streams in the i/o API. One may use this feature to stream uncompressed BCF records to save on computational time spent on (de)compression.

 #using textual VCF streams indicated by -
 cat view -h mills.vcf | vt normalize - -r hs37d5.fa | vt mergedups - -o out.bcf

 #using uncompressed BCF streams indicated by +
 cat mills.vcf | vt normalize - -r hs37d5.fa -o + | vt mergedups + -o out.bcf

In this example, the former took 0.84s while the latter took 0.64s to process. (24% speed up!)

Filters

For some programs. you may define a filter via the -f option.

 This allows you to only analyse biallelic indels that are passed on chromosome 20.
 vt profile_na12878 vt.bcf -g na12878.reference.txt -r genome.fa -f "N_ALLELE==2&&VTYPE==INDEL&&PASS"  -i 20

Other examples of filters

 #all variants with a SNP in them
 VTYPE&SNP
 #Simple insertions of length 1
 VTYPE==INDEL&&DLEN==1
 #Indels of length 1
 VTYPE==INDEL&&LEN==1

 Variant characteristics
   VTYPE,N_ALLELE,DLEN,LEN

 Variant value types
   SNP,MNP,INDEL,CLUMPED

 Biallelic SNPs only                       : VTYPE==SNP&&N_ALLELE==2
 Biallelic Indels with embedded SNP        : VTYPE==(SNP|INDEL)&&N_ALLELE==2
 Biallelic variants involving insertions   : VTYPE&INDEL&&DLEN>0&&N_ALLELE==2
 Biallelic variants involving 1bp variants : LEN==1&&N_ALLELE==2

 FILTER fields
   PASS, FILTER.<tag>

 INFO fields
   INFO.<tag>

 Passed biallelic SNPs only                  : PASS&&VTYPE==SNP&&N_ALLELE==2
 Passed Common biallelic SNPs only           : PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005
 Passed Common biallelic SNPs or rare indels : (PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005)

 Operations
   ==,~,&&,||,&,|,+,-,*,/

 Failed rare variants : ~PASS&&(INFO.AC/INFO.AN<0.005)

The motivation in this case is because Indels are a heterogeneous set of variants and thus we usually examine them from many different characteristics. The following programs support filter support.

peek
view
concat
profile_snps
profile_indels
profile_na12878
profile_mendelian
profile_len
profile_chrom
profile_afs
profile_hwe
concordance
partition

Alternate headers

 As BCF is a restrictive format of VCF where all meta data must be present in the header, 
 vt provides a mechanism to read an alternative header for VCF files that do not have a 
 complete header.  Simply provide a header file stub named as <vcf-file>.hdr and vt
 will automatically read it instead of the original header in <vcf-file>.

VCF Manipulation

View

Views a VCF or VCF.GZ or BCF file.

  #views mills.bcf and outputs to standard out
  vt view -h mills.bcf 
  #views mills.bcf and locally sorts it in a 10000bp window and outputs to out.bcf
  vt view -h -w 10000 mills.bcf

 usage : vt view [options] <in.vcf>

 options : -o  output VCF/VCF.GZ/BCF file [-]
           -w  local sorting window size [0]
           -s  print site information only without genotypes [false]
           -h  print header [false]
           -p  print options and summary []
           -I  file containing list of intervals []
           -i  intervals []
           -?  displays help

Index

Indexes a VCF.GZ or BCF file.

  #indexes mills.bcf
  vt index mills.bcf

 usage : vt index [options] <in.vcf>

 options : -p  print options and summary []
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

Sorting

Local sorting can be done using view setting the -w option to a non 0 value.

You will want to locally sort a VCF file if you are aware that the records are not in order in short stretches. This will not work if the records are not in chromosomal order. The window dictates the size of the region to buffer the records while sorting.

Normalization

Normalize variants in a VCF file. Normalized variants may have their positions changed; in such cases, the normalized variants are reordered and output in an ordered fashion. The local reordering takes place over a window of 10000 base pairs.

  #normalize variants and write out to dbsnp.normalized.vcf
  vt normalize mdbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf

  #normalize variants, send to standard out and remove duplicates.
  vt normalize dbsnp.vcf -r seq.fa | vt mergedups - -o dbsnp.normalized.merged.vcf

  #variants that are normalized will be annotated with an OLD_VARIANT info tag.
  #CHROM  POS      ID   REF           ALT  QUAL  FILTER  INFO
  19	  29238772 .	C             G    .     PASS	 VT=SNP;OLD_VARIANT=19:29238771:TC/TG
  20	  60674709 .	GCCCAGCCCCAC  G    .     PASS	 VT=INDEL;OLD_VARIANT=20:60674718:CACCCCAGCCCC/C

  #this shows a sample output with the normalization operations that were used 
  #categorized into 5 categories each for biallelic and multiallelic variants. 

  stats: biallelic
         no. left trimmed                      : 156908
         no. right trimmed                     : 323
         no. left and right trimmed            : 33
         no. right trimmed and left aligned    : 7
         no. left aligned                      : 12360 

      total no. biallelic normalized           : 169631 
 

      multiallelic
         no. left trimmed                      : 627189
         no. right trimmed                     : 2509
         no. left and right trimmed            : 1498
         no. right trimmed and left aligned    : 212
         no. left aligned                      : 1783 

      total no. multiallelic normalized        : 633191 

      total no. variants normalized            : 802822
      total no. variants observed              : 88052639

  usage : vt normalize [options] <in.vcf>

  options : -o  output VCF file [-]
            -I  file containing list of intervals []
            -i  intervals []
            -r  reference sequence fasta file []
            --  ignores the rest of the labeled arguments following this flag
            -h  displays help

Decompose blocksub

Decomposes biallelic block substitutions into its constituent SNPs.

  #decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf
  vt decompose gatk.vcf -o gatk.decomposed.vcf

  #before decomposition
  #CHROM  POS      ID   REF     ALT             QUAL    FILTER  INFO                    FORMAT    S1                                                                          
  20	763837	.	CA	TG	50340.1	PASS	AC=1;AN=2	GT	0|1

  #after decomposition
  #CHROM  POS      ID   REF     ALT     QUAL    FILTER  INFO                                                            FORMAT  S1         
 20	763837	.	C	T	50340.1	PASS	AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG	GT	0|1
 20	763838	.	A	G	50340.1	PASS	AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG	GT	0|1
 
  One might want to post process the partial genotypes like 1/. to the best guess genotype based on the PL values.

  description : decomposes multialleic variants into biallelic in a VCF file. 

  usage : vt decompose [options] <in.vcf> 

  options : -o  output VCF file [-]
         -I  file containing list of intervals []
         -i  intervals []
         -?  displays help

Decompose

Decompose multiallelic variants in a VCF file. If the VCF file has genotype fields GT,PL or GL, they are modified to reflect the change in alleles. All other genotype fields are removed.

  #decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf
  vt decompose gatk.vcf -o gatk.decomposed.vcf

  #before decomposition
  #CHROM  POS      ID   REF     ALT             QUAL    FILTER  INFO                    FORMAT    S1                                     S2                                                                          
  1	3759889	.	TA	TAA,TAAA,T	.	PASS	AF=0.342,0.173,0.037	GT:DP:PL	  1/2:81:281,5,9,58,0,115,338,46,116,809	 0/0:86:0,30,323,31,365,483,38,291,325,567

  #after decomposition
  #CHROM  POS      ID   REF     ALT     QUAL    FILTER  INFO                                                            FORMAT  S1              S2             
  1	3759889	.	TA	TAA	.	PASS	AF=0.342,0.173,0.037;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T	GT:PL	1/.:281,5,9	0/0:0,30,323	
  1	3759889	.	TA	TAAA	.	.	OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T	                        GT:PL	./1:281,58,115	0/0:0,31,483	
  1	3759889	.	TA	T	.	.	OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T	                        GT:PL	./.:281,338,809	0/0:0,38,567

  One might want to post process the partial genotypes like 1/. to the best guess genotype based on the PL values.

  description : decomposes multialleic variants into biallelic in a VCF file. 

  usage : vt decompose [options] <in.vcf> 

  options : -o  output VCF file [-]
         -I  file containing list of intervals []
         -i  intervals []
         -?  displays help

Merge duplicate variants

Merges duplicate variants by position with the option of considering alleles. (This just discards the duplicate variant that appears later in the VCF file)

  #merge duplicate variants and save output in mills.merged.vcf
  vt mergedups mills.vcf -o mills.merged.vcf

  usage : vt mergedups [options] <in.vcf>

  options : -o  output VCF file [-]
            -p  merge by position [false]

Concatenate

Concatenate VCF files. Assumes individuals are in the same order and files share the same header.

  #concatenate chr1.mills.bcf and chr2.mills.bcf
  vt concat chr1.mills.bcf chr2.mills.bcf -o mills.bcf

 usage : vt concat [options] <in1.vcf>...

 options : -s  print site information only without genotypes [false]
           -p  print options and summary []
           -L  file containing list of input VCF files
           -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals
           -?  displays help

VCF Inspection and Evaluation

Peek

Summarizes the variants in a VCF file

  #summarizes the variants found in mills.vcf
  vt peek mills.vcf

 usage : vt peek [options] <in.vcf>

 options : -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals []
           -r  reference sequence fasta file []
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

For a more detailed guide on variant classification.

#This is a sample output of a peek command which summarizes the variants found in a VCF file.
  stats: no. of samples                     :          0
         no. of chromosomes                 :         22

         ========== Micro variants ==========

         no. of SNPs                        :   77228885
             2 alleles (ts/tv)              :        77011302 (2.11) [52287790/24723512]
             3 alleles (ts/tv)              :          216560 (0.75) [185520/247600]
             4 alleles (ts/tv)              :            1023 (0.50) [1023/2046]

         no. of MNPs                        :          0
             2 alleles (ts/tv)              :               0 (-nan) [0/0]
             >=3 alleles (ts/tv)            :               0 (-nan) [0/0]

         no. Indels                         :    2147564
             2 alleles (ins/del)            :         2124842 (0.47) [683250/1441592]
             >=3 alleles (ins/del)          :           22722 (2.12) [32411/15286]

         no. SNP/MNP                        :          0
             3 alleles (ts/tv)              :               0 (-nan) [0/0] 
             >=4 alleles (ts/tv)            :               0 (-nan) [0/0] 

         no. SNP/Indels                     :      12913
             2 alleles (ts/tv) (ins/del)    :             412 (0.41) [120/292] (3.68) [324/88]
             >=3 alleles (ts/tv) (ins/del)  :           12501 (0.43) [7670/17649] (18.64) [12434/667]

         no. MNP/Indels                     :        153
             2 alleles (ts/tv) (ins/del)    :               0 (-nan) [0/0] (-nan) [0/0]
             >=3 alleles (ts/tv) (ins/del)  :             153 (0.30) [138/465] (0.27) [67/248]

         no. SNP/MNP/Indels                 :          2
             3 alleles (ts/tv) (ins/del)    :               0 (-nan) [0/0] (-nan) [0/0]
             4 alleles (ts/tv) (ins/del)    :               2 (0.00) [3/5] (1.00) [3/3]
             >=5 alleles (ts/tv) (ins/del)  :               0 (-nan) [0/0] (-nan) [0/0]

         no. of clumped variants            :      19025
             2 alleles                      :               0 (-nan) [0/0] (-nan) [0/0]
             3 alleles                      :           18508 (0.16) [12152/75366] (0.00) [93/18653]
             4 alleles                      :             451 (0.15) [369/2390] (0.33) [201/609]
             >=5 alleles                    :              66 (0.09) [37/414] (1.19) [107/90]

         ====== Other useful categories =====

         no. complex variants               :      32093
             2 alleles (ts/tv) (ins/del)    :             412 (0.41) [120/292] (3.68) [324/88]
             >=3 alleles (ts/tv) (ins/del)  :           31681 (0.21) [20369/96289] (0.64) [12905/20270]

         ======= Structural variants ========

         no. of structural variants         :      41217
             2 alleles                      :           38079
                 deletion                   :                13135
                 insertion                  :                16451
                    mobile element          :                    16253
                       ALU                  :                        12513
                       LINE1                :                         2911
                       SVA                  :                          829
                    numt                    :                      198
                 duplication                :                  664
                 inversion                  :                  100
                 copy number variation      :                 7729
             >=3 alleles                    :            3138
                 copy number variation      :                 3138 

         ========= General summary ========== 

         no. of reference                   :          0 

         no. of observed variants           :   79449759
         no. of unclassified variants       :          0

Partition

Partitions variants between 2 indexed VCF files.

  #partitions all variants in bi1.bcf  and bi2.bcf
  vt partition bi1.bcf bi2.bcf

 Options:     input VCF file a   bi1.bcf
              input VCF file b   bi2.bcf 

   A:      504676 variants
   B:     1389333 variants 

                  ts/tv  ins/del
   A-B      37564 [0.19] [1.34]
   A&B     467112 [1.55] [0.72]
   B-A     922221 [1.20] [0.58]
   of A     92.6%
   of B     33.6%

  #partitions only passed variants in bi1.bcf and bi2.bcf
  vt partition bi1.bcf bi2.bcf -f PASS

 Options:     input VCF file a   bi1.bcf
              input VCF file b   bi2.bcf 
              [f] filter             PASS 

   A:      466148 variants
   B:      986056 variants 

                  ts/tv  ins/del
   A-B      47261 [0.44] [1.36]
   A&B     418887 [1.80] [0.68]
   B-A     567169 [1.43] [0.72]
   of A     89.9%
   of B     42.5%

partition v0.5

description : partition variants. check the overlap of variants between 2 data sets.

 usage : vt partition [options] <in1.vcf><in2.vcf>

 options : -f  filter
           -I  file containing list of intervals []
           -i  intervals []
           -?  displays help

Annotate Variants

Annotates variants in a VCF file. The GENCODE annotation file should be bgzipped and indexed with tabix. This is available in the vt resource bundle.

  #annotates the variants found in mills.vcf
  vt annotate_variants mills.vcf -r hs37d5.fa -g gencode.v19.annotation.gtf.gz

 #annotates variants with the following fields
 ##INFO=<ID=VT,Number=1,Type=String,Description="Variant Type - SNP, MNP, INDEL, CLUMPED"> 
 ##INFO=<ID=GENCODE_FS,Number=0,Type=Flag,Description="Frameshift INDEL">
 ##INFO=<ID=GENCODE_NFS,Number=0,Type=Flag,Description="Non Frameshift INDEL">

 usage : vt annotate_variants [options] <in.vcf>

 options : -g  GENCODE annotations GTF file []
           -r  reference sequence fasta file []
           -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals
           -?  displays help

Compute Features

Compute features in a VCF file. Example of statistics are Allele counts, Genotype Likelihood based Inbreeding Coefficient. Hardy-Weinberg Genotype Likelihood based Allele Frequencies

  #compute features for the variants found in vt.vcf
  #requires GT, PL and DP
  vt compute_features vt.vcf

 #annotates variants with the following fields
 ##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate Allele Counts">
 ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Number Allele Counts">
 ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
 ##INFO=<ID=AF,Number=A,Type=Float,Description="Alternate Allele Frequency">
 ##INFO=<ID=GC,Number=G,Type=Integer,Description="Genotype Counts">
 ##INFO=<ID=GN,Number=1,Type=Integer,Description="Total Number of Genotypes Counts">
 ##INFO=<ID=GF,Number=G,Type=Float,Description="Genotype Frequency">
 ##INFO=<ID=HWEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency assuming HWE">
 ##INFO=<ID=HWEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency assuming HWE">
 ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency">
 ##INFO=<ID=MLEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency">
 ##INFO=<ID=HWE_LLR,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg ln(Likelihood Ratio)">
 ##INFO=<ID=HWE_LPVAL,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic ln(p-value)">
 ##INFO=<ID=HWE_DF,Number=1,Type=Integer,Description="Degrees of freedom for Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic">
 ##INFO=<ID=FIC,Number=1,Type=Float,Description="Genotype likelihood based Inbreeding Coefficient">
 ##INFO=<ID=AB,Number=1,Type=Float,Description="Genotype likelihood based Allele Balance">

 usage : vt compute_features for variants [options] <in.vcf>

 options : -s  print site information only without genotypes [false]
           -o  output VCF/VCF.GZ/BCF file [-]
           -f  filter expression []
           -I  File containing list of intervals
           -i  Intervals
           -?  displays help

Profile Indels

Profile Indels. The reference data sets can be obtained from vt resource bundle.

  #profile indels found in mills.vcf
  vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa  -i 20

 #this is a sample output for indel profiling.
 # square brackets contain the ins/del ratio.  
 # for the FS/NFS field, that is the proportion of coding indels that are frame shifted.  
 # The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively.
 data set
   No Indels :      46974 [0.89]
      FS/NFS :       0.26 (8/23) 

 dbsnp
   A-B      30704 [0.92]
   A&B      16270 [0.83]
   B-A    2049488 [1.52]
   Precision    34.6%
   Sensitivity   0.8% 

 mills
   A-B      43234 [0.88]
   A&B       3740 [1.00]
   B-A     203278 [0.98]
   Precision     8.0%
   Sensitivity   1.8% 

 mills.chip
   A-B      46847 [0.89]
   A&B        127 [0.90]
   B-A       8777 [0.93]
   Precision     0.3%
   Sensitivity   1.4% 

 affy.exome.chip
   A-B      46911 [0.89]
   A&B         63 [0.43]
   B-A      33997 [0.47]
   Precision     0.1%
   Sensitivity   0.2%

 # This file contains information on how to process reference data sets.
 #
 # dataset - name of data set, this label will be printed.
 # type    - True Positives (TP) and False Positives (FP)
 #           overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively
 #         - annotation
 #           file is used for GENCODE annotation of frame shift and non frame shift Indels
 # filter  - filter applied to variants for this particular data set
 # path    - path of indexed BCF file
 #dataset              type         filter           path
 dbsnp                 TP           INDEL            dbsnp.xxsnps_indels.sites.bcf
 mills                 TP           INDEL            mills.208620indels.sites.bcf
 mills.chip            TP           INDEL            mills.chip.158samples.8904indels.sites.bcf
 #mills.chip.common     TP           INDEL&&AF>0.005  grch37/mills.chip.158samples.8904indels.sites.bcf
 affy.exome.chip       TP           INDEL            affy.exome.chip.1249samples.316520variants.sites.bcf
 #affy.exome.chip.poly  TP           INDEL&&AC!=0     grch37/affy.exome.chip.1249samples.316520variants.sites.bcf
 #affy.exome.chip.mono  FP           INDEL&&AC=0      grch37/affy.exome.chip.1249samples.316520variants.sites.bcf
 gencode.v19           annotation   .                gencode.v19.annotation.gtf.gz

 usage : vt profile_indels [options] <in.vcf>

 options : -g  file containing list of reference datasets []
           -I  file containing list of intervals []
           -i  intervals []
           -r  reference sequence fasta file []
           -?  displays help

Profile Mendelian Errors

Profile Mendelian errors

  #profile mendelian errors found in vt.genotypes.bcf, generate tables in the directory mendel, requires pdflatex.
  vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel

  #this is a sample output for mendelian error profiling.
  #R and A stand for reference and alternate allele respectively.
  #Error% - mendelian error (confounded with de novo mutation)
  #HomHet - Homozygous-Heterozygous genotype ratios
  #Het% - proportion of hets
  Mendelian Errors 

  Father Mother       R/R          R/A          A/A    Error(%) HomHet    Het(%)
  R/R    R/R        14889          210           38     1.64       nan    nan
  R/R    R/A         3403         3497           74     1.06      0.97  50.68
  R/R    A/A          176         1482          155    18.26       nan    nan
  R/A    R/R         3665         3652           68     0.92      1.00  49.91
  R/A    R/A         1015         3151          990     0.00      0.64  61.11
  R/A    A/A           43         1300         1401     1.57      1.08  48.13
  A/A    R/R          172         1365          147    18.94       nan    nan
  A/A    R/A           47         1164         1183     1.96      1.02  49.60
  A/A    A/A           20           78         5637     1.71       nan    nan 

  Parental            R/R          R/A          A/A    Error(%) HomHet    Het(%)
  R/R    R/R        14889          210           38     1.64       nan    nan
  R/R    R/A         7068         7149          142     0.99      0.99  50.28
  R/R    A/A          348         2847          302    18.59       nan    nan
  R/A    R/A         1015         3151          990     0.00      0.64  61.11
  R/A    A/A           90         2464         2584     1.75      1.05  48.81
  A/A    A/A           20           78         5637     1.71       nan    nan  

  Parental            R/R          R/A          A/A    Error(%) HomHet    Het(%)
  HOM    HOM        14909          288         5675     1.66       nan    nan
  HOM    HET         7158         9613         2726     1.19      1.00  49.90
  HET    HET         1015         3151          990     0.00      0.64  61.11
  HOMREF HOMALT       348         2847          302    18.59       nan    nan  

  total mendelian error :   2.505% 
  no. of trios     : 2
  no. of variants  : 25346

profile_mendelian v0.5

 usage : vt profile_mendelian [options] <in.vcf>

 options : -q  minimum genotype quality
           -d  minimum depth
           -r  reference sequence fasta file []
           -x  output latex directory []
           -p  pedigree file
           -I  file containing list of intervals []
           -i  intervals
          -?  displays help

Profile NA12878

Profile Mendelian errors

  #profile NA12878 overlap with broad knowledgebase and illumina platinum genomes for the file vt.genotypes.bcf for chromosome 20.
  vt profile_na12878  vt.genotypes.bcf -g na12878.reference.txt -r hs37d5.fa -i 20

  #this is a sample output for mendelian error profiling.
  #R and A stand for reference and alternate allele respectively.
  #Error% - mendelian error (confounded with de novo mutation)
  #HomHet - Homozygous-Heterozygous genotype ratios
  #Het% - proportion of hets
    data set
   No Indels :      27770 [0.94]
      FS/NFS :       0.26 (8/23) 

 broad.kb
   A-B      13071 [1.19]
   A&B      14699 [0.76]
   B-A      21546 [0.62]
   Precision    52.9%
   Sensitivity  40.6% 

 illumina.platinum
   A-B      17952 [0.88]
   A&B       9818 [1.07]
   B-A       2418 [0.88]
   Precision    35.4%
   Sensitivity  80.2% 

 broad.kb
               R/R       R/A       A/A       ./.
   R/R         346       145         3      5473
   R/A           3      4133         9       758
   A/A           2       136      2186       956
   ./.           2       139        86       322 

   Total genotype pairs :      6963
   Concordance          :  95.72% (6665)
   Discordance          :   4.28% (298) 

 illumina.platinum
               R/R       R/A       A/A       ./.
   R/R        1768        85         2         0
   R/A          10      4479        14         0
   A/A          13       180      3028         0
   ./.          71        98        70         0

   Total genotype pairs :      9579
   Concordance          :  96.83% (9275)
   Discordance          :   3.17% (304)

  # This file contains information on how to process reference data sets.
  #
  # dataset - name of data set, this label will be printed.
  # type    - True Positives (TP) and False Positives (FP)
  #           overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively
  #         - annotation
  #           file is used for GENCODE annotation of frame shift and non frame shift Indels
  # filter  - filter applied to variants for this particular data set
  # path    - path of indexed BCF file
  #dataset              type         filter    path
  broad.kb              TP           PASS      /net/fantasia/home/atks/dev/vt/bundle/public/grch37/broad.kb.241365variants.genotypes.bcf
  illumina.platinum     TP           PASS      /net/fantasia/home/atks/dev/vt/bundle/public/grch37/NA12878.illumina.platinum.5284448variants.genotypes.bcf
  #gencode.v19           annotation   .         /net/fantasia/home/atks/dev/vt/bundle/public/grch37/gencode.v19.annotation.gtf.gz

profile_na12878 v0.5

 usage : vt profile_na12878 [options] <in.vcf>

 options : -g  file containing list of reference datasets []
           -I  file containing list of intervals []
           -i  intervals []
           -r  reference sequence fasta file []
           -?  displays help

Variant Calling

Discover

Discovers variants from reads in a BAM file.

  #discover variants from NA12878.bam and write to stdout
  vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20 -v snps,indels,mnps

 usage : vt discover [options]

 options : -b  input BAM file
           -v  variant types [snps,mnps,indels]
           -f  fractional evidence cutoff for candidate allele [0.1]
           -e  evidence count cutoff for candidate allele [2]
           -q  base quality cutoff for bases [13]
           -m  MAPQ cutoff for alignments [20]
           -s  sample ID
           -r  reference sequence fasta file []
           -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals []
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

Merge candidate variants

Merge candidate variants across samples. Each VCF file is required to have the FORMAT flags E and N and should have exactly one sample.

  #merge candidate variants from VCFs in candidate.txt and output in candidate.sites.vcf
  vt merge_candidate_variants candidates.txt -o candidate.sites.vcf

 usage : vt merge_candidate_variants [options]

 options : -L  file containing list of input VCF files
           -o  output VCF file [-]
           -I  file containing list of intervals []
           -i  intervals
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

Construct Probes

Construct probes for genotyping a variant.

  #construct probes from candidate.sites.bcf and output to standard out
  vt construct_probes candidates.sites.bcf -r ref.fa

 usage : vt construct_probes [options] <in.vcf>

 options : -o  output VCF file [-]
           -f  minimum flank length [20]
           -r  reference sequence fasta file []
           -I  file containing list of intervals []
           -i  intervals []
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

Genotype

Genotypes variants for each sample.

  #genotypes variants found in candidate.sites.vcf from sample.bam
  vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf

 usage : vt genotype [options]

 options : -r  reference sequence fasta file []
           -s  sample ID []
           -o  output VCF file [-]
           -b  input BAM file []
           -i  input candidate VCF file []
           --  ignores the rest of the labeled arguments following this flag
           -h  displays help

Resource Bundle

External : resource bundle
Internal : /net/fantasia/home/atks/ref/vt/grch37

Maintained by

This page is maintained by Adrian

Vt