Changes

From Genome Analysis Wiki
Jump to navigationJump to search
80,829 bytes added ,  04:36, 4 May 2021
Line 1: Line 1: −
=== Introduction ===
+
= Introduction =
   −
vt is a variant tool set that discovers short variants from Next Generation Sequencing data.  The features are being rolled out to github as major rewriting is being undertaken.
+
vt is a variant tool set that discovers short variants from Next Generation Sequencing data.
   −
=== Location ===
+
= Installation =
   −
The source files are housed in github.
+
== General ==
   −
   git clone https://github.com/atks/vt.git
+
The source files are housed in github.  [https://github.com/samtools/htslib htslib] is
 +
used and a copy of a developmental freeze is stored as part of the vt repository to
 +
ensure compatibility.
 +
 
 +
To install, perform the following steps:
 +
 
 +
  #this will create a directory named vt in the directory you cloned the repository
 +
   1. git clone https://github.com/atks/vt.git <br>
 +
  #change directory to vt
 +
  2. cd vt <br>
 +
  #update submodules
 +
  3. git submodule update --init --recursive <br>
 +
  #run make, note that compilers need to support the c++0x standard
 +
  4. make <br>
 +
  #you can test the build
 +
  5. make test
 +
  <div class=" mw-collapsible mw-collapsed">
 +
  An expected output when all is well for the tests is shown here. (click expand =>)
 +
  <div class="mw-collapsible-content">
 +
  user@server:~/vt$ make test
 +
  test/test.sh
 +
  ++++++++++++++++++++++
 +
  Tests for vt normalize
 +
  ++++++++++++++++++++++
 +
  testing normalize
 +
              output VCF file : ok
 +
              output logs    : ok
 +
  +++++++++++++++++++++++++++++++
 +
  Tests for vt decompose_blocksub
 +
  +++++++++++++++++++++++++++++++
 +
  testing decompose_blocksub of even-length blocks
 +
              output VCF file : ok
 +
              output logs    : ok
 +
  testing decompose_blocksub with alignment
 +
              output VCF file : ok
 +
              output logs    : ok
 +
  testing decompose_blocksub of phased even-length blocks
 +
              output VCF file : ok
 +
              output logs    : ok
 +
  ++++++++++++++++++++++
 +
  Tests for vt decompose
 +
  ++++++++++++++++++++++
 +
  testing decompose for a triallelic variant
 +
              output VCF file : ok
 +
              output logs    : ok <br>
 +
  Passed tests : 5 / 5
 +
  </div>
 +
  </div>
 +
 
 +
=== Mac ===
 +
 
 +
You may install vt via homebrew.
 +
 
 +
  brew tap brewsci/bio
 +
  brew tap brewsci/science
 +
 
 +
  brew install brewsci/bio/vt
 +
 
 +
 
 +
Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4. <br>
 +
Some features of C++11 are used, thus there is a need for newer versions of gcc and clang.
 +
 
 +
= Updating =
 +
 
 +
vt is currently under heavy development, you will probably need to update often.
 +
 
 +
  #remove all object files
 +
  #you need to do this as source files as the static libraries might have changed and need to be removed.
 +
  1. make clean <br>
 +
  #update source files
 +
  2. git pull <br>
 +
  #compile and link, the -j option tells Makefile to run up to 40 independent commands in parallel
 +
  3. make -j 40
 +
 
 +
= General Features and notes =
    
== Common options ==
 
== Common options ==
Line 15: Line 89:  
     -I  multiple intervals in <seq>:<start>-<end> format listed in a text file line by line.
 
     -I  multiple intervals in <seq>:<start>-<end> format listed in a text file line by line.
   −
     -o defines the out file which and has the STDOUT set as the default.
+
     -o   defines the out file which and has the STDOUT set as the default.
          You may modify the STDOUT to output the binary version of the format.
+
        vt recognizes the appropriate output by file extension.
 +
        <name>.vcf    - uncompressed VCF
 +
        <name>.vcf.gz  - compressed VCF
 +
        <name>.bcf    - BCF
 +
        You may modify the STDOUT to output the binary version of the format. Uncompressed
 +
        VCF and BCF streams are indicated by - and + respectively. 
 +
 
 +
    -f  filter expression
 +
 
 +
    -s  sequential region selection as opposed to random access of regions specified by the i option.
 +
        This is useful when you want to select many close-by regions, while the -i option works,
 +
        it is less efficient and also selects a variant multiple times if it overlaps 2 regions.  This
 +
        option iterates through the variants in the file sequentially and checks for overlap with the
 +
        bed file given.
 +
 
 +
== Uncompressed BCF streams ==
 +
 
 +
htslib is designed with BCF as the underlying data structure  and it has incorporated
 +
awareness of uncompressed BCF streams in the i/o API.  One may use this feature to
 +
stream uncompressed BCF records to save on computational time spent on (de)compression.
 +
 
 +
  #using textual VCF streams indicated by -
 +
  cat mills.vcf | vt normalize - -r hs37d5.fa | vt uniq - -o out.bcf
 +
 
 +
  #using uncompressed BCF streams indicated by +
 +
  cat mills.vcf | vt normalize - -r hs37d5.fa -o + | vt uniq + -o out.bcf
 +
 
 +
In this example, the former took 0.84s while the latter took 0.64s to process. (24% speed up!)
 +
 
 +
== Filters ==
 +
 
 +
For some programs. you may define a filter via the -f option.
 +
 
 +
  This allows you to only analyse biallelic indels that are passed on chromosome 20.
 +
  vt profile_na12878 vt.bcf -g na12878.reference.txt -r genome.fa -f "N_ALLELE==2&&VTYPE==INDEL&&PASS"  -i 20
 +
 
 +
  This allows you to extract biallelic indels that are passed on chromosome 20.
 +
  vt view vt.bcf -f "N_ALLELE==2&&VTYPE==INDEL&&PASS"  -i 20
   −
== Programs ==
+
Other examples of filters
   −
=== Normalization ===
+
  #all variants with a SNP in them
 +
  VTYPE&SNP
 +
  #Simple insertions of length 1
 +
  VTYPE==INDEL&&DLEN==1
 +
  #Indels of length 1
 +
  VTYPE==INDEL&&LEN==1
 +
 
 +
  Variant characteristics
 +
    VTYPE,N_ALLELE,DLEN,LEN,VARIANT_CONTAINS_N
 +
 
 +
  Variant value types
 +
    SNP,MNP,INDEL,CLUMPED
 +
 
 +
  Biallelic SNPs only                        : VTYPE==SNP&&N_ALLELE==2
 +
  Biallelic Indels with embedded SNP          : VTYPE==(SNP|INDEL)&&N_ALLELE==2
 +
  Biallelic variants involving insertions    : VTYPE&INDEL&&DLEN>0&&N_ALLELE==2
 +
  Biallelic variants involving 1bp variants  : LEN==1&&N_ALLELE==2
 +
  Variants with explicit sequences with no Ns : ~VARIANT_CONTAINS_N
 +
 
 +
  REF field
 +
    REF
 +
 
 +
  ALT field
 +
    ALT
 +
 
 +
  QUAL field
 +
    QUAL
 +
 
 +
  FILTER fields
 +
    PASS, FILTER.<tag>
 +
 
 +
  INFO fields
 +
    INFO.<tag>
 +
 
 +
  A/C SNPs                                    : REF=='A' && ALT=='C'
 +
  AC type of STRs                            : REF=~'^.(AC)+$' || ALT=~'^.(AC)+$'
 +
  Passed biallelic SNPs only                  : PASS&&VTYPE==SNP&&N_ALLELE==2
 +
  Passed Common biallelic SNPs only          : PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005
 +
  Passed Common biallelic SNPs or rare indels : (PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005)
 +
  Passed Common biallelic SNPs or rare indels : ((PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005))&&QUAL>100
 +
  with quality greater than 100
 +
  Failed rare variants : ~PASS&&(INFO.AC/INFO.AN<0.005)
 +
 
 +
  [http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC1 Regular expression] matching PERL style (implemented with pcre2) 
 +
  Sometimes, an info field will contain several values in a string with functional annotation, to match what you want,
 +
  just use INFO.ANNO=~'<perl regular expression>'
 +
 
 +
  Passed variants in intergenic regions or UTR                    : PASS&&INFO.ANNO=~'Intergenic|UTR'
 +
  Passed variants in intergenic regions or UTR ignoring case      : PASS&&INFO.ANNO=~'(?i)Intergenic|UTR' <br>
 +
  pcre2's '(?i)Intergenic|UTR' is equivalent to PERL's '/intergenic|UTR/i'
 +
 
 +
  Operations
 +
  == : equivalence for strings and numbers
 +
  != : not equal
 +
  =~ : regular expression match for strings only
 +
  ~~ : not of =~.  Is equivalent to PERL's !~, this notation is used as BASH keeps interpreting ! for recalling commands from the history
 +
  ~  : logical not
 +
  && : logical and
 +
  || : logical or
 +
  &  : bitwise and
 +
  |  : bitwise or
 +
  +  : add
 +
  -  : subtract
 +
  *  : multiply
 +
  /  : divide
 +
 
 +
The following programs support filter expressions.
 +
 
 +
* view
 +
* peek
 +
* profile_snps
 +
* profile_indels
 +
* profile_na12878
 +
* profile_mendelian
 +
* profile_len
 +
* profile_chrom
 +
* profile_afs
 +
* profile_hwe
 +
* concordance
 +
* partition
 +
 
 +
== Alternate headers ==
 +
 
 +
  As BCF is a restrictive format of VCF where all meta data must be present in the header,
 +
  vt provides a mechanism to read an alternative header for VCF files that do not have a
 +
  well formed header.  Simply provide a header file stub named as <vcf-file>.hdr and vt
 +
  will automatically read it instead of the original header in <vcf-file>.
 +
 
 +
  For more information about VCF/BCF : http://samtools.github.io/hts-specs/VCFv4.2.pdf
 +
 
 +
  <span style="color:#FF0000">This mechanism is available only if one is reading VCF or compressed VCF files.  It is
 +
  disabled for BCF files as this might corrupt the BCF file because the encoding of the
 +
  fields in BCF records is based on the order of the meta info lines in the header.</span>
 +
 
 +
  <span style="color:#0000FF">Note: BCF2.2 introduces the IDX field in meta information lines that indicates the
 +
  dictionary encoding. This feature might be enabled for BCF files in the future.</span>
 +
 
 +
== General cases of Ploidy and Alleles ==
 +
 
 +
  I am trying to make vt handle [http://genome.sph.umich.edu/wiki/Relationship_between_Ploidy,_Alleles_and_Genotypes general cases of ploidy and alleles]. 
 +
  Please let me know if that is lacking in a tool that you are using.
 +
 
 +
== BCF Compression Levels vs Compression Time ==
 +
 
 +
The zlib deflation algorithm (a variant of LZ77) has 10 levels - 0 to 9.  0 has no compression but instead wraps <br>
 +
up the file in zlib or bgzf blocks.  It may be useful to have 0 compression as it is indexable with the same mechanism <br>
 +
used for compressed files.  Levels 1-9 denote an increasing compression level in exchange for longer times for <br>
 +
compression.
 +
 
 +
In general, zlib compression does not have significant differences in compression for BCF files between the 9 compression <br>
 +
levels as shown in the following table:
 +
 
 +
{| class="wikitable"
 +
|-
 +
! scope="col"| Compression Level
 +
! scope="col"| Size
 +
! scope="col"| Time
 +
|-
 +
|0<br>
 +
1<br>
 +
2<br>
 +
3<br>
 +
4<br>
 +
5<br>
 +
6 (default) <br>
 +
7<br>
 +
8<br>
 +
9
 +
|153GB <br>
 +
98.4GB <br>
 +
98.0GB <br>
 +
97.5GB <br>
 +
95.5GB <br>
 +
95.2GB <br>
 +
94.9GB <br>
 +
94.8GB <br>
 +
94.76GB <br>
 +
94.75GB
 +
|45m<br>
 +
2h3m <br>
 +
2h7m <br>
 +
2h12m <br>
 +
2h26m <br>
 +
2h54m <br>
 +
3h19m <br>
 +
3h41m <br>
 +
4h5m <br>
 +
4h25m
 +
|}
 +
 
 +
<span style="color:#0000FF">So, it might be a good idea to compress at lower levels when dealing with large temporary <br>
 +
files in a pipeline to save compute time.  This can be achieved with the -c option in vt view</span>
 +
 
 +
= VCF Manipulation =
 +
 
 +
=== View ===
 +
 
 +
 
 +
Views a VCF or VCF.GZ or BCF file.
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #views mills.bcf and outputs to standard out
 +
  vt view -h mills.bcf
 +
 
 +
  #views mills.bcf and locally sorts it in a 10000bp window and outputs to sorted-millsbcf
 +
  vt view -h -w 10000 mills.bcf -o sorted-mills.bcf
 +
 
 +
  #views mills.bcf and outputs to c1-mills.bcf with a compression level of 1.  By default,
 +
  #the compression level is 6 where lower levels compress the file less but are faster.
 +
  #The difference in compression for BCF files between level 1 to level 9 is about 5% of
 +
  #of a level 1 compression file.  The difference in time taken is about an additional 50%
 +
  #of a level 1 compression.  The levels range from 0 to 9 where 0 means no compression
 +
  #but the file is encapsulated in bgzf blocks that allows the file to be indexed.  A special
 +
  #level -1 denotes an uncompressed BCF file that is not encapsulated in bgzf blocks and
 +
  #are thus not indexable but are highly suitable for streaming between vt commands.
 +
  vt view -h mills.bcf -c 1 -o c1-mills.bcf
 +
 
 +
  #views mills.bcf and selects variants that overlap with the regions found in dust.bed from chromosome 20
 +
  #the -t option selects variants by checking if each variant overlaps with the regions in the bed file, this is
 +
  #as opposed to random accessing the variants via the index through the intervals defined in -i and -I options.
 +
  #this is useful when selecting variants from the target regions from an exome sequencing experiment.
 +
  vt view 10000 mills.bcf -t dust.bed -i 20
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt view [options] <in.vcf>
 +
 
 +
  options : -o  output VCF/VCF.GZ/BCF file [-]
 +
            -f  filter expression []
 +
            -w  local sorting window size [0]
 +
            -s  print site information only without genotypes [false]
 +
            -H  print header only, this option is honored only for STDOUT [false]
 +
            -h  omit header, this option is honored only for STDOUT [false]
 +
            -p  print options and summary []
 +
            -r  right window size for overlap []
 +
            -l  left window size for overlap []
 +
            -c  compression level 0-9, 0 and -1 denotes uncompressed with the former being wrapped in bgzf. [6]
 +
            -t  bed file for variant selection via streaming []
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Index ===
 +
 
 +
Indexes a VCF.GZ or BCF file.
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #indexes mills.bcf
 +
  vt index mills.bcf
 +
  #indexes mills.vcf.gz
 +
  vt index mills.vcf.gz
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt index [options] <in.vcf>
 +
 
 +
  options : -p  print options and summary []
 +
            --  ignores the rest of the labeled arguments following this flag
 +
            -h  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Sorting ===
 +
 
 +
Sorting may be done in 3 approaches.
 +
 
 +
Locally:<br>
 +
Performs sorting within a local window.  The window size may be set by the -w option. The default window size <br>
 +
is 1000bp and if a record is detected to be potentially out of order due to a small window size, it wil be reported.<br>
 +
Use this when your VCF records are grouped by chromosome but not ordered in short stretches.<br><br>
 +
 
 +
By chromosome: <br>
 +
Your VCF file is not ordered by the chromosomes in the header but is fully ordered within each chromosome.<br>
 +
The VCF file should be indexed and vt will output the records in the order of chromosomes given in the header. <br> <br>
 +
 
 +
Full sort [default option]: <br>
 +
No assumptions are made about the VCF file.  Records will be ordered by the order of contigs in the header.  <br>
 +
Smaller temporary ordered files are created and their names are <output_vcf>.<no>.bcf and after generating <br>
 +
these files, they are merged and output into <output_vcf>.<br>
    
<div class=" mw-collapsible mw-collapsed">
 
<div class=" mw-collapsible mw-collapsed">
[http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file.   
+
  #sorts mills.bcf and outputs to standard out in a 1000bp window.
 +
  vt sort -m local mills.bcf
 +
  #sorts mills.bcf and locally sorts it in a 10000bp window and outputs to out.bcf
 +
  vt sort -m local -w 10000 mills.bcf -o out.bcf
 +
  #sorts an indexed mills.bcf  with chromosomes not sorted in the contig order in the header
 +
  vt sort -m chrom mills.bcf -o out.bcf
 +
  #sorts mills.bcf with no assumption
 +
  vt sort mills.bcf -o out.bcf
 +
 
 
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
Normalized variants may have their positions changed; in such cases, the normalized variants
+
  usage : vt sort [options] <in.vcf>
are reordered and output in an ordered fashionThe local reordering takes place over a window
+
 
of 10000 base pairs.
+
  options : -m  sorting modes. [full]
 +
                local : locally sort within a 1000bp window.  Window size may be set by -w.
 +
                chrom : sort chromosomes based on order of contigs in header.
 +
                        input must be indexed.
 +
                full  : full sort with no assumptions.
 +
            -o  output VCF/VCF.GZ/BCF file. [-]
 +
            -w local sorting window size, set by default to 1000 under local mode. [0]
 +
            -p  print options and summary. []
 +
            -?  displays help
 +
 
 +
</div>
 
</div>
 
</div>
 +
 +
=== Normalization ===
 +
 +
<div>
 +
[http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a [http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42 VCF]  file [http://bioinformatics.oxfordjournals.org/content/31/13/2202 (Tan et al. 2015)] .  Normalized variants may have their positions changed; in such cases, the normalized variants
 +
are reordered and output in an ordered fashion.  The local reordering takes place over a window of 10000 base pairs which may be changed via the -w option.  There is an underlying assumption that the REF
 +
field is consistent with the reference sequence use, vt will check for this and will fail if reference inconsistency is encountered; this may be relaexd with the -n option.
 
</div>
 
</div>
    
<div class=" mw-collapsible mw-collapsed">
 
<div class=" mw-collapsible mw-collapsed">
   vt normalize mills.vcf -r seq.fa -o - mills.normalized.vcf
+
  #normalize variants and write out to dbsnp.normalized.vcf
 +
   vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf
    
   #normalize variants, send to standard out and remove duplicates.
 
   #normalize variants, send to standard out and remove duplicates.
   vt normalize mills.vcf -r seq.fa -o - | vt merge_duplicate_variants - -o mills.normalized_merged.vcf
+
   vt normalize dbsnp.vcf -r seq.fa | vt uniq - -o dbsnp.normalized.uniq.vcf
 +
 
 +
  #read in variants that do not contain N in the explicit alleles, normalize variants, send to standard out.
 +
  vt normalize dbsnp.vcf -r seq.fa -f "~VARIANT_CONTAINS_N"
 +
 
 +
  #variants that are normalized will be annotated with an OLD_VARIANT info tag.
 +
  #CHROM  POS      ID  REF          ALT  QUAL  FILTER  INFO
 +
  19   29238772 . C            G    .    PASS VT=SNP;OLD_VARIANT=19:29238771:TC/TG
 +
  20   60674709 . GCCCAGCCCCAC  G    .    PASS VT=INDEL;OLD_VARIANT=20:60674718:CACCCCAGCCCC/C
 +
 
 +
  #this shows a sample output with the normalization operations that were used
 +
  #categorized into 5 categories each for biallelic and multiallelic variants. <br>
 +
  stats: biallelic
 +
          no. left trimmed                      : 156908
 +
          no. right trimmed                    : 323
 +
          no. left and right trimmed            : 33
 +
          no. right trimmed and left aligned    : 7
 +
          no. left aligned                      : 12360 <br>
 +
      total no. biallelic normalized          : 169631 <br> <br>
 +
      multiallelic
 +
          no. left trimmed                      : 627189
 +
          no. right trimmed                    : 2509
 +
          no. left and right trimmed            : 1498
 +
          no. right trimmed and left aligned    : 212
 +
          no. left aligned                      : 1783 <br>
 +
      total no. multiallelic normalized        : 633191 <br>
 +
      total no. variants normalized            : 802822
 +
      total no. variants observed              : 88052639
 +
 
 
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
 
   usage : vt normalize [options] <in.vcf>
 
   usage : vt normalize [options] <in.vcf>
    +
  options : -o  output VCF file [-]
 +
            -d  debug [false]
 +
            -q  do not print options and summary [false]
 +
            -m  warns but does not exit when REF is inconsistent
 +
                with masked reference sequence for non SNPs.
 +
                This overides the -n option [false]
 +
            -n  warns but does not exit when REF is inconsistent
 +
                with reference sequence for non SNPs [false]
 +
            -w  window size for local sorting of variants [10000]
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -r  reference sequence fasta file []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 +
=== Decompose biallelic block substitutions ===
 +
 +
<div>
 +
Decomposes biallelic block substitutions into its constituent SNPs.  <br>
 +
There is now an additional option -a which decomposes non block substitutions into its constituent SNPs and indels. (kindly added by [[https://github.com/holtgrewe holtgrewe@github]]) <br>
 +
There is no exact solution and this decomposition is based on the best guess outcome using a Needleman-Wunsch algorithm. <br>
 +
You might also want to check out [https://github.com/vcflib/vcflib#vcfallelicprimitives vcfallelicprimitives]. <br>
 +
<br>
 +
There is now an additional option -m and -d which ensures that some MNVs are not decomposed. (kindly added by [[https://github.com/jaudoux jaudoux@github]]) <br>
 +
The motivation is from<br>
 +
*Exome-wide assessment of the functional impact and pathogenicity of multi-nucleotide mutations https://www.biorxiv.org/content/10.1101/258723v2.full<br>
 +
*Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes https://www.biorxiv.org/content/10.1101/573378v2.full<br>
 +
</div>
 +
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #decomposes biallelic block substitutions and write out to decomposed_blocksub.vcf
 +
  vt decompose_blocksub gatk.vcf -o decomposed_blocksub.vcf <br>
 +
  #before decomposition
 +
  #CHROM  POS    ID    REF    ALT    QUAL    FILTER  INFO            FORMAT  S1                                                                         
 +
  20   763837  . CA TG 50340.1 PASS AC=1;AN=2 GT 0|1 <br>
 +
  #after decomposition
 +
  #CHROM  POS    ID    REF    ALT    QUAL    FILTER  INFO                                    FORMAT  S1       
 +
  20   763837  . C T 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG GT 0|1
 +
  20   763838  . A G 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG GT 0|1
 +
 +
  #decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf
 +
  vt decompose_blocksub -a gatk.vcf -o decomposed_blocksub.vcf <br>
 +
  #before decomposition
 +
  #CHROM  POS    ID    REF    ALT    QUAL    FILTER  INFO            FORMAT  S1                                                                         
 +
  20   763837  . CG TGA 50340.1 PASS AC=1;AN=2 GT 0|1 <br>
 +
  #after decomposition
 +
  #CHROM  POS    ID    REF    ALT    QUAL    FILTER  INFO                                    FORMAT  S1       
 +
  20   763837  . C T 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1
 +
  20   763838  . G GA 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1
 +
 +
  #decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf and add phase set information in the genotype fields
 +
  vt decompose_blocksub -p gatk.vcf -o decomposed_blocksub.vcf <br>
 +
  #before decomposition
 +
  #CHROM  POS     ID   REF        ALT        QUAL FILTER INFO                                         FORMAT tumor     normal
 +
  1   159030    .   TAACCTTTC  TGACCTTTT  0.04 . AF=0.5                                         GT      0/0          1/1  <br>                                                             
 +
  #after decomposition
 +
  1   159031    .   A          G         0.04 . AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT GT:PS 0|0:159031  1|1:159031
 +
  1   159038    .   C          T         0.04 . AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT GT:PS 0|0:159031  1|1:159031
 +
   
 +
<div class="mw-collapsible-content">
 +
  description : decomposes biallelic block substitutions into its constituent SNPs. <br>
 +
  usage : vt decompose_blocksub [options] <in.vcf> <br>
 +
  options : -m  keep MNVs (multi-nucleotide variants) [false]
 +
            -a  enable aggressive/alignment mode [false]
 +
            -d  MNVs max distance (when -m option is used) [2]
 +
            -o  output VCF file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -?  displays help-a  enable aggressive/alignment mode
 +
 +
</div>
 +
</div>
 +
 +
=== Decompose===
 +
 +
<div>
 +
Decompose multiallelic variants in a [http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42 VCF]  file.  If the VCF file has genotype fields GT,PL, GL or DP, they are
 +
modified to reflect the change in alleles.  All other genotype fields are removed.  The -s option will retain the fields and decompose fields of counts R and A accordingingly.
 +
 +
Decomposition and combining variants is a complex operation where the correctness is dependent on [[https://github.com/tfarrah tfarrah@github]]:
 +
 +
*whether the observed variants are seen in the same sample,
 +
*if same sample, whether they are homozygous or heterozygous,
 +
*if both heterozygous, whether they are in the same haplotype or not (if known).
 +
 +
and one should be aware of the issues in handling variants resulting from such operations. <br>
 +
The original purpose of this tool is to allow for allelic comparisons between call sets.
 +
[[https://github.com/atks/vt/issues/16  example of a problem caused in combining separate variant records]]
 +
</div>
 +
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf
 +
  vt decompose gatk.vcf -o gatk.decomposed.vcf <br>
 +
  #before decomposition
 +
  #CHROM  POS    ID  REF    ALT        QUAL  FILTER  INFO                  FORMAT    S1                                    S2                                                                         
 +
  1      3759889 .    TA      TAA,TAAA,T  .      PASS    AF=0.342,0.173,0.037 GT:DP:PL   1/2:81:281,5,9,58,0,115,338,46,116,809 0/0:86:0,30,323,31,365,483,38,291,325,567 <br>
 +
  #after decomposition
 +
  #CHROM  POS    ID  REF    ALT        QUAL  FILTER  INFO                                        FORMAT  S1              S2           
 +
  1   3759889 .    TA      TAA   .   PASS    OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    1/.:281,5,9      0/0:0,30,323
 +
  1   3759889 .    TA      TAAA        .      .      OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./1:281,58,115  0/0:0,31,483
 +
  1   3759889 .    TA      T          .      .      OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./.:281,338,809  0/0:0,38,567 <br>
 +
  One might want to post process the partial genotypes like 1/. to the best guess genotype based on the PL values.
 +
 +
  #decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf with the -s option.
 +
  #-s option splits up INFO and GENOTYPE fields that have number counts of R and A [[https://samtools.github.io/hts-specs/VCFv4.2.pdf VCFv4.2 section 1.2.2]] appropriately.
 +
  vt decompose -s gatk.vcf -o gatk.decomposed.vcf <br>
 +
  #before decomposition
 +
  #CHROM  POS    ID  REF    ALT        QUAL  FILTER  INFO                  FORMAT    S1                                    S2                                                                         
 +
  1      3759889 .    TA      TAA,TAAA,T  .      PASS    AF=0.342,0.173,0.037 GT:DP:PL   1/2:81:281,5,9,58,0,115,338,46,116,809 0/0:86:0,30,323,31,365,483,38,291,325,567 <br>
 +
  #after decomposition
 +
  #CHROM  POS    ID  REF    ALT        QUAL  FILTER  INFO                                                FORMAT  S1              S2         
 +
  1   3759889 .    TA      TAA   .   PASS    AF=0.342;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    1/.:281,5,9      0/0:0,30,323
 +
  1   3759889 .    TA      TAAA        .      .      AF=0.173;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./1:281,58,115  0/0:0,31,483
 +
  1   3759889 .    TA      T          .      .      AF=0.037;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T    GT:PL    ./.:281,338,809  0/0:0,38,567 <br>
 +
  In general, you should recompute fields that involves alleles after decomposition.  Information is generally lost after vertically decomposing a variant, so care should be taken
 +
  in interpreting the resultant values. 
 +
 +
<div class="mw-collapsible-content">
 +
  description : decomposes multiallelic variants into biallelic in a VCF file. <br>
 +
  usage : vt decompose [options] <in.vcf> <br>
 +
  options : -s  smart decomposition [false]
 +
            -o  output VCF file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 +
=== Drop duplicate variants ===
 +
 +
Drops duplicate variants that appear later in the file.  VCF file must be ordered. <br>
 +
If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained.
 +
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #drop duplicate variants and save output in mills.uniq.vcf
 +
  vt uniq mills.vcf -o mills.uniq.vcf
 +
<div class="mw-collapsible-content">
 +
  usage : vt uniq [options] <in.vcf>
 +
 +
  options : -o  output VCF file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 +
=== Paste ===
 +
 +
Pastes VCF files like the unix paste functions.
 +
 +
  Input requirements and assumptions:
 +
      1. Same variants are represented in the same order for each file (required)
 +
      2. Genotype field order are the same for corresponding records (required)
 +
      3. Sample names are different in all the files (warning will be given if not)
 +
      4. Headers are the same for all the files (assumption, not checked, will fail if output is BCF)
 +
  Outputs:
 +
      1. INFO fields output will be that of the first file
 +
      2. Genotype fields are the same for corresponding records
 +
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #paste together genotypes from the CEU trio into one file.
 +
  vt paste NA12878.mills.bcf NA12891.mills.bcf NA12892.mills.bcf -o ceu_trio.bcf
 +
 +
<div class="mw-collapsible-content">
 +
  usage : vt paste [options] <in1.vcf>...
 +
 
 +
  options : -L  file containing list of input VCF files
 +
            -o  output VCF file [-]
 +
            -p  print options and summary []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 +
=== Concatenate ===
 +
 +
Concatenates VCF files.  Assumes individuals are in the same order and files share the same header.
 +
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #concatenates chr1.mills.bcf and chr2.mills.bcf
 +
  vt cat chr1.mills.bcf chr2.mills.bcf -o mills.bcf
 +
 +
  #concatenates chr1.mills.bcf and chr2.mills.bcf with the naive option.
 +
  #The naive option assumes that the headers are all the same and skips
 +
  #merging headers and translating encodings between BCF files.  This is
 +
  #a much faster option if you know the nature of your BCF files in advance.
 +
  vt cat -n chr1.mills.bcf chr2.mills.bcf -o mills.bcf
 +
 +
<div class="mw-collapsible-content">
 +
  usage : vt cat [options] <in1.vcf>...
 +
 +
  options : -s  print site information only without genotypes [false]
 +
            -p  print options and summary [false]
 +
            -n  naive, assumes that headers are the same. [false]
 +
            -w  local sorting window size [0]
 +
            -f  filter expression []
 +
            -L  file containing list of input VCF files
 +
            -o  output VCF file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 +
=== Remove info tags ===
 +
 +
Removes INFO tags from a VCF file
 +
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #removes the INFO tags OLD_VARIANT, ENTROPY, PSCORE and COMP
 +
  vt rminfo exact.del.bcf -t OLD_VARIANT,ENTROPY,PSCORE,COMP -o rm.bcf
 +
 +
<div class="mw-collapsible-content">
 +
  usage : vt rminfo [options] <in.vcf>
 +
 +
  options : -o  output VCF file [-]
 +
            -q  do not print options and summary [false]
 +
            -t  list of info tags to be removed []
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 +
=== Filter ===
 +
 +
Filters variants in a VCF file
 +
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #adds a filter tag "refA" for variants where the REF column is a A sequence.
 +
  vt filter in.bcf -f "REF=='A'" -d "refA"
 +
 +
<div class="mw-collapsible-content">
 +
  usage : vt filter [options] <in.vcf> <br>
 +
  options : -x  clear filter [false]
 +
            -f  filter expression []
 +
            -d  filter tag description []
 +
            -t  filter tag []
 +
            -o  output VCF file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals
 +
            -?  displays help </div>
 +
</div>
 +
 +
=== Filter overlap ===
 +
 +
Tags overlapping variants in a VCF file with the FILTER flag overlap.
 +
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #adds a filter tag "overlap" for overlapping variants within a window size of 1 based on the REF sequence.
 +
  vt filter_overlap in.bcf -w 1 out.bcf
 +
 +
  todo: option for considering END info tag for detecting overlaps.
 +
 +
<div class="mw-collapsible-content">
 +
  usage : vt filter_overlap [options] <in.vcf>
 +
 
 
   options : -o  output VCF file [-]
 
   options : -o  output VCF file [-]
 +
            -w  window overlap for variants [0]
 
             -I  file containing list of intervals []
 
             -I  file containing list of intervals []
 
             -i  intervals []
 
             -i  intervals []
             -r reference sequence fasta file []
+
             -? displays help
            --  ignores the rest of the labeled arguments following this flag
+
  </div>
            -h displays help
   
</div>
 
</div>
 +
 +
=== Validate ===
 +
 +
Checks the following properties of a VCF file:
 +
#order
 +
#reference sequence consistency
 +
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #validates lobstr.bcf
 +
  vt validate lobstr.bcf
 +
 +
<div class="mw-collapsible-content">
 +
  usage : vt validate [options] <in.vcf>
 +
 
 +
  options : -q  do not print invalid records [false]
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -r  reference sequence fasta file []
 +
            -?  displays help
 +
</div>
 
</div>
 
</div>
   −
=== Merge duplicate variants ===
+
=== Extract INFO fields to a tab delimited file ===
 +
 
 +
Converts a VCF file and its shared information in the INFO field to a tab delimited file for further analysis.
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #converts in.bcf to tab format with selected INFO and FILTER fields
 +
  vt info2tab in.bcf -u PASS -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT
 +
  <div style="height:6em; overflow:auto; border: 2px solid #FFF">
 +
  INPUT
 +
  =====
 +
  20 17548608 . A AC . PASS CENTERS=vbi;NCENTERS=1;OLD_MULTIALLELIC=20:17548598:GAAAAAAAAAAAAA/GAAAAAAAAAAAA/GAAAAAAAAAAAAAA/GAAAAAAAAAA/GAAAAAAAAAAA/GAAAAAAAAAACAAA;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAACAAAG;EX_MOTIF=C;EX_MLEN=1;EX_RU=C;EX_BASIS=C;EX_BLEN=1;EX_REPEAT_TRACT=17548608,17548609;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=2;EX_RL=2;EX_LL=3;EX_RU_COUNTS=0,2;EX_SCORE=0;EX_TRF_SCORE=-14;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=14;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[A]AAAGAAGGAA;MDUST;LOBSTR
 +
  20 17548608 . AAAAG A . PASS CENTERS=ox1;NCENTERS=1;EX_MOTIF=AAAG;EX_MLEN=4;EX_RU=AAAG;EX_BASIS=AG;EX_BLEN=2;EX_REPEAT_TRACT=17548609,17548612;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=0.75;EX_RL=4;EX_LL=4;EX_RU_COUNTS=0,1;EX_SCORE=0.75;EX_TRF_SCORE=-1;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=13;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[AAAAG]AAGGAACTAC;MDUST;LOBSTR;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAA
 +
  </div>
 +
  OUTPUT
 +
  ======
 +
  CHROM POS   REF   ALT N_ALLELE  PASS  EX_RL  FZ_RL MDUST LOBSTR VNTRSEEK  RMSK EX_REPEAT_TRACT_1 EX_REPEAT_TRACT_2
 +
  20 17548608  A   AC 2        1    2      13 1 1 0   0    17548608                17548608
 +
  20 17548608  AAAAG  A 2        1    4      13      1      1      0        0    17548609                17548609
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt info2tab [options] <in.vcf>
 +
 
 +
  options : -d  debug [false]
 +
            -f  filter expression []
 +
            -u  list of filter tags to be extracted []-t  list of info tags to be extracted []
 +
            -o  output tab delimited file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
= VCF Inspection and Evaluation =
 +
 
 +
=== Peek ===
 +
 
 +
Summarizes the variants in a VCF file
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #summarizes the variants found in mills.vcf
 +
  vt peek mills.vcf
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt peek [options] <in.vcf>
 +
 
 +
  options : -o  output VCF file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -r  reference sequence fasta file []
 +
            --  ignores the rest of the labeled arguments following this flag
 +
            -h  displays help
 +
</div>
 +
</div>
 +
 
 +
For a more detailed guide on [http://genome.sph.umich.edu/wiki/Variant_classification variant classification].
 +
 
 +
#This is a sample output of a peek command which summarizes the variants found in a VCF file.
 +
  stats: no. of samples                    :          0
 +
          no. of chromosomes                :        22<br>
 +
          ========== Micro variants ==========<br>
 +
          no. of SNPs                        :  77228885
 +
              2 alleles (ts/tv)              :        77011302 (2.11) [52287790/24723512]
 +
              3 alleles (ts/tv)              :          216560 (0.75) [185520/247600]
 +
              4 alleles (ts/tv)              :            1023 (0.50) [1023/2046]<br>
 +
          no. of MNPs                        :          0
 +
              2 alleles (ts/tv)              :              0 (-nan) [0/0]
 +
              >=3 alleles (ts/tv)            :              0 (-nan) [0/0]<br>
 +
          no. Indels                        :    2147564
 +
              2 alleles (ins/del)            :        2124842 (0.47) [683250/1441592]
 +
              >=3 alleles (ins/del)          :          22722 (2.12) [32411/15286]<br>
 +
          no. SNP/MNP                        :          0
 +
              3 alleles (ts/tv)              :              0 (-nan) [0/0]
 +
              >=4 alleles (ts/tv)            :              0 (-nan) [0/0] <br>
 +
          no. SNP/Indels                    :      12913
 +
              2 alleles (ts/tv) (ins/del)    :            412 (0.41) [120/292] (3.68) [324/88]
 +
              >=3 alleles (ts/tv) (ins/del)  :          12501 (0.43) [7670/17649] (18.64) [12434/667]<br>
 +
          no. MNP/Indels                    :        153
 +
              2 alleles (ts/tv) (ins/del)    :              0 (-nan) [0/0] (-nan) [0/0]
 +
              >=3 alleles (ts/tv) (ins/del)  :            153 (0.30) [138/465] (0.27) [67/248]<br>
 +
          no. SNP/MNP/Indels                :          2
 +
              3 alleles (ts/tv) (ins/del)    :              0 (-nan) [0/0] (-nan) [0/0]
 +
              4 alleles (ts/tv) (ins/del)    :              2 (0.00) [3/5] (1.00) [3/3]
 +
              >=5 alleles (ts/tv) (ins/del)  :              0 (-nan) [0/0] (-nan) [0/0]<br>
 +
          no. of clumped variants            :      19025
 +
              2 alleles                      :              0 (-nan) [0/0] (-nan) [0/0]
 +
              3 alleles                      :          18508 (0.16) [12152/75366] (0.00) [93/18653]
 +
              4 alleles                      :            451 (0.15) [369/2390] (0.33) [201/609]
 +
              >=5 alleles                    :              66 (0.09) [37/414] (1.19) [107/90]<br>
 +
          ====== Other useful categories =====<br>
 +
          no. complex variants              :      32093
 +
              2 alleles (ts/tv) (ins/del)    :            412 (0.41) [120/292] (3.68) [324/88]
 +
              >=3 alleles (ts/tv) (ins/del)  :          31681 (0.21) [20369/96289] (0.64) [12905/20270]<br>
 +
          ======= Structural variants ========<br>
 +
          no. of structural variants        :      41217
 +
              2 alleles                      :          38079
 +
                  deletion                  :                13135
 +
                  insertion                  :                16451
 +
                    mobile element          :                    16253
 +
                        ALU                  :                        12513
 +
                        LINE1                :                        2911
 +
                        SVA                  :                          829
 +
                    numt                    :                      198
 +
                  duplication                :                  664
 +
                  inversion                  :                  100
 +
                  copy number variation      :                7729
 +
              >=3 alleles                    :            3138
 +
                  copy number variation      :                3138 <br>
 +
          ========= General summary ========== <br>
 +
          no. of reference                  :          0 <br>
 +
          no. of observed variants          :  79449759
 +
          no. of unclassified variants      :          0
 +
 
 +
=== Partition ===
 +
 
 +
Partition variants from two data sets.
 +
 
 +
  <span style="color:#FF0000">Please note that this only works if the contigs in the headers of both data sets are the same.</span>
 +
 
 +
<div class=" mw-collapsible mw-collapsed"> 
 +
  #partitions all variants in bi1.bcf  and bi2.bcf
 +
  vt partition bi1.bcf bi2.bcf
 +
 
 +
  Options:    input VCF file a  bi1.bcf
 +
              input VCF file b  bi2.bcf <br>
 +
    A:      504676 variants
 +
    B:    1389333 variants <br>
 +
                  ts/tv  ins/del
 +
    A-B      37564 [0.19] [1.34]
 +
    A&B    467112 [1.55] [0.72]
 +
    B-A    922221 [1.20] [0.58]
 +
    of A    92.6%
 +
    of B    33.6%
 +
 
 +
  #partitions only passed variants in bi1.bcf and bi2.bcf
 +
  vt partition bi1.bcf bi2.bcf -f PASS
 +
 
 +
  Options:    input VCF file a  bi1.bcf
 +
              input VCF file b  bi2.bcf
 +
              [f] filter            PASS <br>
 +
    A:      466148 variants
 +
    B:      986056 variants <br>
 +
                  ts/tv  ins/del
 +
    A-B      47261 [0.44] [1.36]
 +
    A&B    418887 [1.80] [0.68]
 +
    B-A    567169 [1.43] [0.72]
 +
    of A    89.9%
 +
    of B    42.5%
 +
 
 +
<div class="mw-collapsible-content">
 +
 
 +
partition v0.5
 +
 
 +
description : partition variants. check the overlap of variants between 2 data sets.
 +
 
 +
  usage : vt partition [options] <in1.vcf><in2.vcf>
 +
 
 +
  options : -w  write partitioned variants to file
 +
            -f  filter
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -?  displays help
 +
  </div>
 +
</div>
 +
 
 +
=== Multi Partition ===
 +
 
 +
Partitions variants found in VCF files. <br>
 +
In comparison to the simple 2 way partition, this does not support writing out of partitions to file and
 +
reporting proportion of shared variants for each VCF.
 +
 
 +
<div class=" mw-collapsible mw-collapsed"> 
 +
  #partitions variants n-ways
 +
  vt multi_partition hc.genotypes.bcf pl.genotypes.bcf st.genotypes.bcf
 +
 
 +
  Options:    input VCF file a  hc.genotypes.bcf
 +
              input VCF file b  pl.genotypes.bcf
 +
              input VCF file c  st.genotypes.bcf <br>
 +
      A:      97274 variants
 +
      B:      95458 variants
 +
      C:      98943 variants <br>
 +
                  no  [ts/tv] [ins/del]
 +
      A--      3887  [1.10]  [0.86]
 +
      -B-      7890  [1.45]  [0.98]
 +
      AB-      4360  [0.99]  [1.32]
 +
      --C      8277  [1.75]  [2.21]
 +
      A-C      7458  [1.78]  [0.49]
 +
      -BC      1639  [1.63]  [1.03]
 +
      ABC      81569  [2.28]  [1.08] <br>
 +
      Unique variants    :    115080
 +
      Overall concordance :      70.88% (#intersection/#union)
 +
 
 +
<div class="mw-collapsible-content">
 +
 
 +
  usage : vt multi_partition [options] <in1.vcf><in2.vcf>...
 +
  options : -f  filter
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -?  displays help
 +
  </div>
 +
</div>
 +
 
 +
=== Annotate Regions ===
 +
 
 +
Annotates regions in a VCF file.  The BED file should be bgzipped and indexed with tabix. 
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #annotates the variants that overlap with coding regions.
 +
  vt annotate_regions mills.vcf -b coding.bed.gz -t CDS -d "Coding region"
 +
 
 +
  #annotates the variants that overlap with low complexity regions.
 +
  vt annotate_regions mills.vcf -b mdust.bed.gz -t DUST -d "DUST Low Complexity Region"
 +
 
 +
<div class="mw-collapsible-content">
 +
 
 +
  usage : vt annotate_regions [options] <in.vcf>
 +
 
 +
  options : -d  regions tag description []
 +
            -t  regions tag []
 +
            -b  regions BED file []
 +
            -o  output VCF file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Annotate Variants ===
 +
 
 +
Annotates variants in a VCF file.  The GENCODE annotation file should be bgzipped and indexed with tabix. 
 +
This is available in the [[Vt#Resource_Bundle|vt resource bundle]].
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #annotates the variants found in mills.vcf
 +
  vt annotate_variants mills.vcf -r hs37d5.fa -g gencode.v19.annotation.gtf.gz
 +
 
 +
  #annotates variants with the following fields
 +
  ##INFO=<ID=VT,Number=1,Type=String,Description="Variant Type - SNP, MNP, INDEL, CLUMPED">
 +
  ##INFO=<ID=GENCODE_FS,Number=0,Type=Flag,Description="Frameshift INDEL">
 +
  ##INFO=<ID=GENCODE_NFS,Number=0,Type=Flag,Description="Non Frameshift INDEL">
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt annotate_variants [options] <in.vcf>
 +
 
 +
  options : -g  GENCODE annotations GTF file []
 +
            -r  reference sequence fasta file []
 +
            -o  output VCF file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Compute Features ===
 +
 
 +
Compute features in a VCF file.  Example of statistics are Allele counts, [[Genotype_Likelihood_based_Inbreeding_Coefficient|Genotype Likelihood based Inbreeding Coefficient]].
 +
[[Genotype_Likelihood_based_Allele_Frequency|Hardy-Weinberg Genotype Likelihood based Allele Frequencies]] <br>
 +
For more customizable feature computation - look at [http://genome.sph.umich.edu/wiki/Vt#Estimate estimate]
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #compute features for the variants found in vt.vcf
 +
  #requires GT, PL and DP
 +
  vt compute_features vt.vcf
 +
 
 +
  #annotates variants with the following fields
 +
  ##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate Allele Counts">
 +
  ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Number Allele Counts">
 +
  ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
 +
  ##INFO=<ID=AF,Number=A,Type=Float,Description="Alternate Allele Frequency">
 +
  ##INFO=<ID=GC,Number=G,Type=Integer,Description="Genotype Counts">
 +
  ##INFO=<ID=GN,Number=1,Type=Integer,Description="Total Number of Genotypes Counts">
 +
  ##INFO=<ID=GF,Number=G,Type=Float,Description="Genotype Frequency">
 +
  ##INFO=<ID=HWEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency assuming HWE">
 +
  ##INFO=<ID=HWEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency assuming HWE">
 +
  ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency">
 +
  ##INFO=<ID=MLEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency">
 +
  ##INFO=<ID=HWE_LLR,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg ln(Likelihood Ratio)">
 +
  ##INFO=<ID=HWE_LPVAL,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic ln(p-value)">
 +
  ##INFO=<ID=HWE_DF,Number=1,Type=Integer,Description="Degrees of freedom for Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic">
 +
  ##INFO=<ID=FIC,Number=1,Type=Float,Description="Genotype likelihood based Inbreeding Coefficient">
 +
  ##INFO=<ID=AB,Number=1,Type=Float,Description="Genotype likelihood based Allele Balance">
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt compute_features for variants [options] <in.vcf>
 +
 +
  options : -s  print site information only without genotypes [false]
 +
            -o  output VCF/VCF.GZ/BCF file [-]
 +
            -f  filter expression []
 +
            -I  File containing list of intervals
 +
            -i  Intervals
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Estimate ===
 +
 
 +
  Compute variant based estimates. 
 +
 
 +
  Example of statistics are:
 +
  * Allele counts
 +
  * [[Genotype_Likelihood_based_Allele_Frequency|Hardy-Weinberg Genotype Likelihood based Allele Frequencies]]
 +
  * [[Genotype_Likelihood_based_Inbreeding_Coefficient|Genotype Likelihood based Inbreeding Coefficient]]
 +
  * [[HWEP|Genotype Likelihood based Hardy-Weinberg test]]
 +
  * [[Genotype_Likelihood_Based_Allele_Balance|Genotype Likelihood based Allele Balance]]
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #compute features for the variants found in vt.vcf
 +
  #requires GT and PL
 +
  vt estimate -e AF,MLEAF vt.vcf
 +
 
 +
  AF        Genotype (GT) based allele frequencies
 +
              If genotypes are unavailable, best guess
 +
              genotypes are inferred based on genotype
 +
              likelihoods (GL or PL)
 +
              AC        : Alternate Allele counts
 +
              AN        : Total allele counts
 +
              NS        : No. of samples.
 +
              AF        : Alternate allele frequencies.
 +
  MLEAF      GL based allele frequencies estimates
 +
              MLEAF    : Alternate allele frequency derived from MLEGF
 +
              MLEGF    : Genotype frequencies.
 +
  HWEAF      GL based allele frequencies estimates assuming HWE
 +
              HWEAF    : Alternate allele frequencies
 +
              HWEGF    : Genotype frequencies derived from HWEAF.
 +
  HWE        GL based Hardy-Weinberg statistics.
 +
              HWE_LLR  : log likelihood ratio
 +
              HWE_LPVAL : log p-value
 +
              HWE_DF    : degrees of freedom
 +
  AB        GL based Allele Balance.
 +
  FIC        GL based Inbreeding Coefficient
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt estimate [options] <in.vcf>
 +
 +
  options : -s  print site information only without genotypes [false]
 +
            -o  output VCF/VCF.GZ/BCF file [-]
 +
            -e  comma separated estimates to be computed []
 +
            -f  filter expression []
 +
            -I  File containing list of intervals
 +
            -i  Intervals
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Profile Mendelian Errors ===
 +
 
 +
Profile Mendelian errors
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #profile mendelian errors found in vt.genotypes.bcf, generate [[media:mendel.pdf|tables]] in the directory mendel, requires pdflatex.
 +
  vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel
 +
 
 +
  pedigree file format is described in [[Vt#Pedigree File|here]].
 +
 
 +
  #this is a sample output for mendelian error profiling.
 +
  #R and A stand for reference and alternate allele respectively.
 +
  #Error% - mendelian error (confounded with de novo mutation)
 +
  #HomHet - Homozygous-Heterozygous genotype ratios
 +
  #Het% - proportion of hets
 +
  Mendelian Errors <br>
 +
  Father Mother      R/R          R/A          A/A    Error(%) HomHet    Het(%)
 +
  R/R    R/R        14889          210          38    1.64      nan    nan
 +
  R/R    R/A        3403        3497          74    1.06      0.97  50.68
 +
  R/R    A/A          176        1482          155    18.26      nan    nan
 +
  R/A    R/R        3665        3652          68    0.92      1.00  49.91
 +
  R/A    R/A        1015        3151          990    0.00      0.64  61.11
 +
  R/A    A/A          43        1300        1401    1.57      1.08  48.13
 +
  A/A    R/R          172        1365          147    18.94      nan    nan
 +
  A/A    R/A          47        1164        1183    1.96      1.02  49.60
 +
  A/A    A/A          20          78        5637    1.71      nan    nan <br>
 +
  Parental            R/R          R/A          A/A    Error(%) HomHet    Het(%)
 +
  R/R    R/R        14889          210          38    1.64      nan    nan
 +
  R/R    R/A        7068        7149          142    0.99      0.99  50.28
 +
  R/R    A/A          348        2847          302    18.59      nan    nan
 +
  R/A    R/A        1015        3151          990    0.00      0.64  61.11
 +
  R/A    A/A          90        2464        2584    1.75      1.05  48.81
 +
  A/A    A/A          20          78        5637    1.71      nan    nan  <br>
 +
  Parental            R/R          R/A          A/A    Error(%) HomHet    Het(%)
 +
  HOM    HOM        14909          288        5675    1.66      nan    nan
 +
  HOM    HET        7158        9613        2726    1.19      1.00  49.90
 +
  HET    HET        1015        3151          990    0.00      0.64  61.11
 +
  HOMREF HOMALT      348        2847          302    18.59      nan    nan  <br>
 +
  total mendelian error :  2.505%
 +
  no. of trios    : 2
 +
  no. of variants  : 25346
 +
 
 +
<div class="mw-collapsible-content">
 +
profile_mendelian v0.5
 +
 
 +
  usage : vt profile_mendelian [options] <in.vcf>
 +
 
 +
  options : -q  minimum genotype quality
 +
            -d  minimum depth
 +
            -r  reference sequence fasta file []
 +
            -x  output latex directory []
 +
            -p  pedigree file
 +
            -I  file containing list of intervals []
 +
            -i  intervals
 +
          -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Profile SNPs ===
 +
 
 +
Profile SNPs.  The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #profile snps found in 20.sites.vcf
 +
  vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa  -i 20
 +
 
 +
  #this is a sample output for indel profiling.
 +
  # square brackets contain the ts/tv ratio. 
 +
  # The numbers in curved bracket are the counts of ts and tv SNPs respectively.
 +
  # Low complexity shows what percent of the SNPs are in low complexity regions.
 +
  data set
 +
    No. SNPs          :    508603 [2.09]
 +
        Low complexity :      0.08 (39837/508603) <br>
 +
  1000g
 +
    A-B    109970 [1.39]
 +
    A&B    398633 [2.37]
 +
    B-A    1340682 [2.26]
 +
    Precision    78.4%
 +
    Sensitivity  22.9% <br>
 +
  dbsnp
 +
    A-B    324063 [1.99]
 +
    A&B    184540 [2.29]
 +
    B-A    103893 [2.60]
 +
    Precision    36.3%
 +
    Sensitivity  64.0%
 +
 
 +
  # This file contains information on how to process reference data sets.
 +
  #
 +
  # dataset - name of data set, this label will be printed.
 +
  # type    - True Positives (TP) and False Positives (FP)
 +
  #          overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively
 +
  #        - annotation
 +
  #          file is used for GENCODE annotation of frame shift and non frame shift Indels
 +
  # filter  - filter applied to variants for this particular data set
 +
  # path    - path of indexed BCF file
 +
  #dataset              type            filter                                path
 +
  1000g                  TP              N_ALLELE==2&&VTYPE==SNP                /net/fantasia/home/atks/ref/vt/grch37/1000G.v5.snps.indels.complex.svs.sites.bcf
 +
  dbsnp                  TP              N_ALLELE==2&&VTYPE==SNP                /net/fantasia/home/atks/ref/vt/grch37/dbSNP138.snps.indels.complex.sites.bcf
 +
  GENCODE_V19            cds_annotation  .                                      /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz
 +
  DUST                  cplx_annotation  .                                      /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt profile_snps [options] <in.vcf>
 +
 
 +
  options : -f  filter expression []
 +
            -g  file containing list of reference datasets []
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -r  reference sequence fasta file []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Profile Indels ===
 +
 
 +
Profile Indels.  The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #profile indels found in mills.vcf
 +
  vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa  -i 20
 +
 
 +
  #this is a sample output for indel profiling.
 +
  # square brackets contain the ins/del ratio. 
 +
  # for the FS/NFS field, that is the proportion of coding indels that are frame shifted. 
 +
  # The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively.
 +
  data set
 +
    No Indels :      46974 [0.89]
 +
      FS/NFS :      0.26 (8/23) <br>
 +
  dbsnp
 +
    A-B      30704 [0.92]
 +
    A&B      16270 [0.83]
 +
    B-A    2049488 [1.52]
 +
    Precision    34.6%
 +
    Sensitivity  0.8% <br>
 +
  mills
 +
    A-B      43234 [0.88]
 +
    A&B      3740 [1.00]
 +
    B-A    203278 [0.98]
 +
    Precision    8.0%
 +
    Sensitivity  1.8% <br>
 +
  mills.chip
 +
    A-B      46847 [0.89]
 +
    A&B        127 [0.90]
 +
    B-A      8777 [0.93]
 +
    Precision    0.3%
 +
    Sensitivity  1.4% <br>
 +
  affy.exome.chip
 +
    A-B      46911 [0.89]
 +
    A&B        63 [0.43]
 +
    B-A      33997 [0.47]
 +
    Precision    0.1%
 +
    Sensitivity  0.2% <br>
 +
 
 +
  # This file contains information on how to process reference data sets.
 +
  # dataset - name of data set, this label will be printed.
 +
  # type    - True Positives (TP) and False Positives (FP).
 +
  #          overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively.
 +
  #        - annotation.
 +
  #          file is used for GENCODE annotation of frame shift and non frame shift Indels.
 +
  # filter  - filter applied to variants for this particular data set.
 +
  # path    - path of indexed BCF file.
 +
  #dataset    type            filter                      path
 +
  1000g        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/1000G.snps_indels.sites.bcf
 +
  mills        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/mills.208620indels.sites.bcf
 +
  dbsnp        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/dbsnp.13147541variants.sites.bcf
 +
  GENCODE_V19  cds_annotation  .                            /net/fantasia/home/atks/ref/vt/grch37/gencode.cds.bed.gz
 +
  DUST        cplx_annotation .                            /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt profile_indels [options] <in.vcf>
 +
 
 +
  options : -g  file containing list of reference datasets []
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -r  reference sequence fasta file []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Profile VNTRs ===
 +
 
 +
Profile VNTRs.  The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
 
 +
  #profiles a set of VNTRs
 +
  vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt
 +
 
 +
 
 +
  profile_vntrs v0.5
 +
 
 +
    no VNTRs          5660874          #number of VNTRs in vntrs.sites.bcf
 +
    no low complexity  2686460 (47.46%)  #number of VNTRs in low complexity region determined by MDUST
 +
    no coding          17911 (0.32%)    #number of VNTRs in coding regions determined by GENCODE v7
 +
    no redundant      1312209 (23.18%)  #number of VNTRs involved in overlapping with one another<br>
 +
  trf_lobstr (1638516)  #TRF based reference set used in lobSTR, motif lengths 1 to 6.
 +
    A-B    3269285    #TRs specific to vntrs.sites.bcf
 +
    A-B~    1666185    #TRs in vntrs.sites.bcf that overlap partially with at least one TR in TRF(lobSTR) but does not overlap exactly with another TR.
 +
    A&B1    725404    #TRs in vntrs.sites.bcf that overlap exactly with at least one TR in TRF(lobSTR)
 +
    A&B2    723195    #TRs in TRF(lobSTR) that overlap exactly with at least one TR in vntrs.sites.bcf
 +
    B-A~    710075    #TRs in TRF(lobSTR) that overlap partially with at least one TR in vntrs.sites.bcf but does not overlap exactly with another TR.
 +
    B-A      205246    #TRs specific to TRF(lobSTR)
 +
  #note that the first 3 rows should sum up to the number of TRs in vntrs.sites.bcf
 +
  #and the 4th to 6th rows should sum up to the number of TRs in TRF( lobSTR)
 +
  #This basically allows us to see the m to n overlapping in overlapping TRs<br>
 +
  trf_repeatseq (1624553) #TRF based reference set used in repeatseq, motif lengths 1 to 6.
 +
    A-B    3291652
 +
    A-B~    1650190
 +
    A&B1    719032
 +
    A&B2    716838
 +
    B-A~    703948
 +
    B-A      203767  <br>
 +
  trf_vntrseek (230306)  #TRF based reference set used in vntrseek, motif lengths 7 to 2000.
 +
    A-B    5384453
 +
    A-B~    271302
 +
    A&B1      5119
 +
    A&B2      4973
 +
    B-A~      92496
 +
    B-A      132837  <br>
 +
  codis+ (15)            #CODIS STRs + 2 STRs from PROMEGA
 +
    A-B    5660794
 +
    A-B~        79
 +
    A&B1          1
 +
    A&B2          1
 +
    B-A~        14
 +
    B-A          0
 +
 
 +
  # This file contains information on how to process reference data sets.
 +
  # dataset - name of data set, this label will be printed.
 +
  # type    - True Positives (TP) and False Positives (FP).
 +
  #          overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively.
 +
  #        - annotation.
 +
  #          file is used for GENCODE annotation of coding VNTRs.
 +
  # filter  - filter applied to variants for this particular data set.
 +
  # path    - path of indexed BCF file.
 +
  #dataset      type            filter                      path
 +
  trf_lobstr    TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.lobstr.sites.bcf
 +
  trf_repeatseq TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.repeatseq.sites.bcf
 +
  trf_vntrseek  TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.vntrseek.sites.bcf
 +
  codis+        TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/codis.strs.sites.bcf
 +
  GENCODE_V19  cds_annotation  .                            /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz
 +
  DUST          cplx_annotation .                             
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt profile_vntrs [options] <in.vcf>
 +
 
 +
  options : -g  file containing list of reference datasets []
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -r  reference sequence fasta file []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Profile NA12878 ===
 +
 
 +
Profile Mendelian errors
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #profile NA12878 overlap with broad knowledgebase and illumina platinum genomes for the file vt.genotypes.bcf for chromosome 20.
 +
  vt profile_na12878  vt.genotypes.bcf -g na12878.reference.txt -r hs37d5.fa -i 20
 +
 
 +
  #this is a sample output for mendelian error profiling.
 +
  #R and A stand for reference and alternate allele respectively.
 +
  #Error% - mendelian error (confounded with de novo mutation)
 +
  #HomHet - Homozygous-Heterozygous genotype ratios
 +
  #Het% - proportion of hets
 +
    data set
 +
    No Indels :      27770 [0.94]
 +
      FS/NFS :      0.26 (8/23) <br>
 +
  broad.kb
 +
    A-B      13071 [1.19]
 +
    A&B      14699 [0.76]
 +
    B-A      21546 [0.62]
 +
    Precision    52.9%
 +
    Sensitivity  40.6% <br>
 +
  illumina.platinum
 +
    A-B      17952 [0.88]
 +
    A&B      9818 [1.07]
 +
    B-A      2418 [0.88]
 +
    Precision    35.4%
 +
    Sensitivity  80.2% <br>
 +
  broad.kb
 +
                R/R      R/A      A/A      ./.
 +
    R/R        346      145        3      5473
 +
    R/A          3      4133        9      758
 +
    A/A          2      136      2186      956
 +
    ./.          2      139        86      322 <br>
 +
    Total genotype pairs :      6963
 +
    Concordance          :  95.72% (6665)
 +
    Discordance          :  4.28% (298) <br>
 +
  illumina.platinum
 +
                R/R      R/A      A/A      ./.
 +
    R/R        1768        85        2        0
 +
    R/A          10      4479        14        0
 +
    A/A          13      180      3028        0
 +
    ./.          71        98        70        0<br>
 +
    Total genotype pairs :      9579
 +
    Concordance          :  96.83% (9275)
 +
    Discordance          :  3.17% (304)
 +
 
 +
  # This file contains information on how to process reference data sets.
 +
  #
 +
  # dataset - name of data set, this label will be printed.
 +
  # type    - True Positives (TP) and False Positives (FP)
 +
  #          overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively
 +
  #        - annotation
 +
  #          file is used for GENCODE annotation of frame shift and non frame shift Indels
 +
  # filter  - filter applied to variants for this particular data set
 +
  # path    - path of indexed BCF file
 +
  #dataset              type        filter    path
 +
  broad.kb              TP          PASS      /net/fantasia/home/atks/dev/vt/bundle/public/grch37/broad.kb.241365variants.genotypes.bcf
 +
  illumina.platinum    TP          PASS      /net/fantasia/home/atks/dev/vt/bundle/public/grch37/NA12878.illumina.platinum.5284448variants.genotypes.bcf
 +
  #gencode.v19          annotation  .        /net/fantasia/home/atks/dev/vt/bundle/public/grch37/gencode.v19.annotation.gtf.gz
 +
<div class="mw-collapsible-content">
 +
profile_na12878 v0.5
 +
 
 +
  usage : vt profile_na12878 [options] <in.vcf>
 +
 
 +
  options : -g  file containing list of reference datasets []
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -r  reference sequence fasta file []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
= Variant Calling =
 +
 
 +
 
 +
=== Discover ===
 +
 
 +
Discovers variants from reads in a BAM/CRAM file.
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #discover variants from NA12878.bam and write to stdout
 +
  vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20
 +
<div class="mw-collapsible-content">
 +
  usage : vt discover2 [options]
 +
 
 +
  options : -b  input BAM/CRAM file
 +
          -y  soft clipped unique sequences cutoff [0]
 +
          -x  soft clipped mean quality cutoff [0]
 +
          -w  insertion desired type II error [0.0]
 +
          -c  insertion desired type I error [0.0]
 +
          -h  insertion fractional evidence cutoff [0]
 +
          -g  insertion count cutoff [1]
 +
          -n  deletion desired type II error [0.0]
 +
          -m  deletion desired type I error [0.0]
 +
          -v  deletion fractional evidence cutoff [0]
 +
          -u  deletion count cutoff [1]
 +
          -k  snp desired type II error [0.0]
 +
          -j  snp desired type I error [0.0]
 +
          -f  snp fractional evidence cutoff [0]
 +
          -e  snp evidence count cutoff [1]
 +
          -q  base quality cutoff for bases [0]
 +
          -C  likelihood ratio cutoff [0]
 +
          -B  reference bias [0]
 +
          -a  read exclude flag [0x0704]
 +
          -l  ignore overlapping reads [false]
 +
          -t  MAPQ cutoff for alignments [0]
 +
          -p  ploidy [2]
 +
          -s  sample ID
 +
          -r  reference sequence fasta file []
 +
          -o  output VCF file [-]
 +
          -z  ignore MD tags [0]
 +
          -d  debug [0]
 +
          -I  file containing list of intervals []
 +
          -i  intervals []
 +
          -?  displays help
 +
 
 +
</div>
 +
</div>
 +
 
 +
=== Merge candidate variants ===
 +
 
 +
 
 +
Merge candidate variants across samples.  Each VCF file is required to have the FORMAT flags E and N and should have exactly one sample.
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #merge candidate variants from VCFs in candidate.txt and output in candidate.sites.vcf
 +
  vt merge_candidate_variants candidates.txt -o candidate.sites.vcf
 +
<div class="mw-collapsible-content">
 +
  usage : vt merge_candidate_variants [options]
 +
 
 +
  options : -L  file containing list of input VCF files
 +
            -o  output VCF file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals
 +
            --  ignores the rest of the labeled arguments following this flag
 +
            -h  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Filter overlap ===
 +
 
 +
Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap.
 +
 
 +
<div class="mw-collapsible mw-collapsed">
 +
  #annotates variants that are overlapping 
 +
  vt filter_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt filter_overlap [options] <in.vcf>
 +
 
 +
  options : -o  output VCF file [-]
 +
            -w  window overlap for variants [0]
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
<div class="mw-collapsible mw-collapsed">
 +
  #Use Remove overlap instead for versions older than Jan 12, 2017
 +
  vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf
 +
 
 +
<div class="mw-collapsible-content">
 +
    usage: vt remove_overlap [options] <in.vcf>
 +
    The old version has the same options except that it lacks the -w option
 +
    The change occurred in the following commit:
 +
    https://github.com/atks/vt/commit/ab5cf7e91b3baa5349f439e6fe92491ae19da1a6
 +
</div>
 +
</div>
 +
 
 +
=== Annotate Indels ===
 +
 
 +
Annotates indels with VNTR information and adds a VNTR record.  Facilitates the simultaneous calling of VNTR together with Indels and SNPs.
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #annotates indels from VCFs with VNTR information.
 +
  vt annotate_indels in.vcf -r hs37d5.fa -o annotated.sites.vcf
 +
 
 +
<div style="height:20em; overflow:auto; border: 2px solid #FFF">
 +
  CHROM  POS    ID      REF    ALT    QUAL    FILTER  INFO
 +
  20      82079  .      G      A      1255.98 .      NSAMPLES=1;E=43;N=51;ESUM=43;NSUM=51;FLANKSEQ=GGAGCACGCC[G/A]CCATGCCCGG
 +
  20      82217  .      G      A      1632.77 .      NSAMPLES=1;E=56;N=61;ESUM=56;NSUM=61;FLANKSEQ=GAGCCACCGC[G/A]CCCGGCCCAG
 +
  20      83250  .      CTGTGTGTG      C      .      .      NSAMPLES=1;E=18;N=35;ESUM=18;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT
 +
  20      83250  .      CTGTGTGTGTG    C      .      .      NSAMPLES=1;E=3;N=35;ESUM=3;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT
 +
  20      83251  .      TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG    <VNTR>  .      .      MOTIF=GT;RU=TG;FZ_CONCORDANCE=1;FZ_RL=52;FZ_LL=0;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FZ_RU_COUNTS=26,26;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG]TTTAGTATTT
 +
  20      83252  .      G      C      359.204 .      NSAMPLES=1;E=13;N=14;ESUM=13;NSUM=14;FLANKSEQ=CTCTCTCTCT[G/C]TGTGTGTGTG
 +
  20      83260  .      G      C      500.163 .      NSAMPLES=1;E=18;N=34;ESUM=18;NSUM=34;FLANKSEQ=CTGTGTGTGT[G/C]TGTGTGTGTG
 +
  20      83267  .      T      C      247.043 .      NSAMPLES=1;E=11;N=43;ESUM=11;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT
 +
  20      83275  .      T      C      609.669 .      NSAMPLES=1;E=24;N=43;ESUM=24;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT
 +
  20      90008  .      C      A      1546.88 .      NSAMPLES=1;E=52;N=60;ESUM=52;NSUM=60;FLANKSEQ=AACAGAAAAC[C/A]AAATACTGTA
 +
  20      91088  .      C      T      1766.04 .      NSAMPLES=1;E=58;N=66;ESUM=58;NSUM=66;FLANKSEQ=CCCAGCATAC[C/T]ATGGTTGTGC
 +
  20      91508  .      G      A      1266.93 .      NSAMPLES=1;E=44;N=53;ESUM=44;NSUM=53;FLANKSEQ=AATTAGTAAG[G/A]CTTACGTAAG
 +
  20      91707  .      C      T      888.134 .      NSAMPLES=1;E=30;N=53;ESUM=30;NSUM=53;FLANKSEQ=TGATTTTCTA[C/T]AGCAGGACCT
 +
  20      92527  .      A      G      828.593 .      NSAMPLES=1;E=34;N=40;ESUM=34;NSUM=40;FLANKSEQ=ATTAATTGCC[A/G]TTCTCTCTTT
 +
  20      93440  .      A      G      688.144 .      NSAMPLES=1;E=24;N=58;ESUM=24;NSUM=58;FLANKSEQ=TTGGATGCAT[A/G]GTCTGTAAAT
 +
  20      93636  .      TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT    <VNTR>  .      .      MOTIF=T;RU=T;FZ_CONCORDANCE=0.939394;FZ_RL=35;FZ_LL=0;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FZ_RU_COUNTS=31,33;FLANKSEQ=TCTAGGATTC[TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT
 +
  20      93646  .      C      CT      .      .      NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FLANKSEQ=TTTTTCTTTC[TTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT;GMOTIF=T;TR=20:93636:TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT:<VNTR>:T
 +
  20      93717  .      A      T      31.7622 .      NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKSEQ=CAGTGGCGTG[A/T]TCTTAGATCA
 +
  20      93931  .      G      A      628.149 .      NSAMPLES=1;E=22;N=53;ESUM=22;NSUM=53;FLANKSEQ=GATTACAGGT[G/A]TGAGCCGCTG
 +
  20      100699  .      C      T      809.09  .      NSAMPLES=1;E=28;N=61;ESUM=28;NSUM=61;FLANKSEQ=GGTGAAAAAT[C/T]ACCTGTCAGT
 +
  20      101362  .      G      A      1087.13 .      NSAMPLES=1;E=36;N=67;ESUM=36;NSUM=67;FLANKSEQ=TAATACTGAA[G/A]TTTACTTCTC
 +
 
 +
</div>
 +
 
 +
  The following shows the trace of how the algorithm works
 +
 
 +
    ============================================
 +
    ANNOTATING INDEL FUZZILY
 +
    ********************************************
 +
    EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT
 +
   
 +
    20:131948:C/CCA
 +
    EXACT REGION 131948-131965 (18)
 +
                CCACACACACACACACAA
 +
    FINAL EXACT REGION 131948-131965 (18)
 +
                      CCACACACACACACACAA
 +
    ********************************************
 +
    PICK CANDIDATE MOTIFS
 +
   
 +
    Longest Allele : C[CA]CACACACACACACACAA
 +
    detecting motifs for an str
 +
    seq: CCACACACACACACACACAA
 +
    len : 20
 +
    cmax_len : 10
 +
    candidate motifs: 25
 +
    AC : 0.894737 2 0
 +
    AAC : 0.5 3 0.0555556
 +
    ACC : 0.5 3 0.0555556
 +
    AAAC : 0.0588235 4 0.125 (< 2 copies)
 +
    ACCC : 0.0588235 4 0.125 (< 2 copies)
 +
    AACAC : 0.5 5 0.02
 +
    ACACC : 0.5 5 0.02
 +
    AAACAC : 0.0666667 6 0.0555556 (< 2 copies)
 +
    ACACCC : 0.0666667 6 0.0555556 (< 2 copies)
 +
    AACACAC : 0.5 7 0.0102041
 +
    ACACACC : 0.5 7 0.0102041
 +
    AAACACAC : 0.0769231 8 0.03125 (< 2 copies)
 +
    ACACACCC : 0.0769231 8 0.03125 (< 2 copies)
 +
    AACACACAC : 0.5 9 0.00617284 (< 2 copies)
 +
    ACACACACC : 0.5 9 0.00617284 (< 2 copies)
 +
    AAACACACAC : 0.0909091 10 0.02 (< 2 copies)
 +
    ACACACACCC : 0.0909091 10 0.02 (< 2 copies)
 +
    ********************************************
 +
    PICKING NEXT BEST MOTIF
 +
   
 +
    selected:        AC 0.89 0.00
 +
    ********************************************
 +
    DETECTING REPEAT TRACT FUZZILY
 +
    ++++++++++++++++++++++++++++++++++++++++++++
 +
    Exact left/right alignment
 +
   
 +
    repeat_tract              : CACACACACACACACA
 +
    position                  : [131949,131964]
 +
    motif_concordance        : 1
 +
    repeat units              : 8
 +
    exact repeat units        : 8
 +
    total no. of repeat units : 8
 +
   
 +
    ++++++++++++++++++++++++++++++++++++++++++++
 +
    Fuzzy right alignment
 +
   
 +
    repeat motif : CA
 +
    rflank      : AACTC
 +
    mlen        : 2
 +
    rflen        : 5
 +
    plen        : 111
 +
   
 +
    read        : AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACACCACACACACACACACAAACTC
 +
    rlen        : 106
 +
   
 +
    optimal score: 50.5073
 +
    optimal state: MR
 +
    optimal track: MR|r|0|5
 +
    optimal probe len: 25
 +
    optimal path length : 107
 +
    max j: 106
 +
    probe: (1~82) [1~10] (1~5)
 +
    read : (1~82) [83~101] (102~106)
 +
   
 +
    motif #          : 10 [83,101]
 +
    motif concordance : 95% (9/10)
 +
    motif discordance : 0|1|0|0|0|0|0|0|0|0
 +
   
 +
    Model:  ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC
 +
          SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME
 +
                                                                                              oo++oo++oo++oo++oo++RRRRR
 +
    Read:  AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACAC-CACACACACACACACAAACTC
 +
   
 +
    ++++++++++++++++++++++++++++++++++++++++++++
 +
    Fuzzy left alignment
 +
   
 +
    lflank      : ATCTTA
 +
    repeat motif : CA
 +
    lflen        : 6
 +
    mlen        : 2
 +
    plen        : 111
 +
   
 +
    read        : ATCTTACACCACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT
 +
    rlen        : 105
 +
   
 +
    optimal score: 50.5858
 +
    optimal state: Z
 +
    optimal track: Z|m|10|2
 +
    optimal probe len: 26
 +
    optimal path length : 106
 +
    max j: 105
 +
    mismatch penalty: 3
 +
   
 +
    model: (1~6) [1~10]
 +
    read : (1~6) [7~25][26~106]
 +
   
 +
    motif #          : 10 [7,25]
 +
    motif concordance : 95% (9/10)
 +
    motif discordance : 0|1|0|0|0|0|0|0|0|0
 +
   
 +
    Model:  ATCTTACACACACACACACACACACA--------------------------------------------------------------------------------
 +
          SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE
 +
            LLLLLLoo++oo++oo++oo++oo++                                                                               
 +
    Read:  ATCTTACAC-CACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT
 +
   
 +
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 +
    VNTR Summary
 +
    rid          : 19
 +
    motif        : AC
 +
    ru          : CA
 +
   
 +
    Exact
 +
    repeat_tract                    : CACACACACACACACA
 +
    position                        : [131949,131964]
 +
    reference repeat unit length    : 8
 +
    motif_concordance              : 1
 +
    repeat units                    : 8
 +
    exact repeat units              : 8
 +
    total no. of repeat units      : 8
 +
   
 +
    Fuzzy
 +
    repeat_tract                    : CACCACACACACACACACA
 +
    position                        : [131946,131964]
 +
    reference repeat unit length    : 19
 +
    motif_concordance              : 0.95
 +
    repeat units                    : 19
 +
    exact repeat units              : 9
 +
    total no. of repeat units      : 10
 +
    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt annotate_indels [options] <in.vcf>
 +
 
 +
  options : -v  add vntr record [false]
 +
            -x  override tags [false]
 +
            -f  filter expression []
 +
            -d  debug [false]
 +
            -m  mode [f]
 +
                e : by exact alignment              f : by fuzzy alignment
 +
            -c  classification schemas of tandem repeat [6]
 +
                1 : lai2003   
 +
                2 : kelkar2008 
 +
                3 : fondon2012 
 +
                4 : ananda2013 
 +
                5 : willems2014
 +
                6 : tan_kang2015
 +
            -a  annotation type [v]
 +
                v : a. output VNTR variant (defined by classification).
 +
                      RU                    repeat unit on reference sequence (CA)
 +
                      MOTIF                canonical representation (AC)
 +
                      RL                    repeat tract length in bases (11)
 +
                      FLANKS                flanking positions of repeat tract determined by exact alignment
 +
                      RU_COUNTS            number of exact repeat units and total number of repeat units in
 +
                                            repeat tract determined by exact alignment
 +
                      FZ_RL                fuzzy repeat tract length in bases (11)
 +
                      FZ_FLANKS            flanking positions of repeat tract determined by fuzzy alignment
 +
                      FZ_RU_COUNTS          number of exact repeat units and total number of repeat units in
 +
                                            repeat tract determined by fuzzy alignment
 +
                      FLANKSEQ              flanking sequence of indel
 +
                      LARGE_REPEAT_REGION  repeat region exceeding 2000bp
 +
                    b. mark indels with overlapping VNTR.
 +
                      FLANKS      flanking positions of repeat tract determined by exact alignment
 +
                      FZ_FLANKS    flanking positions of repeat tract determined by fuzzy alignment
 +
                      GMOTIF      generating motif used in fuzzy alignment
 +
                      TR    position and alleles of VNTR (20:23413:CACACACACAC:<VNTR>)
 +
                a : annotate each indel with RU, RL, MOTIF, REF.
 +
            -r  reference sequence fasta file []
 +
            -o  output VCF file [-]
 +
            -I  file containing list of intervals []
 +
            -i  intervals
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Construct Probes ===
 +
 
 +
 
 +
Construct probes for genotyping a variant.
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #construct probes from candidate.sites.bcf and output to standard out
 +
  vt construct_probes candidates.sites.bcf -r ref.fa
 +
<div class="mw-collapsible-content">
 +
  usage : vt construct_probes [options] <in.vcf>
 +
 
 +
  options : -o  output VCF file [-]
 +
            -f  minimum flank length [20]
 +
            -r  reference sequence fasta file []
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            --  ignores the rest of the labeled arguments following this flag
 +
            -h  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Genotype ===
 +
 
 +
Genotypes variants for each sample.
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #genotypes variants found in candidate.sites.vcf from sample.bam
 +
  vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf
 +
<div class="mw-collapsible-content">
 +
  usage : vt genotype [options]
 +
 
 +
  options : -r  reference sequence fasta file []
 +
            -s  sample ID []
 +
            -o  output VCF file [-]
 +
            -b  input BAM file []
 +
            -i  input candidate VCF file []
 +
            --  ignores the rest of the labeled arguments following this flag
 +
            -h  displays help
 +
</div>
 +
</div>
 +
 
 +
= Pedigree File =
 +
 
 +
  vt understands an augmented version introduced by [mailto:hmkang@umich.edu Hyun] of the PED described by [http://zzz.bwh.harvard.edu/plink/data.shtml#ped plink].
 +
  The pedigree file format is as follows with the following mandatory fields:
 +
       
 +
{| class="wikitable"
 +
|-
 +
! scope="col"| Field
 +
! scope="col"| Description
 +
! scope="col"| Valid Values
 +
! scope="col"| Missing Values
 +
|-
 +
|Family ID<br>
 +
Individual ID<br>
 +
Paternal ID<br>
 +
Maternal ID<br>
 +
Sex<br>
 +
Phenotype
 +
|ID of this family <br>
 +
ID(s) of this individual (comma separated) <br>
 +
ID of the father <br>
 +
ID of the mother <br>
 +
Sex of the individual<br>
 +
Phenotype
 +
|[A-Za-z0-9_]+<br>
 +
[A-Za-z0-9_]+(,[A-Za-z0-9_]+)* <br>
 +
[A-Za-z0-9_]+ <br>
 +
[A-Za-z0-9_]+<br>
 +
1=male, 2=female, other, male, female<br>
 +
[A-Za-z0-9_]+
 +
|  0 <br>
 +
cannot be missing <br>
 +
0 <br>
 +
0 <br>
 +
other<br>
 +
-9
 +
|}
 +
 
 +
  Examples:   
 +
 
 +
    ceu      NA12878    NA12891    NA12892    female    -9
 +
    yri      NA19240    NA19239    NA19238    female    -9
 +
 
 +
    ceu      NA12878    NA12891    NA12892    2    -9
 +
    yri      NA19240    NA19239    NA19238    2    -9
 +
 
 +
    #allows tools like profile_mendelian to detect duplicates and check for concordance
 +
    ceu      NA12878,NA12878A    NA12891    NA12892    female  case
 +
    yri      NA19240            NA19239    NA19238    female  control
 +
 
 +
    #allows tools like profile_mendelian to detect duplicates and check for concordance
 +
    ceu      NA12412    0  0    female  case
 +
    yri      NA19650    0  0    female  control
 +
 
 +
= Resource Bundle =
 +
 
 +
== GRCh37 ==
 +
 
 +
Files are based on hs37d5.fa made by Heng Li.
 +
 
 +
* External : [ftp://share.sph.umich.edu/vt/grch37 GRCh37 resource bundle]
 +
* Internal : /net/fantasia/home/atks/ref/vt/grch37
 +
 
 +
Read here for [ftp://share.sph.umich.edu/vt/grch37/readme.txt contents].
 +
 
 +
== GRCh38 ==
 +
 
 +
Files are based on [https://github.com/lh3/bwa/blob/master/README-alt.md hs38DH.fa] made by Heng Li.
 +
Note that many of the references are simply lifted over from GRCh37 using Picard's liftover tool with the default options.
 +
 
 +
* External : [ftp://share.sph.umich.edu/vt/grch38 GRCh38 resource bundle]
 +
* Internal : /net/fantasia/home/atks/ref/vt/grch38
 +
 
 +
Read here for [ftp://share.sph.umich.edu/vt/grch38/readme.txt contents].
 +
 
 +
= FAQ =
 +
 
 +
==1. vt cannot retrieve sequences from my reference sequence file ==
 +
 
 +
  It is common to use reference files based on the UCSC browser's database and from the Genome Reference Consortium.
 +
  For example, HG19 vs Grch37.  The key difference is that chromosome 1 is represented as chr1 and 1 respectively in the
 +
  FASTA files from these 2 sources.  Just use the appropriate FASTA file that was used to generate your VCF file originally.
   −
Merges duplicate variants by position with the option of considering alleles. (This just discards the duplicate variant that appears later in the VCF file)
+
  Another common issue is due to the corruption of the index file of the reference sequence; say for a reference file named
 +
  hs37d5.fa or hs37d5.fa.gz, simply delete the index file denoted by hs37d5.fa.fai or hs37d5.fa.gz.fai and run the vt command
 +
  again.  A new index file will be generated automatically.
   −
  Options:
+
= How to cite vt? =
  -i,  --input-vcf <string>  : Input VCF file
  −
  -o,  --output-vcf <string> : Output VCF file [-]
  −
  -p,  --merge-by-position  : Merge by position [false]
     −
  Example:
+
If you use normalize: <br>
  e.g. vt merge_duplicate_variants -i 8904indels.dups.genotypes.vcf -o out.vcf
+
[http://bioinformatics.oxfordjournals.org/content/31/13/2202 Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204]
  e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf
      
= Maintained by =
 
= Maintained by =
    
This page is maintained by  [mailto:atks@umich.edu Adrian]
 
This page is maintained by  [mailto:atks@umich.edu Adrian]
1,102

edits

Navigation menu