Difference between revisions of "Vt"
(129 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
= Introduction = | = Introduction = | ||
− | vt is a variant tool set that discovers short variants from Next Generation Sequencing data | + | vt is a variant tool set that discovers short variants from Next Generation Sequencing data. |
= Installation = | = Installation = | ||
+ | |||
+ | == General == | ||
The source files are housed in github. [https://github.com/samtools/htslib htslib] is | The source files are housed in github. [https://github.com/samtools/htslib htslib] is | ||
Line 15: | Line 17: | ||
#change directory to vt | #change directory to vt | ||
2. cd vt <br> | 2. cd vt <br> | ||
+ | #update submodules | ||
+ | 3. git submodule update --init --recursive <br> | ||
#run make, note that compilers need to support the c++0x standard | #run make, note that compilers need to support the c++0x standard | ||
− | + | 4. make <br> | |
− | #you can test the build ( | + | #you can test the build |
− | + | 5. make test | |
+ | <div class=" mw-collapsible mw-collapsed"> | ||
+ | An expected output when all is well for the tests is shown here. (click expand =>) | ||
+ | <div class="mw-collapsible-content"> | ||
+ | user@server:~/vt$ make test | ||
+ | test/test.sh | ||
+ | ++++++++++++++++++++++ | ||
+ | Tests for vt normalize | ||
+ | ++++++++++++++++++++++ | ||
+ | testing normalize | ||
+ | output VCF file : ok | ||
+ | output logs : ok | ||
+ | +++++++++++++++++++++++++++++++ | ||
+ | Tests for vt decompose_blocksub | ||
+ | +++++++++++++++++++++++++++++++ | ||
+ | testing decompose_blocksub of even-length blocks | ||
+ | output VCF file : ok | ||
+ | output logs : ok | ||
+ | testing decompose_blocksub with alignment | ||
+ | output VCF file : ok | ||
+ | output logs : ok | ||
+ | testing decompose_blocksub of phased even-length blocks | ||
+ | output VCF file : ok | ||
+ | output logs : ok | ||
+ | ++++++++++++++++++++++ | ||
+ | Tests for vt decompose | ||
+ | ++++++++++++++++++++++ | ||
+ | testing decompose for a triallelic variant | ||
+ | output VCF file : ok | ||
+ | output logs : ok <br> | ||
+ | Passed tests : 5 / 5 | ||
+ | </div> | ||
+ | </div> | ||
− | Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4. | + | === Mac === |
+ | |||
+ | You may install vt via homebrew. | ||
+ | |||
+ | brew tap brewsci/bio | ||
+ | brew tap brewsci/science | ||
+ | |||
+ | brew install brewsci/bio/vt | ||
+ | |||
+ | |||
+ | Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4. <br> | ||
+ | Some features of C++11 are used, thus there is a need for newer versions of gcc and clang. | ||
= Updating = | = Updating = | ||
Line 34: | Line 81: | ||
3. make -j 40 | 3. make -j 40 | ||
− | = General Features = | + | = General Features and notes = |
== Common options == | == Common options == | ||
Line 43: | Line 90: | ||
-o defines the out file which and has the STDOUT set as the default. | -o defines the out file which and has the STDOUT set as the default. | ||
+ | vt recognizes the appropriate output by file extension. | ||
+ | <name>.vcf - uncompressed VCF | ||
+ | <name>.vcf.gz - compressed VCF | ||
+ | <name>.bcf - BCF | ||
You may modify the STDOUT to output the binary version of the format. Uncompressed | You may modify the STDOUT to output the binary version of the format. Uncompressed | ||
− | VCF and BCF streams are indicated by - and + respectively. | + | VCF and BCF streams are indicated by - and + respectively. |
-f filter expression | -f filter expression | ||
Line 74: | Line 125: | ||
This allows you to only analyse biallelic indels that are passed on chromosome 20. | This allows you to only analyse biallelic indels that are passed on chromosome 20. | ||
vt profile_na12878 vt.bcf -g na12878.reference.txt -r genome.fa -f "N_ALLELE==2&&VTYPE==INDEL&&PASS" -i 20 | vt profile_na12878 vt.bcf -g na12878.reference.txt -r genome.fa -f "N_ALLELE==2&&VTYPE==INDEL&&PASS" -i 20 | ||
+ | |||
+ | This allows you to extract biallelic indels that are passed on chromosome 20. | ||
+ | vt view vt.bcf -f "N_ALLELE==2&&VTYPE==INDEL&&PASS" -i 20 | ||
Other examples of filters | Other examples of filters | ||
Line 85: | Line 139: | ||
Variant characteristics | Variant characteristics | ||
− | VTYPE,N_ALLELE,DLEN,LEN | + | VTYPE,N_ALLELE,DLEN,LEN,VARIANT_CONTAINS_N |
Variant value types | Variant value types | ||
SNP,MNP,INDEL,CLUMPED | SNP,MNP,INDEL,CLUMPED | ||
− | Biallelic SNPs only | + | Biallelic SNPs only : VTYPE==SNP&&N_ALLELE==2 |
− | Biallelic Indels with embedded SNP | + | Biallelic Indels with embedded SNP : VTYPE==(SNP|INDEL)&&N_ALLELE==2 |
− | Biallelic variants involving insertions | + | Biallelic variants involving insertions : VTYPE&INDEL&&DLEN>0&&N_ALLELE==2 |
− | Biallelic variants involving 1bp variants : LEN==1&&N_ALLELE==2 | + | Biallelic variants involving 1bp variants : LEN==1&&N_ALLELE==2 |
+ | Variants with explicit sequences with no Ns : ~VARIANT_CONTAINS_N | ||
+ | |||
+ | REF field | ||
+ | REF | ||
+ | |||
+ | ALT field | ||
+ | ALT | ||
QUAL field | QUAL field | ||
Line 103: | Line 164: | ||
INFO fields | INFO fields | ||
INFO.<tag> | INFO.<tag> | ||
− | + | ||
+ | A/C SNPs : REF=='A' && ALT=='C' | ||
+ | AC type of STRs : REF=~'^.(AC)+$' || ALT=~'^.(AC)+$' | ||
Passed biallelic SNPs only : PASS&&VTYPE==SNP&&N_ALLELE==2 | Passed biallelic SNPs only : PASS&&VTYPE==SNP&&N_ALLELE==2 | ||
Passed Common biallelic SNPs only : PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005 | Passed Common biallelic SNPs only : PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005 | ||
Line 169: | Line 232: | ||
I am trying to make vt handle [http://genome.sph.umich.edu/wiki/Relationship_between_Ploidy,_Alleles_and_Genotypes general cases of ploidy and alleles]. | I am trying to make vt handle [http://genome.sph.umich.edu/wiki/Relationship_between_Ploidy,_Alleles_and_Genotypes general cases of ploidy and alleles]. | ||
Please let me know if that is lacking in a tool that you are using. | Please let me know if that is lacking in a tool that you are using. | ||
+ | |||
+ | == BCF Compression Levels vs Compression Time == | ||
+ | |||
+ | The zlib deflation algorithm (a variant of LZ77) has 10 levels - 0 to 9. 0 has no compression but instead wraps <br> | ||
+ | up the file in zlib or bgzf blocks. It may be useful to have 0 compression as it is indexable with the same mechanism <br> | ||
+ | used for compressed files. Levels 1-9 denote an increasing compression level in exchange for longer times for <br> | ||
+ | compression. | ||
+ | |||
+ | In general, zlib compression does not have significant differences in compression for BCF files between the 9 compression <br> | ||
+ | levels as shown in the following table: | ||
+ | |||
+ | {| class="wikitable" | ||
+ | |- | ||
+ | ! scope="col"| Compression Level | ||
+ | ! scope="col"| Size | ||
+ | ! scope="col"| Time | ||
+ | |- | ||
+ | |0<br> | ||
+ | 1<br> | ||
+ | 2<br> | ||
+ | 3<br> | ||
+ | 4<br> | ||
+ | 5<br> | ||
+ | 6 (default) <br> | ||
+ | 7<br> | ||
+ | 8<br> | ||
+ | 9 | ||
+ | |153GB <br> | ||
+ | 98.4GB <br> | ||
+ | 98.0GB <br> | ||
+ | 97.5GB <br> | ||
+ | 95.5GB <br> | ||
+ | 95.2GB <br> | ||
+ | 94.9GB <br> | ||
+ | 94.8GB <br> | ||
+ | 94.76GB <br> | ||
+ | 94.75GB | ||
+ | |45m<br> | ||
+ | 2h3m <br> | ||
+ | 2h7m <br> | ||
+ | 2h12m <br> | ||
+ | 2h26m <br> | ||
+ | 2h54m <br> | ||
+ | 3h19m <br> | ||
+ | 3h41m <br> | ||
+ | 4h5m <br> | ||
+ | 4h25m | ||
+ | |} | ||
+ | |||
+ | <span style="color:#0000FF">So, it might be a good idea to compress at lower levels when dealing with large temporary <br> | ||
+ | files in a pipeline to save compute time. This can be achieved with the -c option in vt view</span> | ||
= VCF Manipulation = | = VCF Manipulation = | ||
Line 180: | Line 294: | ||
#views mills.bcf and outputs to standard out | #views mills.bcf and outputs to standard out | ||
vt view -h mills.bcf | vt view -h mills.bcf | ||
− | #views mills.bcf and locally sorts it in a 10000bp window and outputs to | + | |
− | vt view -h - | + | #views mills.bcf and locally sorts it in a 10000bp window and outputs to sorted-millsbcf |
+ | vt view -h -w 10000 mills.bcf -o sorted-mills.bcf | ||
+ | |||
+ | #views mills.bcf and outputs to c1-mills.bcf with a compression level of 1. By default, | ||
+ | #the compression level is 6 where lower levels compress the file less but are faster. | ||
+ | #The difference in compression for BCF files between level 1 to level 9 is about 5% of | ||
+ | #of a level 1 compression file. The difference in time taken is about an additional 50% | ||
+ | #of a level 1 compression. The levels range from 0 to 9 where 0 means no compression | ||
+ | #but the file is encapsulated in bgzf blocks that allows the file to be indexed. A special | ||
+ | #level -1 denotes an uncompressed BCF file that is not encapsulated in bgzf blocks and | ||
+ | #are thus not indexable but are highly suitable for streaming between vt commands. | ||
+ | vt view -h mills.bcf -c 1 -o c1-mills.bcf | ||
+ | |||
+ | #views mills.bcf and selects variants that overlap with the regions found in dust.bed from chromosome 20 | ||
+ | #the -t option selects variants by checking if each variant overlaps with the regions in the bed file, this is | ||
+ | #as opposed to random accessing the variants via the index through the intervals defined in -i and -I options. | ||
+ | #this is useful when selecting variants from the target regions from an exome sequencing experiment. | ||
+ | vt view 10000 mills.bcf -t dust.bed -i 20 | ||
<div class="mw-collapsible-content"> | <div class="mw-collapsible-content"> | ||
Line 187: | Line 318: | ||
options : -o output VCF/VCF.GZ/BCF file [-] | options : -o output VCF/VCF.GZ/BCF file [-] | ||
+ | -f filter expression [] | ||
-w local sorting window size [0] | -w local sorting window size [0] | ||
-s print site information only without genotypes [false] | -s print site information only without genotypes [false] | ||
− | -h | + | -H print header only, this option is honored only for STDOUT [false] |
+ | -h omit header, this option is honored only for STDOUT [false] | ||
-p print options and summary [] | -p print options and summary [] | ||
+ | -r right window size for overlap [] | ||
+ | -l left window size for overlap [] | ||
+ | -c compression level 0-9, 0 and -1 denotes uncompressed with the former being wrapped in bgzf. [6] | ||
+ | -t bed file for variant selection via streaming [] | ||
-I file containing list of intervals [] | -I file containing list of intervals [] | ||
-i intervals [] | -i intervals [] | ||
Line 198: | Line 335: | ||
=== Index === | === Index === | ||
− | |||
Indexes a VCF.GZ or BCF file. | Indexes a VCF.GZ or BCF file. | ||
Line 275: | Line 411: | ||
#normalize variants, send to standard out and remove duplicates. | #normalize variants, send to standard out and remove duplicates. | ||
vt normalize dbsnp.vcf -r seq.fa | vt uniq - -o dbsnp.normalized.uniq.vcf | vt normalize dbsnp.vcf -r seq.fa | vt uniq - -o dbsnp.normalized.uniq.vcf | ||
+ | |||
+ | #read in variants that do not contain N in the explicit alleles, normalize variants, send to standard out. | ||
+ | vt normalize dbsnp.vcf -r seq.fa -f "~VARIANT_CONTAINS_N" | ||
#variants that are normalized will be annotated with an OLD_VARIANT info tag. | #variants that are normalized will be annotated with an OLD_VARIANT info tag. | ||
Line 306: | Line 445: | ||
-d debug [false] | -d debug [false] | ||
-q do not print options and summary [false] | -q do not print options and summary [false] | ||
− | -n | + | -m warns but does not exit when REF is inconsistent |
+ | with masked reference sequence for non SNPs. | ||
+ | This overides the -n option [false] | ||
+ | -n warns but does not exit when REF is inconsistent | ||
+ | with reference sequence for non SNPs [false] | ||
-w window size for local sorting of variants [10000] | -w window size for local sorting of variants [10000] | ||
-I file containing list of intervals [] | -I file containing list of intervals [] | ||
Line 321: | Line 464: | ||
There is now an additional option -a which decomposes non block substitutions into its constituent SNPs and indels. (kindly added by [[https://github.com/holtgrewe holtgrewe@github]]) <br> | There is now an additional option -a which decomposes non block substitutions into its constituent SNPs and indels. (kindly added by [[https://github.com/holtgrewe holtgrewe@github]]) <br> | ||
There is no exact solution and this decomposition is based on the best guess outcome using a Needleman-Wunsch algorithm. <br> | There is no exact solution and this decomposition is based on the best guess outcome using a Needleman-Wunsch algorithm. <br> | ||
− | + | You might also want to check out [https://github.com/vcflib/vcflib#vcfallelicprimitives vcfallelicprimitives]. <br> | |
+ | <br> | ||
+ | There is now an additional option -m and -d which ensures that some MNVs are not decomposed. (kindly added by [[https://github.com/jaudoux jaudoux@github]]) <br> | ||
+ | The motivation is from<br> | ||
+ | *Exome-wide assessment of the functional impact and pathogenicity of multi-nucleotide mutations https://www.biorxiv.org/content/10.1101/258723v2.full<br> | ||
+ | *Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes https://www.biorxiv.org/content/10.1101/573378v2.full<br> | ||
</div> | </div> | ||
Line 344: | Line 492: | ||
20 763837 . C T 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1 | 20 763837 . C T 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1 | ||
20 763838 . G GA 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1 | 20 763838 . G GA 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1 | ||
+ | |||
+ | #decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf and add phase set information in the genotype fields | ||
+ | vt decompose_blocksub -p gatk.vcf -o decomposed_blocksub.vcf <br> | ||
+ | #before decomposition | ||
+ | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tumor normal | ||
+ | 1 159030 . TAACCTTTC TGACCTTTT 0.04 . AF=0.5 GT 0/0 1/1 <br> | ||
+ | #after decomposition | ||
+ | 1 159031 . A G 0.04 . AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT GT:PS 0|0:159031 1|1:159031 | ||
+ | 1 159038 . C T 0.04 . AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT GT:PS 0|0:159031 1|1:159031 | ||
<div class="mw-collapsible-content"> | <div class="mw-collapsible-content"> | ||
description : decomposes biallelic block substitutions into its constituent SNPs. <br> | description : decomposes biallelic block substitutions into its constituent SNPs. <br> | ||
usage : vt decompose_blocksub [options] <in.vcf> <br> | usage : vt decompose_blocksub [options] <in.vcf> <br> | ||
− | options : -a enable aggressive/alignment mode | + | options : -m keep MNVs (multi-nucleotide variants) [false] |
+ | -a enable aggressive/alignment mode [false] | ||
+ | -d MNVs max distance (when -m option is used) [2] | ||
-o output VCF file [-] | -o output VCF file [-] | ||
-I file containing list of intervals [] | -I file containing list of intervals [] | ||
-i intervals [] | -i intervals [] | ||
− | -? displays help | + | -? displays help-a enable aggressive/alignment mode |
+ | |||
</div> | </div> | ||
</div> | </div> | ||
Line 413: | Line 573: | ||
=== Drop duplicate variants === | === Drop duplicate variants === | ||
− | Drops duplicate variants that appear later in the file. <br> | + | Drops duplicate variants that appear later in the file. VCF file must be ordered. <br> |
If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained. | If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained. | ||
Line 458: | Line 618: | ||
=== Concatenate === | === Concatenate === | ||
− | + | Concatenates VCF files. Assumes individuals are in the same order and files share the same header. | |
<div class=" mw-collapsible mw-collapsed"> | <div class=" mw-collapsible mw-collapsed"> | ||
− | # | + | #concatenates chr1.mills.bcf and chr2.mills.bcf |
vt cat chr1.mills.bcf chr2.mills.bcf -o mills.bcf | vt cat chr1.mills.bcf chr2.mills.bcf -o mills.bcf | ||
+ | |||
+ | #concatenates chr1.mills.bcf and chr2.mills.bcf with the naive option. | ||
+ | #The naive option assumes that the headers are all the same and skips | ||
+ | #merging headers and translating encodings between BCF files. This is | ||
+ | #a much faster option if you know the nature of your BCF files in advance. | ||
+ | vt cat -n chr1.mills.bcf chr2.mills.bcf -o mills.bcf | ||
<div class="mw-collapsible-content"> | <div class="mw-collapsible-content"> | ||
Line 468: | Line 634: | ||
options : -s print site information only without genotypes [false] | options : -s print site information only without genotypes [false] | ||
− | -p print options and summary [] | + | -p print options and summary [false] |
+ | -n naive, assumes that headers are the same. [false] | ||
+ | -w local sorting window size [0] | ||
+ | -f filter expression [] | ||
-L file containing list of input VCF files | -L file containing list of input VCF files | ||
-o output VCF file [-] | -o output VCF file [-] | ||
Line 494: | Line 663: | ||
-i intervals [] | -i intervals [] | ||
-? displays help | -? displays help | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | === Filter === | ||
+ | |||
+ | Filters variants in a VCF file | ||
+ | |||
+ | <div class=" mw-collapsible mw-collapsed"> | ||
+ | #adds a filter tag "refA" for variants where the REF column is a A sequence. | ||
+ | vt filter in.bcf -f "REF=='A'" -d "refA" | ||
+ | |||
+ | <div class="mw-collapsible-content"> | ||
+ | usage : vt filter [options] <in.vcf> <br> | ||
+ | options : -x clear filter [false] | ||
+ | -f filter expression [] | ||
+ | -d filter tag description [] | ||
+ | -t filter tag [] | ||
+ | -o output VCF file [-] | ||
+ | -I file containing list of intervals [] | ||
+ | -i intervals | ||
+ | -? displays help </div> | ||
+ | </div> | ||
+ | |||
+ | === Filter overlap === | ||
+ | |||
+ | Tags overlapping variants in a VCF file with the FILTER flag overlap. | ||
+ | |||
+ | <div class=" mw-collapsible mw-collapsed"> | ||
+ | #adds a filter tag "overlap" for overlapping variants within a window size of 1 based on the REF sequence. | ||
+ | vt filter_overlap in.bcf -w 1 out.bcf | ||
+ | |||
+ | todo: option for considering END info tag for detecting overlaps. | ||
+ | |||
+ | <div class="mw-collapsible-content"> | ||
+ | usage : vt filter_overlap [options] <in.vcf> | ||
+ | |||
+ | options : -o output VCF file [-] | ||
+ | -w window overlap for variants [0] | ||
+ | -I file containing list of intervals [] | ||
+ | -i intervals [] | ||
+ | -? displays help | ||
</div> | </div> | ||
</div> | </div> | ||
Line 514: | Line 724: | ||
-i intervals [] | -i intervals [] | ||
-r reference sequence fasta file [] | -r reference sequence fasta file [] | ||
+ | -? displays help | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | === Extract INFO fields to a tab delimited file === | ||
+ | |||
+ | Converts a VCF file and its shared information in the INFO field to a tab delimited file for further analysis. | ||
+ | |||
+ | <div class=" mw-collapsible mw-collapsed"> | ||
+ | #converts in.bcf to tab format with selected INFO and FILTER fields | ||
+ | vt info2tab in.bcf -u PASS -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT | ||
+ | <div style="height:6em; overflow:auto; border: 2px solid #FFF"> | ||
+ | INPUT | ||
+ | ===== | ||
+ | 20 17548608 . A AC . PASS CENTERS=vbi;NCENTERS=1;OLD_MULTIALLELIC=20:17548598:GAAAAAAAAAAAAA/GAAAAAAAAAAAA/GAAAAAAAAAAAAAA/GAAAAAAAAAA/GAAAAAAAAAAA/GAAAAAAAAAACAAA;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAACAAAG;EX_MOTIF=C;EX_MLEN=1;EX_RU=C;EX_BASIS=C;EX_BLEN=1;EX_REPEAT_TRACT=17548608,17548609;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=2;EX_RL=2;EX_LL=3;EX_RU_COUNTS=0,2;EX_SCORE=0;EX_TRF_SCORE=-14;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=14;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[A]AAAGAAGGAA;MDUST;LOBSTR | ||
+ | 20 17548608 . AAAAG A . PASS CENTERS=ox1;NCENTERS=1;EX_MOTIF=AAAG;EX_MLEN=4;EX_RU=AAAG;EX_BASIS=AG;EX_BLEN=2;EX_REPEAT_TRACT=17548609,17548612;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=0.75;EX_RL=4;EX_LL=4;EX_RU_COUNTS=0,1;EX_SCORE=0.75;EX_TRF_SCORE=-1;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=13;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[AAAAG]AAGGAACTAC;MDUST;LOBSTR;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAA | ||
+ | </div> | ||
+ | OUTPUT | ||
+ | ====== | ||
+ | CHROM POS REF ALT N_ALLELE PASS EX_RL FZ_RL MDUST LOBSTR VNTRSEEK RMSK EX_REPEAT_TRACT_1 EX_REPEAT_TRACT_2 | ||
+ | 20 17548608 A AC 2 1 2 13 1 1 0 0 17548608 17548608 | ||
+ | 20 17548608 AAAAG A 2 1 4 13 1 1 0 0 17548609 17548609 | ||
+ | |||
+ | <div class="mw-collapsible-content"> | ||
+ | usage : vt info2tab [options] <in.vcf> | ||
+ | |||
+ | options : -d debug [false] | ||
+ | -f filter expression [] | ||
+ | -u list of filter tags to be extracted []-t list of info tags to be extracted [] | ||
+ | -o output tab delimited file [-] | ||
+ | -I file containing list of intervals [] | ||
+ | -i intervals [] | ||
-? displays help | -? displays help | ||
</div> | </div> | ||
Line 741: | Line 983: | ||
Compute features in a VCF file. Example of statistics are Allele counts, [[Genotype_Likelihood_based_Inbreeding_Coefficient|Genotype Likelihood based Inbreeding Coefficient]]. | Compute features in a VCF file. Example of statistics are Allele counts, [[Genotype_Likelihood_based_Inbreeding_Coefficient|Genotype Likelihood based Inbreeding Coefficient]]. | ||
− | [[Genotype_Likelihood_based_Allele_Frequency|Hardy-Weinberg Genotype Likelihood based Allele Frequencies]] | + | [[Genotype_Likelihood_based_Allele_Frequency|Hardy-Weinberg Genotype Likelihood based Allele Frequencies]] <br> |
+ | For more customizable feature computation - look at [http://genome.sph.umich.edu/wiki/Vt#Estimate estimate] | ||
<div class=" mw-collapsible mw-collapsed"> | <div class=" mw-collapsible mw-collapsed"> | ||
Line 824: | Line 1,067: | ||
-i Intervals | -i Intervals | ||
-? displays help | -? displays help | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | === Profile Mendelian Errors === | ||
+ | |||
+ | Profile Mendelian errors | ||
+ | |||
+ | <div class=" mw-collapsible mw-collapsed"> | ||
+ | #profile mendelian errors found in vt.genotypes.bcf, generate [[media:mendel.pdf|tables]] in the directory mendel, requires pdflatex. | ||
+ | vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel | ||
+ | |||
+ | pedigree file format is described in [[Vt#Pedigree File|here]]. | ||
+ | |||
+ | #this is a sample output for mendelian error profiling. | ||
+ | #R and A stand for reference and alternate allele respectively. | ||
+ | #Error% - mendelian error (confounded with de novo mutation) | ||
+ | #HomHet - Homozygous-Heterozygous genotype ratios | ||
+ | #Het% - proportion of hets | ||
+ | Mendelian Errors <br> | ||
+ | Father Mother R/R R/A A/A Error(%) HomHet Het(%) | ||
+ | R/R R/R 14889 210 38 1.64 nan nan | ||
+ | R/R R/A 3403 3497 74 1.06 0.97 50.68 | ||
+ | R/R A/A 176 1482 155 18.26 nan nan | ||
+ | R/A R/R 3665 3652 68 0.92 1.00 49.91 | ||
+ | R/A R/A 1015 3151 990 0.00 0.64 61.11 | ||
+ | R/A A/A 43 1300 1401 1.57 1.08 48.13 | ||
+ | A/A R/R 172 1365 147 18.94 nan nan | ||
+ | A/A R/A 47 1164 1183 1.96 1.02 49.60 | ||
+ | A/A A/A 20 78 5637 1.71 nan nan <br> | ||
+ | Parental R/R R/A A/A Error(%) HomHet Het(%) | ||
+ | R/R R/R 14889 210 38 1.64 nan nan | ||
+ | R/R R/A 7068 7149 142 0.99 0.99 50.28 | ||
+ | R/R A/A 348 2847 302 18.59 nan nan | ||
+ | R/A R/A 1015 3151 990 0.00 0.64 61.11 | ||
+ | R/A A/A 90 2464 2584 1.75 1.05 48.81 | ||
+ | A/A A/A 20 78 5637 1.71 nan nan <br> | ||
+ | Parental R/R R/A A/A Error(%) HomHet Het(%) | ||
+ | HOM HOM 14909 288 5675 1.66 nan nan | ||
+ | HOM HET 7158 9613 2726 1.19 1.00 49.90 | ||
+ | HET HET 1015 3151 990 0.00 0.64 61.11 | ||
+ | HOMREF HOMALT 348 2847 302 18.59 nan nan <br> | ||
+ | total mendelian error : 2.505% | ||
+ | no. of trios : 2 | ||
+ | no. of variants : 25346 | ||
+ | |||
+ | <div class="mw-collapsible-content"> | ||
+ | profile_mendelian v0.5 | ||
+ | |||
+ | usage : vt profile_mendelian [options] <in.vcf> | ||
+ | |||
+ | options : -q minimum genotype quality | ||
+ | -d minimum depth | ||
+ | -r reference sequence fasta file [] | ||
+ | -x output latex directory [] | ||
+ | -p pedigree file | ||
+ | -I file containing list of intervals [] | ||
+ | -i intervals | ||
+ | -? displays help | ||
</div> | </div> | ||
</div> | </div> | ||
Line 948: | Line 1,249: | ||
</div> | </div> | ||
− | === Profile | + | === Profile VNTRs === |
− | Profile | + | Profile VNTRs. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]]. |
<div class=" mw-collapsible mw-collapsed"> | <div class=" mw-collapsible mw-collapsed"> | ||
− | |||
− | |||
− | + | #profiles a set of VNTRs | |
− | + | vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt | |
− | + | ||
− | + | ||
− | + | profile_vntrs v0.5 | |
− | + | ||
− | + | no VNTRs 5660874 #number of VNTRs in vntrs.sites.bcf | |
− | + | no low complexity 2686460 (47.46%) #number of VNTRs in low complexity region determined by MDUST | |
− | + | no coding 17911 (0.32%) #number of VNTRs in coding regions determined by GENCODE v7 | |
− | + | no redundant 1312209 (23.18%) #number of VNTRs involved in overlapping with one another<br> | |
− | + | trf_lobstr (1638516) #TRF based reference set used in lobSTR, motif lengths 1 to 6. | |
− | + | A-B 3269285 #TRs specific to vntrs.sites.bcf | |
− | + | A-B~ 1666185 #TRs in vntrs.sites.bcf that overlap partially with at least one TR in TRF(lobSTR) but does not overlap exactly with another TR. | |
− | + | A&B1 725404 #TRs in vntrs.sites.bcf that overlap exactly with at least one TR in TRF(lobSTR) | |
− | + | A&B2 723195 #TRs in TRF(lobSTR) that overlap exactly with at least one TR in vntrs.sites.bcf | |
− | + | B-A~ 710075 #TRs in TRF(lobSTR) that overlap partially with at least one TR in vntrs.sites.bcf but does not overlap exactly with another TR. | |
− | + | B-A 205246 #TRs specific to TRF(lobSTR) | |
− | + | #note that the first 3 rows should sum up to the number of TRs in vntrs.sites.bcf | |
− | + | #and the 4th to 6th rows should sum up to the number of TRs in TRF( lobSTR) | |
− | + | #This basically allows us to see the m to n overlapping in overlapping TRs<br> | |
− | + | trf_repeatseq (1624553) #TRF based reference set used in repeatseq, motif lengths 1 to 6. | |
− | + | A-B 3291652 | |
− | + | A-B~ 1650190 | |
− | + | A&B1 719032 | |
− | + | A&B2 716838 | |
− | + | B-A~ 703948 | |
− | + | B-A 203767 <br> | |
− | + | trf_vntrseek (230306) #TRF based reference set used in vntrseek, motif lengths 7 to 2000. | |
− | + | A-B 5384453 | |
− | + | A-B~ 271302 | |
− | + | A&B1 5119 | |
+ | A&B2 4973 | ||
+ | B-A~ 92496 | ||
+ | B-A 132837 <br> | ||
+ | codis+ (15) #CODIS STRs + 2 STRs from PROMEGA | ||
+ | A-B 5660794 | ||
+ | A-B~ 79 | ||
+ | A&B1 1 | ||
+ | A&B2 1 | ||
+ | B-A~ 14 | ||
+ | B-A 0 | ||
+ | |||
+ | # This file contains information on how to process reference data sets. | ||
+ | # dataset - name of data set, this label will be printed. | ||
+ | # type - True Positives (TP) and False Positives (FP). | ||
+ | # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively. | ||
+ | # - annotation. | ||
+ | # file is used for GENCODE annotation of coding VNTRs. | ||
+ | # filter - filter applied to variants for this particular data set. | ||
+ | # path - path of indexed BCF file. | ||
+ | #dataset type filter path | ||
+ | trf_lobstr TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.lobstr.sites.bcf | ||
+ | trf_repeatseq TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.repeatseq.sites.bcf | ||
+ | trf_vntrseek TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.vntrseek.sites.bcf | ||
+ | codis+ TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/codis.strs.sites.bcf | ||
+ | GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz | ||
+ | DUST cplx_annotation . | ||
<div class="mw-collapsible-content"> | <div class="mw-collapsible-content"> | ||
− | + | usage : vt profile_vntrs [options] <in.vcf> | |
− | + | options : -g file containing list of reference datasets [] | |
− | + | -I file containing list of intervals [] | |
− | + | -i intervals [] | |
− | - | ||
-r reference sequence fasta file [] | -r reference sequence fasta file [] | ||
− | + | -? displays help | |
− | |||
− | |||
− | |||
− | |||
</div> | </div> | ||
</div> | </div> | ||
Line 1,081: | Line 1,401: | ||
=== Discover === | === Discover === | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Discovers variants from reads in a BAM/CRAM file. | Discovers variants from reads in a BAM/CRAM file. | ||
Line 1,114: | Line 1,406: | ||
<div class=" mw-collapsible mw-collapsed"> | <div class=" mw-collapsible mw-collapsed"> | ||
#discover variants from NA12878.bam and write to stdout | #discover variants from NA12878.bam and write to stdout | ||
− | vt | + | vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20 |
<div class="mw-collapsible-content"> | <div class="mw-collapsible-content"> | ||
usage : vt discover2 [options] | usage : vt discover2 [options] | ||
Line 1,169: | Line 1,461: | ||
-- ignores the rest of the labeled arguments following this flag | -- ignores the rest of the labeled arguments following this flag | ||
-h displays help | -h displays help | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | === Filter overlap === | ||
+ | |||
+ | Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap. | ||
+ | |||
+ | <div class="mw-collapsible mw-collapsed"> | ||
+ | #annotates variants that are overlapping | ||
+ | vt filter_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf | ||
+ | |||
+ | <div class="mw-collapsible-content"> | ||
+ | usage : vt filter_overlap [options] <in.vcf> | ||
+ | |||
+ | options : -o output VCF file [-] | ||
+ | -w window overlap for variants [0] | ||
+ | -I file containing list of intervals [] | ||
+ | -i intervals [] | ||
+ | -? displays help | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | <div class="mw-collapsible mw-collapsed"> | ||
+ | #Use Remove overlap instead for versions older than Jan 12, 2017 | ||
+ | vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf | ||
+ | |||
+ | <div class="mw-collapsible-content"> | ||
+ | usage: vt remove_overlap [options] <in.vcf> | ||
+ | The old version has the same options except that it lacks the -w option | ||
+ | The change occurred in the following commit: | ||
+ | https://github.com/atks/vt/commit/ab5cf7e91b3baa5349f439e6fe92491ae19da1a6 | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | === Annotate Indels === | ||
+ | |||
+ | Annotates indels with VNTR information and adds a VNTR record. Facilitates the simultaneous calling of VNTR together with Indels and SNPs. | ||
+ | |||
+ | <div class=" mw-collapsible mw-collapsed"> | ||
+ | #annotates indels from VCFs with VNTR information. | ||
+ | vt annotate_indels in.vcf -r hs37d5.fa -o annotated.sites.vcf | ||
+ | |||
+ | <div style="height:20em; overflow:auto; border: 2px solid #FFF"> | ||
+ | CHROM POS ID REF ALT QUAL FILTER INFO | ||
+ | 20 82079 . G A 1255.98 . NSAMPLES=1;E=43;N=51;ESUM=43;NSUM=51;FLANKSEQ=GGAGCACGCC[G/A]CCATGCCCGG | ||
+ | 20 82217 . G A 1632.77 . NSAMPLES=1;E=56;N=61;ESUM=56;NSUM=61;FLANKSEQ=GAGCCACCGC[G/A]CCCGGCCCAG | ||
+ | 20 83250 . CTGTGTGTG C . . NSAMPLES=1;E=18;N=35;ESUM=18;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT | ||
+ | 20 83250 . CTGTGTGTGTG C . . NSAMPLES=1;E=3;N=35;ESUM=3;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT | ||
+ | 20 83251 . TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG <VNTR> . . MOTIF=GT;RU=TG;FZ_CONCORDANCE=1;FZ_RL=52;FZ_LL=0;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FZ_RU_COUNTS=26,26;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG]TTTAGTATTT | ||
+ | 20 83252 . G C 359.204 . NSAMPLES=1;E=13;N=14;ESUM=13;NSUM=14;FLANKSEQ=CTCTCTCTCT[G/C]TGTGTGTGTG | ||
+ | 20 83260 . G C 500.163 . NSAMPLES=1;E=18;N=34;ESUM=18;NSUM=34;FLANKSEQ=CTGTGTGTGT[G/C]TGTGTGTGTG | ||
+ | 20 83267 . T C 247.043 . NSAMPLES=1;E=11;N=43;ESUM=11;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT | ||
+ | 20 83275 . T C 609.669 . NSAMPLES=1;E=24;N=43;ESUM=24;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT | ||
+ | 20 90008 . C A 1546.88 . NSAMPLES=1;E=52;N=60;ESUM=52;NSUM=60;FLANKSEQ=AACAGAAAAC[C/A]AAATACTGTA | ||
+ | 20 91088 . C T 1766.04 . NSAMPLES=1;E=58;N=66;ESUM=58;NSUM=66;FLANKSEQ=CCCAGCATAC[C/T]ATGGTTGTGC | ||
+ | 20 91508 . G A 1266.93 . NSAMPLES=1;E=44;N=53;ESUM=44;NSUM=53;FLANKSEQ=AATTAGTAAG[G/A]CTTACGTAAG | ||
+ | 20 91707 . C T 888.134 . NSAMPLES=1;E=30;N=53;ESUM=30;NSUM=53;FLANKSEQ=TGATTTTCTA[C/T]AGCAGGACCT | ||
+ | 20 92527 . A G 828.593 . NSAMPLES=1;E=34;N=40;ESUM=34;NSUM=40;FLANKSEQ=ATTAATTGCC[A/G]TTCTCTCTTT | ||
+ | 20 93440 . A G 688.144 . NSAMPLES=1;E=24;N=58;ESUM=24;NSUM=58;FLANKSEQ=TTGGATGCAT[A/G]GTCTGTAAAT | ||
+ | 20 93636 . TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT <VNTR> . . MOTIF=T;RU=T;FZ_CONCORDANCE=0.939394;FZ_RL=35;FZ_LL=0;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FZ_RU_COUNTS=31,33;FLANKSEQ=TCTAGGATTC[TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT | ||
+ | 20 93646 . C CT . . NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FLANKSEQ=TTTTTCTTTC[TTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT;GMOTIF=T;TR=20:93636:TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT:<VNTR>:T | ||
+ | 20 93717 . A T 31.7622 . NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKSEQ=CAGTGGCGTG[A/T]TCTTAGATCA | ||
+ | 20 93931 . G A 628.149 . NSAMPLES=1;E=22;N=53;ESUM=22;NSUM=53;FLANKSEQ=GATTACAGGT[G/A]TGAGCCGCTG | ||
+ | 20 100699 . C T 809.09 . NSAMPLES=1;E=28;N=61;ESUM=28;NSUM=61;FLANKSEQ=GGTGAAAAAT[C/T]ACCTGTCAGT | ||
+ | 20 101362 . G A 1087.13 . NSAMPLES=1;E=36;N=67;ESUM=36;NSUM=67;FLANKSEQ=TAATACTGAA[G/A]TTTACTTCTC | ||
+ | |||
+ | </div> | ||
+ | |||
+ | The following shows the trace of how the algorithm works | ||
+ | |||
+ | ============================================ | ||
+ | ANNOTATING INDEL FUZZILY | ||
+ | ******************************************** | ||
+ | EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT | ||
+ | |||
+ | 20:131948:C/CCA | ||
+ | EXACT REGION 131948-131965 (18) | ||
+ | CCACACACACACACACAA | ||
+ | FINAL EXACT REGION 131948-131965 (18) | ||
+ | CCACACACACACACACAA | ||
+ | ******************************************** | ||
+ | PICK CANDIDATE MOTIFS | ||
+ | |||
+ | Longest Allele : C[CA]CACACACACACACACAA | ||
+ | detecting motifs for an str | ||
+ | seq: CCACACACACACACACACAA | ||
+ | len : 20 | ||
+ | cmax_len : 10 | ||
+ | candidate motifs: 25 | ||
+ | AC : 0.894737 2 0 | ||
+ | AAC : 0.5 3 0.0555556 | ||
+ | ACC : 0.5 3 0.0555556 | ||
+ | AAAC : 0.0588235 4 0.125 (< 2 copies) | ||
+ | ACCC : 0.0588235 4 0.125 (< 2 copies) | ||
+ | AACAC : 0.5 5 0.02 | ||
+ | ACACC : 0.5 5 0.02 | ||
+ | AAACAC : 0.0666667 6 0.0555556 (< 2 copies) | ||
+ | ACACCC : 0.0666667 6 0.0555556 (< 2 copies) | ||
+ | AACACAC : 0.5 7 0.0102041 | ||
+ | ACACACC : 0.5 7 0.0102041 | ||
+ | AAACACAC : 0.0769231 8 0.03125 (< 2 copies) | ||
+ | ACACACCC : 0.0769231 8 0.03125 (< 2 copies) | ||
+ | AACACACAC : 0.5 9 0.00617284 (< 2 copies) | ||
+ | ACACACACC : 0.5 9 0.00617284 (< 2 copies) | ||
+ | AAACACACAC : 0.0909091 10 0.02 (< 2 copies) | ||
+ | ACACACACCC : 0.0909091 10 0.02 (< 2 copies) | ||
+ | ******************************************** | ||
+ | PICKING NEXT BEST MOTIF | ||
+ | |||
+ | selected: AC 0.89 0.00 | ||
+ | ******************************************** | ||
+ | DETECTING REPEAT TRACT FUZZILY | ||
+ | ++++++++++++++++++++++++++++++++++++++++++++ | ||
+ | Exact left/right alignment | ||
+ | |||
+ | repeat_tract : CACACACACACACACA | ||
+ | position : [131949,131964] | ||
+ | motif_concordance : 1 | ||
+ | repeat units : 8 | ||
+ | exact repeat units : 8 | ||
+ | total no. of repeat units : 8 | ||
+ | |||
+ | ++++++++++++++++++++++++++++++++++++++++++++ | ||
+ | Fuzzy right alignment | ||
+ | |||
+ | repeat motif : CA | ||
+ | rflank : AACTC | ||
+ | mlen : 2 | ||
+ | rflen : 5 | ||
+ | plen : 111 | ||
+ | |||
+ | read : AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACACCACACACACACACACAAACTC | ||
+ | rlen : 106 | ||
+ | |||
+ | optimal score: 50.5073 | ||
+ | optimal state: MR | ||
+ | optimal track: MR|r|0|5 | ||
+ | optimal probe len: 25 | ||
+ | optimal path length : 107 | ||
+ | max j: 106 | ||
+ | probe: (1~82) [1~10] (1~5) | ||
+ | read : (1~82) [83~101] (102~106) | ||
+ | |||
+ | motif # : 10 [83,101] | ||
+ | motif concordance : 95% (9/10) | ||
+ | motif discordance : 0|1|0|0|0|0|0|0|0|0 | ||
+ | |||
+ | Model: ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC | ||
+ | SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME | ||
+ | oo++oo++oo++oo++oo++RRRRR | ||
+ | Read: AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACAC-CACACACACACACACAAACTC | ||
+ | |||
+ | ++++++++++++++++++++++++++++++++++++++++++++ | ||
+ | Fuzzy left alignment | ||
+ | |||
+ | lflank : ATCTTA | ||
+ | repeat motif : CA | ||
+ | lflen : 6 | ||
+ | mlen : 2 | ||
+ | plen : 111 | ||
+ | |||
+ | read : ATCTTACACCACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT | ||
+ | rlen : 105 | ||
+ | |||
+ | optimal score: 50.5858 | ||
+ | optimal state: Z | ||
+ | optimal track: Z|m|10|2 | ||
+ | optimal probe len: 26 | ||
+ | optimal path length : 106 | ||
+ | max j: 105 | ||
+ | mismatch penalty: 3 | ||
+ | |||
+ | model: (1~6) [1~10] | ||
+ | read : (1~6) [7~25][26~106] | ||
+ | |||
+ | motif # : 10 [7,25] | ||
+ | motif concordance : 95% (9/10) | ||
+ | motif discordance : 0|1|0|0|0|0|0|0|0|0 | ||
+ | |||
+ | Model: ATCTTACACACACACACACACACACA-------------------------------------------------------------------------------- | ||
+ | SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE | ||
+ | LLLLLLoo++oo++oo++oo++oo++ | ||
+ | Read: ATCTTACAC-CACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT | ||
+ | |||
+ | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | ||
+ | VNTR Summary | ||
+ | rid : 19 | ||
+ | motif : AC | ||
+ | ru : CA | ||
+ | |||
+ | Exact | ||
+ | repeat_tract : CACACACACACACACA | ||
+ | position : [131949,131964] | ||
+ | reference repeat unit length : 8 | ||
+ | motif_concordance : 1 | ||
+ | repeat units : 8 | ||
+ | exact repeat units : 8 | ||
+ | total no. of repeat units : 8 | ||
+ | |||
+ | Fuzzy | ||
+ | repeat_tract : CACCACACACACACACACA | ||
+ | position : [131946,131964] | ||
+ | reference repeat unit length : 19 | ||
+ | motif_concordance : 0.95 | ||
+ | repeat units : 19 | ||
+ | exact repeat units : 9 | ||
+ | total no. of repeat units : 10 | ||
+ | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | ||
+ | |||
+ | <div class="mw-collapsible-content"> | ||
+ | usage : vt annotate_indels [options] <in.vcf> | ||
+ | |||
+ | options : -v add vntr record [false] | ||
+ | -x override tags [false] | ||
+ | -f filter expression [] | ||
+ | -d debug [false] | ||
+ | -m mode [f] | ||
+ | e : by exact alignment f : by fuzzy alignment | ||
+ | -c classification schemas of tandem repeat [6] | ||
+ | 1 : lai2003 | ||
+ | 2 : kelkar2008 | ||
+ | 3 : fondon2012 | ||
+ | 4 : ananda2013 | ||
+ | 5 : willems2014 | ||
+ | 6 : tan_kang2015 | ||
+ | -a annotation type [v] | ||
+ | v : a. output VNTR variant (defined by classification). | ||
+ | RU repeat unit on reference sequence (CA) | ||
+ | MOTIF canonical representation (AC) | ||
+ | RL repeat tract length in bases (11) | ||
+ | FLANKS flanking positions of repeat tract determined by exact alignment | ||
+ | RU_COUNTS number of exact repeat units and total number of repeat units in | ||
+ | repeat tract determined by exact alignment | ||
+ | FZ_RL fuzzy repeat tract length in bases (11) | ||
+ | FZ_FLANKS flanking positions of repeat tract determined by fuzzy alignment | ||
+ | FZ_RU_COUNTS number of exact repeat units and total number of repeat units in | ||
+ | repeat tract determined by fuzzy alignment | ||
+ | FLANKSEQ flanking sequence of indel | ||
+ | LARGE_REPEAT_REGION repeat region exceeding 2000bp | ||
+ | b. mark indels with overlapping VNTR. | ||
+ | FLANKS flanking positions of repeat tract determined by exact alignment | ||
+ | FZ_FLANKS flanking positions of repeat tract determined by fuzzy alignment | ||
+ | GMOTIF generating motif used in fuzzy alignment | ||
+ | TR position and alleles of VNTR (20:23413:CACACACACAC:<VNTR>) | ||
+ | a : annotate each indel with RU, RL, MOTIF, REF. | ||
+ | -r reference sequence fasta file [] | ||
+ | -o output VCF file [-] | ||
+ | -I file containing list of intervals [] | ||
+ | -i intervals | ||
+ | -? displays help | ||
</div> | </div> | ||
</div> | </div> | ||
Line 1,212: | Line 1,754: | ||
</div> | </div> | ||
</div> | </div> | ||
+ | |||
+ | = Pedigree File = | ||
+ | |||
+ | vt understands an augmented version introduced by [mailto:hmkang@umich.edu Hyun] of the PED described by [http://zzz.bwh.harvard.edu/plink/data.shtml#ped plink]. | ||
+ | The pedigree file format is as follows with the following mandatory fields: | ||
+ | |||
+ | {| class="wikitable" | ||
+ | |- | ||
+ | ! scope="col"| Field | ||
+ | ! scope="col"| Description | ||
+ | ! scope="col"| Valid Values | ||
+ | ! scope="col"| Missing Values | ||
+ | |- | ||
+ | |Family ID<br> | ||
+ | Individual ID<br> | ||
+ | Paternal ID<br> | ||
+ | Maternal ID<br> | ||
+ | Sex<br> | ||
+ | Phenotype | ||
+ | |ID of this family <br> | ||
+ | ID(s) of this individual (comma separated) <br> | ||
+ | ID of the father <br> | ||
+ | ID of the mother <br> | ||
+ | Sex of the individual<br> | ||
+ | Phenotype | ||
+ | |[A-Za-z0-9_]+<br> | ||
+ | [A-Za-z0-9_]+(,[A-Za-z0-9_]+)* <br> | ||
+ | [A-Za-z0-9_]+ <br> | ||
+ | [A-Za-z0-9_]+<br> | ||
+ | 1=male, 2=female, other, male, female<br> | ||
+ | [A-Za-z0-9_]+ | ||
+ | | 0 <br> | ||
+ | cannot be missing <br> | ||
+ | 0 <br> | ||
+ | 0 <br> | ||
+ | other<br> | ||
+ | -9 | ||
+ | |} | ||
+ | |||
+ | Examples: | ||
+ | |||
+ | ceu NA12878 NA12891 NA12892 female -9 | ||
+ | yri NA19240 NA19239 NA19238 female -9 | ||
+ | |||
+ | ceu NA12878 NA12891 NA12892 2 -9 | ||
+ | yri NA19240 NA19239 NA19238 2 -9 | ||
+ | |||
+ | #allows tools like profile_mendelian to detect duplicates and check for concordance | ||
+ | ceu NA12878,NA12878A NA12891 NA12892 female case | ||
+ | yri NA19240 NA19239 NA19238 female control | ||
+ | |||
+ | #allows tools like profile_mendelian to detect duplicates and check for concordance | ||
+ | ceu NA12412 0 0 female case | ||
+ | yri NA19650 0 0 female control | ||
= Resource Bundle = | = Resource Bundle = | ||
− | * External : [ftp://share.sph.umich.edu/vt resource bundle] | + | == GRCh37 == |
+ | |||
+ | Files are based on hs37d5.fa made by Heng Li. | ||
+ | |||
+ | * External : [ftp://share.sph.umich.edu/vt/grch37 GRCh37 resource bundle] | ||
* Internal : /net/fantasia/home/atks/ref/vt/grch37 | * Internal : /net/fantasia/home/atks/ref/vt/grch37 | ||
+ | |||
+ | Read here for [ftp://share.sph.umich.edu/vt/grch37/readme.txt contents]. | ||
+ | |||
+ | == GRCh38 == | ||
+ | |||
+ | Files are based on [https://github.com/lh3/bwa/blob/master/README-alt.md hs38DH.fa] made by Heng Li. | ||
+ | Note that many of the references are simply lifted over from GRCh37 using Picard's liftover tool with the default options. | ||
+ | |||
+ | * External : [ftp://share.sph.umich.edu/vt/grch38 GRCh38 resource bundle] | ||
+ | * Internal : /net/fantasia/home/atks/ref/vt/grch38 | ||
+ | |||
+ | Read here for [ftp://share.sph.umich.edu/vt/grch38/readme.txt contents]. | ||
= FAQ = | = FAQ = | ||
Line 1,229: | Line 1,841: | ||
hs37d5.fa or hs37d5.fa.gz, simply delete the index file denoted by hs37d5.fa.fai or hs37d5.fa.gz.fai and run the vt command | hs37d5.fa or hs37d5.fa.gz, simply delete the index file denoted by hs37d5.fa.fai or hs37d5.fa.gz.fai and run the vt command | ||
again. A new index file will be generated automatically. | again. A new index file will be generated automatically. | ||
+ | |||
+ | = How to cite vt? = | ||
+ | |||
+ | If you use normalize: <br> | ||
+ | [http://bioinformatics.oxfordjournals.org/content/31/13/2202 Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204] | ||
= Maintained by = | = Maintained by = | ||
This page is maintained by [mailto:atks@umich.edu Adrian] | This page is maintained by [mailto:atks@umich.edu Adrian] |
Latest revision as of 03:36, 4 May 2021
Introduction
vt is a variant tool set that discovers short variants from Next Generation Sequencing data.
Installation
General
The source files are housed in github. htslib is used and a copy of a developmental freeze is stored as part of the vt repository to ensure compatibility.
To install, perform the following steps:
#this will create a directory named vt in the directory you cloned the repository 1. git clone https://github.com/atks/vt.git
#change directory to vt 2. cd vt
#update submodules 3. git submodule update --init --recursive
#run make, note that compilers need to support the c++0x standard 4. make
#you can test the build 5. make test
An expected output when all is well for the tests is shown here. (click expand =>)
user@server:~/vt$ make test test/test.sh ++++++++++++++++++++++ Tests for vt normalize ++++++++++++++++++++++ testing normalize output VCF file : ok output logs : ok +++++++++++++++++++++++++++++++ Tests for vt decompose_blocksub +++++++++++++++++++++++++++++++ testing decompose_blocksub of even-length blocks output VCF file : ok output logs : ok testing decompose_blocksub with alignment output VCF file : ok output logs : ok testing decompose_blocksub of phased even-length blocks output VCF file : ok output logs : ok ++++++++++++++++++++++ Tests for vt decompose ++++++++++++++++++++++ testing decompose for a triallelic variant output VCF file : ok output logs : ok
Passed tests : 5 / 5
Mac
You may install vt via homebrew.
brew tap brewsci/bio brew tap brewsci/science brew install brewsci/bio/vt
Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4.
Some features of C++11 are used, thus there is a need for newer versions of gcc and clang.
Updating
vt is currently under heavy development, you will probably need to update often.
#remove all object files #you need to do this as source files as the static libraries might have changed and need to be removed. 1. make clean
#update source files 2. git pull
#compile and link, the -j option tells Makefile to run up to 40 independent commands in parallel 3. make -j 40
General Features and notes
Common options
-i multiple intervals in <seq>:<start>-<end> format delimited by commas.
-I multiple intervals in <seq>:<start>-<end> format listed in a text file line by line.
-o defines the out file which and has the STDOUT set as the default. vt recognizes the appropriate output by file extension. <name>.vcf - uncompressed VCF <name>.vcf.gz - compressed VCF <name>.bcf - BCF You may modify the STDOUT to output the binary version of the format. Uncompressed VCF and BCF streams are indicated by - and + respectively.
-f filter expression
-s sequential region selection as opposed to random access of regions specified by the i option. This is useful when you want to select many close-by regions, while the -i option works, it is less efficient and also selects a variant multiple times if it overlaps 2 regions. This option iterates through the variants in the file sequentially and checks for overlap with the bed file given.
Uncompressed BCF streams
htslib is designed with BCF as the underlying data structure and it has incorporated awareness of uncompressed BCF streams in the i/o API. One may use this feature to stream uncompressed BCF records to save on computational time spent on (de)compression.
#using textual VCF streams indicated by - cat mills.vcf | vt normalize - -r hs37d5.fa | vt uniq - -o out.bcf
#using uncompressed BCF streams indicated by + cat mills.vcf | vt normalize - -r hs37d5.fa -o + | vt uniq + -o out.bcf
In this example, the former took 0.84s while the latter took 0.64s to process. (24% speed up!)
Filters
For some programs. you may define a filter via the -f option.
This allows you to only analyse biallelic indels that are passed on chromosome 20. vt profile_na12878 vt.bcf -g na12878.reference.txt -r genome.fa -f "N_ALLELE==2&&VTYPE==INDEL&&PASS" -i 20
This allows you to extract biallelic indels that are passed on chromosome 20. vt view vt.bcf -f "N_ALLELE==2&&VTYPE==INDEL&&PASS" -i 20
Other examples of filters
#all variants with a SNP in them VTYPE&SNP #Simple insertions of length 1 VTYPE==INDEL&&DLEN==1 #Indels of length 1 VTYPE==INDEL&&LEN==1
Variant characteristics VTYPE,N_ALLELE,DLEN,LEN,VARIANT_CONTAINS_N
Variant value types SNP,MNP,INDEL,CLUMPED
Biallelic SNPs only : VTYPE==SNP&&N_ALLELE==2 Biallelic Indels with embedded SNP : VTYPE==(SNP|INDEL)&&N_ALLELE==2 Biallelic variants involving insertions : VTYPE&INDEL&&DLEN>0&&N_ALLELE==2 Biallelic variants involving 1bp variants : LEN==1&&N_ALLELE==2 Variants with explicit sequences with no Ns : ~VARIANT_CONTAINS_N
REF field REF
ALT field ALT
QUAL field QUAL
FILTER fields PASS, FILTER.<tag>
INFO fields INFO.<tag> A/C SNPs : REF=='A' && ALT=='C' AC type of STRs : REF=~'^.(AC)+$' || ALT=~'^.(AC)+$' Passed biallelic SNPs only : PASS&&VTYPE==SNP&&N_ALLELE==2 Passed Common biallelic SNPs only : PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005 Passed Common biallelic SNPs or rare indels : (PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005) Passed Common biallelic SNPs or rare indels : ((PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005))&&QUAL>100 with quality greater than 100 Failed rare variants : ~PASS&&(INFO.AC/INFO.AN<0.005)
Regular expression matching PERL style (implemented with pcre2) Sometimes, an info field will contain several values in a string with functional annotation, to match what you want, just use INFO.ANNO=~'<perl regular expression>'
Passed variants in intergenic regions or UTR : PASS&&INFO.ANNO=~'Intergenic|UTR' Passed variants in intergenic regions or UTR ignoring case : PASS&&INFO.ANNO=~'(?i)Intergenic|UTR'
pcre2's '(?i)Intergenic|UTR' is equivalent to PERL's '/intergenic|UTR/i'
Operations == : equivalence for strings and numbers != : not equal =~ : regular expression match for strings only ~~ : not of =~. Is equivalent to PERL's !~, this notation is used as BASH keeps interpreting ! for recalling commands from the history ~ : logical not && : logical and || : logical or & : bitwise and | : bitwise or + : add - : subtract * : multiply / : divide
The following programs support filter expressions.
- view
- peek
- profile_snps
- profile_indels
- profile_na12878
- profile_mendelian
- profile_len
- profile_chrom
- profile_afs
- profile_hwe
- concordance
- partition
Alternate headers
As BCF is a restrictive format of VCF where all meta data must be present in the header, vt provides a mechanism to read an alternative header for VCF files that do not have a well formed header. Simply provide a header file stub named as <vcf-file>.hdr and vt will automatically read it instead of the original header in <vcf-file>.
For more information about VCF/BCF : http://samtools.github.io/hts-specs/VCFv4.2.pdf
This mechanism is available only if one is reading VCF or compressed VCF files. It is
disabled for BCF files as this might corrupt the BCF file because the encoding of the
fields in BCF records is based on the order of the meta info lines in the header.
Note: BCF2.2 introduces the IDX field in meta information lines that indicates the
dictionary encoding. This feature might be enabled for BCF files in the future.
General cases of Ploidy and Alleles
I am trying to make vt handle general cases of ploidy and alleles. Please let me know if that is lacking in a tool that you are using.
BCF Compression Levels vs Compression Time
The zlib deflation algorithm (a variant of LZ77) has 10 levels - 0 to 9. 0 has no compression but instead wraps
up the file in zlib or bgzf blocks. It may be useful to have 0 compression as it is indexable with the same mechanism
used for compressed files. Levels 1-9 denote an increasing compression level in exchange for longer times for
compression.
In general, zlib compression does not have significant differences in compression for BCF files between the 9 compression
levels as shown in the following table:
Compression Level | Size | Time |
---|---|---|
0 1 |
153GB 98.4GB |
45m 2h3m |
So, it might be a good idea to compress at lower levels when dealing with large temporary
files in a pipeline to save compute time. This can be achieved with the -c option in vt view
VCF Manipulation
View
Views a VCF or VCF.GZ or BCF file.
#views mills.bcf and outputs to standard out vt view -h mills.bcf
#views mills.bcf and locally sorts it in a 10000bp window and outputs to sorted-millsbcf vt view -h -w 10000 mills.bcf -o sorted-mills.bcf
#views mills.bcf and outputs to c1-mills.bcf with a compression level of 1. By default, #the compression level is 6 where lower levels compress the file less but are faster. #The difference in compression for BCF files between level 1 to level 9 is about 5% of #of a level 1 compression file. The difference in time taken is about an additional 50% #of a level 1 compression. The levels range from 0 to 9 where 0 means no compression #but the file is encapsulated in bgzf blocks that allows the file to be indexed. A special #level -1 denotes an uncompressed BCF file that is not encapsulated in bgzf blocks and #are thus not indexable but are highly suitable for streaming between vt commands. vt view -h mills.bcf -c 1 -o c1-mills.bcf
#views mills.bcf and selects variants that overlap with the regions found in dust.bed from chromosome 20 #the -t option selects variants by checking if each variant overlaps with the regions in the bed file, this is #as opposed to random accessing the variants via the index through the intervals defined in -i and -I options. #this is useful when selecting variants from the target regions from an exome sequencing experiment. vt view 10000 mills.bcf -t dust.bed -i 20
usage : vt view [options] <in.vcf>
options : -o output VCF/VCF.GZ/BCF file [-] -f filter expression [] -w local sorting window size [0] -s print site information only without genotypes [false] -H print header only, this option is honored only for STDOUT [false] -h omit header, this option is honored only for STDOUT [false] -p print options and summary [] -r right window size for overlap [] -l left window size for overlap [] -c compression level 0-9, 0 and -1 denotes uncompressed with the former being wrapped in bgzf. [6] -t bed file for variant selection via streaming [] -I file containing list of intervals [] -i intervals [] -? displays help
Index
Indexes a VCF.GZ or BCF file.
#indexes mills.bcf vt index mills.bcf #indexes mills.vcf.gz vt index mills.vcf.gz
usage : vt index [options] <in.vcf>
options : -p print options and summary [] -- ignores the rest of the labeled arguments following this flag -h displays help
Sorting
Sorting may be done in 3 approaches.
Locally:
Performs sorting within a local window. The window size may be set by the -w option. The default window size
is 1000bp and if a record is detected to be potentially out of order due to a small window size, it wil be reported.
Use this when your VCF records are grouped by chromosome but not ordered in short stretches.
By chromosome:
Your VCF file is not ordered by the chromosomes in the header but is fully ordered within each chromosome.
The VCF file should be indexed and vt will output the records in the order of chromosomes given in the header.
Full sort [default option]:
No assumptions are made about the VCF file. Records will be ordered by the order of contigs in the header.
Smaller temporary ordered files are created and their names are <output_vcf>.<no>.bcf and after generating
these files, they are merged and output into <output_vcf>.
#sorts mills.bcf and outputs to standard out in a 1000bp window. vt sort -m local mills.bcf #sorts mills.bcf and locally sorts it in a 10000bp window and outputs to out.bcf vt sort -m local -w 10000 mills.bcf -o out.bcf #sorts an indexed mills.bcf with chromosomes not sorted in the contig order in the header vt sort -m chrom mills.bcf -o out.bcf #sorts mills.bcf with no assumption vt sort mills.bcf -o out.bcf
usage : vt sort [options] <in.vcf>
options : -m sorting modes. [full] local : locally sort within a 1000bp window. Window size may be set by -w. chrom : sort chromosomes based on order of contigs in header. input must be indexed. full : full sort with no assumptions. -o output VCF/VCF.GZ/BCF file. [-] -w local sorting window size, set by default to 1000 under local mode. [0] -p print options and summary. [] -? displays help
Normalization
Normalize variants in a VCF file (Tan et al. 2015) . Normalized variants may have their positions changed; in such cases, the normalized variants are reordered and output in an ordered fashion. The local reordering takes place over a window of 10000 base pairs which may be changed via the -w option. There is an underlying assumption that the REF field is consistent with the reference sequence use, vt will check for this and will fail if reference inconsistency is encountered; this may be relaexd with the -n option.
#normalize variants and write out to dbsnp.normalized.vcf vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf
#normalize variants, send to standard out and remove duplicates. vt normalize dbsnp.vcf -r seq.fa | vt uniq - -o dbsnp.normalized.uniq.vcf
#read in variants that do not contain N in the explicit alleles, normalize variants, send to standard out. vt normalize dbsnp.vcf -r seq.fa -f "~VARIANT_CONTAINS_N"
#variants that are normalized will be annotated with an OLD_VARIANT info tag. #CHROM POS ID REF ALT QUAL FILTER INFO 19 29238772 . C G . PASS VT=SNP;OLD_VARIANT=19:29238771:TC/TG 20 60674709 . GCCCAGCCCCAC G . PASS VT=INDEL;OLD_VARIANT=20:60674718:CACCCCAGCCCC/C
#this shows a sample output with the normalization operations that were used #categorized into 5 categories each for biallelic and multiallelic variants.
stats: biallelic no. left trimmed : 156908 no. right trimmed : 323 no. left and right trimmed : 33 no. right trimmed and left aligned : 7 no. left aligned : 12360
total no. biallelic normalized : 169631
multiallelic no. left trimmed : 627189 no. right trimmed : 2509 no. left and right trimmed : 1498 no. right trimmed and left aligned : 212 no. left aligned : 1783
total no. multiallelic normalized : 633191
total no. variants normalized : 802822 total no. variants observed : 88052639
usage : vt normalize [options] <in.vcf>
options : -o output VCF file [-] -d debug [false] -q do not print options and summary [false] -m warns but does not exit when REF is inconsistent with masked reference sequence for non SNPs. This overides the -n option [false] -n warns but does not exit when REF is inconsistent with reference sequence for non SNPs [false] -w window size for local sorting of variants [10000] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -? displays help
Decompose biallelic block substitutions
Decomposes biallelic block substitutions into its constituent SNPs.
There is now an additional option -a which decomposes non block substitutions into its constituent SNPs and indels. (kindly added by [holtgrewe@github])
There is no exact solution and this decomposition is based on the best guess outcome using a Needleman-Wunsch algorithm.
You might also want to check out vcfallelicprimitives.
There is now an additional option -m and -d which ensures that some MNVs are not decomposed. (kindly added by [jaudoux@github])
The motivation is from
- Exome-wide assessment of the functional impact and pathogenicity of multi-nucleotide mutations https://www.biorxiv.org/content/10.1101/258723v2.full
- Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes https://www.biorxiv.org/content/10.1101/573378v2.full
#decomposes biallelic block substitutions and write out to decomposed_blocksub.vcf vt decompose_blocksub gatk.vcf -o decomposed_blocksub.vcf
#before decomposition #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 20 763837 . CA TG 50340.1 PASS AC=1;AN=2 GT 0|1
#after decomposition #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 20 763837 . C T 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG GT 0|1 20 763838 . A G 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG GT 0|1
#decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf vt decompose_blocksub -a gatk.vcf -o decomposed_blocksub.vcf
#before decomposition #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 20 763837 . CG TGA 50340.1 PASS AC=1;AN=2 GT 0|1
#after decomposition #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 20 763837 . C T 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1 20 763838 . G GA 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1
#decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf and add phase set information in the genotype fields vt decompose_blocksub -p gatk.vcf -o decomposed_blocksub.vcf
#before decomposition #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tumor normal 1 159030 . TAACCTTTC TGACCTTTT 0.04 . AF=0.5 GT 0/0 1/1
#after decomposition 1 159031 . A G 0.04 . AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT GT:PS 0|0:159031 1|1:159031 1 159038 . C T 0.04 . AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT GT:PS 0|0:159031 1|1:159031
description : decomposes biallelic block substitutions into its constituent SNPs.
usage : vt decompose_blocksub [options] <in.vcf>
options : -m keep MNVs (multi-nucleotide variants) [false] -a enable aggressive/alignment mode [false] -d MNVs max distance (when -m option is used) [2] -o output VCF file [-] -I file containing list of intervals [] -i intervals [] -? displays help-a enable aggressive/alignment mode
Decompose
Decompose multiallelic variants in a VCF file. If the VCF file has genotype fields GT,PL, GL or DP, they are modified to reflect the change in alleles. All other genotype fields are removed. The -s option will retain the fields and decompose fields of counts R and A accordingingly.
Decomposition and combining variants is a complex operation where the correctness is dependent on [tfarrah@github]:
- whether the observed variants are seen in the same sample,
- if same sample, whether they are homozygous or heterozygous,
- if both heterozygous, whether they are in the same haplotype or not (if known).
and one should be aware of the issues in handling variants resulting from such operations.
The original purpose of this tool is to allow for allelic comparisons between call sets.
[example of a problem caused in combining separate variant records]
#decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf vt decompose gatk.vcf -o gatk.decomposed.vcf
#before decomposition #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 1 3759889 . TA TAA,TAAA,T . PASS AF=0.342,0.173,0.037 GT:DP:PL 1/2:81:281,5,9,58,0,115,338,46,116,809 0/0:86:0,30,323,31,365,483,38,291,325,567
#after decomposition #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 1 3759889 . TA TAA . PASS OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL 1/.:281,5,9 0/0:0,30,323 1 3759889 . TA TAAA . . OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./1:281,58,115 0/0:0,31,483 1 3759889 . TA T . . OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./.:281,338,809 0/0:0,38,567
One might want to post process the partial genotypes like 1/. to the best guess genotype based on the PL values.
#decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf with the -s option. #-s option splits up INFO and GENOTYPE fields that have number counts of R and A [VCFv4.2 section 1.2.2] appropriately. vt decompose -s gatk.vcf -o gatk.decomposed.vcf
#before decomposition #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 1 3759889 . TA TAA,TAAA,T . PASS AF=0.342,0.173,0.037 GT:DP:PL 1/2:81:281,5,9,58,0,115,338,46,116,809 0/0:86:0,30,323,31,365,483,38,291,325,567
#after decomposition #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 1 3759889 . TA TAA . PASS AF=0.342;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL 1/.:281,5,9 0/0:0,30,323 1 3759889 . TA TAAA . . AF=0.173;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./1:281,58,115 0/0:0,31,483 1 3759889 . TA T . . AF=0.037;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./.:281,338,809 0/0:0,38,567
In general, you should recompute fields that involves alleles after decomposition. Information is generally lost after vertically decomposing a variant, so care should be taken in interpreting the resultant values.
description : decomposes multiallelic variants into biallelic in a VCF file.
usage : vt decompose [options] <in.vcf>
options : -s smart decomposition [false] -o output VCF file [-] -I file containing list of intervals [] -i intervals [] -? displays help
Drop duplicate variants
Drops duplicate variants that appear later in the file. VCF file must be ordered.
If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained.
#drop duplicate variants and save output in mills.uniq.vcf vt uniq mills.vcf -o mills.uniq.vcf
usage : vt uniq [options] <in.vcf>
options : -o output VCF file [-] -I file containing list of intervals [] -i intervals [] -? displays help
Paste
Pastes VCF files like the unix paste functions.
Input requirements and assumptions: 1. Same variants are represented in the same order for each file (required) 2. Genotype field order are the same for corresponding records (required) 3. Sample names are different in all the files (warning will be given if not) 4. Headers are the same for all the files (assumption, not checked, will fail if output is BCF) Outputs: 1. INFO fields output will be that of the first file 2. Genotype fields are the same for corresponding records
#paste together genotypes from the CEU trio into one file. vt paste NA12878.mills.bcf NA12891.mills.bcf NA12892.mills.bcf -o ceu_trio.bcf
usage : vt paste [options] <in1.vcf>... options : -L file containing list of input VCF files -o output VCF file [-] -p print options and summary [] -? displays help
Concatenate
Concatenates VCF files. Assumes individuals are in the same order and files share the same header.
#concatenates chr1.mills.bcf and chr2.mills.bcf vt cat chr1.mills.bcf chr2.mills.bcf -o mills.bcf
#concatenates chr1.mills.bcf and chr2.mills.bcf with the naive option. #The naive option assumes that the headers are all the same and skips #merging headers and translating encodings between BCF files. This is #a much faster option if you know the nature of your BCF files in advance. vt cat -n chr1.mills.bcf chr2.mills.bcf -o mills.bcf
usage : vt cat [options] <in1.vcf>...
options : -s print site information only without genotypes [false] -p print options and summary [false] -n naive, assumes that headers are the same. [false] -w local sorting window size [0] -f filter expression [] -L file containing list of input VCF files -o output VCF file [-] -I file containing list of intervals [] -i intervals -? displays help
Remove info tags
Removes INFO tags from a VCF file
#removes the INFO tags OLD_VARIANT, ENTROPY, PSCORE and COMP vt rminfo exact.del.bcf -t OLD_VARIANT,ENTROPY,PSCORE,COMP -o rm.bcf
usage : vt rminfo [options] <in.vcf>
options : -o output VCF file [-] -q do not print options and summary [false] -t list of info tags to be removed [] -I file containing list of intervals [] -i intervals [] -? displays help
Filter
Filters variants in a VCF file
#adds a filter tag "refA" for variants where the REF column is a A sequence. vt filter in.bcf -f "REF=='A'" -d "refA"
usage : vt filter [options] <in.vcf>-? displays help
options : -x clear filter [false] -f filter expression [] -d filter tag description [] -t filter tag [] -o output VCF file [-] -I file containing list of intervals [] -i intervals
Filter overlap
Tags overlapping variants in a VCF file with the FILTER flag overlap.
#adds a filter tag "overlap" for overlapping variants within a window size of 1 based on the REF sequence. vt filter_overlap in.bcf -w 1 out.bcf
todo: option for considering END info tag for detecting overlaps.
usage : vt filter_overlap [options] <in.vcf> options : -o output VCF file [-] -w window overlap for variants [0] -I file containing list of intervals [] -i intervals [] -? displays help
Validate
Checks the following properties of a VCF file:
- order
- reference sequence consistency
#validates lobstr.bcf vt validate lobstr.bcf
usage : vt validate [options] <in.vcf> options : -q do not print invalid records [false] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -? displays help
Extract INFO fields to a tab delimited file
Converts a VCF file and its shared information in the INFO field to a tab delimited file for further analysis.
#converts in.bcf to tab format with selected INFO and FILTER fields vt info2tab in.bcf -u PASS -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT
INPUT ===== 20 17548608 . A AC . PASS CENTERS=vbi;NCENTERS=1;OLD_MULTIALLELIC=20:17548598:GAAAAAAAAAAAAA/GAAAAAAAAAAAA/GAAAAAAAAAAAAAA/GAAAAAAAAAA/GAAAAAAAAAAA/GAAAAAAAAAACAAA;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAACAAAG;EX_MOTIF=C;EX_MLEN=1;EX_RU=C;EX_BASIS=C;EX_BLEN=1;EX_REPEAT_TRACT=17548608,17548609;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=2;EX_RL=2;EX_LL=3;EX_RU_COUNTS=0,2;EX_SCORE=0;EX_TRF_SCORE=-14;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=14;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[A]AAAGAAGGAA;MDUST;LOBSTR 20 17548608 . AAAAG A . PASS CENTERS=ox1;NCENTERS=1;EX_MOTIF=AAAG;EX_MLEN=4;EX_RU=AAAG;EX_BASIS=AG;EX_BLEN=2;EX_REPEAT_TRACT=17548609,17548612;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=0.75;EX_RL=4;EX_LL=4;EX_RU_COUNTS=0,1;EX_SCORE=0.75;EX_TRF_SCORE=-1;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=13;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[AAAAG]AAGGAACTAC;MDUST;LOBSTR;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAA
OUTPUT ====== CHROM POS REF ALT N_ALLELE PASS EX_RL FZ_RL MDUST LOBSTR VNTRSEEK RMSK EX_REPEAT_TRACT_1 EX_REPEAT_TRACT_2 20 17548608 A AC 2 1 2 13 1 1 0 0 17548608 17548608 20 17548608 AAAAG A 2 1 4 13 1 1 0 0 17548609 17548609
usage : vt info2tab [options] <in.vcf> options : -d debug [false] -f filter expression [] -u list of filter tags to be extracted []-t list of info tags to be extracted [] -o output tab delimited file [-] -I file containing list of intervals [] -i intervals [] -? displays help
VCF Inspection and Evaluation
Peek
Summarizes the variants in a VCF file
#summarizes the variants found in mills.vcf vt peek mills.vcf
usage : vt peek [options] <in.vcf>
options : -o output VCF file [-] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -- ignores the rest of the labeled arguments following this flag -h displays help
For a more detailed guide on variant classification.
#This is a sample output of a peek command which summarizes the variants found in a VCF file. stats: no. of samples : 0 no. of chromosomes : 22
========== Micro variants ==========
no. of SNPs : 77228885 2 alleles (ts/tv) : 77011302 (2.11) [52287790/24723512] 3 alleles (ts/tv) : 216560 (0.75) [185520/247600] 4 alleles (ts/tv) : 1023 (0.50) [1023/2046]
no. of MNPs : 0 2 alleles (ts/tv) : 0 (-nan) [0/0] >=3 alleles (ts/tv) : 0 (-nan) [0/0]
no. Indels : 2147564 2 alleles (ins/del) : 2124842 (0.47) [683250/1441592] >=3 alleles (ins/del) : 22722 (2.12) [32411/15286]
no. SNP/MNP : 0 3 alleles (ts/tv) : 0 (-nan) [0/0] >=4 alleles (ts/tv) : 0 (-nan) [0/0]
no. SNP/Indels : 12913 2 alleles (ts/tv) (ins/del) : 412 (0.41) [120/292] (3.68) [324/88] >=3 alleles (ts/tv) (ins/del) : 12501 (0.43) [7670/17649] (18.64) [12434/667]
no. MNP/Indels : 153 2 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0] >=3 alleles (ts/tv) (ins/del) : 153 (0.30) [138/465] (0.27) [67/248]
no. SNP/MNP/Indels : 2 3 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0] 4 alleles (ts/tv) (ins/del) : 2 (0.00) [3/5] (1.00) [3/3] >=5 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0]
no. of clumped variants : 19025 2 alleles : 0 (-nan) [0/0] (-nan) [0/0] 3 alleles : 18508 (0.16) [12152/75366] (0.00) [93/18653] 4 alleles : 451 (0.15) [369/2390] (0.33) [201/609] >=5 alleles : 66 (0.09) [37/414] (1.19) [107/90]
====== Other useful categories =====
no. complex variants : 32093 2 alleles (ts/tv) (ins/del) : 412 (0.41) [120/292] (3.68) [324/88] >=3 alleles (ts/tv) (ins/del) : 31681 (0.21) [20369/96289] (0.64) [12905/20270]
======= Structural variants ========
no. of structural variants : 41217 2 alleles : 38079 deletion : 13135 insertion : 16451 mobile element : 16253 ALU : 12513 LINE1 : 2911 SVA : 829 numt : 198 duplication : 664 inversion : 100 copy number variation : 7729 >=3 alleles : 3138 copy number variation : 3138
========= General summary ==========
no. of reference : 0
no. of observed variants : 79449759 no. of unclassified variants : 0
Partition
Partition variants from two data sets.
Please note that this only works if the contigs in the headers of both data sets are the same.
#partitions all variants in bi1.bcf and bi2.bcf vt partition bi1.bcf bi2.bcf
Options: input VCF file a bi1.bcf input VCF file b bi2.bcf
A: 504676 variants B: 1389333 variants
ts/tv ins/del A-B 37564 [0.19] [1.34] A&B 467112 [1.55] [0.72] B-A 922221 [1.20] [0.58] of A 92.6% of B 33.6%
#partitions only passed variants in bi1.bcf and bi2.bcf vt partition bi1.bcf bi2.bcf -f PASS
Options: input VCF file a bi1.bcf input VCF file b bi2.bcf [f] filter PASS
A: 466148 variants B: 986056 variants
ts/tv ins/del A-B 47261 [0.44] [1.36] A&B 418887 [1.80] [0.68] B-A 567169 [1.43] [0.72] of A 89.9% of B 42.5%
partition v0.5
description : partition variants. check the overlap of variants between 2 data sets.
usage : vt partition [options] <in1.vcf><in2.vcf>
options : -w write partitioned variants to file -f filter -I file containing list of intervals [] -i intervals [] -? displays help
Multi Partition
Partitions variants found in VCF files.
In comparison to the simple 2 way partition, this does not support writing out of partitions to file and
reporting proportion of shared variants for each VCF.
#partitions variants n-ways vt multi_partition hc.genotypes.bcf pl.genotypes.bcf st.genotypes.bcf
Options: input VCF file a hc.genotypes.bcf input VCF file b pl.genotypes.bcf input VCF file c st.genotypes.bcf
A: 97274 variants B: 95458 variants C: 98943 variants
no [ts/tv] [ins/del] A-- 3887 [1.10] [0.86] -B- 7890 [1.45] [0.98] AB- 4360 [0.99] [1.32] --C 8277 [1.75] [2.21] A-C 7458 [1.78] [0.49] -BC 1639 [1.63] [1.03] ABC 81569 [2.28] [1.08]
Unique variants : 115080 Overall concordance : 70.88% (#intersection/#union)
usage : vt multi_partition [options] <in1.vcf><in2.vcf>... options : -f filter -I file containing list of intervals [] -i intervals [] -? displays help
Annotate Regions
Annotates regions in a VCF file. The BED file should be bgzipped and indexed with tabix.
#annotates the variants that overlap with coding regions. vt annotate_regions mills.vcf -b coding.bed.gz -t CDS -d "Coding region"
#annotates the variants that overlap with low complexity regions. vt annotate_regions mills.vcf -b mdust.bed.gz -t DUST -d "DUST Low Complexity Region"
usage : vt annotate_regions [options] <in.vcf>
options : -d regions tag description [] -t regions tag [] -b regions BED file [] -o output VCF file [-] -I file containing list of intervals [] -i intervals -? displays help
Annotate Variants
Annotates variants in a VCF file. The GENCODE annotation file should be bgzipped and indexed with tabix. This is available in the vt resource bundle.
#annotates the variants found in mills.vcf vt annotate_variants mills.vcf -r hs37d5.fa -g gencode.v19.annotation.gtf.gz
#annotates variants with the following fields ##INFO=<ID=VT,Number=1,Type=String,Description="Variant Type - SNP, MNP, INDEL, CLUMPED"> ##INFO=<ID=GENCODE_FS,Number=0,Type=Flag,Description="Frameshift INDEL"> ##INFO=<ID=GENCODE_NFS,Number=0,Type=Flag,Description="Non Frameshift INDEL">
usage : vt annotate_variants [options] <in.vcf>
options : -g GENCODE annotations GTF file [] -r reference sequence fasta file [] -o output VCF file [-] -I file containing list of intervals [] -i intervals -? displays help
Compute Features
Compute features in a VCF file. Example of statistics are Allele counts, Genotype Likelihood based Inbreeding Coefficient.
Hardy-Weinberg Genotype Likelihood based Allele Frequencies
For more customizable feature computation - look at estimate
#compute features for the variants found in vt.vcf #requires GT, PL and DP vt compute_features vt.vcf
#annotates variants with the following fields ##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate Allele Counts"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Number Allele Counts"> ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Alternate Allele Frequency"> ##INFO=<ID=GC,Number=G,Type=Integer,Description="Genotype Counts"> ##INFO=<ID=GN,Number=1,Type=Integer,Description="Total Number of Genotypes Counts"> ##INFO=<ID=GF,Number=G,Type=Float,Description="Genotype Frequency"> ##INFO=<ID=HWEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency assuming HWE"> ##INFO=<ID=HWEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency assuming HWE"> ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency"> ##INFO=<ID=MLEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency"> ##INFO=<ID=HWE_LLR,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg ln(Likelihood Ratio)"> ##INFO=<ID=HWE_LPVAL,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic ln(p-value)"> ##INFO=<ID=HWE_DF,Number=1,Type=Integer,Description="Degrees of freedom for Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic"> ##INFO=<ID=FIC,Number=1,Type=Float,Description="Genotype likelihood based Inbreeding Coefficient"> ##INFO=<ID=AB,Number=1,Type=Float,Description="Genotype likelihood based Allele Balance">
usage : vt compute_features for variants [options] <in.vcf> options : -s print site information only without genotypes [false] -o output VCF/VCF.GZ/BCF file [-] -f filter expression [] -I File containing list of intervals -i Intervals -? displays help
Estimate
Compute variant based estimates.
Example of statistics are: * Allele counts * Hardy-Weinberg Genotype Likelihood based Allele Frequencies * Genotype Likelihood based Inbreeding Coefficient * Genotype Likelihood based Hardy-Weinberg test * Genotype Likelihood based Allele Balance
#compute features for the variants found in vt.vcf #requires GT and PL vt estimate -e AF,MLEAF vt.vcf
AF Genotype (GT) based allele frequencies If genotypes are unavailable, best guess genotypes are inferred based on genotype likelihoods (GL or PL) AC : Alternate Allele counts AN : Total allele counts NS : No. of samples. AF : Alternate allele frequencies. MLEAF GL based allele frequencies estimates MLEAF : Alternate allele frequency derived from MLEGF MLEGF : Genotype frequencies. HWEAF GL based allele frequencies estimates assuming HWE HWEAF : Alternate allele frequencies HWEGF : Genotype frequencies derived from HWEAF. HWE GL based Hardy-Weinberg statistics. HWE_LLR : log likelihood ratio HWE_LPVAL : log p-value HWE_DF : degrees of freedom AB GL based Allele Balance. FIC GL based Inbreeding Coefficient
usage : vt estimate [options] <in.vcf> options : -s print site information only without genotypes [false] -o output VCF/VCF.GZ/BCF file [-] -e comma separated estimates to be computed [] -f filter expression [] -I File containing list of intervals -i Intervals -? displays help
Profile Mendelian Errors
Profile Mendelian errors
#profile mendelian errors found in vt.genotypes.bcf, generate tables in the directory mendel, requires pdflatex. vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel
pedigree file format is described in here.
#this is a sample output for mendelian error profiling. #R and A stand for reference and alternate allele respectively. #Error% - mendelian error (confounded with de novo mutation) #HomHet - Homozygous-Heterozygous genotype ratios #Het% - proportion of hets Mendelian Errors
Father Mother R/R R/A A/A Error(%) HomHet Het(%) R/R R/R 14889 210 38 1.64 nan nan R/R R/A 3403 3497 74 1.06 0.97 50.68 R/R A/A 176 1482 155 18.26 nan nan R/A R/R 3665 3652 68 0.92 1.00 49.91 R/A R/A 1015 3151 990 0.00 0.64 61.11 R/A A/A 43 1300 1401 1.57 1.08 48.13 A/A R/R 172 1365 147 18.94 nan nan A/A R/A 47 1164 1183 1.96 1.02 49.60 A/A A/A 20 78 5637 1.71 nan nan
Parental R/R R/A A/A Error(%) HomHet Het(%) R/R R/R 14889 210 38 1.64 nan nan R/R R/A 7068 7149 142 0.99 0.99 50.28 R/R A/A 348 2847 302 18.59 nan nan R/A R/A 1015 3151 990 0.00 0.64 61.11 R/A A/A 90 2464 2584 1.75 1.05 48.81 A/A A/A 20 78 5637 1.71 nan nan
Parental R/R R/A A/A Error(%) HomHet Het(%) HOM HOM 14909 288 5675 1.66 nan nan HOM HET 7158 9613 2726 1.19 1.00 49.90 HET HET 1015 3151 990 0.00 0.64 61.11 HOMREF HOMALT 348 2847 302 18.59 nan nan
total mendelian error : 2.505% no. of trios : 2 no. of variants : 25346
profile_mendelian v0.5
usage : vt profile_mendelian [options] <in.vcf>
options : -q minimum genotype quality -d minimum depth -r reference sequence fasta file [] -x output latex directory [] -p pedigree file -I file containing list of intervals [] -i intervals -? displays help
Profile SNPs
Profile SNPs. The reference data sets can be obtained from vt resource bundle.
#profile snps found in 20.sites.vcf vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa -i 20
#this is a sample output for indel profiling. # square brackets contain the ts/tv ratio. # The numbers in curved bracket are the counts of ts and tv SNPs respectively. # Low complexity shows what percent of the SNPs are in low complexity regions. data set No. SNPs : 508603 [2.09] Low complexity : 0.08 (39837/508603)
1000g A-B 109970 [1.39] A&B 398633 [2.37] B-A 1340682 [2.26] Precision 78.4% Sensitivity 22.9%
dbsnp A-B 324063 [1.99] A&B 184540 [2.29] B-A 103893 [2.60] Precision 36.3% Sensitivity 64.0%
# This file contains information on how to process reference data sets. # # dataset - name of data set, this label will be printed. # type - True Positives (TP) and False Positives (FP) # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively # - annotation # file is used for GENCODE annotation of frame shift and non frame shift Indels # filter - filter applied to variants for this particular data set # path - path of indexed BCF file #dataset type filter path 1000g TP N_ALLELE==2&&VTYPE==SNP /net/fantasia/home/atks/ref/vt/grch37/1000G.v5.snps.indels.complex.svs.sites.bcf dbsnp TP N_ALLELE==2&&VTYPE==SNP /net/fantasia/home/atks/ref/vt/grch37/dbSNP138.snps.indels.complex.sites.bcf GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz DUST cplx_annotation . /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz
usage : vt profile_snps [options] <in.vcf>
options : -f filter expression [] -g file containing list of reference datasets [] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -? displays help
Profile Indels
Profile Indels. The reference data sets can be obtained from vt resource bundle.
#profile indels found in mills.vcf vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa -i 20
#this is a sample output for indel profiling. # square brackets contain the ins/del ratio. # for the FS/NFS field, that is the proportion of coding indels that are frame shifted. # The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively. data set No Indels : 46974 [0.89] FS/NFS : 0.26 (8/23)
dbsnp A-B 30704 [0.92] A&B 16270 [0.83] B-A 2049488 [1.52] Precision 34.6% Sensitivity 0.8%
mills A-B 43234 [0.88] A&B 3740 [1.00] B-A 203278 [0.98] Precision 8.0% Sensitivity 1.8%
mills.chip A-B 46847 [0.89] A&B 127 [0.90] B-A 8777 [0.93] Precision 0.3% Sensitivity 1.4%
affy.exome.chip A-B 46911 [0.89] A&B 63 [0.43] B-A 33997 [0.47] Precision 0.1% Sensitivity 0.2%
# This file contains information on how to process reference data sets. # dataset - name of data set, this label will be printed. # type - True Positives (TP) and False Positives (FP). # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively. # - annotation. # file is used for GENCODE annotation of frame shift and non frame shift Indels. # filter - filter applied to variants for this particular data set. # path - path of indexed BCF file. #dataset type filter path 1000g TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/1000G.snps_indels.sites.bcf mills TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/mills.208620indels.sites.bcf dbsnp TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/dbsnp.13147541variants.sites.bcf GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.cds.bed.gz DUST cplx_annotation . /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz
usage : vt profile_indels [options] <in.vcf>
options : -g file containing list of reference datasets [] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -? displays help
Profile VNTRs
Profile VNTRs. The reference data sets can be obtained from vt resource bundle.
#profiles a set of VNTRs vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt
profile_vntrs v0.5 no VNTRs 5660874 #number of VNTRs in vntrs.sites.bcf no low complexity 2686460 (47.46%) #number of VNTRs in low complexity region determined by MDUST no coding 17911 (0.32%) #number of VNTRs in coding regions determined by GENCODE v7 no redundant 1312209 (23.18%) #number of VNTRs involved in overlapping with one another
trf_lobstr (1638516) #TRF based reference set used in lobSTR, motif lengths 1 to 6. A-B 3269285 #TRs specific to vntrs.sites.bcf A-B~ 1666185 #TRs in vntrs.sites.bcf that overlap partially with at least one TR in TRF(lobSTR) but does not overlap exactly with another TR. A&B1 725404 #TRs in vntrs.sites.bcf that overlap exactly with at least one TR in TRF(lobSTR) A&B2 723195 #TRs in TRF(lobSTR) that overlap exactly with at least one TR in vntrs.sites.bcf B-A~ 710075 #TRs in TRF(lobSTR) that overlap partially with at least one TR in vntrs.sites.bcf but does not overlap exactly with another TR. B-A 205246 #TRs specific to TRF(lobSTR) #note that the first 3 rows should sum up to the number of TRs in vntrs.sites.bcf #and the 4th to 6th rows should sum up to the number of TRs in TRF( lobSTR) #This basically allows us to see the m to n overlapping in overlapping TRs
trf_repeatseq (1624553) #TRF based reference set used in repeatseq, motif lengths 1 to 6. A-B 3291652 A-B~ 1650190 A&B1 719032 A&B2 716838 B-A~ 703948 B-A 203767
trf_vntrseek (230306) #TRF based reference set used in vntrseek, motif lengths 7 to 2000. A-B 5384453 A-B~ 271302 A&B1 5119 A&B2 4973 B-A~ 92496 B-A 132837
codis+ (15) #CODIS STRs + 2 STRs from PROMEGA A-B 5660794 A-B~ 79 A&B1 1 A&B2 1 B-A~ 14 B-A 0
# This file contains information on how to process reference data sets. # dataset - name of data set, this label will be printed. # type - True Positives (TP) and False Positives (FP). # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively. # - annotation. # file is used for GENCODE annotation of coding VNTRs. # filter - filter applied to variants for this particular data set. # path - path of indexed BCF file. #dataset type filter path trf_lobstr TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.lobstr.sites.bcf trf_repeatseq TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.repeatseq.sites.bcf trf_vntrseek TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.vntrseek.sites.bcf codis+ TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/codis.strs.sites.bcf GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz DUST cplx_annotation .
usage : vt profile_vntrs [options] <in.vcf>
options : -g file containing list of reference datasets [] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -? displays help
Profile NA12878
Profile Mendelian errors
#profile NA12878 overlap with broad knowledgebase and illumina platinum genomes for the file vt.genotypes.bcf for chromosome 20. vt profile_na12878 vt.genotypes.bcf -g na12878.reference.txt -r hs37d5.fa -i 20
#this is a sample output for mendelian error profiling. #R and A stand for reference and alternate allele respectively. #Error% - mendelian error (confounded with de novo mutation) #HomHet - Homozygous-Heterozygous genotype ratios #Het% - proportion of hets data set No Indels : 27770 [0.94] FS/NFS : 0.26 (8/23)
broad.kb A-B 13071 [1.19] A&B 14699 [0.76] B-A 21546 [0.62] Precision 52.9% Sensitivity 40.6%
illumina.platinum A-B 17952 [0.88] A&B 9818 [1.07] B-A 2418 [0.88] Precision 35.4% Sensitivity 80.2%
broad.kb R/R R/A A/A ./. R/R 346 145 3 5473 R/A 3 4133 9 758 A/A 2 136 2186 956 ./. 2 139 86 322
Total genotype pairs : 6963 Concordance : 95.72% (6665) Discordance : 4.28% (298)
illumina.platinum R/R R/A A/A ./. R/R 1768 85 2 0 R/A 10 4479 14 0 A/A 13 180 3028 0 ./. 71 98 70 0
Total genotype pairs : 9579 Concordance : 96.83% (9275) Discordance : 3.17% (304)
# This file contains information on how to process reference data sets. # # dataset - name of data set, this label will be printed. # type - True Positives (TP) and False Positives (FP) # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively # - annotation # file is used for GENCODE annotation of frame shift and non frame shift Indels # filter - filter applied to variants for this particular data set # path - path of indexed BCF file #dataset type filter path broad.kb TP PASS /net/fantasia/home/atks/dev/vt/bundle/public/grch37/broad.kb.241365variants.genotypes.bcf illumina.platinum TP PASS /net/fantasia/home/atks/dev/vt/bundle/public/grch37/NA12878.illumina.platinum.5284448variants.genotypes.bcf #gencode.v19 annotation . /net/fantasia/home/atks/dev/vt/bundle/public/grch37/gencode.v19.annotation.gtf.gz
profile_na12878 v0.5
usage : vt profile_na12878 [options] <in.vcf>
options : -g file containing list of reference datasets [] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -? displays help
Variant Calling
Discover
Discovers variants from reads in a BAM/CRAM file.
#discover variants from NA12878.bam and write to stdout vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20
usage : vt discover2 [options]
options : -b input BAM/CRAM file -y soft clipped unique sequences cutoff [0] -x soft clipped mean quality cutoff [0] -w insertion desired type II error [0.0] -c insertion desired type I error [0.0] -h insertion fractional evidence cutoff [0] -g insertion count cutoff [1] -n deletion desired type II error [0.0] -m deletion desired type I error [0.0] -v deletion fractional evidence cutoff [0] -u deletion count cutoff [1] -k snp desired type II error [0.0] -j snp desired type I error [0.0] -f snp fractional evidence cutoff [0] -e snp evidence count cutoff [1] -q base quality cutoff for bases [0] -C likelihood ratio cutoff [0] -B reference bias [0] -a read exclude flag [0x0704] -l ignore overlapping reads [false] -t MAPQ cutoff for alignments [0] -p ploidy [2] -s sample ID -r reference sequence fasta file [] -o output VCF file [-] -z ignore MD tags [0] -d debug [0] -I file containing list of intervals [] -i intervals [] -? displays help
Merge candidate variants
Merge candidate variants across samples. Each VCF file is required to have the FORMAT flags E and N and should have exactly one sample.
#merge candidate variants from VCFs in candidate.txt and output in candidate.sites.vcf vt merge_candidate_variants candidates.txt -o candidate.sites.vcf
usage : vt merge_candidate_variants [options]
options : -L file containing list of input VCF files -o output VCF file [-] -I file containing list of intervals [] -i intervals -- ignores the rest of the labeled arguments following this flag -h displays help
Filter overlap
Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap.
#annotates variants that are overlapping vt filter_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf
usage : vt filter_overlap [options] <in.vcf>
options : -o output VCF file [-] -w window overlap for variants [0] -I file containing list of intervals [] -i intervals [] -? displays help
#Use Remove overlap instead for versions older than Jan 12, 2017 vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf
usage: vt remove_overlap [options] <in.vcf> The old version has the same options except that it lacks the -w option The change occurred in the following commit: https://github.com/atks/vt/commit/ab5cf7e91b3baa5349f439e6fe92491ae19da1a6
Annotate Indels
Annotates indels with VNTR information and adds a VNTR record. Facilitates the simultaneous calling of VNTR together with Indels and SNPs.
#annotates indels from VCFs with VNTR information. vt annotate_indels in.vcf -r hs37d5.fa -o annotated.sites.vcf
CHROM POS ID REF ALT QUAL FILTER INFO 20 82079 . G A 1255.98 . NSAMPLES=1;E=43;N=51;ESUM=43;NSUM=51;FLANKSEQ=GGAGCACGCC[G/A]CCATGCCCGG 20 82217 . G A 1632.77 . NSAMPLES=1;E=56;N=61;ESUM=56;NSUM=61;FLANKSEQ=GAGCCACCGC[G/A]CCCGGCCCAG 20 83250 . CTGTGTGTG C . . NSAMPLES=1;E=18;N=35;ESUM=18;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT 20 83250 . CTGTGTGTGTG C . . NSAMPLES=1;E=3;N=35;ESUM=3;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT 20 83251 . TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG <VNTR> . . MOTIF=GT;RU=TG;FZ_CONCORDANCE=1;FZ_RL=52;FZ_LL=0;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FZ_RU_COUNTS=26,26;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG]TTTAGTATTT 20 83252 . G C 359.204 . NSAMPLES=1;E=13;N=14;ESUM=13;NSUM=14;FLANKSEQ=CTCTCTCTCT[G/C]TGTGTGTGTG 20 83260 . G C 500.163 . NSAMPLES=1;E=18;N=34;ESUM=18;NSUM=34;FLANKSEQ=CTGTGTGTGT[G/C]TGTGTGTGTG 20 83267 . T C 247.043 . NSAMPLES=1;E=11;N=43;ESUM=11;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT 20 83275 . T C 609.669 . NSAMPLES=1;E=24;N=43;ESUM=24;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT 20 90008 . C A 1546.88 . NSAMPLES=1;E=52;N=60;ESUM=52;NSUM=60;FLANKSEQ=AACAGAAAAC[C/A]AAATACTGTA 20 91088 . C T 1766.04 . NSAMPLES=1;E=58;N=66;ESUM=58;NSUM=66;FLANKSEQ=CCCAGCATAC[C/T]ATGGTTGTGC 20 91508 . G A 1266.93 . NSAMPLES=1;E=44;N=53;ESUM=44;NSUM=53;FLANKSEQ=AATTAGTAAG[G/A]CTTACGTAAG 20 91707 . C T 888.134 . NSAMPLES=1;E=30;N=53;ESUM=30;NSUM=53;FLANKSEQ=TGATTTTCTA[C/T]AGCAGGACCT 20 92527 . A G 828.593 . NSAMPLES=1;E=34;N=40;ESUM=34;NSUM=40;FLANKSEQ=ATTAATTGCC[A/G]TTCTCTCTTT 20 93440 . A G 688.144 . NSAMPLES=1;E=24;N=58;ESUM=24;NSUM=58;FLANKSEQ=TTGGATGCAT[A/G]GTCTGTAAAT 20 93636 . TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT <VNTR> . . MOTIF=T;RU=T;FZ_CONCORDANCE=0.939394;FZ_RL=35;FZ_LL=0;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FZ_RU_COUNTS=31,33;FLANKSEQ=TCTAGGATTC[TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT 20 93646 . C CT . . NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FLANKSEQ=TTTTTCTTTC[TTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT;GMOTIF=T;TR=20:93636:TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT:<VNTR>:T 20 93717 . A T 31.7622 . NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKSEQ=CAGTGGCGTG[A/T]TCTTAGATCA 20 93931 . G A 628.149 . NSAMPLES=1;E=22;N=53;ESUM=22;NSUM=53;FLANKSEQ=GATTACAGGT[G/A]TGAGCCGCTG 20 100699 . C T 809.09 . NSAMPLES=1;E=28;N=61;ESUM=28;NSUM=61;FLANKSEQ=GGTGAAAAAT[C/T]ACCTGTCAGT 20 101362 . G A 1087.13 . NSAMPLES=1;E=36;N=67;ESUM=36;NSUM=67;FLANKSEQ=TAATACTGAA[G/A]TTTACTTCTC
The following shows the trace of how the algorithm works
============================================ ANNOTATING INDEL FUZZILY ******************************************** EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT 20:131948:C/CCA EXACT REGION 131948-131965 (18) CCACACACACACACACAA FINAL EXACT REGION 131948-131965 (18) CCACACACACACACACAA ******************************************** PICK CANDIDATE MOTIFS Longest Allele : C[CA]CACACACACACACACAA detecting motifs for an str seq: CCACACACACACACACACAA len : 20 cmax_len : 10 candidate motifs: 25 AC : 0.894737 2 0 AAC : 0.5 3 0.0555556 ACC : 0.5 3 0.0555556 AAAC : 0.0588235 4 0.125 (< 2 copies) ACCC : 0.0588235 4 0.125 (< 2 copies) AACAC : 0.5 5 0.02 ACACC : 0.5 5 0.02 AAACAC : 0.0666667 6 0.0555556 (< 2 copies) ACACCC : 0.0666667 6 0.0555556 (< 2 copies) AACACAC : 0.5 7 0.0102041 ACACACC : 0.5 7 0.0102041 AAACACAC : 0.0769231 8 0.03125 (< 2 copies) ACACACCC : 0.0769231 8 0.03125 (< 2 copies) AACACACAC : 0.5 9 0.00617284 (< 2 copies) ACACACACC : 0.5 9 0.00617284 (< 2 copies) AAACACACAC : 0.0909091 10 0.02 (< 2 copies) ACACACACCC : 0.0909091 10 0.02 (< 2 copies) ******************************************** PICKING NEXT BEST MOTIF selected: AC 0.89 0.00 ******************************************** DETECTING REPEAT TRACT FUZZILY ++++++++++++++++++++++++++++++++++++++++++++ Exact left/right alignment repeat_tract : CACACACACACACACA position : [131949,131964] motif_concordance : 1 repeat units : 8 exact repeat units : 8 total no. of repeat units : 8 ++++++++++++++++++++++++++++++++++++++++++++ Fuzzy right alignment repeat motif : CA rflank : AACTC mlen : 2 rflen : 5 plen : 111 read : AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACACCACACACACACACACAAACTC rlen : 106 optimal score: 50.5073 optimal state: MR optimal track: MR|r|0|5 optimal probe len: 25 optimal path length : 107 max j: 106 probe: (1~82) [1~10] (1~5) read : (1~82) [83~101] (102~106) motif # : 10 [83,101] motif concordance : 95% (9/10) motif discordance : 0|1|0|0|0|0|0|0|0|0 Model: ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME oo++oo++oo++oo++oo++RRRRR Read: AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACAC-CACACACACACACACAAACTC ++++++++++++++++++++++++++++++++++++++++++++ Fuzzy left alignment lflank : ATCTTA repeat motif : CA lflen : 6 mlen : 2 plen : 111 read : ATCTTACACCACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT rlen : 105 optimal score: 50.5858 optimal state: Z optimal track: Z|m|10|2 optimal probe len: 26 optimal path length : 106 max j: 105 mismatch penalty: 3 model: (1~6) [1~10] read : (1~6) [7~25][26~106] motif # : 10 [7,25] motif concordance : 95% (9/10) motif discordance : 0|1|0|0|0|0|0|0|0|0 Model: ATCTTACACACACACACACACACACA-------------------------------------------------------------------------------- SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE LLLLLLoo++oo++oo++oo++oo++ Read: ATCTTACAC-CACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx VNTR Summary rid : 19 motif : AC ru : CA Exact repeat_tract : CACACACACACACACA position : [131949,131964] reference repeat unit length : 8 motif_concordance : 1 repeat units : 8 exact repeat units : 8 total no. of repeat units : 8 Fuzzy repeat_tract : CACCACACACACACACACA position : [131946,131964] reference repeat unit length : 19 motif_concordance : 0.95 repeat units : 19 exact repeat units : 9 total no. of repeat units : 10 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
usage : vt annotate_indels [options] <in.vcf>
options : -v add vntr record [false] -x override tags [false] -f filter expression [] -d debug [false] -m mode [f] e : by exact alignment f : by fuzzy alignment -c classification schemas of tandem repeat [6] 1 : lai2003 2 : kelkar2008 3 : fondon2012 4 : ananda2013 5 : willems2014 6 : tan_kang2015 -a annotation type [v] v : a. output VNTR variant (defined by classification). RU repeat unit on reference sequence (CA) MOTIF canonical representation (AC) RL repeat tract length in bases (11) FLANKS flanking positions of repeat tract determined by exact alignment RU_COUNTS number of exact repeat units and total number of repeat units in repeat tract determined by exact alignment FZ_RL fuzzy repeat tract length in bases (11) FZ_FLANKS flanking positions of repeat tract determined by fuzzy alignment FZ_RU_COUNTS number of exact repeat units and total number of repeat units in repeat tract determined by fuzzy alignment FLANKSEQ flanking sequence of indel LARGE_REPEAT_REGION repeat region exceeding 2000bp b. mark indels with overlapping VNTR. FLANKS flanking positions of repeat tract determined by exact alignment FZ_FLANKS flanking positions of repeat tract determined by fuzzy alignment GMOTIF generating motif used in fuzzy alignment TR position and alleles of VNTR (20:23413:CACACACACAC:<VNTR>) a : annotate each indel with RU, RL, MOTIF, REF. -r reference sequence fasta file [] -o output VCF file [-] -I file containing list of intervals [] -i intervals -? displays help
Construct Probes
Construct probes for genotyping a variant.
#construct probes from candidate.sites.bcf and output to standard out vt construct_probes candidates.sites.bcf -r ref.fa
usage : vt construct_probes [options] <in.vcf>
options : -o output VCF file [-] -f minimum flank length [20] -r reference sequence fasta file [] -I file containing list of intervals [] -i intervals [] -- ignores the rest of the labeled arguments following this flag -h displays help
Genotype
Genotypes variants for each sample.
#genotypes variants found in candidate.sites.vcf from sample.bam vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf
usage : vt genotype [options]
options : -r reference sequence fasta file [] -s sample ID [] -o output VCF file [-] -b input BAM file [] -i input candidate VCF file [] -- ignores the rest of the labeled arguments following this flag -h displays help
Pedigree File
vt understands an augmented version introduced by Hyun of the PED described by plink. The pedigree file format is as follows with the following mandatory fields:
Field | Description | Valid Values | Missing Values |
---|---|---|---|
Family ID Individual ID |
ID of this family ID(s) of this individual (comma separated) |
[A-Za-z0-9_]+ [A-Za-z0-9_]+(,[A-Za-z0-9_]+)* |
0 cannot be missing |
Examples:
ceu NA12878 NA12891 NA12892 female -9 yri NA19240 NA19239 NA19238 female -9
ceu NA12878 NA12891 NA12892 2 -9 yri NA19240 NA19239 NA19238 2 -9
#allows tools like profile_mendelian to detect duplicates and check for concordance ceu NA12878,NA12878A NA12891 NA12892 female case yri NA19240 NA19239 NA19238 female control
#allows tools like profile_mendelian to detect duplicates and check for concordance ceu NA12412 0 0 female case yri NA19650 0 0 female control
Resource Bundle
GRCh37
Files are based on hs37d5.fa made by Heng Li.
- External : GRCh37 resource bundle
- Internal : /net/fantasia/home/atks/ref/vt/grch37
Read here for contents.
GRCh38
Files are based on hs38DH.fa made by Heng Li. Note that many of the references are simply lifted over from GRCh37 using Picard's liftover tool with the default options.
- External : GRCh38 resource bundle
- Internal : /net/fantasia/home/atks/ref/vt/grch38
Read here for contents.
FAQ
1. vt cannot retrieve sequences from my reference sequence file
It is common to use reference files based on the UCSC browser's database and from the Genome Reference Consortium. For example, HG19 vs Grch37. The key difference is that chromosome 1 is represented as chr1 and 1 respectively in the FASTA files from these 2 sources. Just use the appropriate FASTA file that was used to generate your VCF file originally.
Another common issue is due to the corruption of the index file of the reference sequence; say for a reference file named hs37d5.fa or hs37d5.fa.gz, simply delete the index file denoted by hs37d5.fa.fai or hs37d5.fa.gz.fai and run the vt command again. A new index file will be generated automatically.
How to cite vt?
If you use normalize:
Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204
Maintained by
This page is maintained by Adrian