Vt
Introduction
vt is a variant tool set that discovers short variants from Next Generation Sequencing data. The features are being rolled out to github as major rewriting is being undertaken.
Installation
The source files are housed in github. htslib is used and a copy of a developmental freeze is stored as part of the vt repository to ensure compatibility.
To install, perform the following steps:
#this will create a directory named vt in the directory you cloned the repository 1. git clone https://github.com/atks/vt.git #change directory to vt 2. cd vt #run make, note that compilers need to support the c++0x standard and that the #default compiler is clang++, simply change the CXX variable in the Makefile to #g++ if you do not have clang++. 3. make
Building has been tested on Linux and Mac systems on gcc 4.3 and above and clang 3.4.
General Features
Common options
-i multiple intervals in <seq>:<start>-<end> format delimited by commas.
-I multiple intervals in <seq>:<start>-<end> format listed in a text file line by line.
-o defines the out file which and has the STDOUT set as the default. You may modify the STDOUT to output the binary version of the format. Uncompressed VCF and BCF streams are indicated by - and + respectively.
Uncompressed BCF streams
htslib is designed with BCF as the underlying data structure and it has incorporated awareness of uncompressed BCF streams in the i/o API. One may use this feature to stream uncompressed BCF records to save on computational time spent on (de)compression.
#using textual VCF streams indicated by - cat view -h mills.vcf | vt normalize - -r hs37d5.fa | vt mergedups - -o out.bcf
#using uncompressed BCF streams indicated by + cat mills.vcf | vt normalize - -r hs37d5.fa -o + | vt mergedups + -o out.bcf
In this example, the former took 0.84s while the latter took 0.64s to process. (24% speed up!)
Alternate headers
As BCF is a restrictive format of VCF where all meta data must be present in the header, vt provides a mechanism to read an alternative header for VCF files that do not have a complete header. Simply provide a header file stub named as <vcf-file>.hdr and vt will automatically read it instead of the original header in <vcf-file>.
VCF Manipulation
View
Views a VCF or VCF.GZ or BCF file.
#views mills.bcf and outputs to standard out vt view -h mills.bcf #views mills.bcf and locally sorts it in a 10000bp window and outputs to out.bcf vt view -h -w 10000 mills.bcf
usage : vt view [options] <in.vcf>
options : -o output VCF/VCF.GZ/BCF file [-] -w local sorting window size [0] -s print site information only without genotypes [false] -h print header [false] -p print options and summary [] -I file containing list of intervals [] -i intervals [] -? displays help
Concatenate
Concatenate VCF files. Assumes individuals are in the same order and files share the same header.
#indexes mills.bcf vt index mills.bcf
usage : vt concat [options] <in1.vcf>...
options : -s print site information only without genotypes [false] -p print options and summary [] -L file containing list of input VCF files -o output VCF file [-] -I file containing list of intervals [] -i intervals -? displays help
Index
Indexes a VCF.GZ or BCF file.
#indexes mills.bcf vt index mills.bcf
usage : vt index [options] <in.vcf>
options : -p print options and summary [] -- ignores the rest of the labeled arguments following this flag -h displays help
Sorting
Local sorting can be done using view setting the -w option to a non 0 value.
Normalization
Normalize variants in a VCF file. Normalized variants may have their positions changed; in such cases, the normalized variants are reordered and output in an ordered fashion. The local reordering takes place over a window of 10000 base pairs.
#normalize variants and write out to mills.normalized.vcf vt normalize mills.vcf -r seq.fa -o mills.normalized.vcf
#normalize variants, send to standard out and remove duplicates. vt normalize mills.vcf -r seq.fa | vt mergedups - -o mills.normalized.merged.vcf
#variants that are normalized will be annotated with an OLD_VARIANT info tag. #CHROM POS ID REF ALT QUAL FILTER INFO 19 29238772 . C G . PASS VT=SNP;OLD_VARIANT=19:29238771:TC,TG 20 60674709 . GCCCAGCCCCAC G . PASS VT=INDEL;OLD_VARIANT=20:60674718:CACCCCAGCCCC,C
#this shows a sample output with the normalization operations that were used #categorized into 5 categories each for biallelic and multiallelic variants.
stats: biallelic no. left trimmed : 2 no. left trimmed and left aligned : 0 no. left trimmed and right trimmed : 0 no. left aligned : 63118 no. right trimmed : 4
multiallelic no. left trimmed : 0 no. left trimmed and left aligned : 0 no. left trimmed and right trimmed : 0 no. left aligned : 0 no. right trimmed : 0
no. variants observed : 644696
usage : vt normalize [options] <in.vcf>
options : -o output VCF file [-] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -- ignores the rest of the labeled arguments following this flag -h displays help
Merge duplicate variants
Merges duplicate variants by position with the option of considering alleles. (This just discards the duplicate variant that appears later in the VCF file)
#merge duplicate variants and save output in mills.merged.vcf vt mergedups mills.vcf -o mills.merged.vcf
usage : vt mergedups [options] <in.vcf>
options : -o output VCF file [-] -p merge by position [false]
Peek
Summarizes the variants in a VCF file
#summarizes the variants found in mills.vcf vt peek mills.vcf
#This is a sample output of a peek command which summarizes the variants found in a VCF file. stats: No. of samples : 0 No. of chromosomes : 24
No. of SNPs : 80171904 biallelic (ts/tv) : 79548537 (1.96) [52632749/26915788] 3 alleles : 618207 4 alleles : 5160
No. of MNPs : 273710 biallelic (ts/tv) : 272415 (0.84) [258918/308014] multiallelic : 1295
No. Indels : 5179595 biallelic (ins/del) : 4769442 (0.59) insertions : 1769372 deletions : 3000070 multiallelic : 410153
No. SNP/Indels : 659557 biallelic (ins/del) : 87867 (0.48) insertions : 28649 deletions : 59218 multiallelic : 571690
No. MNP/Indels : 55552 biallelic (ins/del) : 34161 (0.75) insertions : 14624 deletions : 19537 multiallelic : 21391
No. SNP/MNP/Indels : 15857 (multiallelic)
No. SNP/MNP : 121965 (multiallelic)
No. of clumped variants : 1574499 biallelic : 295257 multiallelic : 1279242
No. of reference : 0
No. of observed variants : 88052639 No. of unclassified variants : 0
usage : vt peek [options] <in.vcf>
options : -o output VCF file [-] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -- ignores the rest of the labeled arguments following this flag -h displays help
Annotate Variants
Annotates variants in a VCF file. The GENCODE annotation file should be bgzipped and indexed with tabix. This is available in the resource bundle.
#summarizes the variants found in mills.vcf vt annotate mills.vcf -r hs37d5.fa -g gencode.gtf.gz
#annotates variants with the following fields ##INFO=<ID=VT,Number=1,Type=String,Description="Variant Type - SNP, MNP, INDEL, CLUMPED"> ##INFO=<ID=RU,Number=1,Type=String,Description="Repeat unit in a STR or Homopolymer"> ##INFO=<ID=RL,Number=1,Type=Integer,Description="Repeat Length"> ##INFO=<ID=FS,Number=0,Type=Flag,Description="Frameshift INDEL"> ##INFO=<ID=NFS,Number=0,Type=Flag,Description="Non Frameshift INDEL">
usage : vt annotate_variants [options] <in.vcf>
options : -g GENCODE annotations GTF file [] -r reference sequence fasta file [] -o output VCF file [-] -I file containing list of intervals [] -i intervals -? displays help
Profile Indels
Profile Indels
#profile indels found in mills.vcf vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa -i 20
#this is a sample output for indel profiling. # square brackets contain the ins/del ratio. # for the FS/NFS field, that is the proportion of coding indels that are frame shifted. # The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively. data set No Indels : 46974 [0.89] FS/NFS : 0.26 (8/23)
dbsnp A-B 30704 [0.92] A&B 16270 [0.83] B-A 2049488 [1.52] Precision 34.6% Sensitivity 0.8%
mills A-B 43234 [0.88] A&B 3740 [1.00] B-A 203278 [0.98] Precision 8.0% Sensitivity 1.8%
mills.chip A-B 46847 [0.89] A&B 127 [0.90] B-A 8777 [0.93] Precision 0.3% Sensitivity 1.4%
affy.exome.chip A-B 46911 [0.89] A&B 63 [0.43] B-A 33997 [0.47] Precision 0.1% Sensitivity 0.2%
# This file contains information on how to process reference data sets. # # dataset - name of data set, this label will be printed. # type - True Positives (TP) and False Positives (FP) # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively # - annotation # file is used for GENCODE annotation of frame shift and non frame shift Indels # filter - filter applied to variants for this particular data set # path - path of indexed BCF file #dataset type filter path dbsnp TP INDEL dbsnp.xxsnps_indels.sites.bcf mills TP INDEL mills.208620indels.sites.bcf mills.chip TP INDEL mills.chip.158samples.8904indels.sites.bcf #mills.chip.common TP INDEL&&AF>0.005 grch37/mills.chip.158samples.8904indels.sites.bcf affy.exome.chip TP INDEL affy.exome.chip.1249samples.316520variants.sites.bcf #affy.exome.chip.poly TP INDEL&&AC!=0 grch37/affy.exome.chip.1249samples.316520variants.sites.bcf #affy.exome.chip.mono FP INDEL&&AC=0 grch37/affy.exome.chip.1249samples.316520variants.sites.bcf gencode.v19 annotation . gencode.v19.annotation.gtf.gz
usage : vt profile_indels [options] <in.vcf>
options : -g file containing list of reference datasets [] -I file containing list of intervals [] -i intervals [] -r reference sequence fasta file [] -? displays help
Profile Mendelian Errors
Profile Mendelian errors
#profile mendelian errors found in vt.genotypes.bcf, generate tables in the directory mendel, requires pdflatex. vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel
#this is a sample output for mendelian error profiling. #R and A stand for reference and alternate allele respectively. #Error% - mendelian error (confounded with de novo mutation) #HomHet - Homozygous-Heterozygous genotype ratios #Het% - proportion of hets
Mendelian Errors
Father Mother R/R R/A A/A Error(%) HomHet Het(%) R/R R/R 14889 210 38 1.64 nan nan R/R R/A 3403 3497 74 1.06 0.97 50.68 R/R A/A 176 1482 155 18.26 nan nan R/A R/R 3665 3652 68 0.92 1.00 49.91 R/A R/A 1015 3151 990 0.00 0.64 61.11 R/A A/A 43 1300 1401 1.57 1.08 48.13 A/A R/R 172 1365 147 18.94 nan nan A/A R/A 47 1164 1183 1.96 1.02 49.60 A/A A/A 20 78 5637 1.71 nan nan
Parental R/R R/A A/A Error(%) HomHet Het(%) R/R R/R 14889 210 38 1.64 nan nan R/R R/A 7068 7149 142 0.99 0.99 50.28 R/R A/A 348 2847 302 18.59 nan nan R/A R/A 1015 3151 990 0.00 0.64 61.11 R/A A/A 90 2464 2584 1.75 1.05 48.81 A/A A/A 20 78 5637 1.71 nan nan
Parental R/R R/A A/A Error(%) HomHet Het(%) HOM HOM 14909 288 5675 1.66 nan nan HOM HET 7158 9613 2726 1.19 1.00 49.90 HET HET 1015 3151 990 0.00 0.64 61.11 HOMREF HOMALT 348 2847 302 18.59 nan nan
total mendelian error : 2.505% no. of trios : 2 no. of variants : 25346
profile_mendelian v0.5
usage : vt profile_mendelian [options] <in.vcf>
options : -q minimum genotype quality -d minimum depth -r reference sequence fasta file [] -x output latex directory [] -p pedigree file -I file containing list of intervals [] -i intervals -? displays help
Variant Calling
Discover
Discovers variants from reads in a BAM file.
#discover variants from NA12878.bam and write to stdout vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20 -v snps,indels,mnps
usage : vt discover [options]
options : -b input BAM file -v variant types [snps,mnps,indels] -f fractional evidence cutoff for candidate allele [0.1] -e evidence count cutoff for candidate allele [2] -q base quality cutoff for bases [13] -m MAPQ cutoff for alignments [20] -s sample ID -r reference sequence fasta file [] -o output VCF file [-] -I file containing list of intervals [] -i intervals [] -- ignores the rest of the labeled arguments following this flag -h displays help
Merge candidate variants
Merge candidate variants across samples. Each VCF file is required to have the FORMAT flags E and N and should have exactly one sample.
#merge candidate variants from VCFs in candidate.txt and output in candidate.sites.vcf vt merge_candidate_variants candidates.txt -o candidate.sites.vcf
usage : vt merge_candidate_variants [options]
options : -L file containing list of input VCF files -o output VCF file [-] -I file containing list of intervals [] -i intervals -- ignores the rest of the labeled arguments following this flag -h displays help
Construct Probes
Construct probes for genotyping a variant.
#construct probes from candidate.sites.bcf and output to standard out vt construct_probes candidates.sites.bcf -r ref.fa
usage : vt construct_probes [options] <in.vcf>
options : -o output VCF file [-] -f minimum flank length [20] -r reference sequence fasta file [] -I file containing list of intervals [] -i intervals [] -- ignores the rest of the labeled arguments following this flag -h displays help
Genotype
Genotypes variants for each sample.
#genotypes variants found in candidate.sites.vcf from sample.bam vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf
usage : vt genotype [options]
options : -r reference sequence fasta file [] -s sample ID [] -o output VCF file [-] -b input BAM file [] -i input candidate VCF file [] -- ignores the rest of the labeled arguments following this flag -h displays help
Resource Bundle
Maintained by
This page is maintained by Adrian