Line 1: |
Line 1: |
| + | = Introduction = |
| | | |
| + | vt is a variant tool set that discovers short variants from Next Generation Sequencing data. |
| + | |
| + | = Installation = |
| + | |
| + | == General == |
| + | |
| + | The source files are housed in github. [https://github.com/samtools/htslib htslib] is |
| + | used and a copy of a developmental freeze is stored as part of the vt repository to |
| + | ensure compatibility. |
| + | |
| + | To install, perform the following steps: |
| + | |
| + | #this will create a directory named vt in the directory you cloned the repository |
| + | 1. git clone https://github.com/atks/vt.git <br> |
| + | #change directory to vt |
| + | 2. cd vt <br> |
| + | #update submodules |
| + | 3. git submodule update --init --recursive <br> |
| + | #run make, note that compilers need to support the c++0x standard |
| + | 4. make <br> |
| + | #you can test the build |
| + | 5. make test |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | An expected output when all is well for the tests is shown here. (click expand =>) |
| + | <div class="mw-collapsible-content"> |
| + | user@server:~/vt$ make test |
| + | test/test.sh |
| + | ++++++++++++++++++++++ |
| + | Tests for vt normalize |
| + | ++++++++++++++++++++++ |
| + | testing normalize |
| + | output VCF file : ok |
| + | output logs : ok |
| + | +++++++++++++++++++++++++++++++ |
| + | Tests for vt decompose_blocksub |
| + | +++++++++++++++++++++++++++++++ |
| + | testing decompose_blocksub of even-length blocks |
| + | output VCF file : ok |
| + | output logs : ok |
| + | testing decompose_blocksub with alignment |
| + | output VCF file : ok |
| + | output logs : ok |
| + | testing decompose_blocksub of phased even-length blocks |
| + | output VCF file : ok |
| + | output logs : ok |
| + | ++++++++++++++++++++++ |
| + | Tests for vt decompose |
| + | ++++++++++++++++++++++ |
| + | testing decompose for a triallelic variant |
| + | output VCF file : ok |
| + | output logs : ok <br> |
| + | Passed tests : 5 / 5 |
| + | </div> |
| + | </div> |
| + | |
| + | === Mac === |
| + | |
| + | You may install vt via homebrew. |
| + | |
| + | brew tap brewsci/bio |
| + | brew tap brewsci/science |
| + | |
| + | brew install brewsci/bio/vt |
| + | |
| + | |
| + | Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4. <br> |
| + | Some features of C++11 are used, thus there is a need for newer versions of gcc and clang. |
| + | |
| + | = Updating = |
| + | |
| + | vt is currently under heavy development, you will probably need to update often. |
| + | |
| + | #remove all object files |
| + | #you need to do this as source files as the static libraries might have changed and need to be removed. |
| + | 1. make clean <br> |
| + | #update source files |
| + | 2. git pull <br> |
| + | #compile and link, the -j option tells Makefile to run up to 40 independent commands in parallel |
| + | 3. make -j 40 |
| + | |
| + | = General Features and notes = |
| + | |
| + | == Common options == |
| + | |
| + | -i multiple intervals in <seq>:<start>-<end> format delimited by commas. |
| + | |
| + | -I multiple intervals in <seq>:<start>-<end> format listed in a text file line by line. |
| + | |
| + | -o defines the out file which and has the STDOUT set as the default. |
| + | vt recognizes the appropriate output by file extension. |
| + | <name>.vcf - uncompressed VCF |
| + | <name>.vcf.gz - compressed VCF |
| + | <name>.bcf - BCF |
| + | You may modify the STDOUT to output the binary version of the format. Uncompressed |
| + | VCF and BCF streams are indicated by - and + respectively. |
| + | |
| + | -f filter expression |
| + | |
| + | -s sequential region selection as opposed to random access of regions specified by the i option. |
| + | This is useful when you want to select many close-by regions, while the -i option works, |
| + | it is less efficient and also selects a variant multiple times if it overlaps 2 regions. This |
| + | option iterates through the variants in the file sequentially and checks for overlap with the |
| + | bed file given. |
| + | |
| + | == Uncompressed BCF streams == |
| + | |
| + | htslib is designed with BCF as the underlying data structure and it has incorporated |
| + | awareness of uncompressed BCF streams in the i/o API. One may use this feature to |
| + | stream uncompressed BCF records to save on computational time spent on (de)compression. |
| + | |
| + | #using textual VCF streams indicated by - |
| + | cat mills.vcf | vt normalize - -r hs37d5.fa | vt uniq - -o out.bcf |
| + | |
| + | #using uncompressed BCF streams indicated by + |
| + | cat mills.vcf | vt normalize - -r hs37d5.fa -o + | vt uniq + -o out.bcf |
| + | |
| + | In this example, the former took 0.84s while the latter took 0.64s to process. (24% speed up!) |
| + | |
| + | == Filters == |
| + | |
| + | For some programs. you may define a filter via the -f option. |
| + | |
| + | This allows you to only analyse biallelic indels that are passed on chromosome 20. |
| + | vt profile_na12878 vt.bcf -g na12878.reference.txt -r genome.fa -f "N_ALLELE==2&&VTYPE==INDEL&&PASS" -i 20 |
| + | |
| + | This allows you to extract biallelic indels that are passed on chromosome 20. |
| + | vt view vt.bcf -f "N_ALLELE==2&&VTYPE==INDEL&&PASS" -i 20 |
| + | |
| + | Other examples of filters |
| + | |
| + | #all variants with a SNP in them |
| + | VTYPE&SNP |
| + | #Simple insertions of length 1 |
| + | VTYPE==INDEL&&DLEN==1 |
| + | #Indels of length 1 |
| + | VTYPE==INDEL&&LEN==1 |
| + | |
| + | Variant characteristics |
| + | VTYPE,N_ALLELE,DLEN,LEN,VARIANT_CONTAINS_N |
| + | |
| + | Variant value types |
| + | SNP,MNP,INDEL,CLUMPED |
| + | |
| + | Biallelic SNPs only : VTYPE==SNP&&N_ALLELE==2 |
| + | Biallelic Indels with embedded SNP : VTYPE==(SNP|INDEL)&&N_ALLELE==2 |
| + | Biallelic variants involving insertions : VTYPE&INDEL&&DLEN>0&&N_ALLELE==2 |
| + | Biallelic variants involving 1bp variants : LEN==1&&N_ALLELE==2 |
| + | Variants with explicit sequences with no Ns : ~VARIANT_CONTAINS_N |
| + | |
| + | REF field |
| + | REF |
| + | |
| + | ALT field |
| + | ALT |
| + | |
| + | QUAL field |
| + | QUAL |
| + | |
| + | FILTER fields |
| + | PASS, FILTER.<tag> |
| + | |
| + | INFO fields |
| + | INFO.<tag> |
| + | |
| + | A/C SNPs : REF=='A' && ALT=='C' |
| + | AC type of STRs : REF=~'^.(AC)+$' || ALT=~'^.(AC)+$' |
| + | Passed biallelic SNPs only : PASS&&VTYPE==SNP&&N_ALLELE==2 |
| + | Passed Common biallelic SNPs only : PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005 |
| + | Passed Common biallelic SNPs or rare indels : (PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005) |
| + | Passed Common biallelic SNPs or rare indels : ((PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005)||(VTYPE&INDEL&&INFO.AF<=0.005))&&QUAL>100 |
| + | with quality greater than 100 |
| + | Failed rare variants : ~PASS&&(INFO.AC/INFO.AN<0.005) |
| + | |
| + | [http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC1 Regular expression] matching PERL style (implemented with pcre2) |
| + | Sometimes, an info field will contain several values in a string with functional annotation, to match what you want, |
| + | just use INFO.ANNO=~'<perl regular expression>' |
| + | |
| + | Passed variants in intergenic regions or UTR : PASS&&INFO.ANNO=~'Intergenic|UTR' |
| + | Passed variants in intergenic regions or UTR ignoring case : PASS&&INFO.ANNO=~'(?i)Intergenic|UTR' <br> |
| + | pcre2's '(?i)Intergenic|UTR' is equivalent to PERL's '/intergenic|UTR/i' |
| + | |
| + | Operations |
| + | == : equivalence for strings and numbers |
| + | != : not equal |
| + | =~ : regular expression match for strings only |
| + | ~~ : not of =~. Is equivalent to PERL's !~, this notation is used as BASH keeps interpreting ! for recalling commands from the history |
| + | ~ : logical not |
| + | && : logical and |
| + | || : logical or |
| + | & : bitwise and |
| + | | : bitwise or |
| + | + : add |
| + | - : subtract |
| + | * : multiply |
| + | / : divide |
| + | |
| + | The following programs support filter expressions. |
| + | |
| + | * view |
| + | * peek |
| + | * profile_snps |
| + | * profile_indels |
| + | * profile_na12878 |
| + | * profile_mendelian |
| + | * profile_len |
| + | * profile_chrom |
| + | * profile_afs |
| + | * profile_hwe |
| + | * concordance |
| + | * partition |
| + | |
| + | == Alternate headers == |
| + | |
| + | As BCF is a restrictive format of VCF where all meta data must be present in the header, |
| + | vt provides a mechanism to read an alternative header for VCF files that do not have a |
| + | well formed header. Simply provide a header file stub named as <vcf-file>.hdr and vt |
| + | will automatically read it instead of the original header in <vcf-file>. |
| + | |
| + | For more information about VCF/BCF : http://samtools.github.io/hts-specs/VCFv4.2.pdf |
| + | |
| + | <span style="color:#FF0000">This mechanism is available only if one is reading VCF or compressed VCF files. It is |
| + | disabled for BCF files as this might corrupt the BCF file because the encoding of the |
| + | fields in BCF records is based on the order of the meta info lines in the header.</span> |
| + | |
| + | <span style="color:#0000FF">Note: BCF2.2 introduces the IDX field in meta information lines that indicates the |
| + | dictionary encoding. This feature might be enabled for BCF files in the future.</span> |
| + | |
| + | == General cases of Ploidy and Alleles == |
| + | |
| + | I am trying to make vt handle [http://genome.sph.umich.edu/wiki/Relationship_between_Ploidy,_Alleles_and_Genotypes general cases of ploidy and alleles]. |
| + | Please let me know if that is lacking in a tool that you are using. |
| + | |
| + | == BCF Compression Levels vs Compression Time == |
| + | |
| + | The zlib deflation algorithm (a variant of LZ77) has 10 levels - 0 to 9. 0 has no compression but instead wraps <br> |
| + | up the file in zlib or bgzf blocks. It may be useful to have 0 compression as it is indexable with the same mechanism <br> |
| + | used for compressed files. Levels 1-9 denote an increasing compression level in exchange for longer times for <br> |
| + | compression. |
| + | |
| + | In general, zlib compression does not have significant differences in compression for BCF files between the 9 compression <br> |
| + | levels as shown in the following table: |
| + | |
| + | {| class="wikitable" |
| + | |- |
| + | ! scope="col"| Compression Level |
| + | ! scope="col"| Size |
| + | ! scope="col"| Time |
| + | |- |
| + | |0<br> |
| + | 1<br> |
| + | 2<br> |
| + | 3<br> |
| + | 4<br> |
| + | 5<br> |
| + | 6 (default) <br> |
| + | 7<br> |
| + | 8<br> |
| + | 9 |
| + | |153GB <br> |
| + | 98.4GB <br> |
| + | 98.0GB <br> |
| + | 97.5GB <br> |
| + | 95.5GB <br> |
| + | 95.2GB <br> |
| + | 94.9GB <br> |
| + | 94.8GB <br> |
| + | 94.76GB <br> |
| + | 94.75GB |
| + | |45m<br> |
| + | 2h3m <br> |
| + | 2h7m <br> |
| + | 2h12m <br> |
| + | 2h26m <br> |
| + | 2h54m <br> |
| + | 3h19m <br> |
| + | 3h41m <br> |
| + | 4h5m <br> |
| + | 4h25m |
| + | |} |
| + | |
| + | <span style="color:#0000FF">So, it might be a good idea to compress at lower levels when dealing with large temporary <br> |
| + | files in a pipeline to save compute time. This can be achieved with the -c option in vt view</span> |
| + | |
| + | = VCF Manipulation = |
| + | |
| + | === View === |
| + | |
| + | |
| + | Views a VCF or VCF.GZ or BCF file. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #views mills.bcf and outputs to standard out |
| + | vt view -h mills.bcf |
| + | |
| + | #views mills.bcf and locally sorts it in a 10000bp window and outputs to sorted-millsbcf |
| + | vt view -h -w 10000 mills.bcf -o sorted-mills.bcf |
| + | |
| + | #views mills.bcf and outputs to c1-mills.bcf with a compression level of 1. By default, |
| + | #the compression level is 6 where lower levels compress the file less but are faster. |
| + | #The difference in compression for BCF files between level 1 to level 9 is about 5% of |
| + | #of a level 1 compression file. The difference in time taken is about an additional 50% |
| + | #of a level 1 compression. The levels range from 0 to 9 where 0 means no compression |
| + | #but the file is encapsulated in bgzf blocks that allows the file to be indexed. A special |
| + | #level -1 denotes an uncompressed BCF file that is not encapsulated in bgzf blocks and |
| + | #are thus not indexable but are highly suitable for streaming between vt commands. |
| + | vt view -h mills.bcf -c 1 -o c1-mills.bcf |
| + | |
| + | #views mills.bcf and selects variants that overlap with the regions found in dust.bed from chromosome 20 |
| + | #the -t option selects variants by checking if each variant overlaps with the regions in the bed file, this is |
| + | #as opposed to random accessing the variants via the index through the intervals defined in -i and -I options. |
| + | #this is useful when selecting variants from the target regions from an exome sequencing experiment. |
| + | vt view 10000 mills.bcf -t dust.bed -i 20 |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt view [options] <in.vcf> |
| + | |
| + | options : -o output VCF/VCF.GZ/BCF file [-] |
| + | -f filter expression [] |
| + | -w local sorting window size [0] |
| + | -s print site information only without genotypes [false] |
| + | -H print header only, this option is honored only for STDOUT [false] |
| + | -h omit header, this option is honored only for STDOUT [false] |
| + | -p print options and summary [] |
| + | -r right window size for overlap [] |
| + | -l left window size for overlap [] |
| + | -c compression level 0-9, 0 and -1 denotes uncompressed with the former being wrapped in bgzf. [6] |
| + | -t bed file for variant selection via streaming [] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Index === |
| + | |
| + | Indexes a VCF.GZ or BCF file. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #indexes mills.bcf |
| + | vt index mills.bcf |
| + | #indexes mills.vcf.gz |
| + | vt index mills.vcf.gz |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt index [options] <in.vcf> |
| + | |
| + | options : -p print options and summary [] |
| + | -- ignores the rest of the labeled arguments following this flag |
| + | -h displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Sorting === |
| + | |
| + | Sorting may be done in 3 approaches. |
| + | |
| + | Locally:<br> |
| + | Performs sorting within a local window. The window size may be set by the -w option. The default window size <br> |
| + | is 1000bp and if a record is detected to be potentially out of order due to a small window size, it wil be reported.<br> |
| + | Use this when your VCF records are grouped by chromosome but not ordered in short stretches.<br><br> |
| + | |
| + | By chromosome: <br> |
| + | Your VCF file is not ordered by the chromosomes in the header but is fully ordered within each chromosome.<br> |
| + | The VCF file should be indexed and vt will output the records in the order of chromosomes given in the header. <br> <br> |
| + | |
| + | Full sort [default option]: <br> |
| + | No assumptions are made about the VCF file. Records will be ordered by the order of contigs in the header. <br> |
| + | Smaller temporary ordered files are created and their names are <output_vcf>.<no>.bcf and after generating <br> |
| + | these files, they are merged and output into <output_vcf>.<br> |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #sorts mills.bcf and outputs to standard out in a 1000bp window. |
| + | vt sort -m local mills.bcf |
| + | #sorts mills.bcf and locally sorts it in a 10000bp window and outputs to out.bcf |
| + | vt sort -m local -w 10000 mills.bcf -o out.bcf |
| + | #sorts an indexed mills.bcf with chromosomes not sorted in the contig order in the header |
| + | vt sort -m chrom mills.bcf -o out.bcf |
| + | #sorts mills.bcf with no assumption |
| + | vt sort mills.bcf -o out.bcf |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt sort [options] <in.vcf> |
| + | |
| + | options : -m sorting modes. [full] |
| + | local : locally sort within a 1000bp window. Window size may be set by -w. |
| + | chrom : sort chromosomes based on order of contigs in header. |
| + | input must be indexed. |
| + | full : full sort with no assumptions. |
| + | -o output VCF/VCF.GZ/BCF file. [-] |
| + | -w local sorting window size, set by default to 1000 under local mode. [0] |
| + | -p print options and summary. [] |
| + | -? displays help |
| + | |
| + | </div> |
| + | </div> |
| + | |
| + | === Normalization === |
| + | |
| + | <div> |
| + | [http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a [http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42 VCF] file [http://bioinformatics.oxfordjournals.org/content/31/13/2202 (Tan et al. 2015)] . Normalized variants may have their positions changed; in such cases, the normalized variants |
| + | are reordered and output in an ordered fashion. The local reordering takes place over a window of 10000 base pairs which may be changed via the -w option. There is an underlying assumption that the REF |
| + | field is consistent with the reference sequence use, vt will check for this and will fail if reference inconsistency is encountered; this may be relaexd with the -n option. |
| + | </div> |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #normalize variants and write out to dbsnp.normalized.vcf |
| + | vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf |
| + | |
| + | #normalize variants, send to standard out and remove duplicates. |
| + | vt normalize dbsnp.vcf -r seq.fa | vt uniq - -o dbsnp.normalized.uniq.vcf |
| + | |
| + | #read in variants that do not contain N in the explicit alleles, normalize variants, send to standard out. |
| + | vt normalize dbsnp.vcf -r seq.fa -f "~VARIANT_CONTAINS_N" |
| + | |
| + | #variants that are normalized will be annotated with an OLD_VARIANT info tag. |
| + | #CHROM POS ID REF ALT QUAL FILTER INFO |
| + | 19 29238772 . C G . PASS VT=SNP;OLD_VARIANT=19:29238771:TC/TG |
| + | 20 60674709 . GCCCAGCCCCAC G . PASS VT=INDEL;OLD_VARIANT=20:60674718:CACCCCAGCCCC/C |
| + | |
| + | #this shows a sample output with the normalization operations that were used |
| + | #categorized into 5 categories each for biallelic and multiallelic variants. <br> |
| + | stats: biallelic |
| + | no. left trimmed : 156908 |
| + | no. right trimmed : 323 |
| + | no. left and right trimmed : 33 |
| + | no. right trimmed and left aligned : 7 |
| + | no. left aligned : 12360 <br> |
| + | total no. biallelic normalized : 169631 <br> <br> |
| + | multiallelic |
| + | no. left trimmed : 627189 |
| + | no. right trimmed : 2509 |
| + | no. left and right trimmed : 1498 |
| + | no. right trimmed and left aligned : 212 |
| + | no. left aligned : 1783 <br> |
| + | total no. multiallelic normalized : 633191 <br> |
| + | total no. variants normalized : 802822 |
| + | total no. variants observed : 88052639 |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt normalize [options] <in.vcf> |
| + | |
| + | options : -o output VCF file [-] |
| + | -d debug [false] |
| + | -q do not print options and summary [false] |
| + | -m warns but does not exit when REF is inconsistent |
| + | with masked reference sequence for non SNPs. |
| + | This overides the -n option [false] |
| + | -n warns but does not exit when REF is inconsistent |
| + | with reference sequence for non SNPs [false] |
| + | -w window size for local sorting of variants [10000] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -r reference sequence fasta file [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Decompose biallelic block substitutions === |
| + | |
| + | <div> |
| + | Decomposes biallelic block substitutions into its constituent SNPs. <br> |
| + | There is now an additional option -a which decomposes non block substitutions into its constituent SNPs and indels. (kindly added by [[https://github.com/holtgrewe holtgrewe@github]]) <br> |
| + | There is no exact solution and this decomposition is based on the best guess outcome using a Needleman-Wunsch algorithm. <br> |
| + | You might also want to check out [https://github.com/vcflib/vcflib#vcfallelicprimitives vcfallelicprimitives]. <br> |
| + | <br> |
| + | There is now an additional option -m and -d which ensures that some MNVs are not decomposed. (kindly added by [[https://github.com/jaudoux jaudoux@github]]) <br> |
| + | The motivation is from<br> |
| + | *Exome-wide assessment of the functional impact and pathogenicity of multi-nucleotide mutations https://www.biorxiv.org/content/10.1101/258723v2.full<br> |
| + | *Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes https://www.biorxiv.org/content/10.1101/573378v2.full<br> |
| + | </div> |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #decomposes biallelic block substitutions and write out to decomposed_blocksub.vcf |
| + | vt decompose_blocksub gatk.vcf -o decomposed_blocksub.vcf <br> |
| + | #before decomposition |
| + | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 |
| + | 20 763837 . CA TG 50340.1 PASS AC=1;AN=2 GT 0|1 <br> |
| + | #after decomposition |
| + | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 |
| + | 20 763837 . C T 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG GT 0|1 |
| + | 20 763838 . A G 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CA/TG GT 0|1 |
| + | |
| + | #decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf |
| + | vt decompose_blocksub -a gatk.vcf -o decomposed_blocksub.vcf <br> |
| + | #before decomposition |
| + | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 |
| + | 20 763837 . CG TGA 50340.1 PASS AC=1;AN=2 GT 0|1 <br> |
| + | #after decomposition |
| + | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 |
| + | 20 763837 . C T 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1 |
| + | 20 763838 . G GA 50340.1 PASS AC=1;AN=2;OLD_CLUMPED=20:763837:CG/TGA GT 0|1 |
| + | |
| + | #decomposes biallelic clumped variant and write out to decomposed_blocksub.vcf and add phase set information in the genotype fields |
| + | vt decompose_blocksub -p gatk.vcf -o decomposed_blocksub.vcf <br> |
| + | #before decomposition |
| + | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tumor normal |
| + | 1 159030 . TAACCTTTC TGACCTTTT 0.04 . AF=0.5 GT 0/0 1/1 <br> |
| + | #after decomposition |
| + | 1 159031 . A G 0.04 . AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT GT:PS 0|0:159031 1|1:159031 |
| + | 1 159038 . C T 0.04 . AF=0.5;OLD_CLUMPED=1:159030:TAACCTTTC/TGACCTTTT GT:PS 0|0:159031 1|1:159031 |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | description : decomposes biallelic block substitutions into its constituent SNPs. <br> |
| + | usage : vt decompose_blocksub [options] <in.vcf> <br> |
| + | options : -m keep MNVs (multi-nucleotide variants) [false] |
| + | -a enable aggressive/alignment mode [false] |
| + | -d MNVs max distance (when -m option is used) [2] |
| + | -o output VCF file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help-a enable aggressive/alignment mode |
| + | |
| + | </div> |
| + | </div> |
| + | |
| + | === Decompose=== |
| + | |
| + | <div> |
| + | Decompose multiallelic variants in a [http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42 VCF] file. If the VCF file has genotype fields GT,PL, GL or DP, they are |
| + | modified to reflect the change in alleles. All other genotype fields are removed. The -s option will retain the fields and decompose fields of counts R and A accordingingly. |
| + | |
| + | Decomposition and combining variants is a complex operation where the correctness is dependent on [[https://github.com/tfarrah tfarrah@github]]: |
| + | |
| + | *whether the observed variants are seen in the same sample, |
| + | *if same sample, whether they are homozygous or heterozygous, |
| + | *if both heterozygous, whether they are in the same haplotype or not (if known). |
| + | |
| + | and one should be aware of the issues in handling variants resulting from such operations. <br> |
| + | The original purpose of this tool is to allow for allelic comparisons between call sets. |
| + | [[https://github.com/atks/vt/issues/16 example of a problem caused in combining separate variant records]] |
| + | </div> |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf |
| + | vt decompose gatk.vcf -o gatk.decomposed.vcf <br> |
| + | #before decomposition |
| + | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 |
| + | 1 3759889 . TA TAA,TAAA,T . PASS AF=0.342,0.173,0.037 GT:DP:PL 1/2:81:281,5,9,58,0,115,338,46,116,809 0/0:86:0,30,323,31,365,483,38,291,325,567 <br> |
| + | #after decomposition |
| + | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 |
| + | 1 3759889 . TA TAA . PASS OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL 1/.:281,5,9 0/0:0,30,323 |
| + | 1 3759889 . TA TAAA . . OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./1:281,58,115 0/0:0,31,483 |
| + | 1 3759889 . TA T . . OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./.:281,338,809 0/0:0,38,567 <br> |
| + | One might want to post process the partial genotypes like 1/. to the best guess genotype based on the PL values. |
| + | |
| + | #decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf with the -s option. |
| + | #-s option splits up INFO and GENOTYPE fields that have number counts of R and A [[https://samtools.github.io/hts-specs/VCFv4.2.pdf VCFv4.2 section 1.2.2]] appropriately. |
| + | vt decompose -s gatk.vcf -o gatk.decomposed.vcf <br> |
| + | #before decomposition |
| + | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 |
| + | 1 3759889 . TA TAA,TAAA,T . PASS AF=0.342,0.173,0.037 GT:DP:PL 1/2:81:281,5,9,58,0,115,338,46,116,809 0/0:86:0,30,323,31,365,483,38,291,325,567 <br> |
| + | #after decomposition |
| + | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 |
| + | 1 3759889 . TA TAA . PASS AF=0.342;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL 1/.:281,5,9 0/0:0,30,323 |
| + | 1 3759889 . TA TAAA . . AF=0.173;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./1:281,58,115 0/0:0,31,483 |
| + | 1 3759889 . TA T . . AF=0.037;OLD_MULTIALLELIC=1:3759889:TA/TAA/TAAA/T GT:PL ./.:281,338,809 0/0:0,38,567 <br> |
| + | In general, you should recompute fields that involves alleles after decomposition. Information is generally lost after vertically decomposing a variant, so care should be taken |
| + | in interpreting the resultant values. |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | description : decomposes multiallelic variants into biallelic in a VCF file. <br> |
| + | usage : vt decompose [options] <in.vcf> <br> |
| + | options : -s smart decomposition [false] |
| + | -o output VCF file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Drop duplicate variants === |
| + | |
| + | Drops duplicate variants that appear later in the file. VCF file must be ordered. <br> |
| + | If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #drop duplicate variants and save output in mills.uniq.vcf |
| + | vt uniq mills.vcf -o mills.uniq.vcf |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt uniq [options] <in.vcf> |
| + | |
| + | options : -o output VCF file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Paste === |
| + | |
| + | Pastes VCF files like the unix paste functions. |
| + | |
| + | Input requirements and assumptions: |
| + | 1. Same variants are represented in the same order for each file (required) |
| + | 2. Genotype field order are the same for corresponding records (required) |
| + | 3. Sample names are different in all the files (warning will be given if not) |
| + | 4. Headers are the same for all the files (assumption, not checked, will fail if output is BCF) |
| + | Outputs: |
| + | 1. INFO fields output will be that of the first file |
| + | 2. Genotype fields are the same for corresponding records |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #paste together genotypes from the CEU trio into one file. |
| + | vt paste NA12878.mills.bcf NA12891.mills.bcf NA12892.mills.bcf -o ceu_trio.bcf |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt paste [options] <in1.vcf>... |
| + | |
| + | options : -L file containing list of input VCF files |
| + | -o output VCF file [-] |
| + | -p print options and summary [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Concatenate === |
| + | |
| + | Concatenates VCF files. Assumes individuals are in the same order and files share the same header. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #concatenates chr1.mills.bcf and chr2.mills.bcf |
| + | vt cat chr1.mills.bcf chr2.mills.bcf -o mills.bcf |
| + | |
| + | #concatenates chr1.mills.bcf and chr2.mills.bcf with the naive option. |
| + | #The naive option assumes that the headers are all the same and skips |
| + | #merging headers and translating encodings between BCF files. This is |
| + | #a much faster option if you know the nature of your BCF files in advance. |
| + | vt cat -n chr1.mills.bcf chr2.mills.bcf -o mills.bcf |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt cat [options] <in1.vcf>... |
| + | |
| + | options : -s print site information only without genotypes [false] |
| + | -p print options and summary [false] |
| + | -n naive, assumes that headers are the same. [false] |
| + | -w local sorting window size [0] |
| + | -f filter expression [] |
| + | -L file containing list of input VCF files |
| + | -o output VCF file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Remove info tags === |
| + | |
| + | Removes INFO tags from a VCF file |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #removes the INFO tags OLD_VARIANT, ENTROPY, PSCORE and COMP |
| + | vt rminfo exact.del.bcf -t OLD_VARIANT,ENTROPY,PSCORE,COMP -o rm.bcf |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt rminfo [options] <in.vcf> |
| + | |
| + | options : -o output VCF file [-] |
| + | -q do not print options and summary [false] |
| + | -t list of info tags to be removed [] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Filter === |
| + | |
| + | Filters variants in a VCF file |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #adds a filter tag "refA" for variants where the REF column is a A sequence. |
| + | vt filter in.bcf -f "REF=='A'" -d "refA" |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt filter [options] <in.vcf> <br> |
| + | options : -x clear filter [false] |
| + | -f filter expression [] |
| + | -d filter tag description [] |
| + | -t filter tag [] |
| + | -o output VCF file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals |
| + | -? displays help </div> |
| + | </div> |
| + | |
| + | === Filter overlap === |
| + | |
| + | Tags overlapping variants in a VCF file with the FILTER flag overlap. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #adds a filter tag "overlap" for overlapping variants within a window size of 1 based on the REF sequence. |
| + | vt filter_overlap in.bcf -w 1 out.bcf |
| + | |
| + | todo: option for considering END info tag for detecting overlaps. |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt filter_overlap [options] <in.vcf> |
| + | |
| + | options : -o output VCF file [-] |
| + | -w window overlap for variants [0] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Validate === |
| + | |
| + | Checks the following properties of a VCF file: |
| + | #order |
| + | #reference sequence consistency |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #validates lobstr.bcf |
| + | vt validate lobstr.bcf |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt validate [options] <in.vcf> |
| + | |
| + | options : -q do not print invalid records [false] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -r reference sequence fasta file [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Extract INFO fields to a tab delimited file === |
| + | |
| + | Converts a VCF file and its shared information in the INFO field to a tab delimited file for further analysis. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #converts in.bcf to tab format with selected INFO and FILTER fields |
| + | vt info2tab in.bcf -u PASS -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT |
| + | <div style="height:6em; overflow:auto; border: 2px solid #FFF"> |
| + | INPUT |
| + | ===== |
| + | 20 17548608 . A AC . PASS CENTERS=vbi;NCENTERS=1;OLD_MULTIALLELIC=20:17548598:GAAAAAAAAAAAAA/GAAAAAAAAAAAA/GAAAAAAAAAAAAAA/GAAAAAAAAAA/GAAAAAAAAAAA/GAAAAAAAAAACAAA;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAACAAAG;EX_MOTIF=C;EX_MLEN=1;EX_RU=C;EX_BASIS=C;EX_BLEN=1;EX_REPEAT_TRACT=17548608,17548609;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=2;EX_RL=2;EX_LL=3;EX_RU_COUNTS=0,2;EX_SCORE=0;EX_TRF_SCORE=-14;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=14;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[A]AAAGAAGGAA;MDUST;LOBSTR |
| + | 20 17548608 . AAAAG A . PASS CENTERS=ox1;NCENTERS=1;EX_MOTIF=AAAG;EX_MLEN=4;EX_RU=AAAG;EX_BASIS=AG;EX_BLEN=2;EX_REPEAT_TRACT=17548609,17548612;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=0.75;EX_RL=4;EX_LL=4;EX_RU_COUNTS=0,1;EX_SCORE=0.75;EX_TRF_SCORE=-1;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=13;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[AAAAG]AAGGAACTAC;MDUST;LOBSTR;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAA |
| + | </div> |
| + | OUTPUT |
| + | ====== |
| + | CHROM POS REF ALT N_ALLELE PASS EX_RL FZ_RL MDUST LOBSTR VNTRSEEK RMSK EX_REPEAT_TRACT_1 EX_REPEAT_TRACT_2 |
| + | 20 17548608 A AC 2 1 2 13 1 1 0 0 17548608 17548608 |
| + | 20 17548608 AAAAG A 2 1 4 13 1 1 0 0 17548609 17548609 |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt info2tab [options] <in.vcf> |
| + | |
| + | options : -d debug [false] |
| + | -f filter expression [] |
| + | -u list of filter tags to be extracted []-t list of info tags to be extracted [] |
| + | -o output tab delimited file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | = VCF Inspection and Evaluation = |
| + | |
| + | === Peek === |
| + | |
| + | Summarizes the variants in a VCF file |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #summarizes the variants found in mills.vcf |
| + | vt peek mills.vcf |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt peek [options] <in.vcf> |
| + | |
| + | options : -o output VCF file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -r reference sequence fasta file [] |
| + | -- ignores the rest of the labeled arguments following this flag |
| + | -h displays help |
| + | </div> |
| + | </div> |
| + | |
| + | For a more detailed guide on [http://genome.sph.umich.edu/wiki/Variant_classification variant classification]. |
| + | |
| + | #This is a sample output of a peek command which summarizes the variants found in a VCF file. |
| + | stats: no. of samples : 0 |
| + | no. of chromosomes : 22<br> |
| + | ========== Micro variants ==========<br> |
| + | no. of SNPs : 77228885 |
| + | 2 alleles (ts/tv) : 77011302 (2.11) [52287790/24723512] |
| + | 3 alleles (ts/tv) : 216560 (0.75) [185520/247600] |
| + | 4 alleles (ts/tv) : 1023 (0.50) [1023/2046]<br> |
| + | no. of MNPs : 0 |
| + | 2 alleles (ts/tv) : 0 (-nan) [0/0] |
| + | >=3 alleles (ts/tv) : 0 (-nan) [0/0]<br> |
| + | no. Indels : 2147564 |
| + | 2 alleles (ins/del) : 2124842 (0.47) [683250/1441592] |
| + | >=3 alleles (ins/del) : 22722 (2.12) [32411/15286]<br> |
| + | no. SNP/MNP : 0 |
| + | 3 alleles (ts/tv) : 0 (-nan) [0/0] |
| + | >=4 alleles (ts/tv) : 0 (-nan) [0/0] <br> |
| + | no. SNP/Indels : 12913 |
| + | 2 alleles (ts/tv) (ins/del) : 412 (0.41) [120/292] (3.68) [324/88] |
| + | >=3 alleles (ts/tv) (ins/del) : 12501 (0.43) [7670/17649] (18.64) [12434/667]<br> |
| + | no. MNP/Indels : 153 |
| + | 2 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0] |
| + | >=3 alleles (ts/tv) (ins/del) : 153 (0.30) [138/465] (0.27) [67/248]<br> |
| + | no. SNP/MNP/Indels : 2 |
| + | 3 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0] |
| + | 4 alleles (ts/tv) (ins/del) : 2 (0.00) [3/5] (1.00) [3/3] |
| + | >=5 alleles (ts/tv) (ins/del) : 0 (-nan) [0/0] (-nan) [0/0]<br> |
| + | no. of clumped variants : 19025 |
| + | 2 alleles : 0 (-nan) [0/0] (-nan) [0/0] |
| + | 3 alleles : 18508 (0.16) [12152/75366] (0.00) [93/18653] |
| + | 4 alleles : 451 (0.15) [369/2390] (0.33) [201/609] |
| + | >=5 alleles : 66 (0.09) [37/414] (1.19) [107/90]<br> |
| + | ====== Other useful categories =====<br> |
| + | no. complex variants : 32093 |
| + | 2 alleles (ts/tv) (ins/del) : 412 (0.41) [120/292] (3.68) [324/88] |
| + | >=3 alleles (ts/tv) (ins/del) : 31681 (0.21) [20369/96289] (0.64) [12905/20270]<br> |
| + | ======= Structural variants ========<br> |
| + | no. of structural variants : 41217 |
| + | 2 alleles : 38079 |
| + | deletion : 13135 |
| + | insertion : 16451 |
| + | mobile element : 16253 |
| + | ALU : 12513 |
| + | LINE1 : 2911 |
| + | SVA : 829 |
| + | numt : 198 |
| + | duplication : 664 |
| + | inversion : 100 |
| + | copy number variation : 7729 |
| + | >=3 alleles : 3138 |
| + | copy number variation : 3138 <br> |
| + | ========= General summary ========== <br> |
| + | no. of reference : 0 <br> |
| + | no. of observed variants : 79449759 |
| + | no. of unclassified variants : 0 |
| + | |
| + | === Partition === |
| + | |
| + | Partition variants from two data sets. |
| + | |
| + | <span style="color:#FF0000">Please note that this only works if the contigs in the headers of both data sets are the same.</span> |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #partitions all variants in bi1.bcf and bi2.bcf |
| + | vt partition bi1.bcf bi2.bcf |
| + | |
| + | Options: input VCF file a bi1.bcf |
| + | input VCF file b bi2.bcf <br> |
| + | A: 504676 variants |
| + | B: 1389333 variants <br> |
| + | ts/tv ins/del |
| + | A-B 37564 [0.19] [1.34] |
| + | A&B 467112 [1.55] [0.72] |
| + | B-A 922221 [1.20] [0.58] |
| + | of A 92.6% |
| + | of B 33.6% |
| + | |
| + | #partitions only passed variants in bi1.bcf and bi2.bcf |
| + | vt partition bi1.bcf bi2.bcf -f PASS |
| + | |
| + | Options: input VCF file a bi1.bcf |
| + | input VCF file b bi2.bcf |
| + | [f] filter PASS <br> |
| + | A: 466148 variants |
| + | B: 986056 variants <br> |
| + | ts/tv ins/del |
| + | A-B 47261 [0.44] [1.36] |
| + | A&B 418887 [1.80] [0.68] |
| + | B-A 567169 [1.43] [0.72] |
| + | of A 89.9% |
| + | of B 42.5% |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | |
| + | partition v0.5 |
| + | |
| + | description : partition variants. check the overlap of variants between 2 data sets. |
| + | |
| + | usage : vt partition [options] <in1.vcf><in2.vcf> |
| + | |
| + | options : -w write partitioned variants to file |
| + | -f filter |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Multi Partition === |
| + | |
| + | Partitions variants found in VCF files. <br> |
| + | In comparison to the simple 2 way partition, this does not support writing out of partitions to file and |
| + | reporting proportion of shared variants for each VCF. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #partitions variants n-ways |
| + | vt multi_partition hc.genotypes.bcf pl.genotypes.bcf st.genotypes.bcf |
| + | |
| + | Options: input VCF file a hc.genotypes.bcf |
| + | input VCF file b pl.genotypes.bcf |
| + | input VCF file c st.genotypes.bcf <br> |
| + | A: 97274 variants |
| + | B: 95458 variants |
| + | C: 98943 variants <br> |
| + | no [ts/tv] [ins/del] |
| + | A-- 3887 [1.10] [0.86] |
| + | -B- 7890 [1.45] [0.98] |
| + | AB- 4360 [0.99] [1.32] |
| + | --C 8277 [1.75] [2.21] |
| + | A-C 7458 [1.78] [0.49] |
| + | -BC 1639 [1.63] [1.03] |
| + | ABC 81569 [2.28] [1.08] <br> |
| + | Unique variants : 115080 |
| + | Overall concordance : 70.88% (#intersection/#union) |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | |
| + | usage : vt multi_partition [options] <in1.vcf><in2.vcf>... |
| + | options : -f filter |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Annotate Regions === |
| + | |
| + | Annotates regions in a VCF file. The BED file should be bgzipped and indexed with tabix. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #annotates the variants that overlap with coding regions. |
| + | vt annotate_regions mills.vcf -b coding.bed.gz -t CDS -d "Coding region" |
| + | |
| + | #annotates the variants that overlap with low complexity regions. |
| + | vt annotate_regions mills.vcf -b mdust.bed.gz -t DUST -d "DUST Low Complexity Region" |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | |
| + | usage : vt annotate_regions [options] <in.vcf> |
| + | |
| + | options : -d regions tag description [] |
| + | -t regions tag [] |
| + | -b regions BED file [] |
| + | -o output VCF file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Annotate Variants === |
| + | |
| + | Annotates variants in a VCF file. The GENCODE annotation file should be bgzipped and indexed with tabix. |
| + | This is available in the [[Vt#Resource_Bundle|vt resource bundle]]. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #annotates the variants found in mills.vcf |
| + | vt annotate_variants mills.vcf -r hs37d5.fa -g gencode.v19.annotation.gtf.gz |
| + | |
| + | #annotates variants with the following fields |
| + | ##INFO=<ID=VT,Number=1,Type=String,Description="Variant Type - SNP, MNP, INDEL, CLUMPED"> |
| + | ##INFO=<ID=GENCODE_FS,Number=0,Type=Flag,Description="Frameshift INDEL"> |
| + | ##INFO=<ID=GENCODE_NFS,Number=0,Type=Flag,Description="Non Frameshift INDEL"> |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt annotate_variants [options] <in.vcf> |
| + | |
| + | options : -g GENCODE annotations GTF file [] |
| + | -r reference sequence fasta file [] |
| + | -o output VCF file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Compute Features === |
| + | |
| + | Compute features in a VCF file. Example of statistics are Allele counts, [[Genotype_Likelihood_based_Inbreeding_Coefficient|Genotype Likelihood based Inbreeding Coefficient]]. |
| + | [[Genotype_Likelihood_based_Allele_Frequency|Hardy-Weinberg Genotype Likelihood based Allele Frequencies]] <br> |
| + | For more customizable feature computation - look at [http://genome.sph.umich.edu/wiki/Vt#Estimate estimate] |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #compute features for the variants found in vt.vcf |
| + | #requires GT, PL and DP |
| + | vt compute_features vt.vcf |
| + | |
| + | #annotates variants with the following fields |
| + | ##INFO=<ID=AC,Number=A,Type=Integer,Description="Alternate Allele Counts"> |
| + | ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Number Allele Counts"> |
| + | ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> |
| + | ##INFO=<ID=AF,Number=A,Type=Float,Description="Alternate Allele Frequency"> |
| + | ##INFO=<ID=GC,Number=G,Type=Integer,Description="Genotype Counts"> |
| + | ##INFO=<ID=GN,Number=1,Type=Integer,Description="Total Number of Genotypes Counts"> |
| + | ##INFO=<ID=GF,Number=G,Type=Float,Description="Genotype Frequency"> |
| + | ##INFO=<ID=HWEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency assuming HWE"> |
| + | ##INFO=<ID=HWEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency assuming HWE"> |
| + | ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Genotype likelihood based MLE Allele Frequency"> |
| + | ##INFO=<ID=MLEGF,Number=G,Type=Float,Description="Genotype likelihood based MLE Genotype Frequency"> |
| + | ##INFO=<ID=HWE_LLR,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg ln(Likelihood Ratio)"> |
| + | ##INFO=<ID=HWE_LPVAL,Number=1,Type=Float,Description="Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic ln(p-value)"> |
| + | ##INFO=<ID=HWE_DF,Number=1,Type=Integer,Description="Degrees of freedom for Genotype likelihood based Hardy Weinberg Likelihood Ratio Test Statistic"> |
| + | ##INFO=<ID=FIC,Number=1,Type=Float,Description="Genotype likelihood based Inbreeding Coefficient"> |
| + | ##INFO=<ID=AB,Number=1,Type=Float,Description="Genotype likelihood based Allele Balance"> |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt compute_features for variants [options] <in.vcf> |
| + | |
| + | options : -s print site information only without genotypes [false] |
| + | -o output VCF/VCF.GZ/BCF file [-] |
| + | -f filter expression [] |
| + | -I File containing list of intervals |
| + | -i Intervals |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Estimate === |
| + | |
| + | Compute variant based estimates. |
| + | |
| + | Example of statistics are: |
| + | * Allele counts |
| + | * [[Genotype_Likelihood_based_Allele_Frequency|Hardy-Weinberg Genotype Likelihood based Allele Frequencies]] |
| + | * [[Genotype_Likelihood_based_Inbreeding_Coefficient|Genotype Likelihood based Inbreeding Coefficient]] |
| + | * [[HWEP|Genotype Likelihood based Hardy-Weinberg test]] |
| + | * [[Genotype_Likelihood_Based_Allele_Balance|Genotype Likelihood based Allele Balance]] |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #compute features for the variants found in vt.vcf |
| + | #requires GT and PL |
| + | vt estimate -e AF,MLEAF vt.vcf |
| + | |
| + | AF Genotype (GT) based allele frequencies |
| + | If genotypes are unavailable, best guess |
| + | genotypes are inferred based on genotype |
| + | likelihoods (GL or PL) |
| + | AC : Alternate Allele counts |
| + | AN : Total allele counts |
| + | NS : No. of samples. |
| + | AF : Alternate allele frequencies. |
| + | MLEAF GL based allele frequencies estimates |
| + | MLEAF : Alternate allele frequency derived from MLEGF |
| + | MLEGF : Genotype frequencies. |
| + | HWEAF GL based allele frequencies estimates assuming HWE |
| + | HWEAF : Alternate allele frequencies |
| + | HWEGF : Genotype frequencies derived from HWEAF. |
| + | HWE GL based Hardy-Weinberg statistics. |
| + | HWE_LLR : log likelihood ratio |
| + | HWE_LPVAL : log p-value |
| + | HWE_DF : degrees of freedom |
| + | AB GL based Allele Balance. |
| + | FIC GL based Inbreeding Coefficient |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt estimate [options] <in.vcf> |
| + | |
| + | options : -s print site information only without genotypes [false] |
| + | -o output VCF/VCF.GZ/BCF file [-] |
| + | -e comma separated estimates to be computed [] |
| + | -f filter expression [] |
| + | -I File containing list of intervals |
| + | -i Intervals |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Profile Mendelian Errors === |
| + | |
| + | Profile Mendelian errors |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #profile mendelian errors found in vt.genotypes.bcf, generate [[media:mendel.pdf|tables]] in the directory mendel, requires pdflatex. |
| + | vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel |
| + | |
| + | pedigree file format is described in [[Vt#Pedigree File|here]]. |
| + | |
| + | #this is a sample output for mendelian error profiling. |
| + | #R and A stand for reference and alternate allele respectively. |
| + | #Error% - mendelian error (confounded with de novo mutation) |
| + | #HomHet - Homozygous-Heterozygous genotype ratios |
| + | #Het% - proportion of hets |
| + | Mendelian Errors <br> |
| + | Father Mother R/R R/A A/A Error(%) HomHet Het(%) |
| + | R/R R/R 14889 210 38 1.64 nan nan |
| + | R/R R/A 3403 3497 74 1.06 0.97 50.68 |
| + | R/R A/A 176 1482 155 18.26 nan nan |
| + | R/A R/R 3665 3652 68 0.92 1.00 49.91 |
| + | R/A R/A 1015 3151 990 0.00 0.64 61.11 |
| + | R/A A/A 43 1300 1401 1.57 1.08 48.13 |
| + | A/A R/R 172 1365 147 18.94 nan nan |
| + | A/A R/A 47 1164 1183 1.96 1.02 49.60 |
| + | A/A A/A 20 78 5637 1.71 nan nan <br> |
| + | Parental R/R R/A A/A Error(%) HomHet Het(%) |
| + | R/R R/R 14889 210 38 1.64 nan nan |
| + | R/R R/A 7068 7149 142 0.99 0.99 50.28 |
| + | R/R A/A 348 2847 302 18.59 nan nan |
| + | R/A R/A 1015 3151 990 0.00 0.64 61.11 |
| + | R/A A/A 90 2464 2584 1.75 1.05 48.81 |
| + | A/A A/A 20 78 5637 1.71 nan nan <br> |
| + | Parental R/R R/A A/A Error(%) HomHet Het(%) |
| + | HOM HOM 14909 288 5675 1.66 nan nan |
| + | HOM HET 7158 9613 2726 1.19 1.00 49.90 |
| + | HET HET 1015 3151 990 0.00 0.64 61.11 |
| + | HOMREF HOMALT 348 2847 302 18.59 nan nan <br> |
| + | total mendelian error : 2.505% |
| + | no. of trios : 2 |
| + | no. of variants : 25346 |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | profile_mendelian v0.5 |
| + | |
| + | usage : vt profile_mendelian [options] <in.vcf> |
| + | |
| + | options : -q minimum genotype quality |
| + | -d minimum depth |
| + | -r reference sequence fasta file [] |
| + | -x output latex directory [] |
| + | -p pedigree file |
| + | -I file containing list of intervals [] |
| + | -i intervals |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Profile SNPs === |
| + | |
| + | Profile SNPs. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]]. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #profile snps found in 20.sites.vcf |
| + | vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa -i 20 |
| + | |
| + | #this is a sample output for indel profiling. |
| + | # square brackets contain the ts/tv ratio. |
| + | # The numbers in curved bracket are the counts of ts and tv SNPs respectively. |
| + | # Low complexity shows what percent of the SNPs are in low complexity regions. |
| + | data set |
| + | No. SNPs : 508603 [2.09] |
| + | Low complexity : 0.08 (39837/508603) <br> |
| + | 1000g |
| + | A-B 109970 [1.39] |
| + | A&B 398633 [2.37] |
| + | B-A 1340682 [2.26] |
| + | Precision 78.4% |
| + | Sensitivity 22.9% <br> |
| + | dbsnp |
| + | A-B 324063 [1.99] |
| + | A&B 184540 [2.29] |
| + | B-A 103893 [2.60] |
| + | Precision 36.3% |
| + | Sensitivity 64.0% |
| + | |
| + | # This file contains information on how to process reference data sets. |
| + | # |
| + | # dataset - name of data set, this label will be printed. |
| + | # type - True Positives (TP) and False Positives (FP) |
| + | # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively |
| + | # - annotation |
| + | # file is used for GENCODE annotation of frame shift and non frame shift Indels |
| + | # filter - filter applied to variants for this particular data set |
| + | # path - path of indexed BCF file |
| + | #dataset type filter path |
| + | 1000g TP N_ALLELE==2&&VTYPE==SNP /net/fantasia/home/atks/ref/vt/grch37/1000G.v5.snps.indels.complex.svs.sites.bcf |
| + | dbsnp TP N_ALLELE==2&&VTYPE==SNP /net/fantasia/home/atks/ref/vt/grch37/dbSNP138.snps.indels.complex.sites.bcf |
| + | GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz |
| + | DUST cplx_annotation . /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt profile_snps [options] <in.vcf> |
| + | |
| + | options : -f filter expression [] |
| + | -g file containing list of reference datasets [] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -r reference sequence fasta file [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Profile Indels === |
| + | |
| + | Profile Indels. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]]. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #profile indels found in mills.vcf |
| + | vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa -i 20 |
| + | |
| + | #this is a sample output for indel profiling. |
| + | # square brackets contain the ins/del ratio. |
| + | # for the FS/NFS field, that is the proportion of coding indels that are frame shifted. |
| + | # The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively. |
| + | data set |
| + | No Indels : 46974 [0.89] |
| + | FS/NFS : 0.26 (8/23) <br> |
| + | dbsnp |
| + | A-B 30704 [0.92] |
| + | A&B 16270 [0.83] |
| + | B-A 2049488 [1.52] |
| + | Precision 34.6% |
| + | Sensitivity 0.8% <br> |
| + | mills |
| + | A-B 43234 [0.88] |
| + | A&B 3740 [1.00] |
| + | B-A 203278 [0.98] |
| + | Precision 8.0% |
| + | Sensitivity 1.8% <br> |
| + | mills.chip |
| + | A-B 46847 [0.89] |
| + | A&B 127 [0.90] |
| + | B-A 8777 [0.93] |
| + | Precision 0.3% |
| + | Sensitivity 1.4% <br> |
| + | affy.exome.chip |
| + | A-B 46911 [0.89] |
| + | A&B 63 [0.43] |
| + | B-A 33997 [0.47] |
| + | Precision 0.1% |
| + | Sensitivity 0.2% <br> |
| + | |
| + | # This file contains information on how to process reference data sets. |
| + | # dataset - name of data set, this label will be printed. |
| + | # type - True Positives (TP) and False Positives (FP). |
| + | # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively. |
| + | # - annotation. |
| + | # file is used for GENCODE annotation of frame shift and non frame shift Indels. |
| + | # filter - filter applied to variants for this particular data set. |
| + | # path - path of indexed BCF file. |
| + | #dataset type filter path |
| + | 1000g TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/1000G.snps_indels.sites.bcf |
| + | mills TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/mills.208620indels.sites.bcf |
| + | dbsnp TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/dbsnp.13147541variants.sites.bcf |
| + | GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.cds.bed.gz |
| + | DUST cplx_annotation . /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt profile_indels [options] <in.vcf> |
| + | |
| + | options : -g file containing list of reference datasets [] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -r reference sequence fasta file [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Profile VNTRs === |
| + | |
| + | Profile VNTRs. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]]. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | |
| + | #profiles a set of VNTRs |
| + | vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt |
| + | |
| + | |
| + | profile_vntrs v0.5 |
| + | |
| + | no VNTRs 5660874 #number of VNTRs in vntrs.sites.bcf |
| + | no low complexity 2686460 (47.46%) #number of VNTRs in low complexity region determined by MDUST |
| + | no coding 17911 (0.32%) #number of VNTRs in coding regions determined by GENCODE v7 |
| + | no redundant 1312209 (23.18%) #number of VNTRs involved in overlapping with one another<br> |
| + | trf_lobstr (1638516) #TRF based reference set used in lobSTR, motif lengths 1 to 6. |
| + | A-B 3269285 #TRs specific to vntrs.sites.bcf |
| + | A-B~ 1666185 #TRs in vntrs.sites.bcf that overlap partially with at least one TR in TRF(lobSTR) but does not overlap exactly with another TR. |
| + | A&B1 725404 #TRs in vntrs.sites.bcf that overlap exactly with at least one TR in TRF(lobSTR) |
| + | A&B2 723195 #TRs in TRF(lobSTR) that overlap exactly with at least one TR in vntrs.sites.bcf |
| + | B-A~ 710075 #TRs in TRF(lobSTR) that overlap partially with at least one TR in vntrs.sites.bcf but does not overlap exactly with another TR. |
| + | B-A 205246 #TRs specific to TRF(lobSTR) |
| + | #note that the first 3 rows should sum up to the number of TRs in vntrs.sites.bcf |
| + | #and the 4th to 6th rows should sum up to the number of TRs in TRF( lobSTR) |
| + | #This basically allows us to see the m to n overlapping in overlapping TRs<br> |
| + | trf_repeatseq (1624553) #TRF based reference set used in repeatseq, motif lengths 1 to 6. |
| + | A-B 3291652 |
| + | A-B~ 1650190 |
| + | A&B1 719032 |
| + | A&B2 716838 |
| + | B-A~ 703948 |
| + | B-A 203767 <br> |
| + | trf_vntrseek (230306) #TRF based reference set used in vntrseek, motif lengths 7 to 2000. |
| + | A-B 5384453 |
| + | A-B~ 271302 |
| + | A&B1 5119 |
| + | A&B2 4973 |
| + | B-A~ 92496 |
| + | B-A 132837 <br> |
| + | codis+ (15) #CODIS STRs + 2 STRs from PROMEGA |
| + | A-B 5660794 |
| + | A-B~ 79 |
| + | A&B1 1 |
| + | A&B2 1 |
| + | B-A~ 14 |
| + | B-A 0 |
| + | |
| + | # This file contains information on how to process reference data sets. |
| + | # dataset - name of data set, this label will be printed. |
| + | # type - True Positives (TP) and False Positives (FP). |
| + | # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively. |
| + | # - annotation. |
| + | # file is used for GENCODE annotation of coding VNTRs. |
| + | # filter - filter applied to variants for this particular data set. |
| + | # path - path of indexed BCF file. |
| + | #dataset type filter path |
| + | trf_lobstr TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.lobstr.sites.bcf |
| + | trf_repeatseq TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.repeatseq.sites.bcf |
| + | trf_vntrseek TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.vntrseek.sites.bcf |
| + | codis+ TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/codis.strs.sites.bcf |
| + | GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz |
| + | DUST cplx_annotation . |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt profile_vntrs [options] <in.vcf> |
| + | |
| + | options : -g file containing list of reference datasets [] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -r reference sequence fasta file [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Profile NA12878 === |
| + | |
| + | Profile Mendelian errors |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #profile NA12878 overlap with broad knowledgebase and illumina platinum genomes for the file vt.genotypes.bcf for chromosome 20. |
| + | vt profile_na12878 vt.genotypes.bcf -g na12878.reference.txt -r hs37d5.fa -i 20 |
| + | |
| + | #this is a sample output for mendelian error profiling. |
| + | #R and A stand for reference and alternate allele respectively. |
| + | #Error% - mendelian error (confounded with de novo mutation) |
| + | #HomHet - Homozygous-Heterozygous genotype ratios |
| + | #Het% - proportion of hets |
| + | data set |
| + | No Indels : 27770 [0.94] |
| + | FS/NFS : 0.26 (8/23) <br> |
| + | broad.kb |
| + | A-B 13071 [1.19] |
| + | A&B 14699 [0.76] |
| + | B-A 21546 [0.62] |
| + | Precision 52.9% |
| + | Sensitivity 40.6% <br> |
| + | illumina.platinum |
| + | A-B 17952 [0.88] |
| + | A&B 9818 [1.07] |
| + | B-A 2418 [0.88] |
| + | Precision 35.4% |
| + | Sensitivity 80.2% <br> |
| + | broad.kb |
| + | R/R R/A A/A ./. |
| + | R/R 346 145 3 5473 |
| + | R/A 3 4133 9 758 |
| + | A/A 2 136 2186 956 |
| + | ./. 2 139 86 322 <br> |
| + | Total genotype pairs : 6963 |
| + | Concordance : 95.72% (6665) |
| + | Discordance : 4.28% (298) <br> |
| + | illumina.platinum |
| + | R/R R/A A/A ./. |
| + | R/R 1768 85 2 0 |
| + | R/A 10 4479 14 0 |
| + | A/A 13 180 3028 0 |
| + | ./. 71 98 70 0<br> |
| + | Total genotype pairs : 9579 |
| + | Concordance : 96.83% (9275) |
| + | Discordance : 3.17% (304) |
| + | |
| + | # This file contains information on how to process reference data sets. |
| + | # |
| + | # dataset - name of data set, this label will be printed. |
| + | # type - True Positives (TP) and False Positives (FP) |
| + | # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively |
| + | # - annotation |
| + | # file is used for GENCODE annotation of frame shift and non frame shift Indels |
| + | # filter - filter applied to variants for this particular data set |
| + | # path - path of indexed BCF file |
| + | #dataset type filter path |
| + | broad.kb TP PASS /net/fantasia/home/atks/dev/vt/bundle/public/grch37/broad.kb.241365variants.genotypes.bcf |
| + | illumina.platinum TP PASS /net/fantasia/home/atks/dev/vt/bundle/public/grch37/NA12878.illumina.platinum.5284448variants.genotypes.bcf |
| + | #gencode.v19 annotation . /net/fantasia/home/atks/dev/vt/bundle/public/grch37/gencode.v19.annotation.gtf.gz |
| + | <div class="mw-collapsible-content"> |
| + | profile_na12878 v0.5 |
| + | |
| + | usage : vt profile_na12878 [options] <in.vcf> |
| + | |
| + | options : -g file containing list of reference datasets [] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -r reference sequence fasta file [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | = Variant Calling = |
| + | |
| + | |
| + | === Discover === |
| + | |
| + | Discovers variants from reads in a BAM/CRAM file. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #discover variants from NA12878.bam and write to stdout |
| + | vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20 |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt discover2 [options] |
| + | |
| + | options : -b input BAM/CRAM file |
| + | -y soft clipped unique sequences cutoff [0] |
| + | -x soft clipped mean quality cutoff [0] |
| + | -w insertion desired type II error [0.0] |
| + | -c insertion desired type I error [0.0] |
| + | -h insertion fractional evidence cutoff [0] |
| + | -g insertion count cutoff [1] |
| + | -n deletion desired type II error [0.0] |
| + | -m deletion desired type I error [0.0] |
| + | -v deletion fractional evidence cutoff [0] |
| + | -u deletion count cutoff [1] |
| + | -k snp desired type II error [0.0] |
| + | -j snp desired type I error [0.0] |
| + | -f snp fractional evidence cutoff [0] |
| + | -e snp evidence count cutoff [1] |
| + | -q base quality cutoff for bases [0] |
| + | -C likelihood ratio cutoff [0] |
| + | -B reference bias [0] |
| + | -a read exclude flag [0x0704] |
| + | -l ignore overlapping reads [false] |
| + | -t MAPQ cutoff for alignments [0] |
| + | -p ploidy [2] |
| + | -s sample ID |
| + | -r reference sequence fasta file [] |
| + | -o output VCF file [-] |
| + | -z ignore MD tags [0] |
| + | -d debug [0] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help |
| + | |
| + | </div> |
| + | </div> |
| + | |
| + | === Merge candidate variants === |
| + | |
| + | |
| + | Merge candidate variants across samples. Each VCF file is required to have the FORMAT flags E and N and should have exactly one sample. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #merge candidate variants from VCFs in candidate.txt and output in candidate.sites.vcf |
| + | vt merge_candidate_variants candidates.txt -o candidate.sites.vcf |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt merge_candidate_variants [options] |
| + | |
| + | options : -L file containing list of input VCF files |
| + | -o output VCF file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals |
| + | -- ignores the rest of the labeled arguments following this flag |
| + | -h displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Filter overlap === |
| + | |
| + | Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap. |
| + | |
| + | <div class="mw-collapsible mw-collapsed"> |
| + | #annotates variants that are overlapping |
| + | vt filter_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt filter_overlap [options] <in.vcf> |
| + | |
| + | options : -o output VCF file [-] |
| + | -w window overlap for variants [0] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | <div class="mw-collapsible mw-collapsed"> |
| + | #Use Remove overlap instead for versions older than Jan 12, 2017 |
| + | vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage: vt remove_overlap [options] <in.vcf> |
| + | The old version has the same options except that it lacks the -w option |
| + | The change occurred in the following commit: |
| + | https://github.com/atks/vt/commit/ab5cf7e91b3baa5349f439e6fe92491ae19da1a6 |
| + | </div> |
| + | </div> |
| + | |
| + | === Annotate Indels === |
| + | |
| + | Annotates indels with VNTR information and adds a VNTR record. Facilitates the simultaneous calling of VNTR together with Indels and SNPs. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #annotates indels from VCFs with VNTR information. |
| + | vt annotate_indels in.vcf -r hs37d5.fa -o annotated.sites.vcf |
| + | |
| + | <div style="height:20em; overflow:auto; border: 2px solid #FFF"> |
| + | CHROM POS ID REF ALT QUAL FILTER INFO |
| + | 20 82079 . G A 1255.98 . NSAMPLES=1;E=43;N=51;ESUM=43;NSUM=51;FLANKSEQ=GGAGCACGCC[G/A]CCATGCCCGG |
| + | 20 82217 . G A 1632.77 . NSAMPLES=1;E=56;N=61;ESUM=56;NSUM=61;FLANKSEQ=GAGCCACCGC[G/A]CCCGGCCCAG |
| + | 20 83250 . CTGTGTGTG C . . NSAMPLES=1;E=18;N=35;ESUM=18;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT |
| + | 20 83250 . CTGTGTGTGTG C . . NSAMPLES=1;E=3;N=35;ESUM=3;NSUM=35;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT]TTAGTATTTG;GMOTIF=GT;TR=20:83251:TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG:<VNTR>:GT |
| + | 20 83251 . TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG <VNTR> . . MOTIF=GT;RU=TG;FZ_CONCORDANCE=1;FZ_RL=52;FZ_LL=0;FLANKS=83250,83304;FZ_FLANKS=83250,83303;FZ_RU_COUNTS=26,26;FLANKSEQ=TCTCTCTCTC[TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG]TTTAGTATTT |
| + | 20 83252 . G C 359.204 . NSAMPLES=1;E=13;N=14;ESUM=13;NSUM=14;FLANKSEQ=CTCTCTCTCT[G/C]TGTGTGTGTG |
| + | 20 83260 . G C 500.163 . NSAMPLES=1;E=18;N=34;ESUM=18;NSUM=34;FLANKSEQ=CTGTGTGTGT[G/C]TGTGTGTGTG |
| + | 20 83267 . T C 247.043 . NSAMPLES=1;E=11;N=43;ESUM=11;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT |
| + | 20 83275 . T C 609.669 . NSAMPLES=1;E=24;N=43;ESUM=24;NSUM=43;FLANKSEQ=TGTGTGTGTG[T/C]GTGTGTGTGT |
| + | 20 90008 . C A 1546.88 . NSAMPLES=1;E=52;N=60;ESUM=52;NSUM=60;FLANKSEQ=AACAGAAAAC[C/A]AAATACTGTA |
| + | 20 91088 . C T 1766.04 . NSAMPLES=1;E=58;N=66;ESUM=58;NSUM=66;FLANKSEQ=CCCAGCATAC[C/T]ATGGTTGTGC |
| + | 20 91508 . G A 1266.93 . NSAMPLES=1;E=44;N=53;ESUM=44;NSUM=53;FLANKSEQ=AATTAGTAAG[G/A]CTTACGTAAG |
| + | 20 91707 . C T 888.134 . NSAMPLES=1;E=30;N=53;ESUM=30;NSUM=53;FLANKSEQ=TGATTTTCTA[C/T]AGCAGGACCT |
| + | 20 92527 . A G 828.593 . NSAMPLES=1;E=34;N=40;ESUM=34;NSUM=40;FLANKSEQ=ATTAATTGCC[A/G]TTCTCTCTTT |
| + | 20 93440 . A G 688.144 . NSAMPLES=1;E=24;N=58;ESUM=24;NSUM=58;FLANKSEQ=TTGGATGCAT[A/G]GTCTGTAAAT |
| + | 20 93636 . TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT <VNTR> . . MOTIF=T;RU=T;FZ_CONCORDANCE=0.939394;FZ_RL=35;FZ_LL=0;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FZ_RU_COUNTS=31,33;FLANKSEQ=TCTAGGATTC[TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT |
| + | 20 93646 . C CT . . NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKS=93646,93671;FZ_FLANKS=93635,93671;FLANKSEQ=TTTTTCTTTC[TTTTTTTTTTTTTTTTTTTTTTTT]GAGATGGAGT;GMOTIF=T;TR=20:93636:TTTTTTCTTTCTTTTTTTTTTTTTTTTTTTTTTTT:<VNTR>:T |
| + | 20 93717 . A T 31.7622 . NSAMPLES=1;E=2;N=29;ESUM=2;NSUM=29;FLANKSEQ=CAGTGGCGTG[A/T]TCTTAGATCA |
| + | 20 93931 . G A 628.149 . NSAMPLES=1;E=22;N=53;ESUM=22;NSUM=53;FLANKSEQ=GATTACAGGT[G/A]TGAGCCGCTG |
| + | 20 100699 . C T 809.09 . NSAMPLES=1;E=28;N=61;ESUM=28;NSUM=61;FLANKSEQ=GGTGAAAAAT[C/T]ACCTGTCAGT |
| + | 20 101362 . G A 1087.13 . NSAMPLES=1;E=36;N=67;ESUM=36;NSUM=67;FLANKSEQ=TAATACTGAA[G/A]TTTACTTCTC |
| + | |
| + | </div> |
| + | |
| + | The following shows the trace of how the algorithm works |
| + | |
| + | ============================================ |
| + | ANNOTATING INDEL FUZZILY |
| + | ******************************************** |
| + | EXTRACTIING REGION BY EXACT LEFT AND RIGHT ALIGNMENT |
| + | |
| + | 20:131948:C/CCA |
| + | EXACT REGION 131948-131965 (18) |
| + | CCACACACACACACACAA |
| + | FINAL EXACT REGION 131948-131965 (18) |
| + | CCACACACACACACACAA |
| + | ******************************************** |
| + | PICK CANDIDATE MOTIFS |
| + | |
| + | Longest Allele : C[CA]CACACACACACACACAA |
| + | detecting motifs for an str |
| + | seq: CCACACACACACACACACAA |
| + | len : 20 |
| + | cmax_len : 10 |
| + | candidate motifs: 25 |
| + | AC : 0.894737 2 0 |
| + | AAC : 0.5 3 0.0555556 |
| + | ACC : 0.5 3 0.0555556 |
| + | AAAC : 0.0588235 4 0.125 (< 2 copies) |
| + | ACCC : 0.0588235 4 0.125 (< 2 copies) |
| + | AACAC : 0.5 5 0.02 |
| + | ACACC : 0.5 5 0.02 |
| + | AAACAC : 0.0666667 6 0.0555556 (< 2 copies) |
| + | ACACCC : 0.0666667 6 0.0555556 (< 2 copies) |
| + | AACACAC : 0.5 7 0.0102041 |
| + | ACACACC : 0.5 7 0.0102041 |
| + | AAACACAC : 0.0769231 8 0.03125 (< 2 copies) |
| + | ACACACCC : 0.0769231 8 0.03125 (< 2 copies) |
| + | AACACACAC : 0.5 9 0.00617284 (< 2 copies) |
| + | ACACACACC : 0.5 9 0.00617284 (< 2 copies) |
| + | AAACACACAC : 0.0909091 10 0.02 (< 2 copies) |
| + | ACACACACCC : 0.0909091 10 0.02 (< 2 copies) |
| + | ******************************************** |
| + | PICKING NEXT BEST MOTIF |
| + | |
| + | selected: AC 0.89 0.00 |
| + | ******************************************** |
| + | DETECTING REPEAT TRACT FUZZILY |
| + | ++++++++++++++++++++++++++++++++++++++++++++ |
| + | Exact left/right alignment |
| + | |
| + | repeat_tract : CACACACACACACACA |
| + | position : [131949,131964] |
| + | motif_concordance : 1 |
| + | repeat units : 8 |
| + | exact repeat units : 8 |
| + | total no. of repeat units : 8 |
| + | |
| + | ++++++++++++++++++++++++++++++++++++++++++++ |
| + | Fuzzy right alignment |
| + | |
| + | repeat motif : CA |
| + | rflank : AACTC |
| + | mlen : 2 |
| + | rflen : 5 |
| + | plen : 111 |
| + | |
| + | read : AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACACCACACACACACACACAAACTC |
| + | rlen : 106 |
| + | |
| + | optimal score: 50.5073 |
| + | optimal state: MR |
| + | optimal track: MR|r|0|5 |
| + | optimal probe len: 25 |
| + | optimal path length : 107 |
| + | max j: 106 |
| + | probe: (1~82) [1~10] (1~5) |
| + | read : (1~82) [83~101] (102~106) |
| + | |
| + | motif # : 10 [83,101] |
| + | motif concordance : 95% (9/10) |
| + | motif discordance : 0|1|0|0|0|0|0|0|0|0 |
| + | |
| + | Model: ----------------------------------------------------------------------------------CACACACACACACACACACAAACTC |
| + | SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYMMMDMMMMMMMMMMMMMMMMMMMMME |
| + | oo++oo++oo++oo++oo++RRRRR |
| + | Read: AGAAATGATAGTCACTTCAACAGATGGTGTTGGGAAAACTGGATTTCCACAGGCAGAACAAATGAAATGGATCCTTATCTTACAC-CACACACACACACACAAACTC |
| + | |
| + | ++++++++++++++++++++++++++++++++++++++++++++ |
| + | Fuzzy left alignment |
| + | |
| + | lflank : ATCTTA |
| + | repeat motif : CA |
| + | lflen : 6 |
| + | mlen : 2 |
| + | plen : 111 |
| + | |
| + | read : ATCTTACACCACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT |
| + | rlen : 105 |
| + | |
| + | optimal score: 50.5858 |
| + | optimal state: Z |
| + | optimal track: Z|m|10|2 |
| + | optimal probe len: 26 |
| + | optimal path length : 106 |
| + | max j: 105 |
| + | mismatch penalty: 3 |
| + | |
| + | model: (1~6) [1~10] |
| + | read : (1~6) [7~25][26~106] |
| + | |
| + | motif # : 10 [7,25] |
| + | motif concordance : 95% (9/10) |
| + | motif discordance : 0|1|0|0|0|0|0|0|0|0 |
| + | |
| + | Model: ATCTTACACACACACACACACACACA-------------------------------------------------------------------------------- |
| + | SMMMMMMMMMDMMMMMMMMMMMMMMMMZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZE |
| + | LLLLLLoo++oo++oo++oo++oo++ |
| + | Read: ATCTTACAC-CACACACACACACACAAACTCAAAATGGATTTAAAGACTTAAATGTGAGCCTGGCAAACTTAAAACTCCTAAAATAAAACAGAAGGGAATATCTTT |
| + | |
| + | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |
| + | VNTR Summary |
| + | rid : 19 |
| + | motif : AC |
| + | ru : CA |
| + | |
| + | Exact |
| + | repeat_tract : CACACACACACACACA |
| + | position : [131949,131964] |
| + | reference repeat unit length : 8 |
| + | motif_concordance : 1 |
| + | repeat units : 8 |
| + | exact repeat units : 8 |
| + | total no. of repeat units : 8 |
| + | |
| + | Fuzzy |
| + | repeat_tract : CACCACACACACACACACA |
| + | position : [131946,131964] |
| + | reference repeat unit length : 19 |
| + | motif_concordance : 0.95 |
| + | repeat units : 19 |
| + | exact repeat units : 9 |
| + | total no. of repeat units : 10 |
| + | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt annotate_indels [options] <in.vcf> |
| + | |
| + | options : -v add vntr record [false] |
| + | -x override tags [false] |
| + | -f filter expression [] |
| + | -d debug [false] |
| + | -m mode [f] |
| + | e : by exact alignment f : by fuzzy alignment |
| + | -c classification schemas of tandem repeat [6] |
| + | 1 : lai2003 |
| + | 2 : kelkar2008 |
| + | 3 : fondon2012 |
| + | 4 : ananda2013 |
| + | 5 : willems2014 |
| + | 6 : tan_kang2015 |
| + | -a annotation type [v] |
| + | v : a. output VNTR variant (defined by classification). |
| + | RU repeat unit on reference sequence (CA) |
| + | MOTIF canonical representation (AC) |
| + | RL repeat tract length in bases (11) |
| + | FLANKS flanking positions of repeat tract determined by exact alignment |
| + | RU_COUNTS number of exact repeat units and total number of repeat units in |
| + | repeat tract determined by exact alignment |
| + | FZ_RL fuzzy repeat tract length in bases (11) |
| + | FZ_FLANKS flanking positions of repeat tract determined by fuzzy alignment |
| + | FZ_RU_COUNTS number of exact repeat units and total number of repeat units in |
| + | repeat tract determined by fuzzy alignment |
| + | FLANKSEQ flanking sequence of indel |
| + | LARGE_REPEAT_REGION repeat region exceeding 2000bp |
| + | b. mark indels with overlapping VNTR. |
| + | FLANKS flanking positions of repeat tract determined by exact alignment |
| + | FZ_FLANKS flanking positions of repeat tract determined by fuzzy alignment |
| + | GMOTIF generating motif used in fuzzy alignment |
| + | TR position and alleles of VNTR (20:23413:CACACACACAC:<VNTR>) |
| + | a : annotate each indel with RU, RL, MOTIF, REF. |
| + | -r reference sequence fasta file [] |
| + | -o output VCF file [-] |
| + | -I file containing list of intervals [] |
| + | -i intervals |
| + | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Construct Probes === |
| + | |
| + | |
| + | Construct probes for genotyping a variant. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #construct probes from candidate.sites.bcf and output to standard out |
| + | vt construct_probes candidates.sites.bcf -r ref.fa |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt construct_probes [options] <in.vcf> |
| + | |
| + | options : -o output VCF file [-] |
| + | -f minimum flank length [20] |
| + | -r reference sequence fasta file [] |
| + | -I file containing list of intervals [] |
| + | -i intervals [] |
| + | -- ignores the rest of the labeled arguments following this flag |
| + | -h displays help |
| + | </div> |
| + | </div> |
| + | |
| + | === Genotype === |
| + | |
| + | Genotypes variants for each sample. |
| + | |
| + | <div class=" mw-collapsible mw-collapsed"> |
| + | #genotypes variants found in candidate.sites.vcf from sample.bam |
| + | vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf |
| + | <div class="mw-collapsible-content"> |
| + | usage : vt genotype [options] |
| + | |
| + | options : -r reference sequence fasta file [] |
| + | -s sample ID [] |
| + | -o output VCF file [-] |
| + | -b input BAM file [] |
| + | -i input candidate VCF file [] |
| + | -- ignores the rest of the labeled arguments following this flag |
| + | -h displays help |
| + | </div> |
| + | </div> |
| + | |
| + | = Pedigree File = |
| + | |
| + | vt understands an augmented version introduced by [mailto:hmkang@umich.edu Hyun] of the PED described by [http://zzz.bwh.harvard.edu/plink/data.shtml#ped plink]. |
| + | The pedigree file format is as follows with the following mandatory fields: |
| + | |
| + | {| class="wikitable" |
| + | |- |
| + | ! scope="col"| Field |
| + | ! scope="col"| Description |
| + | ! scope="col"| Valid Values |
| + | ! scope="col"| Missing Values |
| + | |- |
| + | |Family ID<br> |
| + | Individual ID<br> |
| + | Paternal ID<br> |
| + | Maternal ID<br> |
| + | Sex<br> |
| + | Phenotype |
| + | |ID of this family <br> |
| + | ID(s) of this individual (comma separated) <br> |
| + | ID of the father <br> |
| + | ID of the mother <br> |
| + | Sex of the individual<br> |
| + | Phenotype |
| + | |[A-Za-z0-9_]+<br> |
| + | [A-Za-z0-9_]+(,[A-Za-z0-9_]+)* <br> |
| + | [A-Za-z0-9_]+ <br> |
| + | [A-Za-z0-9_]+<br> |
| + | 1=male, 2=female, other, male, female<br> |
| + | [A-Za-z0-9_]+ |
| + | | 0 <br> |
| + | cannot be missing <br> |
| + | 0 <br> |
| + | 0 <br> |
| + | other<br> |
| + | -9 |
| + | |} |
| + | |
| + | Examples: |
| + | |
| + | ceu NA12878 NA12891 NA12892 female -9 |
| + | yri NA19240 NA19239 NA19238 female -9 |
| + | |
| + | ceu NA12878 NA12891 NA12892 2 -9 |
| + | yri NA19240 NA19239 NA19238 2 -9 |
| + | |
| + | #allows tools like profile_mendelian to detect duplicates and check for concordance |
| + | ceu NA12878,NA12878A NA12891 NA12892 female case |
| + | yri NA19240 NA19239 NA19238 female control |
| + | |
| + | #allows tools like profile_mendelian to detect duplicates and check for concordance |
| + | ceu NA12412 0 0 female case |
| + | yri NA19650 0 0 female control |
| + | |
| + | = Resource Bundle = |
| + | |
| + | == GRCh37 == |
| + | |
| + | Files are based on hs37d5.fa made by Heng Li. |
| + | |
| + | * External : [ftp://share.sph.umich.edu/vt/grch37 GRCh37 resource bundle] |
| + | * Internal : /net/fantasia/home/atks/ref/vt/grch37 |
| + | |
| + | Read here for [ftp://share.sph.umich.edu/vt/grch37/readme.txt contents]. |
| + | |
| + | == GRCh38 == |
| + | |
| + | Files are based on [https://github.com/lh3/bwa/blob/master/README-alt.md hs38DH.fa] made by Heng Li. |
| + | Note that many of the references are simply lifted over from GRCh37 using Picard's liftover tool with the default options. |
| + | |
| + | * External : [ftp://share.sph.umich.edu/vt/grch38 GRCh38 resource bundle] |
| + | * Internal : /net/fantasia/home/atks/ref/vt/grch38 |
| + | |
| + | Read here for [ftp://share.sph.umich.edu/vt/grch38/readme.txt contents]. |
| + | |
| + | = FAQ = |
| + | |
| + | ==1. vt cannot retrieve sequences from my reference sequence file == |
| + | |
| + | It is common to use reference files based on the UCSC browser's database and from the Genome Reference Consortium. |
| + | For example, HG19 vs Grch37. The key difference is that chromosome 1 is represented as chr1 and 1 respectively in the |
| + | FASTA files from these 2 sources. Just use the appropriate FASTA file that was used to generate your VCF file originally. |
| + | |
| + | Another common issue is due to the corruption of the index file of the reference sequence; say for a reference file named |
| + | hs37d5.fa or hs37d5.fa.gz, simply delete the index file denoted by hs37d5.fa.fai or hs37d5.fa.gz.fai and run the vt command |
| + | again. A new index file will be generated automatically. |
| + | |
| + | = How to cite vt? = |
| + | |
| + | If you use normalize: <br> |
| + | [http://bioinformatics.oxfordjournals.org/content/31/13/2202 Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. Unified Representation of Genetic Variants. Bioinformatics (2015) 31(13): 2202-2204] |
| + | |
| + | = Maintained by = |
| + | |
| + | This page is maintained by [mailto:atks@umich.edu Adrian] |