Line 17: |
Line 17: |
| #change directory to vt | | #change directory to vt |
| 2. cd vt <br> | | 2. cd vt <br> |
| + | #update submodules |
| + | 3. git submodule update --init --recursive <br> |
| #run make, note that compilers need to support the c++0x standard | | #run make, note that compilers need to support the c++0x standard |
− | 3. make <br> | + | 4. make <br> |
| #you can test the build | | #you can test the build |
− | 4. make test | + | 5. make test |
| <div class=" mw-collapsible mw-collapsed"> | | <div class=" mw-collapsible mw-collapsed"> |
| An expected output when all is well for the tests is shown here. (click expand =>) | | An expected output when all is well for the tests is shown here. (click expand =>) |
Line 54: |
Line 56: |
| </div> | | </div> |
| | | |
− | Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4. <br>
| + | === Mac === |
− | Some features of C++11 is used, thus there is a need for newer versions of gcc and clang.
| + | |
| + | You may install vt via homebrew. |
| | | |
− | == Mac ==
| + | brew tap brewsci/bio |
| + | brew tap brewsci/science |
| + | |
| + | brew install brewsci/bio/vt |
| | | |
− | You may also install vt on mac via homebrew.
| |
| | | |
− | brew install homebrew/science/vt
| + | Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4. <br> |
| + | Some features of C++11 are used, thus there is a need for newer versions of gcc and clang. |
| | | |
| = Updating = | | = Updating = |
Line 458: |
Line 464: |
| There is now an additional option -a which decomposes non block substitutions into its constituent SNPs and indels. (kindly added by [[https://github.com/holtgrewe holtgrewe@github]]) <br> | | There is now an additional option -a which decomposes non block substitutions into its constituent SNPs and indels. (kindly added by [[https://github.com/holtgrewe holtgrewe@github]]) <br> |
| There is no exact solution and this decomposition is based on the best guess outcome using a Needleman-Wunsch algorithm. <br> | | There is no exact solution and this decomposition is based on the best guess outcome using a Needleman-Wunsch algorithm. <br> |
− | You might also want to check out [https://github.com/vcflib/vcflib#vcfallelicprimitives vcfallelicprimitives]. | + | You might also want to check out [https://github.com/vcflib/vcflib#vcfallelicprimitives vcfallelicprimitives]. <br> |
| + | <br> |
| + | There is now an additional option -m and -d which ensures that some MNVs are not decomposed. (kindly added by [[https://github.com/jaudoux jaudoux@github]]) <br> |
| + | The motivation is from<br> |
| + | *Exome-wide assessment of the functional impact and pathogenicity of multi-nucleotide mutations https://www.biorxiv.org/content/10.1101/258723v2.full<br> |
| + | *Landscape of multi-nucleotide variants in 125,748 human exomes and 15,708 genomes https://www.biorxiv.org/content/10.1101/573378v2.full<br> |
| </div> | | </div> |
| | | |
Line 494: |
Line 505: |
| description : decomposes biallelic block substitutions into its constituent SNPs. <br> | | description : decomposes biallelic block substitutions into its constituent SNPs. <br> |
| usage : vt decompose_blocksub [options] <in.vcf> <br> | | usage : vt decompose_blocksub [options] <in.vcf> <br> |
− | options : -a enable aggressive/alignment mode | + | options : -m keep MNVs (multi-nucleotide variants) [false] |
| + | -a enable aggressive/alignment mode [false] |
| + | -d MNVs max distance (when -m option is used) [2] |
| -o output VCF file [-] | | -o output VCF file [-] |
| -I file containing list of intervals [] | | -I file containing list of intervals [] |
| -i intervals [] | | -i intervals [] |
− | -? displays help | + | -? displays help-a enable aggressive/alignment mode |
| + | |
| </div> | | </div> |
| </div> | | </div> |
Line 559: |
Line 573: |
| === Drop duplicate variants === | | === Drop duplicate variants === |
| | | |
− | Drops duplicate variants that appear later in the file. <br> | + | Drops duplicate variants that appear later in the file. VCF file must be ordered. <br> |
| If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained. | | If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained. |
| | | |
Line 719: |
Line 733: |
| | | |
| <div class=" mw-collapsible mw-collapsed"> | | <div class=" mw-collapsible mw-collapsed"> |
− | #converts in.bcf to tab format with selected INFO fields | + | #converts in.bcf to tab format with selected INFO and FILTER fields |
− | vt info2tab in.bcf -v -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT | + | vt info2tab in.bcf -u PASS -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT |
− | | |
| <div style="height:6em; overflow:auto; border: 2px solid #FFF"> | | <div style="height:6em; overflow:auto; border: 2px solid #FFF"> |
| + | INPUT |
| + | ===== |
| 20 17548608 . A AC . PASS CENTERS=vbi;NCENTERS=1;OLD_MULTIALLELIC=20:17548598:GAAAAAAAAAAAAA/GAAAAAAAAAAAA/GAAAAAAAAAAAAAA/GAAAAAAAAAA/GAAAAAAAAAAA/GAAAAAAAAAACAAA;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAACAAAG;EX_MOTIF=C;EX_MLEN=1;EX_RU=C;EX_BASIS=C;EX_BLEN=1;EX_REPEAT_TRACT=17548608,17548609;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=2;EX_RL=2;EX_LL=3;EX_RU_COUNTS=0,2;EX_SCORE=0;EX_TRF_SCORE=-14;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=14;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[A]AAAGAAGGAA;MDUST;LOBSTR | | 20 17548608 . A AC . PASS CENTERS=vbi;NCENTERS=1;OLD_MULTIALLELIC=20:17548598:GAAAAAAAAAAAAA/GAAAAAAAAAAAA/GAAAAAAAAAAAAAA/GAAAAAAAAAA/GAAAAAAAAAAA/GAAAAAAAAAACAAA;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAACAAAG;EX_MOTIF=C;EX_MLEN=1;EX_RU=C;EX_BASIS=C;EX_BLEN=1;EX_REPEAT_TRACT=17548608,17548609;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=2;EX_RL=2;EX_LL=3;EX_RU_COUNTS=0,2;EX_SCORE=0;EX_TRF_SCORE=-14;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=14;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[A]AAAGAAGGAA;MDUST;LOBSTR |
| 20 17548608 . AAAAG A . PASS CENTERS=ox1;NCENTERS=1;EX_MOTIF=AAAG;EX_MLEN=4;EX_RU=AAAG;EX_BASIS=AG;EX_BLEN=2;EX_REPEAT_TRACT=17548609,17548612;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=0.75;EX_RL=4;EX_LL=4;EX_RU_COUNTS=0,1;EX_SCORE=0.75;EX_TRF_SCORE=-1;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=13;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[AAAAG]AAGGAACTAC;MDUST;LOBSTR;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAA | | 20 17548608 . AAAAG A . PASS CENTERS=ox1;NCENTERS=1;EX_MOTIF=AAAG;EX_MLEN=4;EX_RU=AAAG;EX_BASIS=AG;EX_BLEN=2;EX_REPEAT_TRACT=17548609,17548612;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=0.75;EX_RL=4;EX_LL=4;EX_RU_COUNTS=0,1;EX_SCORE=0.75;EX_TRF_SCORE=-1;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=13;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[AAAAG]AAGGAACTAC;MDUST;LOBSTR;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAA |
− |
| |
| </div> | | </div> |
− | | + | OUTPUT |
− | CHROM POS REF ALT N_ALLELE EX_RL FZ_RL MDUST LOBSTR VNTRSEEK RMSK EX_REPEAT_TRACT_1 EX_REPEAT_TRACT_2 | + | ====== |
− | 20 17548608 A AC 2 2 13 1 1 0 0 17548608 17548608 | + | CHROM POS REF ALT N_ALLELE PASS EX_RL FZ_RL MDUST LOBSTR VNTRSEEK RMSK EX_REPEAT_TRACT_1 EX_REPEAT_TRACT_2 |
− | 20 17548608 AAAAG A 2 4 13 1 1 0 0 17548609 17548609 | + | 20 17548608 A AC 2 1 2 13 1 1 0 0 17548608 17548608 |
| + | 20 17548608 AAAAG A 2 1 4 13 1 1 0 0 17548609 17548609 |
| | | |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
| usage : vt info2tab [options] <in.vcf> | | usage : vt info2tab [options] <in.vcf> |
| | | |
− | options : -v print variant CHROM,POS,REF,ALT,N_ALLELE [false] | + | options : -d debug [false] |
− | -d debug [false]
| |
| -f filter expression [] | | -f filter expression [] |
− | -t list of info tags to be extracted [] | + | -u list of filter tags to be extracted []-t list of info tags to be extracted [] |
| -o output tab delimited file [-] | | -o output tab delimited file [-] |
| -I file containing list of intervals [] | | -I file containing list of intervals [] |
Line 1,056: |
Line 1,070: |
| </div> | | </div> |
| | | |
− | === Profile SNPs === | + | === Profile Mendelian Errors === |
| | | |
− | Profile SNPs. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]]. | + | Profile Mendelian errors |
| | | |
| <div class=" mw-collapsible mw-collapsed"> | | <div class=" mw-collapsible mw-collapsed"> |
− | #profile snps found in 20.sites.vcf | + | #profile mendelian errors found in vt.genotypes.bcf, generate [[media:mendel.pdf|tables]] in the directory mendel, requires pdflatex. |
− | vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa -i 20 | + | vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel |
| + | |
| + | pedigree file format is described in [[Vt#Pedigree File|here]]. |
| | | |
− | #this is a sample output for indel profiling.
| + | #this is a sample output for mendelian error profiling. |
− | # square brackets contain the ts/tv ratio.
| + | #R and A stand for reference and alternate allele respectively. |
− | # The numbers in curved bracket are the counts of ts and tv SNPs respectively.
| + | #Error% - mendelian error (confounded with de novo mutation) |
− | # Low complexity shows what percent of the SNPs are in low complexity regions.
| + | #HomHet - Homozygous-Heterozygous genotype ratios |
− | data set | + | #Het% - proportion of hets |
− | No. SNPs : 508603 [2.09]
| + | Mendelian Errors <br> |
− | Low complexity : 0.08 (39837/508603) <br> | + | Father Mother R/R R/A A/A Error(%) HomHet Het(%) |
− | 1000g
| + | R/R R/R 14889 210 38 1.64 nan nan |
− | A-B 109970 [1.39]
| + | R/R R/A 3403 3497 74 1.06 0.97 50.68 |
− | A&B 398633 [2.37] | + | R/R A/A 176 1482 155 18.26 nan nan |
− | B-A 1340682 [2.26]
| + | R/A R/R 3665 3652 68 0.92 1.00 49.91 |
− | Precision 78.4% | + | R/A R/A 1015 3151 990 0.00 0.64 61.11 |
− | Sensitivity 22.9% <br> | + | R/A A/A 43 1300 1401 1.57 1.08 48.13 |
− | dbsnp
| + | A/A R/R 172 1365 147 18.94 nan nan |
− | A-B 324063 [1.99] | + | A/A R/A 47 1164 1183 1.96 1.02 49.60 |
− | A&B 184540 [2.29] | + | A/A A/A 20 78 5637 1.71 nan nan <br> |
− | B-A 103893 [2.60] | + | Parental R/R R/A A/A Error(%) HomHet Het(%) |
− | Precision 36.3% | + | R/R R/R 14889 210 38 1.64 nan nan |
− | Sensitivity 64.0% | + | R/R R/A 7068 7149 142 0.99 0.99 50.28 |
| + | R/R A/A 348 2847 302 18.59 nan nan |
| + | R/A R/A 1015 3151 990 0.00 0.64 61.11 |
| + | R/A A/A 90 2464 2584 1.75 1.05 48.81 |
| + | A/A A/A 20 78 5637 1.71 nan nan <br> |
| + | Parental R/R R/A A/A Error(%) HomHet Het(%) |
| + | HOM HOM 14909 288 5675 1.66 nan nan |
| + | HOM HET 7158 9613 2726 1.19 1.00 49.90 |
| + | HET HET 1015 3151 990 0.00 0.64 61.11 |
| + | HOMREF HOMALT 348 2847 302 18.59 nan nan <br> |
| + | total mendelian error : 2.505% |
| + | no. of trios : 2 |
| + | no. of variants : 25346 |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | profile_mendelian v0.5 |
| + | |
| + | usage : vt profile_mendelian [options] <in.vcf> |
| | | |
− | # This file contains information on how to process reference data sets. | + | options : -q minimum genotype quality |
− | #
| + | -d minimum depth |
− | # dataset - name of data set, this label will be printed.
| + | -r reference sequence fasta file [] |
− | # type - True Positives (TP) and False Positives (FP)
| + | -x output latex directory [] |
− | # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively
| + | -p pedigree file |
− | # - annotation
| |
− | # file is used for GENCODE annotation of frame shift and non frame shift Indels
| |
− | # filter - filter applied to variants for this particular data set
| |
− | # path - path of indexed BCF file
| |
− | #dataset type filter path
| |
− | 1000g TP N_ALLELE==2&&VTYPE==SNP /net/fantasia/home/atks/ref/vt/grch37/1000G.v5.snps.indels.complex.svs.sites.bcf
| |
− | dbsnp TP N_ALLELE==2&&VTYPE==SNP /net/fantasia/home/atks/ref/vt/grch37/dbSNP138.snps.indels.complex.sites.bcf
| |
− | GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz
| |
− | DUST cplx_annotation . /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz
| |
− | | |
− | <div class="mw-collapsible-content">
| |
− | usage : vt profile_snps [options] <in.vcf>
| |
− | | |
− | options : -f filter expression []
| |
− | -g file containing list of reference datasets [] | |
| -I file containing list of intervals [] | | -I file containing list of intervals [] |
− | -i intervals [] | + | -i intervals |
− | -r reference sequence fasta file []
| + | -? displays help |
− | -? displays help
| |
| </div> | | </div> |
| </div> | | </div> |
| | | |
− | === Profile Indels === | + | === Profile SNPs === |
| | | |
− | Profile Indels. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]]. | + | Profile SNPs. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]]. |
| | | |
| <div class=" mw-collapsible mw-collapsed"> | | <div class=" mw-collapsible mw-collapsed"> |
− | #profile indels found in mills.vcf | + | #profile snps found in 20.sites.vcf |
− | vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa -i 20 | + | vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa -i 20 |
| | | |
| #this is a sample output for indel profiling. | | #this is a sample output for indel profiling. |
− | # square brackets contain the ins/del ratio. | + | # square brackets contain the ts/tv ratio. |
− | # for the FS/NFS field, that is the proportion of coding indels that are frame shifted.
| + | # The numbers in curved bracket are the counts of ts and tv SNPs respectively. |
− | # The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively. | + | # Low complexity shows what percent of the SNPs are in low complexity regions. |
− | data set | + | data set |
− | No Indels : 46974 [0.89]
| + | No. SNPs : 508603 [2.09] |
− | FS/NFS : 0.26 (8/23) <br>
| + | Low complexity : 0.08 (39837/508603) <br> |
| + | 1000g |
| + | A-B 109970 [1.39] |
| + | A&B 398633 [2.37] |
| + | B-A 1340682 [2.26] |
| + | Precision 78.4% |
| + | Sensitivity 22.9% <br> |
| dbsnp | | dbsnp |
− | A-B 30704 [0.92] | + | A-B 324063 [1.99] |
− | A&B 16270 [0.83] | + | A&B 184540 [2.29] |
− | B-A 2049488 [1.52]
| + | B-A 103893 [2.60] |
− | Precision 34.6%
| + | Precision 36.3% |
− | Sensitivity 0.8% <br>
| + | Sensitivity 64.0% |
− | mills
| |
− | A-B 43234 [0.88] | |
− | A&B 3740 [1.00] | |
− | B-A 203278 [0.98] | |
− | Precision 8.0% | |
− | Sensitivity 1.8% <br>
| |
− | mills.chip
| |
− | A-B 46847 [0.89]
| |
− | A&B 127 [0.90]
| |
− | B-A 8777 [0.93]
| |
− | Precision 0.3%
| |
− | Sensitivity 1.4% <br> | |
− | affy.exome.chip
| |
− | A-B 46911 [0.89]
| |
− | A&B 63 [0.43]
| |
− | B-A 33997 [0.47]
| |
− | Precision 0.1%
| |
− | Sensitivity 0.2% <br>
| |
| | | |
| # This file contains information on how to process reference data sets. | | # This file contains information on how to process reference data sets. |
| + | # |
| # dataset - name of data set, this label will be printed. | | # dataset - name of data set, this label will be printed. |
− | # type - True Positives (TP) and False Positives (FP). | + | # type - True Positives (TP) and False Positives (FP) |
− | # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively. | + | # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively |
− | # - annotation. | + | # - annotation |
− | # file is used for GENCODE annotation of frame shift and non frame shift Indels. | + | # file is used for GENCODE annotation of frame shift and non frame shift Indels |
− | # filter - filter applied to variants for this particular data set. | + | # filter - filter applied to variants for this particular data set |
− | # path - path of indexed BCF file. | + | # path - path of indexed BCF file |
− | #dataset type filter path | + | #dataset type filter path |
− | 1000g TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/1000G.snps_indels.sites.bcf | + | 1000g TP N_ALLELE==2&&VTYPE==SNP /net/fantasia/home/atks/ref/vt/grch37/1000G.v5.snps.indels.complex.svs.sites.bcf |
− | mills TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/mills.208620indels.sites.bcf
| + | dbsnp TP N_ALLELE==2&&VTYPE==SNP /net/fantasia/home/atks/ref/vt/grch37/dbSNP138.snps.indels.complex.sites.bcf |
− | dbsnp TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/dbsnp.13147541variants.sites.bcf | + | GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz |
− | GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.cds.bed.gz | + | DUST cplx_annotation . /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz |
− | DUST cplx_annotation . /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz | |
| | | |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
− | usage : vt profile_indels [options] <in.vcf> | + | usage : vt profile_snps [options] <in.vcf> |
| | | |
− | options : -g file containing list of reference datasets [] | + | options : -f filter expression [] |
| + | -g file containing list of reference datasets [] |
| -I file containing list of intervals [] | | -I file containing list of intervals [] |
| -i intervals [] | | -i intervals [] |
Line 1,177: |
Line 1,183: |
| </div> | | </div> |
| | | |
− | === Profile VNTRs === | + | === Profile Indels === |
| | | |
− | Profile VNTRs. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]]. | + | Profile Indels. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]]. |
| | | |
| <div class=" mw-collapsible mw-collapsed"> | | <div class=" mw-collapsible mw-collapsed"> |
| + | #profile indels found in mills.vcf |
| + | vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa -i 20 |
| | | |
− | #profiles a set of VNTRs | + | #this is a sample output for indel profiling. |
− | vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt | + | # square brackets contain the ins/del ratio. |
− | | + | # for the FS/NFS field, that is the proportion of coding indels that are frame shifted. |
− | | + | # The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively. |
− | profile_vntrs v0.5
| + | data set |
− | | + | No Indels : 46974 [0.89] |
− | no VNTRs 5660874 #number of VNTRs in vntrs.sites.bcf
| + | FS/NFS : 0.26 (8/23) <br> |
− | no low complexity 2686460 (47.46%) #number of VNTRs in low complexity region determined by MDUST
| + | dbsnp |
− | no coding 17911 (0.32%) #number of VNTRs in coding regions determined by GENCODE v7 | + | A-B 30704 [0.92] |
− | no redundant 1312209 (23.18%) #number of VNTRs involved in overlapping with one another<br>
| + | A&B 16270 [0.83] |
− | trf_lobstr (1638516) #TRF based reference set used in lobSTR, motif lengths 1 to 6. | + | B-A 2049488 [1.52] |
− | A-B 3269285 #TRs specific to vntrs.sites.bcf | + | Precision 34.6% |
− | A-B~ 1666185 #TRs in vntrs.sites.bcf that overlap partially with at least one TR in TRF(lobSTR) but does not overlap exactly with another TR. | + | Sensitivity 0.8% <br> |
− | A&B1 725404 #TRs in vntrs.sites.bcf that overlap exactly with at least one TR in TRF(lobSTR) | + | mills |
− | A&B2 723195 #TRs in TRF(lobSTR) that overlap exactly with at least one TR in vntrs.sites.bcf | + | A-B 43234 [0.88] |
− | B-A~ 710075 #TRs in TRF(lobSTR) that overlap partially with at least one TR in vntrs.sites.bcf but does not overlap exactly with another TR. | + | A&B 3740 [1.00] |
− | B-A 205246 #TRs specific to TRF(lobSTR)
| + | B-A 203278 [0.98] |
− | #note that the first 3 rows should sum up to the number of TRs in vntrs.sites.bcf | + | Precision 8.0% |
− | #and the 4th to 6th rows should sum up to the number of TRs in TRF( lobSTR)
| + | Sensitivity 1.8% <br> |
− | #This basically allows us to see the m to n overlapping in overlapping TRs<br>
| + | mills.chip |
− | trf_repeatseq (1624553) #TRF based reference set used in repeatseq, motif lengths 1 to 6. | + | A-B 46847 [0.89] |
− | A-B 3291652 | + | A&B 127 [0.90] |
− | A-B~ 1650190 | + | B-A 8777 [0.93] |
− | A&B1 719032 | + | Precision 0.3% |
− | A&B2 716838 | + | Sensitivity 1.4% <br> |
− | B-A~ 703948 | + | affy.exome.chip |
− | B-A 203767 <br>
| + | A-B 46911 [0.89] |
− | trf_vntrseek (230306) #TRF based reference set used in vntrseek, motif lengths 7 to 2000. | + | A&B 63 [0.43] |
− | A-B 5384453 | + | B-A 33997 [0.47] |
− | A-B~ 271302 | + | Precision 0.1% |
− | A&B1 5119 | + | Sensitivity 0.2% <br> |
− | A&B2 4973 | |
− | B-A~ 92496 | |
− | B-A 132837 <br> | |
− | codis+ (15) #CODIS STRs + 2 STRs from PROMEGA | |
− | A-B 5660794 | |
− | A-B~ 79 | |
− | A&B1 1 | |
− | A&B2 1 | |
− | B-A~ 14 | |
− | B-A 0
| |
| | | |
| # This file contains information on how to process reference data sets. | | # This file contains information on how to process reference data sets. |
Line 1,230: |
Line 1,228: |
| # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively. | | # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively. |
| # - annotation. | | # - annotation. |
− | # file is used for GENCODE annotation of coding VNTRs. | + | # file is used for GENCODE annotation of frame shift and non frame shift Indels. |
| # filter - filter applied to variants for this particular data set. | | # filter - filter applied to variants for this particular data set. |
| # path - path of indexed BCF file. | | # path - path of indexed BCF file. |
− | #dataset type filter path | + | #dataset type filter path |
− | trf_lobstr TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.lobstr.sites.bcf | + | 1000g TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/1000G.snps_indels.sites.bcf |
− | trf_repeatseq TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.repeatseq.sites.bcf | + | mills TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/mills.208620indels.sites.bcf |
− | trf_vntrseek TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.vntrseek.sites.bcf | + | dbsnp TP N_ALLELE==2&&VTYPE==INDEL /net/fantasia/home/atks/ref/vt/grch37/dbsnp.13147541variants.sites.bcf |
− | codis+ TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/codis.strs.sites.bcf | + | GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.cds.bed.gz |
− | GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz | + | DUST cplx_annotation . /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz |
− | DUST cplx_annotation .
| |
| | | |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
− | usage : vt profile_vntrs [options] <in.vcf> | + | usage : vt profile_indels [options] <in.vcf> |
| | | |
| options : -g file containing list of reference datasets [] | | options : -g file containing list of reference datasets [] |
Line 1,252: |
Line 1,249: |
| </div> | | </div> |
| | | |
− | === Profile Mendelian Errors === | + | === Profile VNTRs === |
| | | |
− | Profile Mendelian errors | + | Profile VNTRs. The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]]. |
| | | |
| <div class=" mw-collapsible mw-collapsed"> | | <div class=" mw-collapsible mw-collapsed"> |
− | #profile mendelian errors found in vt.genotypes.bcf, generate [[media:mendel.pdf|tables]] in the directory mendel, requires pdflatex.
| |
− | vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel
| |
| | | |
− | pedigree file format is described in [http://csg.sph.umich.edu//abecasis/merlin/tour/input_files.html here]
| + | #profiles a set of VNTRs |
| + | vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt |
| + | |
| | | |
− | #this is a sample output for mendelian error profiling.
| + | profile_vntrs v0.5 |
− | #R and A stand for reference and alternate allele respectively.
| + | |
− | #Error% - mendelian error (confounded with de novo mutation)
| + | no VNTRs 5660874 #number of VNTRs in vntrs.sites.bcf |
− | #HomHet - Homozygous-Heterozygous genotype ratios
| + | no low complexity 2686460 (47.46%) #number of VNTRs in low complexity region determined by MDUST |
− | #Het% - proportion of hets
| + | no coding 17911 (0.32%) #number of VNTRs in coding regions determined by GENCODE v7 |
− | Mendelian Errors <br>
| + | no redundant 1312209 (23.18%) #number of VNTRs involved in overlapping with one another<br> |
− | Father Mother R/R R/A A/A Error(%) HomHet Het(%)
| + | trf_lobstr (1638516) #TRF based reference set used in lobSTR, motif lengths 1 to 6. |
− | R/R R/R 14889 210 38 1.64 nan nan
| + | A-B 3269285 #TRs specific to vntrs.sites.bcf |
− | R/R R/A 3403 3497 74 1.06 0.97 50.68
| + | A-B~ 1666185 #TRs in vntrs.sites.bcf that overlap partially with at least one TR in TRF(lobSTR) but does not overlap exactly with another TR. |
− | R/R A/A 176 1482 155 18.26 nan nan
| + | A&B1 725404 #TRs in vntrs.sites.bcf that overlap exactly with at least one TR in TRF(lobSTR) |
− | R/A R/R 3665 3652 68 0.92 1.00 49.91
| + | A&B2 723195 #TRs in TRF(lobSTR) that overlap exactly with at least one TR in vntrs.sites.bcf |
− | R/A R/A 1015 3151 990 0.00 0.64 61.11
| + | B-A~ 710075 #TRs in TRF(lobSTR) that overlap partially with at least one TR in vntrs.sites.bcf but does not overlap exactly with another TR. |
− | R/A A/A 43 1300 1401 1.57 1.08 48.13
| + | B-A 205246 #TRs specific to TRF(lobSTR) |
− | A/A R/R 172 1365 147 18.94 nan nan
| + | #note that the first 3 rows should sum up to the number of TRs in vntrs.sites.bcf |
− | A/A R/A 47 1164 1183 1.96 1.02 49.60
| + | #and the 4th to 6th rows should sum up to the number of TRs in TRF( lobSTR) |
− | A/A A/A 20 78 5637 1.71 nan nan <br>
| + | #This basically allows us to see the m to n overlapping in overlapping TRs<br> |
− | Parental R/R R/A A/A Error(%) HomHet Het(%)
| + | trf_repeatseq (1624553) #TRF based reference set used in repeatseq, motif lengths 1 to 6. |
− | R/R R/R 14889 210 38 1.64 nan nan
| + | A-B 3291652 |
− | R/R R/A 7068 7149 142 0.99 0.99 50.28
| + | A-B~ 1650190 |
− | R/R A/A 348 2847 302 18.59 nan nan
| + | A&B1 719032 |
− | R/A R/A 1015 3151 990 0.00 0.64 61.11
| + | A&B2 716838 |
− | R/A A/A 90 2464 2584 1.75 1.05 48.81
| + | B-A~ 703948 |
− | A/A A/A 20 78 5637 1.71 nan nan <br>
| + | B-A 203767 <br> |
− | Parental R/R R/A A/A Error(%) HomHet Het(%)
| + | trf_vntrseek (230306) #TRF based reference set used in vntrseek, motif lengths 7 to 2000. |
− | HOM HOM 14909 288 5675 1.66 nan nan
| + | A-B 5384453 |
− | HOM HET 7158 9613 2726 1.19 1.00 49.90
| + | A-B~ 271302 |
− | HET HET 1015 3151 990 0.00 0.64 61.11
| + | A&B1 5119 |
− | HOMREF HOMALT 348 2847 302 18.59 nan nan <br>
| + | A&B2 4973 |
− | total mendelian error : 2.505%
| + | B-A~ 92496 |
− | no. of trios : 2
| + | B-A 132837 <br> |
− | no. of variants : 25346
| + | codis+ (15) #CODIS STRs + 2 STRs from PROMEGA |
| + | A-B 5660794 |
| + | A-B~ 79 |
| + | A&B1 1 |
| + | A&B2 1 |
| + | B-A~ 14 |
| + | B-A 0 |
| | | |
− | <div class="mw-collapsible-content">
| + | # This file contains information on how to process reference data sets. |
− | profile_mendelian v0.5
| + | # dataset - name of data set, this label will be printed. |
| + | # type - True Positives (TP) and False Positives (FP). |
| + | # overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively. |
| + | # - annotation. |
| + | # file is used for GENCODE annotation of coding VNTRs. |
| + | # filter - filter applied to variants for this particular data set. |
| + | # path - path of indexed BCF file. |
| + | #dataset type filter path |
| + | trf_lobstr TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.lobstr.sites.bcf |
| + | trf_repeatseq TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.repeatseq.sites.bcf |
| + | trf_vntrseek TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/trf.vntrseek.sites.bcf |
| + | codis+ TP VTYPE==VNTR /net/fantasia/home/atks/ref/vt/grch37/codis.strs.sites.bcf |
| + | GENCODE_V19 cds_annotation . /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz |
| + | DUST cplx_annotation . |
| | | |
− | usage : vt profile_mendelian [options] <in.vcf> | + | <div class="mw-collapsible-content"> |
| + | usage : vt profile_vntrs [options] <in.vcf> |
| | | |
− | options : -q minimum genotype quality | + | options : -g file containing list of reference datasets [] |
− | -d minimum depth | + | -I file containing list of intervals [] |
| + | -i intervals [] |
| -r reference sequence fasta file [] | | -r reference sequence fasta file [] |
− | -x output latex directory [] | + | -? displays help |
− | -p pedigree file
| |
− | -I file containing list of intervals []
| |
− | -i intervals
| |
− | -? displays help
| |
| </div> | | </div> |
| </div> | | </div> |
Line 1,450: |
Line 1,464: |
| </div> | | </div> |
| | | |
− | === Remove overlap === | + | === Filter overlap === |
| | | |
| Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap. | | Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap. |
| | | |
− | <div class=" mw-collapsible mw-collapsed"> | + | <div class="mw-collapsible mw-collapsed"> |
| #annotates variants that are overlapping | | #annotates variants that are overlapping |
− | vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf | + | vt filter_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf |
| | | |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
− | usage : vt remove_overlap [options] <in.vcf> | + | usage : vt filter_overlap [options] <in.vcf> |
| | | |
| options : -o output VCF file [-] | | options : -o output VCF file [-] |
| + | -w window overlap for variants [0] |
| -I file containing list of intervals [] | | -I file containing list of intervals [] |
| -i intervals [] | | -i intervals [] |
| -? displays help | | -? displays help |
| + | </div> |
| + | </div> |
| + | |
| + | <div class="mw-collapsible mw-collapsed"> |
| + | #Use Remove overlap instead for versions older than Jan 12, 2017 |
| + | vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf |
| + | |
| + | <div class="mw-collapsible-content"> |
| + | usage: vt remove_overlap [options] <in.vcf> |
| + | The old version has the same options except that it lacks the -w option |
| + | The change occurred in the following commit: |
| + | https://github.com/atks/vt/commit/ab5cf7e91b3baa5349f439e6fe92491ae19da1a6 |
| </div> | | </div> |
| </div> | | </div> |
Line 1,705: |
Line 1,732: |
| -- ignores the rest of the labeled arguments following this flag | | -- ignores the rest of the labeled arguments following this flag |
| -h displays help | | -h displays help |
− | </div> | + | </div> |
− | </div> | + | </div> |
− | | + | |
− | === Genotype === | + | === Genotype === |
− | | + | |
− | Genotypes variants for each sample. | + | Genotypes variants for each sample. |
− | | + | |
− | <div class=" mw-collapsible mw-collapsed"> | + | <div class=" mw-collapsible mw-collapsed"> |
− | #genotypes variants found in candidate.sites.vcf from sample.bam | + | #genotypes variants found in candidate.sites.vcf from sample.bam |
− | vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf | + | vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf |
− | <div class="mw-collapsible-content"> | + | <div class="mw-collapsible-content"> |
− | usage : vt genotype [options] | + | usage : vt genotype [options] |
| + | |
| + | options : -r reference sequence fasta file [] |
| + | -s sample ID [] |
| + | -o output VCF file [-] |
| + | -b input BAM file [] |
| + | -i input candidate VCF file [] |
| + | -- ignores the rest of the labeled arguments following this flag |
| + | -h displays help |
| + | </div> |
| + | </div> |
| + | |
| + | = Pedigree File = |
| + | |
| + | vt understands an augmented version introduced by [mailto:hmkang@umich.edu Hyun] of the PED described by [http://zzz.bwh.harvard.edu/plink/data.shtml#ped plink]. |
| + | The pedigree file format is as follows with the following mandatory fields: |
| + | |
| + | {| class="wikitable" |
| + | |- |
| + | ! scope="col"| Field |
| + | ! scope="col"| Description |
| + | ! scope="col"| Valid Values |
| + | ! scope="col"| Missing Values |
| + | |- |
| + | |Family ID<br> |
| + | Individual ID<br> |
| + | Paternal ID<br> |
| + | Maternal ID<br> |
| + | Sex<br> |
| + | Phenotype |
| + | |ID of this family <br> |
| + | ID(s) of this individual (comma separated) <br> |
| + | ID of the father <br> |
| + | ID of the mother <br> |
| + | Sex of the individual<br> |
| + | Phenotype |
| + | |[A-Za-z0-9_]+<br> |
| + | [A-Za-z0-9_]+(,[A-Za-z0-9_]+)* <br> |
| + | [A-Za-z0-9_]+ <br> |
| + | [A-Za-z0-9_]+<br> |
| + | 1=male, 2=female, other, male, female<br> |
| + | [A-Za-z0-9_]+ |
| + | | 0 <br> |
| + | cannot be missing <br> |
| + | 0 <br> |
| + | 0 <br> |
| + | other<br> |
| + | -9 |
| + | |} |
| + | |
| + | Examples: |
| + | |
| + | ceu NA12878 NA12891 NA12892 female -9 |
| + | yri NA19240 NA19239 NA19238 female -9 |
| + | |
| + | ceu NA12878 NA12891 NA12892 2 -9 |
| + | yri NA19240 NA19239 NA19238 2 -9 |
| + | |
| + | #allows tools like profile_mendelian to detect duplicates and check for concordance |
| + | ceu NA12878,NA12878A NA12891 NA12892 female case |
| + | yri NA19240 NA19239 NA19238 female control |
| + | |
| + | #allows tools like profile_mendelian to detect duplicates and check for concordance |
| + | ceu NA12412 0 0 female case |
| + | yri NA19650 0 0 female control |
| + | |
| + | = Resource Bundle = |
| + | |
| + | == GRCh37 == |
| + | |
| + | Files are based on hs37d5.fa made by Heng Li. |
| + | |
| + | * External : [ftp://share.sph.umich.edu/vt/grch37 GRCh37 resource bundle] |
| + | * Internal : /net/fantasia/home/atks/ref/vt/grch37 |
| | | |
− | options : -r reference sequence fasta file []
| + | Read here for [ftp://share.sph.umich.edu/vt/grch37/readme.txt contents]. |
− | -s sample ID []
| |
− | -o output VCF file [-]
| |
− | -b input BAM file []
| |
− | -i input candidate VCF file []
| |
− | -- ignores the rest of the labeled arguments following this flag
| |
− | -h displays help
| |
− | </div>
| |
− | </div>
| |
| | | |
− | = Resource Bundle = | + | == GRCh38 == |
| | | |
− | * External : [ftp://share.sph.umich.edu/vt resource bundle]
| + | Files are based on [https://github.com/lh3/bwa/blob/master/README-alt.md hs38DH.fa] made by Heng Li. |
− | * Internal : /net/fantasia/home/atks/ref/vt/grch37
| + | Note that many of the references are simply lifted over from GRCh37 using Picard's liftover tool with the default options. |
| | | |
− | GRCH37 set : Files are based on hs37d5.fa made by Heng Li.
| + | * External : [ftp://share.sph.umich.edu/vt/grch38 GRCh38 resource bundle] |
| + | * Internal : /net/fantasia/home/atks/ref/vt/grch38 |
| | | |
− | {| class="wikitable"
| + | Read here for [ftp://share.sph.umich.edu/vt/grch38/readme.txt contents]. |
− | |-
| |
− | ! scope="col"| data set
| |
− | ! scope="col"| samples
| |
− | ! scope="col"| snps/indels/complex/sv
| |
− | ! scope="col"| description
| |
− | |-
| |
− | |1000G.v5 <br>
| |
− | dbsnp138 <br>
| |
− | 1000G.omni.chip <br>
| |
− | mills <br>
| |
− | mills.chip <br>
| |
− | affy.exome.chip <br>
| |
− | NA12878.broad.kb <br>
| |
− | NA12878.v7.illumina.platinum <br>
| |
− | mdust.bed.gz <br>
| |
− | gencode.cds.bed.gz <br>
| |
− | trf.bed.gz
| |
− | | 0<br>
| |
− | 0<br>
| |
− | 2141<br>
| |
− | 0<br>
| |
− | 158<br>
| |
− | 2122<br>
| |
− | 1<br>
| |
− | 1 <br>
| |
− | NA <br>
| |
− | NA <br>
| |
− | NA
| |
− | | 81316694/3296894/66806/59426 <br>
| |
− | 10588965/2488793/69749/0 <br>
| |
− | 2432554/5/0/0 <br>
| |
− | 0/208753/0/0 <br>
| |
− | 0/8904/0/0 <br>
| |
− | 281875/34389/0/0 <br>
| |
− | 281345/87389/152/0 <br>
| |
− | 3702969/650764/13751/0 <br>
| |
− | NA<br>
| |
− | NA<br>
| |
− | NA
| |
− | |1000G v5. [1000G 2015?]<br>
| |
− | derived from GATK's resource bundle that excludes 1000G variants.<br>
| |
− | 1000G individuals typed on the omni chip [1000G 2015?]<br>
| |
− | indels from [Mills 2006]<br>
| |
− | indels from [Mills 2011]<br>
| |
− | 1000G individuals and others typed on the affymetrix exome chip [1000G 2015?]<br>
| |
− | from GATK's NA12878 knowledgebase.<br>
| |
− | Illumina's platinum genomes version 7<br>
| |
− | regions of low complexity annotated using mdust [Morgulis 2006]<br>
| |
− | coding sequence regions based on GENCODE v19 annotations [Harrow 2012]<br>
| |
− | tandem repeat finder STRs from lobSTR's resource bundle [Gymrek 2012]
| |
− | |}
| |
− |
| |
− | Note: Please let me know if I did not cite a resource properly.
| |
| | | |
| = FAQ = | | = FAQ = |