Changes

From Genome Analysis Wiki
Jump to navigationJump to search
2,632 bytes added ,  17:31, 13 November 2017
Line 4: Line 4:     
= Installation =
 
= Installation =
 +
 +
== General ==
    
The source files are housed in github.  [https://github.com/samtools/htslib htslib] is  
 
The source files are housed in github.  [https://github.com/samtools/htslib htslib] is  
Line 54: Line 56:  
Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4. <br>
 
Building has been tested on Linux and Mac systems on gcc 4.8.1 and clang 3.4. <br>
 
Some features of C++11 is used, thus there is a need for newer versions of gcc and clang.
 
Some features of C++11 is used, thus there is a need for newer versions of gcc and clang.
 +
 +
== Mac ==
 +
 +
You may also install vt on mac via homebrew.
 +
 +
  brew install homebrew/science/vt
    
= Updating =
 
= Updating =
Line 111: Line 119:  
   This allows you to only analyse biallelic indels that are passed on chromosome 20.
 
   This allows you to only analyse biallelic indels that are passed on chromosome 20.
 
   vt profile_na12878 vt.bcf -g na12878.reference.txt -r genome.fa -f "N_ALLELE==2&&VTYPE==INDEL&&PASS"  -i 20
 
   vt profile_na12878 vt.bcf -g na12878.reference.txt -r genome.fa -f "N_ALLELE==2&&VTYPE==INDEL&&PASS"  -i 20
 +
 +
  This allows you to extract biallelic indels that are passed on chromosome 20.
 +
  vt view vt.bcf -f "N_ALLELE==2&&VTYPE==INDEL&&PASS"  -i 20
    
Other examples of filters
 
Other examples of filters
Line 132: Line 143:  
   Biallelic variants involving 1bp variants  : LEN==1&&N_ALLELE==2
 
   Biallelic variants involving 1bp variants  : LEN==1&&N_ALLELE==2
 
   Variants with explicit sequences with no Ns : ~VARIANT_CONTAINS_N  
 
   Variants with explicit sequences with no Ns : ~VARIANT_CONTAINS_N  
 +
 +
  REF field
 +
    REF
 +
 +
  ALT field
 +
    ALT
    
   QUAL field
 
   QUAL field
Line 141: Line 158:  
   INFO fields
 
   INFO fields
 
     INFO.<tag>
 
     INFO.<tag>
 
+
 
 +
  A/C SNPs                                    : REF=='A' && ALT=='C'
 +
  AC type of STRs                            : REF=~'^.(AC)+$' || ALT=~'^.(AC)+$'
 
   Passed biallelic SNPs only                  : PASS&&VTYPE==SNP&&N_ALLELE==2
 
   Passed biallelic SNPs only                  : PASS&&VTYPE==SNP&&N_ALLELE==2
 
   Passed Common biallelic SNPs only          : PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005
 
   Passed Common biallelic SNPs only          : PASS&&VTYPE==SNP&&N_ALLELE==2&&INFO.AF>0.005
Line 540: Line 559:  
=== Drop duplicate variants ===
 
=== Drop duplicate variants ===
   −
Drops duplicate variants that appear later in the file.  <br>
+
Drops duplicate variants that appear later in the file.  VCF file must be ordered. <br>
 
If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained.
 
If there are OLD_VARIANT tags in the INFO field, the variants in these tags are aggregated in the unique record retained.
   Line 633: Line 652:  
</div>
 
</div>
   −
=== Validate ===
+
=== Filter ===
   −
Checks the following properties of a VCF file:
+
Filters variants in a VCF file
#order
  −
#reference sequence consistency
      
<div class=" mw-collapsible mw-collapsed">
 
<div class=" mw-collapsible mw-collapsed">
   #validates lobstr.bcf
+
   #adds a filter tag "refA" for variants where the REF column is a A sequence.
   vt validate lobstr.bcf
+
   vt filter in.bcf -f "REF=='A'" -d "refA"
   −
<div class="mw-collapsible-content">
+
<div class="mw-collapsible-content">
  usage : vt validate [options] <in.vcf>
+
  usage : vt filter [options] <in.vcf> <br>
 
+
  options : -x clear filter [false]
  options : -q do not print invalid records [false]
+
            -f  filter expression []
            -I  file containing list of intervals []
+
            -d  filter tag description []
            -i  intervals []
+
            -t  filter tag []
            -r  reference sequence fasta file []
+
            -o  output VCF file [-]
            -?  displays help
+
            -I  file containing list of intervals []
</div>
+
            -i  intervals
 +
            -?  displays help </div>
 
</div>
 
</div>
   −
=== Extract INFO fields to a tab delimited file ===
+
=== Filter overlap ===
   −
Converts a VCF file and its shared information in the INFO field to a tab delimited file for further analysis.
+
Tags overlapping variants in a VCF file with the FILTER flag overlap.
    
<div class=" mw-collapsible mw-collapsed">
 
<div class=" mw-collapsible mw-collapsed">
   #converts in.bcf to tab format with selected INFO fields
+
   #adds a filter tag "overlap" for overlapping variants within a window size of 1 based on the REF sequence.
   vt info2tab in.bcf -v -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT
+
   vt filter_overlap in.bcf -w 1 out.bcf
   −
  <div style="height:6em; overflow:auto; border: 2px solid #FFF">
+
  todo: option for considering END info tag for detecting overlaps.
  20 17548608 . A AC . PASS CENTERS=vbi;NCENTERS=1;OLD_MULTIALLELIC=20:17548598:GAAAAAAAAAAAAA/GAAAAAAAAAAAA/GAAAAAAAAAAAAAA/GAAAAAAAAAA/GAAAAAAAAAAA/GAAAAAAAAAACAAA;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAACAAAG;EX_MOTIF=C;EX_MLEN=1;EX_RU=C;EX_BASIS=C;EX_BLEN=1;EX_REPEAT_TRACT=17548608,17548609;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=2;EX_RL=2;EX_LL=3;EX_RU_COUNTS=0,2;EX_SCORE=0;EX_TRF_SCORE=-14;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=14;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[A]AAAGAAGGAA;MDUST;LOBSTR
+
 
  20 17548608 . AAAAG A . PASS CENTERS=ox1;NCENTERS=1;EX_MOTIF=AAAG;EX_MLEN=4;EX_RU=AAAG;EX_BASIS=AG;EX_BLEN=2;EX_REPEAT_TRACT=17548609,17548612;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=0.75;EX_RL=4;EX_LL=4;EX_RU_COUNTS=0,1;EX_SCORE=0.75;EX_TRF_SCORE=-1;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=13;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[AAAAG]AAGGAACTAC;MDUST;LOBSTR;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAA
+
<div class="mw-collapsible-content">
 +
  usage : vt filter_overlap [options] <in.vcf>
 +
 
 +
  options : -o  output VCF file [-]
 +
            -w  window overlap for variants [0]
 +
            -I  file containing list of intervals []
 +
            -i  intervals []
 +
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Validate ===
   −
  </div>
+
Checks the following properties of a VCF file:
 +
#order
 +
#reference sequence consistency
   −
  CHROM POS   REF   ALT N_ALLELE  EX_RL  FZ_RL MDUST LOBSTR VNTRSEEK  RMSK EX_REPEAT_TRACT_1 EX_REPEAT_TRACT_2
+
<div class=" mw-collapsible mw-collapsed">
  20 17548608  A   AC 2        2 13 1 1 0   0    17548608                17548608
+
   #validates lobstr.bcf
  20 17548608  AAAAG  A 2        4      13    1 1      0        0    17548609                17548609
+
  vt validate lobstr.bcf
    
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
   usage : vt info2tab [options] <in.vcf>
+
   usage : vt validate [options] <in.vcf>
 
    
 
    
   options : -v print variant CHROM,POS,REF,ALT,N_ALLELE [false]
+
   options : -q do not print invalid records [false]
            -d  debug [false]
+
            -I  file containing list of intervals []
             -f  filter expression []
+
            -i  intervals []
             -t  list of info tags to be extracted []
+
            -r  reference sequence fasta file []
             -o  output tab delimited file [-]
+
            -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Extract INFO fields to a tab delimited file ===
 +
 
 +
Converts a VCF file and its shared information in the INFO field to a tab delimited file for further analysis.
 +
 
 +
<div class=" mw-collapsible mw-collapsed">
 +
  #converts in.bcf to tab format with selected INFO and FILTER fields
 +
  vt info2tab in.bcf -u PASS -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT
 +
  <div style="height:6em; overflow:auto; border: 2px solid #FFF">
 +
  INPUT
 +
  =====
 +
  20 17548608 . A AC . PASS CENTERS=vbi;NCENTERS=1;OLD_MULTIALLELIC=20:17548598:GAAAAAAAAAAAAA/GAAAAAAAAAAAA/GAAAAAAAAAAAAAA/GAAAAAAAAAA/GAAAAAAAAAAA/GAAAAAAAAAACAAA;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAACAAAG;EX_MOTIF=C;EX_MLEN=1;EX_RU=C;EX_BASIS=C;EX_BLEN=1;EX_REPEAT_TRACT=17548608,17548609;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=2;EX_RL=2;EX_LL=3;EX_RU_COUNTS=0,2;EX_SCORE=0;EX_TRF_SCORE=-14;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=14;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[A]AAAGAAGGAA;MDUST;LOBSTR
 +
  20 17548608 . AAAAG A . PASS CENTERS=ox1;NCENTERS=1;EX_MOTIF=AAAG;EX_MLEN=4;EX_RU=AAAG;EX_BASIS=AG;EX_BLEN=2;EX_REPEAT_TRACT=17548609,17548612;EX_COMP=100,0,0,0;EX_ENTROPY=0;EX_ENTROPY2=0;EX_KL_DIVERGENCE=2;EX_KL_DIVERGENCE2=4;EX_REF=0.75;EX_RL=4;EX_LL=4;EX_RU_COUNTS=0,1;EX_SCORE=0.75;EX_TRF_SCORE=-1;FZ_MOTIF=A;FZ_MLEN=1;FZ_RU=A;FZ_BASIS=A;FZ_BLEN=1;FZ_REPEAT_TRACT=17548599,17548611;FZ_COMP=100,0,0,0;FZ_ENTROPY=0;FZ_ENTROPY2=0;FZ_KL_DIVERGENCE=2;FZ_KL_DIVERGENCE2=4;FZ_REF=13;FZ_RL=13;FZ_LL=13;FZ_RU_COUNTS=13,13;FZ_SCORE=1;FZ_TRF_SCORE=26;FLANKSEQ=GAAAAAAAAA[AAAAG]AAGGAACTAC;MDUST;LOBSTR;OLD_VARIANT=20:17548598:GAAAAAAAAAAAAAG/GAAAAAAAAAA
 +
  </div>
 +
  OUTPUT
 +
  ======
 +
  CHROM POS   REF   ALT N_ALLELE PASS  EX_RL  FZ_RL MDUST LOBSTR VNTRSEEK  RMSK EX_REPEAT_TRACT_1 EX_REPEAT_TRACT_2
 +
  20 17548608  A   AC 2        1    2      13 1 1 0   0    17548608                17548608
 +
  20 17548608  AAAAG  A 2        1    4      13      1      1      0        0    17548609                17548609
 +
 
 +
<div class="mw-collapsible-content">
 +
  usage : vt info2tab [options] <in.vcf>
 +
 
 +
  options : -d  debug [false]
 +
             -f  filter expression []
 +
             -u  list of filter tags to be extracted []-t  list of info tags to be extracted []
 +
             -o  output tab delimited file [-]
 
             -I  file containing list of intervals []
 
             -I  file containing list of intervals []
 
             -i  intervals []
 
             -i  intervals []
Line 909: Line 969:     
Compute features in a VCF file.  Example of statistics are Allele counts, [[Genotype_Likelihood_based_Inbreeding_Coefficient|Genotype Likelihood based Inbreeding Coefficient]].
 
Compute features in a VCF file.  Example of statistics are Allele counts, [[Genotype_Likelihood_based_Inbreeding_Coefficient|Genotype Likelihood based Inbreeding Coefficient]].
[[Genotype_Likelihood_based_Allele_Frequency|Hardy-Weinberg Genotype Likelihood based Allele Frequencies]]
+
[[Genotype_Likelihood_based_Allele_Frequency|Hardy-Weinberg Genotype Likelihood based Allele Frequencies]] <br>
 +
For more customizable feature computation - look at [http://genome.sph.umich.edu/wiki/Vt#Estimate estimate]
    
<div class=" mw-collapsible mw-collapsed">
 
<div class=" mw-collapsible mw-collapsed">
Line 995: Line 1,056:  
</div>
 
</div>
   −
=== Profile SNPs ===
+
=== Profile Mendelian Errors ===
   −
Profile SNPs.  The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].
+
Profile Mendelian errors
    
<div class=" mw-collapsible mw-collapsed">
 
<div class=" mw-collapsible mw-collapsed">
   #profile snps found in 20.sites.vcf
+
   #profile mendelian errors found in vt.genotypes.bcf, generate [[media:mendel.pdf|tables]] in the directory mendel, requires pdflatex.
   vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa  -i 20
+
   vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel
   −
  #this is a sample output for indel profiling.
+
  pedigree file format is described in [[Vt#Pedigree File|here]].
  # square brackets contain the ts/tv ratio. 
  −
  # The numbers in curved bracket are the counts of ts and tv SNPs respectively.
  −
  # Low complexity shows what percent of the SNPs are in low complexity regions.
  −
  data set
  −
    No. SNPs          :    508603 [2.09]
  −
        Low complexity :      0.08 (39837/508603) <br>
  −
  1000g
  −
    A-B    109970 [1.39]
  −
    A&B    398633 [2.37]
  −
    B-A    1340682 [2.26]
  −
    Precision    78.4%
  −
    Sensitivity  22.9% <br>
  −
  dbsnp
  −
    A-B    324063 [1.99]
  −
    A&B    184540 [2.29]
  −
    B-A    103893 [2.60]
  −
    Precision    36.3%
  −
    Sensitivity  64.0%
     −
  # This file contains information on how to process reference data sets.
+
  #this is a sample output for mendelian error profiling.
  #
+
  #R and A stand for reference and alternate allele respectively.
  # dataset - name of data set, this label will be printed.
+
  #Error% - mendelian error (confounded with de novo mutation)
  # type    - True Positives (TP) and False Positives (FP)
+
  #HomHet - Homozygous-Heterozygous genotype ratios
  #          overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively
+
  #Het% - proportion of hets
  #         - annotation
+
  Mendelian Errors <br>
  #          file is used for GENCODE annotation of frame shift and non frame shift Indels
+
  Father Mother      R/R          R/A          A/A    Error(%) HomHet    Het(%)
  # filter  - filter applied to variants for this particular data set
+
  R/R    R/R        14889          210          38    1.64      nan    nan
  # path    - path of indexed BCF file
+
  R/R    R/A        3403        3497          74    1.06      0.97  50.68
  #dataset              type            filter                                path
+
  R/R    A/A          176        1482          155    18.26      nan    nan
  1000g                  TP              N_ALLELE==2&&VTYPE==SNP                /net/fantasia/home/atks/ref/vt/grch37/1000G.v5.snps.indels.complex.svs.sites.bcf
+
  R/A    R/R        3665        3652          68    0.92      1.00  49.91
  dbsnp                  TP              N_ALLELE==2&&VTYPE==SNP                /net/fantasia/home/atks/ref/vt/grch37/dbSNP138.snps.indels.complex.sites.bcf
+
  R/A    R/A        1015        3151          990    0.00      0.64  61.11
  GENCODE_V19           cds_annotation  .                                      /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz
+
  R/A    A/A          43        1300        1401    1.57      1.08  48.13
  DUST                  cplx_annotation .                                     /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz
+
  A/A    R/R          172        1365          147    18.94      nan    nan
 
+
  A/A    R/A          47        1164        1183    1.96      1.02  49.60
<div class="mw-collapsible-content">
+
  A/A    A/A          20          78        5637    1.71      nan    nan <br>
   usage : vt profile_snps [options] <in.vcf>
+
  Parental           R/R          R/A          A/A    Error(%) HomHet    Het(%)
 
+
  R/R    R/R        14889          210          38    1.64      nan    nan
  options : -f  filter expression []
+
  R/R    R/A        7068        7149          142    0.99      0.99  50.28
            -g  file containing list of reference datasets []
+
  R/R    A/A          348        2847          302    18.59      nan    nan
            -I file containing list of intervals []
+
  R/A    R/A        1015        3151          990    0.00      0.64 61.11
            -i  intervals []
+
  R/A    A/A          90        2464        2584    1.75      1.05  48.81
            -r  reference sequence fasta file []
+
  A/A    A/A          20          78        5637    1.71      nan    nan  <br>
            -?  displays help
+
  Parental            R/R          R/A          A/A    Error(%) HomHet    Het(%)
</div>
+
  HOM    HOM        14909          288        5675    1.66      nan    nan
</div>
+
  HOM    HET        7158        9613        2726    1.19      1.00  49.90
 +
  HET    HET        1015        3151          990    0.00      0.64  61.11
 +
  HOMREF HOMALT      348        2847          302    18.59      nan    nan  <br>
 +
  total mendelian error :   2.505%
 +
  no. of trios    : 2
 +
  no. of variants : 25346
 +
 
 +
<div class="mw-collapsible-content">
 +
profile_mendelian v0.5
   −
=== Profile Indels ===
+
  usage : vt profile_mendelian [options] <in.vcf>
   −
Profile Indels.  The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].
+
  options : -q  minimum genotype quality
 +
            -d  minimum depth
 +
            -r  reference sequence fasta file []
 +
            -x  output latex directory []
 +
            -p  pedigree file
 +
            -I  file containing list of intervals []
 +
            -i  intervals
 +
          -?  displays help
 +
</div>
 +
</div>
 +
 
 +
=== Profile SNPs ===
 +
 
 +
Profile SNPs.  The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].
    
<div class=" mw-collapsible mw-collapsed">
 
<div class=" mw-collapsible mw-collapsed">
   #profile indels found in mills.vcf
+
   #profile snps found in 20.sites.vcf
   vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa  -i 20
+
   vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa  -i 20
    
   #this is a sample output for indel profiling.
 
   #this is a sample output for indel profiling.
   # square brackets contain the ins/del ratio
+
   # square brackets contain the ts/tv ratio.   
  # for the FS/NFS field, that is the proportion of coding indels that are frame shifted.   
+
   # The numbers in curved bracket are the counts of ts and tv SNPs respectively.
   # The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively.
+
   # Low complexity shows what percent of the SNPs are in low complexity regions.
   data set
+
  data set
    No Indels :     46974 [0.89]
+
    No. SNPs          :     508603 [2.09]
      FS/NFS :      0.26 (8/23) <br>
+
        Low complexity :      0.08 (39837/508603) <br>
 +
  1000g
 +
    A-B    109970 [1.39]
 +
    A&B    398633 [2.37]
 +
    B-A    1340682 [2.26]
 +
    Precision    78.4%
 +
    Sensitivity  22.9% <br>
 
   dbsnp
 
   dbsnp
     A-B     30704 [0.92]
+
     A-B     324063 [1.99]
     A&B     16270 [0.83]
+
     A&B     184540 [2.29]
     B-A   2049488 [1.52]
+
     B-A     103893 [2.60]
     Precision    34.6%
+
     Precision    36.3%
     Sensitivity  0.8% <br>
+
     Sensitivity 64.0%
   mills
+
 
    A-B      43234 [0.88]
+
   # This file contains information on how to process reference data sets.
    A&B      3740 [1.00]
+
   #
    B-A    203278 [0.98]
+
  # dataset - name of data set, this label will be printed.
    Precision     8.0%
+
  # type    - True Positives (TP) and False Positives (FP)
    Sensitivity   1.8% <br>
+
  #          overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively
   mills.chip
+
  #        - annotation
    A-B      46847 [0.89]
+
   #          file is used for GENCODE annotation of frame shift and non frame shift Indels
    A&B        127 [0.90]
+
   # filter  - filter applied to variants for this particular data set
    B-A      8777 [0.93]
+
  # path    - path of indexed BCF file
    Precision    0.3%
+
  #dataset              type            filter                                path
    Sensitivity   1.4% <br>
+
  1000g                  TP              N_ALLELE==2&&VTYPE==SNP                /net/fantasia/home/atks/ref/vt/grch37/1000G.v5.snps.indels.complex.svs.sites.bcf
   affy.exome.chip
+
  dbsnp                  TP              N_ALLELE==2&&VTYPE==SNP                /net/fantasia/home/atks/ref/vt/grch37/dbSNP138.snps.indels.complex.sites.bcf
    A-B      46911 [0.89]
+
   GENCODE_V19            cds_annotation  .                                      /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz
    A&B        63 [0.43]
+
   DUST                  cplx_annotation  .                                     /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz
    B-A      33997 [0.47]
+
 
    Precision    0.1%
+
<div class="mw-collapsible-content">
    Sensitivity  0.2% <br>
+
  usage : vt profile_snps [options] <in.vcf>
   −
   # This file contains information on how to process reference data sets.
+
   options : -f filter expression []
  # dataset - name of data set, this label will be printed.
+
            -g  file containing list of reference datasets []
  # type    - True Positives (TP) and False Positives (FP).
  −
  #          overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively.
  −
  #        - annotation.
  −
  #          file is used for GENCODE annotation of frame shift and non frame shift Indels.
  −
  # filter - filter applied to variants for this particular data set.
  −
  # path    - path of indexed BCF file.
  −
  #dataset    type            filter                       path
  −
  1000g        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/1000G.snps_indels.sites.bcf
  −
  mills        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/mills.208620indels.sites.bcf
  −
  dbsnp        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/dbsnp.13147541variants.sites.bcf
  −
  GENCODE_V19  cds_annotation  .                            /net/fantasia/home/atks/ref/vt/grch37/gencode.cds.bed.gz
  −
  DUST        cplx_annotation .                            /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz
  −
 
  −
<div class="mw-collapsible-content">
  −
  usage : vt profile_indels [options] <in.vcf>
  −
 
  −
  options : -g  file containing list of reference datasets []
   
             -I  file containing list of intervals []
 
             -I  file containing list of intervals []
 
             -i  intervals []
 
             -i  intervals []
Line 1,116: Line 1,169:  
</div>
 
</div>
   −
=== Profile VNTRs ===
+
=== Profile Indels ===
   −
Profile VNTRs.  The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].
+
Profile Indels.  The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].
    
<div class=" mw-collapsible mw-collapsed">
 
<div class=" mw-collapsible mw-collapsed">
 +
  #profile indels found in mills.vcf
 +
  vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa  -i 20
   −
   #profiles a set of VNTRs
+
   #this is a sample output for indel profiling.
   vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt
+
   # square brackets contain the ins/del ratio.
    
+
   # for the FS/NFS field, that is the proportion of coding indels that are frame shifted.
 
+
   # The numbers in curved bracket are the counts of frame shift and non frame shift indels respectively.
  profile_vntrs v0.5
+
  data set
    
+
     No Indels :      46974 [0.89]
    no VNTRs          5660874          #number of VNTRs in vntrs.sites.bcf
+
      FS/NFS :       0.26 (8/23) <br>
    no low complexity  2686460 (47.46%)  #number of VNTRs in low complexity region determined by MDUST
+
   dbsnp
     no coding          17911 (0.32%)    #number of VNTRs in coding regions determined by GENCODE v7
+
     A-B     30704 [0.92]
    no redundant       1312209 (23.18%) #number of VNTRs involved in overlapping with one another<br>
+
     A&B     16270 [0.83]
   trf_lobstr (1638516)  #TRF based reference set used in lobSTR, motif lengths 1 to 6.
+
     B-A   2049488 [1.52]
     A-B     3269285    #TRs specific to vntrs.sites.bcf
+
     Precision    34.6%
     A-B~    1666185    #TRs in vntrs.sites.bcf that overlap partially with at least one TR in TRF(lobSTR) but does not overlap exactly with another TR.
+
     Sensitivity   0.8% <br>
     A&B1    725404    #TRs in vntrs.sites.bcf that overlap exactly with at least one TR in TRF(lobSTR)
+
   mills
     A&B2    723195    #TRs in TRF(lobSTR) that overlap exactly with at least one TR in vntrs.sites.bcf
+
     A-B     43234 [0.88]
     B-A~    710075    #TRs in TRF(lobSTR) that overlap partially with at least one TR in vntrs.sites.bcf but does not overlap exactly with another TR.
+
     A&B       3740 [1.00]
    B-A      205246    #TRs specific to TRF(lobSTR)
+
     B-A    203278 [0.98]
   #note that the first 3 rows should sum up to the number of TRs in vntrs.sites.bcf
+
     Precision     8.0%
  #and the 4th to 6th rows should sum up to the number of TRs in TRF( lobSTR)
+
     Sensitivity  1.8% <br>
  #This basically allows us to see the m to n overlapping in overlapping TRs<br>
+
   mills.chip
   trf_repeatseq (1624553) #TRF based reference set used in repeatseq, motif lengths 1 to 6.
+
     A-B     46847 [0.89]
     A-B     3291652
+
     A&B       127 [0.90]
     A-B~    1650190
+
     B-A      8777 [0.93]
     A&B1     719032
+
     Precision     0.3%
     A&B2     716838
+
     Sensitivity  1.4% <br>
     B-A~    703948
+
   affy.exome.chip
    B-A      203767  <br>
+
     A-B     46911 [0.89]
   trf_vntrseek (230306)  #TRF based reference set used in vntrseek, motif lengths 7 to 2000.
+
     A&B        63 [0.43]
     A-B     5384453
+
     B-A     33997 [0.47]
     A-B~    271302
+
     Precision    0.1%
     A&B1       5119
+
     Sensitivity  0.2% <br>
     A&B2      4973
  −
     B-A~      92496
  −
     B-A      132837  <br>
  −
   codis+ (15)            #CODIS STRs + 2 STRs from PROMEGA
  −
     A-B     5660794
  −
     A-B~         79
  −
     A&B1          1
  −
     A&B2          1  
  −
     B-A~        14
  −
    B-A          0  
      
   # This file contains information on how to process reference data sets.
 
   # This file contains information on how to process reference data sets.
Line 1,169: Line 1,214:  
   #          overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively.
 
   #          overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively.
 
   #        - annotation.
 
   #        - annotation.
   #          file is used for GENCODE annotation of coding VNTRs.
+
   #          file is used for GENCODE annotation of frame shift and non frame shift Indels.
 
   # filter  - filter applied to variants for this particular data set.
 
   # filter  - filter applied to variants for this particular data set.
 
   # path    - path of indexed BCF file.
 
   # path    - path of indexed BCF file.
   #dataset     type            filter                      path
+
   #dataset     type            filter                      path
   trf_lobstr    TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.lobstr.sites.bcf
+
   1000g        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/1000G.snps_indels.sites.bcf
   trf_repeatseq TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.repeatseq.sites.bcf
+
   mills        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/mills.208620indels.sites.bcf
   trf_vntrseek  TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.vntrseek.sites.bcf
+
   dbsnp        TP              N_ALLELE==2&&VTYPE==INDEL    /net/fantasia/home/atks/ref/vt/grch37/dbsnp.13147541variants.sites.bcf
   codis+        TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/codis.strs.sites.bcf
+
   GENCODE_V19  cds_annotation  .                            /net/fantasia/home/atks/ref/vt/grch37/gencode.cds.bed.gz
   GENCODE_V19  cds_annotation  .                            /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz
+
   DUST        cplx_annotation .                            /net/fantasia/home/atks/ref/vt/grch37/mdust.bed.gz
  DUST          cplx_annotation .                             
      
<div class="mw-collapsible-content">
 
<div class="mw-collapsible-content">
   usage : vt profile_vntrs [options] <in.vcf>
+
   usage : vt profile_indels [options] <in.vcf>
    
   options : -g  file containing list of reference datasets []
 
   options : -g  file containing list of reference datasets []
Line 1,191: Line 1,235:  
</div>
 
</div>
   −
=== Profile Mendelian Errors ===
+
=== Profile VNTRs ===
   −
Profile Mendelian errors
+
Profile VNTRs.  The reference data sets can be obtained from [[Vt#Resource_Bundle|vt resource bundle]].
    
<div class=" mw-collapsible mw-collapsed">
 
<div class=" mw-collapsible mw-collapsed">
  #profile mendelian errors found in vt.genotypes.bcf, generate [[media:mendel.pdf|tables]] in the directory mendel, requires pdflatex.
  −
  vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel
     −
  pedigree file format is described in [http://csg.sph.umich.edu//abecasis/merlin/tour/input_files.html here]
+
  #profiles a set of VNTRs
 +
  vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt
 +
 
   −
  #this is a sample output for mendelian error profiling.
+
  profile_vntrs v0.5
  #R and A stand for reference and alternate allele respectively.
+
 
  #Error% - mendelian error (confounded with de novo mutation)
+
    no VNTRs          5660874          #number of VNTRs in vntrs.sites.bcf
  #HomHet - Homozygous-Heterozygous genotype ratios
+
    no low complexity  2686460 (47.46%)  #number of VNTRs in low complexity region determined by MDUST
  #Het% - proportion of hets
+
    no coding          17911 (0.32%)     #number of VNTRs in coding regions determined by GENCODE v7
  Mendelian Errors <br>
+
    no redundant      1312209 (23.18%)  #number of VNTRs involved in overlapping with one another<br>
  Father Mother      R/R          R/A          A/A    Error(%) HomHet    Het(%)
+
  trf_lobstr (1638516) #TRF based reference set used in lobSTR, motif lengths 1 to 6.
  R/R    R/R        14889          210          38    1.64      nan    nan
+
    A-B    3269285     #TRs specific to vntrs.sites.bcf
  R/R    R/A         3403        3497          74     1.06      0.97  50.68
+
    A-B~   1666185     #TRs in vntrs.sites.bcf that overlap partially with at least one TR in TRF(lobSTR) but does not overlap exactly with another TR.
  R/R    A/A          176        1482          155    18.26      nan    nan
+
    A&B1    725404     #TRs in vntrs.sites.bcf that overlap exactly with at least one TR in TRF(lobSTR)
  R/A    R/R        3665        3652          68     0.92      1.00  49.91
+
    A&B2    723195     #TRs in TRF(lobSTR) that overlap exactly with at least one TR in vntrs.sites.bcf
  R/A   R/A        1015        3151          990     0.00      0.64  61.11
+
    B-A~    710075     #TRs in TRF(lobSTR) that overlap partially with at least one TR in vntrs.sites.bcf but does not overlap exactly with another TR.
  R/A   A/A          43        1300        1401     1.57      1.08  48.13
+
    B-A     205246     #TRs specific to TRF(lobSTR)
  A/A    R/R          172        1365          147    18.94      nan    nan
+
  #note that the first 3 rows should sum up to the number of TRs in vntrs.sites.bcf
  A/A    R/A          47        1164        1183     1.96      1.02  49.60
+
  #and the 4th to 6th rows should sum up to the number of TRs in TRF( lobSTR)
  A/A    A/A          20          78        5637     1.71      nan    nan <br>
+
  #This basically allows us to see the m to n overlapping in overlapping TRs<br>
  Parental            R/R          R/A          A/A    Error(%) HomHet    Het(%)
+
  trf_repeatseq (1624553) #TRF based reference set used in repeatseq, motif lengths 1 to 6.
  R/R    R/R        14889          210          38    1.64      nan    nan
+
    A-B     3291652
  R/R    R/A         7068        7149          142     0.99      0.99  50.28
+
    A-B~   1650190
  R/R    A/A          348        2847          302    18.59      nan   nan
+
    A&B1    719032
  R/A   R/A         1015        3151          990     0.00      0.64  61.11
+
    A&B2     716838
  R/A    A/A           90        2464        2584     1.75      1.05  48.81
+
    B-A~     703948
  A/A    A/A          20          78        5637    1.71      nan    nan <br>
+
    B-A     203767 <br>
  Parental            R/R          R/A         A/A   Error(%) HomHet    Het(%)
+
  trf_vntrseek (230306)  #TRF based reference set used in vntrseek, motif lengths 7 to 2000.
  HOM    HOM        14909          288        5675     1.66       nan    nan
+
    A-B    5384453
  HOM    HET        7158        9613        2726     1.19     1.00  49.90
+
    A-B~    271302
  HET    HET        1015        3151          990     0.00     0.64  61.11
+
    A&B1      5119
  HOMREF HOMALT      348        2847          302    18.59      nan    nan <br>
+
     A&B2       4973
  total mendelian error :   2.505%
+
     B-A~     92496
  no. of trios     : 2
+
     B-A     132837 <br>
  no. of variants  : 25346
+
   codis+ (15)            #CODIS STRs + 2 STRs from PROMEGA
 +
    A-B    5660794
 +
    A-B~        79
 +
    A&B1          1
 +
    A&B2          1
 +
     B-A~        14
 +
    B-A          0
   −
<div class="mw-collapsible-content">
+
  # This file contains information on how to process reference data sets.
profile_mendelian v0.5
+
  # dataset - name of data set, this label will be printed.
 
+
  # type    - True Positives (TP) and False Positives (FP).
   usage : vt profile_mendelian [options] <in.vcf>
+
  #          overlap percentages labeled as (Precision, Sensitivity) and (False Discovery Rate, Type I Error) respectively.
 +
  #        - annotation.
 +
  #          file is used for GENCODE annotation of coding VNTRs.
 +
  # filter  - filter applied to variants for this particular data set.
 +
  # path    - path of indexed BCF file.
 +
  #dataset      type            filter                      path
 +
  trf_lobstr    TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.lobstr.sites.bcf
 +
  trf_repeatseq TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.repeatseq.sites.bcf
 +
  trf_vntrseek  TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/trf.vntrseek.sites.bcf
 +
  codis+        TP              VTYPE==VNTR                  /net/fantasia/home/atks/ref/vt/grch37/codis.strs.sites.bcf
 +
  GENCODE_V19  cds_annotation  .                            /net/fantasia/home/atks/ref/vt/grch37/gencode.v19.cds.bed.gz
 +
  DUST          cplx_annotation .                             
 +
 
 +
<div class="mw-collapsible-content">
 +
   usage : vt profile_vntrs [options] <in.vcf>
   −
   options : -q minimum genotype quality
+
   options : -g file containing list of reference datasets []
             -d minimum depth
+
             -I file containing list of intervals []
 +
            -i  intervals []
 
             -r  reference sequence fasta file []
 
             -r  reference sequence fasta file []
             -x  output latex directory []
+
             -?  displays help
            -p  pedigree file
  −
            -I  file containing list of intervals []
  −
            -i  intervals
  −
          -?  displays help
   
  </div>
 
  </div>
 
</div>
 
</div>
Line 1,644: Line 1,705:  
             --  ignores the rest of the labeled arguments following this flag
 
             --  ignores the rest of the labeled arguments following this flag
 
             -h  displays help
 
             -h  displays help
  </div>
+
  </div>
</div>
+
</div>
 
+
 
=== Genotype ===
+
=== Genotype ===
 
+
 
Genotypes variants for each sample.
+
Genotypes variants for each sample.
 
+
 
<div class=" mw-collapsible mw-collapsed">
+
<div class=" mw-collapsible mw-collapsed">
   #genotypes variants found in candidate.sites.vcf from sample.bam
+
   #genotypes variants found in candidate.sites.vcf from sample.bam
   vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf
+
   vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf
<div class="mw-collapsible-content">
+
<div class="mw-collapsible-content">
   usage : vt genotype [options]  
+
   usage : vt genotype [options]  
 +
 
 +
  options : -r  reference sequence fasta file []
 +
            -s  sample ID []
 +
            -o  output VCF file [-]
 +
            -b  input BAM file []
 +
            -i  input candidate VCF file []
 +
            --  ignores the rest of the labeled arguments following this flag
 +
            -h  displays help
 +
</div>
 +
</div>
 +
 
 +
= Pedigree File =
 +
 
 +
  vt understands an augmented version introduced by [mailto:hmkang@umich.edu Hyun] of the PED described by [http://zzz.bwh.harvard.edu/plink/data.shtml#ped plink].
 +
  The pedigree file format is as follows with the following mandatory fields:
 +
       
 +
{| class="wikitable"
 +
|-
 +
! scope="col"| Field
 +
! scope="col"| Description
 +
! scope="col"| Valid Values
 +
! scope="col"| Missing Values
 +
|-
 +
|Family ID<br>
 +
Individual ID<br>
 +
Paternal ID<br>
 +
Maternal ID<br>
 +
Sex<br>
 +
Phenotype
 +
|ID of this family <br>
 +
ID(s) of this individual (comma separated) <br>
 +
ID of the father <br>
 +
ID of the mother <br>
 +
Sex of the individual<br>
 +
Phenotype
 +
|[A-Za-z0-9_]+<br>
 +
[A-Za-z0-9_]+(,[A-Za-z0-9_]+)* <br>
 +
[A-Za-z0-9_]+ <br>
 +
[A-Za-z0-9_]+<br>
 +
1=male, 2=female, other, male, female<br>
 +
[A-Za-z0-9_]+
 +
|  0 <br>
 +
cannot be missing <br>
 +
0 <br>
 +
0 <br>
 +
other<br>
 +
-9
 +
|}
 +
 
 +
  Examples:   
 +
 
 +
    ceu      NA12878    NA12891    NA12892    female    -9
 +
    yri      NA19240    NA19239    NA19238    female    -9
 +
 
 +
    ceu      NA12878    NA12891    NA12892    2    -9
 +
    yri      NA19240    NA19239    NA19238    2    -9
 +
 
 +
    #allows tools like profile_mendelian to detect duplicates and check for concordance
 +
    ceu      NA12878,NA12878A    NA12891    NA12892    female  case
 +
    yri      NA19240            NA19239    NA19238    female  control
 +
 
 +
    #allows tools like profile_mendelian to detect duplicates and check for concordance
 +
    ceu      NA12412    0  0    female  case
 +
    yri      NA19650    0  0    female  control
 +
 
 +
= Resource Bundle =
 +
 
 +
== GRCh37 ==
 +
 
 +
Files are based on hs37d5.fa made by Heng Li.
 +
 
 +
* External : [ftp://share.sph.umich.edu/vt/grch37 GRCh37 resource bundle]
 +
* Internal : /net/fantasia/home/atks/ref/vt/grch37
   −
  options : -r  reference sequence fasta file []
+
Read here for [ftp://share.sph.umich.edu/vt/grch37/readme.txt contents].
            -s  sample ID []
  −
            -o  output VCF file [-]
  −
            -b  input BAM file []
  −
            -i  input candidate VCF file []
  −
            --  ignores the rest of the labeled arguments following this flag
  −
            -h  displays help
  −
</div>
  −
</div>
     −
= Resource Bundle =
+
== GRCh38 ==
   −
* External : [ftp://share.sph.umich.edu/vt resource bundle]
+
Files are based on [https://github.com/lh3/bwa/blob/master/README-alt.md hs38DH.fa] made by Heng Li.
* Internal : /net/fantasia/home/atks/ref/vt/grch37
+
Note that many of the references are simply lifted over from GRCh37 using Picard's liftover tool with the default options.
   −
GRCH37 set : Files are based on hs37d5.fa made by Heng Li.
+
* External : [ftp://share.sph.umich.edu/vt/grch38 GRCh38 resource bundle]
 +
* Internal : /net/fantasia/home/atks/ref/vt/grch38
   −
{| class="wikitable"
+
Read here for [ftp://share.sph.umich.edu/vt/grch38/readme.txt contents].
|-
  −
! scope="col"| data set
  −
! scope="col"| samples
  −
! scope="col"| snps/indels/complex/sv
  −
! scope="col"| description
  −
|-
  −
|1000G.v5  <br>
  −
dbsnp138 <br>
  −
1000G.omni.chip <br>
  −
mills  <br>
  −
mills.chip  <br>
  −
affy.exome.chip <br>
  −
NA12878.broad.kb <br>
  −
NA12878.v7.illumina.platinum <br>
  −
mdust.bed.gz  <br>
  −
gencode.cds.bed.gz <br>
  −
trf.bed.gz
  −
| 0<br>
  −
0<br>
  −
2141<br>
  −
0<br>
  −
158<br>
  −
2122<br>
  −
1<br>
  −
1 <br>
  −
NA <br>
  −
NA <br>
  −
NA
  −
| 81316694/3296894/66806/59426 <br>
  −
10588965/2488793/69749/0 <br>
  −
2432554/5/0/0  <br>
  −
0/208753/0/0 <br>
  −
0/8904/0/0 <br>
  −
281875/34389/0/0  <br>
  −
281345/87389/152/0 <br>
  −
3702969/650764/13751/0 <br>
  −
NA<br>
  −
NA<br>
  −
NA
  −
|1000G v5. [1000G 2015?]<br>
  −
derived from GATK's resource bundle that excludes 1000G variants.<br>
  −
1000G individuals typed on the omni chip [1000G 2015?]<br>
  −
indels from [Mills 2006]<br>
  −
indels from [Mills 2011]<br>
  −
1000G individuals and others typed on the affymetrix exome chip [1000G 2015?]<br>
  −
from GATK's NA12878 knowledgebase.<br>
  −
Illumina's platinum genomes version 7<br>
  −
regions of low complexity annotated using mdust [Morgulis 2006]<br>
  −
coding sequence regions based on GENCODE v19 annotations [Harrow 2012]<br>
  −
tandem repeat finder STRs from lobSTR's resource bundle [Gymrek 2012]
  −
|}
  −
     
  −
Note:  Please let me know if I did not cite a resource properly.
      
= FAQ =
 
= FAQ =
1,102

edits

Navigation menu