Difference between revisions of "Vt"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1:
 
=== Introduction ===
 
=== Introduction ===
  
vt is a tool set that calls, genotypes and filters short variants.  It provides profiling of variants to aid in QC.
+
vt is a variant tool set that discovers short variants from Next Generation Sequencing dataThe features are being rolled out to github as major rewriting is being undertaken.
 
 
  
 
=== Location ===
 
=== Location ===
  
Internal usage
+
You may pull it from github:
 
 
  binaries
 
  /net/fantasia/home/atks/programs/vt
 
 
 
  test data
 
  /net/fantasia/home/atks/programs/vt/test
 
 
 
  scripts
 
  /net/fantasia/home/atks/programs/vt/scripts
 
 
 
External usage
 
  
   download from sourceforge/github
+
   git clone https://github.com/atks/vt.git
  
== Common options patterns ==
+
== Common options ==
  
     -i defines the input file and by default, this is a require parameter,
+
     -i multiple intervals in <seq>:start-end format
      however, you may set it as '-' to accept STDIN which by default is
 
      assumed to be a non compressed format
 
  
 
     -o defines the out file which and has the STDOUT set as the default.
 
     -o defines the out file which and has the STDOUT set as the default.
       You may modify the STDOUT to output the binary version of the format,
+
       You may modify the STDOUT to output the binary version of the format.
      e.g. BCF. with the option -c
 
 
 
== Major Workflows ==
 
 
 
=== Discovery ===
 
 
 
Discovery is performed at per sample level, the evidence sites lists for each sample are then merged and site discovery statistics are computed.
 
The user then makes a decision on cut offs to make to create an initial candidate site list.
 
 
 
Generates site list with info fields E and N.
 
 
 
vt discover -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
 
 
 
Normalize(including left aligning) variants.  This is required as left alignment of insertions and/or deletions within a read is sometimes insufficient to ensure complete left alignment.
 
 
 
vt normalize -i NA12878.bam -o NA12878.normalized.sites.vcf -g hs37d5.fa
 
 
 
Evidence site lists are combined across samples and split by sites to allow for parallelization.
 
 
 
vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -l 5000
 
 
 
Discovery statistics are computed.  These statistics will allow you to choose a suitable cut off for creating a suitable candidate site list.
 
 
 
vt compute_discovery_stats -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
 
 
 
Merge site lists.
 
 
 
vt merge -i 1000000.sites.vcf,2000000.sites.vcf,3000000.sites.vcf -o candidate.sites.vcf
 
 
 
Plot charts to help with candidate list selection criteria.
 
 
 
vt plot_discovery -i candidate.sites.vcf
 
 
 
 
 
A calling pipeline implemented in a make file is available here.
 
 
 
=== Genotyping ===
 
 
 
Each individual is genotyped at a set of sites.
 
 
 
vt genotype -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
 
 
 
Genotype sample VCFs are combined across samples and split by sites.
 
 
 
vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -o 1-1000000.sites.vcf
 
 
 
Features are computed.
 
 
 
vt compute_features -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
 
 
 
A  genotyping pipeline implemented in a make file is available here.
 
 
 
=== Filtering ===
 
 
 
Requires a set of features AND an installed copy of SVMLight.
 
 
 
vt filter NA12878.bam -i NA12878.sites.vcf -o NA12878.svm.sites.vcf --pos positive.sites.vcf --neg negative.sites.vcf
 
 
 
A filtering pipeline implemented in a make file is available here.
 
 
 
== Generation ==
 
 
 
=== Discovery ===
 
 
 
Discovers variants from bams.
 
 
 
  Options:
 
  -b,  --input-bam-file    : Input BAM file
 
  -o,  --output-vcf-file  : Output VCF file
 
  -v,  --variant-type      : Variant Types, takes on any combinations of
 
                              the values snps,mnps,indels comma delimited
 
                              [snps,mnps,indels]
 
  -q,  --q-cutoff          : BASE Cutoff, only bases with
 
                              QUAL/BAQ >= baseq are considered [13]
 
  -m,  --mapq-cutoff      : MAPQ Cutoff, only alignments with
 
                              map quality >= mapq are considered [20]
 
  -g,  --genome-fa-file    : Genome FASTA file
 
  -s,  --sample-id        : Sample ID
 
 
 
  Example:
 
  e.g. vt discover -b in.bam -o - -g ref.fa -v snps,indels -s HG0001
 
  e.g. bam mergeBam --in a.bam --in b.bam -o - |
 
        vt discover -b - -o out.sites.vcf -g ref.fa -v all -s HG0001 |
 
        vt left_align -i - | vt merge_duplicate_variants
 
 
 
=== Genotyping ===
 
 
 
Genotypes variants for each sample.
 
 
 
  Options:
 
  -b,  --input-bam-file        : Input BAM file
 
  -i,  --input-candidate-vcf  : Input Candidate VCF file
 
  -o,  --output-vcf-file      : Output VCF file
 
  -v,  --variant-type          : Variant Types, takes on any combinations
 
                                  of the values snps,mnps,indels comma
 
                                  delimited [snps,mnps,indels]
 
  -g,  --genome-fa-file        : Genome FASTA file
 
  -s,  --sample-id            : Sample ID
 
 
 
  Example:
 
  e.g. vt genotype -b in.bam -i candidate.sites.vcf -o - -g ref.fa -s HG0001
 
 
 
== Annotation ==
 
 
 
=== Make Probes ===
 
 
 
Populates the info field with REFPROBE, ALTPROBE and PLEN tags for genotyping.
 
 
 
  Options:
 
  -i,  --input-vcf <string>      : Input VCF file
 
  -o,  --output-vcf <string>    : Output VCF file [-]
 
  -g,  --genome-fa              : Genome FASTA file [/net/fantasia/home/atks/ref/genome/human.g1k.v37.fa]
 
  -f,  --flank-length <integer>  : Minimum Flank Length [20]
 
 
 
  Example:
 
  e.g. vt make_probes -i 8904indels.dups.genotypes.vcf -o probes.sites.vcf -g ref.fa
 
 
 
=== Compute Feature ===
 
 
 
Compute feature of variant.
 
 
 
vt compute_feature -i mills.vcf
 
 
 
=== Compute Allele balance ===
 
 
 
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allele balance].  Outputs allele balance, allele frequency, genotype frequency.
 
 
 
vt compute_ab -i mills.vcf
 
 
 
=== Compute Allele Frequency ===
 
 
 
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Allele_Frequency allele frequency].  Outputs  allele frequency and genotype frequency.
 
 
 
vt compute_af -i mills.vcf
 
 
 
=== Compute Inbreeding Coefficient ===
 
 
 
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Inbreeding_Coefficient inbreeding coefficient].  Outputs inbreeding coefficient based on genotype likelihoods.
 
 
 
vt compute_fic -i mills.vcf
 
 
 
=== Compute HWE ===
 
 
 
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Hardy-Weinberg_Test Hardy-Weinberg equilibrium statistic].  Outputs  PHRED scaled HWE Test p-values for biallelic as well as multiallelic variants.
 
 
 
vt compute_hwe -i mills.vcf
 
 
 
=== Compute Mendelian Error ===
 
 
 
Compute mendelian error  statistics.  Outputs  allele frequency and genotype frequency.
 
 
 
vt compute_mendel -i mills.vcf
 
 
 
=== Compute features ===
 
 
 
vt compute_<feature1>_<feature2>_ ... _<feature n> -i mills.vcf
 
 
 
== Modification ==
 
 
 
=== Left Alignment ===
 
 
 
[http://genome.sph.umich.edu/wiki/Variant_Normalization Left aligns] indel type variants in a VCF file.  This differs from normalization in that it only left aligns and left trims a variant.  This affects Indels only.
 
 
 
vt left_align -i mills.vcf -o mills.leftaligned.vcf
 
  
 
=== Normalization ===
 
=== Normalization ===
Line 198: Line 20:
 
[http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file.
 
[http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file.
  
  vt normalize mills.vcf -r seq.fa -o mills.normalized.vcf
+
  vt normalize -i mills.vcf -o mills.normalized.vcf
  
 
=== Merge duplicate variants ===
 
=== Merge duplicate variants ===
Line 213: Line 35:
 
   e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf
 
   e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf
  
== Profiling ==
 
 
A standard procedure is as follows:
 
 
  zcat dataset.vcf.gx | vt normalize -i - | vt merge_duplicate_variants -i - > dataset.normalized.vcf
 
 
  cut -f1-8 dataset.normalized.vcf > dataset.sites.vcf
 
 
  cat dataset.normalized.sites.vcf | vt profile_snps -i - > snps.summary.log
 
 
 
 
=== Profile SNPs ===
 
 
Profile SNPs.
 
 
* ts/tv ratio
 
* overlap analyses
 
 
vt profile_snps -i mills.snps.sites.vcf
 
 
=== Profile Indels ===
 
 
Profile indels.
 
 
* Overlap analyses with known data sets
 
* FS/NFS annotation
 
 
vt profile_indels mills.indels.sites.vcf
 
 
=== Profile MNPs ===
 
 
Profile MNPs.
 
 
vt profile_mnps -i mills.mnps.sites.vcf
 
 
=== Summarize Variants ===
 
 
Summarizes variants present in VCF file.
 
 
vt peek -i mills.vcf
 
 
== Plotting ==
 
 
=== Allele Frequency Spectrum ===
 
 
Plots Allele Frequency Spectrum of variants found in VCF file
 
 
vt plot_afs -i mills.xml
 
 
=== Genotype Likelihood Concordance ===
 
 
Plots Genotype Likelihood Concordance graph.
 
 
vt plot_gl -i mills.xml
 
 
=== Allele Balance Spectrum===
 
 
Plots Allele Balance graph of variants in the VCF file.
 
 
vt plot_ab -i mills.xml
 
 
= VCF File Manipulation =
 
 
=== Sort ===
 
 
Sort variants according to contig lists in header.
 
 
vt sort -i mills.sites.vcf
 
 
=== Split by variant ===
 
 
Split VCF files by variant type.
 
 
vt split_by_variant -i mills.sites.vcf
 
 
= Resource Files =
 
 
dbSNP
 
OMNI 1000G
 
Mills
 
HAPMAP
 
  
 
= Maintained by =
 
= Maintained by =
  
 
This page is maintained by  [mailto:atks@umich.edu Adrian]
 
This page is maintained by  [mailto:atks@umich.edu Adrian]

Revision as of 16:25, 21 October 2013

Introduction

vt is a variant tool set that discovers short variants from Next Generation Sequencing data. The features are being rolled out to github as major rewriting is being undertaken.

Location

You may pull it from github:

 git clone https://github.com/atks/vt.git

Common options

   -i  multiple intervals in <seq>:start-end format
   -o defines the out file which and has the STDOUT set as the default.
      You may modify the STDOUT to output the binary version of the format.

Normalization

Normalize variants in a VCF file.

vt normalize -i mills.vcf -o mills.normalized.vcf

Merge duplicate variants

Merges duplicate variants by position with the option of considering alleles. (This just discards the duplicate variant that appears later in the VCF file)

  Options:
  -i,  --input-vcf <string>  : Input VCF file
  -o,  --output-vcf <string> : Output VCF file [-]
  -p,  --merge-by-position   : Merge by position [false]
  Example:
  e.g. vt merge_duplicate_variants -i 8904indels.dups.genotypes.vcf -o out.vcf
  e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf


Maintained by

This page is maintained by Adrian