Difference between revisions of "Vt"

From Genome Analysis Wiki
Jump to navigationJump to search
(Blanked the page)
Line 1: Line 1:
 +
=== Introduction ===
  
 +
vt is a tool set that calls, genotypes and filters short variants.  It provides profiling of variants to aid in QC.
 +
 +
 +
=== Location ===
 +
 +
Internal usage
 +
 +
  binaries
 +
  /net/fantasia/home/atks/programs/vt
 +
 +
  test data
 +
  /net/fantasia/home/atks/programs/vt/test
 +
 +
  scripts
 +
  /net/fantasia/home/atks/programs/vt/scripts
 +
 +
External usage
 +
 +
  download from sourceforge/github
 +
 +
== Common options patterns ==
 +
 +
    -i defines the input file and by default, this is a require parameter,
 +
      however, you may set it as '-' to accept STDIN which by default is
 +
      assumed to be a non compressed format. 
 +
 +
    -o defines the out file which and has the STDOUT set as the default.
 +
      You may modify the STDOUT to output the binary version of the format,
 +
      e.g. BCF. with the option -c
 +
 +
== Major Workflows ==
 +
 +
=== Discovery ===
 +
 +
Discovery is performed at per sample level, the evidence sites lists for each sample are then merged and site discovery statistics are computed.
 +
The user then makes a decision on cut offs to make to create an initial candidate site list.
 +
 +
Generates site list with info fields E and N.
 +
 +
vt discover -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
 +
 +
Normalize(including left aligning) variants.  This is required as left alignment of insertions and/or deletions within a read is sometimes insufficient to ensure complete left alignment.
 +
 +
vt normalize -i NA12878.bam -o NA12878.normalized.sites.vcf -g hs37d5.fa
 +
 +
Evidence site lists are combined across samples and split by sites to allow for parallelization.
 +
 +
vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -l 5000
 +
 +
Discovery statistics are computed.  These statistics will allow you to choose a suitable cut off for creating a suitable candidate site list.
 +
 +
vt compute_discovery_stats -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
 +
 +
Merge site lists.
 +
 +
vt merge -i 1000000.sites.vcf,2000000.sites.vcf,3000000.sites.vcf -o candidate.sites.vcf
 +
 +
Plot charts to help with candidate list selection criteria.
 +
 +
vt plot_discovery -i candidate.sites.vcf
 +
 +
 +
A calling pipeline implemented in a make file is available here.
 +
 +
=== Genotyping ===
 +
 +
Each individual is genotyped at a set of sites.
 +
 +
vt genotype -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
 +
 +
Genotype sample VCFs are combined across samples and split by sites.
 +
 +
vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -o 1-1000000.sites.vcf
 +
 +
Features are computed.
 +
 +
vt compute_features -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
 +
 +
A  genotyping pipeline implemented in a make file is available here.
 +
 +
=== Filtering ===
 +
 +
Requires a set of features AND an installed copy of SVMLight.
 +
 +
vt filter NA12878.bam -i NA12878.sites.vcf -o NA12878.svm.sites.vcf --pos positive.sites.vcf --neg negative.sites.vcf
 +
 +
A filtering pipeline implemented in a make file is available here.
 +
 +
== Generation ==
 +
 +
=== Discovery ===
 +
 +
Discovers variants from bams.
 +
 +
  Options:
 +
  -b,  --input-bam-file    : Input BAM file
 +
  -o,  --output-vcf-file  : Output VCF file
 +
  -v,  --variant-type      : Variant Types, takes on any combinations of
 +
                              the values snps,mnps,indels comma delimited
 +
                              [snps,mnps,indels]
 +
  -q,  --q-cutoff          : BASE Cutoff, only bases with
 +
                              QUAL/BAQ >= baseq are considered [13]
 +
  -m,  --mapq-cutoff      : MAPQ Cutoff, only alignments with
 +
                              map quality >= mapq are considered [20]
 +
  -g,  --genome-fa-file    : Genome FASTA file
 +
  -s,  --sample-id        : Sample ID
 +
 
 +
  Example:
 +
  e.g. vt discover -b in.bam -o - -g ref.fa -v snps,indels -s HG0001
 +
  e.g. bam mergeBam --in a.bam --in b.bam -o - |
 +
        vt discover -b - -o out.sites.vcf -g ref.fa -v all -s HG0001 |
 +
        vt left_align -i - | vt merge_duplicate_variants
 +
 +
=== Genotyping ===
 +
 +
Genotypes variants for each sample.
 +
 +
  Options:
 +
  -b,  --input-bam-file        : Input BAM file
 +
  -i,  --input-candidate-vcf  : Input Candidate VCF file
 +
  -o,  --output-vcf-file      : Output VCF file
 +
  -v,  --variant-type          : Variant Types, takes on any combinations
 +
                                  of the values snps,mnps,indels comma
 +
                                  delimited [snps,mnps,indels]
 +
  -g,  --genome-fa-file        : Genome FASTA file
 +
  -s,  --sample-id            : Sample ID
 +
 
 +
  Example:
 +
  e.g. vt genotype -b in.bam -i candidate.sites.vcf -o - -g ref.fa -s HG0001
 +
 +
== Annotation ==
 +
 +
=== Make Probes ===
 +
 +
Populates the info field with REFPROBE, ALTPROBE and PLEN tags for genotyping.
 +
 +
  Options:
 +
  -i,  --input-vcf <string>      : Input VCF file
 +
  -o,  --output-vcf <string>    : Output VCF file [-]
 +
  -g,  --genome-fa              : Genome FASTA file [/net/fantasia/home/atks/ref/genome/human.g1k.v37.fa]
 +
  -f,  --flank-length <integer>  : Minimum Flank Length [20]
 +
 +
  Example:
 +
  e.g. vt make_probes -i 8904indels.dups.genotypes.vcf -o probes.sites.vcf -g ref.fa
 +
 +
=== Compute Feature ===
 +
 +
Compute feature of variant.
 +
 +
vt compute_feature -i mills.vcf
 +
 +
=== Compute Allele balance ===
 +
 +
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allele balance].  Outputs allele balance, allele frequency, genotype frequency.
 +
 +
vt compute_ab -i mills.vcf
 +
 +
=== Compute Allele Frequency ===
 +
 +
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Allele_Frequency allele frequency].  Outputs  allele frequency and genotype frequency.
 +
 +
vt compute_af -i mills.vcf
 +
 +
=== Compute Inbreeding Coefficient ===
 +
 +
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Inbreeding_Coefficient inbreeding coefficient].  Outputs inbreeding coefficient based on genotype likelihoods.
 +
 +
vt compute_fic -i mills.vcf
 +
 +
=== Compute HWE ===
 +
 +
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Hardy-Weinberg_Test Hardy-Weinberg equilibrium statistic].  Outputs  PHRED scaled HWE Test p-values for biallelic as well as multiallelic variants.
 +
 +
vt compute_hwe -i mills.vcf
 +
 +
=== Compute Mendelian Error ===
 +
 +
Compute mendelian error  statistics.  Outputs  allele frequency and genotype frequency.
 +
 +
vt compute_mendel -i mills.vcf
 +
 +
=== Compute features ===
 +
 +
vt compute_<feature1>_<feature2>_ ... _<feature n> -i mills.vcf
 +
 +
== Modification ==
 +
 +
=== Left Alignment ===
 +
 +
[http://genome.sph.umich.edu/wiki/Variant_Normalization Left aligns] indel type variants in a VCF file.  This differs from normalization in that it only left aligns and left trims a variant.  This affects Indels only.
 +
 +
vt left_align -i mills.vcf -o mills.leftaligned.vcf
 +
 +
=== Normalization ===
 +
 +
[http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file.
 +
 +
vt normalize -i mills.vcf -o mills.normalized.vcf
 +
 +
=== Merge duplicate variants ===
 +
 +
Merges duplicate variants by position with the option of considering alleles.  (This just discards the duplicate variant that appears later in the VCF file)
 +
 +
  Options:
 +
  -i,  --input-vcf <string>  : Input VCF file
 +
  -o,  --output-vcf <string> : Output VCF file [-]
 +
  -p,  --merge-by-position  : Merge by position [false]
 +
 +
  Example:
 +
  e.g. vt merge_duplicate_variants -i 8904indels.dups.genotypes.vcf -o out.vcf
 +
  e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf
 +
 +
== Profiling ==
 +
 +
A standard procedure is as follows:
 +
 +
  zcat dataset.vcf.gx | vt normalize -i - | vt merge_duplicate_variants -i - > dataset.normalized.vcf
 +
 +
  cut -f1-8 dataset.normalized.vcf > dataset.sites.vcf
 +
 +
  cat dataset.normalized.sites.vcf | vt profile_snps -i - > snps.summary.log
 +
 +
 +
 +
=== Profile SNPs ===
 +
 +
Profile SNPs.
 +
 +
* ts/tv ratio
 +
* overlap analyses
 +
 +
vt profile_snps -i mills.snps.sites.vcf
 +
 +
=== Profile Indels ===
 +
 +
Profile indels.
 +
 +
* Overlap analyses with known data sets
 +
* FS/NFS annotation
 +
 +
vt profile_indels mills.indels.sites.vcf
 +
 +
=== Profile MNPs ===
 +
 +
Profile MNPs.
 +
 +
vt profile_mnps -i mills.mnps.sites.vcf
 +
 +
=== Summarize Variants ===
 +
 +
Summarizes variants present in VCF file.
 +
 +
vt peek -i mills.vcf
 +
 +
== Plotting ==
 +
 +
=== Allele Frequency Spectrum ===
 +
 +
Plots Allele Frequency Spectrum of variants found in VCF file
 +
 +
vt plot_afs -i mills.xml
 +
 +
=== Genotype Likelihood Concordance ===
 +
 +
Plots Genotype Likelihood Concordance graph.
 +
 +
vt plot_gl -i mills.xml
 +
 +
=== Allele Balance Spectrum===
 +
 +
Plots Allele Balance graph of variants in the VCF file.
 +
 +
vt plot_ab -i mills.xml
 +
 +
= VCF File Manipulation =
 +
 +
=== Sort ===
 +
 +
Sort variants according to contig lists in header.
 +
 +
vt sort -i mills.sites.vcf
 +
 +
=== Split by variant ===
 +
 +
Split VCF files by variant type.
 +
 +
vt split_by_variant -i mills.sites.vcf
 +
 +
= Resource Files =
 +
 +
dbSNP
 +
OMNI 1000G
 +
Mills
 +
HAPMAP
 +
 +
= Maintained by =
 +
 +
This page is maintained by  [mailto:atks@umich.edu Adrian]

Revision as of 16:17, 21 October 2013

Introduction

vt is a tool set that calls, genotypes and filters short variants. It provides profiling of variants to aid in QC.


Location

Internal usage

  binaries
 /net/fantasia/home/atks/programs/vt
 test data
 /net/fantasia/home/atks/programs/vt/test
 scripts
 /net/fantasia/home/atks/programs/vt/scripts

External usage

 download from sourceforge/github

Common options patterns

   -i defines the input file and by default, this is a require parameter,
      however, you may set it as '-' to accept STDIN which by default is 
      assumed to be a non compressed format.   
   -o defines the out file which and has the STDOUT set as the default.
      You may modify the STDOUT to output the binary version of the format, 
      e.g. BCF. with the option -c

Major Workflows

Discovery

Discovery is performed at per sample level, the evidence sites lists for each sample are then merged and site discovery statistics are computed. The user then makes a decision on cut offs to make to create an initial candidate site list.

Generates site list with info fields E and N.

vt discover -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa

Normalize(including left aligning) variants. This is required as left alignment of insertions and/or deletions within a read is sometimes insufficient to ensure complete left alignment.

vt normalize -i NA12878.bam -o NA12878.normalized.sites.vcf -g hs37d5.fa

Evidence site lists are combined across samples and split by sites to allow for parallelization.

vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -l 5000

Discovery statistics are computed. These statistics will allow you to choose a suitable cut off for creating a suitable candidate site list.

vt compute_discovery_stats -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf 

Merge site lists.

vt merge -i 1000000.sites.vcf,2000000.sites.vcf,3000000.sites.vcf -o candidate.sites.vcf 

Plot charts to help with candidate list selection criteria.

vt plot_discovery -i candidate.sites.vcf 


A calling pipeline implemented in a make file is available here.

Genotyping

Each individual is genotyped at a set of sites.

vt genotype -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa

Genotype sample VCFs are combined across samples and split by sites.

vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -o 1-1000000.sites.vcf

Features are computed.

vt compute_features -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf 

A genotyping pipeline implemented in a make file is available here.

Filtering

Requires a set of features AND an installed copy of SVMLight.

vt filter NA12878.bam -i NA12878.sites.vcf -o NA12878.svm.sites.vcf --pos positive.sites.vcf --neg negative.sites.vcf

A filtering pipeline implemented in a make file is available here.

Generation

Discovery

Discovers variants from bams.

  Options:
  -b,  --input-bam-file    : Input BAM file
  -o,  --output-vcf-file   : Output VCF file
  -v,  --variant-type      : Variant Types, takes on any combinations of 
                             the values snps,mnps,indels comma delimited 
                             [snps,mnps,indels]
  -q,  --q-cutoff          : BASE Cutoff, only bases with 
                             QUAL/BAQ >= baseq are considered [13]
  -m,  --mapq-cutoff       : MAPQ Cutoff, only alignments with 
                             map quality >= mapq are considered [20]
  -g,  --genome-fa-file    : Genome FASTA file
  -s,  --sample-id         : Sample ID
  
  Example:
  e.g. vt discover -b in.bam -o - -g ref.fa -v snps,indels -s HG0001
  e.g. bam mergeBam --in a.bam --in b.bam -o - |
       vt discover -b - -o out.sites.vcf -g ref.fa -v all -s HG0001 | 
       vt left_align -i - | vt merge_duplicate_variants

Genotyping

Genotypes variants for each sample.

  Options:
  -b,  --input-bam-file        : Input BAM file
  -i,  --input-candidate-vcf   : Input Candidate VCF file
  -o,  --output-vcf-file       : Output VCF file
  -v,  --variant-type          : Variant Types, takes on any combinations 
                                 of the values snps,mnps,indels comma 
                                 delimited [snps,mnps,indels]
  -g,  --genome-fa-file        : Genome FASTA file
  -s,  --sample-id             : Sample ID
  
  Example:
  e.g. vt genotype -b in.bam -i candidate.sites.vcf -o - -g ref.fa -s HG0001

Annotation

Make Probes

Populates the info field with REFPROBE, ALTPROBE and PLEN tags for genotyping.

  Options:
  -i,  --input-vcf <string>      : Input VCF file
  -o,  --output-vcf <string>     : Output VCF file [-]
  -g,  --genome-fa               : Genome FASTA file [/net/fantasia/home/atks/ref/genome/human.g1k.v37.fa]
  -f,  --flank-length <integer>  : Minimum Flank Length [20]
  Example:
  e.g. vt make_probes -i 8904indels.dups.genotypes.vcf -o probes.sites.vcf -g ref.fa

Compute Feature

Compute feature of variant.

vt compute_feature -i mills.vcf

Compute Allele balance

Compute allele balance. Outputs allele balance, allele frequency, genotype frequency.

vt compute_ab -i mills.vcf

Compute Allele Frequency

Compute allele frequency. Outputs allele frequency and genotype frequency.

vt compute_af -i mills.vcf

Compute Inbreeding Coefficient

Compute inbreeding coefficient. Outputs inbreeding coefficient based on genotype likelihoods.

vt compute_fic -i mills.vcf

Compute HWE

Compute Hardy-Weinberg equilibrium statistic. Outputs PHRED scaled HWE Test p-values for biallelic as well as multiallelic variants.

vt compute_hwe -i mills.vcf

Compute Mendelian Error

Compute mendelian error statistics. Outputs allele frequency and genotype frequency.

vt compute_mendel -i mills.vcf

Compute features

vt compute_<feature1>_<feature2>_ ... _<feature n> -i mills.vcf

Modification

Left Alignment

Left aligns indel type variants in a VCF file. This differs from normalization in that it only left aligns and left trims a variant. This affects Indels only.

vt left_align -i mills.vcf -o mills.leftaligned.vcf

Normalization

Normalize variants in a VCF file.

vt normalize -i mills.vcf -o mills.normalized.vcf

Merge duplicate variants

Merges duplicate variants by position with the option of considering alleles. (This just discards the duplicate variant that appears later in the VCF file)

  Options:
  -i,  --input-vcf <string>  : Input VCF file
  -o,  --output-vcf <string> : Output VCF file [-]
  -p,  --merge-by-position   : Merge by position [false]
  Example:
  e.g. vt merge_duplicate_variants -i 8904indels.dups.genotypes.vcf -o out.vcf
  e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf

Profiling

A standard procedure is as follows:

 zcat dataset.vcf.gx | vt normalize -i - | vt merge_duplicate_variants -i - > dataset.normalized.vcf
 cut -f1-8 dataset.normalized.vcf > dataset.sites.vcf
 cat dataset.normalized.sites.vcf | vt profile_snps -i - > snps.summary.log


Profile SNPs

Profile SNPs.

  • ts/tv ratio
  • overlap analyses
vt profile_snps -i mills.snps.sites.vcf

Profile Indels

Profile indels.

  • Overlap analyses with known data sets
  • FS/NFS annotation
vt profile_indels mills.indels.sites.vcf

Profile MNPs

Profile MNPs.

vt profile_mnps -i mills.mnps.sites.vcf

Summarize Variants

Summarizes variants present in VCF file.

vt peek -i mills.vcf

Plotting

Allele Frequency Spectrum

Plots Allele Frequency Spectrum of variants found in VCF file

vt plot_afs -i mills.xml

Genotype Likelihood Concordance

Plots Genotype Likelihood Concordance graph.

vt plot_gl -i mills.xml

Allele Balance Spectrum

Plots Allele Balance graph of variants in the VCF file.

vt plot_ab -i mills.xml

VCF File Manipulation

Sort

Sort variants according to contig lists in header.

vt sort -i mills.sites.vcf

Split by variant

Split VCF files by variant type.

vt split_by_variant -i mills.sites.vcf

Resource Files

dbSNP OMNI 1000G Mills HAPMAP

Maintained by

This page is maintained by Adrian