Changes

From Genome Analysis Wiki
Jump to navigationJump to search
8,814 bytes added ,  16:17, 21 October 2013
no edit summary
Line 1: Line 1:  +
=== Introduction ===
    +
vt is a tool set that calls, genotypes and filters short variants.  It provides profiling of variants to aid in QC.
 +
 +
 +
=== Location ===
 +
 +
Internal usage
 +
 +
  binaries
 +
  /net/fantasia/home/atks/programs/vt
 +
 +
  test data
 +
  /net/fantasia/home/atks/programs/vt/test
 +
 +
  scripts
 +
  /net/fantasia/home/atks/programs/vt/scripts
 +
 +
External usage
 +
 +
  download from sourceforge/github
 +
 +
== Common options patterns ==
 +
 +
    -i defines the input file and by default, this is a require parameter,
 +
      however, you may set it as '-' to accept STDIN which by default is
 +
      assumed to be a non compressed format. 
 +
 +
    -o defines the out file which and has the STDOUT set as the default.
 +
      You may modify the STDOUT to output the binary version of the format,
 +
      e.g. BCF. with the option -c
 +
 +
== Major Workflows ==
 +
 +
=== Discovery ===
 +
 +
Discovery is performed at per sample level, the evidence sites lists for each sample are then merged and site discovery statistics are computed.
 +
The user then makes a decision on cut offs to make to create an initial candidate site list.
 +
 +
Generates site list with info fields E and N.
 +
 +
vt discover -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
 +
 +
Normalize(including left aligning) variants.  This is required as left alignment of insertions and/or deletions within a read is sometimes insufficient to ensure complete left alignment.
 +
 +
vt normalize -i NA12878.bam -o NA12878.normalized.sites.vcf -g hs37d5.fa
 +
 +
Evidence site lists are combined across samples and split by sites to allow for parallelization.
 +
 +
vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -l 5000
 +
 +
Discovery statistics are computed.  These statistics will allow you to choose a suitable cut off for creating a suitable candidate site list.
 +
 +
vt compute_discovery_stats -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
 +
 +
Merge site lists.
 +
 +
vt merge -i 1000000.sites.vcf,2000000.sites.vcf,3000000.sites.vcf -o candidate.sites.vcf
 +
 +
Plot charts to help with candidate list selection criteria.
 +
 +
vt plot_discovery -i candidate.sites.vcf
 +
 +
 +
A calling pipeline implemented in a make file is available here.
 +
 +
=== Genotyping ===
 +
 +
Each individual is genotyped at a set of sites.
 +
 +
vt genotype -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
 +
 +
Genotype sample VCFs are combined across samples and split by sites.
 +
 +
vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -o 1-1000000.sites.vcf
 +
 +
Features are computed.
 +
 +
vt compute_features -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
 +
 +
A  genotyping pipeline implemented in a make file is available here.
 +
 +
=== Filtering ===
 +
 +
Requires a set of features AND an installed copy of SVMLight.
 +
 +
vt filter NA12878.bam -i NA12878.sites.vcf -o NA12878.svm.sites.vcf --pos positive.sites.vcf --neg negative.sites.vcf
 +
 +
A filtering pipeline implemented in a make file is available here.
 +
 +
== Generation ==
 +
 +
=== Discovery ===
 +
 +
Discovers variants from bams.
 +
 +
  Options:
 +
  -b,  --input-bam-file    : Input BAM file
 +
  -o,  --output-vcf-file  : Output VCF file
 +
  -v,  --variant-type      : Variant Types, takes on any combinations of
 +
                              the values snps,mnps,indels comma delimited
 +
                              [snps,mnps,indels]
 +
  -q,  --q-cutoff          : BASE Cutoff, only bases with
 +
                              QUAL/BAQ >= baseq are considered [13]
 +
  -m,  --mapq-cutoff      : MAPQ Cutoff, only alignments with
 +
                              map quality >= mapq are considered [20]
 +
  -g,  --genome-fa-file    : Genome FASTA file
 +
  -s,  --sample-id        : Sample ID
 +
 
 +
  Example:
 +
  e.g. vt discover -b in.bam -o - -g ref.fa -v snps,indels -s HG0001
 +
  e.g. bam mergeBam --in a.bam --in b.bam -o - |
 +
        vt discover -b - -o out.sites.vcf -g ref.fa -v all -s HG0001 |
 +
        vt left_align -i - | vt merge_duplicate_variants
 +
 +
=== Genotyping ===
 +
 +
Genotypes variants for each sample.
 +
 +
  Options:
 +
  -b,  --input-bam-file        : Input BAM file
 +
  -i,  --input-candidate-vcf  : Input Candidate VCF file
 +
  -o,  --output-vcf-file      : Output VCF file
 +
  -v,  --variant-type          : Variant Types, takes on any combinations
 +
                                  of the values snps,mnps,indels comma
 +
                                  delimited [snps,mnps,indels]
 +
  -g,  --genome-fa-file        : Genome FASTA file
 +
  -s,  --sample-id            : Sample ID
 +
 
 +
  Example:
 +
  e.g. vt genotype -b in.bam -i candidate.sites.vcf -o - -g ref.fa -s HG0001
 +
 +
== Annotation ==
 +
 +
=== Make Probes ===
 +
 +
Populates the info field with REFPROBE, ALTPROBE and PLEN tags for genotyping.
 +
 +
  Options:
 +
  -i,  --input-vcf <string>      : Input VCF file
 +
  -o,  --output-vcf <string>    : Output VCF file [-]
 +
  -g,  --genome-fa              : Genome FASTA file [/net/fantasia/home/atks/ref/genome/human.g1k.v37.fa]
 +
  -f,  --flank-length <integer>  : Minimum Flank Length [20]
 +
 +
  Example:
 +
  e.g. vt make_probes -i 8904indels.dups.genotypes.vcf -o probes.sites.vcf -g ref.fa
 +
 +
=== Compute Feature ===
 +
 +
Compute feature of variant.
 +
 +
vt compute_feature -i mills.vcf
 +
 +
=== Compute Allele balance ===
 +
 +
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allele balance].  Outputs allele balance, allele frequency, genotype frequency.
 +
 +
vt compute_ab -i mills.vcf
 +
 +
=== Compute Allele Frequency ===
 +
 +
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Allele_Frequency allele frequency].  Outputs  allele frequency and genotype frequency.
 +
 +
vt compute_af -i mills.vcf
 +
 +
=== Compute Inbreeding Coefficient ===
 +
 +
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Inbreeding_Coefficient inbreeding coefficient].  Outputs inbreeding coefficient based on genotype likelihoods.
 +
 +
vt compute_fic -i mills.vcf
 +
 +
=== Compute HWE ===
 +
 +
Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Hardy-Weinberg_Test Hardy-Weinberg equilibrium statistic].  Outputs  PHRED scaled HWE Test p-values for biallelic as well as multiallelic variants.
 +
 +
vt compute_hwe -i mills.vcf
 +
 +
=== Compute Mendelian Error ===
 +
 +
Compute mendelian error  statistics.  Outputs  allele frequency and genotype frequency.
 +
 +
vt compute_mendel -i mills.vcf
 +
 +
=== Compute features ===
 +
 +
vt compute_<feature1>_<feature2>_ ... _<feature n> -i mills.vcf
 +
 +
== Modification ==
 +
 +
=== Left Alignment ===
 +
 +
[http://genome.sph.umich.edu/wiki/Variant_Normalization Left aligns] indel type variants in a VCF file.  This differs from normalization in that it only left aligns and left trims a variant.  This affects Indels only.
 +
 +
vt left_align -i mills.vcf -o mills.leftaligned.vcf
 +
 +
=== Normalization ===
 +
 +
[http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file.
 +
 +
vt normalize -i mills.vcf -o mills.normalized.vcf
 +
 +
=== Merge duplicate variants ===
 +
 +
Merges duplicate variants by position with the option of considering alleles.  (This just discards the duplicate variant that appears later in the VCF file)
 +
 +
  Options:
 +
  -i,  --input-vcf <string>  : Input VCF file
 +
  -o,  --output-vcf <string> : Output VCF file [-]
 +
  -p,  --merge-by-position  : Merge by position [false]
 +
 +
  Example:
 +
  e.g. vt merge_duplicate_variants -i 8904indels.dups.genotypes.vcf -o out.vcf
 +
  e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf
 +
 +
== Profiling ==
 +
 +
A standard procedure is as follows:
 +
 +
  zcat dataset.vcf.gx | vt normalize -i - | vt merge_duplicate_variants -i - > dataset.normalized.vcf
 +
 +
  cut -f1-8 dataset.normalized.vcf > dataset.sites.vcf
 +
 +
  cat dataset.normalized.sites.vcf | vt profile_snps -i - > snps.summary.log
 +
 +
 +
 +
=== Profile SNPs ===
 +
 +
Profile SNPs.
 +
 +
* ts/tv ratio
 +
* overlap analyses
 +
 +
vt profile_snps -i mills.snps.sites.vcf
 +
 +
=== Profile Indels ===
 +
 +
Profile indels.
 +
 +
* Overlap analyses with known data sets
 +
* FS/NFS annotation
 +
 +
vt profile_indels mills.indels.sites.vcf
 +
 +
=== Profile MNPs ===
 +
 +
Profile MNPs.
 +
 +
vt profile_mnps -i mills.mnps.sites.vcf
 +
 +
=== Summarize Variants ===
 +
 +
Summarizes variants present in VCF file.
 +
 +
vt peek -i mills.vcf
 +
 +
== Plotting ==
 +
 +
=== Allele Frequency Spectrum ===
 +
 +
Plots Allele Frequency Spectrum of variants found in VCF file
 +
 +
vt plot_afs -i mills.xml
 +
 +
=== Genotype Likelihood Concordance ===
 +
 +
Plots Genotype Likelihood Concordance graph.
 +
 +
vt plot_gl -i mills.xml
 +
 +
=== Allele Balance Spectrum===
 +
 +
Plots Allele Balance graph of variants in the VCF file.
 +
 +
vt plot_ab -i mills.xml
 +
 +
= VCF File Manipulation =
 +
 +
=== Sort ===
 +
 +
Sort variants according to contig lists in header.
 +
 +
vt sort -i mills.sites.vcf
 +
 +
=== Split by variant ===
 +
 +
Split VCF files by variant type.
 +
 +
vt split_by_variant -i mills.sites.vcf
 +
 +
= Resource Files =
 +
 +
dbSNP
 +
OMNI 1000G
 +
Mills
 +
HAPMAP
 +
 +
= Maintained by =
 +
 +
This page is maintained by  [mailto:atks@umich.edu Adrian]
1,102

edits

Navigation menu