Changes

Vt (view source)

Revision as of 16:17, 21 October 2013

8,814 bytes added , 16:17, 21 October 2013

no edit summary

Line 1: Line 1: +

=== Introduction ===

+

vt is a tool set that calls, genotypes and filters short variants. It provides profiling of variants to aid in QC.

+

=== Location ===

+

Internal usage

+

binaries

+

/net/fantasia/home/atks/programs/vt

+

test data

+

/net/fantasia/home/atks/programs/vt/test

+

scripts

+

/net/fantasia/home/atks/programs/vt/scripts

+

External usage

+

download from sourceforge/github

+

== Common options patterns ==

+

-i defines the input file and by default, this is a require parameter,

+

however, you may set it as '-' to accept STDIN which by default is

+

assumed to be a non compressed format.

+

-o defines the out file which and has the STDOUT set as the default.

+

You may modify the STDOUT to output the binary version of the format,

+

e.g. BCF. with the option -c

+

== Major Workflows ==

+

=== Discovery ===

+

Discovery is performed at per sample level, the evidence sites lists for each sample are then merged and site discovery statistics are computed.

+

The user then makes a decision on cut offs to make to create an initial candidate site list.

+

Generates site list with info fields E and N.

+

vt discover -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa

+

Normalize(including left aligning) variants. This is required as left alignment of insertions and/or deletions within a read is sometimes insufficient to ensure complete left alignment.

+

vt normalize -i NA12878.bam -o NA12878.normalized.sites.vcf -g hs37d5.fa

+

Evidence site lists are combined across samples and split by sites to allow for parallelization.

+

vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -l 5000

+

Discovery statistics are computed. These statistics will allow you to choose a suitable cut off for creating a suitable candidate site list.

+

vt compute_discovery_stats -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf

+

Merge site lists.

+

vt merge -i 1000000.sites.vcf,2000000.sites.vcf,3000000.sites.vcf -o candidate.sites.vcf

+

Plot charts to help with candidate list selection criteria.

+

vt plot_discovery -i candidate.sites.vcf

+

A calling pipeline implemented in a make file is available here.

+

=== Genotyping ===

+

Each individual is genotyped at a set of sites.

+

vt genotype -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa

+

Genotype sample VCFs are combined across samples and split by sites.

+

vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -o 1-1000000.sites.vcf

+

Features are computed.

+

vt compute_features -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf

+

A genotyping pipeline implemented in a make file is available here.

+

=== Filtering ===

+

Requires a set of features AND an installed copy of SVMLight.

+

vt filter NA12878.bam -i NA12878.sites.vcf -o NA12878.svm.sites.vcf --pos positive.sites.vcf --neg negative.sites.vcf

+

A filtering pipeline implemented in a make file is available here.

+

== Generation ==

+

=== Discovery ===

+

Discovers variants from bams.

+

Options:

+

-b, --input-bam-file : Input BAM file

+

-o, --output-vcf-file : Output VCF file

+

-v, --variant-type : Variant Types, takes on any combinations of

+

the values snps,mnps,indels comma delimited

+

[snps,mnps,indels]

+

-q, --q-cutoff : BASE Cutoff, only bases with

+

QUAL/BAQ >= baseq are considered [13]

+

-m, --mapq-cutoff : MAPQ Cutoff, only alignments with

+

map quality >= mapq are considered [20]

+

-g, --genome-fa-file : Genome FASTA file

+

-s, --sample-id : Sample ID

+

Example:

+

e.g. vt discover -b in.bam -o - -g ref.fa -v snps,indels -s HG0001

+

e.g. bam mergeBam --in a.bam --in b.bam -o - |

+

vt discover -b - -o out.sites.vcf -g ref.fa -v all -s HG0001 |

+

vt left_align -i - | vt merge_duplicate_variants

+

=== Genotyping ===

+

Genotypes variants for each sample.

+

Options:

+

-b, --input-bam-file : Input BAM file

+

-i, --input-candidate-vcf : Input Candidate VCF file

+

-o, --output-vcf-file : Output VCF file

+

-v, --variant-type : Variant Types, takes on any combinations

+

of the values snps,mnps,indels comma

+

delimited [snps,mnps,indels]

+

-g, --genome-fa-file : Genome FASTA file

+

-s, --sample-id : Sample ID

+

Example:

+

e.g. vt genotype -b in.bam -i candidate.sites.vcf -o - -g ref.fa -s HG0001

+

== Annotation ==

+

=== Make Probes ===

+

Populates the info field with REFPROBE, ALTPROBE and PLEN tags for genotyping.

+

Options:

+

-i, --input-vcf <string> : Input VCF file

+

-o, --output-vcf <string> : Output VCF file [-]

+

-g, --genome-fa : Genome FASTA file [/net/fantasia/home/atks/ref/genome/human.g1k.v37.fa]

+

-f, --flank-length <integer> : Minimum Flank Length [20]

+

Example:

+

e.g. vt make_probes -i 8904indels.dups.genotypes.vcf -o probes.sites.vcf -g ref.fa

+

=== Compute Feature ===

+

Compute feature of variant.

+

vt compute_feature -i mills.vcf

+

=== Compute Allele balance ===

+

Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allele balance]. Outputs allele balance, allele frequency, genotype frequency.

+

vt compute_ab -i mills.vcf

+

=== Compute Allele Frequency ===

+

Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Allele_Frequency allele frequency]. Outputs allele frequency and genotype frequency.

+

vt compute_af -i mills.vcf

+

=== Compute Inbreeding Coefficient ===

+

Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Inbreeding_Coefficient inbreeding coefficient]. Outputs inbreeding coefficient based on genotype likelihoods.

+

vt compute_fic -i mills.vcf

+

=== Compute HWE ===

+

Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Hardy-Weinberg_Test Hardy-Weinberg equilibrium statistic]. Outputs PHRED scaled HWE Test p-values for biallelic as well as multiallelic variants.

+

vt compute_hwe -i mills.vcf

+

=== Compute Mendelian Error ===

+

Compute mendelian error statistics. Outputs allele frequency and genotype frequency.

+

vt compute_mendel -i mills.vcf

+

=== Compute features ===

+

vt compute_<feature1>_<feature2>_ ... _<feature n> -i mills.vcf

+

== Modification ==

+

=== Left Alignment ===

+

[http://genome.sph.umich.edu/wiki/Variant_Normalization Left aligns] indel type variants in a VCF file. This differs from normalization in that it only left aligns and left trims a variant. This affects Indels only.

+

vt left_align -i mills.vcf -o mills.leftaligned.vcf

+

=== Normalization ===

+

[http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file.

+

vt normalize -i mills.vcf -o mills.normalized.vcf

+

=== Merge duplicate variants ===

+

Merges duplicate variants by position with the option of considering alleles. (This just discards the duplicate variant that appears later in the VCF file)

+

Options:

+

-i, --input-vcf <string> : Input VCF file

+

-o, --output-vcf <string> : Output VCF file [-]

+

-p, --merge-by-position : Merge by position [false]

+

Example:

+

e.g. vt merge_duplicate_variants -i 8904indels.dups.genotypes.vcf -o out.vcf

+

e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf

+

== Profiling ==

+

A standard procedure is as follows:

+

zcat dataset.vcf.gx | vt normalize -i - | vt merge_duplicate_variants -i - > dataset.normalized.vcf

+

cut -f1-8 dataset.normalized.vcf > dataset.sites.vcf

+

cat dataset.normalized.sites.vcf | vt profile_snps -i - > snps.summary.log

+

=== Profile SNPs ===

+

Profile SNPs.

+

* ts/tv ratio

+

* overlap analyses

+

vt profile_snps -i mills.snps.sites.vcf

+

=== Profile Indels ===

+

Profile indels.

+

* Overlap analyses with known data sets

+

* FS/NFS annotation

+

vt profile_indels mills.indels.sites.vcf

+

=== Profile MNPs ===

+

Profile MNPs.

+

vt profile_mnps -i mills.mnps.sites.vcf

+

=== Summarize Variants ===

+

Summarizes variants present in VCF file.

+

vt peek -i mills.vcf

+

== Plotting ==

+

=== Allele Frequency Spectrum ===

+

Plots Allele Frequency Spectrum of variants found in VCF file

+

vt plot_afs -i mills.xml

+

=== Genotype Likelihood Concordance ===

+

Plots Genotype Likelihood Concordance graph.

+

vt plot_gl -i mills.xml

+

=== Allele Balance Spectrum===

+

Plots Allele Balance graph of variants in the VCF file.

+

vt plot_ab -i mills.xml

+

= VCF File Manipulation =

+

=== Sort ===

+

Sort variants according to contig lists in header.

+

vt sort -i mills.sites.vcf

+

=== Split by variant ===

+

Split VCF files by variant type.

+

vt split_by_variant -i mills.sites.vcf

+

= Resource Files =

+

dbSNP

+

OMNI 1000G

+

Mills

+

HAPMAP

+

= Maintained by =

+

This page is maintained by [mailto:atks@umich.edu Adrian]

Atks

1,102

edits

Changes

Vt (view source)

Revision as of 16:17, 21 October 2013

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools