Difference between revisions of "Vt"
Line 198: | Line 198: | ||
[http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file. | [http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file. | ||
− | vt normalize | + | vt normalize mills.vcf -r seq.fa -o mills.normalized.vcf |
=== Merge duplicate variants === | === Merge duplicate variants === |
Revision as of 16:25, 21 October 2013
Introduction
vt is a tool set that calls, genotypes and filters short variants. It provides profiling of variants to aid in QC.
Location
Internal usage
binaries /net/fantasia/home/atks/programs/vt
test data /net/fantasia/home/atks/programs/vt/test
scripts /net/fantasia/home/atks/programs/vt/scripts
External usage
download from sourceforge/github
Common options patterns
-i defines the input file and by default, this is a require parameter, however, you may set it as '-' to accept STDIN which by default is assumed to be a non compressed format.
-o defines the out file which and has the STDOUT set as the default. You may modify the STDOUT to output the binary version of the format, e.g. BCF. with the option -c
Major Workflows
Discovery
Discovery is performed at per sample level, the evidence sites lists for each sample are then merged and site discovery statistics are computed. The user then makes a decision on cut offs to make to create an initial candidate site list.
Generates site list with info fields E and N.
vt discover -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
Normalize(including left aligning) variants. This is required as left alignment of insertions and/or deletions within a read is sometimes insufficient to ensure complete left alignment.
vt normalize -i NA12878.bam -o NA12878.normalized.sites.vcf -g hs37d5.fa
Evidence site lists are combined across samples and split by sites to allow for parallelization.
vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -l 5000
Discovery statistics are computed. These statistics will allow you to choose a suitable cut off for creating a suitable candidate site list.
vt compute_discovery_stats -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
Merge site lists.
vt merge -i 1000000.sites.vcf,2000000.sites.vcf,3000000.sites.vcf -o candidate.sites.vcf
Plot charts to help with candidate list selection criteria.
vt plot_discovery -i candidate.sites.vcf
A calling pipeline implemented in a make file is available here.
Genotyping
Each individual is genotyped at a set of sites.
vt genotype -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
Genotype sample VCFs are combined across samples and split by sites.
vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -o 1-1000000.sites.vcf
Features are computed.
vt compute_features -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
A genotyping pipeline implemented in a make file is available here.
Filtering
Requires a set of features AND an installed copy of SVMLight.
vt filter NA12878.bam -i NA12878.sites.vcf -o NA12878.svm.sites.vcf --pos positive.sites.vcf --neg negative.sites.vcf
A filtering pipeline implemented in a make file is available here.
Generation
Discovery
Discovers variants from bams.
Options: -b, --input-bam-file : Input BAM file -o, --output-vcf-file : Output VCF file -v, --variant-type : Variant Types, takes on any combinations of the values snps,mnps,indels comma delimited [snps,mnps,indels] -q, --q-cutoff : BASE Cutoff, only bases with QUAL/BAQ >= baseq are considered [13] -m, --mapq-cutoff : MAPQ Cutoff, only alignments with map quality >= mapq are considered [20] -g, --genome-fa-file : Genome FASTA file -s, --sample-id : Sample ID Example: e.g. vt discover -b in.bam -o - -g ref.fa -v snps,indels -s HG0001 e.g. bam mergeBam --in a.bam --in b.bam -o - | vt discover -b - -o out.sites.vcf -g ref.fa -v all -s HG0001 | vt left_align -i - | vt merge_duplicate_variants
Genotyping
Genotypes variants for each sample.
Options: -b, --input-bam-file : Input BAM file -i, --input-candidate-vcf : Input Candidate VCF file -o, --output-vcf-file : Output VCF file -v, --variant-type : Variant Types, takes on any combinations of the values snps,mnps,indels comma delimited [snps,mnps,indels] -g, --genome-fa-file : Genome FASTA file -s, --sample-id : Sample ID Example: e.g. vt genotype -b in.bam -i candidate.sites.vcf -o - -g ref.fa -s HG0001
Annotation
Make Probes
Populates the info field with REFPROBE, ALTPROBE and PLEN tags for genotyping.
Options: -i, --input-vcf <string> : Input VCF file -o, --output-vcf <string> : Output VCF file [-] -g, --genome-fa : Genome FASTA file [/net/fantasia/home/atks/ref/genome/human.g1k.v37.fa] -f, --flank-length <integer> : Minimum Flank Length [20]
Example: e.g. vt make_probes -i 8904indels.dups.genotypes.vcf -o probes.sites.vcf -g ref.fa
Compute Feature
Compute feature of variant.
vt compute_feature -i mills.vcf
Compute Allele balance
Compute allele balance. Outputs allele balance, allele frequency, genotype frequency.
vt compute_ab -i mills.vcf
Compute Allele Frequency
Compute allele frequency. Outputs allele frequency and genotype frequency.
vt compute_af -i mills.vcf
Compute Inbreeding Coefficient
Compute inbreeding coefficient. Outputs inbreeding coefficient based on genotype likelihoods.
vt compute_fic -i mills.vcf
Compute HWE
Compute Hardy-Weinberg equilibrium statistic. Outputs PHRED scaled HWE Test p-values for biallelic as well as multiallelic variants.
vt compute_hwe -i mills.vcf
Compute Mendelian Error
Compute mendelian error statistics. Outputs allele frequency and genotype frequency.
vt compute_mendel -i mills.vcf
Compute features
vt compute_<feature1>_<feature2>_ ... _<feature n> -i mills.vcf
Modification
Left Alignment
Left aligns indel type variants in a VCF file. This differs from normalization in that it only left aligns and left trims a variant. This affects Indels only.
vt left_align -i mills.vcf -o mills.leftaligned.vcf
Normalization
Normalize variants in a VCF file.
vt normalize mills.vcf -r seq.fa -o mills.normalized.vcf
Merge duplicate variants
Merges duplicate variants by position with the option of considering alleles. (This just discards the duplicate variant that appears later in the VCF file)
Options: -i, --input-vcf <string> : Input VCF file -o, --output-vcf <string> : Output VCF file [-] -p, --merge-by-position : Merge by position [false]
Example: e.g. vt merge_duplicate_variants -i 8904indels.dups.genotypes.vcf -o out.vcf e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf
Profiling
A standard procedure is as follows:
zcat dataset.vcf.gx | vt normalize -i - | vt merge_duplicate_variants -i - > dataset.normalized.vcf
cut -f1-8 dataset.normalized.vcf > dataset.sites.vcf
cat dataset.normalized.sites.vcf | vt profile_snps -i - > snps.summary.log
Profile SNPs
Profile SNPs.
- ts/tv ratio
- overlap analyses
vt profile_snps -i mills.snps.sites.vcf
Profile Indels
Profile indels.
- Overlap analyses with known data sets
- FS/NFS annotation
vt profile_indels mills.indels.sites.vcf
Profile MNPs
Profile MNPs.
vt profile_mnps -i mills.mnps.sites.vcf
Summarize Variants
Summarizes variants present in VCF file.
vt peek -i mills.vcf
Plotting
Allele Frequency Spectrum
Plots Allele Frequency Spectrum of variants found in VCF file
vt plot_afs -i mills.xml
Genotype Likelihood Concordance
Plots Genotype Likelihood Concordance graph.
vt plot_gl -i mills.xml
Allele Balance Spectrum
Plots Allele Balance graph of variants in the VCF file.
vt plot_ab -i mills.xml
VCF File Manipulation
Sort
Sort variants according to contig lists in header.
vt sort -i mills.sites.vcf
Split by variant
Split VCF files by variant type.
vt split_by_variant -i mills.sites.vcf
Resource Files
dbSNP OMNI 1000G Mills HAPMAP
Maintained by
This page is maintained by Adrian