Difference between revisions of "Vt"
(Blanked the page) |
|||
Line 1: | Line 1: | ||
+ | === Introduction === | ||
+ | vt is a tool set that calls, genotypes and filters short variants. It provides profiling of variants to aid in QC. | ||
+ | |||
+ | |||
+ | === Location === | ||
+ | |||
+ | Internal usage | ||
+ | |||
+ | binaries | ||
+ | /net/fantasia/home/atks/programs/vt | ||
+ | |||
+ | test data | ||
+ | /net/fantasia/home/atks/programs/vt/test | ||
+ | |||
+ | scripts | ||
+ | /net/fantasia/home/atks/programs/vt/scripts | ||
+ | |||
+ | External usage | ||
+ | |||
+ | download from sourceforge/github | ||
+ | |||
+ | == Common options patterns == | ||
+ | |||
+ | -i defines the input file and by default, this is a require parameter, | ||
+ | however, you may set it as '-' to accept STDIN which by default is | ||
+ | assumed to be a non compressed format. | ||
+ | |||
+ | -o defines the out file which and has the STDOUT set as the default. | ||
+ | You may modify the STDOUT to output the binary version of the format, | ||
+ | e.g. BCF. with the option -c | ||
+ | |||
+ | == Major Workflows == | ||
+ | |||
+ | === Discovery === | ||
+ | |||
+ | Discovery is performed at per sample level, the evidence sites lists for each sample are then merged and site discovery statistics are computed. | ||
+ | The user then makes a decision on cut offs to make to create an initial candidate site list. | ||
+ | |||
+ | Generates site list with info fields E and N. | ||
+ | |||
+ | vt discover -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa | ||
+ | |||
+ | Normalize(including left aligning) variants. This is required as left alignment of insertions and/or deletions within a read is sometimes insufficient to ensure complete left alignment. | ||
+ | |||
+ | vt normalize -i NA12878.bam -o NA12878.normalized.sites.vcf -g hs37d5.fa | ||
+ | |||
+ | Evidence site lists are combined across samples and split by sites to allow for parallelization. | ||
+ | |||
+ | vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -l 5000 | ||
+ | |||
+ | Discovery statistics are computed. These statistics will allow you to choose a suitable cut off for creating a suitable candidate site list. | ||
+ | |||
+ | vt compute_discovery_stats -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf | ||
+ | |||
+ | Merge site lists. | ||
+ | |||
+ | vt merge -i 1000000.sites.vcf,2000000.sites.vcf,3000000.sites.vcf -o candidate.sites.vcf | ||
+ | |||
+ | Plot charts to help with candidate list selection criteria. | ||
+ | |||
+ | vt plot_discovery -i candidate.sites.vcf | ||
+ | |||
+ | |||
+ | A calling pipeline implemented in a make file is available here. | ||
+ | |||
+ | === Genotyping === | ||
+ | |||
+ | Each individual is genotyped at a set of sites. | ||
+ | |||
+ | vt genotype -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa | ||
+ | |||
+ | Genotype sample VCFs are combined across samples and split by sites. | ||
+ | |||
+ | vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -o 1-1000000.sites.vcf | ||
+ | |||
+ | Features are computed. | ||
+ | |||
+ | vt compute_features -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf | ||
+ | |||
+ | A genotyping pipeline implemented in a make file is available here. | ||
+ | |||
+ | === Filtering === | ||
+ | |||
+ | Requires a set of features AND an installed copy of SVMLight. | ||
+ | |||
+ | vt filter NA12878.bam -i NA12878.sites.vcf -o NA12878.svm.sites.vcf --pos positive.sites.vcf --neg negative.sites.vcf | ||
+ | |||
+ | A filtering pipeline implemented in a make file is available here. | ||
+ | |||
+ | == Generation == | ||
+ | |||
+ | === Discovery === | ||
+ | |||
+ | Discovers variants from bams. | ||
+ | |||
+ | Options: | ||
+ | -b, --input-bam-file : Input BAM file | ||
+ | -o, --output-vcf-file : Output VCF file | ||
+ | -v, --variant-type : Variant Types, takes on any combinations of | ||
+ | the values snps,mnps,indels comma delimited | ||
+ | [snps,mnps,indels] | ||
+ | -q, --q-cutoff : BASE Cutoff, only bases with | ||
+ | QUAL/BAQ >= baseq are considered [13] | ||
+ | -m, --mapq-cutoff : MAPQ Cutoff, only alignments with | ||
+ | map quality >= mapq are considered [20] | ||
+ | -g, --genome-fa-file : Genome FASTA file | ||
+ | -s, --sample-id : Sample ID | ||
+ | |||
+ | Example: | ||
+ | e.g. vt discover -b in.bam -o - -g ref.fa -v snps,indels -s HG0001 | ||
+ | e.g. bam mergeBam --in a.bam --in b.bam -o - | | ||
+ | vt discover -b - -o out.sites.vcf -g ref.fa -v all -s HG0001 | | ||
+ | vt left_align -i - | vt merge_duplicate_variants | ||
+ | |||
+ | === Genotyping === | ||
+ | |||
+ | Genotypes variants for each sample. | ||
+ | |||
+ | Options: | ||
+ | -b, --input-bam-file : Input BAM file | ||
+ | -i, --input-candidate-vcf : Input Candidate VCF file | ||
+ | -o, --output-vcf-file : Output VCF file | ||
+ | -v, --variant-type : Variant Types, takes on any combinations | ||
+ | of the values snps,mnps,indels comma | ||
+ | delimited [snps,mnps,indels] | ||
+ | -g, --genome-fa-file : Genome FASTA file | ||
+ | -s, --sample-id : Sample ID | ||
+ | |||
+ | Example: | ||
+ | e.g. vt genotype -b in.bam -i candidate.sites.vcf -o - -g ref.fa -s HG0001 | ||
+ | |||
+ | == Annotation == | ||
+ | |||
+ | === Make Probes === | ||
+ | |||
+ | Populates the info field with REFPROBE, ALTPROBE and PLEN tags for genotyping. | ||
+ | |||
+ | Options: | ||
+ | -i, --input-vcf <string> : Input VCF file | ||
+ | -o, --output-vcf <string> : Output VCF file [-] | ||
+ | -g, --genome-fa : Genome FASTA file [/net/fantasia/home/atks/ref/genome/human.g1k.v37.fa] | ||
+ | -f, --flank-length <integer> : Minimum Flank Length [20] | ||
+ | |||
+ | Example: | ||
+ | e.g. vt make_probes -i 8904indels.dups.genotypes.vcf -o probes.sites.vcf -g ref.fa | ||
+ | |||
+ | === Compute Feature === | ||
+ | |||
+ | Compute feature of variant. | ||
+ | |||
+ | vt compute_feature -i mills.vcf | ||
+ | |||
+ | === Compute Allele balance === | ||
+ | |||
+ | Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allele balance]. Outputs allele balance, allele frequency, genotype frequency. | ||
+ | |||
+ | vt compute_ab -i mills.vcf | ||
+ | |||
+ | === Compute Allele Frequency === | ||
+ | |||
+ | Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Allele_Frequency allele frequency]. Outputs allele frequency and genotype frequency. | ||
+ | |||
+ | vt compute_af -i mills.vcf | ||
+ | |||
+ | === Compute Inbreeding Coefficient === | ||
+ | |||
+ | Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Inbreeding_Coefficient inbreeding coefficient]. Outputs inbreeding coefficient based on genotype likelihoods. | ||
+ | |||
+ | vt compute_fic -i mills.vcf | ||
+ | |||
+ | === Compute HWE === | ||
+ | |||
+ | Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Hardy-Weinberg_Test Hardy-Weinberg equilibrium statistic]. Outputs PHRED scaled HWE Test p-values for biallelic as well as multiallelic variants. | ||
+ | |||
+ | vt compute_hwe -i mills.vcf | ||
+ | |||
+ | === Compute Mendelian Error === | ||
+ | |||
+ | Compute mendelian error statistics. Outputs allele frequency and genotype frequency. | ||
+ | |||
+ | vt compute_mendel -i mills.vcf | ||
+ | |||
+ | === Compute features === | ||
+ | |||
+ | vt compute_<feature1>_<feature2>_ ... _<feature n> -i mills.vcf | ||
+ | |||
+ | == Modification == | ||
+ | |||
+ | === Left Alignment === | ||
+ | |||
+ | [http://genome.sph.umich.edu/wiki/Variant_Normalization Left aligns] indel type variants in a VCF file. This differs from normalization in that it only left aligns and left trims a variant. This affects Indels only. | ||
+ | |||
+ | vt left_align -i mills.vcf -o mills.leftaligned.vcf | ||
+ | |||
+ | === Normalization === | ||
+ | |||
+ | [http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file. | ||
+ | |||
+ | vt normalize -i mills.vcf -o mills.normalized.vcf | ||
+ | |||
+ | === Merge duplicate variants === | ||
+ | |||
+ | Merges duplicate variants by position with the option of considering alleles. (This just discards the duplicate variant that appears later in the VCF file) | ||
+ | |||
+ | Options: | ||
+ | -i, --input-vcf <string> : Input VCF file | ||
+ | -o, --output-vcf <string> : Output VCF file [-] | ||
+ | -p, --merge-by-position : Merge by position [false] | ||
+ | |||
+ | Example: | ||
+ | e.g. vt merge_duplicate_variants -i 8904indels.dups.genotypes.vcf -o out.vcf | ||
+ | e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf | ||
+ | |||
+ | == Profiling == | ||
+ | |||
+ | A standard procedure is as follows: | ||
+ | |||
+ | zcat dataset.vcf.gx | vt normalize -i - | vt merge_duplicate_variants -i - > dataset.normalized.vcf | ||
+ | |||
+ | cut -f1-8 dataset.normalized.vcf > dataset.sites.vcf | ||
+ | |||
+ | cat dataset.normalized.sites.vcf | vt profile_snps -i - > snps.summary.log | ||
+ | |||
+ | |||
+ | |||
+ | === Profile SNPs === | ||
+ | |||
+ | Profile SNPs. | ||
+ | |||
+ | * ts/tv ratio | ||
+ | * overlap analyses | ||
+ | |||
+ | vt profile_snps -i mills.snps.sites.vcf | ||
+ | |||
+ | === Profile Indels === | ||
+ | |||
+ | Profile indels. | ||
+ | |||
+ | * Overlap analyses with known data sets | ||
+ | * FS/NFS annotation | ||
+ | |||
+ | vt profile_indels mills.indels.sites.vcf | ||
+ | |||
+ | === Profile MNPs === | ||
+ | |||
+ | Profile MNPs. | ||
+ | |||
+ | vt profile_mnps -i mills.mnps.sites.vcf | ||
+ | |||
+ | === Summarize Variants === | ||
+ | |||
+ | Summarizes variants present in VCF file. | ||
+ | |||
+ | vt peek -i mills.vcf | ||
+ | |||
+ | == Plotting == | ||
+ | |||
+ | === Allele Frequency Spectrum === | ||
+ | |||
+ | Plots Allele Frequency Spectrum of variants found in VCF file | ||
+ | |||
+ | vt plot_afs -i mills.xml | ||
+ | |||
+ | === Genotype Likelihood Concordance === | ||
+ | |||
+ | Plots Genotype Likelihood Concordance graph. | ||
+ | |||
+ | vt plot_gl -i mills.xml | ||
+ | |||
+ | === Allele Balance Spectrum=== | ||
+ | |||
+ | Plots Allele Balance graph of variants in the VCF file. | ||
+ | |||
+ | vt plot_ab -i mills.xml | ||
+ | |||
+ | = VCF File Manipulation = | ||
+ | |||
+ | === Sort === | ||
+ | |||
+ | Sort variants according to contig lists in header. | ||
+ | |||
+ | vt sort -i mills.sites.vcf | ||
+ | |||
+ | === Split by variant === | ||
+ | |||
+ | Split VCF files by variant type. | ||
+ | |||
+ | vt split_by_variant -i mills.sites.vcf | ||
+ | |||
+ | = Resource Files = | ||
+ | |||
+ | dbSNP | ||
+ | OMNI 1000G | ||
+ | Mills | ||
+ | HAPMAP | ||
+ | |||
+ | = Maintained by = | ||
+ | |||
+ | This page is maintained by [mailto:atks@umich.edu Adrian] |
Revision as of 16:17, 21 October 2013
Introduction
vt is a tool set that calls, genotypes and filters short variants. It provides profiling of variants to aid in QC.
Location
Internal usage
binaries /net/fantasia/home/atks/programs/vt
test data /net/fantasia/home/atks/programs/vt/test
scripts /net/fantasia/home/atks/programs/vt/scripts
External usage
download from sourceforge/github
Common options patterns
-i defines the input file and by default, this is a require parameter, however, you may set it as '-' to accept STDIN which by default is assumed to be a non compressed format.
-o defines the out file which and has the STDOUT set as the default. You may modify the STDOUT to output the binary version of the format, e.g. BCF. with the option -c
Major Workflows
Discovery
Discovery is performed at per sample level, the evidence sites lists for each sample are then merged and site discovery statistics are computed. The user then makes a decision on cut offs to make to create an initial candidate site list.
Generates site list with info fields E and N.
vt discover -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
Normalize(including left aligning) variants. This is required as left alignment of insertions and/or deletions within a read is sometimes insufficient to ensure complete left alignment.
vt normalize -i NA12878.bam -o NA12878.normalized.sites.vcf -g hs37d5.fa
Evidence site lists are combined across samples and split by sites to allow for parallelization.
vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -l 5000
Discovery statistics are computed. These statistics will allow you to choose a suitable cut off for creating a suitable candidate site list.
vt compute_discovery_stats -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
Merge site lists.
vt merge -i 1000000.sites.vcf,2000000.sites.vcf,3000000.sites.vcf -o candidate.sites.vcf
Plot charts to help with candidate list selection criteria.
vt plot_discovery -i candidate.sites.vcf
A calling pipeline implemented in a make file is available here.
Genotyping
Each individual is genotyped at a set of sites.
vt genotype -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
Genotype sample VCFs are combined across samples and split by sites.
vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -o 1-1000000.sites.vcf
Features are computed.
vt compute_features -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
A genotyping pipeline implemented in a make file is available here.
Filtering
Requires a set of features AND an installed copy of SVMLight.
vt filter NA12878.bam -i NA12878.sites.vcf -o NA12878.svm.sites.vcf --pos positive.sites.vcf --neg negative.sites.vcf
A filtering pipeline implemented in a make file is available here.
Generation
Discovery
Discovers variants from bams.
Options: -b, --input-bam-file : Input BAM file -o, --output-vcf-file : Output VCF file -v, --variant-type : Variant Types, takes on any combinations of the values snps,mnps,indels comma delimited [snps,mnps,indels] -q, --q-cutoff : BASE Cutoff, only bases with QUAL/BAQ >= baseq are considered [13] -m, --mapq-cutoff : MAPQ Cutoff, only alignments with map quality >= mapq are considered [20] -g, --genome-fa-file : Genome FASTA file -s, --sample-id : Sample ID Example: e.g. vt discover -b in.bam -o - -g ref.fa -v snps,indels -s HG0001 e.g. bam mergeBam --in a.bam --in b.bam -o - | vt discover -b - -o out.sites.vcf -g ref.fa -v all -s HG0001 | vt left_align -i - | vt merge_duplicate_variants
Genotyping
Genotypes variants for each sample.
Options: -b, --input-bam-file : Input BAM file -i, --input-candidate-vcf : Input Candidate VCF file -o, --output-vcf-file : Output VCF file -v, --variant-type : Variant Types, takes on any combinations of the values snps,mnps,indels comma delimited [snps,mnps,indels] -g, --genome-fa-file : Genome FASTA file -s, --sample-id : Sample ID Example: e.g. vt genotype -b in.bam -i candidate.sites.vcf -o - -g ref.fa -s HG0001
Annotation
Make Probes
Populates the info field with REFPROBE, ALTPROBE and PLEN tags for genotyping.
Options: -i, --input-vcf <string> : Input VCF file -o, --output-vcf <string> : Output VCF file [-] -g, --genome-fa : Genome FASTA file [/net/fantasia/home/atks/ref/genome/human.g1k.v37.fa] -f, --flank-length <integer> : Minimum Flank Length [20]
Example: e.g. vt make_probes -i 8904indels.dups.genotypes.vcf -o probes.sites.vcf -g ref.fa
Compute Feature
Compute feature of variant.
vt compute_feature -i mills.vcf
Compute Allele balance
Compute allele balance. Outputs allele balance, allele frequency, genotype frequency.
vt compute_ab -i mills.vcf
Compute Allele Frequency
Compute allele frequency. Outputs allele frequency and genotype frequency.
vt compute_af -i mills.vcf
Compute Inbreeding Coefficient
Compute inbreeding coefficient. Outputs inbreeding coefficient based on genotype likelihoods.
vt compute_fic -i mills.vcf
Compute HWE
Compute Hardy-Weinberg equilibrium statistic. Outputs PHRED scaled HWE Test p-values for biallelic as well as multiallelic variants.
vt compute_hwe -i mills.vcf
Compute Mendelian Error
Compute mendelian error statistics. Outputs allele frequency and genotype frequency.
vt compute_mendel -i mills.vcf
Compute features
vt compute_<feature1>_<feature2>_ ... _<feature n> -i mills.vcf
Modification
Left Alignment
Left aligns indel type variants in a VCF file. This differs from normalization in that it only left aligns and left trims a variant. This affects Indels only.
vt left_align -i mills.vcf -o mills.leftaligned.vcf
Normalization
Normalize variants in a VCF file.
vt normalize -i mills.vcf -o mills.normalized.vcf
Merge duplicate variants
Merges duplicate variants by position with the option of considering alleles. (This just discards the duplicate variant that appears later in the VCF file)
Options: -i, --input-vcf <string> : Input VCF file -o, --output-vcf <string> : Output VCF file [-] -p, --merge-by-position : Merge by position [false]
Example: e.g. vt merge_duplicate_variants -i 8904indels.dups.genotypes.vcf -o out.vcf e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf
Profiling
A standard procedure is as follows:
zcat dataset.vcf.gx | vt normalize -i - | vt merge_duplicate_variants -i - > dataset.normalized.vcf
cut -f1-8 dataset.normalized.vcf > dataset.sites.vcf
cat dataset.normalized.sites.vcf | vt profile_snps -i - > snps.summary.log
Profile SNPs
Profile SNPs.
- ts/tv ratio
- overlap analyses
vt profile_snps -i mills.snps.sites.vcf
Profile Indels
Profile indels.
- Overlap analyses with known data sets
- FS/NFS annotation
vt profile_indels mills.indels.sites.vcf
Profile MNPs
Profile MNPs.
vt profile_mnps -i mills.mnps.sites.vcf
Summarize Variants
Summarizes variants present in VCF file.
vt peek -i mills.vcf
Plotting
Allele Frequency Spectrum
Plots Allele Frequency Spectrum of variants found in VCF file
vt plot_afs -i mills.xml
Genotype Likelihood Concordance
Plots Genotype Likelihood Concordance graph.
vt plot_gl -i mills.xml
Allele Balance Spectrum
Plots Allele Balance graph of variants in the VCF file.
vt plot_ab -i mills.xml
VCF File Manipulation
Sort
Sort variants according to contig lists in header.
vt sort -i mills.sites.vcf
Split by variant
Split VCF files by variant type.
vt split_by_variant -i mills.sites.vcf
Resource Files
dbSNP OMNI 1000G Mills HAPMAP
Maintained by
This page is maintained by Adrian