|
|
Line 1: |
Line 1: |
| === Introduction === | | === Introduction === |
| | | |
− | vt is a tool set that calls, genotypes and filters short variants. It provides profiling of variants to aid in QC. | + | vt is a variant tool set that discovers short variants from Next Generation Sequencing data. The features are being rolled out to github as major rewriting is being undertaken. |
− | | |
| | | |
| === Location === | | === Location === |
| | | |
− | Internal usage
| + | You may pull it from github: |
− | | |
− | binaries
| |
− | /net/fantasia/home/atks/programs/vt
| |
− | | |
− | test data
| |
− | /net/fantasia/home/atks/programs/vt/test
| |
− | | |
− | scripts
| |
− | /net/fantasia/home/atks/programs/vt/scripts
| |
− | | |
− | External usage
| |
| | | |
− | download from sourceforge/github | + | git clone https://github.com/atks/vt.git |
| | | |
− | == Common options patterns == | + | == Common options == |
| | | |
− | -i defines the input file and by default, this is a require parameter, | + | -i multiple intervals in <seq>:start-end format |
− | however, you may set it as '-' to accept STDIN which by default is
| |
− | assumed to be a non compressed format.
| |
| | | |
| -o defines the out file which and has the STDOUT set as the default. | | -o defines the out file which and has the STDOUT set as the default. |
− | You may modify the STDOUT to output the binary version of the format, | + | You may modify the STDOUT to output the binary version of the format. |
− | e.g. BCF. with the option -c
| |
− | | |
− | == Major Workflows ==
| |
− | | |
− | === Discovery ===
| |
− | | |
− | Discovery is performed at per sample level, the evidence sites lists for each sample are then merged and site discovery statistics are computed.
| |
− | The user then makes a decision on cut offs to make to create an initial candidate site list.
| |
− | | |
− | Generates site list with info fields E and N.
| |
− | | |
− | vt discover -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
| |
− | | |
− | Normalize(including left aligning) variants. This is required as left alignment of insertions and/or deletions within a read is sometimes insufficient to ensure complete left alignment.
| |
− | | |
− | vt normalize -i NA12878.bam -o NA12878.normalized.sites.vcf -g hs37d5.fa
| |
− | | |
− | Evidence site lists are combined across samples and split by sites to allow for parallelization.
| |
− | | |
− | vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -l 5000
| |
− | | |
− | Discovery statistics are computed. These statistics will allow you to choose a suitable cut off for creating a suitable candidate site list.
| |
− | | |
− | vt compute_discovery_stats -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
| |
− | | |
− | Merge site lists.
| |
− | | |
− | vt merge -i 1000000.sites.vcf,2000000.sites.vcf,3000000.sites.vcf -o candidate.sites.vcf
| |
− | | |
− | Plot charts to help with candidate list selection criteria.
| |
− | | |
− | vt plot_discovery -i candidate.sites.vcf
| |
− | | |
− | | |
− | A calling pipeline implemented in a make file is available here.
| |
− | | |
− | === Genotyping ===
| |
− | | |
− | Each individual is genotyped at a set of sites.
| |
− | | |
− | vt genotype -i NA12878.bam -o NA12878.sites.vcf -g hs37d5.fa
| |
− | | |
− | Genotype sample VCFs are combined across samples and split by sites.
| |
− | | |
− | vt merge_and_split_sample_vcf -i NA12878.sites.vcf,NA12879.sites.vcf,NA12880.sites.vcf -o 1-1000000.sites.vcf
| |
− | | |
− | Features are computed.
| |
− | | |
− | vt compute_features -i 1-1000000.sites.vcf -o 1-1000000.annotated.sites.vcf
| |
− | | |
− | A genotyping pipeline implemented in a make file is available here.
| |
− | | |
− | === Filtering ===
| |
− | | |
− | Requires a set of features AND an installed copy of SVMLight.
| |
− | | |
− | vt filter NA12878.bam -i NA12878.sites.vcf -o NA12878.svm.sites.vcf --pos positive.sites.vcf --neg negative.sites.vcf
| |
− | | |
− | A filtering pipeline implemented in a make file is available here.
| |
− | | |
− | == Generation ==
| |
− | | |
− | === Discovery ===
| |
− | | |
− | Discovers variants from bams.
| |
− | | |
− | Options:
| |
− | -b, --input-bam-file : Input BAM file
| |
− | -o, --output-vcf-file : Output VCF file
| |
− | -v, --variant-type : Variant Types, takes on any combinations of
| |
− | the values snps,mnps,indels comma delimited
| |
− | [snps,mnps,indels]
| |
− | -q, --q-cutoff : BASE Cutoff, only bases with
| |
− | QUAL/BAQ >= baseq are considered [13]
| |
− | -m, --mapq-cutoff : MAPQ Cutoff, only alignments with
| |
− | map quality >= mapq are considered [20]
| |
− | -g, --genome-fa-file : Genome FASTA file
| |
− | -s, --sample-id : Sample ID
| |
− |
| |
− | Example:
| |
− | e.g. vt discover -b in.bam -o - -g ref.fa -v snps,indels -s HG0001
| |
− | e.g. bam mergeBam --in a.bam --in b.bam -o - |
| |
− | vt discover -b - -o out.sites.vcf -g ref.fa -v all -s HG0001 |
| |
− | vt left_align -i - | vt merge_duplicate_variants
| |
− | | |
− | === Genotyping ===
| |
− | | |
− | Genotypes variants for each sample.
| |
− | | |
− | Options:
| |
− | -b, --input-bam-file : Input BAM file
| |
− | -i, --input-candidate-vcf : Input Candidate VCF file
| |
− | -o, --output-vcf-file : Output VCF file
| |
− | -v, --variant-type : Variant Types, takes on any combinations
| |
− | of the values snps,mnps,indels comma
| |
− | delimited [snps,mnps,indels]
| |
− | -g, --genome-fa-file : Genome FASTA file
| |
− | -s, --sample-id : Sample ID
| |
− |
| |
− | Example:
| |
− | e.g. vt genotype -b in.bam -i candidate.sites.vcf -o - -g ref.fa -s HG0001
| |
− | | |
− | == Annotation ==
| |
− | | |
− | === Make Probes ===
| |
− | | |
− | Populates the info field with REFPROBE, ALTPROBE and PLEN tags for genotyping.
| |
− | | |
− | Options:
| |
− | -i, --input-vcf <string> : Input VCF file
| |
− | -o, --output-vcf <string> : Output VCF file [-]
| |
− | -g, --genome-fa : Genome FASTA file [/net/fantasia/home/atks/ref/genome/human.g1k.v37.fa]
| |
− | -f, --flank-length <integer> : Minimum Flank Length [20]
| |
− | | |
− | Example:
| |
− | e.g. vt make_probes -i 8904indels.dups.genotypes.vcf -o probes.sites.vcf -g ref.fa
| |
− | | |
− | === Compute Feature ===
| |
− | | |
− | Compute feature of variant.
| |
− | | |
− | vt compute_feature -i mills.vcf
| |
− | | |
− | === Compute Allele balance ===
| |
− | | |
− | Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_Based_Allele_Balance allele balance]. Outputs allele balance, allele frequency, genotype frequency.
| |
− | | |
− | vt compute_ab -i mills.vcf
| |
− | | |
− | === Compute Allele Frequency ===
| |
− | | |
− | Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Allele_Frequency allele frequency]. Outputs allele frequency and genotype frequency.
| |
− | | |
− | vt compute_af -i mills.vcf
| |
− | | |
− | === Compute Inbreeding Coefficient ===
| |
− | | |
− | Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Inbreeding_Coefficient inbreeding coefficient]. Outputs inbreeding coefficient based on genotype likelihoods.
| |
− | | |
− | vt compute_fic -i mills.vcf
| |
− | | |
− | === Compute HWE ===
| |
− | | |
− | Compute [http://genome.sph.umich.edu/wiki/Genotype_Likelihood_based_Hardy-Weinberg_Test Hardy-Weinberg equilibrium statistic]. Outputs PHRED scaled HWE Test p-values for biallelic as well as multiallelic variants.
| |
− | | |
− | vt compute_hwe -i mills.vcf
| |
− | | |
− | === Compute Mendelian Error ===
| |
− | | |
− | Compute mendelian error statistics. Outputs allele frequency and genotype frequency.
| |
− | | |
− | vt compute_mendel -i mills.vcf
| |
− | | |
− | === Compute features ===
| |
− | | |
− | vt compute_<feature1>_<feature2>_ ... _<feature n> -i mills.vcf
| |
− | | |
− | == Modification ==
| |
− | | |
− | === Left Alignment ===
| |
− | | |
− | [http://genome.sph.umich.edu/wiki/Variant_Normalization Left aligns] indel type variants in a VCF file. This differs from normalization in that it only left aligns and left trims a variant. This affects Indels only.
| |
− | | |
− | vt left_align -i mills.vcf -o mills.leftaligned.vcf
| |
| | | |
| === Normalization === | | === Normalization === |
Line 198: |
Line 20: |
| [http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file. | | [http://genome.sph.umich.edu/wiki/Variant_Normalization Normalize] variants in a VCF file. |
| | | |
− | vt normalize mills.vcf -r seq.fa -o mills.normalized.vcf | + | vt normalize -i mills.vcf -o mills.normalized.vcf |
| | | |
| === Merge duplicate variants === | | === Merge duplicate variants === |
Line 213: |
Line 35: |
| e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf | | e.g. vt merge_duplicate_variants -p -i 8904indels.dups.genotypes.vcf -o out.vcf |
| | | |
− | == Profiling ==
| |
− |
| |
− | A standard procedure is as follows:
| |
− |
| |
− | zcat dataset.vcf.gx | vt normalize -i - | vt merge_duplicate_variants -i - > dataset.normalized.vcf
| |
− |
| |
− | cut -f1-8 dataset.normalized.vcf > dataset.sites.vcf
| |
− |
| |
− | cat dataset.normalized.sites.vcf | vt profile_snps -i - > snps.summary.log
| |
− |
| |
− |
| |
− |
| |
− | === Profile SNPs ===
| |
− |
| |
− | Profile SNPs.
| |
− |
| |
− | * ts/tv ratio
| |
− | * overlap analyses
| |
− |
| |
− | vt profile_snps -i mills.snps.sites.vcf
| |
− |
| |
− | === Profile Indels ===
| |
− |
| |
− | Profile indels.
| |
− |
| |
− | * Overlap analyses with known data sets
| |
− | * FS/NFS annotation
| |
− |
| |
− | vt profile_indels mills.indels.sites.vcf
| |
− |
| |
− | === Profile MNPs ===
| |
− |
| |
− | Profile MNPs.
| |
− |
| |
− | vt profile_mnps -i mills.mnps.sites.vcf
| |
− |
| |
− | === Summarize Variants ===
| |
− |
| |
− | Summarizes variants present in VCF file.
| |
− |
| |
− | vt peek -i mills.vcf
| |
− |
| |
− | == Plotting ==
| |
− |
| |
− | === Allele Frequency Spectrum ===
| |
− |
| |
− | Plots Allele Frequency Spectrum of variants found in VCF file
| |
− |
| |
− | vt plot_afs -i mills.xml
| |
− |
| |
− | === Genotype Likelihood Concordance ===
| |
− |
| |
− | Plots Genotype Likelihood Concordance graph.
| |
− |
| |
− | vt plot_gl -i mills.xml
| |
− |
| |
− | === Allele Balance Spectrum===
| |
− |
| |
− | Plots Allele Balance graph of variants in the VCF file.
| |
− |
| |
− | vt plot_ab -i mills.xml
| |
− |
| |
− | = VCF File Manipulation =
| |
− |
| |
− | === Sort ===
| |
− |
| |
− | Sort variants according to contig lists in header.
| |
− |
| |
− | vt sort -i mills.sites.vcf
| |
− |
| |
− | === Split by variant ===
| |
− |
| |
− | Split VCF files by variant type.
| |
− |
| |
− | vt split_by_variant -i mills.sites.vcf
| |
− |
| |
− | = Resource Files =
| |
− |
| |
− | dbSNP
| |
− | OMNI 1000G
| |
− | Mills
| |
− | HAPMAP
| |
| | | |
| = Maintained by = | | = Maintained by = |
| | | |
| This page is maintained by [mailto:atks@umich.edu Adrian] | | This page is maintained by [mailto:atks@umich.edu Adrian] |