Vmatch
From Genome Analysis Wiki
Jump to navigationJump to searchThe printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
vmatch is a variant matching program for MNPs, INDELs and precise SVs in VCF files.
Basic Usage Example
vmatch <vcf-file-1> <vcf-file-2> -g <genome-file> -w <int> -d
Here is an example of how vmatch
works:
vmatch 1000g.vcf got2d.vcf -g hg18.fa -w 10 -d
Command Line Options
vcf-file-1 VCF file (can be gzipped or bgzipped) vcf-file-2 VCF file (can be gzipped or bgzipped) genome-file Memory Mapped Sequence file (note that if genome.fa is specified, the actual file looked for is genome-bs.umfa) w window size to detect overlaps between variants d debug option to generate a match.log file that gives all the matches made
Output
user@host:~$ vmatch gatk.vcf samtools.vcf -w 10 -d
VCF file A : gatk.vcf VCF file B : samtools.vcf Genome file : human.g1k.v37.fa Window Size : 10 SRSA : 8578 SRSAN : 34522 SRDA : 2363 SRDNA : 888 DRDA : 2322 DRDNA : 439 #A Records : 73976 #B Records : 71994 Match %tage for VCF file A Level 1 (SRSA, SRSAN) : 58.2621 Level 2 (SRSA, SRSAN, SRDA, SRDNA) : 62.6568 Level 3 (SRSA, SRSAN, SRDA, SRDNA, DRDA, DRDNA): 66.3891 Match %tage for VCF file B Level 1 (SRSA, SRSAN) : 59.8661 Level 2 (SRSA, SRSAN, SRDA, SRDNA) : 64.3818 Level 3 (SRSA, SRSAN, SRDA, SRDNA, DRDA, DRDNA): 68.2168 Matched variants written to match.txt Match logs written to match.log #matching classifications after extension and normalizaiton #SRSA : Same REF Same ALT without normalization #SRSAN : Same REF Same ALT only after normalization #SRDA : Same REF Different ALT #SRDNA : Same REF Different Number ALT #DRDA : Different REF Different ALT #DRDNA : Different REF Different Number ALT
user@host:~$ head match.txt
#id1 gives the nth row variants of file A #id2 gives the nth row variants of file B #match_type denotes the match classification per match pair #extended_no_of_bases is the number of bases extended in the extension step #normalized is a binary stating whether normalization resulted in a more #parsimonious representation of the alleles #these fields allows one to look for certain signatures in matches that one might #be interested in within the match.log file that contains all the matches made. id1 id2 match_type extended_no_bases normalized A4 B1 SRSAN 0 1 A5 B2 SRSAN 0 1 A6 B4 SRSA 0 0 A7 B5 SRSAN 0 1 A8 B6 SRSA 0 0 A9 B7 SRDA 0 1 A10 B8 SRSAN 0 1 A11 B9 SRSAN 0 1 A12 B10 SRSAN 0 1
#an output fragment from vmatch A6802 B681435 Original 11 85576116 A6802 T TT 11 85576113 B681435 C CT After Extending 11 85576113 A6802 CTTT CTTTT 11 85576113 B681435 C CT After Normalizing 11 85576113 A6802 C CT 11 85576113 B681435 C CT No of extended bases = 3 Normalized = yes
Description
Outputs 2 files match.txt : gives the matched pairs 1)id1 2)id2 3)match type 4)extended no of bases 5)normalized match.log : Details of the extension and normalization process for all compared pairs vmatch matches the variants in 2 VCF files by choosing the best match for every possible variant pair. The percentage of matches is given at 3 levels for each variant total of both VCF files.
The 3 match levels (in order of decreasing strictness) are given as: Level 1) SRSA - Same Position, same REF and ALT Level 1) SRSAN - Same Position, same REF and ALT after normalization Level 2) SRDA - Same Position, same REF and different ALT Level 2) SRDNA - Same Position, same REF and different number of ALT Level 3) DRDA - Same Position, different REF and different ALT Level 3) DRDNA - Same Position, different REF and different number of ALT Level 1 represents matches in position and alleles Level 2 represents matches in position and reference alleles but different alternate alleles Level 3 represents matches only in position
Download
For vmatch 0.5, we provide binaries for linux machines. vmatch 0.5
You will also need a copy of human genome assembly fasta file: human.g1k.v37.fa. Please gunzip it before usage. arf will generate a memory mapped file from the fasta file named human.g1k.v37-bs.umfa.
This page is maintained by Adrian.