Vmatch

From Genome Analysis Wiki
Jump to: navigation, search

vmatch is a variant matching program for MNPs, INDELs and precise SVs in VCF files.

Basic Usage Example

 vmatch <vcf-file-1> <vcf-file-2> -g <genome-file> -w <int> -d

Here is an example of how vmatch works:

  vmatch 1000g.vcf got2d.vcf -g hg18.fa  -w 10 -d

Command Line Options

   vcf-file-1     VCF file (can be gzipped or bgzipped)
   vcf-file-2     VCF file (can be gzipped or bgzipped)
   genome-file    Memory Mapped Sequence file 
                  (note that if genome.fa is specified, the actual file looked for is genome-bs.umfa)
   w              window size to detect overlaps between variants
   d              debug option to generate a match.log file that gives all the matches made

Output

   user@host:~$ vmatch gatk.vcf samtools.vcf -w 10 -d  
   
   VCF file A  : gatk.vcf
   VCF file B  : samtools.vcf   
   Genome file : human.g1k.v37.fa
   Window Size : 10
   SRSA  : 8578
   SRSAN : 34522
   SRDA  : 2363
   SRDNA : 888
   DRDA  : 2322
   DRDNA : 439    
   #A Records : 73976
   #B Records : 71994
   Match %tage for VCF file A
   Level 1 (SRSA, SRSAN)                          : 58.2621
   Level 2 (SRSA, SRSAN, SRDA, SRDNA)             : 62.6568
   Level 3 (SRSA, SRSAN, SRDA, SRDNA, DRDA, DRDNA): 66.3891
   Match %tage for VCF file B
   Level 1 (SRSA, SRSAN)                          : 59.8661
   Level 2 (SRSA, SRSAN, SRDA, SRDNA)             : 64.3818
   Level 3 (SRSA, SRSAN, SRDA, SRDNA, DRDA, DRDNA): 68.2168
   Matched variants written to match.txt
   Match logs written to match.log
   
   #matching classifications after extension and normalizaiton
   #SRSA  : Same REF Same ALT without normalization
   #SRSAN : Same REF Same ALT only after normalization
   #SRDA  : Same REF Different ALT
   #SRDNA : Same REF Different Number ALT
   #DRDA  : Different REF Different ALT
   #DRDNA : Different REF Different Number ALT
   user@host:~$ head match.txt 
   #id1 gives the nth row variants of file A
   #id2 gives the nth row variants of file B
   #match_type denotes the match classification per match pair
   #extended_no_of_bases is the number of bases extended in the extension step
   #normalized is a binary stating whether normalization resulted in a more 
   #parsimonious representation of the alleles
   #these fields allows one to look for certain signatures in matches that one might 
   #be interested in within the match.log file that contains all the matches made.
   id1 id2 match_type  extended_no_bases   normalized
   A4  B1  SRSAN   0   1
   A5  B2  SRSAN   0   1
   A6  B4  SRSA    0   0
   A7  B5  SRSAN   0   1
   A8  B6  SRSA    0   0
   A9  B7  SRDA    0   1
   A10 B8  SRSAN   0   1
   A11 B9  SRSAN   0   1
   A12 B10 SRSAN   0   1
   #an output fragment from vmatch
   A6802   B681435
   Original
   11      85576116        A6802   T       TT
   11      85576113        B681435 C       CT
   After Extending
   11      85576113        A6802   CTTT    CTTTT
   11      85576113        B681435 C       CT
   After Normalizing
   11      85576113        A6802   C       CT
   11      85576113        B681435 C       CT
   No of extended bases = 3
   Normalized = yes

Description

   Outputs 2 files
     match.txt : gives the matched pairs
                 1)id1
                 2)id2
                 3)match type
                 4)extended no of bases
                 5)normalized
     match.log : Details of the extension and normalization process for all compared pairs
   vmatch matches the variants in 2 VCF files by choosing the best match for every
   possible variant pair.  The percentage of matches is given at 3 levels for each
   variant total of both VCF files.
   The 3 match levels (in order of decreasing strictness) are given as:
      Level 1) SRSA    - Same Position, same REF and ALT
      Level 1) SRSAN   - Same Position, same REF and ALT after normalization
      Level 2) SRDA    - Same Position, same REF and different ALT
      Level 2) SRDNA   - Same Position, same REF and different number of ALT
      Level 3) DRDA    - Same Position, different REF and different ALT
      Level 3) DRDNA   - Same Position, different REF and different number of ALT
 
      Level 1 represents matches in position and alleles
      Level 2 represents matches in position and reference alleles but different alternate alleles
      Level 3 represents matches only in position

Download

For vmatch 0.5, we provide binaries for linux machines. vmatch 0.5

You will also need a copy of human genome assembly fasta file: human.g1k.v37.fa. Please gunzip it before usage. arf will generate a memory mapped file from the fasta file named human.g1k.v37-bs.umfa.

This page is maintained by Adrian.