Vmatch

From Genome Analysis Wiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

vmatch is a variant matching program for MNPs, INDELs and precise SVs in VCF files.

Basic Usage Example

 vmatch <vcf-file-1> <vcf-file-2> -g <genome-file> -w <int> -d

Here is an example of how vmatch works:

  vmatch 1000g.vcf got2d.vcf -g hg18.fa  -w 10 -d

Command Line Options

   vcf-file-1     VCF file (can be gzipped or bgzipped)
   vcf-file-2     VCF file (can be gzipped or bgzipped)
   genome-file    Memory Mapped Sequence file 
                  (note that if genome.fa is specified, the actual file looked for is genome-bs.umfa)
   w              window size to detect overlaps between variants
   d              debug option to generate a match.log file that gives all the matches made

Output

   user@host:~$ vmatch gatk.vcf samtools.vcf -w 10 -d  
   
   VCF file A  : gatk.vcf
   VCF file B  : samtools.vcf   
   Genome file : human.g1k.v37.fa
   Window Size : 10
   SRSA  : 8578
   SRSAN : 34522
   SRDA  : 2363
   SRDNA : 888
   DRDA  : 2322
   DRDNA : 439    
   #A Records : 73976
   #B Records : 71994
   Match %tage for VCF file A
   Level 1 (SRSA, SRSAN)                          : 58.2621
   Level 2 (SRSA, SRSAN, SRDA, SRDNA)             : 62.6568
   Level 3 (SRSA, SRSAN, SRDA, SRDNA, DRDA, DRDNA): 66.3891
   Match %tage for VCF file B
   Level 1 (SRSA, SRSAN)                          : 59.8661
   Level 2 (SRSA, SRSAN, SRDA, SRDNA)             : 64.3818
   Level 3 (SRSA, SRSAN, SRDA, SRDNA, DRDA, DRDNA): 68.2168
   Matched variants written to match.txt
   Match logs written to match.log
   
   #matching classifications after extension and normalizaiton
   #SRSA  : Same REF Same ALT without normalization
   #SRSAN : Same REF Same ALT only after normalization
   #SRDA  : Same REF Different ALT
   #SRDNA : Same REF Different Number ALT
   #DRDA  : Different REF Different ALT
   #DRDNA : Different REF Different Number ALT
   user@host:~$ head match.txt 
   #id1 gives the nth row variants of file A
   #id2 gives the nth row variants of file B
   #match_type denotes the match classification per match pair
   #extended_no_of_bases is the number of bases extended in the extension step
   #normalized is a binary stating whether normalization resulted in a more 
   #parsimonious representation of the alleles
   #these fields allows one to look for certain signatures in matches that one might 
   #be interested in within the match.log file that contains all the matches made.
   id1 id2 match_type  extended_no_bases   normalized
   A4  B1  SRSAN   0   1
   A5  B2  SRSAN   0   1
   A6  B4  SRSA    0   0
   A7  B5  SRSAN   0   1
   A8  B6  SRSA    0   0
   A9  B7  SRDA    0   1
   A10 B8  SRSAN   0   1
   A11 B9  SRSAN   0   1
   A12 B10 SRSAN   0   1
   #an output fragment from vmatch
   A6802   B681435
   Original
   11      85576116        A6802   T       TT
   11      85576113        B681435 C       CT
   After Extending
   11      85576113        A6802   CTTT    CTTTT
   11      85576113        B681435 C       CT
   After Normalizing
   11      85576113        A6802   C       CT
   11      85576113        B681435 C       CT
   No of extended bases = 3
   Normalized = yes

Description

   Outputs 2 files
     match.txt : gives the matched pairs
                 1)id1
                 2)id2
                 3)match type
                 4)extended no of bases
                 5)normalized
     match.log : Details of the extension and normalization process for all compared pairs
   vmatch matches the variants in 2 VCF files by choosing the best match for every
   possible variant pair.  The percentage of matches is given at 3 levels for each
   variant total of both VCF files.
   The 3 match levels (in order of decreasing strictness) are given as:
      Level 1) SRSA    - Same Position, same REF and ALT
      Level 1) SRSAN   - Same Position, same REF and ALT after normalization
      Level 2) SRDA    - Same Position, same REF and different ALT
      Level 2) SRDNA   - Same Position, same REF and different number of ALT
      Level 3) DRDA    - Same Position, different REF and different ALT
      Level 3) DRDNA   - Same Position, different REF and different number of ALT
 
      Level 1 represents matches in position and alleles
      Level 2 represents matches in position and reference alleles but different alternate alleles
      Level 3 represents matches only in position

Download

For vmatch 0.5, we provide binaries for linux machines. vmatch 0.5

You will also need a copy of human genome assembly fasta file: human.g1k.v37.fa. Please gunzip it before usage. arf will generate a memory mapped file from the fasta file named human.g1k.v37-bs.umfa.

This page is maintained by Adrian.