VcfCodingSnps
vcfCodingSnps[1] is a SNP annotation tool that annotates coding variants in a VCF format input file. It takes a VCF as input and generates an annotated VCF file as output.
Basic Usage Example
Here is an example of how vcfCodingSnps
works:
vcfCodingSnps -s chrom22-CHB.vcf -g genelist.txt -o annotated-chrom22-CHB.vcf
Command Line Options
-s SNP file Specifies the name of the input VCF-format SNP file -g genefile Specifies the name of the input gene file, by default use gene list file in ASCII format generated by UCSC genome browser -o output file Specifies the name of the output VCF-format SNP file
Input File Infomation
1. Example headlines of input VCF-format SNP file:
##format=VCFv3.2 ##NA12891=../depthFilter/filtered.NA12891.chrom22.SLX.maq.SRP000032.2009_07.glf ##NA12892=../depthFilter/filtered.NA12892.chrom22.SLX.maq.SRP000032.2009_07.glf ##NA12878=../merged/NA12878.chrom22.merged.glf ##minTotalDepth=0 ##maxTotalDepth=1000 ##minMapQuality=30 ##minPosterior=0.9990 ##program=glfTrio ##versionDate=Tue Dec 1 00:42:24 2009 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878 22 14439753 . a t 100 mapQ=0 depth=68;duples=homs;mac=2 GT:GQ:DP 1|1:100:40 0|0:81:28 1|0:84:0 22 14441250 . t c 59 mapQ=0 depth=40 GT:GQ:DP 1|1:56:25 1|1:31:15 1|1:32:0 22 14443154 . t g 45 mapQ=9 depth=92;duples=homs;mac=2 GT:GQ:DP 1|1:49:21 0|0:60:20 1|0:100:51 ... ...
2. Input gene file should be a plain text file generated by ucsc genome browser. A sample pathway of generating an input gene file is
Go to http://genome.ucsc.edu/ ►► Click "table" ►► Specify the fields required (clade: mammal, genome:human etc.) ►► get output gene file 1. A detailed instruction on using the table browser could be found at genome.ucsc.edu/cgi-bin/hgTables. 2. One can specify the region to be the whole genome or any particular gene position (e.g. chr21:33031597-33041570).
Here is an example of input gene file headlines:
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID uc001aaa.3 chr1 + 11873 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3 uc010nxq.1 chr1 + 11873 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1 uc010nxr.1 chr1 + 11873 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1 uc009vis.2 chr1 - 14362 16765 14362 14362 4 14362,14969,15795,16606, 14829,15038,15942,16765, uc009vis.2 uc009vjc.1 chr1 - 16857 17751 16857 16857 2 16857,17232, 17055,17751, uc009vjc.1 uc009vjd.2 chr1 - 15795 18061 15795 15795 5 15795,16606,16857,17232,17605, 15947,16765,17055,17368,18061, uc009vjd.2
Output File
Some possible annotating results for a single SNP with the meanings of their output format are listed below:
5'UTR=A26C2[-] means the SNP is in the 5'UTR region of gene A26C2 with a minus strand. INTRONIC=POTEG[-] means the SNP is in the intronic region of gene POTEG with a minus strand. SYNONYMOUS_CODING=BARD1(uc002veu.2):His506His[-] means the SNP is synonymous coding at the 506th codon in gene BARD1 with a minus strand and it keeps amino-acid His unchanged. NON_SYNONYMOUS_CODING=BARD1(uc002veu.2):Arg658Cys[-] means the SNP is non_synonymous coding at the 658th codon in gene BARD1 (ucsc gene name uc002veu.2)with a minus strand and it changes amino-acid from Arg to Cys. SPLICE_SITE=FARP2(uc002wbi.1)[+] means the SNP is in the SPLICE_SITE (5 bp within exon start or end positions in the coding region) of gene FARP2 (ucsc gene name uc002wbi.1) with a plus strand. STOP_GAINED=C2orf83(uc002vph.1):Trp141stop[-] means the SNP is the 141th codon in gene MAPK12 (ucsc gene name uc002vph.1) with a minus strand and it changes amino-acid Trp to a stop codon. STOP_LOST=OR2M3(uc001ieb.1):stop313Arg[+] means the SNP is the 313th codon in gene OR2M3 (ucsc gene name uc001ieb.1) with a plus strand and it changes a stop codon to amino-acid Arg.
The annotating result will be added to the entry "INFO" of the input VCF SNP file and outputted together with other information. If a SNP is annotated differently with respect to different genes (or different isoforms of the same gene), all the annotated results will be added into the entry "INFO". If the SNP is NOT in any gene coding region, then the original "INFO" will be outputted. Here is an example of input and output VCF file headlines:
Input VCF headlines:
##format=VCFv3.2 ##NA12891=../GLF/NA12891.chrom8.SLX.SRP000032.2009_07.glf ##NA12892=../GLF/NA12892.chrom8.SLX.SRP000032.2009_07.glf ##NA12878=../merged/NA12878.chrom8.merged.glf ##minTotalDepth=0 ##maxTotalDepth=1000 ##minMapQuality=40 ##minPosterior=0.9990 ##program=glfTrio ##versionDate=Thu Aug 27 18:23:18 2009 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878 8 146284 . c a 54 . depth=29;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 1/0:31:12 1/0:32:3 0/0:28:14 8 146703 . c t 92 . depth=41;mac=1;tdt=0/1 GT:GQ:GD 1/1:42:14 0/1:54:9 1/1:24:18 8 151532 . t c 100 . depth=131 GT:GQ:GD 0/0:8:37 1/0:100:26 1/0:100:68 8 151573 . g t 72 . depth=113;mac=1;tdt=1/1 GT:GQ:GD 0/1:48:35 0/0:39:26 0/1:100:52 8 151638 . a c 100 . depth=124;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 0/1:100:55 0/1:100:58 0/1:87:11 8 151651 . c g 100 . depth=124;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 0/1:87:56 0/1:100:56 0/1:24:12 8 151763 . t a 100 . depth=127;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 1/0:100:49 1/0:100:54 1/0:100:24 8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14 8 152578 . c t 87 . depth=108 GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47
Output VCF headlines:
##format=VCFv3.2 ##NA12891=../GLF/NA12891.chrom8.SLX.SRP000032.2009_07.glf ##NA12892=../GLF/NA12892.chrom8.SLX.SRP000032.2009_07.glf ##NA12878=../merged/NA12878.chrom8.merged.glf ##minTotalDepth=0 ##maxTotalDepth=1000 ##minMapQuality=40 ##minPosterior=0.9990 ##program=glfTrio ##versionDate=Thu Aug 27 18:23:18 2009 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878 8 146284 . c a 54 . depth=29;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 1/0:31:12 1/0:32:3 0/0:28:14 8 146703 . c t 92 . depth=41;mac=1;tdt=0/1 GT:GQ:GD 1/1:42:14 0/1:54:9 1/1:24:18 8 151532 . t c 100 . depth=131;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/0:8:37 1/0:100:26 1/0:100:68 8 151573 . g t 72 . depth=113;mac=1;tdt=1/1;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:48:35 0/0:39:26 0/1:100:52 8 151638 . a c 100 . depth=124;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:100:55 0/1:100:58 0/1:87:11 8 151651 . c g 100 . depth=124;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:87:56 0/1:100:56 0/1:24:12 8 151763 . t a 100 . depth=127;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/0:100:49 1/0:100:54 1/0:100:24 8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14 8 152578 . c t 87 . depth=108;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47