Difference between revisions of "VcfCodingSnps"
m (moved AnnoVcf to VcfCodingSnps: Better name) |
|||
(72 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | ''' | + | '''vcfCodingSnps'''[http://csg.sph.umich.edu//liyanmin/vcfCodingSnps/index.shtml] is a SNP annotation tool that annotates coding variants in a [[VCF]] format input file. It takes a VCF as input and generates an annotated VCF file as output. The tool is currently under development by Yanming Li, a doctoral student at the University of Michigan Center for Statistical Genetics. For any issues with the program, please contact [mailto:liyanmin@umich.edu Yanming]. A detailed tutorial and download page can be found at [http://csg.sph.umich.edu//liyanmin/vcfCodingSnps/index.shtml] |
+ | |||
+ | == Basic Usage Example == | ||
+ | |||
+ | Here is an example of how <code>vcfCodingSnps</code> works: | ||
+ | |||
+ | vcfCodingSnps -s chrom22-CHB.vcf -g genelist.txt -o annotated-chrom22-CHB.vcf | ||
+ | |||
+ | == Command Line Options == | ||
+ | |||
+ | -s SNP file Specifies the name of input VCF-format SNP file | ||
+ | -g genefile Specifies the name of input gene file, by default use a NCBI36 version gene list file (genelist.txt) in UCSC known gene format generated by UCSC genome browser | ||
+ | -o output file Specifies the name of output VCF-format SNP file, by default will be named vcfCodingSNP.out.vcf | ||
+ | -r reference genome Specifies the name of reference genome file, by default will use NCBI build 36 reference genome | ||
+ | -l log file Specifies the name of log file, log file gives more detailed information for each annotated SNP, by default will be named vcfCodingSNP.log | ||
+ | --n1 parameter user defined number of bps into intron for splice site, by default will be set to 8 | ||
+ | --n2 parameter user defined number of bps into extron for splice site, by default will be set to 3 | ||
+ | --ns parameter user defined number of kbps for the range of upstream or downstream of a gene, by default will be set t0 5 | ||
+ | |||
+ | == Library Compiling Guideline == | ||
+ | |||
+ | To Compile the source code, please first re-compile the .c functions in the library folder on your local machine: | ||
+ | 1. Get into folder "libcsg". Type syntax "gcc -c -O2 *.cpp -D_FILE_OFFSET_BITS=32" to re-compile the c files in the library if use a 32-bit local machine | ||
+ | (Type "gcc -c -O2 *.cpp -D_FILE_OFFSET_BITS=64" to re-compile the .c files in the library if use a 64-bit local machine) | ||
+ | 2. In the same folder, type "ar -rc libcsg.a *.o" | ||
+ | 3. Go to the root folder, type "make clean" and "make". (Don't need to change anything in Makefile in this step) | ||
+ | |||
+ | == Input File Information == | ||
+ | |||
+ | 1. Example headlines of input VCF-format SNP file: | ||
+ | |||
+ | ##format=VCFv3.2 | ||
+ | ##NA12891=../depthFilter/filtered.NA12891.chrom22.SLX.maq.SRP000032.2009_07.glf | ||
+ | ##NA12892=../depthFilter/filtered.NA12892.chrom22.SLX.maq.SRP000032.2009_07.glf | ||
+ | ##NA12878=../merged/NA12878.chrom22.merged.glf | ||
+ | ##minTotalDepth=0 | ||
+ | ##maxTotalDepth=1000 | ||
+ | ##minMapQuality=30 | ||
+ | ##minPosterior=0.9990 | ||
+ | ##program=glfTrio | ||
+ | ##versionDate=Tue Dec 1 00:42:24 2009 | ||
+ | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878 | ||
+ | 22 14439753 . a t 100 mapQ=0 depth=68;duples=homs;mac=2 GT:GQ:DP 1|1:100:40 0|0:81:28 1|0:84:0 | ||
+ | 22 14441250 . t c 59 mapQ=0 depth=40 GT:GQ:DP 1|1:56:25 1|1:31:15 1|1:32:0 | ||
+ | 22 14443154 . t g 45 mapQ=9 depth=92;duples=homs;mac=2 GT:GQ:DP 1|1:49:21 0|0:60:20 1|0:100:51 | ||
+ | ... ... | ||
+ | |||
+ | 2. The gene list and the reference genome that user provided can be of various gene tracks and assemblies. The latest version takes gene list tracks such as UCSC known genes, RefSeq genes, Genecode genes, CCDS genes and Emsembl genes, and the assembly of the gene list and the reference genome can be of either hg16, hg17, hg18 or hg19. One can explore UCSC genome browser for a better understanding of different tracks and assemblies. By default vcfColdingSnps uses a hg18 UCSC known gene list and the hg18 reference genome. It also provides versions of other tracks and assemblies at the user's conveinience so that they don't need to download those themselves. Input gene file should be a plain text file generated by [http://genome.ucsc.edu/ ucsc genome browser]. A sample pathway of generating an input gene file is | ||
+ | |||
+ | Go to http://genome.ucsc.edu/ ►► Click "table" ►► Specify the fields required (clade: mammal, genome:human etc.) ►► In "track" filed, select "UCSC gene" ►► get output gene file | ||
+ | |||
+ | 1. Gene file used should be of [http://genome.ucsc.edu/FAQ/FAQformat#format9 GenePred table format]. The following 11 tab delimited fields are required and must be of the same order as shown below: | ||
+ | string name; "Name of gene" | ||
+ | string chrom; "Chromosome name" | ||
+ | char[1] strand; "+ or - for strand" | ||
+ | uint txStart; "Transcription start position" | ||
+ | uint txEnd; "Transcription end position" | ||
+ | uint cdsStart; "Coding region start" | ||
+ | uint cdsEnd; "Coding region end" | ||
+ | uint exonCount; "Number of exons" | ||
+ | uint[exonCount] exonStarts; "Exon start positions" | ||
+ | uint[exonCount] exonEnds; "Exon end positions" | ||
+ | string symbol; "Standard gene symbol" | ||
+ | |||
+ | Note: the 11th field is a mandatory field for running vcfCodingSnps. In the genelists provided with the package, this field gives the standard gene symbols such as "APOE", "LDL-R" etc. | ||
+ | If a genelist downloaded by you own that does not contain such a field, you can simply make the 11th field equal to the first field which is the gene name in a specific track by a syntax like | ||
+ | |||
+ | awk `{FS="\t"; print $0"\t"$1 }` yourGenelist > yourNewGenelist | ||
+ | |||
+ | 2. If gene file assumes an [http://genome.ucsc.edu/FAQ/FAQformat#format9 extended GenePred format], there will be an exctra "exonframe" field. Please refer to [https://lists.soe.ucsc.edu/pipermail/genome/2006-November/012218.html here] for the definition of "exonframe". For some genes, due to translational frame shifts or other | ||
+ | reasons, the exonframe might not match what one would compute using mod 3 in counting codons. In such cases, the program will report a warning massage that "number of base pairs between code start and code end is | ||
+ | not a multiple of three". While we will use the usual mod 3 method for counting codons. | ||
+ | 3. A detailed instruction on using the table browser could be found at [http://genome.ucsc.edu/cgi-bin/hgTables?command=start#Help genome.ucsc.edu/cgi-bin/hgTables]. | ||
+ | 4. One can specify the region to be the whole genome or any particular gene position (e.g. chr21:33031597-33041570). | ||
+ | |||
+ | Here is an example of input gene file headlines: | ||
+ | |||
+ | #name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID | ||
+ | uc001aaa.3 chr1 + 11873 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3 | ||
+ | uc010nxq.1 chr1 + 11873 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1 | ||
+ | uc010nxr.1 chr1 + 11873 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1 | ||
+ | uc009vis.2 chr1 - 14362 16765 14362 14362 4 14362,14969,15795,16606, 14829,15038,15942,16765, uc009vis.2 | ||
+ | uc009vjc.1 chr1 - 16857 17751 16857 16857 2 16857,17232, 17055,17751, uc009vjc.1 | ||
+ | uc009vjd.2 chr1 - 15795 18061 15795 15795 5 15795,16606,16857,17232,17605, 15947,16765,17055,17368,18061, uc009vjd.2 | ||
+ | |||
+ | == Output File == | ||
+ | |||
+ | Some possible annotating results for a single SNP with the meanings of their output format are listed below: | ||
+ | |||
+ | 5'UTR=A26C2[-] means the SNP is in the 5'UTR region of gene A26C2 with a minus strand. | ||
+ | INTRONIC=POTEG[-] means the SNP is in the intronic region of gene POTEG with a minus strand. | ||
+ | SYNONYMOUS_CODING=BARD1(uc002veu.2):His506His[-] means the SNP is synonymous coding at the 506th codon in gene BARD1 with a minus strand and it keeps amino-acid His unchanged. | ||
+ | NON_SYNONYMOUS_CODING=BARD1(uc002veu.2):Arg658Cys[-] means the SNP is non_synonymous coding at the 658th codon in gene BARD1 (ucsc gene name uc002veu.2)with a minus strand and it changes amino-acid from Arg to Cys. | ||
+ | SPLICE_SITE=FARP2(uc002wbi.1)[+] means the SNP is in the SPLICE_SITE (5 bp within exon start or end positions in the coding region) of gene FARP2 (ucsc gene name uc002wbi.1) with a plus strand. | ||
+ | STOP_GAINED=C2orf83(uc002vph.1):Trp141stop[-] means the SNP is the 141th codon in gene MAPK12 (ucsc gene name uc002vph.1) with a minus strand and it changes amino-acid Trp to a stop codon. | ||
+ | STOP_LOST=OR2M3(uc001ieb.1):stop313Arg[+] means the SNP is the 313th codon in gene OR2M3 (ucsc gene name uc001ieb.1) with a plus strand and it changes a stop codon to amino-acid Arg. | ||
+ | |||
+ | The annotating result will be added to the entry "INFO" of the input VCF SNP file and outputted together with other information. If a SNP is annotated differently with respect to different genes (or different isoforms of the same gene), all the annotated results will be added into the entry "INFO". If the SNP is NOT in any gene coding region, then the original "INFO" will be outputted. Here is an example of input and output VCF file headlines: | ||
+ | |||
+ | Input VCF headlines: | ||
+ | |||
+ | ##format=VCFv3.2 | ||
+ | ##NA12891=../GLF/NA12891.chrom8.SLX.SRP000032.2009_07.glf | ||
+ | ##NA12892=../GLF/NA12892.chrom8.SLX.SRP000032.2009_07.glf | ||
+ | ##NA12878=../merged/NA12878.chrom8.merged.glf | ||
+ | ##minTotalDepth=0 | ||
+ | ##maxTotalDepth=1000 | ||
+ | ##minMapQuality=40 | ||
+ | ##minPosterior=0.9990 | ||
+ | ##program=glfTrio | ||
+ | ##versionDate=Thu Aug 27 18:23:18 2009 | ||
+ | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878 | ||
+ | 8 146284 . c a 54 . depth=29;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 1/0:31:12 1/0:32:3 0/0:28:14 | ||
+ | 8 146703 . c t 92 . depth=41;mac=1;tdt=0/1 GT:GQ:GD 1/1:42:14 0/1:54:9 1/1:24:18 | ||
+ | 8 151532 . t c 100 . depth=131 GT:GQ:GD 0/0:8:37 1/0:100:26 1/0:100:68 | ||
+ | 8 151573 . g t 72 . depth=113;mac=1;tdt=1/1 GT:GQ:GD 0/1:48:35 0/0:39:26 0/1:100:52 | ||
+ | 8 151638 . a c 100 . depth=124;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 0/1:100:55 0/1:100:58 0/1:87:11 | ||
+ | 8 151651 . c g 100 . depth=124;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 0/1:87:56 0/1:100:56 0/1:24:12 | ||
+ | 8 151763 . t a 100 . depth=127;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 1/0:100:49 1/0:100:54 1/0:100:24 | ||
+ | 8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14 | ||
+ | 8 152578 . c t 87 . depth=108 GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47 | ||
+ | |||
+ | Output VCF headlines: | ||
+ | |||
+ | ##format=VCFv3.2 | ||
+ | ##NA12891=../GLF/NA12891.chrom8.SLX.SRP000032.2009_07.glf | ||
+ | ##NA12892=../GLF/NA12892.chrom8.SLX.SRP000032.2009_07.glf | ||
+ | ##NA12878=../merged/NA12878.chrom8.merged.glf | ||
+ | ##minTotalDepth=0 | ||
+ | ##maxTotalDepth=1000 | ||
+ | ##minMapQuality=40 | ||
+ | ##minPosterior=0.9990 | ||
+ | ##program=glfTrio | ||
+ | ##versionDate=Thu Aug 27 18:23:18 2009 | ||
+ | #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878 | ||
+ | 8 146284 . c a 54 . depth=29;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 1/0:31:12 1/0:32:3 0/0:28:14 | ||
+ | 8 146703 . c t 92 . depth=41;mac=1;tdt=0/1 GT:GQ:GD 1/1:42:14 0/1:54:9 1/1:24:18 | ||
+ | 8 151532 . t c 100 . depth=131;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/0:8:37 1/0:100:26 1/0:100:68 | ||
+ | 8 151573 . g t 72 . depth=113;mac=1;tdt=1/1;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:48:35 0/0:39:26 0/1:100:52 | ||
+ | 8 151638 . a c 100 . depth=124;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:100:55 0/1:100:58 0/1:87:11 | ||
+ | 8 151651 . c g 100 . depth=124;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:87:56 0/1:100:56 0/1:24:12 | ||
+ | 8 151763 . t a 100 . depth=127;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/0:100:49 1/0:100:54 1/0:100:24 | ||
+ | 8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14 | ||
+ | 8 152578 . c t 87 . depth=108;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47 | ||
+ | |||
+ | Output log file headlines: | ||
+ | |||
+ | ##chr pos ref alt ucsc_name genestrend genestart geneend ref_codon ref_AA alt_codon alt_AA codon_start codon_end genesymbol codonCount type | ||
+ | chr2 214811129 T c uc010fuz.1 + 213857360 214814327 CTA Leu CCA Pro 214811128 214811130 SPAG16 433 NON_SYNONYMOUS_CODING | ||
+ | chr2 214811129 T c uc002veq.1 + 213857360 214983470 . . . . . . SPAG16 . INTRONIC | ||
+ | chr2 214811129 T c uc002ver.1 + 213857360 214983470 . . . . . . SPAG16 . INTRONIC | ||
+ | chr2 214811174 T a uc010fuz.1 + 213857360 214814327 . . . . . . SPAG16 . 3'UTR | ||
+ | chr2 214811174 T a uc002veq.1 + 213857360 214983470 . . . . . . SPAG16 . INTRONIC |
Latest revision as of 11:58, 2 February 2017
vcfCodingSnps[1] is a SNP annotation tool that annotates coding variants in a VCF format input file. It takes a VCF as input and generates an annotated VCF file as output. The tool is currently under development by Yanming Li, a doctoral student at the University of Michigan Center for Statistical Genetics. For any issues with the program, please contact Yanming. A detailed tutorial and download page can be found at [2]
Basic Usage Example
Here is an example of how vcfCodingSnps
works:
vcfCodingSnps -s chrom22-CHB.vcf -g genelist.txt -o annotated-chrom22-CHB.vcf
Command Line Options
-s SNP file Specifies the name of input VCF-format SNP file -g genefile Specifies the name of input gene file, by default use a NCBI36 version gene list file (genelist.txt) in UCSC known gene format generated by UCSC genome browser -o output file Specifies the name of output VCF-format SNP file, by default will be named vcfCodingSNP.out.vcf -r reference genome Specifies the name of reference genome file, by default will use NCBI build 36 reference genome -l log file Specifies the name of log file, log file gives more detailed information for each annotated SNP, by default will be named vcfCodingSNP.log --n1 parameter user defined number of bps into intron for splice site, by default will be set to 8 --n2 parameter user defined number of bps into extron for splice site, by default will be set to 3 --ns parameter user defined number of kbps for the range of upstream or downstream of a gene, by default will be set t0 5
Library Compiling Guideline
To Compile the source code, please first re-compile the .c functions in the library folder on your local machine: 1. Get into folder "libcsg". Type syntax "gcc -c -O2 *.cpp -D_FILE_OFFSET_BITS=32" to re-compile the c files in the library if use a 32-bit local machine (Type "gcc -c -O2 *.cpp -D_FILE_OFFSET_BITS=64" to re-compile the .c files in the library if use a 64-bit local machine) 2. In the same folder, type "ar -rc libcsg.a *.o" 3. Go to the root folder, type "make clean" and "make". (Don't need to change anything in Makefile in this step)
Input File Information
1. Example headlines of input VCF-format SNP file:
##format=VCFv3.2 ##NA12891=../depthFilter/filtered.NA12891.chrom22.SLX.maq.SRP000032.2009_07.glf ##NA12892=../depthFilter/filtered.NA12892.chrom22.SLX.maq.SRP000032.2009_07.glf ##NA12878=../merged/NA12878.chrom22.merged.glf ##minTotalDepth=0 ##maxTotalDepth=1000 ##minMapQuality=30 ##minPosterior=0.9990 ##program=glfTrio ##versionDate=Tue Dec 1 00:42:24 2009 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878 22 14439753 . a t 100 mapQ=0 depth=68;duples=homs;mac=2 GT:GQ:DP 1|1:100:40 0|0:81:28 1|0:84:0 22 14441250 . t c 59 mapQ=0 depth=40 GT:GQ:DP 1|1:56:25 1|1:31:15 1|1:32:0 22 14443154 . t g 45 mapQ=9 depth=92;duples=homs;mac=2 GT:GQ:DP 1|1:49:21 0|0:60:20 1|0:100:51 ... ...
2. The gene list and the reference genome that user provided can be of various gene tracks and assemblies. The latest version takes gene list tracks such as UCSC known genes, RefSeq genes, Genecode genes, CCDS genes and Emsembl genes, and the assembly of the gene list and the reference genome can be of either hg16, hg17, hg18 or hg19. One can explore UCSC genome browser for a better understanding of different tracks and assemblies. By default vcfColdingSnps uses a hg18 UCSC known gene list and the hg18 reference genome. It also provides versions of other tracks and assemblies at the user's conveinience so that they don't need to download those themselves. Input gene file should be a plain text file generated by ucsc genome browser. A sample pathway of generating an input gene file is
Go to http://genome.ucsc.edu/ ►► Click "table" ►► Specify the fields required (clade: mammal, genome:human etc.) ►► In "track" filed, select "UCSC gene" ►► get output gene file 1. Gene file used should be of GenePred table format. The following 11 tab delimited fields are required and must be of the same order as shown below: string name; "Name of gene" string chrom; "Chromosome name" char[1] strand; "+ or - for strand" uint txStart; "Transcription start position" uint txEnd; "Transcription end position" uint cdsStart; "Coding region start" uint cdsEnd; "Coding region end" uint exonCount; "Number of exons" uint[exonCount] exonStarts; "Exon start positions" uint[exonCount] exonEnds; "Exon end positions" string symbol; "Standard gene symbol" Note: the 11th field is a mandatory field for running vcfCodingSnps. In the genelists provided with the package, this field gives the standard gene symbols such as "APOE", "LDL-R" etc. If a genelist downloaded by you own that does not contain such a field, you can simply make the 11th field equal to the first field which is the gene name in a specific track by a syntax like awk `{FS="\t"; print $0"\t"$1 }` yourGenelist > yourNewGenelist 2. If gene file assumes an extended GenePred format, there will be an exctra "exonframe" field. Please refer to here for the definition of "exonframe". For some genes, due to translational frame shifts or other reasons, the exonframe might not match what one would compute using mod 3 in counting codons. In such cases, the program will report a warning massage that "number of base pairs between code start and code end is not a multiple of three". While we will use the usual mod 3 method for counting codons. 3. A detailed instruction on using the table browser could be found at genome.ucsc.edu/cgi-bin/hgTables. 4. One can specify the region to be the whole genome or any particular gene position (e.g. chr21:33031597-33041570).
Here is an example of input gene file headlines:
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID uc001aaa.3 chr1 + 11873 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3 uc010nxq.1 chr1 + 11873 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1 uc010nxr.1 chr1 + 11873 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1 uc009vis.2 chr1 - 14362 16765 14362 14362 4 14362,14969,15795,16606, 14829,15038,15942,16765, uc009vis.2 uc009vjc.1 chr1 - 16857 17751 16857 16857 2 16857,17232, 17055,17751, uc009vjc.1 uc009vjd.2 chr1 - 15795 18061 15795 15795 5 15795,16606,16857,17232,17605, 15947,16765,17055,17368,18061, uc009vjd.2
Output File
Some possible annotating results for a single SNP with the meanings of their output format are listed below:
5'UTR=A26C2[-] means the SNP is in the 5'UTR region of gene A26C2 with a minus strand. INTRONIC=POTEG[-] means the SNP is in the intronic region of gene POTEG with a minus strand. SYNONYMOUS_CODING=BARD1(uc002veu.2):His506His[-] means the SNP is synonymous coding at the 506th codon in gene BARD1 with a minus strand and it keeps amino-acid His unchanged. NON_SYNONYMOUS_CODING=BARD1(uc002veu.2):Arg658Cys[-] means the SNP is non_synonymous coding at the 658th codon in gene BARD1 (ucsc gene name uc002veu.2)with a minus strand and it changes amino-acid from Arg to Cys. SPLICE_SITE=FARP2(uc002wbi.1)[+] means the SNP is in the SPLICE_SITE (5 bp within exon start or end positions in the coding region) of gene FARP2 (ucsc gene name uc002wbi.1) with a plus strand. STOP_GAINED=C2orf83(uc002vph.1):Trp141stop[-] means the SNP is the 141th codon in gene MAPK12 (ucsc gene name uc002vph.1) with a minus strand and it changes amino-acid Trp to a stop codon. STOP_LOST=OR2M3(uc001ieb.1):stop313Arg[+] means the SNP is the 313th codon in gene OR2M3 (ucsc gene name uc001ieb.1) with a plus strand and it changes a stop codon to amino-acid Arg.
The annotating result will be added to the entry "INFO" of the input VCF SNP file and outputted together with other information. If a SNP is annotated differently with respect to different genes (or different isoforms of the same gene), all the annotated results will be added into the entry "INFO". If the SNP is NOT in any gene coding region, then the original "INFO" will be outputted. Here is an example of input and output VCF file headlines:
Input VCF headlines:
##format=VCFv3.2 ##NA12891=../GLF/NA12891.chrom8.SLX.SRP000032.2009_07.glf ##NA12892=../GLF/NA12892.chrom8.SLX.SRP000032.2009_07.glf ##NA12878=../merged/NA12878.chrom8.merged.glf ##minTotalDepth=0 ##maxTotalDepth=1000 ##minMapQuality=40 ##minPosterior=0.9990 ##program=glfTrio ##versionDate=Thu Aug 27 18:23:18 2009 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878 8 146284 . c a 54 . depth=29;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 1/0:31:12 1/0:32:3 0/0:28:14 8 146703 . c t 92 . depth=41;mac=1;tdt=0/1 GT:GQ:GD 1/1:42:14 0/1:54:9 1/1:24:18 8 151532 . t c 100 . depth=131 GT:GQ:GD 0/0:8:37 1/0:100:26 1/0:100:68 8 151573 . g t 72 . depth=113;mac=1;tdt=1/1 GT:GQ:GD 0/1:48:35 0/0:39:26 0/1:100:52 8 151638 . a c 100 . depth=124;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 0/1:100:55 0/1:100:58 0/1:87:11 8 151651 . c g 100 . depth=124;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 0/1:87:56 0/1:100:56 0/1:24:12 8 151763 . t a 100 . depth=127;duples=hets;mac=2;tdt=1/2 GT:GQ:GD 1/0:100:49 1/0:100:54 1/0:100:24 8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14 8 152578 . c t 87 . depth=108 GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47
Output VCF headlines:
##format=VCFv3.2 ##NA12891=../GLF/NA12891.chrom8.SLX.SRP000032.2009_07.glf ##NA12892=../GLF/NA12892.chrom8.SLX.SRP000032.2009_07.glf ##NA12878=../merged/NA12878.chrom8.merged.glf ##minTotalDepth=0 ##maxTotalDepth=1000 ##minMapQuality=40 ##minPosterior=0.9990 ##program=glfTrio ##versionDate=Thu Aug 27 18:23:18 2009 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878 8 146284 . c a 54 . depth=29;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 1/0:31:12 1/0:32:3 0/0:28:14 8 146703 . c t 92 . depth=41;mac=1;tdt=0/1 GT:GQ:GD 1/1:42:14 0/1:54:9 1/1:24:18 8 151532 . t c 100 . depth=131;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/0:8:37 1/0:100:26 1/0:100:68 8 151573 . g t 72 . depth=113;mac=1;tdt=1/1;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:48:35 0/0:39:26 0/1:100:52 8 151638 . a c 100 . depth=124;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:100:55 0/1:100:58 0/1:87:11 8 151651 . c g 100 . depth=124;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:87:56 0/1:100:56 0/1:24:12 8 151763 . t a 100 . depth=127;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/0:100:49 1/0:100:54 1/0:100:24 8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14 8 152578 . c t 87 . depth=108;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47
Output log file headlines:
##chr pos ref alt ucsc_name genestrend genestart geneend ref_codon ref_AA alt_codon alt_AA codon_start codon_end genesymbol codonCount type chr2 214811129 T c uc010fuz.1 + 213857360 214814327 CTA Leu CCA Pro 214811128 214811130 SPAG16 433 NON_SYNONYMOUS_CODING chr2 214811129 T c uc002veq.1 + 213857360 214983470 . . . . . . SPAG16 . INTRONIC chr2 214811129 T c uc002ver.1 + 213857360 214983470 . . . . . . SPAG16 . INTRONIC chr2 214811174 T a uc010fuz.1 + 213857360 214814327 . . . . . . SPAG16 . 3'UTR chr2 214811174 T a uc002veq.1 + 213857360 214983470 . . . . . . SPAG16 . INTRONIC