Open main menu

Genome Analysis Wiki β

VcfCodingSnps

Revision as of 02:39, 20 September 2010 by Liyanmin (talk | contribs)

vcfCodingSnps[1] is a SNP annotation tool that annotates coding variants in a VCF format input file. It takes a VCF as input and generates an annotated VCF file as output. The tool is currently under development by Yanming Li, a doctoral student at the University of Michigan Center for Statistical Genetics. For any issues with the program, please contact Yanming. A detailed tutorial and download page can be found at [2]

Contents

Basic Usage Example

Here is an example of how vcfCodingSnps works:

  vcfCodingSnps -s chrom22-CHB.vcf -g genelist.txt -o annotated-chrom22-CHB.vcf

Command Line Options

 -s SNP file  	                     Specifies the name of input VCF-format SNP file
 -g genefile 	                     Specifies the name of input gene file, by default use a NCBI36 version gene list file (genelist.txt) in UCSC known gene format generated by UCSC genome browser
 -o output file 	             Specifies the name of output VCF-format SNP file, by default will be named vcfCodingSNP.out.vcf
 -r reference genome                Specifies the name of reference genome file, by default will use NCBI build 36 reference genome
 -l log file 	                     Specifies the name of log file, log file gives more detailed information for each annotated SNP, by default will be named vcfCodingSNP.log
 --n1 parameter 	             user defined number of bps into intron for splice site, by default will be set to 8
 --n2 parameter  	             user defined number of bps into extron for splice site, by default will be set to 3
 --ns parameter 	             user defined number of kbps for the range of upstream or downstream of a gene, by default will be set t0 5

Input File Infomation

1. Example headlines of input VCF-format SNP file:

  ##format=VCFv3.2
  ##NA12891=../depthFilter/filtered.NA12891.chrom22.SLX.maq.SRP000032.2009_07.glf
  ##NA12892=../depthFilter/filtered.NA12892.chrom22.SLX.maq.SRP000032.2009_07.glf
  ##NA12878=../merged/NA12878.chrom22.merged.glf
  ##minTotalDepth=0
  ##maxTotalDepth=1000
  ##minMapQuality=30
  ##minPosterior=0.9990
  ##program=glfTrio
  ##versionDate=Tue Dec  1 00:42:24 2009
  #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12891 NA12892 NA12878
  22      14439753        .       a       t       100     mapQ=0  depth=68;duples=homs;mac=2      GT:GQ:DP        1|1:100:40      0|0:81:28       1|0:84:0
  22      14441250        .       t       c       59      mapQ=0  depth=40        GT:GQ:DP        1|1:56:25       1|1:31:15       1|1:32:0
  22      14443154        .       t       g       45      mapQ=9  depth=92;duples=homs;mac=2      GT:GQ:DP        1|1:49:21       0|0:60:20       1|0:100:51
  ... ...

2. Input gene file should be a plain text file generated by ucsc genome browser. A sample pathway of generating an input gene file is

  Go to http://genome.ucsc.edu/ ►► Click "table" ►► Specify the fields required (clade: mammal, genome:human etc.) ►► In "track" filed, select "UCSC gene" ►► get output gene file

1. Gene file used should be of GenePred table format. The following 10 fields are required and must be of the same order as shown below:
   string  name;               "Name of gene"
   string  chrom;              "Chromosome name"
   char[1] strand;             "+ or - for strand"
   uint    txStart;            "Transcription start position"
   uint    txEnd;              "Transcription end position"
   uint    cdsStart;           "Coding region start"
   uint    cdsEnd;             "Coding region end"
   uint    exonCount;          "Number of exons"
   uint[exonCount] exonStarts; "Exon start positions"
   uint[exonCount] exonEnds;   "Exon end positions"
2. If gene file assumes an extended GenePred format, there will be an exctra "exonframe" field. Please refer to here for the definition of "exonframe". For some genes, due to translational frame shifts or other 
   reasons, the exonframe might not match what one would compute using mod 3 in counting codons. In such cases, the program will report a warning massage that "number of base pairs between code start and code end is
   not a multiple of three". While we will use the usual mod 3 method for counting codons. 
3. A detailed instruction on using the table browser could be found at genome.ucsc.edu/cgi-bin/hgTables.
4. One can specify the region to be the whole genome or any particular gene position (e.g. chr21:33031597-33041570).

Here is an example of input gene file headlines:

  #name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds	proteinID	alignID
  uc001aaa.3	chr1	+	11873	14409	11873	11873	3	11873,12612,13220,	12227,12721,14409,		uc001aaa.3
  uc010nxq.1	chr1	+	11873	14409	12189	13639	3	11873,12594,13402,	12227,12721,14409,	B7ZGX9	uc010nxq.1
  uc010nxr.1	chr1	+	11873	14409	11873	11873	3	11873,12645,13220,	12227,12697,14409,		uc010nxr.1
  uc009vis.2	chr1	-	14362	16765	14362	14362	4	14362,14969,15795,16606,	14829,15038,15942,16765,		uc009vis.2
  uc009vjc.1	chr1	-	16857	17751	16857	16857	2	16857,17232,	17055,17751,		uc009vjc.1
  uc009vjd.2	chr1	-	15795	18061	15795	15795	5	15795,16606,16857,17232,17605,	15947,16765,17055,17368,18061,		uc009vjd.2

Output File

Some possible annotating results for a single SNP with the meanings of their output format are listed below:

  5'UTR=A26C2[-]                                              means    the SNP is in the 5'UTR region of gene A26C2 with a minus strand.
  INTRONIC=POTEG[-]                                           means    the SNP is in the intronic region of gene POTEG with a minus strand.
  SYNONYMOUS_CODING=BARD1(uc002veu.2):His506His[-]            means    the SNP is synonymous coding at the 506th codon in gene BARD1 with a minus strand and it keeps amino-acid His unchanged.
  NON_SYNONYMOUS_CODING=BARD1(uc002veu.2):Arg658Cys[-]        means    the SNP is non_synonymous coding at the 658th codon in gene BARD1 (ucsc gene name uc002veu.2)with a minus strand and it changes amino-acid from Arg to Cys.
  SPLICE_SITE=FARP2(uc002wbi.1)[+]                            means    the SNP is in the SPLICE_SITE (5 bp within exon start or end positions in the coding region) of gene FARP2 (ucsc gene name uc002wbi.1) with a plus strand.
  STOP_GAINED=C2orf83(uc002vph.1):Trp141stop[-]               means    the SNP is the 141th codon in gene MAPK12 (ucsc gene name uc002vph.1) with a minus strand and it changes amino-acid Trp to a stop codon.
  STOP_LOST=OR2M3(uc001ieb.1):stop313Arg[+]                   means    the SNP is the 313th codon in gene OR2M3 (ucsc gene name uc001ieb.1) with a plus strand and it changes a stop codon to amino-acid Arg.

The annotating result will be added to the entry "INFO" of the input VCF SNP file and outputted together with other information. If a SNP is annotated differently with respect to different genes (or different isoforms of the same gene), all the annotated results will be added into the entry "INFO". If the SNP is NOT in any gene coding region, then the original "INFO" will be outputted. Here is an example of input and output VCF file headlines:

Input VCF headlines:

  ##format=VCFv3.2
  ##NA12891=../GLF/NA12891.chrom8.SLX.SRP000032.2009_07.glf
  ##NA12892=../GLF/NA12892.chrom8.SLX.SRP000032.2009_07.glf
  ##NA12878=../merged/NA12878.chrom8.merged.glf
  ##minTotalDepth=0
  ##maxTotalDepth=1000
  ##minMapQuality=40
  ##minPosterior=0.9990
  ##program=glfTrio
  ##versionDate=Thu Aug 27 18:23:18 2009
  #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12891 NA12892 NA12878
  8       146284  .       c       a       54      .       depth=29;duples=hets;mac=2;tdt=0/2      GT:GQ:GD        1/0:31:12       1/0:32:3        0/0:28:14
  8       146703  .       c       t       92      .       depth=41;mac=1;tdt=0/1  GT:GQ:GD        1/1:42:14       0/1:54:9        1/1:24:18
  8       151532  .       t       c       100     .       depth=131       GT:GQ:GD        0/0:8:37        1/0:100:26      1/0:100:68
  8       151573  .       g       t       72      .       depth=113;mac=1;tdt=1/1 GT:GQ:GD        0/1:48:35       0/0:39:26       0/1:100:52
  8       151638  .       a       c       100     .       depth=124;duples=hets;mac=2;tdt=1/2     GT:GQ:GD        0/1:100:55      0/1:100:58      0/1:87:11
  8       151651  .       c       g       100     .       depth=124;duples=hets;mac=2;tdt=1/2     GT:GQ:GD        0/1:87:56       0/1:100:56      0/1:24:12
  8       151763  .       t       a       100     .       depth=127;duples=hets;mac=2;tdt=1/2     GT:GQ:GD        1/0:100:49      1/0:100:54      1/0:100:24
  8       151936  .       a       g       32      .       depth=105;duples=hets;mac=2;tdt=0/2     GT:GQ:GD        0/1:42:44       0/1:23:47       0/0:39:14
  8       152578  .       c       t       87      .       depth=108       GT:GQ:GD        1/1:95:31       1/1:89:30       1/1:100:47

Output VCF headlines:

  ##format=VCFv3.2 
  ##NA12891=../GLF/NA12891.chrom8.SLX.SRP000032.2009_07.glf 
  ##NA12892=../GLF/NA12892.chrom8.SLX.SRP000032.2009_07.glf 
  ##NA12878=../merged/NA12878.chrom8.merged.glf 
  ##minTotalDepth=0 
  ##maxTotalDepth=1000 
  ##minMapQuality=40 
  ##minPosterior=0.9990 
  ##program=glfTrio 
  ##versionDate=Thu Aug 27 18:23:18 2009 
  #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12891 NA12892 NA12878
  8 146284 . c a 54 . depth=29;duples=hets;mac=2;tdt=0/2 GT:GQ:GD 1/0:31:12 1/0:32:3 0/0:28:14 
  8 146703 . c t 92 . depth=41;mac=1;tdt=0/1 GT:GQ:GD 1/1:42:14 0/1:54:9 1/1:24:18 
  8 151532 . t c 100 . depth=131;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/0:8:37 1/0:100:26 1/0:100:68 
  8 151573 . g t 72 . depth=113;mac=1;tdt=1/1;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:48:35 0/0:39:26 0/1:100:52 
  8 151638 . a c 100 . depth=124;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:100:55 0/1:100:58 0/1:87:11 
  8 151651 . c g 100 . depth=124;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:87:56 0/1:100:56 0/1:24:12 
  8 151763 . t a 100 . depth=127;duples=hets;mac=2;tdt=1/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/0:100:49 1/0:100:54 1/0:100:24 
  8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14 
  8 152578 . c t 87 . depth=108;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47