VcfCodingSnps

From Genome Analysis Wiki
Revision as of 18:39, 12 December 2009 by Liyanmin (talk | contribs)
Jump to: navigation, search

vcfCodingSnps is a SNP annotation tool that annotates coding variants in a VCF format input file. It takes a VCF as input and generates an annotated VCF file as output.

Basic Usage Example

Here is an example of how vcfCodingSnps works:

  vcfCodingSnps -s chrom22-CHB.vcf -g genelist.txt -o annotated-chrom22-CHB.vcf

Command Line Options

 -s SNP file                    Specifies the name of the input VCF-format SNP file
 -g genefile                    Specifies the name of the input gene file, by default use gene list file in ASCII format generated by UCSC genome browser 
 -o output file                 Specifies the name of the output VCF-format SNP file

Input File Infomation

1. Example headlines of input VCF-format SNP file:

  ##format=VCFv3.2
  ##NA12891=../depthFilter/filtered.NA12891.chrom22.SLX.maq.SRP000032.2009_07.glf
  ##NA12892=../depthFilter/filtered.NA12892.chrom22.SLX.maq.SRP000032.2009_07.glf
  ##NA12878=../merged/NA12878.chrom22.merged.glf
  ##minTotalDepth=0
  ##maxTotalDepth=1000
  ##minMapQuality=30
  ##minPosterior=0.9990
  ##program=glfTrio
  ##versionDate=Tue Dec  1 00:42:24 2009
  #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12891 NA12892 NA12878
  22      14439753        .       a       t       100     mapQ=0  depth=68;duples=homs;mac=2      GT:GQ:DP        1|1:100:40      0|0:81:28       1|0:84:0
  22      14441250        .       t       c       59      mapQ=0  depth=40        GT:GQ:DP        1|1:56:25       1|1:31:15       1|1:32:0
  22      14443154        .       t       g       45      mapQ=9  depth=92;duples=homs;mac=2      GT:GQ:DP        1|1:49:21       0|0:60:20       1|0:100:51
  ... ...

2. Input gene file should be a plain text file generated by ucsc genome browser. A sample pathway of generating an input gene file is

  Go to http://genome.ucsc.edu/ ►► Click "table" ►► Specify the fields required (clade: mammal, genome:human etc.) ►► get output gene file

1. A detailed instruction on using the table browser could be found at genome.ucsc.edu/cgi-bin/hgTables.
2. One can specify the regieon to be whole genome or any particular gene position (e.g. chr21:33031597-33041570).

Here is an example of input gene file headlines:

  #name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds	proteinID	alignID
  uc001aaa.3	chr1	+	11873	14409	11873	11873	3	11873,12612,13220,	12227,12721,14409,		uc001aaa.3
  uc010nxq.1	chr1	+	11873	14409	12189	13639	3	11873,12594,13402,	12227,12721,14409,	B7ZGX9	uc010nxq.1
  uc010nxr.1	chr1	+	11873	14409	11873	11873	3	11873,12645,13220,	12227,12697,14409,		uc010nxr.1
  uc009vis.2	chr1	-	14362	16765	14362	14362	4	14362,14969,15795,16606,	14829,15038,15942,16765,		uc009vis.2
  uc009vjc.1	chr1	-	16857	17751	16857	16857	2	16857,17232,	17055,17751,		uc009vjc.1
  uc009vjd.2	chr1	-	15795	18061	15795	15795	5	15795,16606,16857,17232,17605,	15947,16765,17055,17368,18061,		uc009vjd.2

Output File

Some possible annotating results for a single SNP with the meanings of their output format are listed below:

  5'UTR=A26C2[-] means the SNP is in the 5'UTR region of gene A26C2 with a minus strand.
  INTRONIC=POTEG[-] means the SNP is in the intronic region of gene POTEG with a minus strand.
  SYNONYMOUS_CODING=GAB4:Ala15826157Ala[-] means that the SNP is synonymous coding at position 15826167 in gene GAB4 with a minus strand and it keeps amino-acid Ala unchaged.
  NON_SYNONYMOUS_CODING=GAB4:Leu15830952Pro[-] means that the SNP is non_synonymous coding at position 15830925 in gene GAB4 with a minus strand and it changes amino-acid Leu to Pro.
  SPLICE_SITE=NCAPH2[+] means that the SNP is in the SPLICE_SITE (5 bp within exon start or end positions in the coding reegion) of gene MCAPH2 with a plus strand.
  STOP_GAINED=MAPK12:Trp49035685stop[-] means that the SNP is at position 49035685 in gene MAPK12 with a minus strand and it changes amino-acid Trp to a stop codon.