Open main menu

Genome Analysis Wiki β

VcfCodingSnps

Revision as of 13:59, 13 December 2009 by Liyanmin (talk | contribs)

vcfCodingSnps is a SNP annotation tool that annotates coding variants in a VCF format input file. It takes a VCF as input and generates an annotated VCF file as output.

Contents

Basic Usage Example

Here is an example of how vcfCodingSnps works:

  vcfCodingSnps -s chrom22-CHB.vcf -g genelist.txt -o annotated-chrom22-CHB.vcf

Command Line Options

 -s SNP file                    Specifies the name of the input VCF-format SNP file
 -g genefile                    Specifies the name of the input gene file, by default use gene list file in ASCII format generated by UCSC genome browser 
 -o output file                 Specifies the name of the output VCF-format SNP file

Input File Infomation

1. Example headlines of input VCF-format SNP file:

  ##format=VCFv3.2
  ##NA12891=../depthFilter/filtered.NA12891.chrom22.SLX.maq.SRP000032.2009_07.glf
  ##NA12892=../depthFilter/filtered.NA12892.chrom22.SLX.maq.SRP000032.2009_07.glf
  ##NA12878=../merged/NA12878.chrom22.merged.glf
  ##minTotalDepth=0
  ##maxTotalDepth=1000
  ##minMapQuality=30
  ##minPosterior=0.9990
  ##program=glfTrio
  ##versionDate=Tue Dec  1 00:42:24 2009
  #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12891 NA12892 NA12878
  22      14439753        .       a       t       100     mapQ=0  depth=68;duples=homs;mac=2      GT:GQ:DP        1|1:100:40      0|0:81:28       1|0:84:0
  22      14441250        .       t       c       59      mapQ=0  depth=40        GT:GQ:DP        1|1:56:25       1|1:31:15       1|1:32:0
  22      14443154        .       t       g       45      mapQ=9  depth=92;duples=homs;mac=2      GT:GQ:DP        1|1:49:21       0|0:60:20       1|0:100:51
  ... ...

2. Input gene file should be a plain text file generated by ucsc genome browser. A sample pathway of generating an input gene file is

  Go to http://genome.ucsc.edu/ ►► Click "table" ►► Specify the fields required (clade: mammal, genome:human etc.) ►► get output gene file

1. A detailed instruction on using the table browser could be found at genome.ucsc.edu/cgi-bin/hgTables.
2. One can specify the region to be the whole genome or any particular gene position (e.g. chr21:33031597-33041570).

Here is an example of input gene file headlines:

  #name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds	proteinID	alignID
  uc001aaa.3	chr1	+	11873	14409	11873	11873	3	11873,12612,13220,	12227,12721,14409,		uc001aaa.3
  uc010nxq.1	chr1	+	11873	14409	12189	13639	3	11873,12594,13402,	12227,12721,14409,	B7ZGX9	uc010nxq.1
  uc010nxr.1	chr1	+	11873	14409	11873	11873	3	11873,12645,13220,	12227,12697,14409,		uc010nxr.1
  uc009vis.2	chr1	-	14362	16765	14362	14362	4	14362,14969,15795,16606,	14829,15038,15942,16765,		uc009vis.2
  uc009vjc.1	chr1	-	16857	17751	16857	16857	2	16857,17232,	17055,17751,		uc009vjc.1
  uc009vjd.2	chr1	-	15795	18061	15795	15795	5	15795,16606,16857,17232,17605,	15947,16765,17055,17368,18061,		uc009vjd.2

Output File

Some possible annotating results for a single SNP with the meanings of their output format are listed below:

  5'UTR=A26C2[-]    means    the SNP is in the 5'UTR region of gene A26C2 with a minus strand.
  INTRONIC=POTEG[-]    means     the SNP is in the intronic region of gene POTEG with a minus strand.
  SYNONYMOUS_CODING=GAB4:Ala15826157Ala[-]    means    that the SNP is synonymous coding at position 15826157 in gene GAB4 with a minus strand and it keeps amino-acid Ala unchanged.
  NON_SYNONYMOUS_CODING=GAB4:Leu15830952Pro[-]    means    that the SNP is non_synonymous coding at position 15830925 in gene GAB4 with a minus strand and it changes amino-acid Leu to Pro.
  SPLICE_SITE=NCAPH2[+]    means    that the SNP is in the SPLICE_SITE (5 bp within exon start or end positions in the coding region) of gene MCAPH2 with a plus strand.
  STOP_GAINED=MAPK12:Trp49035685stop[-]    means    that the SNP is at position 49035685 in gene MAPK12 with a minus strand and it changes amino-acid Trp to a stop codon.

The annotating result will be added to the entry "INFO" of the input VCF file and outputted together with other information in the input VCF SNP file. Here is an example of output VCF file headlines:

  ##format=VCFv3.2
  ##NA12891=../GLF/NA12891.chrom22.SLX.SRP000032.2009_07.glf
  ##NA12892=../GLF/NA12892.chrom22.SLX.SRP000032.2009_07.glf
  ##NA12878=../merged/NA12878.chrom22.merged.glf
  ##minTotalDepth=0
  ##maxTotalDepth=1000
  ##minMapQuality=40
  ##minPosterior=0.9990
  ##program=glfTrio
  ##versionDate=Thu Aug 27 18:23:18 2009
  #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12891 NA12892 NA12878
  22      15464609        .       a       g       100     .       depth=109;mac=1;tdt=0/1;3'UTR=psiTPTE22[+]      GT:GQ:GD        0/1:100:44      1/1:81:28       1/1:100:37
  22      15464609        .       a       g       100     .       depth=109;mac=1;tdt=0/1;3'UTR=psiTPTE22[+]      GT:GQ:GD        0/1:100:44      1/1:81:28       1/1:100:37
  22      15464609        .       a       g       100     .       depth=109;mac=1;tdt=0/1;3'UTR=psiTPTE22[+]      GT:GQ:GD        0/1:100:44      1/1:81:28       1/1:100:37
  22      15464609        .       a       g       100     .       depth=109;mac=1;tdt=0/1;3'UTR=psiTPTE22[+]      GT:GQ:GD        0/1:100:44      1/1:81:28       1/1:100:37
  22      15482433        .       a       g       38      .       depth=21;3'UTR=psiTPTE22[+]     GT:GQ:GD        1/1:34:11       1/1:14:3        1/1:35:7
  22      15644565        .       g       t       77      .       depth=140;NON_SYNONYMOUS_CODING=XKR3:His15644565Asn[-]  GT:GQ:GD        1/1:100:49