Changes

From Genome Analysis Wiki
Jump to navigationJump to search
2,922 bytes added ,  11:58, 2 February 2017
no edit summary
Line 1: Line 1: −
'''vcfCodingSnps'''[http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps/index.shtml] is a SNP annotation tool that annotates coding variants in a [[VCF]] format input file. It takes a VCF as input and generates an annotated VCF file as output. The tool is currently under development by Yanming Li, a doctoral student at the University of Michigan Center for Statistical Genetics. For any issues with the program, please contact [mailto:liyanmin@umich.edu Yanming]. A detailed tutorial and download page can be found at [http://www.sph.umich.edu/csg/liyanmin/vcfCodingSnps/index.shtml]  
+
'''vcfCodingSnps'''[http://csg.sph.umich.edu//liyanmin/vcfCodingSnps/index.shtml] is a SNP annotation tool that annotates coding variants in a [[VCF]] format input file. It takes a VCF as input and generates an annotated VCF file as output. The tool is currently under development by Yanming Li, a doctoral student at the University of Michigan Center for Statistical Genetics. For any issues with the program, please contact [mailto:liyanmin@umich.edu Yanming]. A detailed tutorial and download page can be found at [http://csg.sph.umich.edu//liyanmin/vcfCodingSnps/index.shtml]  
    
== Basic Usage Example  ==
 
== Basic Usage Example  ==
Line 18: Line 18:  
   --ns parameter             user defined number of kbps for the range of upstream or downstream of a gene, by default will be set t0 5
 
   --ns parameter             user defined number of kbps for the range of upstream or downstream of a gene, by default will be set t0 5
   −
== Input File Infomation ==
+
== Library Compiling Guideline  ==
 +
 
 +
  To Compile the source code, please first re-compile the .c functions in the library folder on your local machine:
 +
  1. Get into folder "libcsg". Type syntax "gcc -c -O2 *.cpp -D_FILE_OFFSET_BITS=32" to re-compile the c files in the library if use a 32-bit local machine
 +
    (Type "gcc -c -O2 *.cpp -D_FILE_OFFSET_BITS=64" to re-compile the .c files in the library if use a 64-bit local machine)
 +
  2. In the same folder, type "ar -rc libcsg.a *.o"
 +
  3. Go to the root folder, type "make clean" and "make". (Don't need to change anything in Makefile in this step)
 +
 
 +
== Input File Information ==
    
1. Example headlines of input VCF-format SNP file:  
 
1. Example headlines of input VCF-format SNP file:  
Line 38: Line 46:  
   ... ...
 
   ... ...
   −
2. Input gene file should be a plain text file generated by [http://genome.ucsc.edu/ ucsc genome browser]. A sample pathway of generating an input gene file is  
+
2. The gene list and the reference genome that user provided can be of various gene tracks and assemblies. The latest version takes gene list tracks such as UCSC known genes, RefSeq genes, Genecode genes, CCDS genes and Emsembl genes, and the assembly of the gene list and the reference genome can be of either hg16, hg17, hg18 or hg19. One can explore UCSC genome browser for a better understanding of different tracks and assemblies. By default vcfColdingSnps uses a hg18 UCSC known gene list and the hg18 reference genome. It also provides versions of other tracks and assemblies at the user's conveinience so that they don't need to download those themselves. Input gene file should be a plain text file generated by [http://genome.ucsc.edu/ ucsc genome browser]. A sample pathway of generating an input gene file is  
    
   Go to http://genome.ucsc.edu/ ►► Click "table" ►► Specify the fields required (clade: mammal, genome:human etc.) ►► In "track" filed, select "UCSC gene" ►► get output gene file
 
   Go to http://genome.ucsc.edu/ ►► Click "table" ►► Specify the fields required (clade: mammal, genome:human etc.) ►► In "track" filed, select "UCSC gene" ►► get output gene file
 
   
 
   
  1. Gene file used should be of [http://genome.ucsc.edu/FAQ/FAQformat#format9 GenePred table format]. The following 10 fields are required and must be of the same order as shown below:
+
  1. Gene file used should be of [http://genome.ucsc.edu/FAQ/FAQformat#format9 GenePred table format]. The following 11 tab delimited fields are required and must be of the same order as shown below:
 
     string  name;              "Name of gene"
 
     string  name;              "Name of gene"
 
     string  chrom;              "Chromosome name"
 
     string  chrom;              "Chromosome name"
Line 53: Line 61:  
     uint[exonCount] exonStarts; "Exon start positions"
 
     uint[exonCount] exonStarts; "Exon start positions"
 
     uint[exonCount] exonEnds;  "Exon end positions"
 
     uint[exonCount] exonEnds;  "Exon end positions"
  2. If gene file assumes an [http://genome.ucsc.edu/FAQ/FAQformat#format9 extended GenePred format], there will be an exctra "exonframe" field. Please refer to [https://lists.soe.ucsc.edu/pipermail/genome/2006-November/012218.html here] for the definition of "exonframe". For some genes, due to translational frame shifts or other reasons, the exonframe might not
+
    string  symbol;            "Standard gene symbol"
match what one would compute using mod 3 in counting codons. In such cases, the program will report a warning massage that "number of base pairs between code start and code end is not a multiple of three". While we will use the usual mod 3 method for counting codons.  
+
   
 +
    Note: the 11th field is a mandatory field for running vcfCodingSnps. In the genelists provided with the package, this field gives the standard gene symbols such as "APOE", "LDL-R" etc.
 +
  If a genelist downloaded by you own that does not contain such a field, you can simply make the 11th field equal to the first field which is the gene name in a specific track by a syntax like
 +
   
 +
    awk `{FS="\t"; print $0"\t"$1 }` yourGenelist > yourNewGenelist
 +
   
 +
  2. If gene file assumes an [http://genome.ucsc.edu/FAQ/FAQformat#format9 extended GenePred format], there will be an exctra "exonframe" field. Please refer to [https://lists.soe.ucsc.edu/pipermail/genome/2006-November/012218.html here] for the definition of "exonframe". For some genes, due to translational frame shifts or other  
 +
    reasons, the exonframe might not match what one would compute using mod 3 in counting codons. In such cases, the program will report a warning massage that "number of base pairs between code start and code end is
 +
    not a multiple of three". While we will use the usual mod 3 method for counting codons.
 
  3. A detailed instruction on using the table browser could be found at [http://genome.ucsc.edu/cgi-bin/hgTables?command=start#Help genome.ucsc.edu/cgi-bin/hgTables].
 
  3. A detailed instruction on using the table browser could be found at [http://genome.ucsc.edu/cgi-bin/hgTables?command=start#Help genome.ucsc.edu/cgi-bin/hgTables].
 
  4. One can specify the region to be the whole genome or any particular gene position (e.g. chr21:33031597-33041570).
 
  4. One can specify the region to be the whole genome or any particular gene position (e.g. chr21:33031597-33041570).
Line 127: Line 143:  
   8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14  
 
   8 151936 . a g 32 . depth=105;duples=hets;mac=2;tdt=0/2;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 0/1:42:44 0/1:23:47 0/0:39:14  
 
   8 152578 . c t 87 . depth=108;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47
 
   8 152578 . c t 87 . depth=108;5'UTR=RPL23A_20_869(uc010lra.1)[-];5'UTR=RPL23A_20_869(uc003woq.2)[-];5'UTR=RPL23A_20_869(uc010lrb.1)[-] GT:GQ:GD 1/1:95:31 1/1:89:30 1/1:100:47
 +
 +
Output log file headlines:
 +
 +
  ##chr    pos    ref    alt    ucsc_name      genestrend      genestart    geneend ref_codon      ref_AA  alt_codon      alt_AA codon_start    codon_end      genesymbol      codonCount      type
 +
  chr2    214811129      T      c      uc010fuz.1      +      213857360      214814327      CTA    Leu    CCA    Pro  214811128      214811130      SPAG16  433    NON_SYNONYMOUS_CODING
 +
  chr2    214811129      T      c      uc002veq.1      +      213857360      214983470      .      .      .      .      .      .      SPAG16  .      INTRONIC
 +
  chr2    214811129      T      c      uc002ver.1      +      213857360      214983470      .      .      .      .      .      .      SPAG16  .      INTRONIC
 +
  chr2    214811174      T      a      uc010fuz.1      +      213857360      214814327      .      .      .      .      .      .      SPAG16  .      3'UTR
 +
  chr2    214811174      T      a      uc002veq.1      +      213857360      214983470      .      .      .      .      .      .      SPAG16  .      INTRONIC
96

edits

Navigation menu