Changes

From Genome Analysis Wiki
Jump to navigationJump to search
70 bytes added ,  21:05, 19 November 2009
m
Line 20: Line 20:  
= Build Binary Reference Genome and Word Index<br> =
 
= Build Binary Reference Genome and Word Index<br> =
   −
&nbsp; First, we need to build binary reference genome (option: --createReference)<br> &nbsp; (To let KARMA map nucleotide space reads, you need to use ''--createIndex'' to create the word index file.)<br>
+
First, build a binary version of the genome reference sequence as nucleotides (option: --createReference). Suppose that NCBI36.fa is a FASTA file which contains the nucleotide sequences for all chromosomes.<br>
 
+
The command to invoke is:<br>
&nbsp; in nucleotide space. Assume NCBI36.fa is a FASTA file contains sequences of all chromosomes.<br> &nbsp; The command to invoke is:<br>
      
   karma --createReference --reference NCBI36.fa
 
   karma --createReference --reference NCBI36.fa
   
<br>
 
<br>
 +
(To let KARMA map nucleotide space reads, one would use instead ''--createIndex'' to create both a binary sequence and the word index files.)<br>
   −
&nbsp; Second, we need to build binary reference genome (option: --createReference) and word index (option: --createIndex)<br>&nbsp; in color space. The same FASTA file is needed. However, to avoid naming conflicts, we suggest using word "CS" <br>&nbsp; appending to the base file name for clarity. The command to invoke is:<br>
+
Second, we also need to build a binary version of the genome reference sequence (option: --createReference) and the word index files (option: --createIndex) in color space. The same nucleotide FASTA file is needed. However, to avoid naming conflicts among the resulting binary files, we suggest appending "CS" to the base file name for clarity. The command to invoke is:<br>
    
   ln -s NCBI36.fa NCBI36CS.fa
 
   ln -s NCBI36.fa NCBI36CS.fa
 
   karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa
 
   karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa
 
+
<br>
&nbsp; An important parameter is the size of words for indexing.<br> &nbsp; We recommand 15 (default value) for human reference genome.<br> &nbsp; Specifiy ``--wordSize N`` if you like to use ''N'' as word size.<br> &nbsp; Typically you will observe performance change (see [[#Choose_an_appropriate_size_for_word_index|Choose an appropriate size for word index]] for more discussion).<br> &nbsp;<br> &nbsp;<br> &nbsp; Note, multiple chromosomes are supported.<br> &nbsp; In current version, KARMA can take one FASTA file which contains sequences of all chromosomes.<br>
+
An important parameter is the word length for indexing. We recommend N = 15 (the default value) for the human genome on a machine with at least 20 Gb of RAM. Shorter words will decrease the memory footprint at the cost of increased run time.  However, the word length must not be longer than half the length of the color space reads you intend to map, minus 1. See [[#Choose_an_appropriate_size_for_word_index|Choose an appropriate size for word index]] for more discussion. Specify ``--wordSize N`` in order to use ''N'' as the word size.<br>
    
<br>
 
<br>
29

edits

Navigation menu