Changes

From Genome Analysis Wiki
Jump to navigationJump to search
214 bytes added ,  23:54, 27 January 2010
no edit summary
Line 5: Line 5:  
We summarize the input data requirements as following:
 
We summarize the input data requirements as following:
   −
*  A binary conversion of the genome reference sequence as nucleotides (see [[#Build_Binary_Reference_Genome_and_Word_Index|Build Binary Reference Genome and Word Index]]})  
+
* A binary conversion of the genome reference sequence as nucleotides (see [[#Build_Binary_Reference_Genome_and_Word_Index|Build Binary Reference Genome and Word Index]]})  
*  A binary conversion of the genome reference sequence as colors plus word indices in color space (see [[#Build_Binary_Reference_Genome_and_Word_Index|Build Binary Reference Genome and Word Index]])  
+
 
*  Color space reads in color space FASTQ format (see [[#Input_file_requirement|Input file requirement]] for a description)  
+
* A binary conversion of the genome reference sequence as colors plus word indices in color space (see [[#Build_Binary_Reference_Genome_and_Word_Index|Build Binary Reference Genome and Word Index]])  
*  Color space reads longer than a minimum length requirement. (see [[#Minimum_read_length_requirement|Minimum read length requirement]])  
+
 
*&nbsp; Specify color space parameter when starting KARMA (see [[#Map_Color_Space_Reads|Map Color Space Reads]])<br>
+
* Color space reads in color space FASTQ format (see [[#Input_file_requirement|Input file requirement]] for a description)  
 +
 
 +
* Color space reads longer than a minimum length requirement. (see [[#Minimum_read_length_requirement|Minimum read length requirement]])  
 +
 
 +
* Specify color space parameter when starting KARMA (see [[#Map_Color_Space_Reads|Map Color Space Reads]])  
    
Please note the hardware requirements for KARMA are:
 
Please note the hardware requirements for KARMA are:
Line 18: Line 22:  
<br> We show a complete example demonstrating the whole procedure from building the word index to mapping color space reads in [[#A_Complete_Example|A Complete Example]].
 
<br> We show a complete example demonstrating the whole procedure from building the word index to mapping color space reads in [[#A_Complete_Example|A Complete Example]].
   −
= Build Binary Reference Genome and Word Index<br> =
+
= Build Binary Reference Genome and Word Index =
   −
First, build a binary version of the genome reference sequence as nucleotides (option: --createReference).&nbsp; Suppose that &nbsp; NCBI36.fa &nbsp; is a FASTA file which contains nucleotide sequences for all chromosomes.<br>
+
First, build a binary version of the genome reference sequence as nucleotides (option: --createReference).  Suppose that NCBI36.fa is a FASTA file which contains the nucleotide sequences for all chromosomes.
The command to invoke is:<br>
+
 
 +
The command to invoke is:
    
   karma --createReference --reference NCBI36.fa
 
   karma --createReference --reference NCBI36.fa
 
<br>
 
<br>
(To let KARMA map nucleotide space reads, one would use instead ''--createIndex''&nbsp; to create both a binary sequence and the word index files.)<br>
+
(To let KARMA map nucleotide space reads, one would use instead ''--createIndex''&nbsp; to create both a packed binary sequence file and the word index files.)<br>
    
Second, one also needs to build color space versions of both the genome reference sequence (option: --createReference) and the word index files (option: --createIndex).&nbsp;  The same nucleotide FASTA file is used.&nbsp;  However, to avoid naming conflicts among the resulting binary files, we suggest appending "CS" to the base file name for clarity.&nbsp;  The command to invoke is:<br>
 
Second, one also needs to build color space versions of both the genome reference sequence (option: --createReference) and the word index files (option: --createIndex).&nbsp;  The same nucleotide FASTA file is used.&nbsp;  However, to avoid naming conflicts among the resulting binary files, we suggest appending "CS" to the base file name for clarity.&nbsp;  The command to invoke is:<br>
Line 38: Line 43:  
= Map Color Space Reads =
 
= Map Color Space Reads =
   −
KARMA expects valid color space FASTQ files as input.&nbsp; We often use the suffix .csfastq to distinguish these from nucleotide space reads.&nbsp; For a .csfastq&nbsp; file of single end color space reads named &nbsp; single.csfastq, &nbsp; invoke the command:<br>
+
KARMA expects valid color space FASTQ files as input.&nbsp; We often use the suffix .csfastq to distinguish these from nucleotide space reads.&nbsp; With a .csfastq &nbsp; file of single end color space reads named &nbsp; single.csfastq, &nbsp; invoke the command:<br>
    
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
   −
This command line specifies both the nucleotide and color space reference sequences (and the word indexes, invisibly).&nbsp; The output will be written to a file in .sam&nbsp; format named&nbsp; "single.sam".<br>
+
This command line specifies both the nucleotide and color space reference sequences (and the word indexes, invisibly).&nbsp; The output will be written to a file in .sam format named &nbsp; "single.sam"&nbsp; derived from the .fastq&nbsp; file name.<br>
 
&nbsp;<br>
 
&nbsp;<br>
   −
Multiple input files are also acceptable, e.g.<br>
+
Multiple input files are also acceptable and will produce multiple .sam output files, e.g.<br>
    
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   single.1.csfastq single.2.csfastq single.3.csfastq
+
   single.1.csfastq single.2.csfastq single.3.csfastq
    
For paired end color space reads, use the option "--pairedReads".&nbsp; Suppose the paired end reads are stored in two files,&nbsp; pair.1.csfastq&nbsp; and&nbsp; pair.2.csfastq.&nbsp; The command to invoke is:<br>
 
For paired end color space reads, use the option "--pairedReads".&nbsp; Suppose the paired end reads are stored in two files,&nbsp; pair.1.csfastq&nbsp; and&nbsp; pair.2.csfastq.&nbsp; The command to invoke is:<br>
    
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   --pairedReads pair.1.csfastq pair.2.csfastq
+
   --pairedReads pair.1.csfastq pair.2.csfastq
   −
The mapping results will be stored in a .sam&nbsp; file named&nbsp; "pair.sam", which contains reads from both files.&nbsp; If multiple paired end read files are specified on the command line, KARMA will pair the 1st and 2nd files, 3rd and 4th files and etc.<br>
+
The mapping results will be stored in a .sam&nbsp; file named&nbsp; "pair.1.sam", which contains reads from both files.&nbsp; If multiple paired end read files are specified on the command line, KARMA will pair the 1st and 2nd files, 3rd and 4th files, etc. and write output files&nbsp; "pair.1.sam", "pair.3.sam", etc.<br>
    
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   --pairedReads pair.1.csfastq pair.2.csfastq pair.3.csfastq pair.4.csfastq
+
   --pairedReads pair.1.csfastq pair.2.csfastq pair.3.csfastq pair.4.csfastq
   −
= <br> Additional Information<br> =
+
= Additional Information =
    
== Input file requirement ==
 
== Input file requirement ==
Line 73: Line 78:  
== Minimum read length requirement ==
 
== Minimum read length requirement ==
   −
Keep in mind that KARMA requires color space reads that are at least twice as long as the index word size plus two (including the leading primer base).&nbsp; (For nucleotide space, the minimum read length is twice the word size.)&nbsp; For example, KARMA uses an index word size of 15 by default, so it will only map color space reads that are 32 base pairs or longer.<br>
+
Keep in mind that KARMA requires color space reads that are at least twice as long as the index word size plus two (including the leading primer base).&nbsp; (For nucleotide space, the minimum read length is twice the word size.)&nbsp; For example, KARMA uses an index word size of 15 by default, so it will only map color space reads that are 32 colors or longer (including the primer base).<br>
    
== Auxiliary tools ==
 
== Auxiliary tools ==
   −
The ABI SOLiD platform generates separate FASTA and quality files, usually named&nbsp; XXX.csfasta&nbsp; and&nbsp; XXX\_QV.qual.&nbsp; We provide a script ''solid2csfastq.py'' which converts these into a single color space FASTQ file named&nbsp; XXX.csfastq.&nbsp; We believe that a single color space FASTQ file simplifies post processing.<br>
+
The ABI SOLiD platform generates separate FASTA and quality files named&nbsp; XXX.csfasta&nbsp; and&nbsp; XXX_QV.qual.&nbsp; We provide a script&nbsp; ''solid2csfastq.py''&nbsp; which converts these into a single color space FASTQ file named&nbsp; XXX.csfastq.&nbsp; We believe that a single color space FASTQ file simplifies post processing.<br>
    
== Choose an appropriate size for word index ==
 
== Choose an appropriate size for word index ==
   −
Size for word index is sensitive to mapping performance. A small size of word index will increase the number of calculation cycles for a single read and duplications of a single word. On the other side, a big size will require much larger memory. Please also keep in mind that appropriate size is related to your hardware architecture. For practically purpose, we found size of 15 is optimal.
+
The length of the index words influences mapping performance.&nbsp; Using short  index words increases the number of calculation cycles for a single read and duplications of a single word.&nbsp; On the other side, long index words require much larger memory.&nbsp; Please also keep in mind that appropriate size is related to your hardware architecture.&nbsp; For practical purposes, with at least 20 Gb of RAM, we find that a size of 15 is optimal.
    
= A Complete Example =
 
= A Complete Example =
   −
&nbsp; A wrap-up message for quick start mapping color space reads.<br>
+
A wrap-up message for quick start mapping color space reads.<br>
    
Building binary genome reference and word index:<br>
 
Building binary genome reference and word index:<br>
Line 92: Line 97:  
   ln -s NCBI36.fa NCBI36CS.fa
 
   ln -s NCBI36.fa NCBI36CS.fa
 
   karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa
 
   karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa
  −
<br>
      
Mapping color space reads:<br>
 
Mapping color space reads:<br>
Line 99: Line 102:  
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   --pairedReads pair1.csfastq pair2.csfastq
+
   --pairedReads pair.1.csfastq pair.2.csfastq
   −
The output files are ''single.sam'' and ''pair1.sam'' and they conform SAM specification.
+
The output files are&nbsp; ''single.sam''&nbsp; and&nbsp; ''pair.1.sam''&nbsp; and they conform to the .sam format specification.
    
<br>
 
<br>
255

edits

Navigation menu