Changes

From Genome Analysis Wiki
Jump to navigationJump to search
123 bytes added ,  23:54, 27 January 2010
no edit summary
Line 5: Line 5:  
We summarize the input data requirements as following:
 
We summarize the input data requirements as following:
   −
*  A binary conversion of the genome reference sequence as nucleotides (see [[#Build_Binary_Reference_Genome_and_Word_Index|Build Binary Reference Genome and Word Index]]})  
+
* A binary conversion of the genome reference sequence as nucleotides (see [[#Build_Binary_Reference_Genome_and_Word_Index|Build Binary Reference Genome and Word Index]]})  
*  A binary conversion of the genome reference sequence as colors plus word indices in color space (see [[#Build_Binary_Reference_Genome_and_Word_Index|Build Binary Reference Genome and Word Index]])  
+
 
*  Color space reads in color space FASTQ format (see [[#Input_file_requirement|Input file requirement]] for a description)  
+
* A binary conversion of the genome reference sequence as colors plus word indices in color space (see [[#Build_Binary_Reference_Genome_and_Word_Index|Build Binary Reference Genome and Word Index]])  
*  Color space reads longer than a minimum length requirement. (see [[#Minimum_read_length_requirement|Minimum read length requirement]])  
+
 
*&nbsp; Specify color space parameter when starting KARMA (see [[#Map_Color_Space_Reads|Map Color Space Reads]])<br>
+
* Color space reads in color space FASTQ format (see [[#Input_file_requirement|Input file requirement]] for a description)  
 +
 
 +
* Color space reads longer than a minimum length requirement. (see [[#Minimum_read_length_requirement|Minimum read length requirement]])  
 +
 
 +
* Specify color space parameter when starting KARMA (see [[#Map_Color_Space_Reads|Map Color Space Reads]])  
    
Please note the hardware requirements for KARMA are:
 
Please note the hardware requirements for KARMA are:
Line 18: Line 22:  
<br> We show a complete example demonstrating the whole procedure from building the word index to mapping color space reads in [[#A_Complete_Example|A Complete Example]].
 
<br> We show a complete example demonstrating the whole procedure from building the word index to mapping color space reads in [[#A_Complete_Example|A Complete Example]].
   −
= Build Binary Reference Genome and Word Index<br> =
+
= Build Binary Reference Genome and Word Index =
 +
 
 +
First, build a binary version of the genome reference sequence as nucleotides (option: --createReference).  Suppose that NCBI36.fa is a FASTA file which contains the nucleotide sequences for all chromosomes.
   −
First, build a binary version of the genome reference sequence as nucleotides (option: --createReference).&nbsp;  Suppose that &nbsp; NCBI36.fa &nbsp; is a FASTA file which contains nucleotide sequences for all chromosomes.<br>
+
The command to invoke is:
The command to invoke is:<br>
      
   karma --createReference --reference NCBI36.fa
 
   karma --createReference --reference NCBI36.fa
 
<br>
 
<br>
(To let KARMA map nucleotide space reads, one would use instead ''--createIndex''&nbsp; to create both a binary sequence and the word index files.)<br>
+
(To let KARMA map nucleotide space reads, one would use instead ''--createIndex''&nbsp; to create both a packed binary sequence file and the word index files.)<br>
    
Second, one also needs to build color space versions of both the genome reference sequence (option: --createReference) and the word index files (option: --createIndex).&nbsp;  The same nucleotide FASTA file is used.&nbsp;  However, to avoid naming conflicts among the resulting binary files, we suggest appending "CS" to the base file name for clarity.&nbsp;  The command to invoke is:<br>
 
Second, one also needs to build color space versions of both the genome reference sequence (option: --createReference) and the word index files (option: --createIndex).&nbsp;  The same nucleotide FASTA file is used.&nbsp;  However, to avoid naming conflicts among the resulting binary files, we suggest appending "CS" to the base file name for clarity.&nbsp;  The command to invoke is:<br>
Line 38: Line 43:  
= Map Color Space Reads =
 
= Map Color Space Reads =
   −
KARMA expects valid color space FASTQ files as input.&nbsp; We often use the suffix .csfastq to distinguish these from nucleotide space reads.&nbsp; For a .csfastq&nbsp; file of single end color space reads named &nbsp; single.csfastq, &nbsp; invoke the command:<br>
+
KARMA expects valid color space FASTQ files as input.&nbsp; We often use the suffix .csfastq to distinguish these from nucleotide space reads.&nbsp; With a .csfastq &nbsp; file of single end color space reads named &nbsp; single.csfastq, &nbsp; invoke the command:<br>
    
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
   −
This command line specifies both the nucleotide and color space reference sequences (and the word indexes, invisibly).&nbsp; The output will be written to a file in .sam&nbsp; format named&nbsp; "single.sam".<br>
+
This command line specifies both the nucleotide and color space reference sequences (and the word indexes, invisibly).&nbsp; The output will be written to a file in .sam format named &nbsp; "single.sam"&nbsp; derived from the .fastq&nbsp; file name.<br>
 
&nbsp;<br>
 
&nbsp;<br>
   −
Multiple input files are also acceptable, e.g.<br>
+
Multiple input files are also acceptable and will produce multiple .sam output files, e.g.<br>
    
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   single.1.csfastq single.2.csfastq single.3.csfastq
+
   single.1.csfastq single.2.csfastq single.3.csfastq
    
For paired end color space reads, use the option "--pairedReads".&nbsp; Suppose the paired end reads are stored in two files,&nbsp; pair.1.csfastq&nbsp; and&nbsp; pair.2.csfastq.&nbsp; The command to invoke is:<br>
 
For paired end color space reads, use the option "--pairedReads".&nbsp; Suppose the paired end reads are stored in two files,&nbsp; pair.1.csfastq&nbsp; and&nbsp; pair.2.csfastq.&nbsp; The command to invoke is:<br>
    
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   --pairedReads pair.1.csfastq pair.2.csfastq
+
   --pairedReads pair.1.csfastq pair.2.csfastq
   −
The mapping results will be stored in a .sam&nbsp; file named&nbsp; "pair.1.sam", which contains reads from both files.&nbsp; If multiple paired end read files are specified on the command line, KARMA will pair the 1st and 2nd files, 3rd and 4th files and etc.<br>
+
The mapping results will be stored in a .sam&nbsp; file named&nbsp; "pair.1.sam", which contains reads from both files.&nbsp; If multiple paired end read files are specified on the command line, KARMA will pair the 1st and 2nd files, 3rd and 4th files, etc. and write output files&nbsp; "pair.1.sam", "pair.3.sam", etc.<br>
    
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   --pairedReads pair.1.csfastq pair.2.csfastq pair.3.csfastq pair.4.csfastq
+
   --pairedReads pair.1.csfastq pair.2.csfastq pair.3.csfastq pair.4.csfastq
   −
= <br> Additional Information<br> =
+
= Additional Information =
    
== Input file requirement ==
 
== Input file requirement ==
Line 73: Line 78:  
== Minimum read length requirement ==
 
== Minimum read length requirement ==
   −
Keep in mind that KARMA requires color space reads that are at least twice as long as the index word size plus two (including the leading primer base).&nbsp; (For nucleotide space, the minimum read length is twice the word size.)&nbsp; For example, KARMA uses an index word size of 15 by default, so it will only map color space reads that are 32 base pairs or longer.<br>
+
Keep in mind that KARMA requires color space reads that are at least twice as long as the index word size plus two (including the leading primer base).&nbsp; (For nucleotide space, the minimum read length is twice the word size.)&nbsp; For example, KARMA uses an index word size of 15 by default, so it will only map color space reads that are 32 colors or longer (including the primer base).<br>
    
== Auxiliary tools ==
 
== Auxiliary tools ==
   −
The ABI SOLiD platform generates separate FASTA and quality files named&nbsp; XXX.csfasta&nbsp; and&nbsp; XXX\_QV.qual.&nbsp; We provide a script&nbsp; ''solid2csfastq.py''&nbsp; which converts these into a single color space FASTQ file named&nbsp; XXX.csfastq.&nbsp; We believe that a single color space FASTQ file simplifies post processing.<br>
+
The ABI SOLiD platform generates separate FASTA and quality files named&nbsp; XXX.csfasta&nbsp; and&nbsp; XXX_QV.qual.&nbsp; We provide a script&nbsp; ''solid2csfastq.py''&nbsp; which converts these into a single color space FASTQ file named&nbsp; XXX.csfastq.&nbsp; We believe that a single color space FASTQ file simplifies post processing.<br>
    
== Choose an appropriate size for word index ==
 
== Choose an appropriate size for word index ==
255

edits

Navigation menu