Difference between revisions of "Karma-colorspace"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with '= Overview = KARMA(K-tuple Alignment with Rapid Matching Algorithm) is able to map 35bp single end color space reads at the speed of approximately $1.2-2.0 \times 10^9$ reads pe…')
 
 
(41 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
= Overview =
 
= Overview =
  
KARMA(K-tuple Alignment with Rapid Matching Algorithm) is able to map 35bp single end color space reads at the speed of approximately $1.2-2.0 \times 10^9$ reads per hour using Intel Xeon X760 2.66GHz and 128G memory.
+
KARMA (K-tuple Alignment with Rapid Matching Algorithm) is able to map 35 bp single end color space reads at a speed of approximately <math>1.2-2.0 \times 10^9</math> reads per hour using Intel Xeon X760 2.66GHz and 128G memory.
  
We summarize software requirments as following:
+
We summarize the input data requirements as following:
  
*&nbsp; <span style="background-color: navy; color: white;" /> Binary reference genome in nucleotide space (see \ref{sec:2})
+
* A binary conversion of the genome reference sequence as nucleotides (see [[#Build_Binary_Reference_Genome_and_Word_Index|Build Binary Reference Genome and Word Index]]})  
*&nbsp; <span style="background-color: navy; color: white;" /> Binary reference genome and word index in color space (see \ref{sec:2})
 
*&nbsp; <span style="background-color: navy; color: white;" /> Color space reads in valid color space FASTQ format (see \ref{sec:4.1} for file specification)
 
*&nbsp; <span style="background-color: navy; color: white;" /> Color space reads are longer than minimum length requirement. (see \ref{sec:4.2})
 
*&nbsp; <span style="background-color: navy; color: white;" /> Specify color space parameter when starting KARMA (see \ref{sec:3})<br>
 
  
Please note the hardwares requirment for KARMA are:
+
* A binary conversion of the genome reference sequence as colors plus word indices in color space (see [[#Build_Binary_Reference_Genome_and_Word_Index|Build Binary Reference Genome and Word Index]])
  
*<span style="background-color: navy; color: white;" /> <span style="background-color: navy; color: white;" /> 20G memory. Due to memory share mechanism, running multiple processes of KARMA on the same machine consumes about the same amount of memory asrunning one process.<br>
+
* Color space reads in color space FASTQ format (see [[#Input_file_requirement|Input file requirement]] for a description)
*<span style="background-color: navy; color: white;" /> 30G disk space  
 
  
<span style="background-color: navy; color: white;" />
+
* Color space reads longer than a minimum length requirement. (see [[#Minimum_read_length_requirement|Minimum read length requirement]])
  
<br> We listed a complete example reviewing the whole procedure from building word index to mapping color space reads in \ref{sec:5}.
+
* Specify color space parameter when starting KARMA (see [[#Map_Color_Space_Reads|Map Color Space Reads]])
  
= Build Binary Reference Genome and Word Index<br> =
+
Please note the hardware requirements for KARMA are:
  
&nbsp; First, we need to build binary reference genome (option: --createReference)<br>&nbsp; \footnote{To let KARMA map nucleotide space reads, you need to use ``--createIndex''to create the word index file.}''<br>
+
*20G memory.  By using shared memory for the word index tables, multiple instances of KARMA can run on one machine without using more memory than running a single instance.<br>  
 +
*30G disk space
  
&nbsp; in nucleotide space. Assume NCBI36.fa is a FASTA file contains sequences of all chromosomes.<br>&nbsp; The command to invoke is:<br>
+
<br> We show a complete example demonstrating the whole procedure from building the word index to mapping color space reads in [[#A_Complete_Example|A Complete Example]].
  
<span style="background-color: navy; color: white;" />
+
= Build Binary Reference Genome and Word Index =
  
  karma --createReference --reference NCBI36.fa
+
First, build a binary version of the genome reference sequence as nucleotides (option: --createReference).  Suppose that NCBI36.fa is a FASTA file which contains the nucleotide sequences for all chromosomes.
  
<span style="background-color: navy; color: white;" />
+
The command to invoke is:
  
 +
  karma --createReference --reference NCBI36.fa
 
<br>
 
<br>
 +
(To let KARMA map nucleotide space reads, one would use instead ''--createIndex''&nbsp; to create both a packed binary sequence file and the word index files.)<br>
  
  Second, we need to build binary reference genome (option: --createReference) and word index (option: --createIndex)
+
Second, one also needs to build color space versions of both the genome reference sequence (option: --createReference) and the word index files (option: --createIndex).&nbsp;  The same nucleotide FASTA file is used.&nbsp;  However, to avoid naming conflicts among the resulting binary files, we suggest appending "CS" to the base file name for clarity.&nbsp;  The command to invoke is:<br>
  in color space. The same FASTA file is needed. However, to avoid naming conflicts, we suggest using word "CS"  
 
  appending to the base file name for clarity. The command to invoke is:
 
 
 
<span style="background-color: navy; color: white;" />
 
  
 
   ln -s NCBI36.fa NCBI36CS.fa
 
   ln -s NCBI36.fa NCBI36CS.fa
 
   karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa
 
   karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa
 
+
<br>
<span style="background-color: navy; color: white;" />
+
When building the index files, one can set the word length for indexing.&nbsp; We recommend N = 15 (the default value) for the human genome on a machine with at least 20 Gb of RAM.&nbsp;  Shorter index words will decrease the memory footprint at the cost of increased run time.&nbsp;  However, the word length must not exceed half the length of the color space reads you intend to map, minus 1.&nbsp;  (See [[#Choose_an_appropriate_size_for_word_index|Choose an appropriate size for word index]] for more discussion.)&nbsp; Specify ``--wordSize N`` in order to use ''N'' as the word size.<br>
 
 
&nbsp; An important parameter is the size of words for indexing. <br>&nbsp; We recommand 15 (default value) for human reference genome.<br>&nbsp; Specifiy ``--wordSize N`` if you like to use $N$ as word size.<br>&nbsp; Typically you will observe performance change (see \ref{sec:4.4} for more discussion).<br>&nbsp; <br>&nbsp; <br>&nbsp; Note, multiple chromosomes are supported. <br>&nbsp; In current version, KARMA can take one FASTA file which contains sequences of all chromosomes.<br>
 
  
 
<br>
 
<br>
Line 51: Line 43:
 
= Map Color Space Reads =
 
= Map Color Space Reads =
  
&nbsp; KARMA takes valid color space FASTQ files inputs.<br>&nbsp; We usually use suffix .csfastq to distinguish it from nucleotide space reads.<br>&nbsp; For single end color space read, we can invoke command:<br>
+
KARMA expects valid color space FASTQ files as input.&nbsp; We often use the suffix .csfastq to distinguish these from nucleotide space reads.&nbsp; With a .csfastq &nbsp; file of single end color space reads named &nbsp; single.csfastq, &nbsp; invoke the command:<br>
 
 
<span style="background-color: navy; color: white;" />
 
  
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
  
<span style="background-color: navy; color: white;" />
+
This command line specifies both the nucleotide and color space reference sequences (and the word indexes, invisibly).&nbsp; The output will be written to a file in .sam format named &nbsp; "single.sam"&nbsp; derived from the .fastq&nbsp; file name.<br>
 +
&nbsp;<br>
  
&nbsp; Mapping results are store in a SAM file named "single.sam".<br>&nbsp; <br>&nbsp; Multiple input files are also acceptable, e.g.<br>
+
Multiple input files are also acceptable and will produce multiple .sam output files, e.g.<br>
 
 
<span style="background-color: navy; color: white;" />
 
  
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   single1.csfastq single2.csfastq single3.csfastq
+
   single.1.csfastq single.2.csfastq single.3.csfastq
 
 
<span style="background-color: navy; color: white;" />
 
 
 
<br>
 
  
&nbsp; For paired end color space reads, option "--pairedReads" is requires.<br>&nbsp; Suppose the paired end reads are stored in file, pair1.csfastq and pair2.csfastq. <br>&nbsp; The command to invoke is:<br>
+
For paired end color space reads, use the option "--pairedReads".&nbsp; Suppose the paired end reads are stored in two files,&nbsp; pair.1.csfastq&nbsp; and&nbsp; pair.2.csfastq.&nbsp; The command to invoke is:<br>
 
 
<span style="background-color: navy; color: white;" />
 
  
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   --pairedReads pair1.csfastq pair2.csfastq
+
   --pairedReads pair.1.csfastq pair.2.csfastq
  
<span style="background-color: navy; color: white;" />
+
The mapping results will be stored in a .sam&nbsp; file named&nbsp; "pair.1.sam", which contains reads from both files.&nbsp; If multiple paired end read files are specified on the command line, KARMA will pair the 1st and 2nd files, 3rd and 4th files, etc. and write output files&nbsp; "pair.1.sam", "pair.3.sam", etc.<br>
 
 
&nbsp; Mapping results are store in a SAM file named "pair1.sam", which contains reads from both files.<br>&nbsp; <br>&nbsp; Similarly multiple paired end reads files can be specified in command line, and KARMA will pair 1st and 2rd file, 3rd and 4th file and etc.<br>
 
 
 
<span style="background-color: navy; color: white;" />
 
  
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   --pairedReads pair1.csfastq pair2.csfastq pair3.csfastq pair4.csfastq
+
   --pairedReads pair.1.csfastq pair.2.csfastq pair.3.csfastq pair.4.csfastq
 
 
<span style="background-color: navy; color: white;" />
 
 
 
= <br> Additional Information<br> =
 
  
\subsection{Input file requirement} \label{sec:4.1}
+
= Additional Information =
  
&nbsp; KARMA require input files in valid color space FASTQ format. <br>&nbsp; We require the length of reads(including leading primer) should equal to the length of its quality string.<br>&nbsp; <br>&nbsp; A valid example of color space FASTQ file:<br>
+
== Input file requirement ==
  
<span style="background-color: navy; color: white;" />
+
KARMA requires input files in color space FASTQ format. The length of each read (which includes the leading primer base) should equal the length of its quality string. An example of a valid color space FASTQ file follows:
  
@Chromosome_20_048435095_Genome_2757096147  
+
  @Chromosome_20_048435095_Genome_2757096147
 +
  A02232200222021320012102212311002212
 +
  +
 +
  !!1111111111111111111111111111111111
  
A02232200222021320012102212311002212
+
== Minimum read length requirement ==
  
+
+
Keep in mind that KARMA requires color space reads that are at least twice as long as the index word size plus two (including the leading primer base).&nbsp; (For nucleotide space, the minimum read length is twice the word size.)&nbsp; For example, KARMA uses an index word size of 15 by default, so it will only map color space reads that are 32 colors or longer (including the primer base).<br>
  
!!1111111111111111111111111111111111
+
== Auxiliary tools ==
  
<span style="background-color: navy; color: white;" />
+
The ABI SOLiD platform generates separate FASTA and quality files named&nbsp; XXX.csfasta&nbsp; and&nbsp; XXX_QV.qual.&nbsp; We provide a script&nbsp; ''solid2csfastq.py''&nbsp; which converts these into a single color space FASTQ file named&nbsp; XXX.csfastq.&nbsp; We believe that a single color space FASTQ file simplifies post processing.<br>
  
\subsection{Minimum read length requirement} \label{sec:4.2}
+
== Choose an appropriate size for word index ==
  
&nbsp; Keep in mind that the requirement of minimum color space read length for KARMA is <br>&nbsp; twice the size of word plus two (including leading primer) \footnote{For nucleotide space, <br>&nbsp; the minimum length requirement is twice the word size.}. <br>&nbsp; For example, KARMA use word size of 15 by default, so it will try to map color space <br>&nbsp; reads that are longer than 32 base pairs.<br>
+
The length of the index words influences mapping performance.&nbsp; Using short  index words increases the number of calculation cycles for a single read and duplications of a single word.&nbsp; On the other side, long index words require much larger memory.&nbsp; Please also keep in mind that appropriate size is related to your hardware architecture.&nbsp; For practical purposes, with at least 20 Gb of RAM, we find that a size of 15 is optimal.
 
 
\subsection{Auxiliary tools} \label{sec:4.3}
 
 
 
&nbsp; ABI SOLiD platform generated FASTA file (e.g. XXX.csfasta) and quality file (e.g. XXX\_QV.qual) separately. We wrote a script, \emph{solid2csfastq.py}, to convert it to color space FASTQ file(e.g. XXX.csfastq). We believe a single color space FASTQ file will simplify post processing. <br>
 
 
 
\subsection{Choose an appropriate size for word index} \label{sec:4.4} Size for word index is sensitive to mapping performance. A small size of word index will increase the number of calculation cycles for a single read and duplications of a single word. On the other side, a big size will require much larger memory. Please also keep in mind that appropriate size is related to your hardware architecture. For practically purpose, we found size of 15 is optimal.
 
  
 
= A Complete Example =
 
= A Complete Example =
  
<br>
+
A wrap-up message for quick start mapping color space reads.<br>
  
&nbsp; A wrap-up message for quick start mapping color space reads.<br>
+
Building binary genome reference and word index:<br>
 
 
Building binary genome reference and word index: <br>
 
  
 
   karma --createReference --reference NCBI36.fa
 
   karma --createReference --reference NCBI36.fa
Line 128: Line 98:
 
   karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa
 
   karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa
  
<br>
+
Mapping color space reads:<br>
 
 
Mapping color space reads: <br>
 
  
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
   --pairedReads pair1.csfastq pair2.csfastq
+
   --pairedReads pair.1.csfastq pair.2.csfastq
  
The output files are \emph{single.sam} and \emph{pair1.sam} and they conform SAM specification.
+
The output files are&nbsp; ''single.sam''&nbsp; and&nbsp; ''pair.1.sam''&nbsp; and they conform to the .sam format specification.
  
 
<br>
 
<br>

Latest revision as of 23:54, 27 January 2010

Overview

KARMA (K-tuple Alignment with Rapid Matching Algorithm) is able to map 35 bp single end color space reads at a speed of approximately reads per hour using Intel Xeon X760 2.66GHz and 128G memory.

We summarize the input data requirements as following:

Please note the hardware requirements for KARMA are:

  • 20G memory. By using shared memory for the word index tables, multiple instances of KARMA can run on one machine without using more memory than running a single instance.
  • 30G disk space


We show a complete example demonstrating the whole procedure from building the word index to mapping color space reads in A Complete Example.

Build Binary Reference Genome and Word Index

First, build a binary version of the genome reference sequence as nucleotides (option: --createReference). Suppose that NCBI36.fa is a FASTA file which contains the nucleotide sequences for all chromosomes.

The command to invoke is:

  karma --createReference --reference NCBI36.fa


(To let KARMA map nucleotide space reads, one would use instead --createIndex  to create both a packed binary sequence file and the word index files.)

Second, one also needs to build color space versions of both the genome reference sequence (option: --createReference) and the word index files (option: --createIndex).  The same nucleotide FASTA file is used.  However, to avoid naming conflicts among the resulting binary files, we suggest appending "CS" to the base file name for clarity.  The command to invoke is:

  ln -s NCBI36.fa NCBI36CS.fa
  karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa


When building the index files, one can set the word length for indexing.  We recommend N = 15 (the default value) for the human genome on a machine with at least 20 Gb of RAM.  Shorter index words will decrease the memory footprint at the cost of increased run time.  However, the word length must not exceed half the length of the color space reads you intend to map, minus 1.  (See Choose an appropriate size for word index for more discussion.)  Specify ``--wordSize N`` in order to use N as the word size.


Map Color Space Reads

KARMA expects valid color space FASTQ files as input.  We often use the suffix .csfastq to distinguish these from nucleotide space reads.  With a .csfastq   file of single end color space reads named   single.csfastq,   invoke the command:

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq

This command line specifies both the nucleotide and color space reference sequences (and the word indexes, invisibly).  The output will be written to a file in .sam format named   "single.sam"  derived from the .fastq  file name.
 

Multiple input files are also acceptable and will produce multiple .sam output files, e.g.

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  single.1.csfastq  single.2.csfastq  single.3.csfastq

For paired end color space reads, use the option "--pairedReads".  Suppose the paired end reads are stored in two files,  pair.1.csfastq  and  pair.2.csfastq.  The command to invoke is:

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  --pairedReads  pair.1.csfastq  pair.2.csfastq

The mapping results will be stored in a .sam  file named  "pair.1.sam", which contains reads from both files.  If multiple paired end read files are specified on the command line, KARMA will pair the 1st and 2nd files, 3rd and 4th files, etc. and write output files  "pair.1.sam", "pair.3.sam", etc.

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  --pairedReads  pair.1.csfastq  pair.2.csfastq  pair.3.csfastq  pair.4.csfastq

Additional Information

Input file requirement

KARMA requires input files in color space FASTQ format. The length of each read (which includes the leading primer base) should equal the length of its quality string. An example of a valid color space FASTQ file follows:

 @Chromosome_20_048435095_Genome_2757096147
 A02232200222021320012102212311002212
 +
 !!1111111111111111111111111111111111

Minimum read length requirement

Keep in mind that KARMA requires color space reads that are at least twice as long as the index word size plus two (including the leading primer base).  (For nucleotide space, the minimum read length is twice the word size.)  For example, KARMA uses an index word size of 15 by default, so it will only map color space reads that are 32 colors or longer (including the primer base).

Auxiliary tools

The ABI SOLiD platform generates separate FASTA and quality files named  XXX.csfasta  and  XXX_QV.qual.  We provide a script  solid2csfastq.py  which converts these into a single color space FASTQ file named  XXX.csfastq.  We believe that a single color space FASTQ file simplifies post processing.

Choose an appropriate size for word index

The length of the index words influences mapping performance.  Using short index words increases the number of calculation cycles for a single read and duplications of a single word.  On the other side, long index words require much larger memory.  Please also keep in mind that appropriate size is related to your hardware architecture.  For practical purposes, with at least 20 Gb of RAM, we find that a size of 15 is optimal.

A Complete Example

A wrap-up message for quick start mapping color space reads.

Building binary genome reference and word index:

  karma --createReference --reference NCBI36.fa
  ln -s NCBI36.fa NCBI36CS.fa
  karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa

Mapping color space reads:

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  --pairedReads pair.1.csfastq pair.2.csfastq

The output files are  single.sam  and  pair.1.sam  and they conform to the .sam format specification.