Difference between revisions of "Karma-colorspace"

From Genome Analysis Wiki
Jump to: navigation, search
(Created page with '= Overview = KARMA(K-tuple Alignment with Rapid Matching Algorithm) is able to map 35bp single end color space reads at the speed of approximately $1.2-2.0 \times 10^9$ reads pe…')
 
Line 5: Line 5:
 
We summarize software requirments as following:
 
We summarize software requirments as following:
  
*&nbsp; <span style="background-color: navy; color: white;" /> Binary reference genome in nucleotide space (see \ref{sec:2})
+
*&nbsp; Binary reference genome in nucleotide space (see [[sec4.1|#foo]]})  
*&nbsp; <span style="background-color: navy; color: white;" /> Binary reference genome and word index in color space (see \ref{sec:2})
+
*&nbsp; Binary reference genome and word index in color space (see \ref{sec:2})  
*&nbsp; <span style="background-color: navy; color: white;" /> Color space reads in valid color space FASTQ format (see \ref{sec:4.1} for file specification)
+
*&nbsp; Color space reads in valid color space FASTQ format (see \ref{sec:4.1} for file specification)  
*&nbsp; <span style="background-color: navy; color: white;" /> Color space reads are longer than minimum length requirement. (see \ref{sec:4.2})  
+
*&nbsp; Color space reads are longer than minimum length requirement. (see \ref{sec:4.2})  
*&nbsp; <span style="background-color: navy; color: white;" /> Specify color space parameter when starting KARMA (see \ref{sec:3})<br>
+
*&nbsp; Specify color space parameter when starting KARMA (see \ref{sec:3})<br>  
  
Please note the hardwares requirment for KARMA are:  
+
Please note the hardwares requirment for KARMA are:
  
*<span style="background-color: navy; color: white;" /> <span style="background-color: navy; color: white;" /> 20G memory. Due to memory share mechanism, running multiple processes of KARMA on the same machine consumes about the same amount of memory asrunning one process.<br>
+
*20G memory. Due to memory share mechanism, running multiple processes of KARMA on the same machine consumes about the same amount of memory asrunning one process.<br>  
*<span style="background-color: navy; color: white;" /> 30G disk space  
+
*30G disk space  
 
 
<span style="background-color: navy; color: white;" />
 
  
 
<br> We listed a complete example reviewing the whole procedure from building word index to mapping color space reads in \ref{sec:5}.
 
<br> We listed a complete example reviewing the whole procedure from building word index to mapping color space reads in \ref{sec:5}.
Line 22: Line 20:
 
= Build Binary Reference Genome and Word Index<br> =
 
= Build Binary Reference Genome and Word Index<br> =
  
&nbsp; First, we need to build binary reference genome (option: --createReference)<br>&nbsp; \footnote{To let KARMA map nucleotide space reads, you need to use ``--createIndex''to create the word index file.}''<br>
+
&nbsp; First, we need to build binary reference genome (option: --createReference)<br> &nbsp; \footnote{To let KARMA map nucleotide space reads, you need to use ``--createIndex''to create the word index file.}''<br>
 
 
&nbsp; in nucleotide space. Assume NCBI36.fa is a FASTA file contains sequences of all chromosomes.<br>&nbsp; The command to invoke is:<br>
 
  
<span style="background-color: navy; color: white;" />
+
&nbsp; in nucleotide space. Assume NCBI36.fa is a FASTA file contains sequences of all chromosomes.<br> &nbsp; The command to invoke is:<br>
  
 
   karma --createReference --reference NCBI36.fa
 
   karma --createReference --reference NCBI36.fa
 
<span style="background-color: navy; color: white;" />
 
  
 
<br>
 
<br>
Line 37: Line 31:
 
   in color space. The same FASTA file is needed. However, to avoid naming conflicts, we suggest using word "CS"  
 
   in color space. The same FASTA file is needed. However, to avoid naming conflicts, we suggest using word "CS"  
 
   appending to the base file name for clarity. The command to invoke is:
 
   appending to the base file name for clarity. The command to invoke is:
 
<span style="background-color: navy; color: white;" />
 
  
 
   ln -s NCBI36.fa NCBI36CS.fa
 
   ln -s NCBI36.fa NCBI36CS.fa
 
   karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa
 
   karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa
  
<span style="background-color: navy; color: white;" />
+
&nbsp; An important parameter is the size of words for indexing.<br> &nbsp; We recommand 15 (default value) for human reference genome.<br> &nbsp; Specifiy ``--wordSize N`` if you like to use $N$ as word size.<br> &nbsp; Typically you will observe performance change (see \ref{sec:4.4} for more discussion).<br> &nbsp;<br> &nbsp;<br> &nbsp; Note, multiple chromosomes are supported.<br> &nbsp; In current version, KARMA can take one FASTA file which contains sequences of all chromosomes.<br>
 
 
&nbsp; An important parameter is the size of words for indexing. <br>&nbsp; We recommand 15 (default value) for human reference genome.<br>&nbsp; Specifiy ``--wordSize N`` if you like to use $N$ as word size.<br>&nbsp; Typically you will observe performance change (see \ref{sec:4.4} for more discussion).<br>&nbsp; <br>&nbsp; <br>&nbsp; Note, multiple chromosomes are supported. <br>&nbsp; In current version, KARMA can take one FASTA file which contains sequences of all chromosomes.<br>
 
  
 
<br>
 
<br>
Line 51: Line 41:
 
= Map Color Space Reads =
 
= Map Color Space Reads =
  
&nbsp; KARMA takes valid color space FASTQ files inputs.<br>&nbsp; We usually use suffix .csfastq to distinguish it from nucleotide space reads.<br>&nbsp; For single end color space read, we can invoke command:<br>
+
&nbsp; KARMA takes valid color space FASTQ files inputs.<br> &nbsp; We usually use suffix .csfastq to distinguish it from nucleotide space reads.<br> &nbsp; For single end color space read, we can invoke command:<br>
 
 
<span style="background-color: navy; color: white;" />
 
  
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
  
<span style="background-color: navy; color: white;" />
+
&nbsp; Mapping results are store in a SAM file named "single.sam".<br> &nbsp;<br> &nbsp; Multiple input files are also acceptable, e.g.<br>
 
 
&nbsp; Mapping results are store in a SAM file named "single.sam".<br>&nbsp; <br>&nbsp; Multiple input files are also acceptable, e.g.<br>
 
 
 
<span style="background-color: navy; color: white;" />
 
  
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   single1.csfastq single2.csfastq single3.csfastq
 
   single1.csfastq single2.csfastq single3.csfastq
 
<span style="background-color: navy; color: white;" />
 
  
 
<br>
 
<br>
  
&nbsp; For paired end color space reads, option "--pairedReads" is requires.<br>&nbsp; Suppose the paired end reads are stored in file, pair1.csfastq and pair2.csfastq. <br>&nbsp; The command to invoke is:<br>
+
&nbsp; For paired end color space reads, option "--pairedReads" is requires.<br> &nbsp; Suppose the paired end reads are stored in file, pair1.csfastq and pair2.csfastq.<br> &nbsp; The command to invoke is:<br>
 
 
<span style="background-color: navy; color: white;" />
 
  
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   --pairedReads pair1.csfastq pair2.csfastq
 
   --pairedReads pair1.csfastq pair2.csfastq
  
<span style="background-color: navy; color: white;" />
+
&nbsp; Mapping results are store in a SAM file named "pair1.sam", which contains reads from both files.<br> &nbsp;<br> &nbsp; Similarly multiple paired end reads files can be specified in command line, and KARMA will pair 1st and 2rd file, 3rd and 4th file and etc.<br>
 
 
&nbsp; Mapping results are store in a SAM file named "pair1.sam", which contains reads from both files.<br>&nbsp; <br>&nbsp; Similarly multiple paired end reads files can be specified in command line, and KARMA will pair 1st and 2rd file, 3rd and 4th file and etc.<br>
 
 
 
<span style="background-color: navy; color: white;" />
 
  
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
 
   --pairedReads pair1.csfastq pair2.csfastq pair3.csfastq pair4.csfastq
 
   --pairedReads pair1.csfastq pair2.csfastq pair3.csfastq pair4.csfastq
 
<span style="background-color: navy; color: white;" />
 
  
 
= <br> Additional Information<br> =
 
= <br> Additional Information<br> =
  
\subsection{Input file requirement} \label{sec:4.1}
+
== {{anchor|foo}}Input file requirement<br> ==
 
 
&nbsp; KARMA require input files in valid color space FASTQ format. <br>&nbsp; We require the length of reads(including leading primer) should equal to the length of its quality string.<br>&nbsp; <br>&nbsp; A valid example of color space FASTQ file:<br>
 
  
<span style="background-color: navy; color: white;" />  
+
&nbsp; KARMA require input files in valid color space FASTQ format.<br> &nbsp; We require the length of reads(including leading primer) should equal to the length of its quality string.<br> &nbsp;<br> &nbsp; A valid example of color space FASTQ file:<br>
  
@Chromosome_20_048435095_Genome_2757096147  
+
@Chromosome_20_048435095_Genome_2757096147
  
A02232200222021320012102212311002212  
+
A02232200222021320012102212311002212
  
+  
+
+
  
!!1111111111111111111111111111111111  
+
!!1111111111111111111111111111111111
  
<span style="background-color: navy; color: white;" />
+
== {{anchor|sec4.2}}Minimum read length requirement ==
  
\subsection{Minimum read length requirement} \label{sec:4.2}
+
&nbsp; Keep in mind that the requirement of minimum color space read length for KARMA is<br> &nbsp; twice the size of word plus two (including leading primer) \footnote{For nucleotide space,<br> &nbsp; the minimum length requirement is twice the word size.}.<br> &nbsp; For example, KARMA use word size of 15 by default, so it will try to map color space<br> &nbsp; reads that are longer than 32 base pairs.<br>
  
&nbsp; Keep in mind that the requirement of minimum color space read length for KARMA is <br>&nbsp; twice the size of word plus two (including leading primer) \footnote{For nucleotide space, <br>&nbsp; the minimum length requirement is twice the word size.}. <br>&nbsp; For example, KARMA use word size of 15 by default, so it will try to map color space <br>&nbsp; reads that are longer than 32 base pairs.<br>
+
== {{anchor|sec4.3}} Auxiliary tools ==
  
\subsection{Auxiliary tools} \label{sec:4.3}
+
&nbsp; ABI SOLiD platform generated FASTA file (e.g. XXX.csfasta) and quality file (e.g. XXX\_QV.qual) separately. We wrote a script, \emph{solid2csfastq.py}, to convert it to color space FASTQ file(e.g. XXX.csfastq). We believe a single color space FASTQ file will simplify post processing.<br>
  
&nbsp; ABI SOLiD platform generated FASTA file (e.g. XXX.csfasta) and quality file (e.g. XXX\_QV.qual) separately. We wrote a script, \emph{solid2csfastq.py}, to convert it to color space FASTQ file(e.g. XXX.csfastq). We believe a single color space FASTQ file will simplify post processing. <br>
+
== {{anchor|sec4.4}} Choose an appropriate size for word index ==
  
\subsection{Choose an appropriate size for word index} \label{sec:4.4} Size for word index is sensitive to mapping performance. A small size of word index will increase the number of calculation cycles for a single read and duplications of a single word. On the other side, a big size will require much larger memory. Please also keep in mind that appropriate size is related to your hardware architecture. For practically purpose, we found size of 15 is optimal.
+
Size for word index is sensitive to mapping performance. A small size of word index will increase the number of calculation cycles for a single read and duplications of a single word. On the other side, a big size will require much larger memory. Please also keep in mind that appropriate size is related to your hardware architecture. For practically purpose, we found size of 15 is optimal.
  
 
= A Complete Example =
 
= A Complete Example =
Line 122: Line 94:
 
&nbsp; A wrap-up message for quick start mapping color space reads.<br>
 
&nbsp; A wrap-up message for quick start mapping color space reads.<br>
  
Building binary genome reference and word index: <br>
+
Building binary genome reference and word index:<br>
  
 
   karma --createReference --reference NCBI36.fa
 
   karma --createReference --reference NCBI36.fa
Line 130: Line 102:
 
<br>
 
<br>
  
Mapping color space reads: <br>
+
Mapping color space reads:<br>
  
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
 
   karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
Line 136: Line 108:
 
   --pairedReads pair1.csfastq pair2.csfastq
 
   --pairedReads pair1.csfastq pair2.csfastq
  
The output files are \emph{single.sam} and \emph{pair1.sam} and they conform SAM specification.
+
The output files are ''single.sam'' and ''pair1.sam'' and they conform SAM specification.
  
 
<br>
 
<br>

Revision as of 06:35, 15 November 2009

Overview

KARMA(K-tuple Alignment with Rapid Matching Algorithm) is able to map 35bp single end color space reads at the speed of approximately $1.2-2.0 \times 10^9$ reads per hour using Intel Xeon X760 2.66GHz and 128G memory.

We summarize software requirments as following:

  •   Binary reference genome in nucleotide space (see #foo})
  •   Binary reference genome and word index in color space (see \ref{sec:2})
  •   Color space reads in valid color space FASTQ format (see \ref{sec:4.1} for file specification)
  •   Color space reads are longer than minimum length requirement. (see \ref{sec:4.2})
  •   Specify color space parameter when starting KARMA (see \ref{sec:3})

Please note the hardwares requirment for KARMA are:

  • 20G memory. Due to memory share mechanism, running multiple processes of KARMA on the same machine consumes about the same amount of memory asrunning one process.
  • 30G disk space


We listed a complete example reviewing the whole procedure from building word index to mapping color space reads in \ref{sec:5}.

Build Binary Reference Genome and Word Index

  First, we need to build binary reference genome (option: --createReference)
  \footnote{To let KARMA map nucleotide space reads, you need to use ``--createIndexto create the word index file.}

  in nucleotide space. Assume NCBI36.fa is a FASTA file contains sequences of all chromosomes.
  The command to invoke is:

  karma --createReference --reference NCBI36.fa


  Second, we need to build binary reference genome (option: --createReference) and word index (option: --createIndex)
  in color space. The same FASTA file is needed. However, to avoid naming conflicts, we suggest using word "CS" 
  appending to the base file name for clarity. The command to invoke is:
  ln -s NCBI36.fa NCBI36CS.fa
  karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa

  An important parameter is the size of words for indexing.
  We recommand 15 (default value) for human reference genome.
  Specifiy ``--wordSize N`` if you like to use $N$ as word size.
  Typically you will observe performance change (see \ref{sec:4.4} for more discussion).
 
 
  Note, multiple chromosomes are supported.
  In current version, KARMA can take one FASTA file which contains sequences of all chromosomes.


Map Color Space Reads

  KARMA takes valid color space FASTQ files inputs.
  We usually use suffix .csfastq to distinguish it from nucleotide space reads.
  For single end color space read, we can invoke command:

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq

  Mapping results are store in a SAM file named "single.sam".
 
  Multiple input files are also acceptable, e.g.

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  single1.csfastq single2.csfastq single3.csfastq


  For paired end color space reads, option "--pairedReads" is requires.
  Suppose the paired end reads are stored in file, pair1.csfastq and pair2.csfastq.
  The command to invoke is:

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  --pairedReads pair1.csfastq pair2.csfastq

  Mapping results are store in a SAM file named "pair1.sam", which contains reads from both files.
 
  Similarly multiple paired end reads files can be specified in command line, and KARMA will pair 1st and 2rd file, 3rd and 4th file and etc.

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  --pairedReads pair1.csfastq pair2.csfastq pair3.csfastq pair4.csfastq


Additional Information

Template:AnchorInput file requirement

  KARMA require input files in valid color space FASTQ format.
  We require the length of reads(including leading primer) should equal to the length of its quality string.
 
  A valid example of color space FASTQ file:

@Chromosome_20_048435095_Genome_2757096147

A02232200222021320012102212311002212

+

!!1111111111111111111111111111111111

Template:AnchorMinimum read length requirement

  Keep in mind that the requirement of minimum color space read length for KARMA is
  twice the size of word plus two (including leading primer) \footnote{For nucleotide space,
  the minimum length requirement is twice the word size.}.
  For example, KARMA use word size of 15 by default, so it will try to map color space
  reads that are longer than 32 base pairs.

Template:Anchor Auxiliary tools

  ABI SOLiD platform generated FASTA file (e.g. XXX.csfasta) and quality file (e.g. XXX\_QV.qual) separately. We wrote a script, \emph{solid2csfastq.py}, to convert it to color space FASTQ file(e.g. XXX.csfastq). We believe a single color space FASTQ file will simplify post processing.

Template:Anchor Choose an appropriate size for word index

Size for word index is sensitive to mapping performance. A small size of word index will increase the number of calculation cycles for a single read and duplications of a single word. On the other side, a big size will require much larger memory. Please also keep in mind that appropriate size is related to your hardware architecture. For practically purpose, we found size of 15 is optimal.

A Complete Example


  A wrap-up message for quick start mapping color space reads.

Building binary genome reference and word index:

  karma --createReference --reference NCBI36.fa
  ln -s NCBI36.fa NCBI36CS.fa
  karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa


Mapping color space reads:

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  --pairedReads pair1.csfastq pair2.csfastq

The output files are single.sam and pair1.sam and they conform SAM specification.