Revision as of 21:43, 19 November 2009

Overview

KARMA (K-tuple Alignment with Rapid Matching Algorithm) is able to map 35 bp single end color space reads at a speed of approximately $1.2-2.0\times 10^{9}$ reads per hour using Intel Xeon X760 2.66GHz and 128G memory.

We summarize the input data requirements as following:

A binary conversion of the genome reference sequence as nucleotides (see Build Binary Reference Genome and Word Index})
A binary conversion of the genome reference sequence as colors plus word indices in color space (see Build Binary Reference Genome and Word Index)
Color space reads in color space FASTQ format (see Input file requirement for a description)
Color space reads longer than a minimum length requirement. (see Minimum read length requirement)
Specify color space parameter when starting KARMA (see Map Color Space Reads)

Please note the hardware requirements for KARMA are:

20G memory. By using shared memory for the word index tables, multiple instances of KARMA can run on one machine without using more memory than running a single instance.
30G disk space

We show a complete example demonstrating the whole procedure from building the word index to mapping color space reads in A Complete Example.

Build Binary Reference Genome and Word Index

First, build a binary version of the genome reference sequence as nucleotides (option: --createReference). Suppose that NCBI36.fa is a FASTA file which contains nucleotide sequences for all chromosomes.
The command to invoke is:

  karma --createReference --reference NCBI36.fa

(To let KARMA map nucleotide space reads, one would use instead --createIndex to create both a binary sequence and the word index files.)

Second, one also needs to build color space versions of both the genome reference sequence (option: --createReference) and the word index files (option: --createIndex). The same nucleotide FASTA file is used. However, to avoid naming conflicts among the resulting binary files, we suggest appending "CS" to the base file name for clarity. The command to invoke is:

  ln -s NCBI36.fa NCBI36CS.fa
  karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa

When building the index files, one can set the word length for indexing. We recommend N = 15 (the default value) for the human genome on a machine with at least 20 Gb of RAM. Shorter index words will decrease the memory footprint at the cost of increased run time. However, the word length must not exceed half the length of the color space reads you intend to map, minus 1. (See Choose an appropriate size for word index for more discussion.) Specify ``--wordSize N`` in order to use N as the word size.

Map Color Space Reads

KARMA expects valid color space FASTQ files as input. We often use the suffix .csfastq to distinguish these from nucleotide space reads. For a .csfastq file of single end color space reads named single.csfastq, invoke the command:

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq

This command line specifies both the nucleotide and color space reference sequences (and the word indexes, invisibly). The output will be written to a file in .sam format named "single.sam".

Multiple input files are also acceptable, e.g.

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  single.1.csfastq single.2.csfastq single.3.csfastq

For paired end color space reads, use the option "--pairedReads". Suppose the paired end reads are stored in two files, pair.1.csfastq and pair.2.csfastq. The command to invoke is:

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  --pairedReads pair.1.csfastq pair.2.csfastq

The mapping results will be stored in a .sam file named "pair.sam", which contains reads from both files. If multiple paired end read files are specified on the command line, KARMA will pair the 1st and 2nd files, 3rd and 4th files and etc.

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  --pairedReads pair.1.csfastq pair.2.csfastq pair.3.csfastq pair.4.csfastq

Additional Information

Input file requirement

KARMA requires input files in color space FASTQ format. The length of each read (which includes the leading primer base) should equal the length of its quality string. An example of a valid color space FASTQ file follows:

 @Chromosome_20_048435095_Genome_2757096147
 A02232200222021320012102212311002212
 +
 !!1111111111111111111111111111111111

Minimum read length requirement

Keep in mind that the requirement of minimum color space read length for KARMA is twice the size of word plus two (including leading primer).
(For nucleotide space, the minimum length requirement is twice the word size.)
For example, KARMA use word size of 15 by default, so it will try to map color space reads that are longer than 32 base pairs.

Auxiliary tools

ABI SOLiD platform generated FASTA file (e.g. XXX.csfasta) and quality file (e.g. XXX\_QV.qual) separately. We wrote a script, solid2csfastq.py, to convert it to color space FASTQ file(e.g. XXX.csfastq). We believe a single color space FASTQ file will simplify post processing.

Choose an appropriate size for word index

Size for word index is sensitive to mapping performance. A small size of word index will increase the number of calculation cycles for a single read and duplications of a single word. On the other side, a big size will require much larger memory. Please also keep in mind that appropriate size is related to your hardware architecture. For practically purpose, we found size of 15 is optimal.

A Complete Example

A wrap-up message for quick start mapping color space reads.

Building binary genome reference and word index:

  karma --createReference --reference NCBI36.fa
  ln -s NCBI36.fa NCBI36CS.fa
  karma --colorSpace --createReference --createIndex --reference NCBI36CS.fa

Mapping color space reads:

  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
  karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
  --pairedReads pair1.csfastq pair2.csfastq

The output files are single.sam and pair1.sam and they conform SAM specification.

@@ Line 38: / Line 38: @@
 = Map Color Space Reads =
-KARMA expects valid color space FASTQ files as input.&nbsp; We often use the suffix .csfastq to distinguish these from nucleotide space reads.&nbsp; For a .csfatq&nbsp; file of single end color space reads named &nbsp; single.csfastq, &nbsp; invoke the command:<br>
+KARMA expects valid color space FASTQ files as input.&nbsp; We often use the suffix .csfastq to distinguish these from nucleotide space reads.&nbsp; For a .csfastq&nbsp; file of single end color space reads named &nbsp; single.csfastq, &nbsp; invoke the command:<br>
     karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace single.csfastq
-This command line specifies both the nucleotide and color space reference sequences (and the word indexes, invisibly).&nbsp; The output will be written to a file in .sam&nbsp; format named "single.sam".<br>
+This command line specifies both the nucleotide and color space reference sequences (and the word indexes, invisibly).&nbsp; The output will be written to a file in .sam&nbsp; format named&nbsp; "single.sam".<br>
+&nbsp;<br>
 Multiple input files are also acceptable, e.g.<br>
@@ Line 49: / Line 50: @@
     single.1.csfastq single.2.csfastq single.3.csfastq
-For paired end color space reads, use the option "--pairedReads".&nbsp; Suppose the paired end reads are stored as two files,&nbsp; pair.1.csfastq&nbsp; and&nbsp; pair.2.csfastq.&nbsp; The command to invoke is:<br>
+For paired end color space reads, use the option "--pairedReads".&nbsp; Suppose the paired end reads are stored in two files,&nbsp; pair.1.csfastq&nbsp; and&nbsp; pair.2.csfastq.&nbsp; The command to invoke is:<br>
     karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \
     --pairedReads pair.1.csfastq pair.2.csfastq
-The mapping results will be stored in a SAM file named "pair.sam", which contains reads from both files.&nbsp; If multiple paired end reads files are specified on the command line, KARMA will pair the 1st and 2nd files, 3rd and 4th files and etc.<br>
+The mapping results will be stored in a .sam&nbsp; file named&nbsp; "pair.sam", which contains reads from both files.&nbsp; If multiple paired end read files are specified on the command line, KARMA will pair the 1st and 2nd files, 3rd and 4th files and etc.<br>
     karma --reference NCBI36.fa --csReference NCBI36CS.fa --colorSpace \

Difference between revisions of "Karma-colorspace"

Revision as of 21:43, 19 November 2009

Contents

Overview

Build Binary Reference Genome and Word Index

Map Color Space Reads

Additional Information

Input file requirement

Minimum read length requirement

Auxiliary tools

Choose an appropriate size for word index

A Complete Example

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools