Difference between revisions of "Karma"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1:
[[Category:Software]]
+
<br> '''K-tuple Alignment with Rapid Matching Algorithm'''  
 
 
'''K-tuple Alignment with Rapid Matching Algorithm'''  
 
  
Karma uses an existing reference to align short reads, such as those generated by Illumina sequencers.
+
Karma uses an existing reference to align short reads, such as those generated by Illumina sequencers.  
  
Primary features:
+
Primary features:  
  
# High performance, high sensitivity
+
#High performance, high sensitivity  
# Large and small gap detection by default
+
#Large and small gap detection by default  
# Multiple gaps per read by default
+
#Multiple gaps per read by default  
# Single or paired end reads
+
#Single or paired end reads  
# No read length limit
+
#No read length limit  
# Quality scores are used to assess quality of maps
+
#Quality scores are used to assess quality of maps  
# All potential locations are examined exhaustively, none are omitted
+
#All potential locations are examined exhaustively, none are omitted  
# Reasonable memory per CPU ratio on high core count machines
+
#Reasonable memory per CPU ratio on high core count machines
  
The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers.
+
The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers.  
  
Color space and LS454 sequence alignments are not currently supported. These features will return in Karma 0.9.1.
+
Color space and LS454 sequence alignments are not currently supported. These features will return in Karma 0.9.1.  
  
= Download Karma =
+
= Download Karma =
  
 
To get a copy go to [http://www.sph.umich.edu/csg/pha/karma/download/ Karma Download]  
 
To get a copy go to [http://www.sph.umich.edu/csg/pha/karma/download/ Karma Download]  
  
= Build Karma =
+
= Build Karma =
  
Karma is designed to be reasonably portable.
+
Karma is designed to be reasonably portable.  
  
However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, there are likely other portability issues.
+
However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, there are likely other portability issues.  
  
We support Karma only on Ubuntu 9.10 on 64-bit processors.
+
We support Karma only on Ubuntu 9.10 on 64-bit processors.  
  
== Dependencies ==
+
== Dependencies ==
  
Karma requires that the following debian packages be installed on the host Linux machine:
+
Karma requires that the following debian packages be installed on the host Linux machine:  
  
# libssl-dev
+
#libssl-dev  
# zlib1g-dev
+
#zlib1g-dev
  
Without these installed, Karma will not build.
+
Without these installed, Karma will not build.  
  
== Building ==
+
== Building ==
  
Assuming the karma tar file is named karma.tgz, do the following
+
Assuming the karma tar file is named karma.tgz, do the following  
  
 
  tar xvzf karma.tgz
 
  tar xvzf karma.tgz
Line 51: Line 49:
 
  cp karma/karma ~/bin
 
  cp karma/karma ~/bin
  
Alternatively, if you want to share the karma binary install it in /usr/local/bin/karma.
+
Alternatively, if you want to share the karma binary install it in /usr/local/bin/karma.  
  
== Testing the build ==
+
== Testing the build ==
  
To test karma, go to the build tree subdirectory named ''karma'', and type the command:
+
To test karma, go to the build tree subdirectory named ''karma'', and type the command:  
  
 
  make test
 
  make test
  
The test script builds a reference for the small phiX genome, then runs single end as well as paired end alignments. It compares the results of that with known results. Differences are printed to the console, and currently look something like this:
+
The test script builds a reference for the small phiX genome, then runs single end as well as paired end alignments. It compares the results of that with known results. Differences are printed to the console, and currently look something like this:  
 
+
<pre>diff phiX.sam.good phiX.sam  
<pre>
 
diff phiX.sam.good phiX.sam  
 
 
3c3
 
3c3
< @RG DT:2010-04-08T17:29Z ID:boingboing SM:NA12345
+
&lt; @RG DT:2010-04-08T17:29Z ID:boingboing SM:NA12345
 
---
 
---
> @RG DT:2010-04-08T18:13Z ID:boingboing SM:NA12345
+
&gt; @RG DT:2010-04-08T18:13Z ID:boingboing SM:NA12345
</pre>
+
</pre>  
 +
Any differences greater than that are an error and need to be fixed by the author.
  
Any differences greater than that are an error and need to be fixed by the author.
+
= Normal Workflow  =
  
= Normal Workflow =
+
Karma works using a set of index and hash files created from an existing reference. Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.
  
Karma works using a set of index and hash files created from an existing reference. Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.
+
In concept, the simplest workflow is to first create a reference index using ''karma create'', then align reads using ''karma map''. You only have to build the index and hash once.  
  
In concept, the simplest workflow is to first create a reference index using ''karma create'', then align reads using ''karma map''.  You only have to build the index and hash once.
+
Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.  
 
 
Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.
 
  
 
= Build reference index and hash  =
 
= Build reference index and hash  =
  
Building a reference index and hash with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.
+
Building a reference index and hash with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.  
  
The simplest example for creating a reference and index using a wordsize of 11-mer words is:
+
The simplest example for creating a reference and index using a wordsize of 11-mer words is:  
  
 
  karma create -i -w 11 phiX.fa
 
  karma create -i -w 11 phiX.fa
  
More generally, three primary parameters are necessary for building a Karma reference index:
+
More generally, three primary parameters are necessary for building a Karma reference index:  
  
# a boolean flag indicating base or color space
+
#a boolean flag indicating base or color space  
# the index table word occurrence cutoff value
+
#the index table word occurrence cutoff value  
# the word size
+
#the word size
  
Although the input reference is always expected to be base space and in [http://en.wikipedia.org/wiki/FASTA_format FASTA] format, the binary version of the reference, and the corresponding index and hash files, can be in either color space (ABI SOLiD) or base space (Illumina or LS454). For a given reference [http://en.wikipedia.org/wiki/FASTA_format FASTA] file, you may have either a color or base space binary reference, as well as either color or base space index/hash files, any in varying word sizes or occurrence cutoffs.
+
Although the input reference is always expected to be base space and in [http://en.wikipedia.org/wiki/FASTA_format FASTA] format, the binary version of the reference, and the corresponding index and hash files, can be in either color space (ABI SOLiD) or base space (Illumina or LS454). For a given reference [http://en.wikipedia.org/wiki/FASTA_format FASTA] file, you may have either a color or base space binary reference, as well as either color or base space index/hash files, any in varying word sizes or occurrence cutoffs.  
  
Because the index and hash files are dependent on the occurrence cutoff parameter and the word size, the output files created by karma have those values in the file name. This allows you to create a variety of index/hash tables, depending on your expected use (ABI SOLiD, in particular, is sensitive to read length).
+
Because the index and hash files are dependent on the occurrence cutoff parameter and the word size, the output files created by karma have those values in the file name. This allows you to create a variety of index/hash tables, depending on your expected use (ABI SOLiD, in particular, is sensitive to read length).  
  
== Options for building reference index and hash ==
+
== Options for building reference index and hash ==
  
 
  -r ''reference''          Reference file in [http://en.wikipedia.org/wiki/FASTA_format FASTA] format
 
  -r ''reference''          Reference file in [http://en.wikipedia.org/wiki/FASTA_format FASTA] format
Line 105: Line 100:
 
  -i                        Create the index and hash as well as the binary reference
 
  -i                        Create the index and hash as well as the binary reference
  
 +
<br>
  
= Aligning Reads =
+
= Aligning Reads =
  
Aligning reads to the reference is easy:
+
Aligning reads to the reference is easy:  
  
 
  karma map -r phiX.fa -w 11 phiX.fastq
 
  karma map -r phiX.fa -w 11 phiX.fastq
  
or for paired reads:
+
or for paired reads:  
  
 
  karma map -r phiX.fa -w 11 phiX-mate1.fastq phiX-mate2.fastq
 
  karma map -r phiX.fa -w 11 phiX-mate1.fastq phiX-mate2.fastq
  
In both of the above examples, the -r option names the reference originally used to build the index/hash, and the -w 11 specifies that we are using the index/hash built for 11-mer words. Although you can use the default word size of 15 for phiX, the index is 4^15 * 4 = 4GBytes, so a shorter word size is prudent.
+
In both of the above examples, the -r option names the reference originally used to build the index/hash, and the -w 11 specifies that we are using the index/hash built for 11-mer words. Although you can use the default word size of 15 for phiX, the index is 4^15 * 4 = 4GBytes, so a shorter word size is prudent.  
  
Since Karma uses the word size and occurrence cutoff to help construct the actual index and hash filenames, you must specify them the same way you did when you created the reference index and hash.
+
Since Karma uses the word size and occurrence cutoff to help construct the actual index and hash filenames, you must specify them the same way you did when you created the reference index and hash.  
  
== Options for aligning reads ==
+
== Options for aligning reads ==
  
   -a [int]    -> maximum insert size
+
   -a [int]    -&gt; maximum insert size
  -B [int]    -> max number of bases in millions
+
  -B [int]    -&gt; max number of bases in millions
  -E          -> show reference bases (default off)
+
  -E          -&gt; show reference bases (default off)
  -H          -> set SAM header line values (e.g. -H RG:SM:NA12345)
+
  -H          -&gt; set SAM header line values (e.g. -H RG:SM:NA12345)
  -o          -> required output sam/bam file
+
  -o          -&gt; required output sam/bam file
  -O [int]    -> occurrence cutoff (default 5000)
+
  -O [int]    -&gt; occurrence cutoff (default 5000)
  -q          -> quiet mode (no output except errors)
+
  -q          -&gt; quiet mode (no output except errors)
  -r [name]    -> required genome reference
+
  -r [name]    -&gt; required genome reference
  -R [int]    -> max number of reads
+
  -R [int]    -&gt; max number of reads
  -w [int]    -> index word size (default 15)
+
  -w [int]    -&gt; index word size (default 15)
  
== Aligning Reads (Illumina) ==
+
== Aligning Reads (Illumina) ==
  
Karma is set up so that the default options work well for mapping Illumina reads to the Human genome.
+
Karma is set up so that the default options work well for mapping Illumina reads to the Human genome.  
  
== Aligning Reads (ABI SOLiD) ==
+
== Aligning Reads (ABI SOLiD) ==
  
Karma has been designed to align color space reads. However, in Karma 0.9.0, this functionality is not working.
+
Karma has been designed to align color space reads. However, in Karma 0.9.0, this functionality is not working.  
  
== Aligning Reads (LS 454) ==
+
== Aligning Reads (LS 454) ==
  
Karma has been designed to align LS 454 reads. However, in Karma 0.9.0, this functionality is not working.
+
Karma has been designed to align LS 454 reads. However, in Karma 0.9.0, this functionality is not working.  
  
= Karma Performance Tuning =
+
= Karma Performance Tuning =
  
There are four components to the Karma index and hash. A pure index array, based on an N-mer word index. This is used as a pointer into a word positions table, which is an ordered list of genome positions in which that N-mer word appears. There is a cap called the ''occurrence cutoff'', which once exceeded, causes that index word to be marked as a high repeat pattern. Once marked as high repeat, the N-mer word is instead combined with both the N-mer word preceding it, as well as the N-mer word succeeding it to create a 2 * N-mer word hash key. Two hash tables are populated, a left and a right hash. These are then used when that pattern is found in a read.
+
There are four components to the Karma index and hash. A pure index array, based on an N-mer word index. This is used as a pointer into a word positions table, which is an ordered list of genome positions in which that N-mer word appears. There is a cap called the ''occurrence cutoff'', which once exceeded, causes that index word to be marked as a high repeat pattern. Once marked as high repeat, the N-mer word is instead combined with both the N-mer word preceding it, as well as the N-mer word succeeding it to create a 2 * N-mer word hash key. Two hash tables are populated, a left and a right hash. These are then used when that pattern is found in a read.  
  
== Index Word Size ==
+
== Index Word Size ==
  
Choosing an appropriate word size for larger genome is critical to performance. The easiest case is for Illumina base space reads with the human genome (3Gbases), where the default 15-mer word size is fine.
+
Choosing an appropriate word size for larger genome is critical to performance. The easiest case is for Illumina base space reads with the human genome (3Gbases), where the default 15-mer word size is fine.  
  
For smaller genomes, consider using a smaller word size. Genomes smaller than a few million bases should be perfectly fine with a word size of 11 or 12.
+
For smaller genomes, consider using a smaller word size. Genomes smaller than a few million bases should be perfectly fine with a word size of 11 or 12.  
  
Since the primary index table into the word positions table is 2^(wordsize) * 4 bytes, it can grow large rapidly. All else being equal, a smaller word size leads to longer sets of word positions for each index value. Each increment of word size approximately quadruples storage requirements, and halves runtime. Similarly, each decrement of word size reduces the index table size by 75%, and doubles runtime. These approximations are old, but serve a useful rule of thumb.
+
Since the primary index table into the word positions table is 2^(wordsize) * 4 bytes, it can grow large rapidly. All else being equal, a smaller word size leads to longer sets of word positions for each index value. Each increment of word size approximately quadruples storage requirements, and halves runtime. Similarly, each decrement of word size reduces the index table size by 75%, and doubles runtime. These approximations are old, but serve a useful rule of thumb.  
  
For ABI SOLiD reads, the word size is critical, due to the shorter length of reads as compared to Illumina or LS 454.
+
For ABI SOLiD reads, the word size is critical, due to the shorter length of reads as compared to Illumina or LS 454.  
  
The optimal minimum word size is chosen such that it is 1/4 the minimum expected average read length. It also must be chosen to be 1/2 the minimum expected read length, since at least 2 full words must exist in the read.
+
The optimal minimum word size is chosen such that it is 1/4 the minimum expected average read length. It also must be chosen to be 1/2 the minimum expected read length, since at least 2 full words must exist in the read.  
  
So for 48-mer reads, a reasonable value of word size is 12. Although the base space default of 15 is fine, too, Karma is able to take advantage of a higher number of index words per read, yielding substantial speedups even with the shorter read. Similarly, 52-mer reads would map better with a 13-mer word size, and 56-mer reads would map best with a 14-mer word size.
+
So for 48-mer reads, a reasonable value of word size is 12. Although the base space default of 15 is fine, too, Karma is able to take advantage of a higher number of index words per read, yielding substantial speedups even with the shorter read. Similarly, 52-mer reads would map better with a 13-mer word size, and 56-mer reads would map best with a 14-mer word size.  
  
== Occurrence Cutoff ==
+
== Occurrence Cutoff ==
  
The occurrence cutoff value determines how quickly an N-mer pattern is declared to be ''high repeat'' and left out of the index in favor of a hash. The default value of 5000 seems adequate for Illumina reads with the human genome. If ultimate performance is necessary, some experimentation is called for with this value.
+
The occurrence cutoff value determines how quickly an N-mer pattern is declared to be ''high repeat'' and left out of the index in favor of a hash. The default value of 5000 seems adequate for Illumina reads with the human genome. If ultimate performance is necessary, some experimentation is called for with this value.  
  
== Shared Memory ==
+
== Shared Memory ==
  
Karma uses memory mapped files to share the potentially large reference index and hash data structures.
+
Karma uses memory mapped files to share the potentially large reference index and hash data structures.  
  
Karma uses this to great effect on our 8 processors with hyperthreading enabled. 16 copies of karma can share one reference index and hash, yielding a very acceptable memory per CPU ratio of around 1GB/CPU.
+
Karma uses this to great effect on our 8 processors with hyperthreading enabled. 16 copies of karma can share one reference index and hash, yielding a very acceptable memory per CPU ratio of around 1GB/CPU.  
  
A problem with large reference index and hash data structures is that they are more prone to being paged out. On a shared machine that is being used extensively even just simple disk I/O, memory pages are being reclaimed such that Karma will become swapped out.
+
A problem with large reference index and hash data structures is that they are more prone to being paged out. On a shared machine that is being used extensively even just simple disk I/O, memory pages are being reclaimed such that Karma will become swapped out.  
  
While Karma can recover on its own, it is best to either run in a production manner on dedicated machines, or to run a program such as the utility ''mapfile'' found in the utilities sub-folder. This program continually touches each page of the data structures in sequential order, forcing them to the head of the disk buffer pool, so they don't get aged out of the queue.
+
While Karma can recover on its own, it is best to either run in a production manner on dedicated machines, or to run a program such as the utility ''mapfile'' found in the utilities sub-folder. This program continually touches each page of the data structures in sequential order, forcing them to the head of the disk buffer pool, so they don't get aged out of the queue.  
  
= Modifying the Reference Header =
+
= Modifying the Reference Header =
  
''NB: This feature is not yet complete''
+
''NB: This feature is not yet complete''  
  
To facilitate SAM RG values being set automatically in a production environment, we keep a header in the binary version of the reference. The header can be viewed and edited using the header subcommands here.
+
To facilitate SAM RG values being set automatically in a production environment, we keep a header in the binary version of the reference. The header can be viewed and edited using the header subcommands here.  
  
To view the header:
+
To view the header:  
  
 
  karma header -r phiX.fa
 
  karma header -r phiX.fa
  
To view and edit the header:
+
To view and edit the header:  
  
 
  karma header -r phiX.fa -e
 
  karma header -r phiX.fa -e
  
= Other test and check capabilities =
+
= Optional flags  =
 +
 
 +
{| width="200" border="1" cellpadding="1" cellspacing="1"
 +
|-
 +
| XA
 +
| Alignment PathTag
 +
|-
 +
| RG
 +
| Sample GroupID
 +
|-
 +
| HA
 +
| numMatchContributors (possible hits checked by karma)
 +
|-
 +
| UQ
 +
| phred quality of this read, assuming it is mapped correctly
 +
|-
 +
| NB
 +
| Number of equally best hits
 +
|-
 +
| ER
 +
| Different kind of errors when mapping
 +
|-
 +
|
 +
|
 +
|}
 +
 
 +
<br>
  
Due to the size and complexity of Karma input, output and index files, various checks and tests are useful, so we include some diagnostics capabilities:
 
  
Tests for external files:
+
 
 +
= Other test and check capabilities  =
 +
 
 +
Due to the size and complexity of Karma input, output and index files, various checks and tests are useful, so we include some diagnostics capabilities:
 +
 
 +
Tests for external files:  
  
 
  karma check [options...] file.bam file.fastq file.sam file.fa file.umfa
 
  karma check [options...] file.bam file.fastq file.sam file.fa file.umfa
  
Tests internal to Karma:
+
Tests internal to Karma:  
 +
 
 
  karma test [options...]
 
  karma test [options...]
  -d -> debug
+
  -d -&gt; debug
  -s [int] -> set random number seed [12345]
+
  -s [int] -&gt; set random number seed [12345]
  
 
= Karma File structure  =
 
= Karma File structure  =
Line 259: Line 286:
 
<br>  
 
<br>  
  
= Karma TODO List =
+
= Karma TODO List =
  
# command line help is muddled up - UserOptions.h needs work
+
#command line help is muddled up - UserOptions.h needs work  
# color space read handling needs to be re-integrated and tested
+
#color space read handling needs to be re-integrated and tested  
# LS 454 needs to be tested, and the code adapted
+
#LS 454 needs to be tested, and the code adapted  
# pre-process some number of records to establish an appropriate max insert size
+
#pre-process some number of records to establish an appropriate max insert size  
# finish reference header view/edit code
+
#finish reference header view/edit code  
# investigate and document maximum memory use during ''create'' sub-command
+
#investigate and document maximum memory use during ''create'' sub-command  
# finish and improve check and test commands
+
#finish and improve check and test commands
  
= Karma CHANGELOG =
+
= Karma CHANGELOG =
 +
 
 +
*Karma 0.9.0
  
* Karma 0.9.0
 
 
  * reference may now contain an arbitrary number of chromosomes  
 
  * reference may now contain an arbitrary number of chromosomes  
 
  * local re-alignment is drastically improved (handle small and large gaps better)
 
  * local re-alignment is drastically improved (handle small and large gaps better)
 
  * command line is re-vamped - now easier to use
 
  * command line is re-vamped - now easier to use
 
  * create/naming/using reference index and hashes is easier
 
  * create/naming/using reference index and hashes is easier
* Karma 0.8.8S
+
 
 +
*Karma 0.8.8S
 +
 
 
  * add first version of local re-alignment
 
  * add first version of local re-alignment
 
  * bump max number of chromosomes to 200
 
  * bump max number of chromosomes to 200
 
  * we no longer do Smith-Waterman on each candidate location
 
  * we no longer do Smith-Waterman on each candidate location
* Karma 0.8.8
 
* Karma 0.8.6
 
  
= Other useful links =
+
*Karma 0.8.8
 +
*Karma 0.8.6
 +
 
 +
= Other useful links =
  
 
[http://lh3lh3.users.sourceforge.net/bioinfo.shtml Heng Li's thoughts about aligners]  
 
[http://lh3lh3.users.sourceforge.net/bioinfo.shtml Heng Li's thoughts about aligners]  
  
[http://lh3lh3.users.sourceforge.net/udb.shtml Benchmark of Dictionary Structures]
+
[http://lh3lh3.users.sourceforge.net/udb.shtml Benchmark of Dictionary Structures]
 +
 
 +
[[Category:Software]]

Revision as of 22:14, 13 April 2010


K-tuple Alignment with Rapid Matching Algorithm

Karma uses an existing reference to align short reads, such as those generated by Illumina sequencers.

Primary features:

  1. High performance, high sensitivity
  2. Large and small gap detection by default
  3. Multiple gaps per read by default
  4. Single or paired end reads
  5. No read length limit
  6. Quality scores are used to assess quality of maps
  7. All potential locations are examined exhaustively, none are omitted
  8. Reasonable memory per CPU ratio on high core count machines

The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers.

Color space and LS454 sequence alignments are not currently supported. These features will return in Karma 0.9.1.

Download Karma

To get a copy go to Karma Download

Build Karma

Karma is designed to be reasonably portable.

However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, there are likely other portability issues.

We support Karma only on Ubuntu 9.10 on 64-bit processors.

Dependencies

Karma requires that the following debian packages be installed on the host Linux machine:

  1. libssl-dev
  2. zlib1g-dev

Without these installed, Karma will not build.

Building

Assuming the karma tar file is named karma.tgz, do the following

tar xvzf karma.tgz
cd karma-0.9
make
mkdir ~/bin
cp karma/karma ~/bin

Alternatively, if you want to share the karma binary install it in /usr/local/bin/karma.

Testing the build

To test karma, go to the build tree subdirectory named karma, and type the command:

make test

The test script builds a reference for the small phiX genome, then runs single end as well as paired end alignments. It compares the results of that with known results. Differences are printed to the console, and currently look something like this:

diff phiX.sam.good phiX.sam 
3c3
< @RG	DT:2010-04-08T17:29Z	ID:boingboing	SM:NA12345
---
> @RG	DT:2010-04-08T18:13Z	ID:boingboing	SM:NA12345

Any differences greater than that are an error and need to be fixed by the author.

Normal Workflow

Karma works using a set of index and hash files created from an existing reference. Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.

In concept, the simplest workflow is to first create a reference index using karma create, then align reads using karma map. You only have to build the index and hash once.

Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.

Build reference index and hash

Building a reference index and hash with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.

The simplest example for creating a reference and index using a wordsize of 11-mer words is:

karma create -i -w 11 phiX.fa

More generally, three primary parameters are necessary for building a Karma reference index:

  1. a boolean flag indicating base or color space
  2. the index table word occurrence cutoff value
  3. the word size

Although the input reference is always expected to be base space and in FASTA format, the binary version of the reference, and the corresponding index and hash files, can be in either color space (ABI SOLiD) or base space (Illumina or LS454). For a given reference FASTA file, you may have either a color or base space binary reference, as well as either color or base space index/hash files, any in varying word sizes or occurrence cutoffs.

Because the index and hash files are dependent on the occurrence cutoff parameter and the word size, the output files created by karma have those values in the file name. This allows you to create a variety of index/hash tables, depending on your expected use (ABI SOLiD, in particular, is sensitive to read length).

Options for building reference index and hash

-r reference          Reference file in FASTA format
-w word size          Word size for index and hash (default 15, typically 10-16)
-O occurrence cutoff  Upper count of number of word positions to store in word positions table (default 5000)
-c                        Creates a color space reference and index/hash
-i                        Create the index and hash as well as the binary reference


Aligning Reads

Aligning reads to the reference is easy:

karma map -r phiX.fa -w 11 phiX.fastq

or for paired reads:

karma map -r phiX.fa -w 11 phiX-mate1.fastq phiX-mate2.fastq

In both of the above examples, the -r option names the reference originally used to build the index/hash, and the -w 11 specifies that we are using the index/hash built for 11-mer words. Although you can use the default word size of 15 for phiX, the index is 4^15 * 4 = 4GBytes, so a shorter word size is prudent.

Since Karma uses the word size and occurrence cutoff to help construct the actual index and hash filenames, you must specify them the same way you did when you created the reference index and hash.

Options for aligning reads

  -a [int]     -> maximum insert size
 -B [int]     -> max number of bases in millions
 -E           -> show reference bases (default off)
 -H           -> set SAM header line values (e.g. -H RG:SM:NA12345)
 -o           -> required output sam/bam file
 -O [int]     -> occurrence cutoff (default 5000)
 -q           -> quiet mode (no output except errors)
 -r [name]    -> required genome reference
 -R [int]     -> max number of reads
 -w [int]     -> index word size (default 15)

Aligning Reads (Illumina)

Karma is set up so that the default options work well for mapping Illumina reads to the Human genome.

Aligning Reads (ABI SOLiD)

Karma has been designed to align color space reads. However, in Karma 0.9.0, this functionality is not working.

Aligning Reads (LS 454)

Karma has been designed to align LS 454 reads. However, in Karma 0.9.0, this functionality is not working.

Karma Performance Tuning

There are four components to the Karma index and hash. A pure index array, based on an N-mer word index. This is used as a pointer into a word positions table, which is an ordered list of genome positions in which that N-mer word appears. There is a cap called the occurrence cutoff, which once exceeded, causes that index word to be marked as a high repeat pattern. Once marked as high repeat, the N-mer word is instead combined with both the N-mer word preceding it, as well as the N-mer word succeeding it to create a 2 * N-mer word hash key. Two hash tables are populated, a left and a right hash. These are then used when that pattern is found in a read.

Index Word Size

Choosing an appropriate word size for larger genome is critical to performance. The easiest case is for Illumina base space reads with the human genome (3Gbases), where the default 15-mer word size is fine.

For smaller genomes, consider using a smaller word size. Genomes smaller than a few million bases should be perfectly fine with a word size of 11 or 12.

Since the primary index table into the word positions table is 2^(wordsize) * 4 bytes, it can grow large rapidly. All else being equal, a smaller word size leads to longer sets of word positions for each index value. Each increment of word size approximately quadruples storage requirements, and halves runtime. Similarly, each decrement of word size reduces the index table size by 75%, and doubles runtime. These approximations are old, but serve a useful rule of thumb.

For ABI SOLiD reads, the word size is critical, due to the shorter length of reads as compared to Illumina or LS 454.

The optimal minimum word size is chosen such that it is 1/4 the minimum expected average read length. It also must be chosen to be 1/2 the minimum expected read length, since at least 2 full words must exist in the read.

So for 48-mer reads, a reasonable value of word size is 12. Although the base space default of 15 is fine, too, Karma is able to take advantage of a higher number of index words per read, yielding substantial speedups even with the shorter read. Similarly, 52-mer reads would map better with a 13-mer word size, and 56-mer reads would map best with a 14-mer word size.

Occurrence Cutoff

The occurrence cutoff value determines how quickly an N-mer pattern is declared to be high repeat and left out of the index in favor of a hash. The default value of 5000 seems adequate for Illumina reads with the human genome. If ultimate performance is necessary, some experimentation is called for with this value.

Shared Memory

Karma uses memory mapped files to share the potentially large reference index and hash data structures.

Karma uses this to great effect on our 8 processors with hyperthreading enabled. 16 copies of karma can share one reference index and hash, yielding a very acceptable memory per CPU ratio of around 1GB/CPU.

A problem with large reference index and hash data structures is that they are more prone to being paged out. On a shared machine that is being used extensively even just simple disk I/O, memory pages are being reclaimed such that Karma will become swapped out.

While Karma can recover on its own, it is best to either run in a production manner on dedicated machines, or to run a program such as the utility mapfile found in the utilities sub-folder. This program continually touches each page of the data structures in sequential order, forcing them to the head of the disk buffer pool, so they don't get aged out of the queue.

Modifying the Reference Header

NB: This feature is not yet complete

To facilitate SAM RG values being set automatically in a production environment, we keep a header in the binary version of the reference. The header can be viewed and edited using the header subcommands here.

To view the header:

karma header -r phiX.fa

To view and edit the header:

karma header -r phiX.fa -e

Optional flags

XA Alignment PathTag
RG Sample GroupID
HA numMatchContributors (possible hits checked by karma)
UQ phred quality of this read, assuming it is mapped correctly
NB Number of equally best hits
ER Different kind of errors when mapping



Other test and check capabilities

Due to the size and complexity of Karma input, output and index files, various checks and tests are useful, so we include some diagnostics capabilities:

Tests for external files:

karma check [options...] file.bam file.fastq file.sam file.fa file.umfa

Tests internal to Karma:

karma test [options...]
-d -> debug
-s [int] -> set random number seed [12345]

Karma File structure

Upon successfully building references, you will obtain a list of reference files like below:

Base Space

Color Space

Reference genome

NCBI37-bs.umfa

NCBI37-cs.umfa

Word Index

NCBI37-bs.15.5000.umwiwp

NCBI37-bs.15.5000.umwihi

NCBI37-cs.15.5000.umwiwp

NCBI37-cs.15.5000.umwihi

Word Hash (Left)

NCBI37-bs.15.5000.umwhl

NCBI37-cs.15.5000.umwhl

Word Hash (Right)

NCBI37-bs.15.5000.umwhr

NCBI37-cs.15.5000.umwhr



Karma TODO List

  1. command line help is muddled up - UserOptions.h needs work
  2. color space read handling needs to be re-integrated and tested
  3. LS 454 needs to be tested, and the code adapted
  4. pre-process some number of records to establish an appropriate max insert size
  5. finish reference header view/edit code
  6. investigate and document maximum memory use during create sub-command
  7. finish and improve check and test commands

Karma CHANGELOG

  • Karma 0.9.0
* reference may now contain an arbitrary number of chromosomes 
* local re-alignment is drastically improved (handle small and large gaps better)
* command line is re-vamped - now easier to use
* create/naming/using reference index and hashes is easier
  • Karma 0.8.8S
* add first version of local re-alignment
* bump max number of chromosomes to 200
* we no longer do Smith-Waterman on each candidate location
  • Karma 0.8.8
  • Karma 0.8.6

Other useful links

Heng Li's thoughts about aligners

Benchmark of Dictionary Structures