Revision as of 22:14, 13 April 2010

K-tuple Alignment with Rapid Matching Algorithm

Karma uses an existing reference to align short reads, such as those generated by Illumina sequencers.

Primary features:

High performance, high sensitivity
Large and small gap detection by default
Multiple gaps per read by default
Single or paired end reads
No read length limit
Quality scores are used to assess quality of maps
All potential locations are examined exhaustively, none are omitted
Reasonable memory per CPU ratio on high core count machines

The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers.

Color space and LS454 sequence alignments are not currently supported. These features will return in Karma 0.9.1.

Download Karma

To get a copy go to Karma Download

Build Karma

Karma is designed to be reasonably portable.

However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, there are likely other portability issues.

We support Karma only on Ubuntu 9.10 on 64-bit processors.

Dependencies

Karma requires that the following debian packages be installed on the host Linux machine:

libssl-dev
zlib1g-dev

Without these installed, Karma will not build.

Building

Assuming the karma tar file is named karma.tgz, do the following

tar xvzf karma.tgz
cd karma-0.9
make
mkdir ~/bin
cp karma/karma ~/bin

Alternatively, if you want to share the karma binary install it in /usr/local/bin/karma.

Testing the build

To test karma, go to the build tree subdirectory named karma, and type the command:

make test

The test script builds a reference for the small phiX genome, then runs single end as well as paired end alignments. It compares the results of that with known results. Differences are printed to the console, and currently look something like this:

diff phiX.sam.good phiX.sam 
3c3
< @RG	DT:2010-04-08T17:29Z	ID:boingboing	SM:NA12345
---
> @RG	DT:2010-04-08T18:13Z	ID:boingboing	SM:NA12345

Any differences greater than that are an error and need to be fixed by the author.

Normal Workflow

Karma works using a set of index and hash files created from an existing reference. Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.

In concept, the simplest workflow is to first create a reference index using karma create, then align reads using karma map. You only have to build the index and hash once.

Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.

Build reference index and hash

Building a reference index and hash with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.

The simplest example for creating a reference and index using a wordsize of 11-mer words is:

karma create -i -w 11 phiX.fa

More generally, three primary parameters are necessary for building a Karma reference index:

a boolean flag indicating base or color space
the index table word occurrence cutoff value
the word size

Although the input reference is always expected to be base space and in FASTA format, the binary version of the reference, and the corresponding index and hash files, can be in either color space (ABI SOLiD) or base space (Illumina or LS454). For a given reference FASTA file, you may have either a color or base space binary reference, as well as either color or base space index/hash files, any in varying word sizes or occurrence cutoffs.

Because the index and hash files are dependent on the occurrence cutoff parameter and the word size, the output files created by karma have those values in the file name. This allows you to create a variety of index/hash tables, depending on your expected use (ABI SOLiD, in particular, is sensitive to read length).

Options for building reference index and hash

-r reference          Reference file in FASTA format
-w word size          Word size for index and hash (default 15, typically 10-16)
-O occurrence cutoff  Upper count of number of word positions to store in word positions table (default 5000)
-c                        Creates a color space reference and index/hash
-i                        Create the index and hash as well as the binary reference

Aligning Reads

Aligning reads to the reference is easy:

karma map -r phiX.fa -w 11 phiX.fastq

or for paired reads:

karma map -r phiX.fa -w 11 phiX-mate1.fastq phiX-mate2.fastq

In both of the above examples, the -r option names the reference originally used to build the index/hash, and the -w 11 specifies that we are using the index/hash built for 11-mer words. Although you can use the default word size of 15 for phiX, the index is 4^15 * 4 = 4GBytes, so a shorter word size is prudent.

Since Karma uses the word size and occurrence cutoff to help construct the actual index and hash filenames, you must specify them the same way you did when you created the reference index and hash.

Options for aligning reads

  -a [int]     -> maximum insert size
 -B [int]     -> max number of bases in millions
 -E           -> show reference bases (default off)
 -H           -> set SAM header line values (e.g. -H RG:SM:NA12345)
 -o           -> required output sam/bam file
 -O [int]     -> occurrence cutoff (default 5000)
 -q           -> quiet mode (no output except errors)
 -r [name]    -> required genome reference
 -R [int]     -> max number of reads
 -w [int]     -> index word size (default 15)

Aligning Reads (Illumina)

Karma is set up so that the default options work well for mapping Illumina reads to the Human genome.

Aligning Reads (ABI SOLiD)

Karma has been designed to align color space reads. However, in Karma 0.9.0, this functionality is not working.

Aligning Reads (LS 454)

Karma has been designed to align LS 454 reads. However, in Karma 0.9.0, this functionality is not working.

Karma Performance Tuning

There are four components to the Karma index and hash. A pure index array, based on an N-mer word index. This is used as a pointer into a word positions table, which is an ordered list of genome positions in which that N-mer word appears. There is a cap called the occurrence cutoff, which once exceeded, causes that index word to be marked as a high repeat pattern. Once marked as high repeat, the N-mer word is instead combined with both the N-mer word preceding it, as well as the N-mer word succeeding it to create a 2 * N-mer word hash key. Two hash tables are populated, a left and a right hash. These are then used when that pattern is found in a read.

Index Word Size

Choosing an appropriate word size for larger genome is critical to performance. The easiest case is for Illumina base space reads with the human genome (3Gbases), where the default 15-mer word size is fine.

For smaller genomes, consider using a smaller word size. Genomes smaller than a few million bases should be perfectly fine with a word size of 11 or 12.

Since the primary index table into the word positions table is 2^(wordsize) * 4 bytes, it can grow large rapidly. All else being equal, a smaller word size leads to longer sets of word positions for each index value. Each increment of word size approximately quadruples storage requirements, and halves runtime. Similarly, each decrement of word size reduces the index table size by 75%, and doubles runtime. These approximations are old, but serve a useful rule of thumb.

For ABI SOLiD reads, the word size is critical, due to the shorter length of reads as compared to Illumina or LS 454.

The optimal minimum word size is chosen such that it is 1/4 the minimum expected average read length. It also must be chosen to be 1/2 the minimum expected read length, since at least 2 full words must exist in the read.

So for 48-mer reads, a reasonable value of word size is 12. Although the base space default of 15 is fine, too, Karma is able to take advantage of a higher number of index words per read, yielding substantial speedups even with the shorter read. Similarly, 52-mer reads would map better with a 13-mer word size, and 56-mer reads would map best with a 14-mer word size.

Occurrence Cutoff

The occurrence cutoff value determines how quickly an N-mer pattern is declared to be high repeat and left out of the index in favor of a hash. The default value of 5000 seems adequate for Illumina reads with the human genome. If ultimate performance is necessary, some experimentation is called for with this value.

Shared Memory

Karma uses memory mapped files to share the potentially large reference index and hash data structures.

Karma uses this to great effect on our 8 processors with hyperthreading enabled. 16 copies of karma can share one reference index and hash, yielding a very acceptable memory per CPU ratio of around 1GB/CPU.

A problem with large reference index and hash data structures is that they are more prone to being paged out. On a shared machine that is being used extensively even just simple disk I/O, memory pages are being reclaimed such that Karma will become swapped out.

While Karma can recover on its own, it is best to either run in a production manner on dedicated machines, or to run a program such as the utility mapfile found in the utilities sub-folder. This program continually touches each page of the data structures in sequential order, forcing them to the head of the disk buffer pool, so they don't get aged out of the queue.

Modifying the Reference Header

NB: This feature is not yet complete

To facilitate SAM RG values being set automatically in a production environment, we keep a header in the binary version of the reference. The header can be viewed and edited using the header subcommands here.

To view the header:

karma header -r phiX.fa

To view and edit the header:

karma header -r phiX.fa -e

Optional flags

XA	Alignment PathTag
RG	Sample GroupID
HA	numMatchContributors (possible hits checked by karma)
UQ	phred quality of this read, assuming it is mapped correctly
NB	Number of equally best hits
ER	Different kind of errors when mapping

Other test and check capabilities

Due to the size and complexity of Karma input, output and index files, various checks and tests are useful, so we include some diagnostics capabilities:

Tests for external files:

karma check [options...] file.bam file.fastq file.sam file.fa file.umfa

Tests internal to Karma:

karma test [options...]
-d -> debug
-s [int] -> set random number seed [12345]

Karma File structure

Upon successfully building references, you will obtain a list of reference files like below:

	Base Space	Color Space
Reference genome	NCBI37-bs.umfa	NCBI37-cs.umfa
Word Index	NCBI37-bs.15.5000.umwiwp NCBI37-bs.15.5000.umwihi	NCBI37-cs.15.5000.umwiwp NCBI37-cs.15.5000.umwihi
Word Hash (Left)	NCBI37-bs.15.5000.umwhl	NCBI37-cs.15.5000.umwhl
Word Hash (Right)	NCBI37-bs.15.5000.umwhr	NCBI37-cs.15.5000.umwhr

Karma TODO List

command line help is muddled up - UserOptions.h needs work
color space read handling needs to be re-integrated and tested
LS 454 needs to be tested, and the code adapted
pre-process some number of records to establish an appropriate max insert size
finish reference header view/edit code
investigate and document maximum memory use during create sub-command
finish and improve check and test commands

Karma CHANGELOG

Karma 0.9.0

* reference may now contain an arbitrary number of chromosomes 
* local re-alignment is drastically improved (handle small and large gaps better)
* command line is re-vamped - now easier to use
* create/naming/using reference index and hashes is easier

Karma 0.8.8S

* add first version of local re-alignment
* bump max number of chromosomes to 200
* we no longer do Smith-Waterman on each candidate location

Karma 0.8.8
Karma 0.8.6

Difference between revisions of "Karma"