Difference between revisions of "Karma"
(fill in performance details)
(→Dependencies: fill em in)
|Line 16:||Line 16:|
== Dependencies ==
== Dependencies ==
== Building ==
== Building ==
Revision as of 15:05, 8 April 2010
K-tuple Alignment with Rapid Matching Algorithm
Karma uses an existing reference to align short reads, such as generated by Illumina sequencers.
The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers.
Color space and LS454 sequence alignments are not working. These features will return in Karma 0.9.1.
- 1 Download Karma
- 2 Build Karma
- 3 Normal Workflow
- 4 Build reference index and hash
- 5 Aligning Reads
- 6 Karma Performance Tuning
- 7 Modifying the Reference Header
- 8 Other test and check capabilities
- 9 Karma File structure
- 10 Karma TODO List
- 11 Karma CHANGELOG
- 12 Other useful links
To get a copy go to Karma Download
Karma requires that the following debian packages be installed on the host Linux machine:
Without these installed, Karma will not build.
Testing the build
To test karma, go to the build tree subdirectory named karma, and type the command:
The test script builds a reference for the small phiX genome, then runs single end as well as paired end alignments. It compares the results of that with known results. Differences are printed to the console, and currently look something like this:
diff phiX.sam.good phiX.sam 3c3 < @RG DT:2010-04-08T17:29Z ID:boingboing SM:NA12345 --- > @RG DT:2010-04-08T18:13Z ID:boingboing SM:NA12345
Any differences greater than that are an error and need to be fixed by the author.
Karma works using a set of index and hash files created from an existing reference. Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.
In concept, the simplest workflow is to first create a reference index using karma create, then align reads using karma map. You only have to build the index and hash once.
Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.
Build reference index and hash
Building a reference index and hash with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.
The simplest example for creating a reference and index using a wordsize of 11-mer words is:
karma create -i -w 11 phiX.fa
More generally, three primary parameters are necessary for building a Karma reference index:
- a boolean flag indicating base or color space
- the index table word occurrence cutoff value
- the word size
Although the input reference is always expected to be base space and in FASTA format, the binary version of the reference, and the corresponding index and hash files, can be in either color space (ABI SOLiD) or base space (Illumina or LS454). For a given reference FASTA file, you may have either a color or base space binary reference, as well as either color or base space index/hash files, any in varying word sizes or occurrence cutoffs.
Because the index and hash files are dependent on the occurrence cutoff parameter and the word size, the output files created by karma have those values in the file name. This allows you to create a variety of index/hash tables, depending on your expected use (ABI SOLiD, in particular, is sensitive to read length).
Options for building reference index and hash
-r reference Reference file in FASTA format -w word size Word size for index and hash (default 15, typically 10-16) -O occurrence cutoff Upper count of number of word positions to store in word positions table (default 5000) -c Creates a color space reference and index/hash -i Create the index and hash as well as the binary reference
Aligning reads to the reference is easy:
karma map -r phiX.fa -w 11 phiX.fastq
or for paired reads:
karma map -r phiX.fa -w 11 phiX-mate1.fastq phiX-mate2.fastq
In both of the above examples, the -r option names the reference originally used to build the index/hash, and the -w 11 specifies that we are using the index/hash built for 11-mer words. Although you can use the default word size of 15 for phiX, the index is 4^15 * 4 = 4GBytes, so a shorter word size is prudent.
Since Karma uses the word size and occurrence cutoff to help construct the actual index and hash filenames, you must specify them the same way you did when you created the reference index and hash.
Aligning Reads (Illumina)
Karma is set up so that the default options work well for mapping Illumina reads to the Human genome.
Aligning Reads (ABI SOLiD)
Karma has been designed to align color space reads. However, in Karma 0.9.0, this functionality is not working.
Aligning Reads (LS 454)
Karma has been designed to align LS 454 reads. However, in Karma 0.9.0, this functionality is not working.
Karma Performance Tuning
There are four components to the Karma index and hash. A pure index, based on an N-mer word index. This is used as a pointer into a word positions table, which is an ordered list of genome positions in which that N-mer word appears. There is a cap called the occurrence cutoff, which once crossed, causes that index word to be marked as a high repeat pattern. Once marked as high repeat, the N-mer word is now combined with both the N-mer word preceding it, as well as the N-mer word succeeding it to create a 2 * N-mer word hash key. Two hash tables are populated, a left and a right hash.
Index Word Size
Choosing an appropriate word size for larger genome is critical to performance. The easiest case is for Illumina base space reads with the human genome (3Gbases), where the default 15-mer word size is fine.
For smaller genomes, consider using a smaller word size. Genomes smaller than a few million bases should be perfectly fine with a word size of 11 or 12.
Since the primary index table into the word positions table is 2^(wordsize) * 4 bytes, it can grow large rapidly. All else being equal, a smaller word size leads to longer sets of word positions for each index value. Each increment of word size approximately quadruples storage requirements, and halves runtime. Similarly, each decrement of word size reduces the index table size by 75%, and doubles runtime. These approximations are old, but serve a useful rule of thumb.
For ABI SOLiD reads, the word size is critical, due to the shorter length of reads as compared to Illumina or LS 454.
The optimal minimum word size is chosen such that it is 1/4 the minimum expected average read length. It also must be chosen to be 1/2 the minimum expected read length, since at least 2 full words must exist in the read.
So for 48-mer reads, a reasonable value of word size is 12. Although the base space default of 15 is fine, too, Karma is able to take advantage of a higher number of index words per read, yielding substantial speedups even with the shorter read. Similarly, 52-mer reads would map better with a 13-mer word size, and 56-mer reads would map best with a 14-mer word size.
The occurrence cutoff value determines how quickly an N-mer pattern is declared to be high repeat and left out of the index in favor of a hash. The default value of 5000 seems adequate for Illumina reads with the human genome. If ultimate performance is necessary, some experimentation is called for with this value.
Karma uses memory mapped files to share the potentially large reference index and hash data structures.
Karma uses this to great effect on our 8 processors with hyperthreading enabled. 16 copies of karma can share one reference index and hash, yielding a very acceptable memory per CPU ratio of around 1GB/CPU.
A problem with large reference index and hash data structures is that they are more prone to being paged out. On a shared machine that is being used extensively even just simple disk I/O, memory pages are being reclaimed such that Karma will become swapped out.
While Karma can recover on its own, it is best to either run in a production manner on dedicated machines, or to run a program such as the utility mapfile found in the utilities sub-folder. This program continually touches each page of the data structures in sequential order, forcing them to the head of the disk buffer pool, so they don't get aged out of the queue.
Modifying the Reference Header
NB: This feature is not yet complete
To facilitate SAM RG values being set automatically in a production environment, we keep a header in the binary version of the reference. The header can be viewed and edited using the header subcommands here.
To view the header:
karma header -r phiX.fa
To view and edit the header:
karma header -r phiX.fa -e
Other test and check capabilities
Due to the size and complexity of Karma input, output and index files, various checks and tests are useful, so we include some diagnostics capabilities:
Tests for external files:
karma check [options...] file.bam file.fastq file.sam file.fa file.umfa
Tests internal to Karma:
karma test [options...] -d -> debug -s [int] -> set random number seed 
Karma File structure
Upon successfully building references, you will obtain a list of reference files like below:
Word Hash (Left)
Word Hash (Right)