From Genome Analysis Wiki
'''K-tuple Alignment with Rapid Matching Algorithm'''
Karma uses an existing reference to align short reads, such as those generated by Illumina sequencers.
Build Karma =
requires that the following debian packages be installed on the host Linux machine:
Assuming the karma tar file is named karma.tgz, do the following
tar xvzf karma.tgz
cp karma/karma ~/bin
Alternatively, if you want to share the karma binary install it in /usr/local/bin/karma.
== Testing the build ==
To test karma, go to the build tree subdirectory named ''karma'', and type the command:
The test script builds a reference for the small phiX genome, then runs single end as well as paired end alignments.
It compares the results of that with known results. Differences are printed to the console, and currently look something like this: <pre>diff phiX.sam.good phiX.sam
= Normal Workflow =
Karma works using a set of index and hash files created from an existing reference.
Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.
In concept, the simplest workflow is to first create a reference index using ''karma create'', then align reads using ''karma map''.
You only have to build the index and hash once.
Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.
= Build reference index and hash =
Building a reference index and hash with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.
The simplest example for creating a reference and index using a wordsize of 11-mer words is:
karma create -i -w 11 phiX.fa
More generally, three primary parameters are necessary for building a Karma reference index:
# a boolean flag indicating base or color space# the index table word occurrence cutoff value# the word size
Although the input reference is always expected to be base space and in [http://en.wikipedia.org/wiki/FASTA_format FASTA] format, the binary version of the reference, and the corresponding index and hash files, can be in either color space (ABI SOLiD) or base space (Illumina or LS454).
For a given reference [http://en.wikipedia.org/wiki/FASTA_format FASTA] file, you may have either a color or base space binary reference, as well as either color or base space index/hash files, any in varying word sizes or occurrence cutoffs.
Because the index and hash files are dependent on the occurrence cutoff parameter and the word size, the output files created by karma have those values in the file name.
This allows you to create a variety of index/hash tables, depending on your expected use (ABI SOLiD, in particular, is sensitive to read length).
== Options for building reference index and hash ==
-r ''reference'' Reference file in [http://en.wikipedia.org/wiki/FASTA_format FASTA] format
-i Create the index and hash as well as the binary reference
= Aligning Reads =
Aligning reads to the reference is easy:
karma map -r phiX.fa -w 11 phiX.fastq
or for paired reads:
karma map -r phiX.fa -w 11 phiX-mate1.fastq phiX-mate2.fastq
In both of the above examples, the -r option names the reference originally used to build the index/hash, and the -w 11 specifies that we are using the index/hash built for 11-mer words.
Although you can use the default word size of 15 for phiX, the index is 4^15 * 4 = 4GBytes, so a shorter word size is prudent.
== Aligning Reads (Illumina) ==
Karma is set up so that the default options work well for mapping Illumina reads to the Human genome.
== Aligning Reads (ABI SOLiD) ==
Karma has been designed to align color space reads.
However, in Karma 0.9.0, this functionality is not working.
== Aligning Reads (LS 454) ==
Karma has been designed to align LS 454 reads.
However, in Karma 0.9.0, this functionality is not working.
= Karma Performance Tuning =
There are four components to the Karma index and hash.
A pure index array, based on an N-mer word index. This is used as a pointer into a word positions table, which is an ordered list of genome positions in which that N-mer word appears. There is a cap called the ''occurrence cutoff'', which once exceeded, causes that index word to be marked as a high repeat pattern. Once marked as high repeat, the N-mer word is instead combined with both the N-mer word preceding it, as well as the N-mer word succeeding it to create a 2 * N-mer word hash key. Two hash tables are populated, a left and a right hash. These are then used when that pattern is found in a read.
== Index Word Size ==
Choosing an appropriate word size for larger genome is critical to performance.
The easiest case is for Illumina base space reads with the human genome (3Gbases), where the default 15-mer word size is fine.
For smaller genomes, consider using a smaller word size.
Genomes smaller than a few million bases should be perfectly fine with a word size of 11 or 12.
Since the primary index table into the word positions table is 2^(wordsize) * 4 bytes, it can grow large rapidly.
All else being equal, a smaller word size leads to longer sets of word positions for each index value. Each increment of word size approximately quadruples storage requirements, and halves runtime. Similarly, each decrement of word size reduces the index table size by 75%, and doubles runtime. These approximations are old, but serve a useful rule of thumb.
For ABI SOLiD reads, the word size is critical, due to the shorter length of reads as compared to Illumina or LS 454.
The optimal minimum word size is chosen such that it is 1/4 the minimum expected average read length.
It also must be chosen to be 1/2 the minimum expected read length, since at least 2 full words must exist in the read.
So for 48-mer reads, a reasonable value of word size is 12.
Although the base space default of 15 is fine, too, Karma is able to take advantage of a higher number of index words per read, yielding substantial speedups even with the shorter read. Similarly, 52-mer reads would map better with a 13-mer word size, and 56-mer reads would map best with a 14-mer word size.
== Occurrence Cutoff ==
The occurrence cutoff value determines how quickly an N-mer pattern is declared to be ''high repeat'' and left out of the index in favor of a hash.
The default value of 5000 seems adequate for Illumina reads with the human genome. If ultimate performance is necessary, some experimentation is called for with this value.
== Shared Memory ==
Karma uses memory mapped files to share the potentially large reference index and hash data structures.
Karma uses this to great effect on our 8 processors with hyperthreading enabled.
16 copies of karma can share one reference index and hash, yielding a very acceptable memory per CPU ratio of around 1GB/CPU.
A problem with large reference index and hash data structures is that they are more prone to being paged out.
On a shared machine that is being used extensively even just simple disk I/O, memory pages are being reclaimed such that Karma will become swapped out.
While Karma can recover on its own, it is best to either run in a production manner on dedicated machines, or to run a program such as the utility ''mapfile'' found in the utilities sub-folder.
This program continually touches each page of the data structures in sequential order, forcing them to the head of the disk buffer pool, so they don't get aged out of the queue.
= Modifying the Reference Header =
''NB: This feature is not yet complete''
To facilitate SAM RG values being set automatically in a production environment, we keep a header in the binary version of the reference.
The header can be viewed and edited using the header subcommands here.
To view the header:
karma header -r phiX.fa
To view and edit the header:
karma header -r phiX.fa -e
Other test and check capabilities =
Tests for external files:
karma check [options...] file.bam file.fastq file.sam file.fa file.umfa
Tests internal to Karma:
karma test [options...]
> debug -s [int] - > set random number seed 
File structure =
Upon successfully building references, you will obtain a list of reference files like below:
style=" width: 571px; height: 288px" border="1" cellspacing="1" cellpadding=" 1" width=" 571"
Karma CHANGELOG =
bioinfo.shtml Heng Li's thoughts about aligner]
http: //lh3lh3.users.sourceforge.net/udb.shtml Benchmark of Dictionary Structures]