Karma

From Genome Analysis Wiki
Revision as of 14:04, 8 April 2010 by Pha (talk | contribs) (lots of changes for karma 0.9)
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.


K-tuple Alignment with Rapid Matching Algorithm

Karma uses an existing reference to align short reads, such as generated by Illumina sequencers.

The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers. This version does not map color space reads, nor does it reliably map LS454 reads. Both of those features will return in Karma 0.9.1.

Download Karma

To get a copy go to Karma Download

Build Karma

Dependencies

Building

Testing the build

Normal Workflow

Karma works using a set of index and hash files created from an existing reference. Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.

In concept, the simplest workflow is to first create a reference index using 'karma create', then align reads using 'karma map'. You only have to build the index and hash once.

Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.

Build Reference

Building a reference with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.

The simplest example for creating a reference and index using a wordsize of 11-mer words is:

karma create -i -w 11 phiX.fa

More generally, three primary parameters are necessary for building a Karma reference index:

  1. a boolean flag indicating base or color space
  2. the index table word occurrence cutoff value
  3. the word size

Although the input reference is always expected to be base space and in FASTA format, the binary version of the reference, and the corresponding index and hash files, can be in either color space (ABI SOLiD) or base space (Illumina or LS454). For a given reference FASTA file, you may have either a color or base space binary reference, as well as either color or base space index/hash files.

Because the index and hash files are dependent on the occurrence cutoff parameter and the word size, the output files created by karma have those values in the file name. This allows you to create a variety of index/hash tables, depending on your expected use (ABI SOLiD, in particular, is sensitive to read length).

Options for building reference

-w word size          Word size for index and hash (default 15, typically 10-16)
-O occurrence cutoff  Upper count of number of word positions to store in word positions table (default 5000)
-c                        Creates a color space reference and index/hash
-i                        Create the index and hash as well as the binary reference


Options

Command line

Usage:

Karma expects the sub command to be the first argument on the command line. Currently, this includes: map, create, header, check and test.

To align reads, you first create an index:

karma create [options...] somereference.fa

A simple example is:

karma create -i phiX.fa

To actually align reads, use the map command:

karma map [options...] mate1.fastq.gz [mate2.fastq.gz]

A simple example is:

karma map -r phiX.fa -o phiX.sam mate1.fastq.gz mate2.fastq.gz

To facilitate SAM RG values being set automatically in a production environment, we keep a header in the reference. The header can be viewed and edited using the header subcommand:

karma header -r phiX.fa

Due to the size and complexity of Karma input, output and index files, various checks and tests are useful, so we include some diagnostics capabilities:

Tests for external files:

karma check [options...] file.bam file.fastq file.sam file.fa file.umfa

Tests internal to Karma:

karma test [options...]
-d -> debug
-s [int] -> set random number seed [12345]

File structure

Upon successfully building references, you will obtain a list of reference files like below:

Base Space

Color Space

Reference genome

NCBI37-bs.umfa

NCBI37-cs.umfa

Word Index

NCBI37-bs.15.5000.umwiwp

NCBI37-bs.15.5000.umwihi

NCBI37-cs.15.5000.umwiwp

NCBI37-cs.15.5000.umwihi

Word Hash (Left)

NCBI37-bs.15.5000.umwhl

NCBI37-cs.15.5000.umwhl

Word Hash (Right)

NCBI37-bs.15.5000.umwhr

NCBI37-cs.15.5000.umwhr



Align Illumina Reads

Command line:

karma map -r reference.fa -o output.sam read1.fastq read2.fastq

Align ABI SOLiD Reads

Command line:

karma map -r reference.fa -c -o output.sam read1.fastq read2.fastq

Other useful links

Introduction of BWA usage

Heng Li's thoughts about aligner

Benchmark of Dictionary Structures