Difference between revisions of "Karma"

From Genome Analysis Wiki
Jump to: navigation, search
m (Testing the build: more detail)
(populate alignment section)
Line 5: Line 5:
 
Karma uses an existing reference to align short reads, such as generated by Illumina sequencers.
 
Karma uses an existing reference to align short reads, such as generated by Illumina sequencers.
  
The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers.  This version does not map color space reads, nor does it reliably map LS454 readsBoth of those features will return in Karma 0.9.1.
+
The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers.   
 +
 
 +
Color space and LS454 sequence alignments are not workingThese features will return in Karma 0.9.1.
  
 
= Download Karma =
 
= Download Karma =
Line 39: Line 41:
 
Karma works using a set of index and hash files created from an existing reference.  Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.
 
Karma works using a set of index and hash files created from an existing reference.  Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.
  
In concept, the simplest workflow is to first create a reference index using 'karma create', then align reads using 'karma map'.  You only have to build the index and hash once.
+
In concept, the simplest workflow is to first create a reference index using ''karma create'', then align reads using ''karma map''.  You only have to build the index and hash once.
  
 
Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.
 
Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.
  
= Build Reference =
+
= Build reference index and hash =
  
Building a reference with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.
+
Building a reference index and hash with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.
  
 
The simplest example for creating a reference and index using a wordsize of 11-mer words is:
 
The simplest example for creating a reference and index using a wordsize of 11-mer words is:
Line 69: Line 71:
  
  
 +
= Aligning Reads =
 +
 +
Aligning reads to the reference is easy:
 +
 +
karma map -r phiX.fa -w 11 phiX.fastq
 +
 +
or for paired reads:
 +
 +
karma map -r phiX.fa -w 11 phiX-mate1.fastq phiX-mate2.fastq
 +
 +
In both of the above examples, the -r option names the reference originally used to build the index/hash, and the -w 11 specifies that we are using the index/hash built for 11-mer words.  Although you can use the default word size of 15 for phiX, the index is 4^15 * 4 = 4GBytes, so a shorter word size is prudent.
 +
 +
== Aligning Reads (Illumina) ==
 +
 +
Karma is set up so that the default options work well for mapping Illumina reads to the Human genome.
 +
 +
== Aligning Reads (ABI SOLiD) ==
 +
 +
Karma has been designed to align color space reads.  However, in Karma 0.9.0, this functionality is not working.
 +
 +
== Aligning Reads (LS 454) ==
 +
 +
Karma has been designed to align LS 454 reads.  However, in Karma 0.9.0, this functionality is not working.
  
 
== Options  ==
 
== Options  ==

Revision as of 14:25, 8 April 2010


K-tuple Alignment with Rapid Matching Algorithm

Karma uses an existing reference to align short reads, such as generated by Illumina sequencers.

The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers.

Color space and LS454 sequence alignments are not working. These features will return in Karma 0.9.1.

Download Karma

To get a copy go to Karma Download

Build Karma

Dependencies

Building

Testing the build

To test karma, go to the subdirectory named karma, and type the command:

make test

The test script builds a reference for the small phiX genome, then runs single end as well as paired end alignments. It compares the results of that with known results. Differences are printed to the console, and currently look something like this:

diff phiX.sam.good phiX.sam 
3c3
< @RG	DT:2010-04-08T17:29Z	ID:boingboing	SM:NA12345
---
> @RG	DT:2010-04-08T18:13Z	ID:boingboing	SM:NA12345

Any differences greater than that are an error and need to be fixed by the author.

Normal Workflow

Karma works using a set of index and hash files created from an existing reference. Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.

In concept, the simplest workflow is to first create a reference index using karma create, then align reads using karma map. You only have to build the index and hash once.

Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.

Build reference index and hash

Building a reference index and hash with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.

The simplest example for creating a reference and index using a wordsize of 11-mer words is:

karma create -i -w 11 phiX.fa

More generally, three primary parameters are necessary for building a Karma reference index:

  1. a boolean flag indicating base or color space
  2. the index table word occurrence cutoff value
  3. the word size

Although the input reference is always expected to be base space and in FASTA format, the binary version of the reference, and the corresponding index and hash files, can be in either color space (ABI SOLiD) or base space (Illumina or LS454). For a given reference FASTA file, you may have either a color or base space binary reference, as well as either color or base space index/hash files.

Because the index and hash files are dependent on the occurrence cutoff parameter and the word size, the output files created by karma have those values in the file name. This allows you to create a variety of index/hash tables, depending on your expected use (ABI SOLiD, in particular, is sensitive to read length).

Options for building reference

-w word size          Word size for index and hash (default 15, typically 10-16)
-O occurrence cutoff  Upper count of number of word positions to store in word positions table (default 5000)
-c                        Creates a color space reference and index/hash
-i                        Create the index and hash as well as the binary reference


Aligning Reads

Aligning reads to the reference is easy:

karma map -r phiX.fa -w 11 phiX.fastq

or for paired reads:

karma map -r phiX.fa -w 11 phiX-mate1.fastq phiX-mate2.fastq

In both of the above examples, the -r option names the reference originally used to build the index/hash, and the -w 11 specifies that we are using the index/hash built for 11-mer words. Although you can use the default word size of 15 for phiX, the index is 4^15 * 4 = 4GBytes, so a shorter word size is prudent.

Aligning Reads (Illumina)

Karma is set up so that the default options work well for mapping Illumina reads to the Human genome.

Aligning Reads (ABI SOLiD)

Karma has been designed to align color space reads. However, in Karma 0.9.0, this functionality is not working.

Aligning Reads (LS 454)

Karma has been designed to align LS 454 reads. However, in Karma 0.9.0, this functionality is not working.

Options

Command line

Usage:

Karma expects the sub command to be the first argument on the command line. Currently, this includes: map, create, header, check and test.

To align reads, you first create an index:

karma create [options...] somereference.fa

A simple example is:

karma create -i phiX.fa

To actually align reads, use the map command:

karma map [options...] mate1.fastq.gz [mate2.fastq.gz]

A simple example is:

karma map -r phiX.fa -o phiX.sam mate1.fastq.gz mate2.fastq.gz

To facilitate SAM RG values being set automatically in a production environment, we keep a header in the reference. The header can be viewed and edited using the header subcommand:

karma header -r phiX.fa

Due to the size and complexity of Karma input, output and index files, various checks and tests are useful, so we include some diagnostics capabilities:

Tests for external files:

karma check [options...] file.bam file.fastq file.sam file.fa file.umfa

Tests internal to Karma:

karma test [options...]
-d -> debug
-s [int] -> set random number seed [12345]

File structure

Upon successfully building references, you will obtain a list of reference files like below:

Base Space

Color Space

Reference genome

NCBI37-bs.umfa

NCBI37-cs.umfa

Word Index

NCBI37-bs.15.5000.umwiwp

NCBI37-bs.15.5000.umwihi

NCBI37-cs.15.5000.umwiwp

NCBI37-cs.15.5000.umwihi

Word Hash (Left)

NCBI37-bs.15.5000.umwhl

NCBI37-cs.15.5000.umwhl

Word Hash (Right)

NCBI37-bs.15.5000.umwhr

NCBI37-cs.15.5000.umwhr



Align Illumina Reads

Command line:

karma map -r reference.fa -o output.sam read1.fastq read2.fastq

Align ABI SOLiD Reads

Command line:

karma map -r reference.fa -c -o output.sam read1.fastq read2.fastq

Other useful links

Introduction of BWA usage

Heng Li's thoughts about aligner

Benchmark of Dictionary Structures