Changes

From Genome Analysis Wiki
Jump to navigationJump to search
12,612 bytes added ,  11:10, 2 February 2017
Line 1: Line 1: −
Karma is top secret. Shh!
+
<!--        BANNER ACROSS TOP OF PAGE        -->
 +
{| style="width:100%; background:#ffb6c1; margin-top:1.2em; border:1px solid #ccc;" |
 +
| style="width:100%; text-align:center; white-space:nowrap; color:#000;" |
 +
<div style="font-size:162%; border:none; margin:0; padding:.1em; color:#000;">KARMA is obsolete and not maintained</div>
 +
|}
   −
* Download
     −
To get a bootleg copy go to [http://www.sph.umich.edu/csg/pha/karma/download/ Karma Download]
+
'''K-tuple Alignment with Rapid Matching Algorithm'''
   −
* Command line
+
Karma uses an existing reference to align short reads, such as those generated by Illumina sequencers.
   −
<pre>
+
Primary features:  
Usage:
  −
  karma88 create [options...]
  −
  karma88 map [options...] file1.fastq.gz [file2.fastq.gz]
  −
Diagnostics:
  −
  karma88 check [options...]
  −
  karma88 test [options...]
  −
  -d -> debug
  −
  -s [int] -> set random number seed [12345]
     −
Defaults:
+
#High performance, high sensitivity
 +
#Large and small gap detection by default
 +
#Multiple gaps per read by default
 +
#Single or paired end reads
 +
#No read length limit
 +
#Quality scores are used to assess quality of maps
 +
#All potential locations are examined exhaustively, none are omitted
 +
#Reasonable memory per CPU ratio on high core count machines
   −
debug off (default off)
+
The current version, 0.9.0, is optimized to rapidly map base space reads from Illumina sequencers.
seed 12345 (default 12345)
     −
</pre>
+
Color space and LS454 sequence alignments are not currently supported. These features will return in Karma 0.9.1.
   −
* File strucutre
+
= Download Karma  =
   −
Color space mapping needs 6 files, and base space mapping needs 5 files (only need to use 1 reference genome).
+
To get a copy go to [http://csg.sph.umich.edu//pha/karma/download/ Karma Download]
   −
Ue color space as an example, and say we use NCBI36.fa as reference genome:
+
= Build Karma  =
   −
basespace.umfa    => NCBI36.bs.umfa
+
Karma is designed to be reasonably portable.  
colorspace.umfa    => NCBI36.cs.umfa
  −
colorspace.umwhr    => NCBI36.cs.12.umwhr
  −
colorspace.umwhl    => NCBI36.cs.12.umwhl
  −
colorspace.umwihi    => NCBI36.cs.12.umwihi
  −
colorspace.umwiwp    => NCBI36.cs.12.umwiwp
     −
* Other useful links:
+
However, since development occurs only on Ubuntu 9.10 x86 and x64 platforms, there are likely other portability issues.
   −
[http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_li.pdf Introduction of BWA usage]
+
We support Karma only on Ubuntu 9.10 on 64-bit processors.  
   −
[http://lh3lh3.users.sourceforge.net/bioinfo.shtml Heng Li's thoughts about aligner]
+
== Dependencies  ==
   −
[http://lh3lh3.users.sourceforge.net/udb.shtml Benchmark of Dictionary Structures]
+
Karma requires that the following debian packages be installed on the host Linux machine:
 +
 
 +
#libssl-dev
 +
#zlib1g-dev
 +
 
 +
Without these installed, Karma will not build.
 +
 
 +
== Building  ==
 +
 
 +
Assuming the karma tar file is named karma.tgz, do the following
 +
 
 +
tar xvzf karma.tgz
 +
cd karma-0.9
 +
make
 +
mkdir ~/bin
 +
cp karma/karma ~/bin
 +
 
 +
Alternatively, if you want to share the karma binary install it in /usr/local/bin/karma.
 +
 
 +
== Testing the build  ==
 +
 
 +
To test karma, go to the build tree subdirectory named ''karma'', and type the command:
 +
 
 +
make test
 +
 
 +
The test script builds a reference for the small phiX genome, then runs single end as well as paired end alignments. It compares the results of that with known results. Differences are printed to the console, and currently look something like this:
 +
<pre>diff phiX.sam.good phiX.sam
 +
3c3
 +
&lt; @RG DT:2010-04-08T17:29Z ID:boingboing SM:NA12345
 +
---
 +
&gt; @RG DT:2010-04-08T18:13Z ID:boingboing SM:NA12345
 +
</pre>
 +
Any differences greater than that are an error and need to be fixed by the author.
 +
 
 +
= Normal Workflow  =
 +
 
 +
Karma works using a set of index and hash files created from an existing reference. Once created, this set of reference index and hash files must always be specified in the command line when aligning reads.
 +
 
 +
In concept, the simplest workflow is to first create a reference index using ''karma create'', then align reads using ''karma map''. You only have to build the index and hash once.
 +
 
 +
Because the reference can be large, and because Karma will share the reference among many running instances of Karma, it is useful to put well known references in a common location readily accessible to you and your collaborators.
 +
 
 +
= Build reference index and hash  =
 +
 
 +
Building a reference index and hash with Karma is straightforward, but because it is time consuming for longer genomes, you typically save the reference index between runs.
 +
 
 +
The simplest example for creating a reference and index using a wordsize of 11-mer words is:
 +
 
 +
karma create -i -w 11 phiX.fa
 +
 
 +
More generally, three primary parameters are necessary for building a Karma reference index:
 +
 
 +
#a boolean flag indicating base or color space
 +
#the index table word occurrence cutoff value
 +
#the word size
 +
 
 +
Although the input reference is always expected to be base space and in [http://en.wikipedia.org/wiki/FASTA_format FASTA] format, the binary version of the reference, and the corresponding index and hash files, can be in either color space (ABI SOLiD) or base space (Illumina or LS454). For a given reference [http://en.wikipedia.org/wiki/FASTA_format FASTA] file, you may have either a color or base space binary reference, as well as either color or base space index/hash files, any in varying word sizes or occurrence cutoffs.
 +
 
 +
Because the index and hash files are dependent on the occurrence cutoff parameter and the word size, the output files created by karma have those values in the file name. This allows you to create a variety of index/hash tables, depending on your expected use (ABI SOLiD, in particular, is sensitive to read length).
 +
 
 +
== Options for building reference index and hash  ==
 +
 
 +
-r ''reference''          Reference file in [http://en.wikipedia.org/wiki/FASTA_format FASTA] format
 +
-w ''word size''          Word size for index and hash (default 15, typically 10-16)
 +
-O ''occurrence cutoff''  Upper count of number of word positions to store in word positions table (default 5000)
 +
-c                        Creates a color space reference and index/hash
 +
-i                        Create the index and hash as well as the binary reference
 +
 
 +
<br>
 +
 
 +
= Aligning Reads  =
 +
 
 +
Aligning reads to the reference is easy:
 +
 
 +
karma map -r phiX.fa -w 11 phiX.fastq
 +
 
 +
or for paired reads:
 +
 
 +
karma map -r phiX.fa -w 11 phiX-mate1.fastq phiX-mate2.fastq
 +
 
 +
In both of the above examples, the -r option names the reference originally used to build the index/hash, and the -w 11 specifies that we are using the index/hash built for 11-mer words. Although you can use the default word size of 15 for phiX, the index is 4^15 * 4 = 4GBytes, so a shorter word size is prudent.
 +
 
 +
Since Karma uses the word size and occurrence cutoff to help construct the actual index and hash filenames, you must specify them the same way you did when you created the reference index and hash.
 +
 
 +
== Options for aligning reads  ==
 +
 
 +
  -a [int]    -&gt; maximum insert size
 +
  -B [int]    -&gt; max number of bases in millions
 +
  -E          -&gt; show reference bases (default off)
 +
  -H          -&gt; set SAM header line values (e.g. -H RG:SM:NA12345)
 +
  -o          -&gt; required output sam/bam file
 +
  -O [int]    -&gt; occurrence cutoff (default 5000)
 +
  -q          -&gt; quiet mode (no output except errors)
 +
  -r [name]    -&gt; required genome reference
 +
  -R [int]    -&gt; max number of reads
 +
  -w [int]    -&gt; index word size (default 15)
 +
 
 +
== Aligning Reads (Illumina)  ==
 +
 
 +
Karma is set up so that the default options work well for mapping Illumina reads to the Human genome.
 +
 
 +
== Aligning Reads (ABI SOLiD)  ==
 +
 
 +
Karma has been designed to align color space reads. However, in Karma 0.9.0, this functionality is not working.
 +
 
 +
== Aligning Reads (LS 454)  ==
 +
 
 +
Karma has been designed to align LS 454 reads. However, in Karma 0.9.0, this functionality is not working.
 +
 
 +
= Karma Performance Tuning  =
 +
 
 +
There are four components to the Karma index and hash. A pure index array, based on an N-mer word index. This is used as a pointer into a word positions table, which is an ordered list of genome positions in which that N-mer word appears. There is a cap called the ''occurrence cutoff'', which once exceeded, causes that index word to be marked as a high repeat pattern. Once marked as high repeat, the N-mer word is instead combined with both the N-mer word preceding it, as well as the N-mer word succeeding it to create a 2 * N-mer word hash key. Two hash tables are populated, a left and a right hash. These are then used when that pattern is found in a read.
 +
 
 +
== Index Word Size  ==
 +
 
 +
Choosing an appropriate word size for larger genome is critical to performance. The easiest case is for Illumina base space reads with the human genome (3Gbases), where the default 15-mer word size is fine.
 +
 
 +
For smaller genomes, consider using a smaller word size. Genomes smaller than a few million bases should be perfectly fine with a word size of 11 or 12.
 +
 
 +
Since the primary index table into the word positions table is 2^(wordsize) * 4 bytes, it can grow large rapidly. All else being equal, a smaller word size leads to longer sets of word positions for each index value. Each increment of word size approximately quadruples storage requirements, and halves runtime. Similarly, each decrement of word size reduces the index table size by 75%, and doubles runtime. These approximations are old, but serve a useful rule of thumb.
 +
 
 +
For ABI SOLiD reads, the word size is critical, due to the shorter length of reads as compared to Illumina or LS 454.
 +
 
 +
The optimal minimum word size is chosen such that it is 1/4 the minimum expected average read length. It also must be chosen to be 1/2 the minimum expected read length, since at least 2 full words must exist in the read.
 +
 
 +
So for 48-mer reads, a reasonable value of word size is 12. Although the base space default of 15 is fine, too, Karma is able to take advantage of a higher number of index words per read, yielding substantial speedups even with the shorter read. Similarly, 52-mer reads would map better with a 13-mer word size, and 56-mer reads would map best with a 14-mer word size.
 +
 
 +
== Occurrence Cutoff  ==
 +
 
 +
The occurrence cutoff value determines how quickly an N-mer pattern is declared to be ''high repeat'' and left out of the index in favor of a hash. The default value of 5000 seems adequate for Illumina reads with the human genome. If ultimate performance is necessary, some experimentation is called for with this value.
 +
 
 +
== Shared Memory  ==
 +
 
 +
Karma uses memory mapped files to share the potentially large reference index and hash data structures.
 +
 
 +
Karma uses this to great effect on our 8 processors with hyperthreading enabled. 16 copies of karma can share one reference index and hash, yielding a very acceptable memory per CPU ratio of around 1GB/CPU.
 +
 
 +
A problem with large reference index and hash data structures is that they are more prone to being paged out. On a shared machine that is being used extensively even just simple disk I/O, memory pages are being reclaimed such that Karma will become swapped out.
 +
 
 +
While Karma can recover on its own, it is best to either run in a production manner on dedicated machines, or to run a program such as the utility ''mapfile'' found in the utilities sub-folder. This program continually touches each page of the data structures in sequential order, forcing them to the head of the disk buffer pool, so they don't get aged out of the queue.
 +
 
 +
= Modifying the Reference Header  =
 +
 
 +
''NB: This feature is not yet complete''
 +
 
 +
To facilitate SAM RG values being set automatically in a production environment, we keep a header in the binary version of the reference. The header can be viewed and edited using the header subcommands here.
 +
 
 +
To view the header:
 +
 
 +
karma header -r phiX.fa
 +
 
 +
To view and edit the header:
 +
 
 +
karma header -r phiX.fa -e
 +
 
 +
= Optional flags  =
 +
 
 +
Besides conforming to SAM specification, Karma developed its own optional tags to&nbsp; help evaluate mapping.
 +
 
 +
{| cellspacing="1" cellpadding="1" border="1" style="width: 1034px; height: 291px;"
 +
|-
 +
| XA
 +
| Alignment PathTag
 +
|-
 +
| RG
 +
| Sample GroupID
 +
|-
 +
| HA
 +
| numMatchContributors (possible hits checked by karma)
 +
|-
 +
| UQ
 +
| phred quality of this read, assuming it is mapped correctly
 +
|-
 +
| NB
 +
| Number of equally best hits
 +
|-
 +
| ER
 +
|
 +
Different kind of errors when mapping:
 +
 
 +
ER:Z:no_match -&gt; UNSET_QUALITY<br>ER:Z:invalid_bases -&gt; INVALID_DATA (not used yet)<br>ER:Z:duplicates -&gt; EARLYSTOP_DUPLICATES ( early stop due to reaching max number of best matches, with all bases matched)<br>ER:Z:quality -&gt; EARLYSTOP_QUALITY (early stop due to reaching max posterior quality)<br>ER:Z:repeats -&gt; REPEAT_QUALITY (not used yet)<br>
 +
 
 +
|}
 +
 
 +
<br>
 +
 
 +
<br>
 +
 
 +
= Other test and check capabilities  =
 +
 
 +
Due to the size and complexity of Karma input, output and index files, various checks and tests are useful, so we include some diagnostics capabilities:
 +
 
 +
Tests for external files:
 +
 
 +
karma check [options...] file.bam file.fastq file.sam file.fa file.umfa
 +
 
 +
Tests internal to Karma:
 +
 
 +
karma test [options...]
 +
-d -&gt; debug
 +
-s [int] -&gt; set random number seed [12345]
 +
 
 +
= Karma file structure  =
 +
 
 +
Upon successfully building references, you will obtain a list of reference files like below:
 +
 
 +
{| cellspacing="1" cellpadding="1" border="1" width="571" style="width: 571px; height: 288px;"
 +
|-
 +
|
 +
|
 +
Base Space
 +
 
 +
| Color Space
 +
|-
 +
|
 +
Reference genome
 +
 
 +
|
 +
NCBI37-bs.umfa
 +
 
 +
| NCBI37-cs.umfa
 +
|-
 +
|
 +
Word Index
 +
 
 +
|
 +
NCBI37-bs.15.5000.umwiwp
 +
 
 +
NCBI37-bs.15.5000.umwihi
 +
 
 +
|
 +
NCBI37-cs.15.5000.umwiwp
 +
 
 +
NCBI37-cs.15.5000.umwihi
 +
 
 +
|-
 +
|
 +
Word Hash (Left)
 +
 
 +
|
 +
NCBI37-bs.15.5000.umwhl
 +
 
 +
| NCBI37-cs.15.5000.umwhl
 +
|-
 +
|
 +
Word Hash (Right)
 +
 
 +
|
 +
NCBI37-bs.15.5000.umwhr
 +
 
 +
| NCBI37-cs.15.5000.umwhr
 +
|}
 +
 
 +
<br>
 +
 
 +
<br>
 +
 
 +
= Karma TODO List  =
 +
 
 +
#command line help is muddled up - UserOptions.h needs work
 +
#color space read handling needs to be re-integrated and tested
 +
#LS 454 needs to be tested, and the code adapted
 +
#pre-process some number of records to establish an appropriate max insert size
 +
#finish reference header view/edit code
 +
#investigate and document maximum memory use during ''create'' sub-command
 +
#finish and improve check and test commands
 +
 
 +
= Karma CHANGELOG  =
 +
 
 +
*Karma 0.9.0
 +
 
 +
* reference may now contain an arbitrary number of chromosomes
 +
* local re-alignment is drastically improved (handle small and large gaps better)
 +
* command line is re-vamped - now easier to use
 +
* create/naming/using reference index and hashes is easier
 +
 
 +
*Karma 0.8.8S
 +
 
 +
* add first version of local re-alignment
 +
* bump max number of chromosomes to 200
 +
* we no longer do Smith-Waterman on each candidate location
 +
 
 +
*Karma 0.8.8
 +
*Karma 0.8.6
 +
 
 +
= Other useful links  =
 +
 
 +
[http://lh3lh3.users.sourceforge.net/bioinfo.shtml Heng Li's thoughts about aligners]
 +
 
 +
[http://lh3lh3.users.sourceforge.net/udb.shtml Benchmark of Dictionary Structures]
 +
 
 +
[[Category:Software]]
96

edits

Navigation menu