Changes

From Genome Analysis Wiki
Jump to navigationJump to search
3,588 bytes added ,  15:03, 8 April 2010
fill in performance details
Line 21: Line 21:  
== Testing the build ==
 
== Testing the build ==
   −
To test karma, go to the subdirectory named karma, and type the command:
+
To test karma, go to the build tree subdirectory named ''karma'', and type the command:
    
  make test
 
  make test
Line 97: Line 97:     
Karma has been designed to align LS 454 reads.  However, in Karma 0.9.0, this functionality is not working.
 
Karma has been designed to align LS 454 reads.  However, in Karma 0.9.0, this functionality is not working.
 +
 +
= Karma Performance Tuning =
 +
 +
There are four components to the Karma index and hash.  A pure index, based on an N-mer word index.  This is used as a pointer into a word positions table, which is an ordered list of genome positions in which that N-mer word appears.  There is a cap called the ''occurrence cutoff'', which once crossed, causes that index word to be marked as a high repeat pattern.  Once marked as high repeat, the N-mer word is now combined with both the N-mer word preceding it, as well as the N-mer word succeeding it to create a 2 * N-mer word hash key.  Two hash tables are populated, a left and a right hash.
 +
 +
== Index Word Size ==
 +
 +
Choosing an appropriate word size for larger genome is critical to performance.  The easiest case is for Illumina base space reads with the human genome (3Gbases), where the default 15-mer word size is fine.
 +
 +
For smaller genomes, consider using a smaller word size.  Genomes smaller than a few million bases should be perfectly fine with a word size of 11 or 12.
 +
 +
Since the primary index table into the word positions table is 2^(wordsize) * 4 bytes, it can grow large rapidly.  All else being equal, a smaller word size leads to longer sets of word positions for each index value.  Each increment of word size approximately quadruples storage requirements, and halves runtime.  Similarly, each decrement of word size reduces the index table size by 75%, and doubles runtime.  These approximations are old, but serve a useful rule of thumb.
 +
 +
For ABI SOLiD reads, the word size is critical, due to the shorter length of reads as compared to Illumina or LS 454.
 +
 +
The optimal minimum word size is chosen such that it is 1/4 the minimum expected average read length.  It also must be chosen to be 1/2 the minimum expected read length, since at least 2 full words must exist in the read.
 +
 +
So for 48-mer reads, a reasonable value of word size is 12.  Although the base space default of 15 is fine, too, Karma is able to take advantage of a higher number of index words per read, yielding substantial speedups even with the shorter read.  Similarly, 52-mer reads would map better with a 13-mer word size, and 56-mer reads would map best with a 14-mer word size.
 +
 +
== Occurrence Cutoff ==
 +
 +
The occurrence cutoff value determines how quickly an N-mer pattern is declared to be ''high repeat'' and left out of the index in favor of a hash.  The default value of 5000 seems adequate for Illumina reads with the human genome.  If ultimate performance is necessary, some experimentation is called for with this value.
 +
 +
== Shared Memory ==
 +
 +
Karma uses memory mapped files to share the potentially large reference index and hash data structures.
 +
 +
Karma uses this to great effect on our 8 processors with hyperthreading enabled.  16 copies of karma can share one reference index and hash, yielding a very acceptable memory per CPU ratio of around 1GB/CPU.
 +
 +
A problem with large reference index and hash data structures is that they are more prone to being paged out.  On a shared machine that is being used extensively even just simple disk I/O, memory pages are being reclaimed such that Karma will become swapped out.
 +
 +
While Karma can recover on its own, it is best to either run in a production manner on dedicated machines, or to run a program such as the utility ''mapfile'' found in the utilities sub-folder.  This program continually touches each page of the data structures in sequential order, forcing them to the head of the disk buffer pool, so they don't get aged out of the queue.
    
= Modifying the Reference Header =
 
= Modifying the Reference Header =
    +
''NB: This feature is not yet complete''
    
To facilitate SAM RG values being set automatically in a production environment, we keep a header in the binary version of the reference.  The header can be viewed and edited using the header subcommands here.
 
To facilitate SAM RG values being set automatically in a production environment, we keep a header in the binary version of the reference.  The header can be viewed and edited using the header subcommands here.
75

edits

Navigation menu