Changes

From Genome Analysis Wiki
Jump to navigationJump to search
717 bytes added ,  16:38, 18 September 2012
no edit summary
Line 5: Line 5:  
= Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> =
 
= Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> =
 
The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file.  
 
The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file.  
 +
 +
Recalibration can also be called as an option of [[bamUtil: dedup]].  This will perform the recalibration and  the deduping in the same set of steps, increasing processing speed.
    
==Handling Recalibration/Implementation Notes==
 
==Handling Recalibration/Implementation Notes==
  −
Reads Not Recalibrated:
  −
* Duplicates
  −
* Unmapped
  −
* Mapping Quality = 0
  −
* Mapping Quality = 255
      
Recalibration is a 2-step process that loops through the file twice:
 
Recalibration is a 2-step process that loops through the file twice:
Line 18: Line 14:  
# Apply Recalibration Table
 
# Apply Recalibration Table
   −
Recalibration is done by grouping bases based on a set of covariates:
+
The Recalibration Table groups bases based on a set of covariates:
 
* Read Group
 
* Read Group
* Cycle
+
* Quality (either from the quality string or from a tag)
 +
* Cycle (reverse complement for reverse strands)
 
* 1st/2nd read in pair
 
* 1st/2nd read in pair
* Previous Cycle's Base
+
* Previous Cycle's Base (reverse complement for reverse strands)
* This Cycle's Base
+
* This Cycle's Base (reverse complement for reverse strands)
 +
 
 +
The Recalibration Table tracks the number of matches/mismatches for each set of covariates.
 +
 
 +
Only bases meeting all of the following criteria are used to Build the Recalibration Table:
 +
* Read criteria
 +
** not a duplicate
 +
** mapped
 +
** mapping quality != 0
 +
** mapping quality != 255
 +
* Base criteria
 +
** match/mismatch (not an insertion/deletion/skip/clip)
 +
** not a dbSNP position
 +
** base quality > minBaseQual (5 by default)
 +
* Additional criteria for cycle != 1 (can be turned off via flags)
 +
** previous base is a CIGAR Match/Mismatch
 +
** previous base position is not a dbSNP position
 +
 
 +
The Recalibration Table is applied to all bases meeting all of the following criteria:
 +
* base quality > minBaseQual (5 by default)
 +
 
 +
The Recalibrated Quality is calculated using: <math>-10 * \log \frac{mismatches + 1}{mismatches + matches + 1}</math>
 +
 
 +
If the Recalibration Table has no matches & no mismatches for a set of covariates, the original base quality is kept.
   −
For Reverse Strands, the reverse complement of the SAM/BAM is used for the cycle, previous cycle's base, and current cycle's base.
+
If the Recalibrated Quality is greater than maxBaseQual, the updated quality is set to maxBaseQual.
   −
Not all bases are used for building the Recalibration table. Only bases meeting the following criteria are used:
+
Optionally, the previous quality can be stored in a tag.
* Base is a q Match/Mismatch
  −
* Previous base is a CIGAR Match/Mismatch or it is the first cycle
  −
* Base position is not a dbSNP position
  −
* Previous base position is not a dbSNP position (if not first cycle)
  −
* Base quality > 5 (or the configurable minimum)
     −
The Recalibration Table is applied on all bases in the read sequence (ignoring the alignment/CIGAR) unless the base quality is < 5 (or the configurable minimum)
+
The current recalibration logic was designed for recalibrating ILLUMINA data.
   −
This recalibration logic was designed for recalibrating ILLUMINA data.
      
== How to use it ==
 
== How to use it ==

Navigation menu