Line 5: |
Line 5: |
| = Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> = | | = Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> = |
| The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file. | | The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file. |
| + | |
| + | Recalibration can also be called as an option of [[bamUtil: dedup]]. This will perform the recalibration and the deduping in the same set of steps, increasing processing speed. |
| | | |
| ==Handling Recalibration/Implementation Notes== | | ==Handling Recalibration/Implementation Notes== |
− |
| |
− | Reads Not Recalibrated:
| |
− | * Duplicates
| |
− | * Unmapped
| |
− | * Mapping Quality = 0
| |
− | * Mapping Quality = 255
| |
| | | |
| Recalibration is a 2-step process that loops through the file twice: | | Recalibration is a 2-step process that loops through the file twice: |
Line 18: |
Line 14: |
| # Apply Recalibration Table | | # Apply Recalibration Table |
| | | |
− | Recalibration is done by grouping bases based on a set of covariates: | + | The Recalibration Table groups bases based on a set of covariates: |
| * Read Group | | * Read Group |
− | * Cycle | + | * Quality (either from the quality string or from a tag) |
| + | * Cycle (reverse complement for reverse strands) |
| * 1st/2nd read in pair | | * 1st/2nd read in pair |
− | * Previous Cycle's Base | + | * Previous Cycle's Base (reverse complement for reverse strands) |
− | * This Cycle's Base | + | * This Cycle's Base (reverse complement for reverse strands) |
| + | |
| + | The Recalibration Table tracks the number of matches/mismatches for each set of covariates. |
| + | |
| + | Only bases meeting all of the following criteria are used to Build the Recalibration Table: |
| + | * Read criteria |
| + | ** not a duplicate |
| + | ** mapped |
| + | ** mapping quality != 0 |
| + | ** mapping quality != 255 |
| + | * Base criteria |
| + | ** match/mismatch (not an insertion/deletion/skip/clip) |
| + | ** not a dbSNP position |
| + | ** base quality > minBaseQual (5 by default) |
| + | * Additional criteria for cycle != 1 (can be turned off via flags) |
| + | ** previous base is a CIGAR Match/Mismatch |
| + | ** previous base position is not a dbSNP position |
| + | |
| + | The Recalibration Table is applied to all bases meeting all of the following criteria: |
| + | * base quality > minBaseQual (5 by default) |
| + | |
| + | The Recalibrated Quality is calculated using: <math>-10 * \log \frac{mismatches + 1}{mismatches + matches + 1}</math> |
| + | |
| + | If the Recalibration Table has no matches & no mismatches for a set of covariates, the original base quality is kept. |
| | | |
− | For Reverse Strands, the reverse complement of the SAM/BAM is used for the cycle, previous cycle's base, and current cycle's base.
| + | If the Recalibrated Quality is greater than maxBaseQual, the updated quality is set to maxBaseQual. |
| | | |
− | Not all bases are used for building the Recalibration table. Only bases meeting the following criteria are used:
| + | Optionally, the previous quality can be stored in a tag. |
− | * Base is a q Match/Mismatch
| |
− | * Previous base is a CIGAR Match/Mismatch or it is the first cycle
| |
− | * Base position is not a dbSNP position
| |
− | * Previous base position is not a dbSNP position (if not first cycle)
| |
− | * Base quality > 5 (or the configurable minimum)
| |
| | | |
− | The Recalibration Table is applied on all bases in the read sequence (ignoring the alignment/CIGAR) unless the base quality is < 5 (or the configurable minimum) | + | The current recalibration logic was designed for recalibrating ILLUMINA data. |
| | | |
− | This recalibration logic was designed for recalibrating ILLUMINA data.
| |
| | | |
| == How to use it == | | == How to use it == |