Line 5: |
Line 5: |
| = Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> = | | = Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> = |
| The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file. | | The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file. |
| + | |
| + | Recalibration can also be called as an option of [[bamUtil: dedup]]. This will perform the recalibration and the deduping in the same set of steps, increasing processing speed. |
| | | |
| ==Handling Recalibration/Implementation Notes== | | ==Handling Recalibration/Implementation Notes== |
| | | |
− | Reads Not Recalibrated:
| + | Recalibration is a 2-step process that loops through the file twice (stdin is not support as input): |
− | * Duplicates
| |
− | * Unmapped
| |
− | * Mapping Quality = 0
| |
− | * Mapping Quality = 255
| |
− | | |
− | Recalibration is a 2-step process that loops through the file twice: | |
| # Build Recalibration Table | | # Build Recalibration Table |
| # Apply Recalibration Table | | # Apply Recalibration Table |
| | | |
− | Recalibration is done by grouping bases based on a set of covariates: | + | |
| + | The Recalibration Table groups bases based on a set of covariates: |
| * Read Group | | * Read Group |
− | * Cycle | + | * Quality (either from the quality string or [[#Read the quality from a tag (--qualField)|from a tag]]) |
| + | * Cycle (reverse complement for reverse strands) |
| * 1st/2nd read in pair | | * 1st/2nd read in pair |
− | * Previous Cycle's Base | + | * Previous Cycle's Base (reverse complement for reverse strands) |
− | * This Cycle's Base | + | * This Cycle's Base (reverse complement for reverse strands) |
| + | |
| + | The Recalibration Table tracks the number of matches/mismatches for each set of covariates. |
| + | |
| + | |
| + | Only bases meeting all of the following criteria are used to Build the Recalibration Table: |
| + | * Read criteria |
| + | ** not a duplicate |
| + | ** mapped |
| + | ** mapping quality != 0 |
| + | ** mapping quality != 255 |
| + | * Base criteria |
| + | ** match/mismatch (not an insertion/deletion/skip/clip) |
| + | ** not a [[#DBSNP File (--dbsnp)|dbSNP position]] |
| + | ** base quality > [[#Minimum Recalibration Base Quality (--minBaseQual)|minBaseQual (5 by default)]] |
| + | * Additional criteria for cycle != 1 (can be turned off via flags) |
| + | ** previous base is a CIGAR Match/Mismatch (Use [[#Allow Previous Base Non-Match/Mismatch (--keepPrevNonAdjacent)|<code>--keepPrevNonAdjacent</code>]] to disable) |
| + | ** previous base position is not a [[#DBSNP File (--dbsnp)|dbSNP position]] (Use [[#Allow Previous Base DBSNP (--keepPrevDbsnp)|<code>--keepPrevDbsnp</code>]] to disable) |
| + | |
| + | |
| + | The Recalibration Table is applied to all bases meeting all of the following criteria (even if they were not used for creating the table): |
| + | * base quality > [[#Minimum Recalibration Base Quality (--minBaseQual)|minBaseQual (5 by default)]] |
| + | * at least 1 match or mismatch for the set of covariates |
| + | |
| + | |
| + | Recalibrated Quality is: <math>-10 * \log \frac{mismatches + 1}{mismatches + matches + 1}</math> |
| + | |
| + | Alternatively, [[#Logistic Regression (--useLogReg)|logistic regression]] can be used for calculating the new quality. |
| + | |
| + | If the Recalibrated Quality is greater than [[#Maximum Recalibration Base Quality (--maxBaseQual)|maxBaseQual]], the updated quality is set to maxBaseQual. |
| | | |
− | For Reverse Strands, the reverse complement of the SAM/BAM is used for the cycle, previous cycle's base, and current cycle's base.
| |
| | | |
− | Not all bases are used for building the Recalibration table. Only bases meeting the following criteria are used:
| + | Optionally, the previous quality can be [[#Store the original quality (--storeQualTag)|stored in a tag]]. |
− | * Base is a CIGAR Match/Mismatch
| |
− | * Previous base is a CIGAR Match/Mismatch or it is the first cycle
| |
− | * Base position is not a dbSNP position
| |
− | * Previous base position is not a dbSNP position (if not first cycle)
| |
− | * Base quality > 5 (or the configurable minimum)
| |
| | | |
− | The Recalibration Table is applied on all bases in the read sequence (ignoring the alignment/CIGAR) unless the base quality is < 5 (or the configurable minimum)
| |
| | | |
− | This recalibration logic was designed for recalibrating ILLUMINA data.
| + | The current recalibration logic was designed for recalibrating ILLUMINA data. |
| + | |
| + | NOTE: GATK ignores/skips adapters, but our logic does not. |
| | | |
| == How to use it == | | == How to use it == |
Line 43: |
Line 65: |
| | | |
| The input SAM/BAM file ([[#input File (--in)|--in]]), the output SAM/BAM file ([[#output File (--out)|--out]]), and the reference file ([[#Reference File (--refFile)|--refFile]]) are required inputs. | | The input SAM/BAM file ([[#input File (--in)|--in]]), the output SAM/BAM file ([[#output File (--out)|--out]]), and the reference file ([[#Reference File (--refFile)|--refFile]]) are required inputs. |
| + | |
| + | |
| + | Recommended usage with Deduper: |
| + | |
| + | /usr/cluster/bin/bam dedup --recab --in ${INPUT}.bam --out ${OUTPUT}.bam --force --refFile ${REF} --dbsnp ${DBSNP} --oneChrom --storeQualTag OQ --maxBaseQual 40 |
| + | |
| + | |
| + | Recommended usage without Deduper: |
| + | |
| + | /usr/cluster/bin/bam recab --in ${INPUT}.bam --out ${OUTPUT}.bam --refFile ${REF} --dbsnp ${DBSNP} --storeQualTag OQ --maxBaseQual 40 |
| | | |
| = Usage = | | = Usage = |
− | ./bam recab (options) --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--minBaseQual <minBaseQual>] [--maxBaseQual <maxBaseQual>] [--blended <weight>] [--recabLogReg] | + | ./bam recab (options) --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--minBaseQual <minBaseQual>] [--maxBaseQual <maxBaseQual>] [--blended <weight>] [--fitModel] [--fast] [--keepPrevDbsnp] [--keepPrevNonAdjacent] [--useLogReg] [--qualField <tag>] [--storeQualTag <tag>] [--buildExcludeFlags <flag>] [--applyExcludeFlags <flag>] |
| | | |
| = Parameters = | | = Parameters = |
| <pre> | | <pre> |
| Required General Parameters : | | Required General Parameters : |
− | --in <infile> : input BAM file name
| + | --in <infile> : input BAM file name |
− | --out <outfile> : output recalibration file name
| + | --out <outfile> : output recalibration file name |
− | Optional General Parameters : | + | Optional General Parameters : |
− | --log <logfile> : log and summary statistics (default: [outfile].log)
| + | --log <logfile> : log and summary statistics (default: [outfile].log) |
− | --verbose : Turn on verbose mode
| + | --verbose : Turn on verbose mode |
− | --noeof : do not expect an EOF block on a bam file.
| + | --noeof : do not expect an EOF block on a bam file. |
− | --params : print the parameter settings
| + | --params : print the parameter settings |
| | | |
| Recab Specific Required Parameters | | Recab Specific Required Parameters |
− | --refFile <reference file> : reference file name
| + | --refFile <reference file> : reference file name |
− | Recab Specific Optional Parameters : | + | Recab Specific Optional Parameters : |
− | --dbsnp <known variance file> : dbsnp file of positions
| + | --dbsnp <known variance file> : dbsnp file of positions |
− | --minBaseQual <minBaseQual> : minimum base quality of bases to recalibrate (default: 5)
| + | --minBaseQual <minBaseQual> : minimum base quality of bases to recalibrate (default: 5) |
− | --maxBaseQual <maxBaseQual> : maximum recalibrated base quality (default: 50)
| + | --maxBaseQual <maxBaseQual> : maximum recalibrated base quality (default: 50) |
− | --blended <weight> : blended model weight
| + | qualities over this value will be set to this value. |
− | --recabLogReg : use logistic regression for calculating the new quality
| + | This setting is applied after binning (if applicable). |
− | --qualField <quality tag> : tag to get the starting base quality (default is to get it from the Quality field
| + | --blended <weight> : blended model weight |
− | --storeQualTag <quality tag> : tag to store the previous quality into
| + | --fitModel : check if the logistic regression model fits the data |
| + | overriden by fast, but automatically applied by useLogReg |
| + | --fast : use a compact representation that only allows: |
| + | * at most 256 Read Groups |
| + | * maximum quality 63 |
| + | * at most 127 cycles |
| + | overrides fitModel, but is overridden by useLogReg |
| + | uses up to about 2.25G more memory than running without --fast. |
| + | --keepPrevDbsnp : do not exclude entries where the previous base is in dbsnp when |
| + | building the recalibration table |
| + | By default they are excluded from the table. |
| + | --keepPrevNonAdjacent : do not exclude entries where the previous base is not adjacent |
| + | (not a Cigar M/X/=) when building the recalibration table |
| + | By default they are excluded from the table (except the first cycle). |
| + | --useLogReg : use logistic regression calculated quality for the new quality |
| + | automatically applies fitModel and overrides fast. |
| + | --qualField <quality tag> : tag to get the starting base quality |
| + | (default is to get it from the Quality field) |
| + | --storeQualTag <quality tag> : tag to store the previous quality into |
| + | --buildExcludeFlags <flag> : exclude reads with any of these flags set when building the |
| + | recalibration table. Default is 0xF04 |
| + | --applyExcludeFlags <flag> : do not apply the recalibration table to any reads with any of these flags set |
| + | Quality Binning Parameters (optional): |
| + | Bin qualities by phred score, into the ranges specified by binQualS or binQualF (both cannot be used) |
| + | Ranges are specified by comma separated minimum phred score for the bin, example: 1,17,20,30,40,50,70 |
| + | The first bin always starts at 0, so does not need to be specified. |
| + | By default, the bin value is the low end of the range. |
| + | --binQualS : Bin the Qualities as specified (phred): minQualOfBin2, minQualofBin3... |
| + | --binQualF : Bin the Qualities based on the specified file |
| + | --binMid : Use the mid point of the quality bin range for the quality value of the bin. |
| + | --binHigh : Use the high end of the quality bin range for the quality value of the bin. |
| + | |
| </pre> | | </pre> |
| + | {{PhoneHomeParamDesc}} |
| | | |
− | {{inBAMInputFile}} | + | == Required Generic Parameters == |
| + | {{inBAMInputFile|noStdin=1}} |
| {{outBAMOutputFile}} | | {{outBAMOutputFile}} |
| | | |
− | == Output log & Summary Statistics FileName (<code>--log</code>) == | + | == Optional Generic Parameters == |
| + | === Output log & Summary Statistics FileName (<code>--log</code>) === |
| | | |
| Output file name for writing logs & summary statistics. | | Output file name for writing logs & summary statistics. |
Line 79: |
Line 145: |
| If this parameter is not specified, it will write to the output file specified in <code>--out</code> + ".log". Or if the output bam is written to stdout (<code>--out</code> starts with '-'), the logs will be written to stderr. If the filename after --log starts with '-' it will write to stderr. | | If this parameter is not specified, it will write to the output file specified in <code>--out</code> + ".log". Or if the output bam is written to stdout (<code>--out</code> starts with '-'), the logs will be written to stderr. If the filename after --log starts with '-' it will write to stderr. |
| | | |
− | == Turn on Verbose Mode (<code>--verbose</code>) == | + | === Turn on Verbose Mode (<code>--verbose</code>) === |
| | | |
| Turn on verbose logging to get more log messages in the log and to stderr. | | Turn on verbose logging to get more log messages in the log and to stderr. |
Line 86: |
Line 152: |
| {{paramsParameter}} | | {{paramsParameter}} |
| | | |
− | == Reference File (<code>--refFile</code>) ==
| + | {{PhoneHomeParameters}} |
| | | |
− | The reference file to use for comparing read bases to the reference.
| + | == Required Recalibration Parameters == |
| + | === Reference File (<code>--refFile</code>) === |
| | | |
− | == DBSNP File (<code>--dbsnp</code>) ==
| + | The reference file is a required parameter used for comparing read bases to the reference. |
| | | |
− | The dbsnp file that specifies positions to skip recalibrating. Tab delimited file with the chromosome in the first column and the 1-based position in the 2nd column.
| + | == Optional Recalibration Parameters == |
| | | |
− | == Blended Model Weight (<code>--blended</code>) == | + | === DBSNP File (<code>--dbsnp</code>) === |
| | | |
− | <span style="color:red">TBD - this parameter is not yet implemented.</span>
| + | The dbsnp file that specifies positions to skip recalibrating. Tab delimited file with the chromosome in the first column and the 1-based position in the 2nd column. |
| | | |
− | == Minimum Recalibration Base Quality (<code>--minBaseQual</code>) == | + | === Minimum Recalibration Base Quality (<code>--minBaseQual</code>) === |
| | | |
| When recalibrating reads, only positions with a base quality greater than this minimum phred quality will be recalibrated. If <code>--minBaseQual</code> is not specified, it is defaulted to 5. | | When recalibrating reads, only positions with a base quality greater than this minimum phred quality will be recalibrated. If <code>--minBaseQual</code> is not specified, it is defaulted to 5. |
Line 104: |
Line 171: |
| The ILLUMINA specs indicate that any quality below 5 can be used as an error indicator so we do not want to recalibrate those. | | The ILLUMINA specs indicate that any quality below 5 can be used as an error indicator so we do not want to recalibrate those. |
| | | |
− | == Maximum Recalibration Base Quality (<code>--maxBaseQual</code>) == | + | === Maximum Recalibration Base Quality (<code>--maxBaseQual</code>) === |
| | | |
| This value sets the maximum phred base quality assigned to a base after recalibrating. Any qualities above this value will be set to this value. It is defaulted to 50. | | This value sets the maximum phred base quality assigned to a base after recalibrating. Any qualities above this value will be set to this value. It is defaulted to 50. |
| | | |
− | == Logistic Regression == (<code>--recabLogReg</code>) == | + | === Blended Model Weight (<code>--blended</code>) === |
| + | |
| + | <span style="color:red">TBD - this parameter is not yet implemented.</span> |
| + | |
| + | === Fit Model (<code>--fitModel</code>) === |
| + | |
| + | Check if the logistic regression model fits the data. |
| + | |
| + | This option does NOT set the new qualities to the logistic regression calculated qualities, it only checks the fit. To apply the logistic regression qualities, see [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]]. <code>--fitModel</code> is automatically applied when <code>--useLogReg</code> is specified. |
| + | |
| + | This option cannot be used in conjunction with [[#Fast Recalibration (--fast)|<code>--fast</code>]] and is overriden by <code>--fast</code>, but automatically applied by useLogReg |
| + | |
| + | === Fast Recalibration (<code>--fast</code>) === |
| + | |
| + | Use a compact representation of the Recalibration Table that only allows: |
| + | * at most 256 Read Groups |
| + | * maximum quality 63 |
| + | * at most 127 cycles |
| + | |
| + | This option will run faster than the default recalibration, but uses up to about 2.25G more memory than running without --fast. |
| + | |
| + | This option cannot be used in conjunction with [[#Fit Model (--fitModel)|<code>--fitModel</code>]], or [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]] and overrides [[#Fit Model (--fitModel)|<code>--fitModel</code>]], but is overridden by [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]]. |
| + | |
| + | === Allow Previous Base DBSNP (<code>--keepPrevDbsnp</code>) === |
| + | |
| + | By default bases where the previous base is in DBSNP are excluded from the Recalibration Table. |
| + | |
| + | This option includes these bases in the building of the Recalibration Table. |
| + | |
| + | === Allow Previous Base Non-Match/Mismatch (<code>--keepPrevNonAdjacent</code>) === |
| + | |
| + | By default bases where the previous base is not a CIGAR Match/Mismatch are excluded from the Recalibration Table. |
| | | |
− | Use the logistic regression empirical qualities for setting the new base qualities instead of the default formula: -10 * log10((#mismatches+1)/(#total+1))
| + | This option includes these bases in the building of the Recalibration Table. |
| | | |
− | == Read the quality from a tag (<code>--qualField</code>) == | + | |
| + | === Logistic Regression (<code>--useLogReg</code>) === |
| + | |
| + | Use the logistic regression empirical qualities for setting the new base qualities instead of the default formula. |
| + | |
| + | This option automatically enables [[#Fit Model (--fitModel)|<code>--fitModel</code>]] and disables [[#Fast Recalibration (--fast)|<code>--fast</code>]]. |
| + | |
| + | === Read the quality from a tag (<code>--qualField</code>) === |
| | | |
| If this parameter is set, then read the quality string from the specified tag name. If the tag is not found, the quality is read from the quality field. | | If this parameter is set, then read the quality string from the specified tag name. If the tag is not found, the quality is read from the quality field. |
| | | |
− | == Store the original quality (<code>--storeQualTag</code>) == | + | === Store the original quality (<code>--storeQualTag</code>) === |
| | | |
| If this parameter is set, the original quality will be stored as a string in the specified tag. | | If this parameter is set, the original quality will be stored as a string in the specified tag. |
| + | |
| + | === Skip Records with any of the Specified Flags (<code>--buildExcludeFlags</code>, <code>--applyExcludeFlags</code>) === |
| + | Use <code>--buildExcludeFlags</code> to skip records with any of the specified flags set when building the recalibration table, default 0xF04. |
| + | |
| + | By default, when building the recalibration table reads with any of the following flags set are skipped: |
| + | * unmapped |
| + | * secondary alignment |
| + | * fails QC checks |
| + | * duplicate |
| + | * supplementary alignment |
| + | |
| + | Use <code>--applyExcludeFlags</code> to skip records with any of the specified flags set when applying the recalibration table. The default value is 0x000, do not skip any reads. |
| | | |
| = Return Value = | | = Return Value = |
Line 124: |
Line 241: |
| Returns -1 if input parameters are invalid. | | Returns -1 if input parameters are invalid. |
| | | |
− | Returns the SamStatus for the reads/writes (0 on success). | + | Returns the SamStatus for the reads/writes (0 on success, non-0 on failure). |