Difference between revisions of "BamUtil: recab"

From Genome Analysis Wiki
Jump to: navigation, search
Line 5: Line 5:
 
= Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> =
 
= Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> =
 
The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file.  
 
The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file.  
 +
 +
Recalibration can also be called as an option of [[bamUtil: dedup]].  This will perform the recalibration and  the deduping in the same set of steps, increasing processing speed.
  
 
==Handling Recalibration/Implementation Notes==
 
==Handling Recalibration/Implementation Notes==
 
Reads Not Recalibrated:
 
* Duplicates
 
* Unmapped
 
* Mapping Quality = 0
 
* Mapping Quality = 255
 
  
 
Recalibration is a 2-step process that loops through the file twice:
 
Recalibration is a 2-step process that loops through the file twice:
Line 18: Line 14:
 
# Apply Recalibration Table
 
# Apply Recalibration Table
  
Recalibration is done by grouping bases based on a set of covariates:
+
The Recalibration Table groups bases based on a set of covariates:
 
* Read Group
 
* Read Group
* Cycle
+
* Quality (either from the quality string or from a tag)
 +
* Cycle (reverse complement for reverse strands)
 
* 1st/2nd read in pair
 
* 1st/2nd read in pair
* Previous Cycle's Base
+
* Previous Cycle's Base (reverse complement for reverse strands)
* This Cycle's Base
+
* This Cycle's Base (reverse complement for reverse strands)
 +
 
 +
The Recalibration Table tracks the number of matches/mismatches for each set of covariates.
 +
 
 +
Only bases meeting all of the following criteria are used to Build the Recalibration Table:
 +
* Read criteria
 +
** not a duplicate
 +
** mapped
 +
** mapping quality != 0
 +
** mapping quality != 255
 +
* Base criteria
 +
** match/mismatch (not an insertion/deletion/skip/clip)
 +
** not a dbSNP position
 +
** base quality > minBaseQual (5 by default)
 +
* Additional criteria for cycle != 1 (can be turned off via flags)
 +
** previous base is a CIGAR Match/Mismatch
 +
** previous base position is not a dbSNP position
 +
 
 +
The Recalibration Table is applied to all bases meeting all of the following criteria:
 +
* base quality > minBaseQual (5 by default)
 +
 
 +
The Recalibrated Quality is calculated using: <math>-10 * \log \frac{mismatches + 1}{mismatches + matches + 1}</math>
 +
 
 +
If the Recalibration Table has no matches & no mismatches for a set of covariates, the original base quality is kept.
  
For Reverse Strands, the reverse complement of the SAM/BAM is used for the cycle, previous cycle's base, and current cycle's base.
+
If the Recalibrated Quality is greater than maxBaseQual, the updated quality is set to maxBaseQual.
  
Not all bases are used for building the Recalibration table. Only bases meeting the following criteria are used:
+
Optionally, the previous quality can be stored in a tag.
* Base is a q Match/Mismatch
 
* Previous base is a CIGAR Match/Mismatch or it is the first cycle
 
* Base position is not a dbSNP position
 
* Previous base position is not a dbSNP position (if not first cycle)
 
* Base quality > 5 (or the configurable minimum)
 
  
The Recalibration Table is applied on all bases in the read sequence (ignoring the alignment/CIGAR) unless the base quality is < 5 (or the configurable minimum)
+
The current recalibration logic was designed for recalibrating ILLUMINA data.
  
This recalibration logic was designed for recalibrating ILLUMINA data.
 
  
 
== How to use it ==
 
== How to use it ==

Revision as of 16:38, 18 September 2012


Overview of the recab function of bamUtil

The recab option of bamUtil recalibrates a SAM/BAM file.

Recalibration can also be called as an option of bamUtil: dedup. This will perform the recalibration and the deduping in the same set of steps, increasing processing speed.

Handling Recalibration/Implementation Notes

Recalibration is a 2-step process that loops through the file twice:

  1. Build Recalibration Table
  2. Apply Recalibration Table

The Recalibration Table groups bases based on a set of covariates:

  • Read Group
  • Quality (either from the quality string or from a tag)
  • Cycle (reverse complement for reverse strands)
  • 1st/2nd read in pair
  • Previous Cycle's Base (reverse complement for reverse strands)
  • This Cycle's Base (reverse complement for reverse strands)

The Recalibration Table tracks the number of matches/mismatches for each set of covariates.

Only bases meeting all of the following criteria are used to Build the Recalibration Table:

  • Read criteria
    • not a duplicate
    • mapped
    • mapping quality != 0
    • mapping quality != 255
  • Base criteria
    • match/mismatch (not an insertion/deletion/skip/clip)
    • not a dbSNP position
    • base quality > minBaseQual (5 by default)
  • Additional criteria for cycle != 1 (can be turned off via flags)
    • previous base is a CIGAR Match/Mismatch
    • previous base position is not a dbSNP position

The Recalibration Table is applied to all bases meeting all of the following criteria:

  • base quality > minBaseQual (5 by default)

The Recalibrated Quality is calculated using: -10 * \log \frac{mismatches + 1}{mismatches + matches + 1}

If the Recalibration Table has no matches & no mismatches for a set of covariates, the original base quality is kept.

If the Recalibrated Quality is greater than maxBaseQual, the updated quality is set to maxBaseQual.

Optionally, the previous quality can be stored in a tag.

The current recalibration logic was designed for recalibrating ILLUMINA data.


How to use it

When recab is invoked without any arguments the usage information is displayed as described below under Usage.

The input SAM/BAM file (--in), the output SAM/BAM file (--out), and the reference file (--refFile) are required inputs.

Usage

./bam recab (options) --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--minBaseQual <minBaseQual>] [--maxBaseQual <maxBaseQual>] [--blended <weight>] [--skipFitModel] [--fast] [--keepPrevDbsnp] [--keepPrevNonAdjacent] [--useLogReg]

Parameters

Required General Parameters :
	--in <infile>   : input BAM file name
	--out <outfile> : output recalibration file name
Optional General Parameters : 
	--log <logfile> : log and summary statistics (default: [outfile].log)
	--verbose       : Turn on verbose mode
	--noeof         : do not expect an EOF block on a bam file.
	--params        : print the parameter settings

Recab Specific Required Parameters
	--refFile <reference file>    : reference file name
Recab Specific Optional Parameters : 
	--dbsnp <known variance file> : dbsnp file of positions
	--minBaseQual <minBaseQual>   : minimum base quality of bases to recalibrate (default: 5)
	--maxBaseQual <maxBaseQual>   : maximum recalibrated base quality (default: 50)
	--blended <weight>            : blended model weight
	--skipFitModel                : do not check if the logistic regression model fits the data
	--fast                        : use a compact representation that only allows:
	                                   * at most 256 Read Groups
	                                   * maximum quality 63
	                                   * at most 127 cycles
	                                automatically enables skipFitModel, but is overridden by useLogReg
	                                uses up to about 2.25G more memory than running without --fast.
	--keepPrevDbsnp               : do not exclude entries where the previous base is in dbsnp when
	                                building the recalibration table
	                                By default they are excluded from the table.
	--keepPrevNonAdjacent         : do not exclude entries where the previous base is not adjacent
	                                (not a Cigar M/X/=) when building the recalibration table
	                                By default they are excluded from the table (except the first cycle).
	--useLogReg                   : use logistic regression calculated quality for the new quality
	                                ignores setting of skipFitModel and fast.
	--qualField <quality tag>     : tag to get the starting base quality
	                                (default is to get it from the Quality field)
	--storeQualTag <quality tag>  : tag to store the previous quality into

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output File (--out)

Use --out followed by your file name to specify the SAM/BAM output file.

The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A - is used to indicate stdout and the extension for file type (no extension is SAM).

SAM to file --out yourFileName.sam
BAM to file --out yourFileName.bam
Uncompressed BAM to file --out yourFileName.ubam
SAM to stdout --out -
BAM to stdout --out -.bam
Uncompressed BAM to stdout --out -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output log & Summary Statistics FileName (--log)

Output file name for writing logs & summary statistics.

If this parameter is not specified, it will write to the output file specified in --out + ".log". Or if the output bam is written to stdout (--out starts with '-'), the logs will be written to stderr. If the filename after --log starts with '-' it will write to stderr.

Turn on Verbose Mode (--verbose)

Turn on verbose logging to get more log messages in the log and to stderr.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

Reference File (--refFile)

The reference file to use for comparing read bases to the reference.

DBSNP File (--dbsnp)

The dbsnp file that specifies positions to skip recalibrating. Tab delimited file with the chromosome in the first column and the 1-based position in the 2nd column.

Blended Model Weight (--blended)

TBD - this parameter is not yet implemented.

Minimum Recalibration Base Quality (--minBaseQual)

When recalibrating reads, only positions with a base quality greater than this minimum phred quality will be recalibrated. If --minBaseQual is not specified, it is defaulted to 5.

The ILLUMINA specs indicate that any quality below 5 can be used as an error indicator so we do not want to recalibrate those.

Maximum Recalibration Base Quality (--maxBaseQual)

This value sets the maximum phred base quality assigned to a base after recalibrating. Any qualities above this value will be set to this value. It is defaulted to 50.

Logistic Regression (--useLogReg)

Use the logistic regression empirical qualities for setting the new base qualities instead of the default formula: -10 * log10((#mismatches+1)/(#total+1))

Read the quality from a tag (--qualField)

If this parameter is set, then read the quality string from the specified tag name. If the tag is not found, the quality is read from the quality field.

Store the original quality (--storeQualTag)

If this parameter is set, the original quality will be stored as a string in the specified tag.

Return Value

Returns -1 if input parameters are invalid.

Returns the SamStatus for the reads/writes (0 on success).