Difference between revisions of "BamUtil: recab"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with 'validate Category:BAM Software Category:Software ='''COMING SOON, June, 2012'''= = Overview of the <code>recab</code> function of <code>bamUtil…')
 
 
(28 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
[[Category:BAM Software]]
 
[[Category:BAM Software]]
 
[[Category:Software]]
 
[[Category:Software]]
 
='''COMING SOON, June, 2012'''=
 
  
 
= Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> =
 
= Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> =
 
The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file.  
 
The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file.  
  
==Handling Recalibration==
+
Recalibration can also be called as an option of [[bamUtil: dedup]].  This will perform the recalibration and  the deduping in the same set of steps, increasing processing speed.
 +
 
 +
==Handling Recalibration/Implementation Notes==
 +
 
 +
Recalibration is a 2-step process that loops through the file twice (stdin is not support as input):
 +
# Build Recalibration Table
 +
# Apply Recalibration Table
 +
 
 +
 
 +
The Recalibration Table groups bases based on a set of covariates:
 +
* Read Group
 +
* Quality (either from the quality string or [[#Read the quality from a tag (--qualField)|from a tag]])
 +
* Cycle (reverse complement for reverse strands)
 +
* 1st/2nd read in pair
 +
* Previous Cycle's Base (reverse complement for reverse strands)
 +
* This Cycle's Base (reverse complement for reverse strands)
 +
 
 +
The Recalibration Table tracks the number of matches/mismatches for each set of covariates.
 +
 
  
Reads Not Recalibrated:
+
Only bases meeting all of the following criteria are used to Build the Recalibration Table:
* Duplicates
+
* Read criteria
* Unmapped
+
** not a duplicate
* Mapping Quality = 0
+
** mapped
* Mapping Quality = 255
+
** mapping quality != 0
 +
** mapping quality != 255
 +
* Base criteria
 +
** match/mismatch (not an insertion/deletion/skip/clip)
 +
** not a [[#DBSNP File (--dbsnp)|dbSNP position]]
 +
** base quality > [[#Minimum Recalibration Base Quality (--minBaseQual)|minBaseQual (5 by default)]]
 +
* Additional criteria for cycle != 1 (can be turned off via flags)
 +
** previous base is a CIGAR Match/Mismatch (Use [[#Allow Previous Base Non-Match/Mismatch (--keepPrevNonAdjacent)|<code>--keepPrevNonAdjacent</code>]] to disable)
 +
** previous base position is not a [[#DBSNP File (--dbsnp)|dbSNP position]] (Use [[#Allow Previous Base DBSNP (--keepPrevDbsnp)|<code>--keepPrevDbsnp</code>]] to disable)
  
  
=== Covariates Notes ===
+
The Recalibration Table is applied to all bases meeting all of the following criteria (even if they were not used for creating the table):
Duplicates are determined by checking for matching keys. 
+
* base quality > [[#Minimum Recalibration Base Quality (--minBaseQual)|minBaseQual (5 by default)]]
 +
* at least 1 match or mismatch for the set of covariates
  
The Key is comprised of:
 
# Chromosome
 
# Orientation (forward/reverse)
 
# Unclipped Start(forward)/End(reverse)
 
# Library
 
  
Rules:
+
Recalibrated Quality is: <math>-10 * \log \frac{mismatches + 1}{mismatches + matches + 1}</math>
* Skip Unmapped Reads, they are not marked as duplicate
+
 
* Mark a Single-End Read Duplicate (or remove it if configured to do so) if:
+
Alternatively, [[#Logistic Regression (--useLogReg)|logistic regression]] can be used for calculating the new quality.
*# A paired-end record has the same key (even if the pair is not proper/the mate is unmapped/the mate is not found)<br/>-OR-
+
 
*# A single-end record has the same key and a higher base quality sum (sum of all base qualities in the record)
+
If the Recalibrated Quality is greater than [[#Maximum Recalibration Base Quality (--maxBaseQual)|maxBaseQual]], the updated quality is set to maxBaseQual.
* Mark both Paired-End Reads Duplicate if:
+
 
# Another paired-end pair has the same set of keys and has a higher base quality sum.
+
 
+
Optionally, the previous quality can be [[#Store the original quality (--storeQualTag)|stored in a tag]].
This code assumes that at most 1000 bases are clipped at the start of a read.
+
 
 +
 
 +
The current recalibration logic was designed for recalibrating ILLUMINA data.
 +
 
 +
NOTE: GATK ignores/skips adapters, but our logic does not.
  
 
== How to use it ==
 
== How to use it ==
  
When <code>dedup</code> is invoked without any arguments the usage information is displayed as described below under [[#Usage|Usage]].
+
When <code>recab</code> is invoked without any arguments the usage information is displayed as described below under [[#Usage|Usage]].
  
The input SAM/BAM file is required, [[#input File (--in)|input File (--in)]], and must be sorted by coordinate.
+
The input SAM/BAM file ([[#input File (--in)|--in]]), the output SAM/BAM file ([[#output File (--out)|--out]]), and the reference file ([[#Reference File (--refFile)|--refFile]]) are required inputs.
  
The output SAM/BAM file is also required, [[#output File (--out)|output File (--out)]].
+
 
 +
Recommended usage with Deduper:
 +
 
 +
/usr/cluster/bin/bam dedup --recab --in ${INPUT}.bam --out ${OUTPUT}.bam --force --refFile ${REF} --dbsnp ${DBSNP} --oneChrom --storeQualTag OQ --maxBaseQual 40
 +
 
 +
 
 +
Recommended usage without Deduper:
 +
 
 +
/usr/cluster/bin/bam recab --in ${INPUT}.bam --out ${OUTPUT}.bam --refFile ${REF} --dbsnp ${DBSNP} --storeQualTag OQ --maxBaseQual 40
  
 
= Usage =
 
= Usage =
  ./bam recab --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--blended <weight>]  
+
  ./bam recab (options) --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--minBaseQual <minBaseQual>] [--maxBaseQual <maxBaseQual>] [--blended <weight>] [--fitModel] [--fast] [--keepPrevDbsnp] [--keepPrevNonAdjacent] [--useLogReg] [--qualField <tag>] [--storeQualTag <tag>] [--buildExcludeFlags <flag>] [--applyExcludeFlags <flag>]
  
 
= Parameters =
 
= Parameters =
 
<pre>
 
<pre>
 
Required General Parameters :
 
Required General Parameters :
--in <infile>  : input BAM file name
+
        --in <infile>  : input BAM file name
--out <outfile> : output recalibration file name
+
        --out <outfile> : output recalibration file name
Optional General Parameters :  
+
Optional General Parameters :
--log <logfile> : log and summary statistics (default: [outfile].log)
+
        --log <logfile> : log and summary statistics (default: [outfile].log)
--verbose      : Turn on verbose mode
+
        --verbose      : Turn on verbose mode
--noeof        : do not expect an EOF block on a bam file.
+
        --noeof        : do not expect an EOF block on a bam file.
--params        : print the parameter settings
+
        --params        : print the parameter settings
  
 
Recab Specific Required Parameters
 
Recab Specific Required Parameters
--refFile <reference file>    : reference file name
+
        --refFile <reference file>    : reference file name
Recab Specific Optional Parameters :  
+
Recab Specific Optional Parameters :
--dbsnp <known variance file> : dbsnp file of positions
+
        --dbsnp <known variance file> : dbsnp file of positions
--blended <weight>            : blended model weight
+
        --minBaseQual <minBaseQual>  : minimum base quality of bases to recalibrate (default: 5)
 +
        --maxBaseQual <maxBaseQual>  : maximum recalibrated base quality (default: 50)
 +
                                        qualities over this value will be set to this value.
 +
                                        This setting is applied after binning (if applicable).
 +
        --blended <weight>            : blended model weight
 +
        --fitModel                    : check if the logistic regression model fits the data
 +
                                        overriden by fast, but automatically applied by useLogReg
 +
        --fast                        : use a compact representation that only allows:
 +
                                          * at most 256 Read Groups
 +
                                          * maximum quality 63
 +
                                          * at most 127 cycles
 +
                                        overrides fitModel, but is overridden by useLogReg
 +
                                        uses up to about 2.25G more memory than running without --fast.
 +
        --keepPrevDbsnp              : do not exclude entries where the previous base is in dbsnp when
 +
                                        building the recalibration table
 +
                                        By default they are excluded from the table.
 +
        --keepPrevNonAdjacent        : do not exclude entries where the previous base is not adjacent
 +
                                        (not a Cigar M/X/=) when building the recalibration table
 +
                                        By default they are excluded from the table (except the first cycle).
 +
        --useLogReg                  : use logistic regression calculated quality for the new quality
 +
                                        automatically applies fitModel and overrides fast.
 +
        --qualField <quality tag>    : tag to get the starting base quality
 +
                                        (default is to get it from the Quality field)
 +
        --storeQualTag <quality tag>  : tag to store the previous quality into
 +
        --buildExcludeFlags <flag>    : exclude reads with any of these flags set when building the
 +
                                        recalibration table.  Default is 0xF04
 +
        --applyExcludeFlags <flag>    : do not apply the recalibration table to any reads with any of these flags set
 +
        Quality Binning Parameters (optional):
 +
          Bin qualities by phred score, into the ranges specified by binQualS or binQualF (both cannot be used)
 +
          Ranges are specified by comma separated minimum phred score for the bin, example: 1,17,20,30,40,50,70
 +
          The first bin always starts at 0, so does not need to be specified.
 +
          By default, the bin value is the low end of the range.
 +
                --binQualS  : Bin the Qualities as specified (phred): minQualOfBin2, minQualofBin3...
 +
                --binQualF  : Bin the Qualities based on the specified file
 +
                --binMid    : Use the mid point of the quality bin range for the quality value of the bin.
 +
                --binHigh    : Use the high end of the quality bin range for the quality value of the bin.
 +
 
 
</pre>
 
</pre>
 +
{{PhoneHomeParamDesc}}
  
{{inBAMInputFile}}
+
== Required Generic Parameters ==
 +
{{inBAMInputFile|noStdin=1}}
 
{{outBAMOutputFile}}
 
{{outBAMOutputFile}}
  
== BAM File Is Sorted By Read Name (<code>--minRecabQual</code>) ==
+
== Optional Generic Parameters ==
 +
=== Output log & Summary Statistics FileName (<code>--log</code>) ===
 +
 
 +
Output file name for writing logs & summary statistics.
 +
 
 +
If this parameter is not specified, it will write to the output file specified in <code>--out</code> + ".log".  Or if the output bam is written to stdout (<code>--out</code> starts with '-'), the logs will be written to stderr.  If the filename after --log starts with '-' it will write to stderr.
  
When recalibrating reads, only positions with a base quality greater than this minimum will be recalibrated.  If <code>--minQual</code> is not specified, it is defaulted to <span style="color:red">TBD</span>.
+
=== Turn on Verbose Mode (<code>--verbose</code>) ===
  
== Output log & Summary Statistics FileName (<code>--log</code>) ==
+
Turn on verbose logging to get more log messages in the log and to stderr.
  
Output file name for writing logs & summary statistics.
+
{{noeofBGZFParameter}}
 +
{{paramsParameter}}
 +
 
 +
{{PhoneHomeParameters}}
 +
 
 +
== Required Recalibration Parameters ==
 +
=== Reference File (<code>--refFile</code>) ===
 +
 
 +
The reference file is a required parameter used for comparing read bases to the reference.
 +
 
 +
== Optional Recalibration Parameters ==
 +
 
 +
=== DBSNP File (<code>--dbsnp</code>) ===
 +
 
 +
The dbsnp file that specifies positions to skip recalibrating.  Tab delimited file with the chromosome in the first column and the 1-based position in the 2nd column.
 +
 
 +
=== Minimum Recalibration Base Quality (<code>--minBaseQual</code>) ===
 +
 
 +
When recalibrating reads, only positions with a base quality greater than this minimum phred quality will be recalibrated.  If <code>--minBaseQual</code> is not specified, it is defaulted to 5.
 +
 
 +
The ILLUMINA specs indicate that any quality below 5 can be used as an error indicator so we do not want to recalibrate those.
 +
 
 +
=== Maximum Recalibration Base Quality (<code>--maxBaseQual</code>) ===
 +
 
 +
This value sets the maximum phred base quality assigned to a base after recalibrating. Any qualities above this value will be set to this value.  It is defaulted to 50.
 +
 
 +
=== Blended Model Weight (<code>--blended</code>) ===
 +
 
 +
<span style="color:red">TBD - this parameter is not yet implemented.</span>
 +
 
 +
=== Fit Model (<code>--fitModel</code>) ===
 +
 
 +
Check if the logistic regression model fits the data.
 +
 
 +
This option does NOT set the new qualities to the logistic regression calculated qualities, it only checks the fit.  To apply the logistic regression qualities, see [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]].  <code>--fitModel</code> is automatically applied when <code>--useLogReg</code> is specified.
 +
                             
 +
This option cannot be used in conjunction with [[#Fast Recalibration (--fast)|<code>--fast</code>]] and is overriden by <code>--fast</code>, but automatically applied by useLogReg
 +
 
 +
=== Fast Recalibration (<code>--fast</code>) ===
 +
 
 +
Use a compact representation of the Recalibration Table that only allows:
 +
* at most 256 Read Groups
 +
* maximum quality 63
 +
* at most 127 cycles
 +
 
 +
This option will run faster than the default recalibration, but uses up to about 2.25G more memory than running without --fast.
 +
 
 +
This option cannot be used in conjunction with [[#Fit Model (--fitModel)|<code>--fitModel</code>]], or [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]] and overrides [[#Fit Model (--fitModel)|<code>--fitModel</code>]], but is overridden by [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]].
 +
 
 +
=== Allow Previous Base DBSNP (<code>--keepPrevDbsnp</code>) ===
 +
 
 +
By default bases where the previous base is in DBSNP are excluded from the Recalibration Table.
 +
 
 +
This option includes these bases in the building of the Recalibration Table.
 +
 
 +
=== Allow Previous Base Non-Match/Mismatch (<code>--keepPrevNonAdjacent</code>) ===
 +
 
 +
By default bases where the previous base is not a CIGAR Match/Mismatch are excluded from the Recalibration Table.
  
If this parameter is not specified, it will write to the output file specified in <code>--out</code> + ".log".  Or if the output bam is written to stdout (<code>--out</code> starts with '-'), the logs will be written to stderr.  If the filename after --log starts with '-' it will write to stderr.
+
This option includes these bases in the building of the Recalibration Table.
  
== Treat Reads with Mates On Different Chromosomes As Single-Ended (<code>--oneChrom</code>) ==
 
  
If a read's mate is not found it will not be used for duplicate marking.  If you are running on a single chromosome, all read's whose mates are on different chromosomes will not be used for duplicate marking.  The <code>--oneChrom</code> option will treat reads with mates on a different chromosome as single-ended.
+
=== Logistic Regression (<code>--useLogReg</code>) ===
  
== Recalibrate (<code>--recab</code>) ==
+
Use the logistic regression empirical qualities for setting the new base qualities instead of the default formula.
  
This option will recalibrate the input file in addition to deduping.
+
This option automatically enables [[#Fit Model (--fitModel)|<code>--fitModel</code>]] and disables [[#Fast Recalibration (--fast)|<code>--fast</code>]].
  
== Remove Duplicates (<code>--rmDups</code>) ==
+
=== Read the quality from a tag (<code>--qualField</code>) ===
  
Instead of marking a read as duplicate in the flag, the <code>--rmDups</code> option will remove it from the output BAM file.
+
If this parameter is set, then read the quality string from the specified tag name.  If the tag is not found, the quality is read from the quality field.
  
== Ignore Previous Duplicate Marking (<code>--force</code>) ==
+
=== Store the original quality (<code>--storeQualTag</code>) ===
  
By default the deduper will throw an error and stop if a read is already marked as duplicate.  The <code>--force</code> option will removes any previous duplicate marking and marks the reads from scratch.  The resulting output file will only have reads determined by the deduper marked as duplicates.
+
If this parameter is set, the original quality will be stored as a string in the specified tag.
  
== Turn on Verbose Mode (<code>--verbose</code>) ==
+
=== Skip Records with any of the Specified Flags (<code>--buildExcludeFlags</code>, <code>--applyExcludeFlags</code>) ===
 +
Use <code>--buildExcludeFlags</code> to skip records with any of the specified flags set when building the recalibration table, default 0xF04.
  
Turn on verbose logging to get more log messages in the log and to stderr.
+
By default, when building the recalibration table reads with any of the following flags set are skipped:
 +
* unmapped
 +
* secondary alignment
 +
* fails QC checks
 +
* duplicate
 +
* supplementary alignment
  
{{noeofBGZFParameter}}
+
Use <code>--applyExcludeFlags</code> to skip records with any of the specified flags set when applying the recalibration table.  The default value is 0x000, do not skip any reads.
{{paramsParameter}}
 
  
 
= Return Value =
 
= Return Value =
Line 105: Line 241:
 
Returns -1 if input parameters are invalid.
 
Returns -1 if input parameters are invalid.
  
Returns the SamStatus for the reads/writes (0 on success).
+
Returns the SamStatus for the reads/writes (0 on success, non-0 on failure).

Latest revision as of 23:01, 19 April 2019


Overview of the recab function of bamUtil

The recab option of bamUtil recalibrates a SAM/BAM file.

Recalibration can also be called as an option of bamUtil: dedup. This will perform the recalibration and the deduping in the same set of steps, increasing processing speed.

Handling Recalibration/Implementation Notes

Recalibration is a 2-step process that loops through the file twice (stdin is not support as input):

  1. Build Recalibration Table
  2. Apply Recalibration Table


The Recalibration Table groups bases based on a set of covariates:

  • Read Group
  • Quality (either from the quality string or from a tag)
  • Cycle (reverse complement for reverse strands)
  • 1st/2nd read in pair
  • Previous Cycle's Base (reverse complement for reverse strands)
  • This Cycle's Base (reverse complement for reverse strands)

The Recalibration Table tracks the number of matches/mismatches for each set of covariates.


Only bases meeting all of the following criteria are used to Build the Recalibration Table:


The Recalibration Table is applied to all bases meeting all of the following criteria (even if they were not used for creating the table):


Recalibrated Quality is:

Alternatively, logistic regression can be used for calculating the new quality.

If the Recalibrated Quality is greater than maxBaseQual, the updated quality is set to maxBaseQual.


Optionally, the previous quality can be stored in a tag.


The current recalibration logic was designed for recalibrating ILLUMINA data.

NOTE: GATK ignores/skips adapters, but our logic does not.

How to use it

When recab is invoked without any arguments the usage information is displayed as described below under Usage.

The input SAM/BAM file (--in), the output SAM/BAM file (--out), and the reference file (--refFile) are required inputs.


Recommended usage with Deduper:

/usr/cluster/bin/bam dedup --recab --in ${INPUT}.bam --out ${OUTPUT}.bam --force --refFile ${REF} --dbsnp ${DBSNP} --oneChrom --storeQualTag OQ --maxBaseQual 40


Recommended usage without Deduper:

/usr/cluster/bin/bam recab --in ${INPUT}.bam --out ${OUTPUT}.bam --refFile ${REF} --dbsnp ${DBSNP} --storeQualTag OQ --maxBaseQual 40

Usage

./bam recab (options) --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--minBaseQual <minBaseQual>] [--maxBaseQual <maxBaseQual>] [--blended <weight>] [--fitModel] [--fast] [--keepPrevDbsnp] [--keepPrevNonAdjacent] [--useLogReg] [--qualField <tag>] [--storeQualTag <tag>] [--buildExcludeFlags <flag>] [--applyExcludeFlags <flag>]

Parameters

Required General Parameters :
        --in <infile>   : input BAM file name
        --out <outfile> : output recalibration file name
Optional General Parameters :
        --log <logfile> : log and summary statistics (default: [outfile].log)
        --verbose       : Turn on verbose mode
        --noeof         : do not expect an EOF block on a bam file.
        --params        : print the parameter settings

Recab Specific Required Parameters
        --refFile <reference file>    : reference file name
Recab Specific Optional Parameters :
        --dbsnp <known variance file> : dbsnp file of positions
        --minBaseQual <minBaseQual>   : minimum base quality of bases to recalibrate (default: 5)
        --maxBaseQual <maxBaseQual>   : maximum recalibrated base quality (default: 50)
                                        qualities over this value will be set to this value.
                                        This setting is applied after binning (if applicable).
        --blended <weight>            : blended model weight
        --fitModel                    : check if the logistic regression model fits the data
                                        overriden by fast, but automatically applied by useLogReg
        --fast                        : use a compact representation that only allows:
                                           * at most 256 Read Groups
                                           * maximum quality 63
                                           * at most 127 cycles
                                        overrides fitModel, but is overridden by useLogReg
                                        uses up to about 2.25G more memory than running without --fast.
        --keepPrevDbsnp               : do not exclude entries where the previous base is in dbsnp when
                                        building the recalibration table
                                        By default they are excluded from the table.
        --keepPrevNonAdjacent         : do not exclude entries where the previous base is not adjacent
                                        (not a Cigar M/X/=) when building the recalibration table
                                        By default they are excluded from the table (except the first cycle).
        --useLogReg                   : use logistic regression calculated quality for the new quality
                                        automatically applies fitModel and overrides fast.
        --qualField <quality tag>     : tag to get the starting base quality
                                        (default is to get it from the Quality field)
        --storeQualTag <quality tag>  : tag to store the previous quality into
        --buildExcludeFlags <flag>    : exclude reads with any of these flags set when building the
                                        recalibration table.  Default is 0xF04
        --applyExcludeFlags <flag>    : do not apply the recalibration table to any reads with any of these flags set
        Quality Binning Parameters (optional):
          Bin qualities by phred score, into the ranges specified by binQualS or binQualF (both cannot be used)
          Ranges are specified by comma separated minimum phred score for the bin, example: 1,17,20,30,40,50,70
          The first bin always starts at 0, so does not need to be specified.
          By default, the bin value is the low end of the range.
                --binQualS   : Bin the Qualities as specified (phred): minQualOfBin2, minQualofBin3...
                --binQualF   : Bin the Qualities based on the specified file
                --binMid     : Use the mid point of the quality bin range for the quality value of the bin.
                --binHigh    : Use the high end of the quality bin range for the quality value of the bin.

	PhoneHome:
		--noPhoneHome       : disable PhoneHome (default enabled)
		--phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)

Required Generic Parameters

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user.

Note: This tool does not support input from stdin.

SAM/BAM/Uncompressed BAM from file --in yourFileName


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output File (--out)

Use --out followed by your file name to specify the SAM/BAM output file.

The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A - is used to indicate stdout and the extension for file type (no extension is SAM).

SAM to file --out yourFileName.sam
BAM to file --out yourFileName.bam
Uncompressed BAM to file --out yourFileName.ubam
SAM to stdout --out -
BAM to stdout --out -.bam
Uncompressed BAM to stdout --out -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Optional Generic Parameters

Output log & Summary Statistics FileName (--log)

Output file name for writing logs & summary statistics.

If this parameter is not specified, it will write to the output file specified in --out + ".log". Or if the output bam is written to stdout (--out starts with '-'), the logs will be written to stderr. If the filename after --log starts with '-' it will write to stderr.

Turn on Verbose Mode (--verbose)

Turn on verbose logging to get more log messages in the log and to stderr.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

Turn off PhoneHome (--noPhoneHome)

Use the --noPhoneHome option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.

Adjust the Frequency of PhoneHome (--phoneHomeThinning)

Use --phoneHomeThinning to modify the percentage of the time that PhoneHome will run (0-100).

  • By default, --phoneHomeThinning is set to 50, running 50% of the time.
  • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
  • N/A if --noPhoneHome is set.

Required Recalibration Parameters

Reference File (--refFile)

The reference file is a required parameter used for comparing read bases to the reference.

Optional Recalibration Parameters

DBSNP File (--dbsnp)

The dbsnp file that specifies positions to skip recalibrating. Tab delimited file with the chromosome in the first column and the 1-based position in the 2nd column.

Minimum Recalibration Base Quality (--minBaseQual)

When recalibrating reads, only positions with a base quality greater than this minimum phred quality will be recalibrated. If --minBaseQual is not specified, it is defaulted to 5.

The ILLUMINA specs indicate that any quality below 5 can be used as an error indicator so we do not want to recalibrate those.

Maximum Recalibration Base Quality (--maxBaseQual)

This value sets the maximum phred base quality assigned to a base after recalibrating. Any qualities above this value will be set to this value. It is defaulted to 50.

Blended Model Weight (--blended)

TBD - this parameter is not yet implemented.

Fit Model (--fitModel)

Check if the logistic regression model fits the data.

This option does NOT set the new qualities to the logistic regression calculated qualities, it only checks the fit. To apply the logistic regression qualities, see --useLogReg. --fitModel is automatically applied when --useLogReg is specified.

This option cannot be used in conjunction with --fast and is overriden by --fast, but automatically applied by useLogReg

Fast Recalibration (--fast)

Use a compact representation of the Recalibration Table that only allows:

  • at most 256 Read Groups
  • maximum quality 63
  • at most 127 cycles

This option will run faster than the default recalibration, but uses up to about 2.25G more memory than running without --fast.

This option cannot be used in conjunction with --fitModel, or --useLogReg and overrides --fitModel, but is overridden by --useLogReg.

Allow Previous Base DBSNP (--keepPrevDbsnp)

By default bases where the previous base is in DBSNP are excluded from the Recalibration Table.

This option includes these bases in the building of the Recalibration Table.

Allow Previous Base Non-Match/Mismatch (--keepPrevNonAdjacent)

By default bases where the previous base is not a CIGAR Match/Mismatch are excluded from the Recalibration Table.

This option includes these bases in the building of the Recalibration Table.


Logistic Regression (--useLogReg)

Use the logistic regression empirical qualities for setting the new base qualities instead of the default formula.

This option automatically enables --fitModel and disables --fast.

Read the quality from a tag (--qualField)

If this parameter is set, then read the quality string from the specified tag name. If the tag is not found, the quality is read from the quality field.

Store the original quality (--storeQualTag)

If this parameter is set, the original quality will be stored as a string in the specified tag.

Skip Records with any of the Specified Flags (--buildExcludeFlags, --applyExcludeFlags)

Use --buildExcludeFlags to skip records with any of the specified flags set when building the recalibration table, default 0xF04.

By default, when building the recalibration table reads with any of the following flags set are skipped:

  • unmapped
  • secondary alignment
  • fails QC checks
  • duplicate
  • supplementary alignment

Use --applyExcludeFlags to skip records with any of the specified flags set when applying the recalibration table. The default value is 0x000, do not skip any reads.

Return Value

Returns -1 if input parameters are invalid.

Returns the SamStatus for the reads/writes (0 on success, non-0 on failure).