Changes

From Genome Analysis Wiki
Jump to navigationJump to search
5,297 bytes added ,  23:01, 19 April 2019
Line 10: Line 10:  
==Handling Recalibration/Implementation Notes==
 
==Handling Recalibration/Implementation Notes==
   −
Recalibration is a 2-step process that loops through the file twice:
+
Recalibration is a 2-step process that loops through the file twice (stdin is not support as input):
 
# Build Recalibration Table
 
# Build Recalibration Table
 
# Apply Recalibration Table
 
# Apply Recalibration Table
Line 37: Line 37:  
** base quality > [[#Minimum Recalibration Base Quality (--minBaseQual)|minBaseQual (5 by default)]]
 
** base quality > [[#Minimum Recalibration Base Quality (--minBaseQual)|minBaseQual (5 by default)]]
 
* Additional criteria for cycle != 1 (can be turned off via flags)
 
* Additional criteria for cycle != 1 (can be turned off via flags)
** previous base is a CIGAR Match/Mismatch
+
** previous base is a CIGAR Match/Mismatch (Use [[#Allow Previous Base Non-Match/Mismatch (--keepPrevNonAdjacent)|<code>--keepPrevNonAdjacent</code>]] to disable)
** previous base position is not a [[#DBSNP File (--dbsnp)|dbSNP position]]
+
** previous base position is not a [[#DBSNP File (--dbsnp)|dbSNP position]] (Use [[#Allow Previous Base DBSNP (--keepPrevDbsnp)|<code>--keepPrevDbsnp</code>]] to disable)
      Line 48: Line 48:  
Recalibrated Quality is: <math>-10 * \log \frac{mismatches + 1}{mismatches + matches + 1}</math>
 
Recalibrated Quality is: <math>-10 * \log \frac{mismatches + 1}{mismatches + matches + 1}</math>
    +
Alternatively, [[#Logistic Regression (--useLogReg)|logistic regression]] can be used for calculating the new quality.
    
If the Recalibrated Quality is greater than [[#Maximum Recalibration Base Quality (--maxBaseQual)|maxBaseQual]], the updated quality is set to maxBaseQual.
 
If the Recalibrated Quality is greater than [[#Maximum Recalibration Base Quality (--maxBaseQual)|maxBaseQual]], the updated quality is set to maxBaseQual.
Line 57: Line 58:  
The current recalibration logic was designed for recalibrating ILLUMINA data.
 
The current recalibration logic was designed for recalibrating ILLUMINA data.
    +
NOTE: GATK ignores/skips adapters, but our logic does not.
    
== How to use it ==
 
== How to use it ==
Line 63: Line 65:     
The input SAM/BAM file ([[#input File (--in)|--in]]), the output SAM/BAM file ([[#output File (--out)|--out]]), and the reference file ([[#Reference File (--refFile)|--refFile]]) are required inputs.
 
The input SAM/BAM file ([[#input File (--in)|--in]]), the output SAM/BAM file ([[#output File (--out)|--out]]), and the reference file ([[#Reference File (--refFile)|--refFile]]) are required inputs.
 +
 +
 +
Recommended usage with Deduper:
 +
 +
/usr/cluster/bin/bam dedup --recab --in ${INPUT}.bam --out ${OUTPUT}.bam --force --refFile ${REF} --dbsnp ${DBSNP} --oneChrom --storeQualTag OQ --maxBaseQual 40
 +
 +
 +
Recommended usage without Deduper:
 +
 +
/usr/cluster/bin/bam recab --in ${INPUT}.bam --out ${OUTPUT}.bam --refFile ${REF} --dbsnp ${DBSNP} --storeQualTag OQ --maxBaseQual 40
    
= Usage =
 
= Usage =
  ./bam recab (options) --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--minBaseQual <minBaseQual>] [--maxBaseQual <maxBaseQual>] [--blended <weight>] [--skipFitModel] [--fast] [--keepPrevDbsnp] [--keepPrevNonAdjacent] [--useLogReg]
+
  ./bam recab (options) --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--minBaseQual <minBaseQual>] [--maxBaseQual <maxBaseQual>] [--blended <weight>] [--fitModel] [--fast] [--keepPrevDbsnp] [--keepPrevNonAdjacent] [--useLogReg] [--qualField <tag>] [--storeQualTag <tag>] [--buildExcludeFlags <flag>] [--applyExcludeFlags <flag>]
    
= Parameters =
 
= Parameters =
 
<pre>
 
<pre>
 
Required General Parameters :
 
Required General Parameters :
--in <infile>  : input BAM file name
+
        --in <infile>  : input BAM file name
--out <outfile> : output recalibration file name
+
        --out <outfile> : output recalibration file name
Optional General Parameters :  
+
Optional General Parameters :
--log <logfile> : log and summary statistics (default: [outfile].log)
+
        --log <logfile> : log and summary statistics (default: [outfile].log)
--verbose      : Turn on verbose mode
+
        --verbose      : Turn on verbose mode
--noeof        : do not expect an EOF block on a bam file.
+
        --noeof        : do not expect an EOF block on a bam file.
--params        : print the parameter settings
+
        --params        : print the parameter settings
    
Recab Specific Required Parameters
 
Recab Specific Required Parameters
--refFile <reference file>    : reference file name
+
        --refFile <reference file>    : reference file name
Recab Specific Optional Parameters :  
+
Recab Specific Optional Parameters :
--dbsnp <known variance file> : dbsnp file of positions
+
        --dbsnp <known variance file> : dbsnp file of positions
--minBaseQual <minBaseQual>  : minimum base quality of bases to recalibrate (default: 5)
+
        --minBaseQual <minBaseQual>  : minimum base quality of bases to recalibrate (default: 5)
--maxBaseQual <maxBaseQual>  : maximum recalibrated base quality (default: 50)
+
        --maxBaseQual <maxBaseQual>  : maximum recalibrated base quality (default: 50)
--blended <weight>            : blended model weight
+
                                        qualities over this value will be set to this value.
--skipFitModel                : do not check if the logistic regression model fits the data
+
                                        This setting is applied after binning (if applicable).
--fast                        : use a compact representation that only allows:
+
        --blended <weight>            : blended model weight
                                  * at most 256 Read Groups
+
        --fitModel                    : check if the logistic regression model fits the data
                                  * maximum quality 63
+
                                        overriden by fast, but automatically applied by useLogReg
                                  * at most 127 cycles
+
        --fast                        : use a compact representation that only allows:
                                automatically enables skipFitModel, but is overridden by useLogReg
+
                                          * at most 256 Read Groups
                                uses up to about 2.25G more memory than running without --fast.
+
                                          * maximum quality 63
--keepPrevDbsnp              : do not exclude entries where the previous base is in dbsnp when
+
                                          * at most 127 cycles
                                building the recalibration table
+
                                        overrides fitModel, but is overridden by useLogReg
                                By default they are excluded from the table.
+
                                        uses up to about 2.25G more memory than running without --fast.
--keepPrevNonAdjacent        : do not exclude entries where the previous base is not adjacent
+
        --keepPrevDbsnp              : do not exclude entries where the previous base is in dbsnp when
                                (not a Cigar M/X/=) when building the recalibration table
+
                                        building the recalibration table
                                By default they are excluded from the table (except the first cycle).
+
                                        By default they are excluded from the table.
--useLogReg                  : use logistic regression calculated quality for the new quality
+
        --keepPrevNonAdjacent        : do not exclude entries where the previous base is not adjacent
                                ignores setting of skipFitModel and fast.
+
                                        (not a Cigar M/X/=) when building the recalibration table
--qualField <quality tag>    : tag to get the starting base quality
+
                                        By default they are excluded from the table (except the first cycle).
                                (default is to get it from the Quality field)
+
        --useLogReg                  : use logistic regression calculated quality for the new quality
--storeQualTag <quality tag>  : tag to store the previous quality into
+
                                        automatically applies fitModel and overrides fast.
 +
        --qualField <quality tag>    : tag to get the starting base quality
 +
                                        (default is to get it from the Quality field)
 +
        --storeQualTag <quality tag>  : tag to store the previous quality into
 +
        --buildExcludeFlags <flag>    : exclude reads with any of these flags set when building the
 +
                                        recalibration table.  Default is 0xF04
 +
        --applyExcludeFlags <flag>    : do not apply the recalibration table to any reads with any of these flags set
 +
        Quality Binning Parameters (optional):
 +
          Bin qualities by phred score, into the ranges specified by binQualS or binQualF (both cannot be used)
 +
          Ranges are specified by comma separated minimum phred score for the bin, example: 1,17,20,30,40,50,70
 +
          The first bin always starts at 0, so does not need to be specified.
 +
          By default, the bin value is the low end of the range.
 +
                --binQualS  : Bin the Qualities as specified (phred): minQualOfBin2, minQualofBin3...
 +
                --binQualF  : Bin the Qualities based on the specified file
 +
                --binMid    : Use the mid point of the quality bin range for the quality value of the bin.
 +
                --binHigh    : Use the high end of the quality bin range for the quality value of the bin.
 +
 
 
</pre>
 
</pre>
 +
{{PhoneHomeParamDesc}}
   −
{{inBAMInputFile}}
+
== Required Generic Parameters ==
 +
{{inBAMInputFile|noStdin=1}}
 
{{outBAMOutputFile}}
 
{{outBAMOutputFile}}
   −
== Output log & Summary Statistics FileName (<code>--log</code>) ==
+
== Optional Generic Parameters ==
 +
=== Output log & Summary Statistics FileName (<code>--log</code>) ===
    
Output file name for writing logs & summary statistics.
 
Output file name for writing logs & summary statistics.
Line 114: Line 145:  
If this parameter is not specified, it will write to the output file specified in <code>--out</code> + ".log".  Or if the output bam is written to stdout (<code>--out</code> starts with '-'), the logs will be written to stderr.  If the filename after --log starts with '-' it will write to stderr.
 
If this parameter is not specified, it will write to the output file specified in <code>--out</code> + ".log".  Or if the output bam is written to stdout (<code>--out</code> starts with '-'), the logs will be written to stderr.  If the filename after --log starts with '-' it will write to stderr.
   −
== Turn on Verbose Mode (<code>--verbose</code>) ==
+
=== Turn on Verbose Mode (<code>--verbose</code>) ===
    
Turn on verbose logging to get more log messages in the log and to stderr.
 
Turn on verbose logging to get more log messages in the log and to stderr.
Line 121: Line 152:  
{{paramsParameter}}
 
{{paramsParameter}}
   −
== Reference File (<code>--refFile</code>) ==
+
{{PhoneHomeParameters}}
   −
The reference file to use for comparing read bases to the reference.
+
== Required Recalibration Parameters ==
 +
=== Reference File (<code>--refFile</code>) ===
   −
== DBSNP File (<code>--dbsnp</code>) ==
+
The reference file is a required parameter used for comparing read bases to the reference.
   −
The dbsnp file that specifies positions to skip recalibrating.  Tab delimited file with the chromosome in the first column and the 1-based position in the 2nd column.
+
== Optional Recalibration Parameters ==
   −
== Blended Model Weight (<code>--blended</code>) ==
+
=== DBSNP File (<code>--dbsnp</code>) ===
   −
<span style="color:red">TBD - this parameter is not yet implemented.</span>
+
The dbsnp file that specifies positions to skip recalibrating.  Tab delimited file with the chromosome in the first column and the 1-based position in the 2nd column.
   −
== Minimum Recalibration Base Quality (<code>--minBaseQual</code>) ==
+
=== Minimum Recalibration Base Quality (<code>--minBaseQual</code>) ===
    
When recalibrating reads, only positions with a base quality greater than this minimum phred quality will be recalibrated.  If <code>--minBaseQual</code> is not specified, it is defaulted to 5.  
 
When recalibrating reads, only positions with a base quality greater than this minimum phred quality will be recalibrated.  If <code>--minBaseQual</code> is not specified, it is defaulted to 5.  
Line 139: Line 171:  
The ILLUMINA specs indicate that any quality below 5 can be used as an error indicator so we do not want to recalibrate those.
 
The ILLUMINA specs indicate that any quality below 5 can be used as an error indicator so we do not want to recalibrate those.
   −
== Maximum Recalibration Base Quality (<code>--maxBaseQual</code>) ==
+
=== Maximum Recalibration Base Quality (<code>--maxBaseQual</code>) ===
    
This value sets the maximum phred base quality assigned to a base after recalibrating. Any qualities above this value will be set to this value.  It is defaulted to 50.  
 
This value sets the maximum phred base quality assigned to a base after recalibrating. Any qualities above this value will be set to this value.  It is defaulted to 50.  
   −
== Logistic Regression (<code>--useLogReg</code>) ==
+
=== Blended Model Weight (<code>--blended</code>) ===
 +
 
 +
<span style="color:red">TBD - this parameter is not yet implemented.</span>
 +
 
 +
=== Fit Model (<code>--fitModel</code>) ===
 +
 
 +
Check if the logistic regression model fits the data.
 +
 
 +
This option does NOT set the new qualities to the logistic regression calculated qualities, it only checks the fit.  To apply the logistic regression qualities, see [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]].  <code>--fitModel</code> is automatically applied when <code>--useLogReg</code> is specified.
 +
                             
 +
This option cannot be used in conjunction with [[#Fast Recalibration (--fast)|<code>--fast</code>]] and is overriden by <code>--fast</code>, but automatically applied by useLogReg
   −
Use the logistic regression empirical qualities for setting the new base qualities instead of the default formula:  -10 * log10((#mismatches+1)/(#total+1))
+
=== Fast Recalibration (<code>--fast</code>) ===
   −
== Read the quality from a tag (<code>--qualField</code>) ==
+
Use a compact representation of the Recalibration Table that only allows:
 +
* at most 256 Read Groups
 +
* maximum quality 63
 +
* at most 127 cycles
 +
 
 +
This option will run faster than the default recalibration, but uses up to about 2.25G more memory than running without --fast.
 +
 
 +
This option cannot be used in conjunction with [[#Fit Model (--fitModel)|<code>--fitModel</code>]], or [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]] and overrides [[#Fit Model (--fitModel)|<code>--fitModel</code>]], but is overridden by [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]].
 +
 
 +
=== Allow Previous Base DBSNP (<code>--keepPrevDbsnp</code>) ===
 +
 
 +
By default bases where the previous base is in DBSNP are excluded from the Recalibration Table.
 +
 
 +
This option includes these bases in the building of the Recalibration Table.
 +
 
 +
=== Allow Previous Base Non-Match/Mismatch (<code>--keepPrevNonAdjacent</code>) ===
 +
 
 +
By default bases where the previous base is not a CIGAR Match/Mismatch are excluded from the Recalibration Table.
 +
 
 +
This option includes these bases in the building of the Recalibration Table.
 +
 
 +
 
 +
=== Logistic Regression (<code>--useLogReg</code>) ===
 +
 
 +
Use the logistic regression empirical qualities for setting the new base qualities instead of the default formula.
 +
 
 +
This option automatically enables [[#Fit Model (--fitModel)|<code>--fitModel</code>]] and disables [[#Fast Recalibration (--fast)|<code>--fast</code>]].
 +
 
 +
=== Read the quality from a tag (<code>--qualField</code>) ===
    
If this parameter is set, then read the quality string from the specified tag name.  If the tag is not found, the quality is read from the quality field.
 
If this parameter is set, then read the quality string from the specified tag name.  If the tag is not found, the quality is read from the quality field.
   −
== Store the original quality (<code>--storeQualTag</code>) ==
+
=== Store the original quality (<code>--storeQualTag</code>) ===
    
If this parameter is set, the original quality will be stored as a string in the specified tag.
 
If this parameter is set, the original quality will be stored as a string in the specified tag.
 +
 +
=== Skip Records with any of the Specified Flags (<code>--buildExcludeFlags</code>, <code>--applyExcludeFlags</code>) ===
 +
Use <code>--buildExcludeFlags</code> to skip records with any of the specified flags set when building the recalibration table, default 0xF04.
 +
 +
By default, when building the recalibration table reads with any of the following flags set are skipped:
 +
* unmapped
 +
* secondary alignment
 +
* fails QC checks
 +
* duplicate
 +
* supplementary alignment
 +
 +
Use <code>--applyExcludeFlags</code> to skip records with any of the specified flags set when applying the recalibration table.  The default value is 0x000, do not skip any reads.
    
= Return Value =
 
= Return Value =
Line 159: Line 241:  
Returns -1 if input parameters are invalid.
 
Returns -1 if input parameters are invalid.
   −
Returns the SamStatus for the reads/writes (0 on success).
+
Returns the SamStatus for the reads/writes (0 on success, non-0 on failure).

Navigation menu