Changes

From Genome Analysis Wiki
Jump to navigationJump to search
8,645 bytes added ,  23:01, 19 April 2019
Line 2: Line 2:  
[[Category:BAM Software]]
 
[[Category:BAM Software]]
 
[[Category:Software]]
 
[[Category:Software]]
  −
='''COMING SOON, June, 2012'''=
      
= Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> =
 
= Overview of the <code>recab</code> function of <code>[[bamUtil]]</code> =
 
The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file.  
 
The <code>recab</code> option of [[bamUtil]] recalibrates a SAM/BAM file.  
   −
==Handling Recalibration==
+
Recalibration can also be called as an option of [[bamUtil: dedup]].  This will perform the recalibration and  the deduping in the same set of steps, increasing processing speed.
 +
 
 +
==Handling Recalibration/Implementation Notes==
 +
 
 +
Recalibration is a 2-step process that loops through the file twice (stdin is not support as input):
 +
# Build Recalibration Table
 +
# Apply Recalibration Table
 +
 
 +
 
 +
The Recalibration Table groups bases based on a set of covariates:
 +
* Read Group
 +
* Quality (either from the quality string or [[#Read the quality from a tag (--qualField)|from a tag]])
 +
* Cycle (reverse complement for reverse strands)
 +
* 1st/2nd read in pair
 +
* Previous Cycle's Base (reverse complement for reverse strands)
 +
* This Cycle's Base (reverse complement for reverse strands)
 +
 
 +
The Recalibration Table tracks the number of matches/mismatches for each set of covariates.
 +
 
   −
Reads Not Recalibrated:
+
Only bases meeting all of the following criteria are used to Build the Recalibration Table:
* Duplicates
+
* Read criteria
* Unmapped
+
** not a duplicate
* Mapping Quality = 0
+
** mapped
* Mapping Quality = 255
+
** mapping quality != 0
 +
** mapping quality != 255
 +
* Base criteria
 +
** match/mismatch (not an insertion/deletion/skip/clip)
 +
** not a [[#DBSNP File (--dbsnp)|dbSNP position]]
 +
** base quality > [[#Minimum Recalibration Base Quality (--minBaseQual)|minBaseQual (5 by default)]]
 +
* Additional criteria for cycle != 1 (can be turned off via flags)
 +
** previous base is a CIGAR Match/Mismatch (Use [[#Allow Previous Base Non-Match/Mismatch (--keepPrevNonAdjacent)|<code>--keepPrevNonAdjacent</code>]] to disable)
 +
** previous base position is not a [[#DBSNP File (--dbsnp)|dbSNP position]] (Use [[#Allow Previous Base DBSNP (--keepPrevDbsnp)|<code>--keepPrevDbsnp</code>]] to disable)
      −
=== Covariates Notes ===
+
The Recalibration Table is applied to all bases meeting all of the following criteria (even if they were not used for creating the table):
Duplicates are determined by checking for matching keys. 
+
* base quality > [[#Minimum Recalibration Base Quality (--minBaseQual)|minBaseQual (5 by default)]]
 +
* at least 1 match or mismatch for the set of covariates
   −
The Key is comprised of:
  −
# Chromosome
  −
# Orientation (forward/reverse)
  −
# Unclipped Start(forward)/End(reverse)
  −
# Library
     −
Rules:
+
Recalibrated Quality is: <math>-10 * \log \frac{mismatches + 1}{mismatches + matches + 1}</math>
* Skip Unmapped Reads, they are not marked as duplicate
+
 
* Mark a Single-End Read Duplicate (or remove it if configured to do so) if:
+
Alternatively, [[#Logistic Regression (--useLogReg)|logistic regression]] can be used for calculating the new quality.
*# A paired-end record has the same key (even if the pair is not proper/the mate is unmapped/the mate is not found)<br/>-OR-
+
 
*# A single-end record has the same key and a higher base quality sum (sum of all base qualities in the record)
+
If the Recalibrated Quality is greater than [[#Maximum Recalibration Base Quality (--maxBaseQual)|maxBaseQual]], the updated quality is set to maxBaseQual.
* Mark both Paired-End Reads Duplicate if:
+
 
# Another paired-end pair has the same set of keys and has a higher base quality sum.
+
 
+
Optionally, the previous quality can be [[#Store the original quality (--storeQualTag)|stored in a tag]].
This code assumes that at most 1000 bases are clipped at the start of a read.
+
 
 +
 
 +
The current recalibration logic was designed for recalibrating ILLUMINA data.
 +
 
 +
NOTE: GATK ignores/skips adapters, but our logic does not.
    
== How to use it ==
 
== How to use it ==
   −
When <code>dedup</code> is invoked without any arguments the usage information is displayed as described below under [[#Usage|Usage]].
+
When <code>recab</code> is invoked without any arguments the usage information is displayed as described below under [[#Usage|Usage]].
   −
The input SAM/BAM file is required, [[#input File (--in)|input File (--in)]], and must be sorted by coordinate.
+
The input SAM/BAM file ([[#input File (--in)|--in]]), the output SAM/BAM file ([[#output File (--out)|--out]]), and the reference file ([[#Reference File (--refFile)|--refFile]]) are required inputs.
   −
The output SAM/BAM file is also required, [[#output File (--out)|output File (--out)]].
+
 
 +
Recommended usage with Deduper:
 +
 
 +
/usr/cluster/bin/bam dedup --recab --in ${INPUT}.bam --out ${OUTPUT}.bam --force --refFile ${REF} --dbsnp ${DBSNP} --oneChrom --storeQualTag OQ --maxBaseQual 40
 +
 
 +
 
 +
Recommended usage without Deduper:
 +
 
 +
/usr/cluster/bin/bam recab --in ${INPUT}.bam --out ${OUTPUT}.bam --refFile ${REF} --dbsnp ${DBSNP} --storeQualTag OQ --maxBaseQual 40
    
= Usage =
 
= Usage =
  ./bam recab --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--blended <weight>]  
+
  ./bam recab (options) --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--minBaseQual <minBaseQual>] [--maxBaseQual <maxBaseQual>] [--blended <weight>] [--fitModel] [--fast] [--keepPrevDbsnp] [--keepPrevNonAdjacent] [--useLogReg] [--qualField <tag>] [--storeQualTag <tag>] [--buildExcludeFlags <flag>] [--applyExcludeFlags <flag>]
    
= Parameters =
 
= Parameters =
 
<pre>
 
<pre>
 
Required General Parameters :
 
Required General Parameters :
--in <infile>  : input BAM file name
+
        --in <infile>  : input BAM file name
--out <outfile> : output recalibration file name
+
        --out <outfile> : output recalibration file name
Optional General Parameters :  
+
Optional General Parameters :
--log <logfile> : log and summary statistics (default: [outfile].log)
+
        --log <logfile> : log and summary statistics (default: [outfile].log)
--verbose      : Turn on verbose mode
+
        --verbose      : Turn on verbose mode
--noeof        : do not expect an EOF block on a bam file.
+
        --noeof        : do not expect an EOF block on a bam file.
--params        : print the parameter settings
+
        --params        : print the parameter settings
    
Recab Specific Required Parameters
 
Recab Specific Required Parameters
--refFile <reference file>    : reference file name
+
        --refFile <reference file>    : reference file name
Recab Specific Optional Parameters :  
+
Recab Specific Optional Parameters :
--dbsnp <known variance file> : dbsnp file of positions
+
        --dbsnp <known variance file> : dbsnp file of positions
--blended <weight>            : blended model weight
+
        --minBaseQual <minBaseQual>  : minimum base quality of bases to recalibrate (default: 5)
 +
        --maxBaseQual <maxBaseQual>  : maximum recalibrated base quality (default: 50)
 +
                                        qualities over this value will be set to this value.
 +
                                        This setting is applied after binning (if applicable).
 +
        --blended <weight>            : blended model weight
 +
        --fitModel                    : check if the logistic regression model fits the data
 +
                                        overriden by fast, but automatically applied by useLogReg
 +
        --fast                        : use a compact representation that only allows:
 +
                                          * at most 256 Read Groups
 +
                                          * maximum quality 63
 +
                                          * at most 127 cycles
 +
                                        overrides fitModel, but is overridden by useLogReg
 +
                                        uses up to about 2.25G more memory than running without --fast.
 +
        --keepPrevDbsnp              : do not exclude entries where the previous base is in dbsnp when
 +
                                        building the recalibration table
 +
                                        By default they are excluded from the table.
 +
        --keepPrevNonAdjacent        : do not exclude entries where the previous base is not adjacent
 +
                                        (not a Cigar M/X/=) when building the recalibration table
 +
                                        By default they are excluded from the table (except the first cycle).
 +
        --useLogReg                  : use logistic regression calculated quality for the new quality
 +
                                        automatically applies fitModel and overrides fast.
 +
        --qualField <quality tag>    : tag to get the starting base quality
 +
                                        (default is to get it from the Quality field)
 +
        --storeQualTag <quality tag>  : tag to store the previous quality into
 +
        --buildExcludeFlags <flag>    : exclude reads with any of these flags set when building the
 +
                                        recalibration table.  Default is 0xF04
 +
        --applyExcludeFlags <flag>    : do not apply the recalibration table to any reads with any of these flags set
 +
        Quality Binning Parameters (optional):
 +
          Bin qualities by phred score, into the ranges specified by binQualS or binQualF (both cannot be used)
 +
          Ranges are specified by comma separated minimum phred score for the bin, example: 1,17,20,30,40,50,70
 +
          The first bin always starts at 0, so does not need to be specified.
 +
          By default, the bin value is the low end of the range.
 +
                --binQualS  : Bin the Qualities as specified (phred): minQualOfBin2, minQualofBin3...
 +
                --binQualF  : Bin the Qualities based on the specified file
 +
                --binMid    : Use the mid point of the quality bin range for the quality value of the bin.
 +
                --binHigh    : Use the high end of the quality bin range for the quality value of the bin.
 +
 
 
</pre>
 
</pre>
 +
{{PhoneHomeParamDesc}}
   −
{{inBAMInputFile}}
+
== Required Generic Parameters ==
 +
{{inBAMInputFile|noStdin=1}}
 
{{outBAMOutputFile}}
 
{{outBAMOutputFile}}
   −
== BAM File Is Sorted By Read Name (<code>--minRecabQual</code>) ==
+
== Optional Generic Parameters ==
 +
=== Output log & Summary Statistics FileName (<code>--log</code>) ===
 +
 
 +
Output file name for writing logs & summary statistics.
 +
 
 +
If this parameter is not specified, it will write to the output file specified in <code>--out</code> + ".log".  Or if the output bam is written to stdout (<code>--out</code> starts with '-'), the logs will be written to stderr.  If the filename after --log starts with '-' it will write to stderr.
   −
When recalibrating reads, only positions with a base quality greater than this minimum will be recalibrated.  If <code>--minQual</code> is not specified, it is defaulted to <span style="color:red">TBD</span>.
+
=== Turn on Verbose Mode (<code>--verbose</code>) ===
   −
== Output log & Summary Statistics FileName (<code>--log</code>) ==
+
Turn on verbose logging to get more log messages in the log and to stderr.
   −
Output file name for writing logs & summary statistics.
+
{{noeofBGZFParameter}}
 +
{{paramsParameter}}
 +
 
 +
{{PhoneHomeParameters}}
 +
 
 +
== Required Recalibration Parameters ==
 +
=== Reference File (<code>--refFile</code>) ===
 +
 
 +
The reference file is a required parameter used for comparing read bases to the reference.
 +
 
 +
== Optional Recalibration Parameters ==
 +
 
 +
=== DBSNP File (<code>--dbsnp</code>) ===
 +
 
 +
The dbsnp file that specifies positions to skip recalibrating.  Tab delimited file with the chromosome in the first column and the 1-based position in the 2nd column.
 +
 
 +
=== Minimum Recalibration Base Quality (<code>--minBaseQual</code>) ===
 +
 
 +
When recalibrating reads, only positions with a base quality greater than this minimum phred quality will be recalibrated.  If <code>--minBaseQual</code> is not specified, it is defaulted to 5.
 +
 
 +
The ILLUMINA specs indicate that any quality below 5 can be used as an error indicator so we do not want to recalibrate those.
 +
 
 +
=== Maximum Recalibration Base Quality (<code>--maxBaseQual</code>) ===
 +
 
 +
This value sets the maximum phred base quality assigned to a base after recalibrating. Any qualities above this value will be set to this value.  It is defaulted to 50.
 +
 
 +
=== Blended Model Weight (<code>--blended</code>) ===
 +
 
 +
<span style="color:red">TBD - this parameter is not yet implemented.</span>
 +
 
 +
=== Fit Model (<code>--fitModel</code>) ===
 +
 
 +
Check if the logistic regression model fits the data.
 +
 
 +
This option does NOT set the new qualities to the logistic regression calculated qualities, it only checks the fit.  To apply the logistic regression qualities, see [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]].  <code>--fitModel</code> is automatically applied when <code>--useLogReg</code> is specified.
 +
                             
 +
This option cannot be used in conjunction with [[#Fast Recalibration (--fast)|<code>--fast</code>]] and is overriden by <code>--fast</code>, but automatically applied by useLogReg
 +
 
 +
=== Fast Recalibration (<code>--fast</code>) ===
 +
 
 +
Use a compact representation of the Recalibration Table that only allows:
 +
* at most 256 Read Groups
 +
* maximum quality 63
 +
* at most 127 cycles
 +
 
 +
This option will run faster than the default recalibration, but uses up to about 2.25G more memory than running without --fast.
 +
 
 +
This option cannot be used in conjunction with [[#Fit Model (--fitModel)|<code>--fitModel</code>]], or [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]] and overrides [[#Fit Model (--fitModel)|<code>--fitModel</code>]], but is overridden by [[#Logistic Regression (--useLogReg)|<code>--useLogReg</code>]].
 +
 
 +
=== Allow Previous Base DBSNP (<code>--keepPrevDbsnp</code>) ===
 +
 
 +
By default bases where the previous base is in DBSNP are excluded from the Recalibration Table.
 +
 
 +
This option includes these bases in the building of the Recalibration Table.
 +
 
 +
=== Allow Previous Base Non-Match/Mismatch (<code>--keepPrevNonAdjacent</code>) ===
 +
 
 +
By default bases where the previous base is not a CIGAR Match/Mismatch are excluded from the Recalibration Table.
   −
If this parameter is not specified, it will write to the output file specified in <code>--out</code> + ".log".  Or if the output bam is written to stdout (<code>--out</code> starts with '-'), the logs will be written to stderr.  If the filename after --log starts with '-' it will write to stderr.
+
This option includes these bases in the building of the Recalibration Table.
   −
== Treat Reads with Mates On Different Chromosomes As Single-Ended (<code>--oneChrom</code>) ==
     −
If a read's mate is not found it will not be used for duplicate marking.  If you are running on a single chromosome, all read's whose mates are on different chromosomes will not be used for duplicate marking.  The <code>--oneChrom</code> option will treat reads with mates on a different chromosome as single-ended.
+
=== Logistic Regression (<code>--useLogReg</code>) ===
   −
== Recalibrate (<code>--recab</code>) ==
+
Use the logistic regression empirical qualities for setting the new base qualities instead of the default formula.
   −
This option will recalibrate the input file in addition to deduping.
+
This option automatically enables [[#Fit Model (--fitModel)|<code>--fitModel</code>]] and disables [[#Fast Recalibration (--fast)|<code>--fast</code>]].
   −
== Remove Duplicates (<code>--rmDups</code>) ==
+
=== Read the quality from a tag (<code>--qualField</code>) ===
   −
Instead of marking a read as duplicate in the flag, the <code>--rmDups</code> option will remove it from the output BAM file.
+
If this parameter is set, then read the quality string from the specified tag name.  If the tag is not found, the quality is read from the quality field.
   −
== Ignore Previous Duplicate Marking (<code>--force</code>) ==
+
=== Store the original quality (<code>--storeQualTag</code>) ===
   −
By default the deduper will throw an error and stop if a read is already marked as duplicate.  The <code>--force</code> option will removes any previous duplicate marking and marks the reads from scratch.  The resulting output file will only have reads determined by the deduper marked as duplicates.
+
If this parameter is set, the original quality will be stored as a string in the specified tag.
   −
== Turn on Verbose Mode (<code>--verbose</code>) ==
+
=== Skip Records with any of the Specified Flags (<code>--buildExcludeFlags</code>, <code>--applyExcludeFlags</code>) ===
 +
Use <code>--buildExcludeFlags</code> to skip records with any of the specified flags set when building the recalibration table, default 0xF04.
   −
Turn on verbose logging to get more log messages in the log and to stderr.
+
By default, when building the recalibration table reads with any of the following flags set are skipped:
 +
* unmapped
 +
* secondary alignment
 +
* fails QC checks
 +
* duplicate
 +
* supplementary alignment
   −
{{noeofBGZFParameter}}
+
Use <code>--applyExcludeFlags</code> to skip records with any of the specified flags set when applying the recalibration table.  The default value is 0x000, do not skip any reads.
{{paramsParameter}}
      
= Return Value =
 
= Return Value =
Line 105: Line 241:  
Returns -1 if input parameters are invalid.
 
Returns -1 if input parameters are invalid.
   −
Returns the SamStatus for the reads/writes (0 on success).
+
Returns the SamStatus for the reads/writes (0 on success, non-0 on failure).

Navigation menu