Difference between revisions of "BamUtil: recab"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with 'validate Category:BAM Software Category:Software ='''COMING SOON, June, 2012'''= = Overview of the <code>recab</code> function of <code>bamUtil…')
 
Line 18: Line 18:
  
 
=== Covariates Notes ===
 
=== Covariates Notes ===
Duplicates are determined by checking for matching keys. 
 
 
The Key is comprised of:
 
# Chromosome
 
# Orientation (forward/reverse)
 
# Unclipped Start(forward)/End(reverse)
 
# Library
 
 
Rules:
 
* Skip Unmapped Reads, they are not marked as duplicate
 
* Mark a Single-End Read Duplicate (or remove it if configured to do so) if:
 
*# A paired-end record has the same key (even if the pair is not proper/the mate is unmapped/the mate is not found)<br/>-OR-
 
*# A single-end record has the same key and a higher base quality sum (sum of all base qualities in the record)
 
* Mark both Paired-End Reads Duplicate if:
 
# Another paired-end pair has the same set of keys and has a higher base quality sum.
 
 
This code assumes that at most 1000 bases are clipped at the start of a read.
 
  
 
== How to use it ==
 
== How to use it ==
  
When <code>dedup</code> is invoked without any arguments the usage information is displayed as described below under [[#Usage|Usage]].
+
When <code>recab</code> is invoked without any arguments the usage information is displayed as described below under [[#Usage|Usage]].
 
 
The input SAM/BAM file is required, [[#input File (--in)|input File (--in)]], and must be sorted by coordinate.
 
  
The output SAM/BAM file is also required, [[#output File (--out)|output File (--out)]].
+
The input SAM/BAM file ([[#input File (--in)|--in]]), the output SAM/BAM file ([[#output File (--out)|--out]]), and the reference file ([[#Reference File (--refFile)|--refFile]]) are required inputs.
  
 
= Usage =
 
= Usage =
Line 67: Line 48:
 
{{inBAMInputFile}}
 
{{inBAMInputFile}}
 
{{outBAMOutputFile}}
 
{{outBAMOutputFile}}
 
== BAM File Is Sorted By Read Name (<code>--minRecabQual</code>) ==
 
 
When recalibrating reads, only positions with a base quality greater than this minimum will be recalibrated.  If <code>--minQual</code> is not specified, it is defaulted to <span style="color:red">TBD</span>.
 
  
 
== Output log & Summary Statistics FileName (<code>--log</code>) ==
 
== Output log & Summary Statistics FileName (<code>--log</code>) ==
Line 78: Line 55:
 
If this parameter is not specified, it will write to the output file specified in <code>--out</code> + ".log".  Or if the output bam is written to stdout (<code>--out</code> starts with '-'), the logs will be written to stderr.  If the filename after --log starts with '-' it will write to stderr.
 
If this parameter is not specified, it will write to the output file specified in <code>--out</code> + ".log".  Or if the output bam is written to stdout (<code>--out</code> starts with '-'), the logs will be written to stderr.  If the filename after --log starts with '-' it will write to stderr.
  
== Treat Reads with Mates On Different Chromosomes As Single-Ended (<code>--oneChrom</code>) ==
+
== Turn on Verbose Mode (<code>--verbose</code>) ==
  
If a read's mate is not found it will not be used for duplicate marking.  If you are running on a single chromosome, all read's whose mates are on different chromosomes will not be used for duplicate marking.  The <code>--oneChrom</code> option will treat reads with mates on a different chromosome as single-ended.
+
Turn on verbose logging to get more log messages in the log and to stderr.
  
== Recalibrate (<code>--recab</code>) ==
+
{{noeofBGZFParameter}}
 +
{{paramsParameter}}
  
This option will recalibrate the input file in addition to deduping.
+
== Reference File (<code>--refFile</code>) ==
  
== Remove Duplicates (<code>--rmDups</code>) ==
+
The reference file to use for comparing read bases to the reference.
  
Instead of marking a read as duplicate in the flag, the <code>--rmDups</code> option will remove it from the output BAM file. 
+
== DBSNP File (<code>--dbsnp</code>) ==
  
== Ignore Previous Duplicate Marking (<code>--force</code>) ==
+
The dbsnp file that specifies positions to skip recalibrating.  Tab delimited file with the chromosome in the first column and the 1-based position in the 2nd column.
  
By default the deduper will throw an error and stop if a read is already marked as duplicate.  The <code>--force</code> option will removes any previous duplicate marking and marks the reads from scratch.  The resulting output file will only have reads determined by the deduper marked as duplicates.
+
== Blended Model Weight (<code>--blended</code>) ==
  
== Turn on Verbose Mode (<code>--verbose</code>) ==
+
<span style="color:red">TBD - this parameter is not yet implemented.</span>
  
Turn on verbose logging to get more log messages in the log and to stderr.
+
== BAM File Is Sorted By Read Name (<code>--minRecabQual</code>) ==
  
{{noeofBGZFParameter}}
+
When recalibrating reads, only positions with a base quality greater than this minimum will be recalibrated.  If <code>--minQual</code> is not specified, it is defaulted to <span style="color:red">TBD - this parameter is not yet implemented.</span>.
{{paramsParameter}}
 
  
 
= Return Value =
 
= Return Value =

Revision as of 11:02, 15 June 2012


COMING SOON, June, 2012

Overview of the recab function of bamUtil

The recab option of bamUtil recalibrates a SAM/BAM file.

Handling Recalibration

Reads Not Recalibrated:

  • Duplicates
  • Unmapped
  • Mapping Quality = 0
  • Mapping Quality = 255


Covariates Notes

How to use it

When recab is invoked without any arguments the usage information is displayed as described below under Usage.

The input SAM/BAM file (--in), the output SAM/BAM file (--out), and the reference file (--refFile) are required inputs.

Usage

./bam recab --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--blended <weight>] 

Parameters

Required General Parameters :
	--in <infile>   : input BAM file name
	--out <outfile> : output recalibration file name
Optional General Parameters : 
	--log <logfile> : log and summary statistics (default: [outfile].log)
	--verbose       : Turn on verbose mode
	--noeof         : do not expect an EOF block on a bam file.
	--params        : print the parameter settings

Recab Specific Required Parameters
	--refFile <reference file>    : reference file name
Recab Specific Optional Parameters : 
	--dbsnp <known variance file> : dbsnp file of positions
	--blended <weight>            : blended model weight

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output File (--out)

Use --out followed by your file name to specify the SAM/BAM output file.

The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A - is used to indicate stdout and the extension for file type (no extension is SAM).

SAM to file --out yourFileName.sam
BAM to file --out yourFileName.bam
Uncompressed BAM to file --out yourFileName.ubam
SAM to stdout --out -
BAM to stdout --out -.bam
Uncompressed BAM to stdout --out -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output log & Summary Statistics FileName (--log)

Output file name for writing logs & summary statistics.

If this parameter is not specified, it will write to the output file specified in --out + ".log". Or if the output bam is written to stdout (--out starts with '-'), the logs will be written to stderr. If the filename after --log starts with '-' it will write to stderr.

Turn on Verbose Mode (--verbose)

Turn on verbose logging to get more log messages in the log and to stderr.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

Reference File (--refFile)

The reference file to use for comparing read bases to the reference.

DBSNP File (--dbsnp)

The dbsnp file that specifies positions to skip recalibrating. Tab delimited file with the chromosome in the first column and the 1-based position in the 2nd column.

Blended Model Weight (--blended)

TBD - this parameter is not yet implemented.

BAM File Is Sorted By Read Name (--minRecabQual)

When recalibrating reads, only positions with a base quality greater than this minimum will be recalibrated. If --minQual is not specified, it is defaulted to TBD - this parameter is not yet implemented..

Return Value

Returns -1 if input parameters are invalid.

Returns the SamStatus for the reads/writes (0 on success).