Revision as of 17:48, 22 May 2012

COMING SOON, May/June, 2012

Overview of the `dedup` function of `bamUtil`

The dedup option of bamUtil determines duplicates in a coordinate sorted SAM/BAM file. It either marks or removes the lower quality duplicates.

This tool also contains the option to perform recalibration.

NOTE: This tool does not properly work on templates that have more than 2 segments. It does not properly match reads when more than 2 reads have the same read name.

Potential future features:

Soft clip overlapping reads (for now, use: BamUtil: clipOverlap

Handling Duplicates

The deduper reads all the alignments in a coordinate-sorted SAM/BAM looking for duplicates.

The deduper assumes that duplicates in the input BAM file are not marked.

When the deduper detects a marked duplicate in the input BAM file, it will throw an error and stop. To override this behavior, use the --force option; in this mode, alignments that are marked as duplicates in the input file are unmarked before the deduper begins its detection algorithm. The result is that only duplicates detected by the deduper will be marked in or removed from the output file.

The handling of paired-end reads assumes that the mate information in the SAM/BAM records is accurate. If a mate is not found at the expected position, an error message is printed (once per file) indicating this error. Paired-end reads whose mate cannot be found are not marked duplicate and are not used for duplicate marking of other paired-end reads. Single-end reads with the same key as paired-end reads whose mate cannot be found are still marked as duplicate. If this error is encountered, you may want to fix the mate information and reprocess the file through the deduper. Use the --oneChrom option to treat reads with a mate on a different chromosome as single-ended. This option is useful if you are running the deduper on just a single chromosome.

Implementation Notes

Duplicates are determined by checking for matching keys.

The Key is comprised of:

Chromosome
Orientation (forward/reverse)
unclipped start(forward)/end(reverse)
Library

Rules:

Skip Unmapped Reads, they are not marked as duplicate
Mark a Single-End Read Duplicate (or remove it if configured to do so) if:
1. A paired-end record has the same key (even if the pair is not proper/the mate is unmapped/the mate is not found)
  -OR-
2. A single-end record has the same key and a higher base quality sum (sum of all base qualities in the record)
Mark both Paired-End Reads Duplicate if:

Another paired-end pair has the same set of keys and has a higher base quality sum.

This code assumes that at most 1000 bases are clipped at the start of a read.

How to use it

When dedup is invoked without any arguments the usage information is displayed as described below under Usage.

The input SAM/BAM file is required, input File (--in), and must be sorted by coordinate.

The output SAM/BAM file is also required, output File (--out).

Usage

./bam dedup (options) --in <InputBamFile> --out <OutputBamFile> [--log <logFile>] [--oneChrom] [--rmDups] [--force] [--verbose] [--noeof] [--params] [--recab] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--blended <weight>]

Parameters

Required parameters :
	--in <infile>   : input BAM file name (must be sorted)
	--out <outfile> : output BAM file name (same order with original file)
Optional parameters : (see SAM format specification for details)
	--log <logfile> : log and summary statistics (default: [outfile].log)
	--oneChrom      : Treat reads with mates on different chromosomes as single-ended.
	--rmDups        : Remove duplicates (default is to mark duplicates)
	--force         : Allow mark-duplicated BAM file and force unmarking the duplicates
                    Default is to throw errors when trying to run a mark-duplicated BAM
	--verbose       : Turn on verbose mode
	--noeof         : do not expect an EOF block on a bam file.
	--params        : print the parameter settings
	--recab         : Recalibrate in addition to deduping

Recab Specific Required Parameters
	--refFile <reference file>    : reference file name
Recab Specific Optional Parameters : 
	--dbsnp <known variance file> : dbsnp file of positions
	--blended <weight>            : blended model weight

Input File (`--in`)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file	`--in yourFileName`
SAM from stdin	`--in -`
BAM from stdin	`--in -.bam`
Uncompressed BAM from stdin	`--in -.ubam`

Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Note: The input file must be sorted by coordinate.

Output File (`--out`)

Use --out followed by your file name to specify the SAM/BAM output file.

The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A - is used to indicate stdout and the extension for file type (no extension is SAM).

SAM to file	`--out yourFileName.sam`
BAM to file	`--out yourFileName.bam`
Uncompressed BAM to file	`--out yourFileName.ubam`
SAM to stdout	`--out -`
BAM to stdout	`--out -.bam`
Uncompressed BAM to stdout	`--out -.ubam`

Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Do not require BGZF EOF block (`--noeof`)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (`--params`)

Use --params to print the parameters for your program to stderr.

Return Value

Returns -1 if input parameters are invalid.

Returns the SamStatus for the reads/writes (0 on success).

@@ Line 3: / Line 3: @@
 [[Category:libStatGen BAM]]
-= SuperDeDuper =
+='''COMING SOON, May/June, 2012'''=
-This program will be re-released soon (April, 2012).
-This program will read a BAM file, determine duplicate alignments, and either mark or remove the lower quality duplicates.  In addition, it may modify paired-end reads where the ends overlap by soft clipping the end with the lower quality bases in the region of overlap.  A few additional features are coming soon.
+= Overview of the <code>dedup</code> function of <code>[[bamUtil]]</code> =
+The <code>dedup</code> option of [[bamUtil]] determines duplicates in a coordinate sorted SAM/BAM file. It either marks or removes the lower quality duplicates.
-SuperDeDuper is a standalone application that may be used in the following way.  Please note that the records in the input BAM file are assumed to be sorted by coordinate.  The various options are explained below.
+This tool also contains the option to perform recalibration.
- Usage: SuperDeDupper (options) --in=<InputBamFile> --out=<OutputBamFile>
+NOTE: This tool does not properly work on templates that have more than 2 segments.  It does not properly match reads when more than 2 reads have the same read name.
- Required parameters :
-   -i/--in [infile]:   input BAM file name (must be sorted)
-   -o/--out [outfile]: output BAM file name (same order with original file)
- Optional parameters:  (see SAM format specification for details)
-   -l/--log [logfile]: log and summary statistics (default: [outfile].log)
-   -r/--rm:            Remove duplicates (default is to mark duplicates)
-   -f/--force-unmark:  Allow mark-duplicated BAM file and force unmarking the duplicates
-                       Default is to throw errors when trying to run a mark-duplicated BAM
-   -c/--clip:          Soft clip lower quality segment of overlapping paired end reads
-   -s/--swap:          Soft clip lower quality segment of overlapping paired end reads
-                       Higher quality bases are swapped into the overlap of the unclipped end
-   -v/--verbose:       Turn on verbose mode
-=Handling Duplicates=
-SuperDeDuper reads all the alignments in a coordinate-sorted BAM file looking for duplicates.  Two single-end reads are considered to be duplicates if they share the same referenceID (chromosome), orientation, library, and unclipped coordinate (left-most for forward strands and right-most for reverse strands).  Two paired-end reads are considered to be duplicates if corresponding ends in the two reads are duplicates when viewed as single-end reads.
+Potential future features:
+* Soft clip overlapping reads (for now, use: [[BamUtil: clipOverlap]]
-When duplicates are detected, the read with the highest base quality is found and the others marked as duplicates in the output file, which is the default behavior, or removed from the output file, if the -r option is used.  (Duplicates are marked by setting the appropriate bit in the alignment's flag.)
+==Handling Duplicates==
-SuperDeDuper assumes that duplicates in the input BAM file are not marked.  When SuperDeDuper detects a marked duplicate in the input BAM file, it will throw an error and stop.  To override this behavior, use the -f option;  in this mode, alignments that are marked as duplicates in the input file are unmarked before SuperDeDuper begins its detection algorithm.  The result is that only duplicates detected by SuperDeDuper will be marked in or removed from the output file.
+The deduper reads all the alignments in a coordinate-sorted SAM/BAM looking for duplicates.
-=Handling Overlaps=
+The deduper assumes that duplicates in the input BAM file are not marked.
-'''Use Clip Overlap instead.'''
+When the deduper detects a marked duplicate in the input BAM file, it will throw an error and stop.  To override this behavior, use the --force option;  in this mode, alignments that are marked as duplicates in the input file are unmarked before the deduper begins its detection algorithm.  The result is that only duplicates detected by the deduper will be marked in or removed from the output file.
+The handling of paired-end reads assumes that the mate information in the SAM/BAM records is accurate.  If a mate is not found at the expected position, an error message is printed (once per file) indicating this error.  Paired-end reads whose mate cannot be found are not marked duplicate and are not used for duplicate marking of other paired-end reads.  Single-end reads with the same key as paired-end reads whose mate cannot be found are still marked as duplicate.  If this error is encountered, you may want to fix the mate information and reprocess the file through the deduper.  Use the <code>--oneChrom</code> option to treat reads with a mate on a different chromosome as single-ended.  This option is useful if you are running the deduper on just a single chromosome.
+=== Implementation Notes ===
+Duplicates are determined by checking for matching keys.
+The Key is comprised of:
+# Chromosome
+# Orientation (forward/reverse)
+# unclipped start(forward)/end(reverse)
+# Library
+Rules:
+* Skip Unmapped Reads, they are not marked as duplicate
+* Mark a Single-End Read Duplicate (or remove it if configured to do so) if:
+*# A paired-end record has the same key (even if the pair is not proper/the mate is unmapped/the mate is not found)<br/>-OR-
+*# A single-end record has the same key and a higher base quality sum (sum of all base qualities in the record)
+* Mark both Paired-End Reads Duplicate if:
+# Another paired-end pair has the same set of keys and has a higher base quality sum.
+This code assumes that at most 1000 bases are clipped at the start of a read.
+== How to use it ==
+When <code>dedup</code> is invoked without any arguments the usage information is displayed as described below under [[#Usage|Usage]].
+The input SAM/BAM file is required, [[#input File (--in)|input File (--in)]], and must be sorted by coordinate.
+The output SAM/BAM file is also required, [[#output File (--out)|output File (--out)]].
+= Usage =
+ ./bam dedup (options) --in <InputBamFile> --out <OutputBamFile> [--log <logFile>] [--oneChrom] [--rmDups] [--force] [--verbose] [--noeof] [--params] [--recab] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--blended <weight>]
+= Parameters =
+<pre>
+Required parameters :
+	--in <infile>   : input BAM file name (must be sorted)
+	--out <outfile> : output BAM file name (same order with original file)
+Optional parameters : (see SAM format specification for details)
+	--log <logfile> : log and summary statistics (default: [outfile].log)
+	--oneChrom      : Treat reads with mates on different chromosomes as single-ended.
+	--rmDups        : Remove duplicates (default is to mark duplicates)
+	--force         : Allow mark-duplicated BAM file and force unmarking the duplicates
+                    Default is to throw errors when trying to run a mark-duplicated BAM
+	--verbose       : Turn on verbose mode
+	--noeof         : do not expect an EOF block on a bam file.
+	--params        : print the parameter settings
+	--recab         : Recalibrate in addition to deduping
+Recab Specific Required Parameters
+	--refFile <reference file>    : reference file name
+Recab Specific Optional Parameters :
+	--dbsnp <known variance file> : dbsnp file of positions
+	--blended <weight>            : blended model weight
+</pre>
+{{inBAMInputFile}}
+Note: The input file must be sorted by coordinate.
+{{outBAMOutputFile}}
+{{noeofBGZFParameter}}
+{{paramsParameter}}
+= Return Value =
+Returns -1 if input parameters are invalid.
+Returns the SamStatus for the reads/writes (0 on success).

Difference between revisions of "BamUtil: dedup"

Revision as of 17:48, 22 May 2012

Contents

COMING SOON, May/June, 2012

Overview of the `dedup` function of `bamUtil`

Handling Duplicates

Implementation Notes

How to use it

Usage

Parameters

Input File (`--in`)

Output File (`--out`)

Do not require BGZF EOF block (`--noeof`)

Print the Program Parameters (`--params`)

Return Value

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools

Difference between revisions of "BamUtil: dedup"

Revision as of 17:48, 22 May 2012

COMING SOON, May/June, 2012

Overview of the dedup function of bamUtil

Handling Duplicates

Implementation Notes

How to use it

Usage

Parameters

Input File (--in)

Output File (--out)

Do not require BGZF EOF block (--noeof)

Print the Program Parameters (--params)

Return Value

Navigation menu

Search

Overview of the `dedup` function of `bamUtil`

Input File (`--in`)

Output File (`--out`)

Do not require BGZF EOF block (`--noeof`)

Print the Program Parameters (`--params`)