BamUtil: dedup

From Genome Analysis Wiki
Revision as of 11:53, 30 April 2012 by Mktrost (talk | contribs)
Jump to navigationJump to search


SuperDeDuper

This program will be re-released soon (April, 2012).

This program will read a BAM file, determine duplicate alignments, and either mark or remove the lower quality duplicates. In addition, it may modify paired-end reads where the ends overlap by soft clipping the end with the lower quality bases in the region of overlap. A few additional features are coming soon.

SuperDeDuper is a standalone application that may be used in the following way. Please note that the records in the input BAM file are assumed to be sorted by coordinate. The various options are explained below.

Usage: SuperDeDupper (options) --in=<InputBamFile> --out=<OutputBamFile>
Required parameters :
  -i/--in [infile]:   input BAM file name (must be sorted)
  -o/--out [outfile]: output BAM file name (same order with original file)
Optional parameters:  (see SAM format specification for details)
  -l/--log [logfile]: log and summary statistics (default: [outfile].log)
  -r/--rm:            Remove duplicates (default is to mark duplicates)
  -f/--force-unmark:  Allow mark-duplicated BAM file and force unmarking the duplicates
                      Default is to throw errors when trying to run a mark-duplicated BAM
  -c/--clip:          Soft clip lower quality segment of overlapping paired end reads
  -s/--swap:          Soft clip lower quality segment of overlapping paired end reads
                      Higher quality bases are swapped into the overlap of the unclipped end
  -v/--verbose:       Turn on verbose mode

Handling Duplicates

SuperDeDuper reads all the alignments in a coordinate-sorted BAM file looking for duplicates. Two single-end reads are considered to be duplicates if they share the same referenceID (chromosome), orientation, library, and unclipped coordinate (left-most for forward strands and right-most for reverse strands). Two paired-end reads are considered to be duplicates if corresponding ends in the two reads are duplicates when viewed as single-end reads.

When duplicates are detected, the read with the highest base quality is found and the others marked as duplicates in the output file, which is the default behavior, or removed from the output file, if the -r option is used. (Duplicates are marked by setting the appropriate bit in the alignment's flag.)

SuperDeDuper assumes that duplicates in the input BAM file are not marked. When SuperDeDuper detects a marked duplicate in the input BAM file, it will throw an error and stop. To override this behavior, use the -f option; in this mode, alignments that are marked as duplicates in the input file are unmarked before SuperDeDuper begins its detection algorithm. The result is that only duplicates detected by SuperDeDuper will be marked in or removed from the output file.

Handling Overlaps

Use Clip Overlap instead.