Difference between revisions of "BamUtil: dedup"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 4: Line 4:
  
 
= SuperDeDuper =
 
= SuperDeDuper =
This program is part of the [[Software#StatGen_C.2B.2B_Software| StatGen Library & Tools]] [[StatGen Download|download.]]
+
This program will be re-released soon (April, 2012).
  
 
This program will read a BAM file, determine duplicate alignments, and either mark or remove the lower quality duplicates.  In addition, it may modify paired-end reads where the ends overlap by soft clipping the end with the lower quality bases in the region of overlap.  A few additional features are coming soon.
 
This program will read a BAM file, determine duplicate alignments, and either mark or remove the lower quality duplicates.  In addition, it may modify paired-end reads where the ends overlap by soft clipping the end with the lower quality bases in the region of overlap.  A few additional features are coming soon.

Revision as of 15:05, 3 April 2012


SuperDeDuper

This program will be re-released soon (April, 2012).

This program will read a BAM file, determine duplicate alignments, and either mark or remove the lower quality duplicates. In addition, it may modify paired-end reads where the ends overlap by soft clipping the end with the lower quality bases in the region of overlap. A few additional features are coming soon.

SuperDeDuper is a standalone application that may be used in the following way. Please note that the records in the input BAM file are assumed to be sorted by coordinate. The various options are explained below.

Usage: SuperDeDupper (options) --in=<InputBamFile> --out=<OutputBamFile>
Required parameters :
  -i/--in [infile]:   input BAM file name (must be sorted)
  -o/--out [outfile]: output BAM file name (same order with original file)
Optional parameters:  (see SAM format specification for details)
  -l/--log [logfile]: log and summary statistics (default: [outfile].log)
  -r/--rm:            Remove duplicates (default is to mark duplicates)
  -f/--force-unmark:  Allow mark-duplicated BAM file and force unmarking the duplicates
                      Default is to throw errors when trying to run a mark-duplicated BAM
  -c/--clip:          Soft clip lower quality segment of overlapping paired end reads
  -s/--swap:          Soft clip lower quality segment of overlapping paired end reads
                      Higher quality bases are swapped into the overlap of the unclipped end
  -v/--verbose:       Turn on verbose mode

Handling Duplicates

SuperDeDuper reads all the alignments in a coordinate-sorted BAM file looking for duplicates. Two single-end reads are considered to be duplicates if they share the same referenceID (chromosome), orientation, library, and unclipped coordinate (left-most for forward strands and right-most for reverse strands). Two paired-end reads are considered to be duplicates if corresponding ends in the two reads are duplicates when viewed as single-end reads.

When duplicates are detected, the read with the highest base quality is found and the others marked as duplicates in the output file, which is the default behavior, or removed from the output file, if the -r option is used. (Duplicates are marked by setting the appropriate bit in the alignment's flag.)

SuperDeDuper assumes that duplicates in the input BAM file are not marked. When SuperDeDuper detects a marked duplicate in the input BAM file, it will throw an error and stop. To override this behavior, use the -f option; in this mode, alignments that are marked as duplicates in the input file are unmarked before SuperDeDuper begins its detection algorithm. The result is that only duplicates detected by SuperDeDuper will be marked in or removed from the output file.

Handling Overlaps

If the -c option is used, SuperDeDuper looks for paired-end reads in which the two ends overlap, as shown below.

Read1:  A C T G A A C C T T G G A A A C T G C C
Read2:                C T T G G A A A C T G C C G G G G A C T

For each end, the average base quality is found for the bases in the region of overlap. (There may be a different number of bases due to insertions and deletions.) The end with the lower average base quality is then soft clipped in the region of overlap. For example, suppose that the cigars for the two reads above are 20M and that Read1 has a lower average base quality in the overlap. Then the cigar for Read1 will be replaced by 7M13S.

If the -s option is used, this behavior persists with the following actions taken as well. Though one end may have a lower average base quality in the region of overlap, it may have individual bases with a higher quality than their corresponding bases in the other end. In this case, those higher quality bases are swapped into the end with the higher average base quality in the region of overlap. This potentially modifies the sequence, the cigar, and the base quality string for the end with the higher average base quality.