BamUtil: dedup

From Genome Analysis Wiki
Jump to navigationJump to search


Overview of the dedup function of bamUtil

The dedup option of bamUtil determines duplicates in a coordinate sorted SAM/BAM file. It either marks or removes the lower quality duplicates.

This tool also contains the option to perform recalibration.

NOTE: This tool does not properly work on templates that have more than 2 segments. It does not properly match reads when more than 2 reads have the same read name.

NOTE: Dedup cannot read from stdin since it reads the input file twice.

Potential future features:

Handling Duplicates

The deduper reads all the alignments in a coordinate-sorted SAM/BAM looking for duplicates, failing if the file is not coordinate-sorted.

The deduper assumes that duplicates in the input BAM file are not marked. When the deduper detects a marked duplicate in the input BAM file, it will throw an error and stop. To override this behavior, use the --force option; in this mode, alignments that are marked as duplicates in the input file are unmarked before the deduper begins its detection algorithm. The result is that only duplicates detected by the deduper will be marked in or removed from the output file.

The handling of paired-end reads assumes that the mate information in the SAM/BAM records is accurate. If a mate is not found at the expected position, an error message is printed (once per file) indicating this error. Paired-end reads whose mate cannot be found are not marked duplicate and are not used for duplicate marking of other paired-end reads. Single-end reads with the same key as paired-end reads whose mate cannot be found are still marked as duplicate. If this error is encountered, you may want to fix the mate information and reprocess the file through the deduper.

With the default settings this tool should produce similar results as Picard.

Use the --oneChrom option to treat reads with a mate on a different chromosome as single-ended. This option is useful if you are running the deduper on just a single chromosome. The code will use less memory with this option if mates are found on different chromosomes. (Picard does not specially handle mates on different chromosomes, so the --oneChrom option may produce different results than Picard.)


Implementation Notes

Duplicates are determined by checking for matching keys.

The Key is comprised of:

  1. Chromosome
  2. Orientation (forward/reverse)
  3. Unclipped Start(forward)/End(reverse)
  4. Library

Rules:

  • Skip Unmapped Reads, they are not marked as duplicate
  • Reads whose mate is unmapped are treated as single-end
  • Mark a Single-End Read Duplicate (or remove it if configured to do so) if:
    1. A paired-end record has the same key (even if the pair is not proper/the mate is not found)
      -OR-
    2. A single-end record has the same key and a higher base quality sum (sum of all base qualities in the record above --minBaseQual)
  • Mark both Paired-End Reads Duplicate if:
  1. Another paired-end pair has the same set of keys and has a higher base quality sum (sum of all base qualities in the record above --minBaseQual)

This code assumes that at most 1000 bases are clipped at the start of a read.


Deduping requires two passes through the file, so cannot read from stdin.

Handling Recalibration

See BamUtil: recab for recalibration details.

Recalibration parameters can be applied to deduping when --recab is specified.

How to use it

When dedup is invoked without any arguments the usage information is displayed as described below under Usage.

The input SAM/BAM file is required, input File (--in), and must be sorted by coordinate.

The output SAM/BAM file is also required, output File (--out).


Recommended usage with Recalibrator:

/usr/cluster/bin/bam dedup --recab --in ${INPUT}.bam --out ${OUTPUT}.bam --force --refFile ${REF} --dbsnp ${DBSNP} --oneChrom --storeQualTag OQ --maxBaseQual 40


Recommended usage without Recalibrator:

/usr/cluster/bin/bam dedup --in ${INPUT}.bam --out ${OUTPUT}.bam --force --oneChrom

Usage

./bam dedup --in <InputBamFile> --out <OutputBamFile> [--minQual <minPhred>] [--log <logFile>] [--oneChrom] [--rmDups] [--force] [--excludeFlags <flag>] [--verbose] [--noeof] [--params] [--recab]

Additional Recalibration Usage is documented at BamUtil: recab -> Usage

Parameters

Required parameters :
	--in <infile>   : Input BAM file name (must be sorted)
	--out <outfile> : Output BAM file name (same order with original file)
Optional parameters : 
	--minQual <int> : Only add scores over this phred quality when determining a read's quality (default: 15)
	--log <logfile> : Log and summary statistics (default: [outfile].log, or stderr if --out starts with '-')
	--oneChrom      : Treat reads with mates on different chromosomes as single-ended.
	--rmDups        : Remove duplicates (default is to mark duplicates)
	--force         : Allow an already mark-duplicated BAM file, unmarking any previously marked 
	                  duplicates and apply this duplicate marking logic.  Default is to throw errors
	                  and exit when trying to run on an already mark-duplicated BAM
	--excludeFlags <flag>    : exclude reads with any of these flags set when determining or marking duplicates
	                           by default (0xB04): exclude unmapped, secondary reads, QC failures, and supplementary reads
	--verbose       : Turn on verbose mode
	--noeof         : Do not expect an EOF block on a bam file.
	--params        : Print the parameter settings
	--recab         : Recalibrate in addition to deduping
	PhoneHome:
		--noPhoneHome       : disable PhoneHome (default enabled)
		--phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)

Additional Recalibration Parameters are documented at BamUtil: recab -> Parameters

Required Parameters

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user.

Note: This tool does not support input from stdin.

SAM/BAM/Uncompressed BAM from file --in yourFileName


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Note: The input file must be sorted by coordinate.

Output File (--out)

Use --out followed by your file name to specify the SAM/BAM output file.

The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A - is used to indicate stdout and the extension for file type (no extension is SAM).

SAM to file --out yourFileName.sam
BAM to file --out yourFileName.bam
Uncompressed BAM to file --out yourFileName.ubam
SAM to stdout --out -
BAM to stdout --out -.bam
Uncompressed BAM to stdout --out -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Optional Parameters

Minimum Quality for Quality Calculations (--minQual)

When duplicate reads are encountered, the read with the highest quality is kept.

To determine the quality of a read, all of the phred base quality scores above the --minQual value are added together. If --minQual is not specified, it is defaulted to 15.

Output log & Summary Statistics FileName (--log)

Output file name for writing logs & summary statistics.

If this parameter is not specified, it will write to the output file specified in --out + ".log". Or if the output bam is written to stdout (--out starts with '-'), the logs will be written to stderr. If the filename after --log starts with '-' it will write to stderr.

Treat Reads with Mates On Different Chromosomes As Single-Ended (--oneChrom)

If a read's mate is not found it will not be used for duplicate marking. If you are running on a single chromosome, all read's whose mates are on different chromosomes will not be used for duplicate marking. The --oneChrom option will treat reads with mates on a different chromosome as single-ended.

Remove Duplicates (--rmDups)

Instead of marking a read as duplicate in the flag, the --rmDups option will remove it from the output BAM file.

Ignore Previous Duplicate Marking (--force)

By default the deduper will throw an error and stop if a read is already marked as duplicate. The --force option will removes any previous duplicate marking and marks the reads from scratch. The resulting output file will only have reads determined by the deduper marked as duplicates.

Skip Records with any of the Specified Flags (--excludeFlags)

Skip records with any of the specified flags set, default 0xB04

By default skips reads with any of the following flags set:

  • unmapped
  • secondary alignment
  • fails QC checks
  • supplementary reads

Secondary (0x100) and Supplementary (0x800) reads currently must be excluded.

Turn on Verbose Mode (--verbose)

Turn on verbose logging to get more log messages in the log and to stderr.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

Recalibrate (--recab)

This option will recalibrate the input file in addition to deduping.

See BamUtil: recab for recalibration details.

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

Turn off PhoneHome (--noPhoneHome)

Use the --noPhoneHome option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.

Adjust the Frequency of PhoneHome (--phoneHomeThinning)

Use --phoneHomeThinning to modify the percentage of the time that PhoneHome will run (0-100).

  • By default, --phoneHomeThinning is set to 50, running 50% of the time.
  • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
  • N/A if --noPhoneHome is set.

Return Value

Returns -1 if input parameters are invalid.

Returns the SamStatus for the reads/writes (0 on success, non-0 on failure).