BamUtil: recab

From Genome Analysis Wiki
Revision as of 10:34, 15 June 2012 by Mktrost (talk | contribs) (Created page with 'validate Category:BAM Software Category:Software ='''COMING SOON, June, 2012'''= = Overview of the <code>recab</code> function of <code>bamUtil…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search


COMING SOON, June, 2012

Overview of the recab function of bamUtil

The recab option of bamUtil recalibrates a SAM/BAM file.

Handling Recalibration

Reads Not Recalibrated:

  • Duplicates
  • Unmapped
  • Mapping Quality = 0
  • Mapping Quality = 255


Covariates Notes

Duplicates are determined by checking for matching keys.

The Key is comprised of:

  1. Chromosome
  2. Orientation (forward/reverse)
  3. Unclipped Start(forward)/End(reverse)
  4. Library

Rules:

  • Skip Unmapped Reads, they are not marked as duplicate
  • Mark a Single-End Read Duplicate (or remove it if configured to do so) if:
    1. A paired-end record has the same key (even if the pair is not proper/the mate is unmapped/the mate is not found)
      -OR-
    2. A single-end record has the same key and a higher base quality sum (sum of all base qualities in the record)
  • Mark both Paired-End Reads Duplicate if:
  1. Another paired-end pair has the same set of keys and has a higher base quality sum.

This code assumes that at most 1000 bases are clipped at the start of a read.

How to use it

When dedup is invoked without any arguments the usage information is displayed as described below under Usage.

The input SAM/BAM file is required, input File (--in), and must be sorted by coordinate.

The output SAM/BAM file is also required, output File (--out).

Usage

./bam recab --in <InputBamFile> --out <OutputFile> [--log <logFile>] [--verbose] [--noeof] [--params] --refFile <ReferenceFile> [--dbsnp <dbsnpFile>] [--blended <weight>] 

Parameters

Required General Parameters :
	--in <infile>   : input BAM file name
	--out <outfile> : output recalibration file name
Optional General Parameters : 
	--log <logfile> : log and summary statistics (default: [outfile].log)
	--verbose       : Turn on verbose mode
	--noeof         : do not expect an EOF block on a bam file.
	--params        : print the parameter settings

Recab Specific Required Parameters
	--refFile <reference file>    : reference file name
Recab Specific Optional Parameters : 
	--dbsnp <known variance file> : dbsnp file of positions
	--blended <weight>            : blended model weight

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output File (--out)

Use --out followed by your file name to specify the SAM/BAM output file.

The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A - is used to indicate stdout and the extension for file type (no extension is SAM).

SAM to file --out yourFileName.sam
BAM to file --out yourFileName.bam
Uncompressed BAM to file --out yourFileName.ubam
SAM to stdout --out -
BAM to stdout --out -.bam
Uncompressed BAM to stdout --out -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

BAM File Is Sorted By Read Name (--minRecabQual)

When recalibrating reads, only positions with a base quality greater than this minimum will be recalibrated. If --minQual is not specified, it is defaulted to TBD.

Output log & Summary Statistics FileName (--log)

Output file name for writing logs & summary statistics.

If this parameter is not specified, it will write to the output file specified in --out + ".log". Or if the output bam is written to stdout (--out starts with '-'), the logs will be written to stderr. If the filename after --log starts with '-' it will write to stderr.

Treat Reads with Mates On Different Chromosomes As Single-Ended (--oneChrom)

If a read's mate is not found it will not be used for duplicate marking. If you are running on a single chromosome, all read's whose mates are on different chromosomes will not be used for duplicate marking. The --oneChrom option will treat reads with mates on a different chromosome as single-ended.

Recalibrate (--recab)

This option will recalibrate the input file in addition to deduping.

Remove Duplicates (--rmDups)

Instead of marking a read as duplicate in the flag, the --rmDups option will remove it from the output BAM file.

Ignore Previous Duplicate Marking (--force)

By default the deduper will throw an error and stop if a read is already marked as duplicate. The --force option will removes any previous duplicate marking and marks the reads from scratch. The resulting output file will only have reads determined by the deduper marked as duplicates.

Turn on Verbose Mode (--verbose)

Turn on verbose logging to get more log messages in the log and to stderr.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

Return Value

Returns -1 if input parameters are invalid.

Returns the SamStatus for the reads/writes (0 on success).