Difference between revisions of "BamUtil: clipOverlap"

From Genome Analysis Wiki
Jump to navigationJump to search
(Update output)
(Update default poolSize)
Line 96: Line 96:
 
Clipping By Coordinate Optional Parameters:
 
Clipping By Coordinate Optional Parameters:
 
--poolSize    : Maximum number of records the program is allowed to allocate
 
--poolSize    : Maximum number of records the program is allowed to allocate
                for clipping on Coordinate sorted files. (Default: 5000)
+
                for clipping on Coordinate sorted files. (Default: 1000000)
 
--poolSkipClip : Skip clipping reads to free of usable records when the
 
--poolSkipClip : Skip clipping reads to free of usable records when the
 
                poolSize is hit. The default action is to just clip the
 
                poolSize is hit. The default action is to just clip the
Line 122: Line 122:
 
== Set the SAM/BAMs record buffer size (<code>--poolSize</code>) ==
 
== Set the SAM/BAMs record buffer size (<code>--poolSize</code>) ==
  
To handle coordinate sorted files, SAM/BAM records are buffered until it is known that all following records will have a later start position.  To prevent the program from running away with memory, a limit is set to the number of records that can be buffered (defaults to 5000).
+
To handle coordinate sorted files, SAM/BAM records are buffered until it is known that all following records will have a later start position.  To prevent the program from running away with memory, a limit is set to the number of records that can be buffered (defaults to 1000000).
  
 
If the poolSize is exhausted, the code will write the earliest record awaiting its overlapping mate and any previous records that are being buffered.
 
If the poolSize is exhausted, the code will write the earliest record awaiting its overlapping mate and any previous records that are being buffered.

Revision as of 09:57, 21 November 2011


Overview of the clipOverlap function of bamUtil

The clipOverlap option on the bamUtil executable clips overlapping read pairs.

The input file and resulting output file is sorted by coordinate (or readName is specified in the options).

When a read is clipped from the front:

  • the read start position is updated to reflect the clipping
  • the mate's mate start position is updated to reflect the record's new position.
  • the record is placed in the output file in the correct location based on the updated position.

To handle coordinate sorted files, SAM/BAM records are buffered up until it is known that all following records will have a later start position. To prevent the program from running away with memory, a limit is set to the number of records that can be buffered, see --poolSize for more information.

When two mates overlap, this tool will clip the record's whose clipped region would has the lowest average quality.

It also checks strand. If a forward strand extends past the end of a reverse strand, that will be clipped. Similarly, if a reverse strand starts before the forward strand, the region prior to the forward strand will be clipped. If the reverse strand occurs entirely before the forward strand, both strands will be entirely clipped.


ASSUMPTIONS/RESTRICTIONS

  • Assumes the file is sorted by Coordinate (or ReadName if using --readName option)
  • Assumes only 2 reads have matching ReadNames
    • It matches in pairs, so if there are 3, the first 2 will be matched and compared, but the 3rd won't. If there are 4, the first 2 will be matched and the last 2 will be matched and compared.
  • Only mapped reads will be clipped
  • Mate information in records are accurate

Rules for Clipping

Clipping from the front

The first operation after the softclip will be a Match/Mismatch, meaning that any trailing pads, deletions, insertions, or skips will also be soft clipped.

Clip Location How it is handled
If the clip position falls in a skip/deletion Removes the entire skip/deletion
If the position immediately after the clip is a skip/deletion Also removes the skip/deletion
If the position immediately after the clip is an Insert Softclips the insert
If the position immediately after the clip is a Pad Removes the pad
Clip occurs at the last match/mismatch position of the read (the entire read is clipped) Entire read is soft clipped, 0-based position is left as the original (not modified)
Clip occurs after the read ends Entire read is soft clipped, 0-based position is left as the original (not modified)
Clip occurs before the read starts Nothing is clipped. The read is not changed.

Clipping from the back

Clip Location How it is handled
If the clip position falls in a skip/deletion Removes the entire skip/deletion
If the position immediately before the clip is a deletion/skip/pad Remove the deletion/skip/pad
If the position immediately before the clip is an insertion Leave the insertion, even if it results in a 70M3I27S
Clip occurs at the first position of the read (the entire read is clipped) Entire read is soft clipped, preceding insertions remain, 0-based position is left as the original (not modified)
Clip occurs before the read starts Entire read is soft clipped, 0-based position is left as the original (not modified)
Clip occurs after the read ends Nothing is clipped. The read is not changed.

Usage

./bam clipOverlap --in <inputFile> --out <outputFile> [--storeOrig <tag>] [--readName] [--poolSize <numRecords allowed to allocate>] [--noeof] [--params]

Parameters

	Required Parameters:
		--in : the SAM/BAM file to clip overlaping read pairs for
		--out        : the SAM/BAM file to be written
	Optional Parameters:
		--storeOrig   : Store the original cigar in the specified tag.
		--readName    : Original file is sorted by Read Name instead of coordinate.
		--noeof       : Do not expect an EOF block on a bam file.
		--params      : Print the parameter settings
	Clipping By Coordinate Optional Parameters:
		--poolSize    : Maximum number of records the program is allowed to allocate
		                for clipping on Coordinate sorted files. (Default: 1000000)
		--poolSkipClip : Skip clipping reads to free of usable records when the
		                 poolSize is hit. The default action is to just clip the
		                 first read in a pair to free up the record.


Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output File (--out)

Use --out followed by your file name to specify the SAM/BAM output file.

The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A - is used to indicate stdout and the extension for file type (no extension is SAM).

SAM to file --out yourFileName.sam
BAM to file --out yourFileName.bam
Uncompressed BAM to file --out yourFileName.ubam
SAM to stdout --out -
BAM to stdout --out -.bam
Uncompressed BAM to stdout --out -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

Store the original cigar string in a tag (--storeOrig)

Use --storeOrig followed by the two character TAG to store the original CIGAR.

It will be stored with the specified tag as a "Z" tag type.


Work on SAM/BAMs sorted by Read Name instead of by coordinate (--readName)

If your file is sorted by read name rather than by coordinate, specify --readName. The resulting file will still be sorted by read name.


Set the SAM/BAMs record buffer size (--poolSize)

To handle coordinate sorted files, SAM/BAM records are buffered until it is known that all following records will have a later start position. To prevent the program from running away with memory, a limit is set to the number of records that can be buffered (defaults to 1000000).

If the poolSize is exhausted, the code will write the earliest record awaiting its overlapping mate and any previous records that are being buffered.

Depending on whether or not --poolSkipClip is set, it will either, clip the end of the read at the position where the mate is supposed to start or it will not clip either read. An error message is written to stderr to indicate that one of these has happened and an unsuccessful return value is returned (2: NO_MORE_RECS).

The resulting file will still be sorted by coordinate.


Skip Clipping Coordinate Sorted Files When Out of Records (--poolSkipClip)

When clipping coordinate sorted SAM/BAM files, we can run out of buffers available in the pool (--poolSize).

By default when we run out of pooled records, we can no longer read in new records, so instead we release some of the stored records. We do this by dropping the first record that is being held awaiting its mate.

This record can either be:

  • Clipped starting at its mate's start position until the end of the read (DEFAULT)
  • Left as is with no clipping, leaving the mates mates overlapping (specify --poolSkipClip)

With either option, the resulting file will still be sorted by coordinate.


Return Value

Returns -1 if input parameters are invalid.

Returns the SamStatus for the reads/writes.

Returns SamStatus::NO_MORE_RECS, 2, if it was clipping files sorted by coordinate and it ran out of records in the pool so had to clip based on the --poolSkipClip setting.


Output

All status messages are written to stderr.

Situation Sorted Type Output Message
Everything ran successfully ReadName/Coordinate
Completed ClipOverlap Successfully.
Failed to allocate any records ReadName/Coordinate
Failed to allocate any records.
Failed to complete ClipOverlap.
Error writing a record ReadName
Failed to complete ClipOverlap.
Expected pair to overlap, but 2nd read was not found in the specified position (may be combined with output for running out of pooled records) Coordinate
Failed to find expected overlapping mates for XX records.
Completed ClipOverlap Successfully.
Ran out of pooled Records with --poolSkipClip setting Coordinate
Due to hitting the max record poolSize, had to skip clipping XX records.
Completed ClipOverlap.
Ran out of pooled Records without --poolSkipClip setting Coordinate
Due to hitting the max record poolSize, had to default clip XX records.
Completed ClipOverlap.
Any other error Coordinate
Failed to complete ClipOverlap.


Example Output

Failed to find expected overlapping mates for 2 records.
Due to hitting the max record poolSize, had to default clip 9 records.
Completed ClipOverlap.