Difference between revisions of "BamUtil: clipOverlap"
Line 99: | Line 99: | ||
{{outBAMOutputFile}} | {{outBAMOutputFile}} | ||
− | {{ | + | {{noeofBGZFParameter}} |
{{paramsParameter}} | {{paramsParameter}} | ||
Revision as of 12:04, 15 November 2011
Overview of the clipOverlap
function of bamUtil
The clipOverlap
option on the bamUtil executable clips overlapping read pairs.
The input file and resulting output file is sorted by coordinate (or readName is specified in the options).
When a read is clipped from the front:
- the read start position is updated to reflect the clipping
- the mate's mate start position is updated to reflect the record's new position.
- the record is placed in the output file in the correct location based on the updated position.
To handle coordinate sorted files, SAM/BAM records are buffered up until it is known that all following records will have a later start position. To prevent the program from running away with memory, a limit is set to the number of records that can be buffered, see --poolSize
for more information.
ASSUMPTIONS/RESTRICTIONS
- Assumes the file is sorted by Coordinate (or ReadName if using
--readName
option) - Assumes only 2 reads have matching ReadNames
- It matches in pairs, so if there are 3, the first 2 will be matched and compared, but the 3rd won't. If there are 4, the first 2 will be matched and the last 2 will be matched and compared.
- Only mapped reads will be clipped
- Mate information in records are accurate
Rules for Clipping
Clipping from the front
The first operation after the softclip will be a Match/Mismatch, meaning that any trailing pads, deletions, insertions, or skips will also be soft clipped.
Clip Location | How it is handled |
---|---|
If the clip position falls in a skip/deletion | Removes the entire skip/deletion |
If the position immediately after the clip is a skip/deletion | Also removes the skip/deletion |
If the position immediately after the clip is an Insert | Softclips the insert |
If the position immediately after the clip is a Pad | Removes the pad |
Clip occurs at the last match/mismatch position of the read (the entire read is clipped) | Entire read is soft clipped, 0-based position is left as the original (not modified) |
Clip occurs after the read ends | Entire read is soft clipped, 0-based position is left as the original (not modified) |
Clip occurs before the read starts | Nothing is clipped. The read is not changed. |
Clipping from the back
Clip Location | How it is handled |
---|---|
If the clip position falls in a skip/deletion | Removes the entire skip/deletion |
If the position immediately before the clip is a deletion/skip/pad | Remove the deletion/skip/pad |
If the position immediately before the clip is an insertion | Leave the insertion, even if it results in a 70M3I27S |
Clip occurs at the first position of the read (the entire read is clipped) | Entire read is soft clipped, preceding insertions remain, 0-based position is left as the original (not modified) |
Clip occurs before the read starts | Entire read is soft clipped, 0-based position is left as the original (not modified) |
Clip occurs after the read ends | Nothing is clipped. The read is not changed. |
Usage
./bam clipOverlap --in <inputFile> --out <outputFile> [--storeOrig <tag>] [--readName] [--poolSize <numRecords allowed to allocate>] [--noeof] [--params]
Parameters
Required Parameters: --in : the SAM/BAM file to clip overlaping read pairs for --out : the SAM/BAM file to be written Optional Parameters: --storeOrig : Store the original cigar in the specified tag. --readName : Original file is sorted by Read Name instead of coordinate. --poolSize : Maximum number of records the program is allowed to allocate for clipping on Coordinate sorted files. (Default: 500) --noeof : Do not expect an EOF block on a bam file. --params : Print the parameter settings
Input File (--in
)
Use --in
followed by your file name to specify the SAM/BAM input file.
The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.
A -
is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).
SAM/BAM/Uncompressed BAM from file | --in yourFileName
|
SAM from stdin | --in - |
BAM from stdin | --in -.bam |
Uncompressed BAM from stdin | --in -.ubam |
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools
implementation so pipes between our tools and samtools
are supported.
Output File (--out
)
Use --out
followed by your file name to specify the SAM/BAM output file.
The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A -
is used to indicate stdout and the extension for file type (no extension is SAM).
SAM to file | --out yourFileName.sam
|
BAM to file | --out yourFileName.bam
|
Uncompressed BAM to file | --out yourFileName.ubam
|
SAM to stdout | --out -
|
BAM to stdout | --out -.bam
|
Uncompressed BAM to stdout | --out -.ubam
|
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools
implementation so pipes between our tools and samtools
are supported.
Do not require BGZF EOF block (--noeof
)
Use --noeof
if you do not expect a trailing eof block in your bgzf file.
By default, the trailing empty block is expected and checked for.
Print the Program Parameters (--params
)
Use --params
to print the parameters for your program to stderr.
Store the original cigar string in a tag (--storeOrig
)
Use --storeOrig
followed by the two character TAG to store the original CIGAR.
It will be stored with the specified tag as a "Z" tag type.
Work on SAM/BAMs sorted by Read Name instead of by coordinate (--readName
)
If your file is sorted by read name rather than by coordinate, specify --readName
. The resulting file will still be sorted by read name.
Set the SAM/BAMs record buffer size (--poolSize
)
To handle coordinate sorted files, SAM/BAM records are buffered until it is known that all following records will have a later start position. To prevent the program from running away with memory, a limit is set to the number of records that can be buffered (defaults to 500).
If the poolSize is exhausted, the code will write the earliest record awaiting its overlapping mate and any previous records that are being buffered. This record and its mate will NOT be clipped since it cannot be held onto any longer. An error message is written to stderr to indicate that this happened.
The resulting file will still be sorted by coordinate, but not all overlapping mates will have been clipped.
Return Value
Returns -1 if input parameters are invalid.
Returns the SamStatus for the reads/writes.
Output
The number of records that are expected to overlap with a mate (based on the mate information), but could not be matched up with the mate (based on mate positions & read names) is printed to stderr after the run has completed.
When processing has been completed, "Completed ClipOverlap." is printed to stderr.
Example Output
Failed to find expected overlapping mates for 2 records. Completed ClipOverlap.