Difference between revisions of "BamUtil: clipOverlap"
(12 intermediate revisions by 2 users not shown) | |||
Line 6: | Line 6: | ||
The <code>clipOverlap</code> option on the [[bamUtil]] executable clips overlapping read pairs. | The <code>clipOverlap</code> option on the [[bamUtil]] executable clips overlapping read pairs. | ||
− | The input file and resulting output file | + | The input file and resulting output file are sorted by coordinate (or readName if specified in the options). |
When a read is clipped from the front: | When a read is clipped from the front: | ||
− | * the read start position is updated to reflect the clipping | + | * the read start position is updated to reflect the clipping. |
* the mate's mate start position is updated to reflect the record's new position. | * the mate's mate start position is updated to reflect the record's new position. | ||
* the record is placed in the output file in the correct location based on the updated position. | * the record is placed in the output file in the correct location based on the updated position. | ||
− | To handle coordinate sorted files, SAM/BAM records are buffered up until it is known that all following records will have a later start position. To prevent the program from running away with memory, a limit is set to the number of records that can be buffered, see [[#Set the SAM/BAMs record buffer size (--poolSize)|<code>--poolSize</code>]] for more information. | + | To handle coordinate-sorted files, SAM/BAM records are buffered up until it is known that all following records will have a later start position. To prevent the program from running away with memory, a limit is set to the number of records that can be buffered, see [[#Set the SAM/BAMs record buffer size (--poolSize)|<code>--poolSize</code>]] for more information. |
− | When two mates overlap, this tool will clip the record's whose clipped region would | + | When two mates overlap, this tool will clip the record's whose clipped region would have the lowest average quality. |
− | It also checks strand. If a forward strand extends past the end of a reverse strand, that will be clipped. Similarly, if a reverse strand starts before the forward strand, the region prior to the forward strand will be clipped. If the reverse strand occurs entirely before the forward strand, both strands will be entirely clipped. | + | It also checks strand. If a forward strand extends past the end of a reverse strand, that will be clipped. Similarly, if a reverse strand starts before the forward strand, the region prior to the forward strand will be clipped. If the reverse strand occurs entirely before the forward strand, both strands will be entirely clipped. If the [[#Mark entirely clipped reads as unmapped (--unmapped)|<code>--unmapped</code>]] option is specified, then rather than clipping an entire read, it will be marked as unmapped. |
+ | |||
+ | The qualities on the two strands remain unchanged even with clipping. | ||
Line 23: | Line 25: | ||
*Assumes the file is sorted by Coordinate (or ReadName if using <code>--readName</code> option) | *Assumes the file is sorted by Coordinate (or ReadName if using <code>--readName</code> option) | ||
− | *Assumes only 2 reads have matching ReadNames | + | *Assumes only 2 reads have matching ReadNames (Supplementary and Secondary reads are ignored/skipped by default so will not cause a problem) |
**It matches in pairs, so if there are 3, the first 2 will be matched and compared, but the 3rd won't. If there are 4, the first 2 will be matched and the last 2 will be matched and compared. | **It matches in pairs, so if there are 3, the first 2 will be matched and compared, but the 3rd won't. If there are 4, the first 2 will be matched and the last 2 will be matched and compared. | ||
*Only mapped reads will be clipped | *Only mapped reads will be clipped | ||
− | * | + | *Assumes that mate information in records are accurate |
= Rules for Clipping = | = Rules for Clipping = | ||
Line 82: | Line 84: | ||
= Usage = | = Usage = | ||
− | ./bam clipOverlap --in <inputFile> --out <outputFile> [--storeOrig <tag>] [--readName] [--poolSize <numRecords allowed to allocate>] [--noeof] [--params] | + | ./bam clipOverlap --in <inputFile> --out <outputFile> [--storeOrig <tag>] [--readName] [--stats] [--overlapsOnly] [--excludeFlags <flag>] [--poolSize <numRecords allowed to allocate>] [--poolSkipOverlap] [--noeof] [--params] |
+ | |||
= Parameters = | = Parameters = | ||
<pre> | <pre> | ||
Required Parameters: | Required Parameters: | ||
− | --in : the SAM/BAM file to clip overlaping read pairs for | + | --in : the SAM/BAM file to clip overlaping read pairs for |
− | --out | + | --out : the SAM/BAM file to be written |
Optional Parameters: | Optional Parameters: | ||
− | --storeOrig | + | --storeOrig : Store the original cigar in the specified tag. |
− | --readName | + | --readName : Original file is sorted by Read Name instead of coordinate. |
− | --stats | + | --stats : Print some statistics on the overlaps. |
− | --noeof | + | --overlapsOnly : Only output overlapping read pairs |
− | --params | + | --excludeFlags : Skip records with any of the specified flags set, default 0xF0C |
+ | --unmapped : Mark records that would be completely clipped as unmapped | ||
+ | --noeof : Do not expect an EOF block on a bam file. | ||
+ | --params : Print the parameter settings to stderr | ||
Clipping By Coordinate Optional Parameters: | Clipping By Coordinate Optional Parameters: | ||
− | --poolSize | + | --poolSize : Maximum number of records the program is allowed to allocate |
− | + | for clipping on Coordinate sorted files. (Default: 1000000) | |
--poolSkipClip : Skip clipping reads to free of usable records when the | --poolSkipClip : Skip clipping reads to free of usable records when the | ||
poolSize is hit. The default action is to just clip the | poolSize is hit. The default action is to just clip the | ||
first read in a pair to free up the record. | first read in a pair to free up the record. | ||
</pre> | </pre> | ||
+ | {{PhoneHomeParamDesc}} | ||
− | + | == Required Parameters== | |
{{inBAMInputFile}} | {{inBAMInputFile}} | ||
{{outBAMOutputFile}} | {{outBAMOutputFile}} | ||
− | |||
− | |||
− | == Store the original cigar string in a tag (<code>--storeOrig</code>) == | + | == Optional Parameters == |
+ | === Store the original cigar string in a tag (<code>--storeOrig</code>) === | ||
Use <code>--storeOrig</code> followed by the two character TAG to store the original CIGAR. | Use <code>--storeOrig</code> followed by the two character TAG to store the original CIGAR. | ||
Line 116: | Line 122: | ||
− | == Work on SAM/BAMs sorted by Read Name instead of by coordinate (<code>--readName</code>) == | + | === Work on SAM/BAMs sorted by Read Name instead of by coordinate (<code>--readName</code>) === |
If your file is sorted by read name rather than by coordinate, specify <code>--readName</code>. The resulting file will still be sorted by read name. | If your file is sorted by read name rather than by coordinate, specify <code>--readName</code>. The resulting file will still be sorted by read name. | ||
− | == Print Overlap Statistics (<code>--stats</code>)== | + | === Print Overlap Statistics (<code>--stats</code>)=== |
Print some basic overlap statistics to stderr. | Print some basic overlap statistics to stderr. | ||
Line 133: | Line 139: | ||
** reads that are only clipped due to orientation are not counted in the other stats | ** reads that are only clipped due to orientation are not counted in the other stats | ||
− | + | ==== Example Output ==== | |
− | === Example Output === | ||
<pre> | <pre> | ||
Overlap Statistics: | Overlap Statistics: | ||
Line 145: | Line 150: | ||
</pre> | </pre> | ||
+ | === Print Only Overlaping Reads (<code>--overlapsOnly</code>)=== | ||
+ | Only output Read Pairs that overlap. Drop all other records. | ||
+ | |||
+ | === Skip Records with any of the Specified Flags (<code>--excludeFlags</code>)=== | ||
+ | Skip records with any of the specified flags set, default 0xF0C | ||
+ | |||
+ | By default skips reads with any of the following flags set: | ||
+ | * unmapped | ||
+ | * mate unmapped | ||
+ | * secondary alignment | ||
+ | * fails QC checks | ||
+ | * duplicate | ||
+ | * supplementary | ||
+ | |||
+ | === Mark entirely clipped reads as unmapped (<code>--unmapped</code>)=== | ||
+ | Specify this option if instead of marking reads as entirely clipped, mark them as unmapped. | ||
− | == Set the SAM/BAMs record buffer size (<code>--poolSize</code>) == | + | When marking a read as unmapped, it will: |
+ | * Set CIGAR to 0 | ||
+ | * Set MapQ to 0 | ||
+ | * Clear N/A flag fields: | ||
+ | ** Proper pair | ||
+ | ** Secondary Alignment | ||
+ | ** Supplementary Alignment | ||
+ | * Update the Mate's flag to indicate: | ||
+ | ** Mate Unmapped | ||
+ | ** Not proper pair | ||
+ | |||
+ | {{noeofBGZFParameter}} | ||
+ | {{paramsParameter}} | ||
+ | |||
+ | ==Clipping By Coordinate Optional Parameters== | ||
+ | === Set the SAM/BAMs record buffer size (<code>--poolSize</code>) === | ||
To handle coordinate sorted files, SAM/BAM records are buffered until it is known that all following records will have a later start position. To prevent the program from running away with memory, a limit is set to the number of records that can be buffered (defaults to 1000000). | To handle coordinate sorted files, SAM/BAM records are buffered until it is known that all following records will have a later start position. To prevent the program from running away with memory, a limit is set to the number of records that can be buffered (defaults to 1000000). | ||
Line 157: | Line 193: | ||
− | == Skip Clipping Coordinate Sorted Files When Out of Records (<code>--poolSkipClip</code>) == | + | === Skip Clipping Coordinate Sorted Files When Out of Records (<code>--poolSkipClip</code>) === |
When clipping coordinate sorted SAM/BAM files, we can run out of buffers available in the pool (<code>--poolSize</code>). | When clipping coordinate sorted SAM/BAM files, we can run out of buffers available in the pool (<code>--poolSize</code>). | ||
Line 169: | Line 205: | ||
With either option, the resulting file will still be sorted by coordinate. | With either option, the resulting file will still be sorted by coordinate. | ||
+ | {{PhoneHomeParameters}} | ||
= Return Value = | = Return Value = | ||
Line 174: | Line 211: | ||
Returns -1 if input parameters are invalid. | Returns -1 if input parameters are invalid. | ||
− | Returns the SamStatus for the reads/writes. | + | Returns the SamStatus for the reads/writes (0 for success, non-0 for failure). |
Returns SamStatus::NO_MORE_RECS, 2, if it was clipping files sorted by coordinate and it ran out of records in the pool so had to clip based on the <code>--poolSkipClip</code> setting. | Returns SamStatus::NO_MORE_RECS, 2, if it was clipping files sorted by coordinate and it ran out of records in the pool so had to clip based on the <code>--poolSkipClip</code> setting. | ||
− | |||
= Output = | = Output = |
Latest revision as of 00:01, 6 March 2016
Overview of the clipOverlap
function of bamUtil
The clipOverlap
option on the bamUtil executable clips overlapping read pairs.
The input file and resulting output file are sorted by coordinate (or readName if specified in the options).
When a read is clipped from the front:
- the read start position is updated to reflect the clipping.
- the mate's mate start position is updated to reflect the record's new position.
- the record is placed in the output file in the correct location based on the updated position.
To handle coordinate-sorted files, SAM/BAM records are buffered up until it is known that all following records will have a later start position. To prevent the program from running away with memory, a limit is set to the number of records that can be buffered, see --poolSize
for more information.
When two mates overlap, this tool will clip the record's whose clipped region would have the lowest average quality.
It also checks strand. If a forward strand extends past the end of a reverse strand, that will be clipped. Similarly, if a reverse strand starts before the forward strand, the region prior to the forward strand will be clipped. If the reverse strand occurs entirely before the forward strand, both strands will be entirely clipped. If the --unmapped
option is specified, then rather than clipping an entire read, it will be marked as unmapped.
The qualities on the two strands remain unchanged even with clipping.
ASSUMPTIONS/RESTRICTIONS
- Assumes the file is sorted by Coordinate (or ReadName if using
--readName
option) - Assumes only 2 reads have matching ReadNames (Supplementary and Secondary reads are ignored/skipped by default so will not cause a problem)
- It matches in pairs, so if there are 3, the first 2 will be matched and compared, but the 3rd won't. If there are 4, the first 2 will be matched and the last 2 will be matched and compared.
- Only mapped reads will be clipped
- Assumes that mate information in records are accurate
Rules for Clipping
Clipping from the front
The first operation after the softclip will be a Match/Mismatch, meaning that any trailing pads, deletions, insertions, or skips will also be soft clipped.
Clip Location | How it is handled |
---|---|
If the clip position falls in a skip/deletion | Removes the entire skip/deletion |
If the position immediately after the clip is a skip/deletion | Also removes the skip/deletion |
If the position immediately after the clip is an Insert | Softclips the insert |
If the position immediately after the clip is a Pad | Removes the pad |
Clip occurs at the last match/mismatch position of the read (the entire read is clipped) | Entire read is soft clipped, 0-based position is left as the original (not modified) |
Clip occurs after the read ends | Entire read is soft clipped, 0-based position is left as the original (not modified) |
Clip occurs before the read starts | Nothing is clipped. The read is not changed. |
Clipping from the back
Clip Location | How it is handled |
---|---|
If the clip position falls in a skip/deletion | Removes the entire skip/deletion |
If the position immediately before the clip is a deletion/skip/pad | Remove the deletion/skip/pad |
If the position immediately before the clip is an insertion | Leave the insertion, even if it results in a 70M3I27S |
Clip occurs at the first position of the read (the entire read is clipped) | Entire read is soft clipped, preceding insertions remain, 0-based position is left as the original (not modified) |
Clip occurs before the read starts | Entire read is soft clipped, 0-based position is left as the original (not modified) |
Clip occurs after the read ends | Nothing is clipped. The read is not changed. |
Usage
./bam clipOverlap --in <inputFile> --out <outputFile> [--storeOrig <tag>] [--readName] [--stats] [--overlapsOnly] [--excludeFlags <flag>] [--poolSize <numRecords allowed to allocate>] [--poolSkipOverlap] [--noeof] [--params]
Parameters
Required Parameters: --in : the SAM/BAM file to clip overlaping read pairs for --out : the SAM/BAM file to be written Optional Parameters: --storeOrig : Store the original cigar in the specified tag. --readName : Original file is sorted by Read Name instead of coordinate. --stats : Print some statistics on the overlaps. --overlapsOnly : Only output overlapping read pairs --excludeFlags : Skip records with any of the specified flags set, default 0xF0C --unmapped : Mark records that would be completely clipped as unmapped --noeof : Do not expect an EOF block on a bam file. --params : Print the parameter settings to stderr Clipping By Coordinate Optional Parameters: --poolSize : Maximum number of records the program is allowed to allocate for clipping on Coordinate sorted files. (Default: 1000000) --poolSkipClip : Skip clipping reads to free of usable records when the poolSize is hit. The default action is to just clip the first read in a pair to free up the record.
PhoneHome: --noPhoneHome : disable PhoneHome (default enabled) --phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)
Required Parameters
Input File (--in
)
Use --in
followed by your file name to specify the SAM/BAM input file.
The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.
A -
is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).
SAM/BAM/Uncompressed BAM from file | --in yourFileName
|
SAM from stdin | --in - |
BAM from stdin | --in -.bam |
Uncompressed BAM from stdin | --in -.ubam |
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools
implementation so pipes between our tools and samtools
are supported.
Output File (--out
)
Use --out
followed by your file name to specify the SAM/BAM output file.
The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A -
is used to indicate stdout and the extension for file type (no extension is SAM).
SAM to file | --out yourFileName.sam
|
BAM to file | --out yourFileName.bam
|
Uncompressed BAM to file | --out yourFileName.ubam
|
SAM to stdout | --out -
|
BAM to stdout | --out -.bam
|
Uncompressed BAM to stdout | --out -.ubam
|
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools
implementation so pipes between our tools and samtools
are supported.
Optional Parameters
Store the original cigar string in a tag (--storeOrig
)
Use --storeOrig
followed by the two character TAG to store the original CIGAR.
It will be stored with the specified tag as a "Z" tag type.
Work on SAM/BAMs sorted by Read Name instead of by coordinate (--readName
)
If your file is sorted by read name rather than by coordinate, specify --readName
. The resulting file will still be sorted by read name.
Print Overlap Statistics (--stats
)
Print some basic overlap statistics to stderr.
Output values
- count of the number of overlapping pairs that are clipped
- average of the number of overlapping reference bases that are clipped
- variance of the number of overlapping reference bases that are clipped
- number of times the forward strand is clipped when read pairs overlap
- number of times the reverse strand is clipped when read pairs overlap
- number of times the orientation causes clipping/additional clipping
- reads that are only clipped due to orientation are not counted in the other stats
Example Output
Overlap Statistics: Number of overlapping pairs: 14 Average # Reference Bases Overlapped: 18.3571 Variance of Reference Bases overlapped: 39.1703 Number of times the forward strand was clipped: 6 Number of times the reverse strand was clipped: 8 Number of times orientation causes additional clipping: 4
Print Only Overlaping Reads (--overlapsOnly
)
Only output Read Pairs that overlap. Drop all other records.
Skip Records with any of the Specified Flags (--excludeFlags
)
Skip records with any of the specified flags set, default 0xF0C
By default skips reads with any of the following flags set:
- unmapped
- mate unmapped
- secondary alignment
- fails QC checks
- duplicate
- supplementary
Mark entirely clipped reads as unmapped (--unmapped
)
Specify this option if instead of marking reads as entirely clipped, mark them as unmapped.
When marking a read as unmapped, it will:
- Set CIGAR to 0
- Set MapQ to 0
- Clear N/A flag fields:
- Proper pair
- Secondary Alignment
- Supplementary Alignment
- Update the Mate's flag to indicate:
- Mate Unmapped
- Not proper pair
Do not require BGZF EOF block (--noeof
)
Use --noeof
if you do not expect a trailing eof block in your bgzf file.
By default, the trailing empty block is expected and checked for.
Print the Program Parameters (--params
)
Use --params
to print the parameters for your program to stderr.
Clipping By Coordinate Optional Parameters
Set the SAM/BAMs record buffer size (--poolSize
)
To handle coordinate sorted files, SAM/BAM records are buffered until it is known that all following records will have a later start position. To prevent the program from running away with memory, a limit is set to the number of records that can be buffered (defaults to 1000000).
If the poolSize is exhausted, the code will write the earliest record awaiting its overlapping mate and any previous records that are being buffered.
Depending on whether or not --poolSkipClip
is set, it will either, clip the end of the read at the position where the mate is supposed to start or it will not clip either read. An error message is written to stderr to indicate that one of these has happened and an unsuccessful return value is returned (2: NO_MORE_RECS).
The resulting file will still be sorted by coordinate.
Skip Clipping Coordinate Sorted Files When Out of Records (--poolSkipClip
)
When clipping coordinate sorted SAM/BAM files, we can run out of buffers available in the pool (--poolSize
).
By default when we run out of pooled records, we can no longer read in new records, so instead we release some of the stored records. We do this by dropping the first record that is being held awaiting its mate.
This record can either be:
- Clipped starting at its mate's start position until the end of the read (DEFAULT)
- Left as is with no clipping, leaving the mates mates overlapping (specify
--poolSkipClip
)
With either option, the resulting file will still be sorted by coordinate.
PhoneHome Parameters
See PhoneHome for more information on how PhoneHome works and what it does.
Turn off PhoneHome (--noPhoneHome
)
Use the --noPhoneHome
option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.
Adjust the Frequency of PhoneHome (--phoneHomeThinning
)
Use --phoneHomeThinning
to modify the percentage of the time that PhoneHome will run (0-100).
- By default,
--phoneHomeThinning
is set to 50, running 50% of the time. - PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
- N/A if
--noPhoneHome
is set.
Return Value
Returns -1 if input parameters are invalid.
Returns the SamStatus for the reads/writes (0 for success, non-0 for failure).
Returns SamStatus::NO_MORE_RECS, 2, if it was clipping files sorted by coordinate and it ran out of records in the pool so had to clip based on the --poolSkipClip
setting.
Output
All status messages are written to stderr.
Situation | Sorted Type | Output Message |
---|---|---|
Everything ran successfully | ReadName/Coordinate |
Completed ClipOverlap Successfully. |
Failed to allocate any records | ReadName/Coordinate |
Failed to allocate any records. Failed to complete ClipOverlap. |
Error writing a record | ReadName |
Failed to complete ClipOverlap. |
Expected pair to overlap, but 2nd read was not found in the specified position (may be combined with output for running out of pooled records) | Coordinate |
Failed to find expected overlapping mates for XX records. Completed ClipOverlap Successfully. |
Ran out of pooled Records with --poolSkipClip setting
|
Coordinate |
Due to hitting the max record poolSize, had to skip clipping XX records. Completed ClipOverlap. |
Ran out of pooled Records without --poolSkipClip setting
|
Coordinate |
Due to hitting the max record poolSize, had to default clip XX records. Completed ClipOverlap. |
Any other error | Coordinate |
Failed to complete ClipOverlap. |
Example Output
Failed to find expected overlapping mates for 2 records. Due to hitting the max record poolSize, had to default clip 9 records. Completed ClipOverlap.