Changes

From Genome Analysis Wiki
Jump to navigationJump to search
2,748 bytes added ,  00:01, 6 March 2016
Line 6: Line 6:  
The <code>clipOverlap</code> option on the [[bamUtil]] executable clips overlapping read pairs.
 
The <code>clipOverlap</code> option on the [[bamUtil]] executable clips overlapping read pairs.
   −
The input file and resulting output file is sorted by coordinate (or readName is specified in the options).
+
The input file and resulting output file are sorted by coordinate (or readName if specified in the options).
    
When a read is clipped from the front:
 
When a read is clipped from the front:
* the read start position is updated to reflect the clipping
+
* the read start position is updated to reflect the clipping.
 
* the mate's mate start position is updated to reflect the record's new position.
 
* the mate's mate start position is updated to reflect the record's new position.
 
* the record is placed in the output file in the correct location based on the updated position.
 
* the record is placed in the output file in the correct location based on the updated position.
   −
To handle coordinate sorted files, SAM/BAM records are buffered up until it is known that all following records will have a later start position.  To prevent the program from running away with memory, a limit is set to the number of records that can be buffered, see [[#Set the SAM/BAMs record buffer size (--poolSize)|<code>--poolSize</code>]] for more information.
+
To handle coordinate-sorted files, SAM/BAM records are buffered up until it is known that all following records will have a later start position.  To prevent the program from running away with memory, a limit is set to the number of records that can be buffered, see [[#Set the SAM/BAMs record buffer size (--poolSize)|<code>--poolSize</code>]] for more information.
   −
When two mates overlap, this tool will clip the record's whose clipped region would has the lowest average quality.
+
When two mates overlap, this tool will clip the record's whose clipped region would have the lowest average quality.
   −
It also checks strand. If a forward strand extends past the end of a reverse strand, that will be clipped.  Similarly, if a reverse strand starts before the forward strand, the region prior to the forward strand will be clipped. If the reverse strand occurs entirely before the forward strand, both strands will be entirely clipped.
+
It also checks strand. If a forward strand extends past the end of a reverse strand, that will be clipped.  Similarly, if a reverse strand starts before the forward strand, the region prior to the forward strand will be clipped. If the reverse strand occurs entirely before the forward strand, both strands will be entirely clipped.  If the [[#Mark entirely clipped reads as unmapped (--unmapped)|<code>--unmapped</code>]] option is specified, then rather than clipping an entire read, it will be marked as unmapped.
 +
 
 +
The qualities on the two strands remain unchanged even with clipping.
      Line 23: Line 25:     
*Assumes the file is sorted by Coordinate (or ReadName if using <code>--readName</code> option)
 
*Assumes the file is sorted by Coordinate (or ReadName if using <code>--readName</code> option)
*Assumes only 2 reads have matching ReadNames
+
*Assumes only 2 reads have matching ReadNames (Supplementary and Secondary reads are ignored/skipped by default so will not cause a problem)
 
**It matches in pairs, so if there are 3, the first 2 will be matched and compared, but the 3rd won't.  If there are 4, the first 2 will be matched and the last 2 will be matched and compared.
 
**It matches in pairs, so if there are 3, the first 2 will be matched and compared, but the 3rd won't.  If there are 4, the first 2 will be matched and the last 2 will be matched and compared.
 
*Only mapped reads will be clipped
 
*Only mapped reads will be clipped
*Mate information in records are accurate
+
*Assumes that mate information in records are accurate
    
= Rules for Clipping =
 
= Rules for Clipping =
Line 82: Line 84:     
= Usage =
 
= Usage =
  ./bam clipOverlap --in <inputFile> --out <outputFile> [--storeOrig <tag>] [--readName] [--poolSize <numRecords allowed to allocate>] [--noeof] [--params]
+
  ./bam clipOverlap --in <inputFile> --out <outputFile> [--storeOrig <tag>] [--readName] [--stats] [--overlapsOnly] [--excludeFlags <flag>] [--poolSize <numRecords allowed to allocate>] [--poolSkipOverlap] [--noeof] [--params]
 +
 
    
= Parameters =
 
= Parameters =
 
<pre>
 
<pre>
 
Required Parameters:
 
Required Parameters:
--in : the SAM/BAM file to clip overlaping read pairs for
+
--in           : the SAM/BAM file to clip overlaping read pairs for
--out       : the SAM/BAM file to be written
+
--out         : the SAM/BAM file to be written
 
Optional Parameters:
 
Optional Parameters:
--storeOrig   : Store the original cigar in the specified tag.
+
--storeOrig   : Store the original cigar in the specified tag.
--readName   : Original file is sorted by Read Name instead of coordinate.
+
--readName     : Original file is sorted by Read Name instead of coordinate.
--noeof       : Do not expect an EOF block on a bam file.
+
--stats        : Print some statistics on the overlaps.
--params     : Print the parameter settings
+
--overlapsOnly : Only output overlapping read pairs
 +
--excludeFlags : Skip records with any of the specified flags set, default 0xF0C
 +
                --unmapped    : Mark records that would be completely clipped as unmapped
 +
--noeof       : Do not expect an EOF block on a bam file.
 +
--params       : Print the parameter settings to stderr
 
Clipping By Coordinate Optional Parameters:
 
Clipping By Coordinate Optional Parameters:
--poolSize   : Maximum number of records the program is allowed to allocate
+
--poolSize     : Maximum number of records the program is allowed to allocate
                for clipping on Coordinate sorted files. (Default: 5000)
+
                for clipping on Coordinate sorted files. (Default: 1000000)
 
--poolSkipClip : Skip clipping reads to free of usable records when the
 
--poolSkipClip : Skip clipping reads to free of usable records when the
 
                poolSize is hit. The default action is to just clip the
 
                poolSize is hit. The default action is to just clip the
 
                first read in a pair to free up the record.
 
                first read in a pair to free up the record.
 
</pre>
 
</pre>
 +
{{PhoneHomeParamDesc}}
   −
 
+
== Required Parameters==
 
{{inBAMInputFile}}
 
{{inBAMInputFile}}
 
{{outBAMOutputFile}}
 
{{outBAMOutputFile}}
{{noeofBGZFParameter}}
  −
{{paramsParameter}}
     −
== Store the original cigar string in a tag (<code>--storeOrig</code>) ==
+
== Optional Parameters ==
 +
=== Store the original cigar string in a tag (<code>--storeOrig</code>) ===
    
Use <code>--storeOrig</code> followed by the two character TAG to store the original CIGAR.
 
Use <code>--storeOrig</code> followed by the two character TAG to store the original CIGAR.
Line 115: Line 122:       −
== Work on SAM/BAMs sorted by Read Name instead of by coordinate (<code>--readName</code>) ==
+
=== Work on SAM/BAMs sorted by Read Name instead of by coordinate (<code>--readName</code>) ===
    
If your file is sorted by read name rather than by coordinate, specify <code>--readName</code>.  The resulting file will still be sorted by read name.
 
If your file is sorted by read name rather than by coordinate, specify <code>--readName</code>.  The resulting file will still be sorted by read name.
      −
== Set the SAM/BAMs record buffer size (<code>--poolSize</code>) ==
+
=== Print Overlap Statistics (<code>--stats</code>)===
 +
Print some basic overlap statistics to stderr.
   −
To handle coordinate sorted files, SAM/BAM records are buffered until it is known that all following records will have a later start position.  To prevent the program from running away with memory, a limit is set to the number of records that can be buffered (defaults to 5000).
+
Output values
 +
* count of the number of overlapping pairs that are clipped
 +
* average of the number of overlapping reference bases that are clipped
 +
* variance of the number of overlapping reference bases that are clipped
 +
* number of times the forward strand is clipped when read pairs overlap
 +
* number of times the reverse strand is clipped when read pairs overlap
 +
* number of times the orientation causes clipping/additional clipping
 +
** reads that are only clipped due to orientation are not counted in the other stats
 +
 
 +
==== Example Output ====
 +
<pre>
 +
Overlap Statistics:
 +
Number of overlapping pairs: 14
 +
Average # Reference Bases Overlapped: 18.3571
 +
Variance of Reference Bases overlapped: 39.1703
 +
Number of times the forward strand was clipped: 6
 +
Number of times the reverse strand was clipped: 8
 +
Number of times orientation causes additional clipping: 4
 +
</pre>
 +
 
 +
=== Print Only Overlaping Reads (<code>--overlapsOnly</code>)===
 +
Only output Read Pairs that overlap.  Drop all other records.
 +
 
 +
=== Skip Records with any of the Specified Flags (<code>--excludeFlags</code>)===
 +
Skip records with any of the specified flags set, default 0xF0C
 +
 
 +
By default skips reads with any of the following flags set:
 +
* unmapped
 +
* mate unmapped
 +
* secondary alignment
 +
* fails QC checks
 +
* duplicate
 +
* supplementary
 +
 
 +
=== Mark entirely clipped reads as unmapped (<code>--unmapped</code>)===
 +
Specify this option if instead of marking reads as entirely clipped, mark them as unmapped.
 +
 
 +
When marking a read as unmapped, it will:
 +
* Set CIGAR to 0
 +
* Set MapQ to 0
 +
* Clear N/A flag fields:
 +
** Proper pair
 +
** Secondary Alignment
 +
** Supplementary Alignment
 +
* Update the Mate's flag to indicate:
 +
** Mate Unmapped
 +
** Not proper pair
 +
 
 +
{{noeofBGZFParameter}}
 +
{{paramsParameter}}
 +
 
 +
==Clipping By Coordinate Optional Parameters==
 +
=== Set the SAM/BAMs record buffer size (<code>--poolSize</code>) ===
 +
 
 +
To handle coordinate sorted files, SAM/BAM records are buffered until it is known that all following records will have a later start position.  To prevent the program from running away with memory, a limit is set to the number of records that can be buffered (defaults to 1000000).
    
If the poolSize is exhausted, the code will write the earliest record awaiting its overlapping mate and any previous records that are being buffered.
 
If the poolSize is exhausted, the code will write the earliest record awaiting its overlapping mate and any previous records that are being buffered.
Line 131: Line 193:       −
== Skip Clipping Coordinate Sorted Files When Out of Records (<code>--poolSkipClip</code>) ==
+
=== Skip Clipping Coordinate Sorted Files When Out of Records (<code>--poolSkipClip</code>) ===
    
When clipping coordinate sorted SAM/BAM files, we can run out of buffers available in the pool (<code>--poolSize</code>).
 
When clipping coordinate sorted SAM/BAM files, we can run out of buffers available in the pool (<code>--poolSize</code>).
Line 143: Line 205:  
With either option, the resulting file will still be sorted by coordinate.
 
With either option, the resulting file will still be sorted by coordinate.
    +
{{PhoneHomeParameters}}
    
= Return Value =
 
= Return Value =
Line 148: Line 211:  
Returns -1 if input parameters are invalid.
 
Returns -1 if input parameters are invalid.
   −
Returns the SamStatus for the reads/writes.
+
Returns the SamStatus for the reads/writes (0 for success, non-0 for failure).
    
Returns SamStatus::NO_MORE_RECS, 2, if it was clipping files sorted by coordinate and it ran out of records in the pool so had to clip based on the <code>--poolSkipClip</code> setting.
 
Returns SamStatus::NO_MORE_RECS, 2, if it was clipping files sorted by coordinate and it ran out of records in the pool so had to clip based on the <code>--poolSkipClip</code> setting.
      
= Output =
 
= Output =

Navigation menu