Changes

From Genome Analysis Wiki
Jump to navigationJump to search
201 bytes added ,  11:16, 19 September 2012
no edit summary
Line 16: Line 16:  
==Handling Duplicates==
 
==Handling Duplicates==
   −
The deduper reads all the alignments in a coordinate-sorted SAM/BAM looking for duplicates.
+
The deduper reads all the alignments in a coordinate-sorted SAM/BAM looking for duplicates, failing if the file is not coordinate-sorted.
   −
The deduper assumes that duplicates in the input BAM file are not marked.   
+
The deduper assumes that duplicates in the input BAM file are not marked.  When the deduper detects a marked duplicate in the input BAM file, it will throw an error and stop.  To override this behavior, use the [[#Ignore Previous Duplicate Marking (--force)|<code>--force</code>]] option;  in this mode, alignments that are marked as duplicates in the input file are unmarked before the deduper begins its detection algorithm.  The result is that only duplicates detected by the deduper will be marked in or removed from the output file.
 
  −
When the deduper detects a marked duplicate in the input BAM file, it will throw an error and stop.  To override this behavior, use the [[#Ignore Previous Duplicate Marking (--force)|<code>--force</code>]] option;  in this mode, alignments that are marked as duplicates in the input file are unmarked before the deduper begins its detection algorithm.  The result is that only duplicates detected by the deduper will be marked in or removed from the output file.
      
The handling of paired-end reads assumes that the mate information in the SAM/BAM records is accurate.  If a mate is not found at the expected position, an error message is printed (once per file) indicating this error.  Paired-end reads whose mate cannot be found are not marked duplicate and are not used for duplicate marking of other paired-end reads.  Single-end reads with the same key as paired-end reads whose mate cannot be found are still marked as duplicate.  If this error is encountered, you may want to fix the mate information and reprocess the file through the deduper.  Use the [[#Treat Reads with Mates On Different Chromosomes As Single-Ended (--oneChrom)|<code>--oneChrom</code>]] option to treat reads with a mate on a different chromosome as single-ended.  This option is useful if you are running the deduper on just a single chromosome.
 
The handling of paired-end reads assumes that the mate information in the SAM/BAM records is accurate.  If a mate is not found at the expected position, an error message is printed (once per file) indicating this error.  Paired-end reads whose mate cannot be found are not marked duplicate and are not used for duplicate marking of other paired-end reads.  Single-end reads with the same key as paired-end reads whose mate cannot be found are still marked as duplicate.  If this error is encountered, you may want to fix the mate information and reprocess the file through the deduper.  Use the [[#Treat Reads with Mates On Different Chromosomes As Single-Ended (--oneChrom)|<code>--oneChrom</code>]] option to treat reads with a mate on a different chromosome as single-ended.  This option is useful if you are running the deduper on just a single chromosome.
Line 38: Line 36:  
* Mark a Single-End Read Duplicate (or remove it if configured to do so) if:
 
* Mark a Single-End Read Duplicate (or remove it if configured to do so) if:
 
*# A paired-end record has the same key (even if the pair is not proper/the mate is unmapped/the mate is not found)<br/>-OR-
 
*# A paired-end record has the same key (even if the pair is not proper/the mate is unmapped/the mate is not found)<br/>-OR-
*# A single-end record has the same key and a higher base quality sum (sum of all base qualities in the record)
+
*# A single-end record has the same key and a higher base quality sum (sum of all base qualities in the record above [[#Minimum Quality for Quality Calculations (--minQual)|<code>--minBaseQual</code>]])
 
* Mark both Paired-End Reads Duplicate if:
 
* Mark both Paired-End Reads Duplicate if:
 
# Another paired-end pair has the same set of keys and has a higher base quality sum.
 
# Another paired-end pair has the same set of keys and has a higher base quality sum.
Line 66: Line 64:  
<pre>
 
<pre>
 
Required parameters :
 
Required parameters :
--in <infile>  : input BAM file name (must be sorted)
+
--in <infile>  : Input BAM file name (must be sorted)
--out <outfile> : output BAM file name (same order with original file)
+
--out <outfile> : Output BAM file name (same order with original file)
Optional parameters : (see SAM format specification for details)
+
Optional parameters :  
--minQual <int> : only add scores over this phred quality when determining a read's quality (default: 15)
+
--minQual <int> : Only add scores over this phred quality when determining a read's quality (default: 15)
--log <logfile> : log and summary statistics (default: [outfile].log, or stderr if --out starts with '-')
+
--log <logfile> : Log and summary statistics (default: [outfile].log, or stderr if --out starts with '-')
 
--oneChrom      : Treat reads with mates on different chromosomes as single-ended.
 
--oneChrom      : Treat reads with mates on different chromosomes as single-ended.
 
--rmDups        : Remove duplicates (default is to mark duplicates)
 
--rmDups        : Remove duplicates (default is to mark duplicates)
--force        : Allow mark-duplicated BAM file and force unmarking the duplicates
+
--force        : Allow an already mark-duplicated BAM file, unmarking any previously marked
                    Default is to throw errors when trying to run a mark-duplicated BAM
+
                  duplicates and apply this duplicate marking logic.  Default is to throw errors
 +
                  and exit when trying to run on an already mark-duplicated BAM
 
--verbose      : Turn on verbose mode
 
--verbose      : Turn on verbose mode
--noeof        : do not expect an EOF block on a bam file.
+
--noeof        : Do not expect an EOF block on a bam file.
--params        : print the parameter settings
+
--params        : Print the parameter settings
 
--recab        : Recalibrate in addition to deduping
 
--recab        : Recalibrate in addition to deduping
 
</pre>
 
</pre>
Line 88: Line 87:  
{{outBAMOutputFile}}
 
{{outBAMOutputFile}}
   −
== BAM File Is Sorted By Read Name (<code>--minQual</code>) ==
+
== Minimum Quality for Quality Calculations (<code>--minQual</code>) ==
    
When duplicate reads are encountered, the read with the highest quality is kept.
 
When duplicate reads are encountered, the read with the highest quality is kept.
Line 103: Line 102:     
If a read's mate is not found it will not be used for duplicate marking.  If you are running on a single chromosome, all read's whose mates are on different chromosomes will not be used for duplicate marking.  The <code>--oneChrom</code> option will treat reads with mates on a different chromosome as single-ended.
 
If a read's mate is not found it will not be used for duplicate marking.  If you are running on a single chromosome, all read's whose mates are on different chromosomes will not be used for duplicate marking.  The <code>--oneChrom</code> option will treat reads with mates on a different chromosome as single-ended.
  −
== Recalibrate (<code>--recab</code>) ==
  −
  −
This option will recalibrate the input file in addition to deduping.
  −
  −
See [[BamUtil: recab]] for recalibration details.
      
== Remove Duplicates (<code>--rmDups</code>) ==
 
== Remove Duplicates (<code>--rmDups</code>) ==
Line 124: Line 117:  
{{noeofBGZFParameter}}
 
{{noeofBGZFParameter}}
 
{{paramsParameter}}
 
{{paramsParameter}}
 +
 +
== Recalibrate (<code>--recab</code>) ==
 +
 +
This option will recalibrate the input file in addition to deduping.
 +
 +
See [[BamUtil: recab]] for recalibration details.
    
= Return Value =
 
= Return Value =

Navigation menu