Changes

From Genome Analysis Wiki
Jump to navigationJump to search
8,618 bytes added ,  15:59, 24 August 2017
Line 3: Line 3:  
= Overview of the <code>stats</code> function of <code>bamUtil</code>  =
 
= Overview of the <code>stats</code> function of <code>bamUtil</code>  =
   −
The <code>stats</code> option on the [[BamUtil]] executable generates the specified statistics on a SAM/BAM file.  
+
The <code>stats</code> option on the [[BamUtil]] executable generates the specified statistics on a SAM/BAM file.
 +
 
 +
== Troubleshooting ==
 +
See [[BamUtil:_FAQ#BamUtil:_stats|BamUtil: FAQ -> BamUtil: stats]] for troubleshooting help.
 +
 
 +
= Usage =
 +
<pre>
 +
./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--pBaseQC <outputFileName>] [--cBaseQC <outputFileName>] [--maxNumReads <maxNum>][--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--requiredFlags <integerRequiredFlags>] [--excludeFlags <integerExcludeFlags>] [--noeof] [--params] [--withinRegion] [--baseSum] [--bufferSize <buffSize>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>]
 +
</pre>
    
= Parameters  =
 
= Parameters  =
<pre> Required Parameters:
+
<pre>
--in&nbsp;: the SAM/BAM file to calculate stats for
+
        Required Parameters:
Types of Statistics that can be generated:
+
                --in : the SAM/BAM file to calculate stats for
--basic     &nbsp;: Turn on basic statistic generation
+
        Types of Statistics that can be generated:
--qual       &nbsp;: Generate a count for each quality (displayed as non-phred quality)
+
                --basic         : Turn on basic statistic generation
--phred     &nbsp;: Generate a count for each quality (displayed as phred quality)
+
                --qual         : Generate a count for each quality (displayed as non-phred quality)
--baseQC    &nbsp;: Write per base statistics to the specified file.
+
                --phred         : Generate a count for each quality (displayed as phred quality)
Optional Parameters:
+
                --pBaseQC      : Write per base statistics as Percentages to the specified file. (use - for stdout)
--maxNumReads&nbsp;: Maximum number of reads to process
+
                                  pBaseQC & cBaseQC cannot both be specified.
                Defaults to -1 to indicate all reads.
+
                --cBaseQC      : Write per base statistics as Counts to the specified file. (use - for stdout)
--unmapped   &nbsp;: Only process unmapped reads (requires a bamIndex file)
+
                                  pBaseQC & cBaseQC cannot both be specified.
--bamIndex   &nbsp;: The path/name of the bam index file
+
        Optional Parameters:
                (if required and not specified, uses the --in value + ".bai")
+
                --maxNumReads   : Maximum number of reads to process
--regionList &nbsp;: File containing the region list chr&lt;tab&gt;start_pos&lt;tab&gt;end&lt;pos&gt;.
+
                                  Defaults to -1 to indicate all reads.
                Positions are 0 based and the end_pos is not included in the region.
+
                --unmapped     : Only process unmapped reads (requires a bamIndex file)
                Uses bamIndex.
+
                --bamIndex     : The path/name of the bam index file
--minMapQual &nbsp;: The minimum mapping quality for filtering reads in the baseQC stats.
+
                                  (if required and not specified, uses the --in value + ".bai")
--dbsnp      &nbsp;: The dbSnp file of positions to exclude from baseQC analysis.
+
                --regionList   : File containing the regions to be processed chr<tab>start_pos<tab>end_pos.
--noeof     &nbsp;: Do not expect an EOF block on a bam file.
+
                                  Positions are 0 based and the end_pos is not included in the region.
--params     &nbsp;: Print the parameter settings
+
                                  Uses bamIndex.
 +
                --excludeFlags  : Skip any records with any of the specified flags set
 +
                                  (specify an integer representation of the flags)
 +
                --requiredFlags : Only process records with all of the specified flags set
 +
                                  (specify an integer representation of the flags)
 +
                --noeof         : Do not expect an EOF block on a bam file.
 +
                --params       : Print the parameter settings.
 +
        Optional phred/qual Only Parameters:
 +
                --withinRegion  : Only count qualities if they fall within regions specified.
 +
                                  Only applicable if regionList is also specified.
 +
        Optional BaseQC Only Parameters:
 +
                --baseSum      : Print an overall summary of the baseQC for the file to stderr.
 +
                --bufferSize    : Size of the pileup buffer for calculating the BaseQC parameters.
 +
                                  Default: 1024
 +
                --minMapQual    : The minimum mapping quality for filtering reads in the baseQC stats.
 +
                --dbsnp        : The dbSnp file of positions to exclude from baseQC analysis.
 
</pre>  
 
</pre>  
For all types of statistics, the bam file used is specified by <code>--in</code>.  
+
{{PhoneHomeParamDesc}}
 +
 
 +
== Required Parameters ==
 +
 
 +
{{inBAMInputFile}}
 +
 
 +
== Optional Parameters ==
 +
===  Maximum number of reads to process(<code>--maxNumReads</code>) ===
 +
Use <code>--maxNumReads</code> followed by a number to indicate the maximum number of reads to process before exiting.  By default, it is set to -1 to indicate all reads should be processed.
 +
 
 +
=== Only Process Unmapped Reads (<code>--unmapped</code>) ===
 +
Use <code>--unmapped</code> to process only unmapped reads.
 +
 
 +
This parameter requires [[#Bam Index File (--bamIndex)|<code>--bamIndex</code>]].
 +
 
 +
{{BamIndex}}
 +
 
 +
=== Only Process Certain Regions (<code>--regionList</code>) ===
 +
Use <code>--regionList</code> followed by the filename to process only the regions specified in the file.
 +
 
 +
The positions in the file are specified one per line with the following format: <nowiki>chr<tab>start_pos<tab>end_pos.</nowiki>
 +
 
 +
Positions are 0 based and the end_pos is not included in the region.
 +
 
 +
This parameter requires [[#Bam Index File (--bamIndex)|<code>--bamIndex</code>]].
   −
The optional parameters are also used for all types of statistics.  
+
=== Exclude Flags (<code>--excludeFlags</code>) ===
 +
Use <code>--excludeFlags</code> followed by an integer representation of the flags to only process reads with any of the specified flags set.
   −
Usage:
+
=== Required Flags (<code>--requiredFlags</code>) ===
<pre> ./bam stats --in &lt;inputFile&gt; [--basic] [--qual] [--phred] [--baseQC &lt;outputFileName&gt;] [--maxNumReads &lt;maxNum&gt;] [--unmapped] [--bamIndex &lt;bamIndexFile&gt;] [--regionList &lt;regFileName&gt;] [--minMapQual &lt;minMapQ&gt;] [--dbsnp &lt;dbsnpFile&gt;] [--noeof] [--params]
+
Use <code>--requiredFlags</code> followed by an integer representation of the flags to only process records with all of the specified flags set.
</pre>  
  −
<br>  
     −
= Types of Statistics =
+
== Types of Statistics ==
   −
== Basic ==
+
=== Basic (<code>--basic</code>) ===
    
Prints summary statistics for the file:  
 
Prints summary statistics for the file:  
Line 56: Line 102:  
*BasesInMappedReads - # of bases in reads marked mapped in the flag
 
*BasesInMappedReads - # of bases in reads marked mapped in the flag
   −
== Qual/Phred ==
+
=== Qual/Phred (<code>--phred</code> and <code>--qual</code>) ===
   −
Prints a count of the number of times each quality value appears in the file.  
+
Prints a count of the number of times each quality value appears in the file to stderr.  
    
*<code>phred</code> Displays Quality as phred integers [0-93]  
 
*<code>phred</code> Displays Quality as phred integers [0-93]  
 
*<code>qual</code> Displays Quality as non-phred integers (phred + 33) [33-126]
 
*<code>qual</code> Displays Quality as non-phred integers (phred + 33) [33-126]
   −
<br>
+
By default, these counts include all qualities in the BAM file.
 +
 
 +
To exclude unmapped reads and soft clips, use --excludeFlags 4.
   −
== BaseQC ==
+
To only include records that overlap a set of regions, use --regionList and specify a bed file with the regions. If a read overlaps the region, all qualities will be counted even if those bases do not fall in the region.  If you only want to count qualities that fall within the region, also specify --withinRegion.  Without excluding unmapped reads, it will include soft clips that overlap the region.
   −
The <code>baseQC</code> option generates the following statistics:
+
==== Optional Phred/Qual Only Parameters ====
 +
===== Within Region (<code>--withinRegion</code>) =====
 +
Use <code>--withinRegion</code> with [[#Qual/Phred (--phred and --qual)|<code>--phred</code> or <code>--qual</code>]] options to only count qualities if they fall within the regions specified using [[#Only Process Certain Regions (--regionList)|<code>--regionList</code>]] (only applicable if [[#Only Process Certain Regions (--regionList)|<code>--regionList</code>]]  is also specified).
   −
A read spans a position if the read starts at or before the position, ends at or after the position and the position is not a clip.  CIGAR operations allowed for the position are M/X/=/D/N.  If the CIGAR is '*', only numbers for the specified reference position are incremented.
+
=== BaseQC (<code>--pBaseQC</code> and <code>--cBaseQC</code> and <code>--baseSum</code>) ===
   −
Currently there is no special logic to exclude positions/reads where the reference base is 'N' or the read base is 'N'.  
+
The <code>pBaseQC</code> and <code>cBaseQC</code> options generate per base statistics.  Only one of these two options can be specified.  They write statistics generated for each position to the file specified after the option (use <code>-</code> to write to STDOUT).  They use the same logic for calculating statistics, but <code>pBaseQC</code> writes the statistics as percentages, and <code>cBaseQC</code> writes them as counts.  The order of the statistics are also different.
   −
<br>  
+
The <code>baseSum</code> option can be used with either <code>pBaseQC</code> or <code>cBaseQC</code> or on its own.  <code>baseSum</code> generates a summary of the per position statistics and writes it to stderr.  It calculates the per position base statistics even if they will not be written anywhere (neither <code>pBaseQC</code> nor <code>cBaseQC</code> are specified).
   −
=== BaseQC Output  ===
     −
There are two output options for BaseQC.  
+
All three options use the same logic for calculating the statistics:
 +
* A read spans a position if the read starts at or before the position, ends at or after the position and the position is not a clip.  CIGAR operations allowed for the position are M/X/=/D/N.  If the CIGAR is '*', only numbers for the specified reference position are incremented.
 +
*Currently there is no special logic to exclude positions/reads where the reference base is 'N' or the read base is 'N'.  
   −
#[[#Percentage-Based Output Format|Percentage-Based Output Format]]
+
<br>
#[[#Count-Based Output Format|Count-Based Output Format]]
     −
==== Percentage-Based Output Format ====
+
==== Percentage-Based Output Format (<code>--pBaseQC</code>) ====
    
Order/Descriptions:  
 
Order/Descriptions:  
Line 95: Line 145:  
| chromEnd || 0-based end position (always 1 greater than start and not included in this region)
 
| chromEnd || 0-based end position (always 1 greater than start and not included in this region)
 
|-
 
|-
| Depth || # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures || align="center"|X || align="center"|X || || align="center"|X || align="center"|X
+
| Depth || # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 
|-
 
|-
| Q20Bases || # of bases at this position with a base quality (from the read) of Q20 or higher || align="center"|X || align="center"|X || || align="center"|X || align="center"|X
+
| Q20Bases || # of bases at this position with a base quality (from the read) of Q20 or higher || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 
|-
 
|-
| Q20BasesPct(%) || Q20Bases / Depth || align="center"|X || align="center"|X || || align="center"|X || align="center"|X
+
| Q20BasesPct(%) || Q20Bases / Depth || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 
|-
 
|-
 
| TotalReads || # of reads that span this position || || || || ||
 
| TotalReads || # of reads that span this position || || || || ||
Line 128: Line 178:     
This output does not include a MapQual255 count.  
 
This output does not include a MapQual255 count.  
 +
    
===== Sample Output  =====
 
===== Sample Output  =====
Line 145: Line 196:  
1 10024 10025 14 12 85.714 39 30 76.923 51.282 25.641 51.282 84.615 38.462 15.385 15.385 11.000 21
 
1 10024 10025 14 12 85.714 39 30 76.923 51.282 25.641 51.282 84.615 38.462 15.385 15.385 11.000 21
 
</pre>  
 
</pre>  
==== Count-Based Output Format ====
+
 
 +
 
 +
==== Count-Based Output Format (<code>--cBaseQC</code>) ====
 
Order/Descriptions:  
 
Order/Descriptions:  
 
{|border=1  
 
{|border=1  
Line 180: Line 233:  
| AverageMapQualCount || # of mapping qualities in AverageMapQuality || align="center"|X || align="center"|X || align="center"|X ||
 
| AverageMapQualCount || # of mapping qualities in AverageMapQuality || align="center"|X || align="center"|X || align="center"|X ||
 
|- ||
 
|- ||
| Depth || # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures || align="center"|X || align="center"|X || || align="center"|X || align="center"|X
+
| Depth || # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 
|-
 
|-
| Q20Bases || # of bases at this position with a base quality (from the read) of Q20 or higher || align="center"|X || align="center"|X || || align="center"|X || align="center"|X
+
| Q20Bases || # of bases at this position with a base quality (from the read) of Q20 or higher || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 
|-
 
|-
 
|}
 
|}
    +
==== Summary of per Position Statistics (<code>--baseSum</code>) ====
 +
Use <code>--baseSum</code> to print an overall summary of the baseQC for the file to stderr.
 +
 +
This option can be used with or without <code>--pBaseQC</code> and <code>--cBaseQC</code>
 +
 +
The values are tab delimited.  First there is a header line describing the summary.  The next line has the Means, and the last line has the Standard Deviations.
 +
 +
{|border=1
 +
! Field !! Description !!style="width: 80px"| Excludes Duplicates, QC Failures !!style="width: 80px"| Excludes Unmapped !!style="width: 80px"|  Excludes MapQual = 255 !!style="width: 80px"| Excludes Below Min MapQual !!style="width: 80px"| Excludes CIGAR Deletions, Skips
 +
|-
 +
| TotalReads || # of reads that span this position || || || || ||
 +
|-
 +
| Dups || # of reads marked duplicate in the flag || || || || ||
 +
|-
 +
| QCFail || # of reads marked QC failure in the flag || || || || ||
 +
|-
 +
| Mapped || # of reads marked mapped in the flag || align="center"|X || align="center"|X || || ||
 +
|-
 +
| Paired || # of reads marked paired in the flag || align="center"|X || align="center"|X || || ||
 +
|-
 +
| ProperPaired || # of reads marked paired AND proper paired in the flag || align="center"|X || align="center"|X || || ||
 +
|-
 +
| ZeroMapQual || # of reads that have a Mapping Quality of 0 || align="center"|X || align="center"|X || || ||
 +
|-
 +
| MapQual&lt;10(%) || # of reads that have a Mapping Quality &lt; 10 || align="center"|X || align="center"|X || || ||
 +
|-
 +
| MapQual255 || # of reads that have a Mapping Quality = 255 || align="center"|X || align="center"|X || || ||
 +
|-
 +
| PassMapQual || # of reads that have a Mapping Quality &gt;= a minimum Mapping Quality || align="center"|X || align="center"|X || || ||
 +
|-
 +
| AverageMapQuality || sum of included mapping qualities / AverageMapQualCount || align="center"|X || align="center"|X || align="center"|X || ||
 +
|-
 +
| AverageMapQualCount || # of mapping qualities in AverageMapQuality || align="center"|X || align="center"|X || align="center"|X ||
 +
|- ||
 +
| Depth || # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 +
|-
 +
| Q20Bases || # of bases at this position with a base quality (from the read) of Q20 or higher || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 +
|-
 +
|}
 +
 +
 +
===== Sample Output =====
 +
<pre>
 +
Summary of Pileup Stats (1st Row is Mean, 2nd Row is Standard Deviation)
 +
TotalReads Dups QCFail Mapped Paired ProperPaired ZeroMapQual MapQual<10 MapQual255 PassMapQual AverageMapQuality AverageMapQualCount
 +
Depth Q20Bases
 +
14.307692 1.846154 1.846154 8.769231 7.846154 0.923077 2.923077 5.846154 0.000000 2.923077 11.000000 8.769231 2.076923 1.153846
 +
17.670053 2.882307 2.882307 9.038380 7.603137 1.441153 3.012793 6.025586 0.000000 3.012793 0.000000 9.038380 2.841993 1.993579
 +
</pre>
 +
 +
==== Optional BaseQC Only Parameters ====
 +
===== Pileup Buffer Size (<code>--bufferSize</code>) =====
 +
Use the <code>--bufferSize</code> option followed by the size of the pileup buffer to use for [[BaseQC (--pBaseQC and --cBaseQC and --baseSum)|baseQC]] stats.
 +
 +
===== Minimum Mapping Quality (<code>--minMapQual</code>) =====
 +
Use the <code>--minMapQual</code> option followed by the minimum mapping quality for filtering reads in the [[BaseQC (--pBaseQC and --cBaseQC and --baseSum)|baseQC]] stats.
 +
 +
===== DBSNP File (<code>--dbsnp</code>) =====
 +
Use the <code>--dbsnp</code> option followed by the name of the dbsnp file to specify the positions to exclude from [[BaseQC (--pBaseQC and --cBaseQC and --baseSum)|baseQC]] analysis.
 +
 +
{{PhoneHomeParameters}}
 +
 +
= Return Value =
 +
0 on Success, non-0 on failure
       
[[Category:BamUtil|stats]] [[Category:BAM_Software]] [[Category:Software]]
 
[[Category:BamUtil|stats]] [[Category:BAM_Software]] [[Category:Software]]

Navigation menu