Difference between revisions of "BamUtil: stats"

From Genome Analysis Wiki
Jump to: navigation, search
(Output Format =)
(BaseQC (--pBaseQC and --cBaseQC and --baseSum))
 
(20 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:BamUtil|stats]]
+
<br>
[[Category:BAM Software]]
 
[[Category:Software]]
 
  
= Overview of the <code>stats</code> function of <code>bamUtil</code> =
+
= Overview of the <code>stats</code> function of <code>bamUtil</code> =
The <code>stats</code> option on the [[bamUtil]] executable generates the specified statistics on a SAM/BAM file.
 
  
= Parameters =
+
The <code>stats</code> option on the [[BamUtil]] executable generates the specified statistics on a SAM/BAM file.
 +
 
 +
== Troubleshooting ==
 +
See [[BamUtil:_FAQ#BamUtil:_stats|BamUtil: FAQ -> BamUtil: stats]] for troubleshooting help.
 +
 
 +
= Usage =
 
<pre>
 
<pre>
Required Parameters:
+
./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--pBaseQC <outputFileName>] [--cBaseQC <outputFileName>] [--maxNumReads <maxNum>][--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--requiredFlags <integerRequiredFlags>] [--excludeFlags <integerExcludeFlags>] [--noeof] [--params] [--withinRegion] [--baseSum] [--bufferSize <buffSize>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>]
--in : the SAM/BAM file to calculate stats for
 
Types of Statistics that can be generated:
 
--basic       : Turn on basic statistic generation
 
--qual       : Generate a count for each quality (displayed as non-phred quality)
 
--phred      : Generate a count for each quality (displayed as phred quality)
 
--baseQC      : Write per base statistics to the specified file.
 
Optional Parameters:
 
--maxNumReads : Maximum number of reads to process
 
                Defaults to -1 to indicate all reads.
 
--unmapped   : Only process unmapped reads (requires a bamIndex file)
 
--bamIndex   : The path/name of the bam index file
 
                (if required and not specified, uses the --in value + ".bai")
 
--regionList : File containing the region list chr<tab>start_pos<tab>end<pos>.
 
                Positions are 0 based and the end_pos is not included in the region.
 
                Uses bamIndex.
 
--minMapQual  : The minimum mapping quality for filtering reads in the baseQC stats.
 
--dbsnp      : The dbSnp file of positions to exclude from baseQC analysis.
 
--noeof      : Do not expect an EOF block on a bam file.
 
--params      : Print the parameter settings
 
 
</pre>
 
</pre>
  
For all types of statistics, the bam file used is specified by <code>--in</code>.
+
= Parameters  =
 +
<pre>
 +
        Required Parameters:
 +
                --in : the SAM/BAM file to calculate stats for
 +
        Types of Statistics that can be generated:
 +
                --basic        : Turn on basic statistic generation
 +
                --qual          : Generate a count for each quality (displayed as non-phred quality)
 +
                --phred        : Generate a count for each quality (displayed as phred quality)
 +
                --pBaseQC      : Write per base statistics as Percentages to the specified file. (use - for stdout)
 +
                                  pBaseQC & cBaseQC cannot both be specified.
 +
                --cBaseQC      : Write per base statistics as Counts to the specified file. (use - for stdout)
 +
                                  pBaseQC & cBaseQC cannot both be specified.
 +
        Optional Parameters:
 +
                --maxNumReads  : Maximum number of reads to process
 +
                                  Defaults to -1 to indicate all reads.
 +
                --unmapped      : Only process unmapped reads (requires a bamIndex file)
 +
                --bamIndex      : The path/name of the bam index file
 +
                                  (if required and not specified, uses the --in value + ".bai")
 +
                --regionList    : File containing the regions to be processed chr<tab>start_pos<tab>end_pos.
 +
                                  Positions are 0 based and the end_pos is not included in the region.
 +
                                  Uses bamIndex.
 +
                --excludeFlags  : Skip any records with any of the specified flags set
 +
                                  (specify an integer representation of the flags)
 +
                --requiredFlags : Only process records with all of the specified flags set
 +
                                  (specify an integer representation of the flags)
 +
                --noeof        : Do not expect an EOF block on a bam file.
 +
                --params        : Print the parameter settings.
 +
        Optional phred/qual Only Parameters:
 +
                --withinRegion  : Only count qualities if they fall within regions specified.
 +
                                  Only applicable if regionList is also specified.
 +
        Optional BaseQC Only Parameters:
 +
                --baseSum      : Print an overall summary of the baseQC for the file to stderr.
 +
                --bufferSize    : Size of the pileup buffer for calculating the BaseQC parameters.
 +
                                  Default: 1024
 +
                --minMapQual    : The minimum mapping quality for filtering reads in the baseQC stats.
 +
                --dbsnp        : The dbSnp file of positions to exclude from baseQC analysis.
 +
</pre>
 +
{{PhoneHomeParamDesc}}
 +
 
 +
== Required Parameters ==
 +
 
 +
{{inBAMInputFile}}
 +
 
 +
== Optional Parameters ==
 +
===  Maximum number of reads to process(<code>--maxNumReads</code>) ===
 +
Use <code>--maxNumReads</code> followed by a number to indicate the maximum number of reads to process before exiting.  By default, it is set to -1 to indicate all reads should be processed.
 +
 
 +
=== Only Process Unmapped Reads (<code>--unmapped</code>) ===
 +
Use <code>--unmapped</code> to process only unmapped reads.
 +
 
 +
This parameter requires [[#Bam Index File (--bamIndex)|<code>--bamIndex</code>]].
 +
 
 +
{{BamIndex}}
 +
 
 +
=== Only Process Certain Regions (<code>--regionList</code>) ===
 +
Use <code>--regionList</code> followed by the filename to process only the regions specified in the file.
  
The optional parameters are also used for all types of statistics.  
+
The positions in the file are specified one per line with the following format: <nowiki>chr<tab>start_pos<tab>end_pos.</nowiki>
  
Usage:
+
Positions are 0 based and the end_pos is not included in the region.
<pre>
+
 
./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--baseQC <outputFileName>] [--maxNumReads <maxNum>] [--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>] [--noeof] [--params]
+
This parameter requires [[#Bam Index File (--bamIndex)|<code>--bamIndex</code>]].
</pre>
+
 
 +
=== Exclude Flags (<code>--excludeFlags</code>) ===
 +
Use <code>--excludeFlags</code> followed by an integer representation of the flags to only process reads with any of the specified flags set.
 +
 
 +
=== Required Flags (<code>--requiredFlags</code>) ===
 +
Use <code>--requiredFlags</code> followed by an integer representation of the flags to only process records with all of the specified flags set.
  
 +
== Types of Statistics ==
  
 +
=== Basic (<code>--basic</code>) ===
  
= Types of Statistics =
+
Prints summary statistics for the file:
  
== Basic ==
+
*TotalReads - # of reads that are in the file  
Prints summary statistics for the file:
+
*MappedReads - # of reads marked mapped in the flag  
*TotalReads - # of reads that are in the file
+
*PairedReads - # of reads marked paired in the flag  
*MappedReads - # of reads marked mapped in the flag
+
*ProperPair - # of reads marked paired AND proper paired in the flag  
*PairedReads - # of reads marked paired in the flag
+
*DuplicateReads - # of reads marked duplicate in the flag  
*ProperPair - # of reads marked paired AND proper paired in the flag
+
*QCFailureReads - # of reads marked QC failure in the flag  
*DuplicateReads - # of reads marked duplicate in the flag
+
*MappingRate(%) - # of reads marked mapped in the flag / TotalReads  
*QCFailureReads - # of reads marked QC failure in the flag
+
*PairedReads(%) - # of reads marked paired in the flag / TotalReads  
*MappingRate(%) - # of reads marked mapped in the flag / TotalReads
+
*ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads  
*PairedReads(%) - # of reads marked paired in the flag / TotalReads
+
*DupRate(%) - # of reads marked duplicate in the flag / TotalReads  
*ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads
+
*QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads  
*DupRate(%) - # of reads marked duplicate in the flag / TotalReads
+
*TotalBases - # of bases in all reads  
*QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
 
*TotalBases - # of bases in all reads
 
 
*BasesInMappedReads - # of bases in reads marked mapped in the flag
 
*BasesInMappedReads - # of bases in reads marked mapped in the flag
  
== Qual/Phred ==
+
=== Qual/Phred (<code>--phred</code> and <code>--qual</code>) ===
Prints a count of the number of times each quality value appears in the file.
+
 
*<code>phred</code> Displays Quality as phred integers [0-93]
+
Prints a count of the number of times each quality value appears in the file to stderr.  
*<code>qual</code> Displays Quality as non-phred integers (phred + 33) [33-126]
+
 
 +
*<code>phred</code> Displays Quality as phred integers [0-93]  
 +
*<code>qual</code> Displays Quality as non-phred integers (phred + 33) [33-126]
 +
 
 +
By default, these counts include all qualities in the BAM file.
 +
 
 +
To exclude unmapped reads and soft clips, use --excludeFlags 4.
  
 +
To only include records that overlap a set of regions, use --regionList and specify a bed file with the regions.  If a read overlaps the region, all qualities will be counted even if those bases do not fall in the region.  If you only want to count qualities that fall within the region, also specify --withinRegion.  Without excluding unmapped reads, it will include soft clips that overlap the region.
  
== BaseQC ==
+
==== Optional Phred/Qual Only Parameters ====
 +
===== Within Region (<code>--withinRegion</code>) =====
 +
Use <code>--withinRegion</code> with [[#Qual/Phred (--phred and --qual)|<code>--phred</code> or <code>--qual</code>]] options to only count qualities if they fall within the regions specified using [[#Only Process Certain Regions (--regionList)|<code>--regionList</code>]] (only applicable if [[#Only Process Certain Regions (--regionList)|<code>--regionList</code>]]  is also specified).
  
The <code>baseQC</code> option generates the following statistics:
+
=== BaseQC (<code>--pBaseQC</code> and <code>--cBaseQC</code> and <code>--baseSum</code>) ===
  
For each position, the following counts are incremented if:
+
The <code>pBaseQC</code> and <code>cBaseQC</code> options generate per base statistics.  Only one of these two options can be specified.  They write statistics generated for each position to the file specified after the option (use <code>-</code> to write to STDOUT).  They use the same logic for calculating statistics, but <code>pBaseQC</code> writes the statistics as percentages, and <code>cBaseQC</code> writes them as counts.  The order of the statistics are also different.
# a read spans the reference position (starts before or at this reference position and ends at or after this position)
 
# regardless of duplicate/qc failure/unmapped/mapping quality
 
# regardless of the CIGAR for this position (other than clips at the beginning/end which are not counted, but deletions and skips are counted)
 
*TotalReads - # of reads that span this position.
 
*DupRate(%) - # of reads marked duplicate in the flag / TotalReads
 
*QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
 
*PairedReads(%) - # of reads marked paired in the flag / TotalReads
 
*ProperPaired(%) - # of reads marked paired AND proper paired in the flag / TotalReads
 
*MappedBases - # of reads marked mapped in the flag
 
*MappingRate(%) - # of reads marked mapped in the flag / TotalReads
 
*ZeroMapQual(%) - # of reads marked mapped in the flag AND have a Mapping Quality of 0 / TotalReads
 
*MapQual<10(%) - # of reads marked mapped in the flag AND have a Mapping Quality < 10 / TotalReads
 
*MapRate_MQpass(%) - # of reads marked mapped in the flag AND have a Mapping Quality >= a minimum Mapping Quality / TotalReads
 
  
 +
The <code>baseSum</code> option can be used with either <code>pBaseQC</code> or <code>cBaseQC</code> or on its own.  <code>baseSum</code> generates a summary of the per position statistics and writes it to stderr.  It calculates the per position base statistics even if they will not be written anywhere (neither <code>pBaseQC</code> nor <code>cBaseQC</code> are specified).
  
For each position, the following counts exclude:
 
# unmapped reads
 
# duplicates
 
# failed QC
 
# deletions/skips (only CIGAR M/X/= are included)
 
# MapQ below the min
 
*Depth - # of reads. 
 
*Q20Bases - # of bases at this position with a base quality (from the read) of Q20 or higher.
 
*Q20BasesPct(%) - Q20Bases / Depth
 
  
 +
All three options use the same logic for calculating the statistics:
 +
* A read spans a position if the read starts at or before the position, ends at or after the position and the position is not a clip.  CIGAR operations allowed for the position are M/X/=/D/N.  If the CIGAR is '*', only numbers for the specified reference position are incremented.
 +
*Currently there is no special logic to exclude positions/reads where the reference base is 'N' or the read base is 'N'.
  
For each position, the following counts exclude:
+
<br>
# unmapped reads
 
# duplicates
 
# failed QC
 
# deletions/skips (only CIGAR M/X/= are included)
 
# MapQ of 255
 
*AverageMapQuality - average calculated by summing all mapping qualities that are not excluded (as defined above) and dividing by the number of mapping qualities added.
 
*AverageMapQualCount - # of mapping qualities used to calculate AverageMapQuality.
 
  
Currently there is no special logic to exclude positions where the refernce is 'N'.
+
==== Percentage-Based Output Format (<code>--pBaseQC</code>) ====
  
Currently there is no special logic to exclude reads from the counts when the base is 'N'.
+
Order/Descriptions:
  
 +
{|border=1
 +
! Field !! Description !!style="width: 80px"| Excludes Duplicates, QC Failures !!style="width: 80px"| Excludes Unmapped !!style="width: 80px"|  Excludes MapQual = 255 !!style="width: 80px"| Excludes Below Min MapQual !!style="width: 80px"| Excludes CIGAR Deletions, Skips
 +
|-
 +
| chrom || Chromosome/reference name string from the SAM/BAM
 +
|-
 +
| chromStart || 0-based start position
 +
|-
 +
| chromEnd || 0-based end position (always 1 greater than start and not included in this region)
 +
|-
 +
| Depth || # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 +
|-
 +
| Q20Bases || # of bases at this position with a base quality (from the read) of Q20 or higher || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 +
|-
 +
| Q20BasesPct(%) || Q20Bases / Depth || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 +
|-
 +
| TotalReads || # of reads that span this position || || || || ||
 +
|-
 +
| MappedBases || # of reads marked mapped in the flag || align="center"|X || align="center"|X || || ||
 +
|-
 +
| MappingRate(%) || MappedBases / TotalReads || align="center"|X || align="center"|X || || ||
 +
|-
 +
| MapRate_MQPass(%) || # of reads that have a Mapping Quality &gt;= a minimum Mapping Quality / TotalReads || align="center"|X || align="center"|X || || ||
 +
|-
 +
| ZeroMapQual(%) || # of reads that have a Mapping Quality of 0 / TotalReads || align="center"|X || align="center"|X || || ||
 +
|-
 +
| MapQual&lt;10(%) || # of reads that have a Mapping Quality &lt; 10 / TotalReads || align="center"|X || align="center"|X || || ||
 +
|-
 +
| PairedReads(%) || # of reads marked paired in the flag / TotalReads || align="center"|X || align="center"|X || || ||
 +
|-
 +
| ProperPaired(%) || # of reads marked paired AND proper paired in the flag / TotalReads || align="center"|X || align="center"|X || || ||
 +
|-
 +
| DupRate(%) || # of reads marked duplicate in the flag / TotalReads || || || || ||
 +
|-
 +
| QCFailRate(%) || # of reads marked QC failure in the flag / TotalReads || || || || ||
 +
|-
 +
| AverageMapQuality || sum of included mapping qualities / AverageMapQualCount || align="center"|X || align="center"|X || align="center"|X || ||
 +
|-
 +
| AverageMapQualCount || # of mapping qualities in AverageMapQuality || align="center"|X || align="center"|X || align="center"|X || ||
 +
|-
 +
|}
  
=== Output Format ===
+
This output does not include a MapQual255 count.
Order:
 
*chrom - Chromosome/reference name string from the SAM/BAM
 
*chromStart - 0-based start position
 
*chromEnd  - 0-based end position (always 1 greater than start and not included in this region)
 
*Depth - excludes unmapped reads, duplicates, failed QC, deletions/skips (only CIGAR M/X/= are included), MapQ below the min
 
*Q20Bases - excludes unmapped reads, duplicates, failed QC, deletions/skips (only CIGAR M/X/= are included), MapQ below the min
 
*Q20BasesPct(%) - excludes unmapped reads, duplicates, failed QC, deletions/skips (only CIGAR M/X/= are included), MapQ below the min
 
*TotalReads - only excludes clips
 
*MappedBases - only excludes clips
 
*MappingRate(%) - only excludes clips
 
*MapRate_MQPass(%) - only excludes clips
 
*ZeroMapQual(%) - only excludes clips
 
*MapQual<10(%) - only excludes clips
 
*PairedReads(%) - only excludes clips
 
*ProperPaired(%) - only excludes clips
 
*DupRate(%) - only excludes clips
 
*QCFailRate(%) - only excludes clips
 
*AverageMapQuality - excludes unmapped reads, duplicates, failed QC, deletions/skips (only CIGAR M/X/= are included), MapQ 255
 
*AverageMapQualCount - excludes unmapped reads, duplicates, failed QC, deletions/skips (only CIGAR M/X/= are included), MapQ 255
 
  
=== Sample Output ===
 
  
<pre>
+
===== Sample Output  =====
chrom chromStart chromEnd Depth Q20Bases Q20BasesPct(%) TotalReads MappedBases MappingRate(%) MapRate_MQPass(%) ZeroMapQual(%) MapQual<10(%) PairedReads(%) ProperPaired(%) DupRate(%) QCFailRate(%) AverageMapQuality AverageMapQualCount
+
<pre>chrom chromStart chromEnd Depth Q20Bases Q20BasesPct(%) TotalReads MappedBases MappingRate(%) MapRate_MQPass(%) ZeroMapQual(%) MapQual&lt;10(%) PairedReads(%) ProperPaired(%) DupRate(%) QCFailRate(%) AverageMapQuality AverageMapQualCount
 
1 100 101 2 2 100.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 11.000 3
 
1 100 101 2 2 100.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 11.000 3
 
1 101 102 2 0 0.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 11.000 3
 
1 101 102 2 0 0.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 11.000 3
Line 148: Line 195:
 
1 10023 10024 0 0 0.000 39 30 76.923 51.282 25.641 51.282 84.615 38.462 15.385 15.385 0.000 0
 
1 10023 10024 0 0 0.000 39 30 76.923 51.282 25.641 51.282 84.615 38.462 15.385 15.385 0.000 0
 
1 10024 10025 14 12 85.714 39 30 76.923 51.282 25.641 51.282 84.615 38.462 15.385 15.385 11.000 21
 
1 10024 10025 14 12 85.714 39 30 76.923 51.282 25.641 51.282 84.615 38.462 15.385 15.385 11.000 21
 +
</pre>
 +
 +
 +
==== Count-Based Output Format (<code>--cBaseQC</code>) ====
 +
Order/Descriptions:
 +
{|border=1
 +
! Field !! Description !!style="width: 80px"| Excludes Duplicates, QC Failures !!style="width: 80px"| Excludes Unmapped !!style="width: 80px"|  Excludes MapQual = 255 !!style="width: 80px"| Excludes Below Min MapQual !!style="width: 80px"| Excludes CIGAR Deletions, Skips
 +
|-
 +
| chrom || Chromosome/reference name string from the SAM/BAM
 +
|-
 +
| chromStart || 0-based start position
 +
|-
 +
| chromEnd || 0-based end position (always 1 greater than start and not included in this region)
 +
|-
 +
| TotalReads || # of reads that span this position || || || || ||
 +
|-
 +
| Dups || # of reads marked duplicate in the flag || || || || ||
 +
|-
 +
| QCFail || # of reads marked QC failure in the flag || || || || ||
 +
|-
 +
| Mapped || # of reads marked mapped in the flag || align="center"|X || align="center"|X || || ||
 +
|-
 +
| Paired || # of reads marked paired in the flag || align="center"|X || align="center"|X || || ||
 +
|-
 +
| ProperPaired || # of reads marked paired AND proper paired in the flag || align="center"|X || align="center"|X || || ||
 +
|-
 +
| ZeroMapQual || # of reads that have a Mapping Quality of 0 || align="center"|X || align="center"|X || || ||
 +
|-
 +
| MapQual&lt;10(%) || # of reads that have a Mapping Quality &lt; 10 || align="center"|X || align="center"|X || || ||
 +
|-
 +
| MapQual255 || # of reads that have a Mapping Quality = 255 || align="center"|X || align="center"|X || || ||
 +
|-
 +
| PassMapQual || # of reads that have a Mapping Quality &gt;= a minimum Mapping Quality || align="center"|X || align="center"|X || || ||
 +
|-
 +
| AverageMapQuality || sum of included mapping qualities / AverageMapQualCount || align="center"|X || align="center"|X || align="center"|X || ||
 +
|-
 +
| AverageMapQualCount || # of mapping qualities in AverageMapQuality || align="center"|X || align="center"|X || align="center"|X ||
 +
|- ||
 +
| Depth || # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 +
|-
 +
| Q20Bases || # of bases at this position with a base quality (from the read) of Q20 or higher || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 +
|-
 +
|}
 +
 +
==== Summary of per Position Statistics (<code>--baseSum</code>) ====
 +
Use <code>--baseSum</code> to print an overall summary of the baseQC for the file to stderr.
 +
 +
This option can be used with or without <code>--pBaseQC</code> and <code>--cBaseQC</code>
 +
 +
The values are tab delimited.  First there is a header line describing the summary.  The next line has the Means, and the last line has the Standard Deviations.
 +
 +
{|border=1
 +
! Field !! Description !!style="width: 80px"| Excludes Duplicates, QC Failures !!style="width: 80px"| Excludes Unmapped !!style="width: 80px"|  Excludes MapQual = 255 !!style="width: 80px"| Excludes Below Min MapQual !!style="width: 80px"| Excludes CIGAR Deletions, Skips
 +
|-
 +
| TotalReads || # of reads that span this position || || || || ||
 +
|-
 +
| Dups || # of reads marked duplicate in the flag || || || || ||
 +
|-
 +
| QCFail || # of reads marked QC failure in the flag || || || || ||
 +
|-
 +
| Mapped || # of reads marked mapped in the flag || align="center"|X || align="center"|X || || ||
 +
|-
 +
| Paired || # of reads marked paired in the flag || align="center"|X || align="center"|X || || ||
 +
|-
 +
| ProperPaired || # of reads marked paired AND proper paired in the flag || align="center"|X || align="center"|X || || ||
 +
|-
 +
| ZeroMapQual || # of reads that have a Mapping Quality of 0 || align="center"|X || align="center"|X || || ||
 +
|-
 +
| MapQual&lt;10(%) || # of reads that have a Mapping Quality &lt; 10 || align="center"|X || align="center"|X || || ||
 +
|-
 +
| MapQual255 || # of reads that have a Mapping Quality = 255 || align="center"|X || align="center"|X || || ||
 +
|-
 +
| PassMapQual || # of reads that have a Mapping Quality &gt;= a minimum Mapping Quality || align="center"|X || align="center"|X || || ||
 +
|-
 +
| AverageMapQuality || sum of included mapping qualities / AverageMapQualCount || align="center"|X || align="center"|X || align="center"|X || ||
 +
|-
 +
| AverageMapQualCount || # of mapping qualities in AverageMapQuality || align="center"|X || align="center"|X || align="center"|X ||
 +
|- ||
 +
| Depth || # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 +
|-
 +
| Q20Bases || # of bases at this position with a base quality (from the read) of Q20 or higher || align="center"|X || align="center"|X || align="center"|X || align="center"|X || align="center"|X
 +
|-
 +
|}
 +
 +
 +
===== Sample Output =====
 +
<pre>
 +
Summary of Pileup Stats (1st Row is Mean, 2nd Row is Standard Deviation)
 +
TotalReads Dups QCFail Mapped Paired ProperPaired ZeroMapQual MapQual<10 MapQual255 PassMapQual AverageMapQuality AverageMapQualCount
 +
Depth Q20Bases
 +
14.307692 1.846154 1.846154 8.769231 7.846154 0.923077 2.923077 5.846154 0.000000 2.923077 11.000000 8.769231 2.076923 1.153846
 +
17.670053 2.882307 2.882307 9.038380 7.603137 1.441153 3.012793 6.025586 0.000000 3.012793 0.000000 9.038380 2.841993 1.993579
 
</pre>
 
</pre>
 +
 +
==== Optional BaseQC Only Parameters ====
 +
===== Pileup Buffer Size (<code>--bufferSize</code>) =====
 +
Use the <code>--bufferSize</code> option followed by the size of the pileup buffer to use for [[BaseQC (--pBaseQC and --cBaseQC and --baseSum)|baseQC]] stats.
 +
 +
===== Minimum Mapping Quality (<code>--minMapQual</code>) =====
 +
Use the <code>--minMapQual</code> option followed by the minimum mapping quality for filtering reads in the [[BaseQC (--pBaseQC and --cBaseQC and --baseSum)|baseQC]] stats.
 +
 +
===== DBSNP File (<code>--dbsnp</code>) =====
 +
Use the <code>--dbsnp</code> option followed by the name of the dbsnp file to specify the positions to exclude from [[BaseQC (--pBaseQC and --cBaseQC and --baseSum)|baseQC]] analysis.
 +
 +
{{PhoneHomeParameters}}
 +
 +
= Return Value =
 +
0 on Success, non-0 on failure
 +
 +
 +
[[Category:BamUtil|stats]] [[Category:BAM_Software]] [[Category:Software]]

Latest revision as of 15:59, 24 August 2017


Overview of the stats function of bamUtil

The stats option on the BamUtil executable generates the specified statistics on a SAM/BAM file.

Troubleshooting

See BamUtil: FAQ -> BamUtil: stats for troubleshooting help.

Usage

 ./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--pBaseQC <outputFileName>] [--cBaseQC <outputFileName>] [--maxNumReads <maxNum>][--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--requiredFlags <integerRequiredFlags>] [--excludeFlags <integerExcludeFlags>] [--noeof] [--params] [--withinRegion] [--baseSum] [--bufferSize <buffSize>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>]

Parameters

        Required Parameters:
                --in : the SAM/BAM file to calculate stats for
        Types of Statistics that can be generated:
                --basic         : Turn on basic statistic generation
                --qual          : Generate a count for each quality (displayed as non-phred quality)
                --phred         : Generate a count for each quality (displayed as phred quality)
                --pBaseQC       : Write per base statistics as Percentages to the specified file. (use - for stdout)
                                  pBaseQC & cBaseQC cannot both be specified.
                --cBaseQC       : Write per base statistics as Counts to the specified file. (use - for stdout)
                                  pBaseQC & cBaseQC cannot both be specified.
        Optional Parameters:
                --maxNumReads   : Maximum number of reads to process
                                  Defaults to -1 to indicate all reads.
                --unmapped      : Only process unmapped reads (requires a bamIndex file)
                --bamIndex      : The path/name of the bam index file
                                  (if required and not specified, uses the --in value + ".bai")
                --regionList    : File containing the regions to be processed chr<tab>start_pos<tab>end_pos.
                                  Positions are 0 based and the end_pos is not included in the region.
                                  Uses bamIndex.
                --excludeFlags  : Skip any records with any of the specified flags set
                                  (specify an integer representation of the flags)
                --requiredFlags : Only process records with all of the specified flags set
                                  (specify an integer representation of the flags)
                --noeof         : Do not expect an EOF block on a bam file.
                --params        : Print the parameter settings.
        Optional phred/qual Only Parameters:
                --withinRegion  : Only count qualities if they fall within regions specified.
                                  Only applicable if regionList is also specified.
        Optional BaseQC Only Parameters:
                --baseSum       : Print an overall summary of the baseQC for the file to stderr.
                --bufferSize    : Size of the pileup buffer for calculating the BaseQC parameters.
                                  Default: 1024
                --minMapQual    : The minimum mapping quality for filtering reads in the baseQC stats.
                --dbsnp         : The dbSnp file of positions to exclude from baseQC analysis.
	PhoneHome:
		--noPhoneHome       : disable PhoneHome (default enabled)
		--phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)

Required Parameters

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Optional Parameters

Maximum number of reads to process(--maxNumReads)

Use --maxNumReads followed by a number to indicate the maximum number of reads to process before exiting. By default, it is set to -1 to indicate all reads should be processed.

Only Process Unmapped Reads (--unmapped)

Use --unmapped to process only unmapped reads.

This parameter requires --bamIndex.

Bam Index File (--bamIndex)

Use --bamIndex followed by your file name to specify the BAM index file to use for reading the BAM file.

If this file is required but not specified, it will use the input file name + ".bai".

Only Process Certain Regions (--regionList)

Use --regionList followed by the filename to process only the regions specified in the file.

The positions in the file are specified one per line with the following format: chr<tab>start_pos<tab>end_pos.

Positions are 0 based and the end_pos is not included in the region.

This parameter requires --bamIndex.

Exclude Flags (--excludeFlags)

Use --excludeFlags followed by an integer representation of the flags to only process reads with any of the specified flags set.

Required Flags (--requiredFlags)

Use --requiredFlags followed by an integer representation of the flags to only process records with all of the specified flags set.

Types of Statistics

Basic (--basic)

Prints summary statistics for the file:

  • TotalReads - # of reads that are in the file
  • MappedReads - # of reads marked mapped in the flag
  • PairedReads - # of reads marked paired in the flag
  • ProperPair - # of reads marked paired AND proper paired in the flag
  • DuplicateReads - # of reads marked duplicate in the flag
  • QCFailureReads - # of reads marked QC failure in the flag
  • MappingRate(%) - # of reads marked mapped in the flag / TotalReads
  • PairedReads(%) - # of reads marked paired in the flag / TotalReads
  • ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads
  • DupRate(%) - # of reads marked duplicate in the flag / TotalReads
  • QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
  • TotalBases - # of bases in all reads
  • BasesInMappedReads - # of bases in reads marked mapped in the flag

Qual/Phred (--phred and --qual)

Prints a count of the number of times each quality value appears in the file to stderr.

  • phred Displays Quality as phred integers [0-93]
  • qual Displays Quality as non-phred integers (phred + 33) [33-126]

By default, these counts include all qualities in the BAM file.

To exclude unmapped reads and soft clips, use --excludeFlags 4.

To only include records that overlap a set of regions, use --regionList and specify a bed file with the regions. If a read overlaps the region, all qualities will be counted even if those bases do not fall in the region. If you only want to count qualities that fall within the region, also specify --withinRegion. Without excluding unmapped reads, it will include soft clips that overlap the region.

Optional Phred/Qual Only Parameters

Within Region (--withinRegion)

Use --withinRegion with --phred or --qual options to only count qualities if they fall within the regions specified using --regionList (only applicable if --regionList is also specified).

BaseQC (--pBaseQC and --cBaseQC and --baseSum)

The pBaseQC and cBaseQC options generate per base statistics. Only one of these two options can be specified. They write statistics generated for each position to the file specified after the option (use - to write to STDOUT). They use the same logic for calculating statistics, but pBaseQC writes the statistics as percentages, and cBaseQC writes them as counts. The order of the statistics are also different.

The baseSum option can be used with either pBaseQC or cBaseQC or on its own. baseSum generates a summary of the per position statistics and writes it to stderr. It calculates the per position base statistics even if they will not be written anywhere (neither pBaseQC nor cBaseQC are specified).


All three options use the same logic for calculating the statistics:

  • A read spans a position if the read starts at or before the position, ends at or after the position and the position is not a clip. CIGAR operations allowed for the position are M/X/=/D/N. If the CIGAR is '*', only numbers for the specified reference position are incremented.
  • Currently there is no special logic to exclude positions/reads where the reference base is 'N' or the read base is 'N'.


Percentage-Based Output Format (--pBaseQC)

Order/Descriptions:

Field Description Excludes Duplicates, QC Failures Excludes Unmapped Excludes MapQual = 255 Excludes Below Min MapQual Excludes CIGAR Deletions, Skips
chrom Chromosome/reference name string from the SAM/BAM
chromStart 0-based start position
chromEnd 0-based end position (always 1 greater than start and not included in this region)
Depth # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures X X X X X
Q20Bases # of bases at this position with a base quality (from the read) of Q20 or higher X X X X X
Q20BasesPct(%) Q20Bases / Depth X X X X X
TotalReads # of reads that span this position
MappedBases # of reads marked mapped in the flag X X
MappingRate(%) MappedBases / TotalReads X X
MapRate_MQPass(%) # of reads that have a Mapping Quality >= a minimum Mapping Quality / TotalReads X X
ZeroMapQual(%) # of reads that have a Mapping Quality of 0 / TotalReads X X
MapQual<10(%) # of reads that have a Mapping Quality < 10 / TotalReads X X
PairedReads(%) # of reads marked paired in the flag / TotalReads X X
ProperPaired(%) # of reads marked paired AND proper paired in the flag / TotalReads X X
DupRate(%) # of reads marked duplicate in the flag / TotalReads
QCFailRate(%) # of reads marked QC failure in the flag / TotalReads
AverageMapQuality sum of included mapping qualities / AverageMapQualCount X X X
AverageMapQualCount # of mapping qualities in AverageMapQuality X X X

This output does not include a MapQual255 count.


Sample Output
chrom	chromStart	chromEnd	Depth	Q20Bases	Q20BasesPct(%)	TotalReads	MappedBases	MappingRate(%)	MapRate_MQPass(%)	ZeroMapQual(%)	MapQual<10(%)	PairedReads(%)	ProperPaired(%)	DupRate(%)	QCFailRate(%)	AverageMapQuality	AverageMapQualCount
1	100	101	2	2	100.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	101	102	2	0	0.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	102	103	0	0	0.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	0.000	0
1	103	104	0	0	0.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	0.000	0
1	104	105	2	0	0.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	105	106	2	2	100.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	110	111	0	0	0.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	0.000	0
1	111	112	2	2	100.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	112	113	2	2	100.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	10012	10013	14	0	0.000	42	33	78.571	52.381	26.190	52.381	85.714	35.714	14.286	14.286	11.000	21
1	10013	10014	14	10	71.429	39	30	76.923	51.282	25.641	51.282	84.615	38.462	15.385	15.385	11.000	21
1	10023	10024	0	0	0.000	39	30	76.923	51.282	25.641	51.282	84.615	38.462	15.385	15.385	0.000	0
1	10024	10025	14	12	85.714	39	30	76.923	51.282	25.641	51.282	84.615	38.462	15.385	15.385	11.000	21


Count-Based Output Format (--cBaseQC)

Order/Descriptions:

Field Description Excludes Duplicates, QC Failures Excludes Unmapped Excludes MapQual = 255 Excludes Below Min MapQual Excludes CIGAR Deletions, Skips
chrom Chromosome/reference name string from the SAM/BAM
chromStart 0-based start position
chromEnd 0-based end position (always 1 greater than start and not included in this region)
TotalReads # of reads that span this position
Dups # of reads marked duplicate in the flag
QCFail # of reads marked QC failure in the flag
Mapped # of reads marked mapped in the flag X X
Paired # of reads marked paired in the flag X X
ProperPaired # of reads marked paired AND proper paired in the flag X X
ZeroMapQual # of reads that have a Mapping Quality of 0 X X
MapQual<10(%) # of reads that have a Mapping Quality < 10 X X
MapQual255 # of reads that have a Mapping Quality = 255 X X
PassMapQual # of reads that have a Mapping Quality >= a minimum Mapping Quality X X
AverageMapQuality sum of included mapping qualities / AverageMapQualCount X X X
AverageMapQualCount # of mapping qualities in AverageMapQuality X X X
Depth # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures X X X X X
Q20Bases # of bases at this position with a base quality (from the read) of Q20 or higher X X X X X

Summary of per Position Statistics (--baseSum)

Use --baseSum to print an overall summary of the baseQC for the file to stderr.

This option can be used with or without --pBaseQC and --cBaseQC

The values are tab delimited. First there is a header line describing the summary. The next line has the Means, and the last line has the Standard Deviations.

Field Description Excludes Duplicates, QC Failures Excludes Unmapped Excludes MapQual = 255 Excludes Below Min MapQual Excludes CIGAR Deletions, Skips
TotalReads # of reads that span this position
Dups # of reads marked duplicate in the flag
QCFail # of reads marked QC failure in the flag
Mapped # of reads marked mapped in the flag X X
Paired # of reads marked paired in the flag X X
ProperPaired # of reads marked paired AND proper paired in the flag X X
ZeroMapQual # of reads that have a Mapping Quality of 0 X X
MapQual<10(%) # of reads that have a Mapping Quality < 10 X X
MapQual255 # of reads that have a Mapping Quality = 255 X X
PassMapQual # of reads that have a Mapping Quality >= a minimum Mapping Quality X X
AverageMapQuality sum of included mapping qualities / AverageMapQualCount X X X
AverageMapQualCount # of mapping qualities in AverageMapQuality X X X
Depth # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures X X X X X
Q20Bases # of bases at this position with a base quality (from the read) of Q20 or higher X X X X X


Sample Output
Summary of Pileup Stats (1st Row is Mean, 2nd Row is Standard Deviation)
TotalReads	Dups	QCFail	Mapped	Paired	ProperPaired	ZeroMapQual	MapQual<10	MapQual255	PassMapQual	AverageMapQuality	AverageMapQualCount	
Depth	Q20Bases
14.307692	1.846154	1.846154	8.769231	7.846154	0.923077	2.923077	5.846154	0.000000	2.923077	11.000000	8.769231	2.076923	1.153846
17.670053	2.882307	2.882307	9.038380	7.603137	1.441153	3.012793	6.025586	0.000000	3.012793	0.000000	9.038380	2.841993	1.993579

Optional BaseQC Only Parameters

Pileup Buffer Size (--bufferSize)

Use the --bufferSize option followed by the size of the pileup buffer to use for baseQC stats.

Minimum Mapping Quality (--minMapQual)

Use the --minMapQual option followed by the minimum mapping quality for filtering reads in the baseQC stats.

DBSNP File (--dbsnp)

Use the --dbsnp option followed by the name of the dbsnp file to specify the positions to exclude from baseQC analysis.

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

Turn off PhoneHome (--noPhoneHome)

Use the --noPhoneHome option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.

Adjust the Frequency of PhoneHome (--phoneHomeThinning)

Use --phoneHomeThinning to modify the percentage of the time that PhoneHome will run (0-100).

  • By default, --phoneHomeThinning is set to 50, running 50% of the time.
  • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
  • N/A if --noPhoneHome is set.

Return Value

0 on Success, non-0 on failure