BamUtil: stats
Overview of the stats
function of bamUtil
The stats
option on the BamUtil executable generates the specified statistics on a SAM/BAM file.
Usage
./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--pbaseQC <outputFileName>] [--cbaseQC <outputFileName>] [--baseSum] [--maxNumReads <maxNum>][--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>] [--noeof] [--params]
Parameters
Required Parameters: --in : the SAM/BAM file to calculate stats for Types of Statistics that can be generated: --basic : Turn on basic statistic generation --qual : Generate a count for each quality (displayed as non-phred quality) --phred : Generate a count for each quality (displayed as phred quality) --pBaseQC : Write per base statistics as Percentages to the specified file. pBaseQC & cBaseQC cannot both be specified. --cBaseQC : Write per base statistics as Counts to the specified file. pBaseQC & cBaseQC cannot both be specified. Optional Parameters: --maxNumReads : Maximum number of reads to process Defaults to -1 to indicate all reads. --unmapped : Only process unmapped reads (requires a bamIndex file) --bamIndex : The path/name of the bam index file (if required and not specified, uses the --in value + ".bai") --regionList : File containing the regions to be processed chr<tab>start_pos<tab>end<pos>. Positions are 0 based and the end_pos is not included in the region. Uses bamIndex. --minMapQual : The minimum mapping quality for filtering reads in the baseQC stats. --dbsnp : The dbSnp file of positions to exclude from baseQC analysis. --noeof : Do not expect an EOF block on a bam file. --params : Print the parameter settings. Optional BaseQC Only Parameters: --baseSum : Print an overall summary of the baseQC for the file to stderr.
For all types of statistics, the bam file used is specified by --in
.
The optional parameters are used for all types of statistics.
Input File (--in
)
Use --in
followed by your file name to specify the SAM/BAM input file.
The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.
A -
is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).
SAM/BAM/Uncompressed BAM from file | --in yourFileName
|
SAM from stdin | --in - |
BAM from stdin | --in -.bam |
Uncompressed BAM from stdin | --in -.ubam |
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools
implementation so pipes between our tools and samtools
are supported.
Types of Statistics
Basic (--basic
)
Prints summary statistics for the file:
- TotalReads - # of reads that are in the file
- MappedReads - # of reads marked mapped in the flag
- PairedReads - # of reads marked paired in the flag
- ProperPair - # of reads marked paired AND proper paired in the flag
- DuplicateReads - # of reads marked duplicate in the flag
- QCFailureReads - # of reads marked QC failure in the flag
- MappingRate(%) - # of reads marked mapped in the flag / TotalReads
- PairedReads(%) - # of reads marked paired in the flag / TotalReads
- ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads
- DupRate(%) - # of reads marked duplicate in the flag / TotalReads
- QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
- TotalBases - # of bases in all reads
- BasesInMappedReads - # of bases in reads marked mapped in the flag
Qual/Phred (--phred
and --qual
)
Prints a count of the number of times each quality value appears in the file.
phred
Displays Quality as phred integers [0-93]qual
Displays Quality as non-phred integers (phred + 33) [33-126]
BaseQC (--pBaseQC
and --cBaseQC
and --baseSum
)
The pBaseQC
and cBaseQC
options generate per base statistics. Only one of these two options can be specified. They write statistics generated for each position to the file specified after the option. They use the same logic for calculating statistics, but pBaseQC
writes the statistics as percentages, and cBaseQC
writes them as counts. The order of the statistics are also different.
The baseSum
option can be used with either pBaseQC
or cBaseQC
or on its own. baseSum
generates a summary of the per position statistics and writes it to stderr. It calculates the per position base statistics even if they will not be written anywhere (neither pBaseQC
nor cBaseQC
are specified).
All three options use the same logic for calculating the statistics:
- A read spans a position if the read starts at or before the position, ends at or after the position and the position is not a clip. CIGAR operations allowed for the position are M/X/=/D/N. If the CIGAR is '*', only numbers for the specified reference position are incremented.
- Currently there is no special logic to exclude positions/reads where the reference base is 'N' or the read base is 'N'.
Percentage-Based Output Format (--pBaseQC
)
Order/Descriptions:
Field | Description | Excludes Duplicates, QC Failures | Excludes Unmapped | Excludes MapQual = 255 | Excludes Below Min MapQual | Excludes CIGAR Deletions, Skips |
---|---|---|---|---|---|---|
chrom | Chromosome/reference name string from the SAM/BAM | |||||
chromStart | 0-based start position | |||||
chromEnd | 0-based end position (always 1 greater than start and not included in this region) | |||||
Depth | # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures | X | X | X | X | X |
Q20Bases | # of bases at this position with a base quality (from the read) of Q20 or higher | X | X | X | X | X |
Q20BasesPct(%) | Q20Bases / Depth | X | X | X | X | X |
TotalReads | # of reads that span this position | |||||
MappedBases | # of reads marked mapped in the flag | X | X | |||
MappingRate(%) | MappedBases / TotalReads | X | X | |||
MapRate_MQPass(%) | # of reads that have a Mapping Quality >= a minimum Mapping Quality / TotalReads | X | X | |||
ZeroMapQual(%) | # of reads that have a Mapping Quality of 0 / TotalReads | X | X | |||
MapQual<10(%) | # of reads that have a Mapping Quality < 10 / TotalReads | X | X | |||
PairedReads(%) | # of reads marked paired in the flag / TotalReads | X | X | |||
ProperPaired(%) | # of reads marked paired AND proper paired in the flag / TotalReads | X | X | |||
DupRate(%) | # of reads marked duplicate in the flag / TotalReads | |||||
QCFailRate(%) | # of reads marked QC failure in the flag / TotalReads | |||||
AverageMapQuality | sum of included mapping qualities / AverageMapQualCount | X | X | X | ||
AverageMapQualCount | # of mapping qualities in AverageMapQuality | X | X | X |
This output does not include a MapQual255 count.
Sample Output
chrom chromStart chromEnd Depth Q20Bases Q20BasesPct(%) TotalReads MappedBases MappingRate(%) MapRate_MQPass(%) ZeroMapQual(%) MapQual<10(%) PairedReads(%) ProperPaired(%) DupRate(%) QCFailRate(%) AverageMapQuality AverageMapQualCount 1 100 101 2 2 100.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 11.000 3 1 101 102 2 0 0.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 11.000 3 1 102 103 0 0 0.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 0.000 0 1 103 104 0 0 0.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 0.000 0 1 104 105 2 0 0.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 11.000 3 1 105 106 2 2 100.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 11.000 3 1 110 111 0 0 0.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 0.000 0 1 111 112 2 2 100.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 11.000 3 1 112 113 2 2 100.000 3 3 100.000 66.667 33.333 66.667 100.000 0.000 0.000 0.000 11.000 3 1 10012 10013 14 0 0.000 42 33 78.571 52.381 26.190 52.381 85.714 35.714 14.286 14.286 11.000 21 1 10013 10014 14 10 71.429 39 30 76.923 51.282 25.641 51.282 84.615 38.462 15.385 15.385 11.000 21 1 10023 10024 0 0 0.000 39 30 76.923 51.282 25.641 51.282 84.615 38.462 15.385 15.385 0.000 0 1 10024 10025 14 12 85.714 39 30 76.923 51.282 25.641 51.282 84.615 38.462 15.385 15.385 11.000 21
Count-Based Output Format (--pBaseQC
)
Order/Descriptions:
Field | Description | Excludes Duplicates, QC Failures | Excludes Unmapped | Excludes MapQual = 255 | Excludes Below Min MapQual | Excludes CIGAR Deletions, Skips |
---|---|---|---|---|---|---|
chrom | Chromosome/reference name string from the SAM/BAM | |||||
chromStart | 0-based start position | |||||
chromEnd | 0-based end position (always 1 greater than start and not included in this region) | |||||
TotalReads | # of reads that span this position | |||||
Dups | # of reads marked duplicate in the flag | |||||
QCFail | # of reads marked QC failure in the flag | |||||
Mapped | # of reads marked mapped in the flag | X | X | |||
Paired | # of reads marked paired in the flag | X | X | |||
ProperPaired | # of reads marked paired AND proper paired in the flag | X | X | |||
ZeroMapQual | # of reads that have a Mapping Quality of 0 | X | X | |||
MapQual<10(%) | # of reads that have a Mapping Quality < 10 | X | X | |||
MapQual255 | # of reads that have a Mapping Quality = 255 | X | X | |||
PassMapQual | # of reads that have a Mapping Quality >= a minimum Mapping Quality | X | X | |||
AverageMapQuality | sum of included mapping qualities / AverageMapQualCount | X | X | X | ||
AverageMapQualCount | # of mapping qualities in AverageMapQuality | X | X | X | ||
Depth | # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures | X | X | X | X | X |
Q20Bases | # of bases at this position with a base quality (from the read) of Q20 or higher | X | X | X | X | X |
Summary of per Position Statistics (--baseSum
)
Use --baseSum
to print an overall summary of the baseQC for the file to stderr.
This option can be used with or without --pBaseQC
and --cBaseQC
The values are tab delimited. First there is a header line describing the summary. Then there is a line with the Mean values, then a line with the Standard Deviations.
Field | Description | Excludes Duplicates, QC Failures | Excludes Unmapped | Excludes MapQual = 255 | Excludes Below Min MapQual | Excludes CIGAR Deletions, Skips |
---|---|---|---|---|---|---|
TotalReads | # of reads that span this position | |||||
Dups | # of reads marked duplicate in the flag | |||||
QCFail | # of reads marked QC failure in the flag | |||||
Mapped | # of reads marked mapped in the flag | X | X | |||
Paired | # of reads marked paired in the flag | X | X | |||
ProperPaired | # of reads marked paired AND proper paired in the flag | X | X | |||
ZeroMapQual | # of reads that have a Mapping Quality of 0 | X | X | |||
MapQual<10(%) | # of reads that have a Mapping Quality < 10 | X | X | |||
MapQual255 | # of reads that have a Mapping Quality = 255 | X | X | |||
PassMapQual | # of reads that have a Mapping Quality >= a minimum Mapping Quality | X | X | |||
AverageMapQuality | sum of included mapping qualities / AverageMapQualCount | X | X | X | ||
AverageMapQualCount | # of mapping qualities in AverageMapQuality | X | X | X | ||
Depth | # of reads that are mapped with acceptable Mapping Quality, and are not duplicates or QC failures | X | X | X | X | X |
Q20Bases | # of bases at this position with a base quality (from the read) of Q20 or higher | X | X | X | X | X |
Sample Output
Summary of Pileup Stats (1st Mean, 2nd Standard Deviation) TotalReads Dups QCFail Mapped Paired ProperPaired ZeroMapQual MapQual<10 MapQual255 PassMapQual AverageMapQuality AverageMapQualCount Depth Q20Bases 14.307692 1.846154 1.846154 8.769231 7.846154 0.923077 2.923077 5.846154 0.000000 2.923077 11.000000 8.769231 2.076923 1.153846 17.670053 2.882307 2.882307 9.038380 7.603137 1.441153 3.012793 6.025586 0.000000 3.012793 0.000000 9.038380 2.841993 1.993579