Difference between revisions of "BamUtil: stats"

From Genome Analysis Wiki
Jump to navigationJump to search
(Update BaseQC outputs)
Line 72: Line 72:
 
# a read spans the reference position (starts before or at this reference position and ends at or after this position)
 
# a read spans the reference position (starts before or at this reference position and ends at or after this position)
 
# regardless of duplicate/qc failure/unmapped/mapping quality
 
# regardless of duplicate/qc failure/unmapped/mapping quality
# regardless of the CIGAR for this position (other than clips at the beginning/end which are not counted, but deletions and skips are counted)
+
# if CIGAR for this position is M/X/=/D/N (any cigar other than clip or insert)
 
*TotalReads - # of reads that span this position.  
 
*TotalReads - # of reads that span this position.  
*DupRate(%) - # of reads marked duplicate in the flag / TotalReads
+
*Dups - # of reads marked duplicate in the flag
*QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
+
*QCFail - # of reads marked QC failure in the flag
*PairedReads(%) - # of reads marked paired in the flag / TotalReads
 
*ProperPaired(%) - # of reads marked paired AND proper paired in the flag / TotalReads
 
*MappedBases - # of reads marked mapped in the flag
 
*MappingRate(%) - # of reads marked mapped in the flag / TotalReads
 
*ZeroMapQual(%) - # of reads marked mapped in the flag AND have a Mapping Quality of 0 / TotalReads
 
*MapQual<10(%) - # of reads marked mapped in the flag AND have a Mapping Quality < 10 / TotalReads
 
*MapRate_MQpass(%) - # of reads marked mapped in the flag AND have a Mapping Quality >= a minimum Mapping Quality / TotalReads
 
  
 +
No further stats are incremented if the read is a duplicate, QC failure, or unmapped.
  
For each position, the following counts exclude:
+
Additional counts incremented ONLY for mapped, non-duplicate, non-QC failure reads:
# unmapped reads
+
*Mapped - # of reads marked mapped in the flag
# duplicates
+
*Paired - # of reads marked paired in the flag
# failed QC
+
*ProperPaired - # of reads marked paired AND proper paired in the flag
# deletions/skips (only CIGAR M/X/= are included)
+
*ZeroMapQual - # of reads that have a Mapping Quality of 0
# MapQ below the min
+
*MapQual<10 - # of reads that have a Mapping Quality < 10
 +
*MapQual255 - # of reads that have a Mapping Quality = 255
 +
*PassMapQual - # of reads that have a Mapping Quality >= a minimum Mapping Quality (version 1.0, this includes mapping quality 255 reads).
 +
 
 +
Additional values ONLY for mapped, mapping quality != 255, non-duplicate, non-QC failure reads:
 +
*AverageMapQuality - average calculated by summing all mapping qualities that are included (as defined above) and dividing by the number of mapping qualities added.
 +
*AverageMapQualCount - # of mapping qualities used to calculate AverageMapQuality.
 +
 
 +
Additional values ONLY incremented for mapped, mapping quality >= min mapping quality, non-duplicate, non-QC failure reads (version 1.0, this includes mapping quality 255 reads):
 
*Depth - # of reads.   
 
*Depth - # of reads.   
 
*Q20Bases - # of bases at this position with a base quality (from the read) of Q20 or higher.
 
*Q20Bases - # of bases at this position with a base quality (from the read) of Q20 or higher.
*Q20BasesPct(%) - Q20Bases / Depth
+
 
 +
Currently there is no special logic to exclude positions where the refernce is 'N'.
 +
 
 +
Currently there is no special logic to exclude reads from the counts when the base is 'N'.
  
  
For each position, the following counts exclude:
+
=== BaseQC Output ===
# unmapped reads
+
There are two output options for BaseQC.
# duplicates
+
# Percentages
# failed QC
+
# Straight Counts
# deletions/skips (only CIGAR M/X/= are included)
 
# MapQ of 255
 
*AverageMapQuality - average calculated by summing all mapping qualities that are not excluded (as defined above) and dividing by the number of mapping qualities added.
 
*AverageMapQualCount - # of mapping qualities used to calculate AverageMapQuality.
 
  
Currently there is no special logic to exclude positions where the refernce is 'N'.
+
==== Percentage-Based Output Format ====
 +
Order (with calculations based on the values described above):
 +
*chrom - Chromosome/reference name string from the SAM/BAM
 +
*chromStart - 0-based start position
 +
*chromEnd  - 0-based end position (always 1 greater than start and not included in this region)
 +
*Depth - Depth
 +
*Q20Bases - Q20Bases
 +
*Q20BasesPct(%) - Q20Bases / Depth
 +
*TotalReads - TotalReads
 +
*MappedBases - Mapped
 +
*MappingRate(%) - Mapped / TotalReads
 +
*MapRate_MQPass(%) - PassMapQual / TotalReads
 +
*ZeroMapQual(%) - ZeroMapQual / TotalReads
 +
*MapQual<10(%) - MapQual<10 / TotalReads
 +
*PairedReads(%) - Paired / TotalReads
 +
*ProperPaired(%) - ProperPaired / TotalReads
 +
*DupRate(%) - Dups / TotalReads
 +
*QCFailRate(%) - QCFail / TotalReads
 +
*AverageMapQuality - AverageMapQuality
 +
*AverageMapQualCount - AverageMapQualCount
  
Currently there is no special logic to exclude reads from the counts when the base is 'N'.
+
This output does not include a MapQual255 count in version 1.0.
  
  
=== Output Format ===
+
==== Count-Based Output Format ====
Order:
+
Order (of values described above):
 
*chrom - Chromosome/reference name string from the SAM/BAM
 
*chrom - Chromosome/reference name string from the SAM/BAM
 
*chromStart - 0-based start position  
 
*chromStart - 0-based start position  
 
*chromEnd  - 0-based end position (always 1 greater than start and not included in this region)
 
*chromEnd  - 0-based end position (always 1 greater than start and not included in this region)
*Depth - excludes unmapped reads, duplicates, failed QC, deletions/skips (only CIGAR M/X/= are included), MapQ below the min
+
*TotalReads
*Q20Bases - excludes unmapped reads, duplicates, failed QC, deletions/skips (only CIGAR M/X/= are included), MapQ below the min
+
*Dups
*Q20BasesPct(%) - excludes unmapped reads, duplicates, failed QC, deletions/skips (only CIGAR M/X/= are included), MapQ below the min
+
*QCFail
*TotalReads - only excludes clips
+
*Mapped
*MappedBases - only excludes clips
+
*Paired
*MappingRate(%) - only excludes clips
+
*ProperPaired
*MapRate_MQPass(%) - only excludes clips
+
*ZeroMapQual
*ZeroMapQual(%) - only excludes clips
+
*MapQual<10
*MapQual<10(%) - only excludes clips
+
*MapQual255
*PairedReads(%) - only excludes clips
+
*PassMapQual
*ProperPaired(%) - only excludes clips
+
*AverageMapQuality
*DupRate(%) - only excludes clips
+
*AverageMapQualCount
*QCFailRate(%) - only excludes clips
+
*Depth
*AverageMapQuality - excludes unmapped reads, duplicates, failed QC, deletions/skips (only CIGAR M/X/= are included), MapQ 255
+
*Q20Bases
*AverageMapQualCount - excludes unmapped reads, duplicates, failed QC, deletions/skips (only CIGAR M/X/= are included), MapQ 255
+
 
  
 
=== Sample Output ===
 
=== Sample Output ===

Revision as of 11:57, 7 October 2011


Overview of the stats function of bamUtil

The stats option on the bamUtil executable generates the specified statistics on a SAM/BAM file.

Parameters

	Required Parameters:
		--in : the SAM/BAM file to calculate stats for
	Types of Statistics that can be generated:
		--basic       : Turn on basic statistic generation
		--qual        : Generate a count for each quality (displayed as non-phred quality)
		--phred       : Generate a count for each quality (displayed as phred quality)
		--baseQC      : Write per base statistics to the specified file.
	Optional Parameters:
		--maxNumReads : Maximum number of reads to process
		                Defaults to -1 to indicate all reads.
		--unmapped    : Only process unmapped reads (requires a bamIndex file)
		--bamIndex    : The path/name of the bam index file
		                (if required and not specified, uses the --in value + ".bai")
		--regionList  : File containing the region list chr<tab>start_pos<tab>end<pos>.
		                Positions are 0 based and the end_pos is not included in the region.
		                Uses bamIndex.
		--minMapQual  : The minimum mapping quality for filtering reads in the baseQC stats.
		--dbsnp       : The dbSnp file of positions to exclude from baseQC analysis.
		--noeof       : Do not expect an EOF block on a bam file.
		--params      : Print the parameter settings

For all types of statistics, the bam file used is specified by --in.

The optional parameters are also used for all types of statistics.

Usage:

	./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--baseQC <outputFileName>] [--maxNumReads <maxNum>] [--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>] [--noeof] [--params]


Types of Statistics

Basic

Prints summary statistics for the file:

  • TotalReads - # of reads that are in the file
  • MappedReads - # of reads marked mapped in the flag
  • PairedReads - # of reads marked paired in the flag
  • ProperPair - # of reads marked paired AND proper paired in the flag
  • DuplicateReads - # of reads marked duplicate in the flag
  • QCFailureReads - # of reads marked QC failure in the flag
  • MappingRate(%) - # of reads marked mapped in the flag / TotalReads
  • PairedReads(%) - # of reads marked paired in the flag / TotalReads
  • ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads
  • DupRate(%) - # of reads marked duplicate in the flag / TotalReads
  • QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
  • TotalBases - # of bases in all reads
  • BasesInMappedReads - # of bases in reads marked mapped in the flag

Qual/Phred

Prints a count of the number of times each quality value appears in the file.

  • phred Displays Quality as phred integers [0-93]
  • qual Displays Quality as non-phred integers (phred + 33) [33-126]


BaseQC

The baseQC option generates the following statistics:

For each position, the following counts are incremented if:

  1. a read spans the reference position (starts before or at this reference position and ends at or after this position)
  2. regardless of duplicate/qc failure/unmapped/mapping quality
  3. if CIGAR for this position is M/X/=/D/N (any cigar other than clip or insert)
  • TotalReads - # of reads that span this position.
  • Dups - # of reads marked duplicate in the flag
  • QCFail - # of reads marked QC failure in the flag

No further stats are incremented if the read is a duplicate, QC failure, or unmapped.

Additional counts incremented ONLY for mapped, non-duplicate, non-QC failure reads:

  • Mapped - # of reads marked mapped in the flag
  • Paired - # of reads marked paired in the flag
  • ProperPaired - # of reads marked paired AND proper paired in the flag
  • ZeroMapQual - # of reads that have a Mapping Quality of 0
  • MapQual<10 - # of reads that have a Mapping Quality < 10
  • MapQual255 - # of reads that have a Mapping Quality = 255
  • PassMapQual - # of reads that have a Mapping Quality >= a minimum Mapping Quality (version 1.0, this includes mapping quality 255 reads).

Additional values ONLY for mapped, mapping quality != 255, non-duplicate, non-QC failure reads:

  • AverageMapQuality - average calculated by summing all mapping qualities that are included (as defined above) and dividing by the number of mapping qualities added.
  • AverageMapQualCount - # of mapping qualities used to calculate AverageMapQuality.

Additional values ONLY incremented for mapped, mapping quality >= min mapping quality, non-duplicate, non-QC failure reads (version 1.0, this includes mapping quality 255 reads):

  • Depth - # of reads.
  • Q20Bases - # of bases at this position with a base quality (from the read) of Q20 or higher.

Currently there is no special logic to exclude positions where the refernce is 'N'.

Currently there is no special logic to exclude reads from the counts when the base is 'N'.


BaseQC Output

There are two output options for BaseQC.

  1. Percentages
  2. Straight Counts

Percentage-Based Output Format

Order (with calculations based on the values described above):

  • chrom - Chromosome/reference name string from the SAM/BAM
  • chromStart - 0-based start position
  • chromEnd - 0-based end position (always 1 greater than start and not included in this region)
  • Depth - Depth
  • Q20Bases - Q20Bases
  • Q20BasesPct(%) - Q20Bases / Depth
  • TotalReads - TotalReads
  • MappedBases - Mapped
  • MappingRate(%) - Mapped / TotalReads
  • MapRate_MQPass(%) - PassMapQual / TotalReads
  • ZeroMapQual(%) - ZeroMapQual / TotalReads
  • MapQual<10(%) - MapQual<10 / TotalReads
  • PairedReads(%) - Paired / TotalReads
  • ProperPaired(%) - ProperPaired / TotalReads
  • DupRate(%) - Dups / TotalReads
  • QCFailRate(%) - QCFail / TotalReads
  • AverageMapQuality - AverageMapQuality
  • AverageMapQualCount - AverageMapQualCount

This output does not include a MapQual255 count in version 1.0.


Count-Based Output Format

Order (of values described above):

  • chrom - Chromosome/reference name string from the SAM/BAM
  • chromStart - 0-based start position
  • chromEnd - 0-based end position (always 1 greater than start and not included in this region)
  • TotalReads
  • Dups
  • QCFail
  • Mapped
  • Paired
  • ProperPaired
  • ZeroMapQual
  • MapQual<10
  • MapQual255
  • PassMapQual
  • AverageMapQuality
  • AverageMapQualCount
  • Depth
  • Q20Bases


Sample Output

chrom	chromStart	chromEnd	Depth	Q20Bases	Q20BasesPct(%)	TotalReads	MappedBases	MappingRate(%)	MapRate_MQPass(%)	ZeroMapQual(%)	MapQual<10(%)	PairedReads(%)	ProperPaired(%)	DupRate(%)	QCFailRate(%)	AverageMapQuality	AverageMapQualCount
1	100	101	2	2	100.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	101	102	2	0	0.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	102	103	0	0	0.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	0.000	0
1	103	104	0	0	0.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	0.000	0
1	104	105	2	0	0.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	105	106	2	2	100.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	110	111	0	0	0.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	0.000	0
1	111	112	2	2	100.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	112	113	2	2	100.000	3	3	100.000	66.667	33.333	66.667	100.000	0.000	0.000	0.000	11.000	3
1	10012	10013	14	0	0.000	42	33	78.571	52.381	26.190	52.381	85.714	35.714	14.286	14.286	11.000	21
1	10013	10014	14	10	71.429	39	30	76.923	51.282	25.641	51.282	84.615	38.462	15.385	15.385	11.000	21
1	10023	10024	0	0	0.000	39	30	76.923	51.282	25.641	51.282	84.615	38.462	15.385	15.385	0.000	0
1	10024	10025	14	12	85.714	39	30	76.923	51.282	25.641	51.282	84.615	38.462	15.385	15.385	11.000	21