Difference between revisions of "BamUtil: stats"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with 'stats Category:BAM Software Category:Software = Overview of the <code>stats</code> function of <code>bamUtil</code> = The <code>stats</code> option …')
 
Line 68: Line 68:
  
 
== BaseQC ==
 
== BaseQC ==
'''This capability is coming soon, so these notes may be updated prior to it being completed...'''
 
  
 
The <code>baseQC</code> option generates the following statistics:
 
The <code>baseQC</code> option generates the following statistics:

Revision as of 13:17, 22 September 2011


Overview of the stats function of bamUtil

The stats option on the bamUtil executable generates the specified statistics on a SAM/BAM file.

Parameters

	Required Parameters:
		--in : the SAM/BAM file to calculate stats for
	Types of Statistics that can be generated:
		--basic       : Turn on basic statistic generation
		--qual        : Generate a count for each quality (displayed as non-phred quality)
		--phred       : Generate a count for each quality (displayed as phred quality)
		--baseQC      : Write per base statistics to the specified file.
	Optional Parameters:
		--maxNumReads : Maximum number of reads to process
		                Defaults to -1 to indicate all reads.
		--unmapped    : Only process unmapped reads (requires a bamIndex file)
		--bamIndex    : The path/name of the bam index file
		                (if required and not specified, uses the --in value + ".bai")
		--regionList  : File containing the region list chr<tab>start_pos<tab>end<pos>.
		                Positions are 0 based and the end_pos is not included in the region.
		                Uses bamIndex.
		--minMapQual  : The minimum mapping quality for filtering reads in the baseQC stats.
		--dbsnp       : The dbSnp file of positions to exclude from baseQC analysis.
		--noeof       : Do not expect an EOF block on a bam file.
		--params      : Print the parameter settings

For all types of statistics, the bam file used is specified by --in.

The optional parameters are also used for all types of statistics.

Usage:

	./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--baseQC <outputFileName>] [--maxNumReads <maxNum>] [--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>] [--noeof] [--params]


Types of Statistics

Basic

Prints summary statistics for the file:

  • TotalReads - # of reads that are in the file
  • MappedReads - # of reads marked mapped in the flag
  • PairedReads - # of reads marked paired in the flag
  • ProperPair - # of reads marked paired AND proper paired in the flag
  • DuplicateReads - # of reads marked duplicate in the flag
  • QCFailureReads - # of reads marked QC failure in the flag
  • MappingRate(%) - # of reads marked mapped in the flag / TotalReads
  • PairedReads(%) - # of reads marked paired in the flag / TotalReads
  • ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads
  • DupRate(%) - # of reads marked duplicate in the flag / TotalReads
  • QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
  • TotalBases - # of bases in all reads
  • BasesInMappedReads - # of bases in reads marked mapped in the flag


Qual/Phred

Prints a count of the number of times each quality value appears in the file.

  • phred Displays Quality as phred integers [0-93]
  • qual Displays Quality as non-phred integers (phred + 33) [33-126]


BaseQC

The baseQC option generates the following statistics:

For each position, the following counts are incremented if:

  1. a read spans the reference position (starts before or at this reference position and ends at or after this position)
  2. regardless of duplicate/qc failure/unmapped/mapping quality
  3. regardless of the CIGAR for this position (other than clips at the beginning/end which are not counted, but deletions and skips are counted)
  • TotalReads(e6) - # of reads that span this position.
  • DupRate(%) - # of reads marked duplicate in the flag / TotalReads
  • QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
  • PairedReads(%) - # of reads marked paired in the flag / TotalReads
  • ProperPaired(%) - # of reads marked paired AND proper paired in the flag / TotalReads
  • MappedBases(e9) - # of reads marked mapped in the flag
  • MappingRate(%) - # of reads marked mapped in the flag / TotalReads
  • ZeroMapQual(%) - # of reads marked mapped in the flag AND have a Mapping Quality of 0 / TotalReads
  • MapQual<10(%) - # of reads marked mapped in the flag AND have a Mapping Quality < 10 / TotalReads
  • MapRate_MQpass(%) - # of reads marked mapped in the flag AND have a Mapping Quality >= a minimum Mapping Quality / TotalReads


For each position, the following counts are incremented if:

  1. a read spans the reference position (starts before or at this reference position and ends at or after this position)
  2. the read is NOT a duplicate, qc failure, unmapped, or mapped with a mapping quality less than the min
  3. the CIGAR for this position is a M/=/X (match/mismatch)
  • Depth - # of reads.
  • Q20Bases(e9) - # of bases at this position with a base quality (from the read) of Q20 or higher.
  • Q20BasesPct(%) - Q20Bases / Depth


Currently there is no special logic to exclude positions where the refernce is 'N'.

Currently there is no special logic to exclude reads from the counts when the base is 'N'.