StatsTools

From Genome Analysis Wiki
Jump to: navigation, search

statsTools Overview

statsTools contains a set of tools for operating on the statistics files that we generate.

Currently it only works on baseQC statistics files produced by BamUtil: stats.

mergeBaseQCSumStats

Merges stats files in the Count-Based Output Format.

The merge can be done in multiple iterations, merging smaller groups of files at a time.

Usage

 mergeBaseQCSumStats --out <outputStatsFile> <inputStatsFiles>

Parameters

 --out output merged stats file
 inputStatsFiles space separated list of files to merge.

output File (--out)

Use --out followed by your file name to specify the merged stats output file.

The file extension is used to determine whether or not to compressed the output file. A - is used to indicate stdout.

uncompressed to file --out yourFileName.stats
compressed to file --out yourFileName.stats.gz
uncompressed to stdout --out -
compressed to stdout --out -.gz

input Stats Files

The input stats files to merge do not have a flag, and are just specified at the end of the command line (after --out).

The software can read either compressed or uncompressed stats files, but they must be in the Count-Based Format.

Return Value

The software returns 0 on completion, or -1 if the parameters could not be read or there was a problem reading an input file.

Output

A status message is written to cerr on failures, and upon completion, "Done writing to " followed by the output file name is written to cerr.


subsetBaseQCStats

Reduce the BAM BaseQC stats files in the to only positions in the specified regions.

Usage

 subsetBaseQCStats --inStats <originalStatsFile> --regionList <subset of regions> --outStats <outputStatsFile>

Parameters

 --inStats    : stats file to narrow down to just a subset of positions
 --regionList : File containing the subset of regions to keep (assumed to be sorted)
                Formated as chr<tab>start_pos<tab>end_pos.
                Positions are 0 based and the end_pos is not included in the region.
 --outStats   : stats file to write the subset of stats into

input File (--inStats)

The input stats files that needs to be narrowed down to just a subset of regions.

The software can read either compressed or uncompressed stats files, but they must be in a BAM BaseQC format.

region List File (--regionList)

The file containing the list of regions to keep from the input stats file.

The regions should be specified, one region per line.

Each column is separated by tabs.

Column # Description
1 Chromosome as written in the stats file.
2 0-based region start position (included in the output file).
3 0-based region end position (not included in the output file).

output File (--outStats)

Use --outStats followed by your file name to specify the output file for the subset of stats.

The file extension is used to determine whether or not to compressed the output file. A - is used to indicate stdout.

uncompressed to file --out yourFileName.stats
compressed to file --out yourFileName.stats.gz
uncompressed to stdout --out -
compressed to stdout --out -.gz


Return Value

The software returns 0 on successful completion, or non-zero on a failure.

Output

Status/error messages are written to cerr.

If an invalid region is found in the regionList, "NonOverlapRegionPos::add: Invalid Range, start must be < end, but ", followed by the start position, followed by " >= ", followed by the end position.

Upon completion, "Done subsetBaseQCStats." is written to cerr.

Example Output:

NonOverlapRegionPos::add: Invalid Range, start must be < end, but 112 >= 111
Done subsetBaseQCStats.