StatsTools
statsTools
Overview
statsTools
contains a set of tools for operating on the statistics files that we generate.
Currently it only works on baseQC statistics files produced by BamUtil: stats.
- mergeBaseQCSumStats - merge stats files in the Count-Based Output Format (sumStats)
- subsetBaseQCStats - reduce the stats file to just the positions in the specified regions.
mergeBaseQCSumStats
Merges stats files in the Count-Based Output Format.
The merge can be done in multiple iterations, merging smaller groups of files at a time.
Usage
mergeBaseQCSumStats --out <outputStatsFile> <inputStatsFiles>
Parameters
--out output merged stats file inputStatsFiles space separated list of files to merge.
output File (--out
)
Use --out
followed by your file name to specify the merged stats output file.
The file extension is used to determine whether or not to compressed the output file. A -
is used to indicate stdout.
uncompressed to file | --out yourFileName.stats
|
compressed to file | --out yourFileName.stats.gz
|
uncompressed to stdout | --out -
|
compressed to stdout | --out -.gz
|
input Stats Files
The input stats files to merge do not have a flag, and are just specified at the end of the command line (after --out
).
The software can read either compressed or uncompressed stats files, but they must be in the Count-Based Format.
Return Value
The software returns 0 on completion, or -1 if the parameters could not be read or there was a problem reading an input file.
Output
A status message is written to cerr on failures, and upon completion, "Done writing to " followed by the output file name is written to cerr.
subsetBaseQCStats
Reduce the BAM BaseQC stats files in the to only positions in the specified regions.
Usage
subsetBaseQCStats --inStats <originalStatsFile> --regionList <subset of regions> --outStats <outputStatsFile>
Parameters
--inStats : stats file to narrow down to just a subset of positions --regionList : File containing the subset of regions to keep (assumed to be sorted) Formated as chr<tab>start_pos<tab>end_pos. Positions are 0 based and the end_pos is not included in the region. --outStats : stats file to write the subset of stats into
input File (--inStats
)
The input stats files that needs to be narrowed down to just a subset of regions.
The software can read either compressed or uncompressed stats files, but they must be in a BAM BaseQC format.
region List File (--regionList
)
The file containing the list of regions to keep from the input stats file.
The regions should be specified, one region per line.
Each column is separated by tabs.
Column # | Description |
---|---|
1 | Chromosome as written in the stats file. |
2 | 0-based region start position (included in the output file). |
3 | 0-based region end position (not included in the output file). |
output File (--outStats
)
Use --outStats
followed by your file name to specify the output file for the subset of stats.
The file extension is used to determine whether or not to compressed the output file. A -
is used to indicate stdout.
uncompressed to file | --out yourFileName.stats
|
compressed to file | --out yourFileName.stats.gz
|
uncompressed to stdout | --out -
|
compressed to stdout | --out -.gz
|
Return Value
The software returns 0 on successful completion, or non-zero on a failure.
Output
Status/error messages are written to cerr.
If an invalid region is found in the regionList, "NonOverlapRegionPos::add: Invalid Range, start must be < end, but ", followed by the start position, followed by " >= ", followed by the end position.
Upon completion, "Done subsetBaseQCStats." is written to cerr.
Example Output:
NonOverlapRegionPos::add: Invalid Range, start must be < end, but 112 >= 111 Done subsetBaseQCStats.