Difference between revisions of "BamUtil: validate"

From Genome Analysis Wiki
Jump to navigationJump to search
(Reorganize categories)
Line 1: Line 1:
 +
[[Category:BamUtil | validate]]
 +
[[Category:BAM Software]]
 
[[Category:Software]]
 
[[Category:Software]]
[[Category:BAM Software]]
+
 
[[Category:BamUtil | validate]]
 
  
 
== Status  ==
 
== Status  ==

Revision as of 13:24, 2 September 2011


Status

The initial version of a SAM/BAM Validator is complete, but does not yet validate all fields or produce all desired statistics. Future releases will add more validation and more statistics.

Download

http://genome.sph.umich.edu/wiki/Software#Download The BAM Validator is found in stagen/src/bam and is called bam (statgen/src/bin/bam).

Purpose

The BamValidator processes the specified SAM/BAM file:

  1. to determine if it has any syntactic or format violations.
  2. to generate basic statistics.

The user can then decide if they want to use the file for future processing based on whether it passed syntactic/format validation and based on the statistics that were reported.


Valid SAM/BAM File Requirements

A valid SAM/BAM file meets the validation criteria specified in SAM Validation Criteria.

Statistic Generation

Statistics are generated by the BAM Validator if the --disableStatistics option is not set. A description of the statistics generated are found at: Sam File Statistics

How to Use the Bam Validator Executable

Parameters

    Required Parameters:
        --in : the SAM/BAM file to be validated
    Optional Parameters:
        --noeof             : do not expect an EOF block on a bam file.
        --so_flag           : validate the file is sorted based on the header's @HD SO flag.
        --so_coord          : validate the file is sorted based on the coordinate.
        --so_query          : validate the file is sorted based on the query name.
        --maxErrors         : Number of records with errors/invalids to allow before quiting.
                              -1 (default) indicates to not quit until the entire file is validated.
                              0 indicates not to read/validate anything.
        --verbose           : Print specific error details rather than just a summary
        --printableErrors   : Maximum number of records with errors to print the details of
                              before suppressing them when in verbose (defaults to 100)
        --disableStatistics : Turn off statistic generation
        --params            : Print the parameter settings

Usage

	./bam validate --in <inputFile> [--noeof] [--so_flag|--so_coord|--so_query] [--maxErrors <numErrors>] [--verbose] [--printableErrors <numReportedErrors>] [--disableStatistics] [--params]

Recommended Usage

If you don't want the file statistics, use --disableStatistics.

If you want to validate that the file is sorted, use the appropriate sorting flag. If you trust the @HD SO flag, use so_flag, otherwise if you want to check that it is sorted by coordinate, use --so_coord.

If you want to see the error details, use --verbose, but if you want to limit the number of errors displayed, use --printableErrors.

If you just want to know if the file is validly formatted or not, use --maxErrors 1

The following will give the most information (without validating that the file is sorted):

./bam validate --in <inputFile> --verbose

Output

The error details (--verbose) and the statistics are printed to stderr. If you want that to go to a file you need to redirect stderr.

For a bash shell, redirect to stderr by doing:

./bam validate --in <inputFile> --verbose 2> outputFile.txt


Return Value

  • 0: all records are successfully read, are valid, and are properly sorted.
  • non-0: at least one record was not successfully read, not valid, or not properly sorted.

Example Outputs

Valid File

./bam validate --in ~/data/bamExample/37mer_alt.bwa.bam

Number of records read = 18900000
Number of valid records = 18900000

TotalReads(e6)	18.90
MappedReads(e6)	14.77
PairedReads(e6)	18.90
ProperPair(e6)	11.28
DuplicateReads(e6)	0.00
QCFailureReads(e6)	0.00

MappingRate(%)	78.17
PairedReads(%)	100.00
ProperPair(%)	59.68
DupRate(%)	0.00
QCFailRate(%)	0.00

TotalBases(e6)	699.30
BasesInMappedReads(e6)	546.67
Returning: 0 (SUCCESS)

Invalid File

./bam validate --in test/testFiles/testInvalid.sam 

Number of records read = 32
Number of valid records = 2

Error Counts:
	FAIL_PARSE: 17
	INVALID: 1
	INVALID_QNAME: 3
	INVALID_RNAME: 8
	INVALID_POS: 2
	INVALID_CIGAR: 2
	INVALID_QUAL: 2

TotalReads	14.00
MappedReads	14.00
PairedReads	6.00
ProperPair	0.00
DuplicateReads	0.00
QCFailureReads	0.00

MappingRate(%)	100.00
PairedReads(%)	42.86
ProperPair(%)	0.00
DupRate(%)	0.00
QCFailRate(%)	0.00

TotalBases	47.00
BasesInMappedReads	47.00
Returning: 7 (INVALID)

Invalid File with Verbose

Printable errors is specified to produce a smaller example that does not print all the errors since that would take up more space.

./bam validate --in test/testFiles/testInvalid.sam --verbose --printableErrors 5

Record 1
INVALID_QNAME (ERROR) : Invalid Query Name - the string length (256) does not match the specified query name length (0).
INVALID_QNAME (WARNING) : Invalid Query Name (QNAME) length: 256.  Length with the terminating null must be between 2 & 255.

Record 2
INVALID: 0 length Query Name.

Record 3
INVALID_QNAME (WARNING) : Invalid character in the Query Name (QNAME): ' ' at position 2.

Record 4
FAIL_PARSE: flag, 29M5I3M:F:295, is not an integer.
FAIL_PARSE: Invalid Tag Format: *, should be cc:c:x*.

Record 5
FAIL_PARSE: Too few columns (1) in the Record, expected at least 11.


Number of records read = 32
Number of valid records = 2

Error Counts:
	FAIL_PARSE: 17
	INVALID: 1
	INVALID_QNAME: 3
	INVALID_RNAME: 8
	INVALID_POS: 2
	INVALID_CIGAR: 2
	INVALID_QUAL: 2

TotalReads	14.00
MappedReads	14.00
PairedReads	6.00
ProperPair	0.00
DuplicateReads	0.00
QCFailureReads	0.00

MappingRate(%)	100.00
PairedReads(%)	42.86
ProperPair(%)	0.00
DupRate(%)	0.00
QCFailRate(%)	0.00

TotalBases	47.00
BasesInMappedReads	47.00
Returning: 7 (INVALID)