BamUtil: validate
Status
The initial version of a SAM/BAM Validator is complete, but does not yet validate all fields or produce all desired statistics. Future releases will add more validation and more statistics.
Download
http://genome.sph.umich.edu/wiki/BamUtil After compiling, the BAM Validator is found in bamUtil/bin/bam and is the "validate" subprogram (bamUtil/bin/bam validate).
Purpose
The BamValidator processes the specified SAM/BAM file:
- to determine if it has any syntactic or format violations.
- to generate basic statistics.
The user can then decide if they want to use the file for future processing based on whether it passed syntactic/format validation and based on the statistics that were reported.
Valid SAM/BAM File Requirements
A valid SAM/BAM file meets the validation criteria specified in SAM Validation Criteria.
Statistic Generation
Statistics are generated by the BAM Validator if the --disableStatistics
option is not set. A description of the statistics generated are found at: Sam File Statistics
Usage
./bam validate --in <inputFile> [--noeof] [--so_flag|--so_coord|--so_query] [--maxErrors <numErrors>] [--verbose] [--printableErrors <numReportedErrors>] [--disableStatistics] [--params]
Recommended Usage
If you don't want the file statistics, use --disableStatistics.
If you want to validate that the file is sorted, use the appropriate sorting flag. If you trust the @HD SO flag, use so_flag
, otherwise if you want to check that it is sorted by coordinate, use --so_coord
.
If you want to see the error details, use --verbose, but if you want to limit the number of errors displayed, use --printableErrors.
If you just want to know if the file is validly formatted or not, use --maxErrors 1
The following will give the most information (without validating that the file is sorted):
./bam validate --in <inputFile> --verbose
Parameters
Required Parameters: --in : the SAM/BAM file to be validated Optional Parameters: --noeof : do not expect an EOF block on a bam file. --refFile : the reference file --so_flag : validate the file is sorted based on the header's @HD SO flag. --so_coord : validate the file is sorted based on the coordinate. --so_query : validate the file is sorted based on the query name. --maxErrors : Number of records with errors/invalids to allow before quiting. -1 (default) indicates to not quit until the entire file is validated. 0 indicates not to read/validate anything. --verbose : Print specific error details rather than just a summary --printableErrors : Maximum number of records with errors to print the details of before suppressing them when in verbose (defaults to 100) --disableStatistics : Turn off statistic generation --params : Print the parameter settings
PhoneHome: --noPhoneHome : disable PhoneHome (default enabled) --phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)
Required Parameters
Input File (--in
)
Use --in
followed by your file name to specify the SAM/BAM input file.
The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.
A -
is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).
SAM/BAM/Uncompressed BAM from file | --in yourFileName
|
SAM from stdin | --in - |
BAM from stdin | --in -.bam |
Uncompressed BAM from stdin | --in -.ubam |
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools
implementation so pipes between our tools and samtools
are supported.
Optional Parameters
Do not require BGZF EOF block (--noeof
)
Use --noeof
if you do not expect a trailing eof block in your bgzf file.
By default, the trailing empty block is expected and checked for.
Reference File (--refFile
)
Use --refFile
followed by the reference file name to specify the reference sequence file.
Validate Sort Order (--so_flag
, --so_coord
,--so_query
)
Validate the sort order of the file:
--so_flag
- based on the flag in the header--so_coord
- based on the coordinates/positions--so_query
- based on the query/read names
Print Specific Errors (--maxErrors
)
Use --maxErrors
followed by a number to specify the maximum number of records with errors/invalids to process before quiting.
-1 (default) indicates to not quit until the entire file is validated.
0 indicates not to read/validate anything.
Print Specific Errors (--verbose
)
Use --verbose
to print specific error details rather than just a summary.
Maxium Number of Record Error Details to Print (--printableErrors
)
Use --printableErrors
followed by a number to specify the maximum number of records with errors to print the details of before suppressing them. This parameter is only valid when --verbose
is also specified.
The default is 100.
Disable Statistic Generation (--disableStatistics
)
Use --disableStatistics
to turn off statistic generation (statistics are generated by default).
Print the Program Parameters (--params
)
Use --params
to print the parameters for your program to stderr.
PhoneHome Parameters
See PhoneHome for more information on how PhoneHome works and what it does.
Turn off PhoneHome (--noPhoneHome
)
Use the --noPhoneHome
option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.
Adjust the Frequency of PhoneHome (--phoneHomeThinning
)
Use --phoneHomeThinning
to modify the percentage of the time that PhoneHome will run (0-100).
- By default,
--phoneHomeThinning
is set to 50, running 50% of the time. - PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
- N/A if
--noPhoneHome
is set.
Output
The error details (--verbose) and the statistics are printed to stderr. If you want that to go to a file you need to redirect stderr.
For a bash shell, redirect to stderr by doing:
./bam validate --in <inputFile> --verbose 2> outputFile.txt
Return Value
- 0: all records are successfully read, are valid, and are properly sorted.
- non-0: at least one record was not successfully read, not valid, or not properly sorted.
Example Outputs
Valid File
./bam validate --in ~/data/bamExample/37mer_alt.bwa.bam Number of records read = 18900000 Number of valid records = 18900000 TotalReads(e6) 18.90 MappedReads(e6) 14.77 PairedReads(e6) 18.90 ProperPair(e6) 11.28 DuplicateReads(e6) 0.00 QCFailureReads(e6) 0.00 MappingRate(%) 78.17 PairedReads(%) 100.00 ProperPair(%) 59.68 DupRate(%) 0.00 QCFailRate(%) 0.00 TotalBases(e6) 699.30 BasesInMappedReads(e6) 546.67 Returning: 0 (SUCCESS)
Invalid File
./bam validate --in test/testFiles/testInvalid.sam Number of records read = 32 Number of valid records = 2 Error Counts: FAIL_PARSE: 17 INVALID: 1 INVALID_QNAME: 3 INVALID_RNAME: 8 INVALID_POS: 2 INVALID_CIGAR: 2 INVALID_QUAL: 2 TotalReads 14.00 MappedReads 14.00 PairedReads 6.00 ProperPair 0.00 DuplicateReads 0.00 QCFailureReads 0.00 MappingRate(%) 100.00 PairedReads(%) 42.86 ProperPair(%) 0.00 DupRate(%) 0.00 QCFailRate(%) 0.00 TotalBases 47.00 BasesInMappedReads 47.00 Returning: 7 (INVALID)
Invalid File with Verbose
Printable errors is specified to produce a smaller example that does not print all the errors since that would take up more space.
./bam validate --in test/testFiles/testInvalid.sam --verbose --printableErrors 5 Record 1 INVALID_QNAME (ERROR) : Invalid Query Name - the string length (256) does not match the specified query name length (0). INVALID_QNAME (WARNING) : Invalid Query Name (QNAME) length: 256. Length with the terminating null must be between 2 & 255. Record 2 INVALID: 0 length Query Name. Record 3 INVALID_QNAME (WARNING) : Invalid character in the Query Name (QNAME): ' ' at position 2. Record 4 FAIL_PARSE: flag, 29M5I3M:F:295, is not an integer. FAIL_PARSE: Invalid Tag Format: *, should be cc:c:x*. Record 5 FAIL_PARSE: Too few columns (1) in the Record, expected at least 11. Number of records read = 32 Number of valid records = 2 Error Counts: FAIL_PARSE: 17 INVALID: 1 INVALID_QNAME: 3 INVALID_RNAME: 8 INVALID_POS: 2 INVALID_CIGAR: 2 INVALID_QUAL: 2 TotalReads 14.00 MappedReads 14.00 PairedReads 6.00 ProperPair 0.00 DuplicateReads 0.00 QCFailureReads 0.00 MappingRate(%) 100.00 PairedReads(%) 42.86 ProperPair(%) 0.00 DupRate(%) 0.00 QCFailRate(%) 0.00 TotalBases 47.00 BasesInMappedReads 47.00 Returning: 7 (INVALID)