Difference between revisions of "BamUtil: validate"
(11 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | [[Category:Software | + | [[Category:BamUtil|validate]] |
− | + | [[Category:BAM Software]] | |
+ | [[Category:Software]] | ||
+ | |||
+ | = Status = | ||
The initial version of a SAM/BAM Validator is complete, but does not yet validate all fields or produce all desired statistics. Future releases will add more validation and more statistics. | The initial version of a SAM/BAM Validator is complete, but does not yet validate all fields or produce all desired statistics. Future releases will add more validation and more statistics. | ||
− | + | = Download = | |
− | + | http://genome.sph.umich.edu/wiki/BamUtil | |
− | + | After compiling, the BAM Validator is found in bamUtil/bin/bam and is the "validate" subprogram (bamUtil/bin/bam validate). | |
− | |||
− | |||
− | |||
− | + | = Purpose = | |
− | |||
− | |||
The BamValidator processes the specified SAM/BAM file: | The BamValidator processes the specified SAM/BAM file: | ||
Line 22: | Line 20: | ||
− | + | == Valid SAM/BAM File Requirements == | |
A valid SAM/BAM file meets the validation criteria specified in [[SAM Validation Criteria]]. | A valid SAM/BAM file meets the validation criteria specified in [[SAM Validation Criteria]]. | ||
− | + | == Statistic Generation == | |
Statistics are generated by the BAM Validator if the <code>--disableStatistics</code> option is not set. A description of the statistics generated are found at: [[C++ Class: SamFile#Statistic Generation|Sam File Statistics]] | Statistics are generated by the BAM Validator if the <code>--disableStatistics</code> option is not set. A description of the statistics generated are found at: [[C++ Class: SamFile#Statistic Generation|Sam File Statistics]] | ||
− | + | = Usage = | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
./bam validate --in <inputFile> [--noeof] [--so_flag|--so_coord|--so_query] [--maxErrors <numErrors>] [--verbose] [--printableErrors <numReportedErrors>] [--disableStatistics] [--params] | ./bam validate --in <inputFile> [--noeof] [--so_flag|--so_coord|--so_query] [--maxErrors <numErrors>] [--verbose] [--printableErrors <numReportedErrors>] [--disableStatistics] [--params] | ||
− | + | == Recommended Usage == | |
If you don't want the file statistics, use --disableStatistics. | If you don't want the file statistics, use --disableStatistics. | ||
Line 66: | Line 44: | ||
./bam validate --in <inputFile> --verbose | ./bam validate --in <inputFile> --verbose | ||
+ | = Parameters = | ||
+ | <pre> | ||
+ | Required Parameters: | ||
+ | --in : the SAM/BAM file to be validated | ||
+ | Optional Parameters: | ||
+ | --noeof : do not expect an EOF block on a bam file. | ||
+ | --refFile : the reference file | ||
+ | --so_flag : validate the file is sorted based on the header's @HD SO flag. | ||
+ | --so_coord : validate the file is sorted based on the coordinate. | ||
+ | --so_query : validate the file is sorted based on the query name. | ||
+ | --maxErrors : Number of records with errors/invalids to allow before quiting. | ||
+ | -1 (default) indicates to not quit until the entire file is validated. | ||
+ | 0 indicates not to read/validate anything. | ||
+ | --verbose : Print specific error details rather than just a summary | ||
+ | --printableErrors : Maximum number of records with errors to print the details of | ||
+ | before suppressing them when in verbose (defaults to 100) | ||
+ | --disableStatistics : Turn off statistic generation | ||
+ | --params : Print the parameter settings | ||
+ | </pre> | ||
+ | {{PhoneHomeParamDesc}} | ||
+ | |||
+ | == Required Parameters == | ||
+ | {{inBAMInputFile|hdr======}} | ||
+ | |||
+ | == Optional Parameters == | ||
+ | {{noeofBGZFParameter}} | ||
+ | {{refFile}} | ||
+ | |||
+ | === Validate Sort Order (<code>--so_flag</code>, <code>--so_coord</code>,<code>--so_query</code>)=== | ||
+ | Validate the sort order of the file: | ||
+ | * <code>--so_flag</code> - based on the flag in the header | ||
+ | * <code>--so_coord</code> - based on the coordinates/positions | ||
+ | * <code>--so_query</code> - based on the query/read names | ||
+ | |||
+ | === Print Specific Errors (<code>--maxErrors</code>)=== | ||
+ | Use <code>--maxErrors</code> followed by a number to specify the maximum number of records with errors/invalids to process before quiting. | ||
+ | |||
+ | -1 (default) indicates to not quit until the entire file is validated. | ||
+ | |||
+ | 0 indicates not to read/validate anything. | ||
+ | |||
+ | === Print Specific Errors (<code>--verbose</code>)=== | ||
+ | Use <code>--verbose</code> to print specific error details rather than just a summary. | ||
− | === Return Value | + | === Maxium Number of Record Error Details to Print (<code>--printableErrors</code>)=== |
+ | Use <code>--printableErrors</code> followed by a number to specify the maximum number of records with errors to print the details of before suppressing them. This parameter is only valid when [[#Print Specific Errors (--verbose)|<code>--verbose</code>]] is also specified. | ||
+ | |||
+ | The default is 100. | ||
+ | |||
+ | === Disable Statistic Generation (<code>--disableStatistics</code>)=== | ||
+ | Use <code>--disableStatistics</code> to turn off statistic generation (statistics are generated by default). | ||
+ | |||
+ | {{paramsParameter}} | ||
+ | |||
+ | {{PhoneHomeParameters}} | ||
+ | |||
+ | = Output = | ||
+ | The error details (--verbose) and the statistics are printed to stderr. If you want that to go to a file you need to redirect stderr. | ||
+ | |||
+ | For a bash shell, redirect to stderr by doing: | ||
+ | ./bam validate --in <inputFile> --verbose 2> outputFile.txt | ||
+ | |||
+ | |||
+ | = Return Value = | ||
* 0: all records are successfully read, are valid, and are properly sorted. | * 0: all records are successfully read, are valid, and are properly sorted. | ||
* non-0: at least one record was not successfully read, not valid, or not properly sorted. | * non-0: at least one record was not successfully read, not valid, or not properly sorted. | ||
− | + | = Example Outputs = | |
− | + | == Valid File == | |
<pre> | <pre> | ||
./bam validate --in ~/data/bamExample/37mer_alt.bwa.bam | ./bam validate --in ~/data/bamExample/37mer_alt.bwa.bam | ||
Line 98: | Line 138: | ||
</pre> | </pre> | ||
− | + | == Invalid File == | |
<pre> | <pre> | ||
./bam validate --in test/testFiles/testInvalid.sam | ./bam validate --in test/testFiles/testInvalid.sam | ||
Line 132: | Line 172: | ||
</pre> | </pre> | ||
− | + | == Invalid File with Verbose == | |
Printable errors is specified to produce a smaller example that does not print all the errors since that would take up more space. | Printable errors is specified to produce a smaller example that does not print all the errors since that would take up more space. | ||
Line 185: | Line 225: | ||
Returning: 7 (INVALID) | Returning: 7 (INVALID) | ||
</pre> | </pre> | ||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 14:05, 6 January 2014
Status
The initial version of a SAM/BAM Validator is complete, but does not yet validate all fields or produce all desired statistics. Future releases will add more validation and more statistics.
Download
http://genome.sph.umich.edu/wiki/BamUtil After compiling, the BAM Validator is found in bamUtil/bin/bam and is the "validate" subprogram (bamUtil/bin/bam validate).
Purpose
The BamValidator processes the specified SAM/BAM file:
- to determine if it has any syntactic or format violations.
- to generate basic statistics.
The user can then decide if they want to use the file for future processing based on whether it passed syntactic/format validation and based on the statistics that were reported.
Valid SAM/BAM File Requirements
A valid SAM/BAM file meets the validation criteria specified in SAM Validation Criteria.
Statistic Generation
Statistics are generated by the BAM Validator if the --disableStatistics
option is not set. A description of the statistics generated are found at: Sam File Statistics
Usage
./bam validate --in <inputFile> [--noeof] [--so_flag|--so_coord|--so_query] [--maxErrors <numErrors>] [--verbose] [--printableErrors <numReportedErrors>] [--disableStatistics] [--params]
Recommended Usage
If you don't want the file statistics, use --disableStatistics.
If you want to validate that the file is sorted, use the appropriate sorting flag. If you trust the @HD SO flag, use so_flag
, otherwise if you want to check that it is sorted by coordinate, use --so_coord
.
If you want to see the error details, use --verbose, but if you want to limit the number of errors displayed, use --printableErrors.
If you just want to know if the file is validly formatted or not, use --maxErrors 1
The following will give the most information (without validating that the file is sorted):
./bam validate --in <inputFile> --verbose
Parameters
Required Parameters: --in : the SAM/BAM file to be validated Optional Parameters: --noeof : do not expect an EOF block on a bam file. --refFile : the reference file --so_flag : validate the file is sorted based on the header's @HD SO flag. --so_coord : validate the file is sorted based on the coordinate. --so_query : validate the file is sorted based on the query name. --maxErrors : Number of records with errors/invalids to allow before quiting. -1 (default) indicates to not quit until the entire file is validated. 0 indicates not to read/validate anything. --verbose : Print specific error details rather than just a summary --printableErrors : Maximum number of records with errors to print the details of before suppressing them when in verbose (defaults to 100) --disableStatistics : Turn off statistic generation --params : Print the parameter settings
PhoneHome: --noPhoneHome : disable PhoneHome (default enabled) --phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)
Required Parameters
Input File (--in
)
Use --in
followed by your file name to specify the SAM/BAM input file.
The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.
A -
is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).
SAM/BAM/Uncompressed BAM from file | --in yourFileName
|
SAM from stdin | --in - |
BAM from stdin | --in -.bam |
Uncompressed BAM from stdin | --in -.ubam |
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools
implementation so pipes between our tools and samtools
are supported.
Optional Parameters
Do not require BGZF EOF block (--noeof
)
Use --noeof
if you do not expect a trailing eof block in your bgzf file.
By default, the trailing empty block is expected and checked for.
Reference File (--refFile
)
Use --refFile
followed by the reference file name to specify the reference sequence file.
Validate Sort Order (--so_flag
, --so_coord
,--so_query
)
Validate the sort order of the file:
--so_flag
- based on the flag in the header--so_coord
- based on the coordinates/positions--so_query
- based on the query/read names
Print Specific Errors (--maxErrors
)
Use --maxErrors
followed by a number to specify the maximum number of records with errors/invalids to process before quiting.
-1 (default) indicates to not quit until the entire file is validated.
0 indicates not to read/validate anything.
Print Specific Errors (--verbose
)
Use --verbose
to print specific error details rather than just a summary.
Maxium Number of Record Error Details to Print (--printableErrors
)
Use --printableErrors
followed by a number to specify the maximum number of records with errors to print the details of before suppressing them. This parameter is only valid when --verbose
is also specified.
The default is 100.
Disable Statistic Generation (--disableStatistics
)
Use --disableStatistics
to turn off statistic generation (statistics are generated by default).
Print the Program Parameters (--params
)
Use --params
to print the parameters for your program to stderr.
PhoneHome Parameters
See PhoneHome for more information on how PhoneHome works and what it does.
Turn off PhoneHome (--noPhoneHome
)
Use the --noPhoneHome
option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.
Adjust the Frequency of PhoneHome (--phoneHomeThinning
)
Use --phoneHomeThinning
to modify the percentage of the time that PhoneHome will run (0-100).
- By default,
--phoneHomeThinning
is set to 50, running 50% of the time. - PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
- N/A if
--noPhoneHome
is set.
Output
The error details (--verbose) and the statistics are printed to stderr. If you want that to go to a file you need to redirect stderr.
For a bash shell, redirect to stderr by doing:
./bam validate --in <inputFile> --verbose 2> outputFile.txt
Return Value
- 0: all records are successfully read, are valid, and are properly sorted.
- non-0: at least one record was not successfully read, not valid, or not properly sorted.
Example Outputs
Valid File
./bam validate --in ~/data/bamExample/37mer_alt.bwa.bam Number of records read = 18900000 Number of valid records = 18900000 TotalReads(e6) 18.90 MappedReads(e6) 14.77 PairedReads(e6) 18.90 ProperPair(e6) 11.28 DuplicateReads(e6) 0.00 QCFailureReads(e6) 0.00 MappingRate(%) 78.17 PairedReads(%) 100.00 ProperPair(%) 59.68 DupRate(%) 0.00 QCFailRate(%) 0.00 TotalBases(e6) 699.30 BasesInMappedReads(e6) 546.67 Returning: 0 (SUCCESS)
Invalid File
./bam validate --in test/testFiles/testInvalid.sam Number of records read = 32 Number of valid records = 2 Error Counts: FAIL_PARSE: 17 INVALID: 1 INVALID_QNAME: 3 INVALID_RNAME: 8 INVALID_POS: 2 INVALID_CIGAR: 2 INVALID_QUAL: 2 TotalReads 14.00 MappedReads 14.00 PairedReads 6.00 ProperPair 0.00 DuplicateReads 0.00 QCFailureReads 0.00 MappingRate(%) 100.00 PairedReads(%) 42.86 ProperPair(%) 0.00 DupRate(%) 0.00 QCFailRate(%) 0.00 TotalBases 47.00 BasesInMappedReads 47.00 Returning: 7 (INVALID)
Invalid File with Verbose
Printable errors is specified to produce a smaller example that does not print all the errors since that would take up more space.
./bam validate --in test/testFiles/testInvalid.sam --verbose --printableErrors 5 Record 1 INVALID_QNAME (ERROR) : Invalid Query Name - the string length (256) does not match the specified query name length (0). INVALID_QNAME (WARNING) : Invalid Query Name (QNAME) length: 256. Length with the terminating null must be between 2 & 255. Record 2 INVALID: 0 length Query Name. Record 3 INVALID_QNAME (WARNING) : Invalid character in the Query Name (QNAME): ' ' at position 2. Record 4 FAIL_PARSE: flag, 29M5I3M:F:295, is not an integer. FAIL_PARSE: Invalid Tag Format: *, should be cc:c:x*. Record 5 FAIL_PARSE: Too few columns (1) in the Record, expected at least 11. Number of records read = 32 Number of valid records = 2 Error Counts: FAIL_PARSE: 17 INVALID: 1 INVALID_QNAME: 3 INVALID_RNAME: 8 INVALID_POS: 2 INVALID_CIGAR: 2 INVALID_QUAL: 2 TotalReads 14.00 MappedReads 14.00 PairedReads 6.00 ProperPair 0.00 DuplicateReads 0.00 QCFailureReads 0.00 MappingRate(%) 100.00 PairedReads(%) 42.86 ProperPair(%) 0.00 DupRate(%) 0.00 QCFailRate(%) 0.00 TotalBases 47.00 BasesInMappedReads 47.00 Returning: 7 (INVALID)