Difference between revisions of "BamUtil: validate"

From Genome Analysis Wiki
Jump to: navigation, search
(Status)
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:Software|BamValidator]]
+
[[Category:BamUtil|validate]]
== Status  ==
+
[[Category:BAM Software]]
 +
[[Category:Software]]
  
The initial version of a SAM/BAM Validator is complete, but does not yet validate all fields or produce all desired statistics.
+
= Status  =
  
== Download ==
+
The initial version of a SAM/BAM Validator is complete, but does not yet validate all fields or produce all desired statistics. Future releases will add more validation and more statistics.
Click the link to download the tar of the source code: [[Media:bam_0.0.2.tar|bam_0.0.2.tar]]
 
  
This version is recommended for Unix users with access to the GNU C++ compiler.
+
= Download =
 +
http://genome.sph.umich.edu/wiki/BamUtil
 +
After compiling, the BAM Validator is found in bamUtil/bin/bam and is the "validate" subprogram (bamUtil/bin/bam validate).  
  
To install the BAM Library and the BAM Validator, unpack the downloaded file (tar xvf) and type make. The BAM Validator is found in pipeline/bam and is called bam (pipeline/bam/bam).
+
= Purpose =
 
 
== Purpose ==
 
  
 
The BamValidator processes the specified SAM/BAM file:
 
The BamValidator processes the specified SAM/BAM file:
Line 20: Line 20:
  
  
=== Valid SAM/BAM File Requirements ===
+
== Valid SAM/BAM File Requirements ==
  
 
A valid SAM/BAM file meets the validation criteria specified in [[SAM Validation Criteria]].
 
A valid SAM/BAM file meets the validation criteria specified in [[SAM Validation Criteria]].
  
=== Statistic Generation ===
+
== Statistic Generation ==
  
The statistics only reflect alignments that were successfully read from the BAM fileAlignments that failed to parse from the file are not reflected in the statistics, but alignments that are invalid for other reasons may show up in the statistics.
+
Statistics are generated by the BAM Validator if the <code>--disableStatistics</code> option is not setA description of the statistics generated are found at: [[C++ Class: SamFile#Statistic Generation|Sam File Statistics]]
  
The following Statistics are generated by the BAM Validator if the <code>--disableStatistics</code> option is not set:
+
= Usage =
  
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
+
./bam validate --in <inputFile> [--noeof] [--so_flag|--so_coord|--so_query] [--maxErrors <numErrors>] [--verbose] [--printableErrors <numReportedErrors>] [--disableStatistics] [--params]
|-style="background: #f2f2f2; text-align: center;"
 
|+ '''Read Counts'''
 
! Statistic !! Description
 
|-
 
|TotalReads
 
| Total number of alignments that were successfully read from the file.
 
|-
 
|MappedReads
 
| Total number of alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
 
|-
 
|PairedReads
 
| Total number of alignments that were successfully read from the file with FLAG bit 0x001 set to 1 (paired).
 
|-
 
|ProperPair
 
| Total number of alignments that were successfully read from the file with FLAG bits 0x001 set to 1 (paired) AND 0x002 (proper pair).
 
|-
 
|DuplicateReads
 
| Total number of alignments that were successfully read from the file with FLAG bit 0x400 set to 1 (PCR or optical duplicate).
 
|-
 
|QCFailureReads
 
| Total number of alignments that were successfully read from the file with FLAG bit 0x200 set to 1 (failed quality checks).
 
|}
 
  
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
+
== Recommended Usage ==
|-style="background: #f2f2f2; text-align: center;"
+
If you don't want the file statistics, use --disableStatistics.
|- '''Read Percentages'''
 
! Statistic !! Description
 
|-
 
|MappingRate(%)
 
| 100 * MappedReads/TotalReads
 
|-
 
|PairedReads(%)
 
| 100 * PairedReads/TotalReads
 
|-
 
|ProperPair(%)
 
| 100 * ProperPair/TotalReads
 
|-
 
|DupRate(%)
 
| 100 * DuplicateReads/TotalReads
 
|-
 
|QCFailRate(%)
 
| 100 * QCFailureReads/TotalReads
 
|}
 
  
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
+
If you want to validate that the file is sorted, use the appropriate sorting flag. If you trust the @HD SO flag, use <code>so_flag</code>, otherwise if you want to check that it is sorted by coordinate, use <code>--so_coord</code>.
|-style="background: #f2f2f2; text-align: center;"
 
|- '''Base Counts'''
 
! Statistic !! Description
 
|-
 
|TotalBases
 
| Sum of the SEQ lengths for all alignments that were successfully read from the file.
 
|-
 
|BasesInMappedReads
 
| Sum of the SEQ lengths for all alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
 
|}
 
  
NOTE: If the TotalReads is greater than 10^6, then the Read Counts and Base Counts specify the total counts divided by 10^6.  This is indicated in the output with a (e6) appended to the field name.
+
If you want to see the error details, use --verbose, but if you want to limit the number of errors displayed, use --printableErrors.
  
 +
If you just want to know if the file is validly formatted or not, use --maxErrors 1
  
 +
The following will give the most information (without validating that the file is sorted):
 +
./bam validate --in <inputFile> --verbose
  
== How to Use the Bam Validator Executable ==
+
= Parameters =
=== Parameters ===
 
 
<pre>
 
<pre>
 
Required Parameters:
 
Required Parameters:
Line 98: Line 50:
 
Optional Parameters:
 
Optional Parameters:
 
--noeof            : do not expect an EOF block on a bam file.
 
--noeof            : do not expect an EOF block on a bam file.
 +
--refFile          : the reference file
 
--so_flag          : validate the file is sorted based on the header's @HD SO flag.
 
--so_flag          : validate the file is sorted based on the header's @HD SO flag.
 
--so_coord          : validate the file is sorted based on the coordinate.
 
--so_coord          : validate the file is sorted based on the coordinate.
Line 108: Line 61:
 
                      before suppressing them when in verbose (defaults to 100)
 
                      before suppressing them when in verbose (defaults to 100)
 
--disableStatistics : Turn off statistic generation
 
--disableStatistics : Turn off statistic generation
 +
--params            : Print the parameter settings
 
</pre>
 
</pre>
 +
{{PhoneHomeParamDesc}}
  
=== Usage ===
+
== Required Parameters ==
 +
{{inBAMInputFile|hdr======}}
  
./bam validate --in <inputFile> [--noeof] [--so_flag|--so_coord|--so_query] [--maxErrors <numErrors>] [--verbose] [--printableErrors <numReportedErrors>] [--disableStatistics]
+
== Optional Parameters ==
 +
{{noeofBGZFParameter}}
 +
{{refFile}}
  
==== Recommended Usage ====
+
=== Validate Sort Order (<code>--so_flag</code>, <code>--so_coord</code>,<code>--so_query</code>)===
If you don't want the file statistics, use --disableStatistics.
+
Validate the sort order of the file:
 +
* <code>--so_flag</code> - based on the flag in the header
 +
* <code>--so_coord</code> - based on the coordinates/positions
 +
* <code>--so_query</code> - based on the query/read names
 +
 
 +
=== Print Specific Errors (<code>--maxErrors</code>)===
 +
Use <code>--maxErrors</code> followed by a number to specify the maximum number of records with errors/invalids to process before quiting.
 +
 
 +
-1 (default) indicates to not quit until the entire file is validated.
 +
 
 +
0 indicates not to read/validate anything.
 +
 
 +
=== Print Specific Errors (<code>--verbose</code>)===
 +
Use <code>--verbose</code> to print specific error details rather than just a summary.
 +
 
 +
=== Maxium Number of Record Error Details to Print  (<code>--printableErrors</code>)===
 +
Use <code>--printableErrors</code> followed by a number to specify the maximum number of records with errors to print the details of before suppressing them.  This parameter is only valid when [[#Print Specific Errors (--verbose)|<code>--verbose</code>]] is also specified.
 +
 
 +
The default is 100.
 +
 
 +
=== Disable Statistic Generation (<code>--disableStatistics</code>)===
 +
Use <code>--disableStatistics</code> to turn off statistic generation (statistics are generated by default).
  
If you want to validate that the file is sorted, use the appropriate sorting flag. If you trust the @HD SO flag, use <code>so_flag</code>, otherwise if you want to check that it is sorted by coordinate, use <code>--so_coord</code>.
+
{{paramsParameter}}
  
If you want to see the error details, use --verbose, but if you want to limit the number of errors displayed, use --printableErrors.
+
{{PhoneHomeParameters}}
  
If you just want to know if the file is validly formatted or not, use --maxErrors 1
+
= Output =
 +
The error details (--verbose) and the statistics are printed to stderr.  If you want that to go to a file you need to redirect stderr.
  
The following will give the most information (without validating that the file is sorted):
+
For a bash shell, redirect to stderr by doing:
  ./bam validate --in <inputFile> --verbose
+
  ./bam validate --in <inputFile> --verbose 2> outputFile.txt
  
  
=== Return Value ===
+
= Return Value =
 
*    0: all records are successfully read, are valid, and are properly sorted.
 
*    0: all records are successfully read, are valid, and are properly sorted.
 
* non-0: at least one record was not successfully read, not valid, or not properly sorted.
 
* non-0: at least one record was not successfully read, not valid, or not properly sorted.
  
=== Example Outputs ===
+
= Example Outputs =
  
==== Valid File ====
+
== Valid File ==
 
<pre>
 
<pre>
 
./bam validate --in ~/data/bamExample/37mer_alt.bwa.bam
 
./bam validate --in ~/data/bamExample/37mer_alt.bwa.bam
  
The following parameters are available.  Ones with "[]" are in effect:
 
 
Input Parameters
 
--in [/home/mktrost/data/bamExample/37mer_alt.bwa.bam], --noeof,
 
              --maxErrors [-1], --verbose, --printableErrors [100],
 
              --disableStatistics
 
  SortOrder : --so_flag, --so_coord, --so_query
 
 
'
 
 
Number of records read = 18900000
 
Number of records read = 18900000
 
Number of valid records = 18900000
 
Number of valid records = 18900000
Line 167: Line 138:
 
</pre>
 
</pre>
  
==== Invalid File ====
+
== Invalid File ==
 
<pre>
 
<pre>
 
./bam validate --in test/testFiles/testInvalid.sam  
 
./bam validate --in test/testFiles/testInvalid.sam  
 
The following parameters are available.  Ones with "[]" are in effect:
 
 
Input Parameters
 
--in [test/testFiles/testInvalid.sam], --noeof, --maxErrors [-1], --verbose,
 
              --printableErrors [100], --disableStatistics
 
  SortOrder : --so_flag, --so_coord, --so_query
 
 
  
 
Number of records read = 32
 
Number of records read = 32
Line 209: Line 172:
 
</pre>
 
</pre>
  
==== Invalid File with Verbose ====  
+
== Invalid File with Verbose ==  
 
Printable errors is specified to produce a smaller example that does not print all the errors since that would take up more space.
 
Printable errors is specified to produce a smaller example that does not print all the errors since that would take up more space.
  
 
<pre>
 
<pre>
 
./bam validate --in test/testFiles/testInvalid.sam --verbose --printableErrors 5
 
./bam validate --in test/testFiles/testInvalid.sam --verbose --printableErrors 5
 
The following parameters are available.  Ones with "[]" are in effect:
 
 
Input Parameters
 
--in [test/testFiles/testInvalid.sam], --noeof, --maxErrors [-1],
 
              --verbose [ON], --printableErrors [5], --disableStatistics
 
  SortOrder : --so_flag, --so_coord, --so_query
 
  
 
Record 1
 
Record 1
Line 269: Line 225:
 
Returning: 7 (INVALID)
 
Returning: 7 (INVALID)
 
</pre>
 
</pre>
 
 
== Libraries ==
 
*[[C++ Library: libbam|libbam.a]]
 
*[[C++ Library: libcsg|libcsg.a]]
 

Latest revision as of 13:05, 6 January 2014


Status

The initial version of a SAM/BAM Validator is complete, but does not yet validate all fields or produce all desired statistics. Future releases will add more validation and more statistics.

Download

http://genome.sph.umich.edu/wiki/BamUtil After compiling, the BAM Validator is found in bamUtil/bin/bam and is the "validate" subprogram (bamUtil/bin/bam validate).

Purpose

The BamValidator processes the specified SAM/BAM file:

  1. to determine if it has any syntactic or format violations.
  2. to generate basic statistics.

The user can then decide if they want to use the file for future processing based on whether it passed syntactic/format validation and based on the statistics that were reported.


Valid SAM/BAM File Requirements

A valid SAM/BAM file meets the validation criteria specified in SAM Validation Criteria.

Statistic Generation

Statistics are generated by the BAM Validator if the --disableStatistics option is not set. A description of the statistics generated are found at: Sam File Statistics

Usage

	./bam validate --in <inputFile> [--noeof] [--so_flag|--so_coord|--so_query] [--maxErrors <numErrors>] [--verbose] [--printableErrors <numReportedErrors>] [--disableStatistics] [--params]

Recommended Usage

If you don't want the file statistics, use --disableStatistics.

If you want to validate that the file is sorted, use the appropriate sorting flag. If you trust the @HD SO flag, use so_flag, otherwise if you want to check that it is sorted by coordinate, use --so_coord.

If you want to see the error details, use --verbose, but if you want to limit the number of errors displayed, use --printableErrors.

If you just want to know if the file is validly formatted or not, use --maxErrors 1

The following will give the most information (without validating that the file is sorted):

./bam validate --in <inputFile> --verbose

Parameters

	Required Parameters:
		--in : the SAM/BAM file to be validated
	Optional Parameters:
		--noeof             : do not expect an EOF block on a bam file.
		--refFile           : the reference file
		--so_flag           : validate the file is sorted based on the header's @HD SO flag.
		--so_coord          : validate the file is sorted based on the coordinate.
		--so_query          : validate the file is sorted based on the query name.
		--maxErrors         : Number of records with errors/invalids to allow before quiting.
		                      -1 (default) indicates to not quit until the entire file is validated.
		                      0 indicates not to read/validate anything.
		--verbose           : Print specific error details rather than just a summary
		--printableErrors   : Maximum number of records with errors to print the details of
		                      before suppressing them when in verbose (defaults to 100)
		--disableStatistics : Turn off statistic generation
		--params            : Print the parameter settings
	PhoneHome:
		--noPhoneHome       : disable PhoneHome (default enabled)
		--phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)

Required Parameters

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Optional Parameters

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Reference File (--refFile)

Use --refFile followed by the reference file name to specify the reference sequence file.

Validate Sort Order (--so_flag, --so_coord,--so_query)

Validate the sort order of the file:

  • --so_flag - based on the flag in the header
  • --so_coord - based on the coordinates/positions
  • --so_query - based on the query/read names

Print Specific Errors (--maxErrors)

Use --maxErrors followed by a number to specify the maximum number of records with errors/invalids to process before quiting.

-1 (default) indicates to not quit until the entire file is validated.

0 indicates not to read/validate anything.

Print Specific Errors (--verbose)

Use --verbose to print specific error details rather than just a summary.

Maxium Number of Record Error Details to Print (--printableErrors)

Use --printableErrors followed by a number to specify the maximum number of records with errors to print the details of before suppressing them. This parameter is only valid when --verbose is also specified.

The default is 100.

Disable Statistic Generation (--disableStatistics)

Use --disableStatistics to turn off statistic generation (statistics are generated by default).

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

Turn off PhoneHome (--noPhoneHome)

Use the --noPhoneHome option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.

Adjust the Frequency of PhoneHome (--phoneHomeThinning)

Use --phoneHomeThinning to modify the percentage of the time that PhoneHome will run (0-100).

  • By default, --phoneHomeThinning is set to 50, running 50% of the time.
  • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
  • N/A if --noPhoneHome is set.

Output

The error details (--verbose) and the statistics are printed to stderr. If you want that to go to a file you need to redirect stderr.

For a bash shell, redirect to stderr by doing:

./bam validate --in <inputFile> --verbose 2> outputFile.txt


Return Value

  • 0: all records are successfully read, are valid, and are properly sorted.
  • non-0: at least one record was not successfully read, not valid, or not properly sorted.

Example Outputs

Valid File

./bam validate --in ~/data/bamExample/37mer_alt.bwa.bam

Number of records read = 18900000
Number of valid records = 18900000

TotalReads(e6)	18.90
MappedReads(e6)	14.77
PairedReads(e6)	18.90
ProperPair(e6)	11.28
DuplicateReads(e6)	0.00
QCFailureReads(e6)	0.00

MappingRate(%)	78.17
PairedReads(%)	100.00
ProperPair(%)	59.68
DupRate(%)	0.00
QCFailRate(%)	0.00

TotalBases(e6)	699.30
BasesInMappedReads(e6)	546.67
Returning: 0 (SUCCESS)

Invalid File

./bam validate --in test/testFiles/testInvalid.sam 

Number of records read = 32
Number of valid records = 2

Error Counts:
	FAIL_PARSE: 17
	INVALID: 1
	INVALID_QNAME: 3
	INVALID_RNAME: 8
	INVALID_POS: 2
	INVALID_CIGAR: 2
	INVALID_QUAL: 2

TotalReads	14.00
MappedReads	14.00
PairedReads	6.00
ProperPair	0.00
DuplicateReads	0.00
QCFailureReads	0.00

MappingRate(%)	100.00
PairedReads(%)	42.86
ProperPair(%)	0.00
DupRate(%)	0.00
QCFailRate(%)	0.00

TotalBases	47.00
BasesInMappedReads	47.00
Returning: 7 (INVALID)

Invalid File with Verbose

Printable errors is specified to produce a smaller example that does not print all the errors since that would take up more space.

./bam validate --in test/testFiles/testInvalid.sam --verbose --printableErrors 5

Record 1
INVALID_QNAME (ERROR) : Invalid Query Name - the string length (256) does not match the specified query name length (0).
INVALID_QNAME (WARNING) : Invalid Query Name (QNAME) length: 256.  Length with the terminating null must be between 2 & 255.

Record 2
INVALID: 0 length Query Name.

Record 3
INVALID_QNAME (WARNING) : Invalid character in the Query Name (QNAME): ' ' at position 2.

Record 4
FAIL_PARSE: flag, 29M5I3M:F:295, is not an integer.
FAIL_PARSE: Invalid Tag Format: *, should be cc:c:x*.

Record 5
FAIL_PARSE: Too few columns (1) in the Record, expected at least 11.


Number of records read = 32
Number of valid records = 2

Error Counts:
	FAIL_PARSE: 17
	INVALID: 1
	INVALID_QNAME: 3
	INVALID_RNAME: 8
	INVALID_POS: 2
	INVALID_CIGAR: 2
	INVALID_QUAL: 2

TotalReads	14.00
MappedReads	14.00
PairedReads	6.00
ProperPair	0.00
DuplicateReads	0.00
QCFailureReads	0.00

MappingRate(%)	100.00
PairedReads(%)	42.86
ProperPair(%)	0.00
DupRate(%)	0.00
QCFailRate(%)	0.00

TotalBases	47.00
BasesInMappedReads	47.00
Returning: 7 (INVALID)