Changes

From Genome Analysis Wiki
Jump to navigationJump to search
8,720 bytes added ,  18:26, 21 July 2010
Created page with 'BamValidator == Status == The initial version of a SAM/BAM Validator is complete. It can be found at: http://www.sph.umich.edu/csg/mktrost/bam/ == Purpos…'
[[Category:Software|BamValidator]]
== Status ==

The initial version of a SAM/BAM Validator is complete.
It can be found at: http://www.sph.umich.edu/csg/mktrost/bam/

== Purpose ==

The BamValidator processes the specified SAM/BAM file:
# to determine if it has any [[BamValidator#Valid SAM/BAM File Requirements|syntactic or format violations]].
# to [[BamValidator#Statistic Generation|generate basic statistics]].

The user can then decide if they want to use the file for future processing based on whether it passed syntactic/format validation and based on the statistics that were reported.


=== Valid SAM/BAM File Requirements ===

A valid SAM/BAM file meets the validation criteria specified in [[SAM Validation Criteria]].

=== Statistic Generation ===

The statistics only reflect alignments that were successfully read from the BAM file. Alignments that failed to parse from the file are not reflected in the statistics, but alignments that are invalid for other reasons may show up in the statistics.

The following Statistics are generated by the BAM Validator if the <code>--disableStatistics</code> option is not set:

{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
|-style="background: #f2f2f2; text-align: center;"
|+ '''Read Counts'''
! Statistic !! Description
|-
|TotalReads
| Total number of alignments that were successfully read from the file.
|-
|MappedReads
| Total number of alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
|-
|PairedReads
| Total number of alignments that were successfully read from the file with FLAG bit 0x001 set to 1 (paired).
|-
|ProperPair
| Total number of alignments that were successfully read from the file with FLAG bits 0x001 set to 1 (paired) AND 0x002 (proper pair).
|-
|DuplicateReads
| Total number of alignments that were successfully read from the file with FLAG bit 0x400 set to 1 (PCR or optical duplicate).
|-
|QCFailureReads
| Total number of alignments that were successfully read from the file with FLAG bit 0x200 set to 1 (failed quality checks).
|}

{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
|-style="background: #f2f2f2; text-align: center;"
|- '''Read Percentages'''
! Statistic !! Description
|-
|MappingRate(%)
| 100 * MappedReads/TotalReads
|-
|PairedReads(%)
| 100 * PairedReads/TotalReads
|-
|ProperPair(%)
| 100 * ProperPair/TotalReads
|-
|DupRate(%)
| 100 * DuplicateReads/TotalReads
|-
|QCFailRate(%)
| 100 * QCFailureReads/TotalReads
|}

{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
|-style="background: #f2f2f2; text-align: center;"
|- '''Base Counts'''
! Statistic !! Description
|-
|TotalBases
| Sum of the SEQ lengths for all alignments that were successfully read from the file.
|-
|BasesInMappedReads
| Sum of the SEQ lengths for all alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
|}

NOTE: If the TotalReads is greater than 10^6, then the Read Counts and Base Counts specify the total counts divided by 10^6. This is indicated in the output with a (e6) appended to the field name.



== How to Use the Bam Validator Executable ==
=== Parameters ===
<pre>
Required Parameters:
--in : the SAM/BAM file to be validated
Optional Parameters:
--noeof : do not expect an EOF block on a bam file.
--so_flag : validate the file is sorted based on the header's @HD SO flag.
--so_coord : validate the file is sorted based on the coordinate.
--so_query : validate the file is sorted based on the query name.
--maxErrors : Number of records with errors/invalids to allow before quiting.
-1 (default) indicates to not quit until the entire file is validated.
0 indicates not to read/validate anything.
--verbose : Print specific error details rather than just a summary
--printableErrors : Maximum number of records with errors to print the details of
before suppressing them when in verbose (defaults to 100)
--disableStatistics : Turn off statistic generation
</pre>

=== Usage ===

./bam validate --in <inputFile> [--noeof] [--so_flag|--so_coord|--so_query] [--maxErrors <numErrors>] [--verbose] [--printableErrors <numReportedErrors>] [--disableStatistics]

==== Recommended Usage ====
If you don't want the file statistics, use --disableStatistics.

If you want to validate that the file is sorted, use the appropriate sorting flag. If you trust the @HD SO flag, use <code>so_flag</code>, otherwise if you want to check that it is sorted by coordinate, use <code>--so_coord</code>.

If you want to see the error details, use --verbose, but if you want to limit the number of errors displayed, use --printableErrors.

If you just want to know if the file is validly formatted or not, use --maxErrors 1

The following will give the most information (without validating that the file is sorted):
./bam validate --in <inputFile> --verbose


=== Return Value ===
* 0: all records are successfully read, are valid, and are properly sorted.
* non-0: at least one record was not successfully read, not valid, or not properly sorted.

=== Example Outputs ===

==== Valid File ====
<pre>
./bam validate --in ~/data/bamExample/37mer_alt.bwa.bam

The following parameters are available. Ones with "[]" are in effect:

Input Parameters
--in [/home/mktrost/data/bamExample/37mer_alt.bwa.bam], --noeof,
--maxErrors [-1], --verbose, --printableErrors [100],
--disableStatistics
SortOrder : --so_flag, --so_coord, --so_query

'
Number of records read = 18900000
Number of valid records = 18900000

TotalReads(e6) 18.90
MappedReads(e6) 14.77
PairedReads(e6) 18.90
ProperPair(e6) 11.28
DuplicateReads(e6) 0.00
QCFailureReads(e6) 0.00

MappingRate(%) 78.17
PairedReads(%) 100.00
ProperPair(%) 59.68
DupRate(%) 0.00
QCFailRate(%) 0.00

TotalBases(e6) 699.30
BasesInMappedReads(e6) 546.67
Returning: 0 (SUCCESS)
</pre>

==== Invalid File ====
<pre>
./bam validate --in test/testFiles/testInvalid.sam

The following parameters are available. Ones with "[]" are in effect:

Input Parameters
--in [test/testFiles/testInvalid.sam], --noeof, --maxErrors [-1], --verbose,
--printableErrors [100], --disableStatistics
SortOrder : --so_flag, --so_coord, --so_query


Number of records read = 32
Number of valid records = 2

Error Counts:
FAIL_PARSE: 17
INVALID: 1
INVALID_QNAME: 3
INVALID_RNAME: 8
INVALID_POS: 2
INVALID_CIGAR: 2
INVALID_QUAL: 2

TotalReads 14.00
MappedReads 14.00
PairedReads 6.00
ProperPair 0.00
DuplicateReads 0.00
QCFailureReads 0.00

MappingRate(%) 100.00
PairedReads(%) 42.86
ProperPair(%) 0.00
DupRate(%) 0.00
QCFailRate(%) 0.00

TotalBases 47.00
BasesInMappedReads 47.00
Returning: 7 (INVALID)
</pre>

==== Invalid File with Verbose ====
Printable errors is specified to produce a smaller example that does not print all the errors since that would take up more space.

<pre>
./bam validate --in test/testFiles/testInvalid.sam --verbose --printableErrors 5

The following parameters are available. Ones with "[]" are in effect:

Input Parameters
--in [test/testFiles/testInvalid.sam], --noeof, --maxErrors [-1],
--verbose [ON], --printableErrors [5], --disableStatistics
SortOrder : --so_flag, --so_coord, --so_query

Record 1
INVALID_QNAME (ERROR) : Invalid Query Name - the string length (256) does not match the specified query name length (0).
INVALID_QNAME (WARNING) : Invalid Query Name (QNAME) length: 256. Length with the terminating null must be between 2 & 255.

Record 2
INVALID: 0 length Query Name.

Record 3
INVALID_QNAME (WARNING) : Invalid character in the Query Name (QNAME): ' ' at position 2.

Record 4
FAIL_PARSE: flag, 29M5I3M:F:295, is not an integer.
FAIL_PARSE: Invalid Tag Format: *, should be cc:c:x*.

Record 5
FAIL_PARSE: Too few columns (1) in the Record, expected at least 11.


Number of records read = 32
Number of valid records = 2

Error Counts:
FAIL_PARSE: 17
INVALID: 1
INVALID_QNAME: 3
INVALID_RNAME: 8
INVALID_POS: 2
INVALID_CIGAR: 2
INVALID_QUAL: 2

TotalReads 14.00
MappedReads 14.00
PairedReads 6.00
ProperPair 0.00
DuplicateReads 0.00
QCFailureReads 0.00

MappingRate(%) 100.00
PairedReads(%) 42.86
ProperPair(%) 0.00
DupRate(%) 0.00
QCFailRate(%) 0.00

TotalBases 47.00
BasesInMappedReads 47.00
Returning: 7 (INVALID)
</pre>


== Libraries ==
*[[C++ Library: libbam|libbam.a]]
*[[C++ Library: libcsg|libcsg.a]]

Navigation menu