Difference between revisions of "FastQValidator"
From Genome Analysis Wiki
Jump to navigationJump to searchm (moved FastQ validator to FastQ Validator) |
|||
Line 15: | Line 15: | ||
* Valid quality values should all have ASCII codes > 32. | * Valid quality values should all have ASCII codes > 32. | ||
− | * Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. | + | * Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed. |
* Every entry in the file should have a unique identifier. | * Every entry in the file should have a unique identifier. | ||
Line 22: | Line 22: | ||
* Base composition should be reported and tracked by position. | * Base composition should be reported and tracked by position. | ||
+ | |||
+ | == Additional Wishlist == | ||
+ | |||
+ | There are a series of optional capabilities a FastQ Validator should implement. Among those: | ||
+ | |||
+ | * Consume gzipped and uncompressed text files transparently. | ||
+ | |||
+ | * To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory. | ||
+ | |||
+ | * Support color space files, where valid base sequences include the characters 0, 1, 2, 3, 4 instead of A, C, T, G and N. |
Revision as of 10:55, 13 January 2010
Status
The FastQ Validator is on our Todo List.
Valid FastQ File Requirements
A valid fastQ file should meet the following requirements:
- A base sequence should have non-zero length.
- A quality string should be present for every base sequence.
- Paired quality and base sequences should be of the same length.
- Valid quality values should all have ASCII codes > 32.
- Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
- Every entry in the file should have a unique identifier.
- Reads should be of a minimum length; many mappers will get into trouble with very short reads.
- Base composition should be reported and tracked by position.
Additional Wishlist
There are a series of optional capabilities a FastQ Validator should implement. Among those:
- Consume gzipped and uncompressed text files transparently.
- To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory.
- Support color space files, where valid base sequences include the characters 0, 1, 2, 3, 4 instead of A, C, T, G and N.