FastQValidator
From Genome Analysis Wiki
Jump to navigationJump to searchStatus
The FastQ Validator is on our Todo List.
Valid FastQ File Requirements
A valid fastQ file should meet the following requirements:
- A base sequence should have non-zero length.
- A quality string should be present for every base sequence.
- Paired quality and base sequences should be of the same length.
- Valid quality values should all have ASCII codes > 32.
- Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
- Every entry in the file should have a unique identifier.
- Reads should be of a minimum length; many mappers will get into trouble with very short reads.
- Base composition should be reported and tracked by position.
Additional Wishlist
There are a series of optional capabilities a FastQ Validator should implement. Among those:
- Consume gzipped and uncompressed text files transparently.
- To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory.
- Support color space files, where valid base sequences include the characters 0, 1, 2, 3, 4 instead of A, C, T, G and N.