From Genome Analysis Wiki
An initial version of the FastQValidator has been completed.
Valid FastQ File Requirements
A valid fastQ file should meet the following requirements:
- A base sequence should have non-zero length.
- A quality string should be present for every base sequence.
- Paired quality and base sequences should be of the same length.
- Valid quality values should all have ASCII codes > 32.
- Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
- Every entry in the file should have a unique identifier.
- Reads should be of a minimum length; many mappers will get into trouble with very short reads.
- Base composition should be reported and tracked by position.
There are a series of optional capabilities a FastQ Validator should implement. Among those:
- Consume gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
- To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
- Support color space files, where valid base sequences include the characters 0, 1, 2, 3, '.' (period) in addition to A, C, T, G and N (some csfastq sequence lines start with a primer base).
- For color space, there is no specification for:
- The length of read and quality string may be the same or differs by 1 (depending on whether the primer base has a corresponding quality value).
- Missing values are usually presented by "." or sometimes left as a blank " ".
- Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)
- It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).