FastQValidator

From Genome Analysis Wiki
Revision as of 18:12, 14 January 2010 by Zhanxw (talk | contribs)
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Status

The FastQ Validator is on our Todo List.

Valid FastQ File Requirements

A valid fastQ file should meet the following requirements:

  • A base sequence should have non-zero length.
  • A quality string should be present for every base sequence.
  • Paired quality and base sequences should be of the same length.
  • Valid quality values should all have ASCII codes > 32.
  • Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
  • Every entry in the file should have a unique identifier.
  • Reads should be of a minimum length; many mappers will get into trouble with very short reads.
  • Base composition should be reported and tracked by position.

Additional Wishlist

There are a series of optional capabilities a FastQ Validator should implement. Among those:

  • Consume gzipped and uncompressed text files transparently.
  • To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory.
  • Support color space files, where valid base sequences include the characters 0, 1, 2, 3, 4 instead of A, C, T, G and N.

Discussion

  • For color space, there is not specification for:
  1. The length of read and quality string may be the same or differs by 1.
  2. Missing values are usually presented by "." or sometimes left as a blank " ".
  3. Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)
  • It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).