FastQValidator

From Genome Analysis Wiki
Revision as of 12:43, 19 February 2010 by Mktrost (talk | contribs) (Status)
Jump to: navigation, search

Status

The FastQ Validator is on our Todo List.

An initial version of the FastQFile has been completed which includes validation methods.

Valid FastQ File Requirements

A valid fastQ file should meet the following requirements:

  • A base sequence should have non-zero length.
  • A quality string should be present for every base sequence.
  • Paired quality and base sequences should be of the same length.
  • Valid quality values should all have ASCII codes > 32.
  • Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
  • Every entry in the file should have a unique identifier.
  • Reads should be of a minimum length; many mappers will get into trouble with very short reads.
  • Base composition should be reported and tracked by position.

Additional Wishlist

There are a series of optional capabilities a FastQ Validator should implement. Among those:

  • Consume gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
  • To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
  • Support color space files, where valid base sequences include the characters 0, 1, 2, 3, '.' (period) in addition to A, C, T, G and N (some csfastq sequence lines start with a primer base).

Discussion

  • For color space, there is no specification for:
  1. The length of read and quality string may be the same or differs by 1 (depending on whether the primer base has a corresponding quality value).
  2. Missing values are usually presented by "." or sometimes left as a blank " ".
  3. Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)
  • It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).