Difference between revisions of "FastQValidator"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 15: Line 15:
 
* Valid quality values should all have ASCII codes > 32.
 
* Valid quality values should all have ASCII codes > 32.
  
* Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file.  
+
* Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
  
 
* Every entry in the file should have a unique identifier.
 
* Every entry in the file should have a unique identifier.
Line 22: Line 22:
  
 
* Base composition should be reported and tracked by position.
 
* Base composition should be reported and tracked by position.
 +
 +
== Additional Wishlist ==
 +
 +
There are a series of optional capabilities a FastQ Validator should implement. Among those:
 +
 +
* Consume gzipped and uncompressed text files transparently.
 +
 +
* To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory.
 +
 +
* Support color space files, where valid base sequences include the characters 0, 1, 2, 3, 4 instead of A, C, T, G and N.

Revision as of 10:55, 13 January 2010

Status

The FastQ Validator is on our Todo List.

Valid FastQ File Requirements

A valid fastQ file should meet the following requirements:

  • A base sequence should have non-zero length.
  • A quality string should be present for every base sequence.
  • Paired quality and base sequences should be of the same length.
  • Valid quality values should all have ASCII codes > 32.
  • Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
  • Every entry in the file should have a unique identifier.
  • Reads should be of a minimum length; many mappers will get into trouble with very short reads.
  • Base composition should be reported and tracked by position.

Additional Wishlist

There are a series of optional capabilities a FastQ Validator should implement. Among those:

  • Consume gzipped and uncompressed text files transparently.
  • To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory.
  • Support color space files, where valid base sequences include the characters 0, 1, 2, 3, 4 instead of A, C, T, G and N.