Difference between revisions of "FastQValidator"

From Genome Analysis Wiki
Jump to: navigation, search
Line 1: Line 1:
== Status ==
+
== Status ==
  
The [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is on our [[Todo List]].
+
The [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is on our [[Todo List]].  
  
== Valid FastQ File Requirements ==
+
== Valid FastQ File Requirements ==
  
A valid fastQ file should meet the following requirements:
+
A valid fastQ file should meet the following requirements:  
  
* A base sequence should have non-zero length.
+
*A base sequence should have non-zero length.
  
* A quality string should be present for every base sequence.
+
*A quality string should be present for every base sequence.
  
* Paired quality and base sequences should be of the same length.
+
*Paired quality and base sequences should be of the same length.
  
* Valid quality values should all have ASCII codes > 32.
+
*Valid quality values should all have ASCII codes > 32.
  
* Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
+
*Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
  
* Every entry in the file should have a unique identifier.
+
*Every entry in the file should have a unique identifier.
  
* Reads should be of a minimum length; many mappers will get into trouble with very short reads.
+
*Reads should be of a minimum length; many mappers will get into trouble with very short reads.
  
* Base composition should be reported and tracked by position.
+
*Base composition should be reported and tracked by position.
  
== Additional Wishlist ==
+
== Additional Wishlist ==
  
There are a series of optional capabilities a FastQ Validator should implement. Among those:
+
There are a series of optional capabilities a FastQ Validator should implement. Among those:  
  
* Consume gzipped and uncompressed text files transparently.
+
*Consume gzipped and uncompressed text files transparently.
  
* To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory.
+
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory.
  
* Support color space files, where valid base sequences include the characters 0, 1, 2, 3, 4 instead of A, C, T, G and N.
+
*Support color space files, where valid base sequences include the characters 0, 1, 2, 3, 4 instead of A, C, T, G and N.
 +
 
 +
== Discussion ==
 +
 
 +
*For color space, there is not specification for:
 +
 
 +
#The length of read and quality string may be the same of differ by 1.
 +
#Missing value are usually presented by "." or sometimes left as a blank value
 +
#Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)
 +
 
 +
<br>

Revision as of 18:09, 14 January 2010

Status

The FastQ Validator is on our Todo List.

Valid FastQ File Requirements

A valid fastQ file should meet the following requirements:

  • A base sequence should have non-zero length.
  • A quality string should be present for every base sequence.
  • Paired quality and base sequences should be of the same length.
  • Valid quality values should all have ASCII codes > 32.
  • Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
  • Every entry in the file should have a unique identifier.
  • Reads should be of a minimum length; many mappers will get into trouble with very short reads.
  • Base composition should be reported and tracked by position.

Additional Wishlist

There are a series of optional capabilities a FastQ Validator should implement. Among those:

  • Consume gzipped and uncompressed text files transparently.
  • To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory.
  • Support color space files, where valid base sequences include the characters 0, 1, 2, 3, 4 instead of A, C, T, G and N.

Discussion

  • For color space, there is not specification for:
  1. The length of read and quality string may be the same of differ by 1.
  2. Missing value are usually presented by "." or sometimes left as a blank value
  3. Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)