Difference between revisions of "LibStatGen: FASTQ"

From Genome Analysis Wiki
Jump to: navigation, search
Line 1: Line 1:
 +
== Validation Criteria ==
 +
=== Sequence Identifier Line ===
 +
*Every entry in the file should have a unique identifier.
 +
 +
=== Raw Sequence Line ===
 +
*A base sequence should have non-zero length.
 +
*Validates the base sequences against the characters allowed via configuration.
 +
** Base Only: A C T G N a c t g n
 +
** Color Space Only: 0 1 2 3 .(period)
 +
** Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)
 +
*Reads should be of a minimum length; many mappers will get into trouble with very short reads.
 +
 +
=== Plus Line ===
 +
 +
=== Quality String Line ===
 +
*A quality string should be present for every base sequence.
 +
*Paired quality and base sequences should be of the same length.
 +
*Valid quality values should all have ASCII codes > 32.
 +
 +
== Additional Features ==
 +
*Base composition are reported and tracked by position.
 +
*Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
 +
 +
== Additional Wishlist - Not Implemented ==
 +
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
 +
 +
 +
 +
== Assumptions ==
 +
 
== How to Use the fastQValidator Executable ==
 
== How to Use the fastQValidator Executable ==
 
'''Required Parameters:'''
 
'''Required Parameters:'''
Line 22: Line 52:
  
 
== FastQ Validator Output ==
 
== FastQ Validator Output ==
The FastQ Validator
+
'''Coming Soon'''

Revision as of 17:44, 3 February 2010

Validation Criteria

Sequence Identifier Line

  • Every entry in the file should have a unique identifier.

Raw Sequence Line

  • A base sequence should have non-zero length.
  • Validates the base sequences against the characters allowed via configuration.
    • Base Only: A C T G N a c t g n
    • Color Space Only: 0 1 2 3 .(period)
    • Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)
  • Reads should be of a minimum length; many mappers will get into trouble with very short reads.

Plus Line

Quality String Line

  • A quality string should be present for every base sequence.
  • Paired quality and base sequences should be of the same length.
  • Valid quality values should all have ASCII codes > 32.

Additional Features

  • Base composition are reported and tracked by position.
  • Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).

Additional Wishlist - Not Implemented

  • To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).


Assumptions

How to Use the fastQValidator Executable

Required Parameters:

       -f  :  FastQ filename with path to be prorcessed.

Optional Parameters:

       -l  :  Minimum allowed read length (Defaults to 10).
       -e  :  Maximum number of errors to display before suppressing them(Defaults to 20).
       -b  :  Raw sequence type:  B - ACTGN only (Default)
                                  C - 0123. only
                                 BC - ACTGN or 0123.

Testing only Parameters:

       -t  :  If "ReadOnly" is specified, the fastq will be read but not processed.  This may be used for determining read time.

Usage:

       ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>

Examples:

       ../fastQValidator -f testFile.txt
       ../fastQValidator -f testFile.txt -l 10 -b BC -e 100
       ./fastQValidator -f test/testFile.txt -l 10 -b BC -e 100
       time ./fastQValidator -f test/testFile.txt -t ReadOnly

FastQ Validator Output

Coming Soon