LibStatGen: FASTQ

From Genome Analysis Wiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Validation Criteria

Sequence Identifier Line

Validation Criteria Error Message
Every entry in the file should have a unique identifier. ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #>

Raw Sequence Line

  • A base sequence should have non-zero length.
  • Validates the base sequences against the characters allowed via configuration.
    • Base Only: A C T G N a c t g n
    • Color Space Only: 0 1 2 3 .(period)
    • Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)
  • Reads should be of a minimum length; many mappers will get into trouble with very short reads.

Plus Line

Quality String Line

  • A quality string should be present for every base sequence.
  • Paired quality and base sequences should be of the same length.
  • Valid quality values should all have ASCII codes > 32.

Additional Features

  • Base composition are reported and tracked by position.
  • Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).

Additional Wishlist - Not Implemented

  • To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).


Assumptions

How to Use the fastQValidator Executable

Required Parameters:

       -f  :  FastQ filename with path to be prorcessed.

Optional Parameters:

       -l  :  Minimum allowed read length (Defaults to 10).
       -e  :  Maximum number of errors to display before suppressing them(Defaults to 20).
       -b  :  Raw sequence type:  B - ACTGN only (Default)
                                  C - 0123. only
                                 BC - ACTGN or 0123.

Testing only Parameters:

       -t  :  If "ReadOnly" is specified, the fastq will be read but not processed.  This may be used for determining read time.

Usage:

       ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>

Examples:

       ../fastQValidator -f testFile.txt
       ../fastQValidator -f testFile.txt -l 10 -b BC -e 100
       ./fastQValidator -f test/testFile.txt -l 10 -b BC -e 100
       time ./fastQValidator -f test/testFile.txt -t ReadOnly

FastQ Validator Output

Coming Soon