Difference between revisions of "FastQValidator"

From Genome Analysis Wiki
Jump to: navigation, search
Line 17: Line 17:
 
         --maxReportedErrors : Maximum number of errors to display before suppressing them (Defaults to 20).
 
         --maxReportedErrors : Maximum number of errors to display before suppressing them (Defaults to 20).
 
         --ignoreAllErrors  : Ignore all errors (same as --maxReportedErrors 0), overwrites the maxReportedErrors option.
 
         --ignoreAllErrors  : Ignore all errors (same as --maxReportedErrors 0), overwrites the maxReportedErrors option.
 +
--printBaseComp      : turns on the printing of Base Composition Statistics.
 +
--disableAllMessages : turns off all prints, including errors, and summary
 +
                      statistics (Defaults to enabled).  Does not turn off
 +
                      Base Composition printing if printBaseComp is set.
  
 
'''Optional Space Options for Raw Sequence (Last one specified is used):'''
 
'''Optional Space Options for Raw Sequence (Last one specified is used):'''
Line 38: Line 42:
  
 
  Input Parameters
 
  Input Parameters
  --file [testFile.txt], --minReadLen [10]
+
  --file [testFile.txt], --printBaseComp [ON], --disableAllMessages, --minReadLen [10]
 
   Space Type : --baseSpace [ON], --colorSpace, --autoDetect
 
   Space Type : --baseSpace [ON], --colorSpace, --autoDetect
 
       Errors : --ignoreAllErrors, --maxReportedErrors [100]
 
       Errors : --ignoreAllErrors, --maxReportedErrors [100]

Revision as of 10:56, 25 March 2010

Status

The initial version of a FastQ Validator is complete. It was built using the FastQFile class.


Valid FastQ File Requirements

A valid fastQ file meets the validation criteria specified in FastQ File Validation.


How to Use the fastQValidator Executable

Required Parameters:

       --file  :  FastQ filename with path to be prorcessed.

Optional Parameters:

       --minReadLen        : Minimum allowed read length (Defaults to 10).
       --maxReportedErrors : Maximum number of errors to display before suppressing them (Defaults to 20).
       --ignoreAllErrors   : Ignore all errors (same as --maxReportedErrors 0), overwrites the maxReportedErrors option.

--printBaseComp  : turns on the printing of Base Composition Statistics. --disableAllMessages : turns off all prints, including errors, and summary statistics (Defaults to enabled). Does not turn off Base Composition printing if printBaseComp is set.

Optional Space Options for Raw Sequence (Last one specified is used):

       --autoDetect : Determine baseSpace/colorSpace from the Raw Sequence in the file (Default).
       --baseSpace  : ACTGN only
       --colorSpace : 0123. only

Usage:

       ./fastQValidator --file <fileName> [--minReadLen <minReadLen>] [--maxReportedErrors <maxReprotedErrors>|--ignoreAllErrors] [--baseSpace|--colorSpace|--autoDetect]

Examples:

       ../fastQValidator --file testFile.txt
       ../fastQValidator --file testFile.txt --minReadLen 10 --baseSpace --maxReportedErrors 100
       ./fastQValidator --file test/testFile.txt --minReadLen 10 --colorSpace --ignoreAllErrors


FastQ Validator Output

When running the fastQValidator Executable, the output starts with a summary of the parameters:

The following parameters are in effect:
Input Parameters
--file [testFile.txt], --printBaseComp [ON], --disableAllMessages, --minReadLen [10]
  Space Type : --baseSpace [ON], --colorSpace, --autoDetect
      Errors : --ignoreAllErrors, --maxReportedErrors [100]

The Validator Executable outputs error messages for invalid sequences based on Validation Criteria. For Example:

ERROR on Line 25: The sequence identifier line was too short.
ERROR on Line 29: First line of a sequence does not begin wtih @
ERROR on Line 33: No Sequence Identifier specified before the comment.

Base Composition Percentages by Index:

Base Composition Statistics:
Read Index	%A	%C	%G	%T	%N	Total Reads At Index
        0   100.00    0.00    0.00    0.00    0.00	20
        1     5.00   95.00    0.00    0.00    0.00	20
        2     5.00    0.00    5.00   90.00    0.00	20


Summary of the number of lines, sequences, and errors:

Finished processing testFile.txt with 92 lines containing 20 sequences.
There were a total of 17 errors.

The fastQValidator returns 0 on success and non-zero on failure.


Libraries & Classes

  • libfqf.a
  • FastQValidator.cpp - Main method for the Executable.
  • libcsg.a
    • ParameterList - Class for reading in Parameters.


Additional Features

  • Base composition reported and tracked by position.
  • Supports base space and color space files.
  • Consumes gzipped and uncompressed text files transparently.
  • Prints error messages for errors up to the configurable maximum number of reportable errors.
  • Prints a summary of the total number of errors.
  • Prints the total number of lines processed as well as the total number of sequences processed.


Additional Wishlist - Not Implemented

There are a series of optional capabilities a FastQ Validator could implement. Among those:

  • To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
  • Report average read quality score.
  • AutoDetect 64/33 illumina/standard quality scores.


Discussion

  • For color space, there is no specification for:
  1. The length of read and quality string may be the same or differs by 1 (depending on whether the primer base has a corresponding quality value).
  2. Missing values are usually presented by "." or sometimes left as a blank " ".
  3. Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)
  • It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).