Difference between revisions of "FastQValidator"

From Genome Analysis Wiki
Jump to: navigation, search
Line 6: Line 6:
 
This command line tool can be downloaded as part of the library: http://genome.sph.umich.edu/wiki/Software#Download
 
This command line tool can be downloaded as part of the library: http://genome.sph.umich.edu/wiki/Software#Download
  
Note: Since the FastQValidator checks for unique sequence names, it may use a large amount of memory.
+
Note: Since the FastQValidator checks for unique sequence names, it may use a large amount of memory - this can be disabled by specifying the --disableSeqIDCheck option
  
 
== Valid FastQ File Requirements  ==
 
== Valid FastQ File Requirements  ==
Line 33: Line 33:
 
                               overwrites the printableErrors option.
 
                               overwrites the printableErrors option.
 
         --baseComposition    : Print the Base Composition Statistics.
 
         --baseComposition    : Print the Base Composition Statistics.
 +
--disableSeqIDCheck  : Disable the unique sequence identifier check.
 +
                      Use this option to save memory since the sequence id
 +
                      check uses a lot of memory.
 +
                      Does not affect the printing of Base Composition Statistics.
 
         --quiet              : Suppresses the display of errors and summary statistics.
 
         --quiet              : Suppresses the display of errors and summary statistics.
 
                               Does not affect the printing of Base Composition Statistics.
 
                               Does not affect the printing of Base Composition Statistics.
Line 42: Line 46:
  
 
=== Usage ===
 
=== Usage ===
        ./fastQValidator --file <fileName> [--minReadLen <minReadLen>] [--maxErrors <numErrors>] [--printableErrors <printableErrors>|--ignoreErrors] [--baseSpace|--colorSpace|--auto] [--baseComposition] [--quiet]
+
./fastQValidator --file <fileName> [--minReadLen <minReadLen>] [--maxErrors <numErrors>] [--printableErrors <printableErrors>|--ignoreErrors] [--baseComposition] [--disableSeqIDCheck] [--quiet] [--baseSpace|--colorSpace|--auto] [--params]
  
 
=== Examples ===
 
=== Examples ===
Line 56: Line 60:
  
 
== FastQ Validator Output ==
 
== FastQ Validator Output ==
When running the fastQValidator Executable, the output starts with a summary of the parameters:
+
When running the fastQValidator Executable, if the --params option is specified, the output starts with a summary of the parameters:
  
  The following parameters are in effect:
+
  The following parameters are available.  Ones with "[]" are in effect:
  
 
  Input Parameters
 
  Input Parameters
--file [testFile.txt], --baseComposition [ON], --quiet, --minReadLen [10],
+
  --file [../fastqValidator/test/testFile.txt], --baseComposition,
 +
                --disableSeqIDCheck, --quiet, --params [ON], --minReadLen [10],
 
                 --maxErrors [-1]
 
                 --maxErrors [-1]
   Space Type : --baseSpace [ON], --colorSpace, --auto
+
   Space Type : --baseSpace, --colorSpace, --auto [ON]
       Errors : --ignoreErrors, --printableErrors [100]
+
       Errors : --ignoreErrors, --printableErrors [20]
  
 
The Validator Executable outputs error messages for invalid sequences based on [[C++ Class: FastQFile#Validation Criteria Used For Reading a Sequence|Validation Criteria]].  For Example:
 
The Validator Executable outputs error messages for invalid sequences based on [[C++ Class: FastQFile#Validation Criteria Used For Reading a Sequence|Validation Criteria]].  For Example:
Line 105: Line 110:
 
There are a series of optional capabilities a FastQ Validator could implement. Among those:  
 
There are a series of optional capabilities a FastQ Validator could implement. Among those:  
  
*Add option to disable the unique sequence name validation so it does not store all the sequence names.
 
 
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
 
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
 
*Report average read quality score.
 
*Report average read quality score.

Revision as of 14:33, 17 November 2010

Status

The initial version of a FASTQ Validator is complete. It was built using the FastQFile class which is part of the StatGen Library.

This command line tool can be downloaded as part of the library: http://genome.sph.umich.edu/wiki/Software#Download

Note: Since the FastQValidator checks for unique sequence names, it may use a large amount of memory - this can be disabled by specifying the --disableSeqIDCheck option

Valid FastQ File Requirements

A valid fastQ file meets the validation criteria specified in FastQ Validation Criteria.


How to Use the fastQValidator Executable

Required Parameters

       --file  :  FastQ filename with path to be processed.

Optional Parameters

       --minReadLen         : Minimum allowed read length (Defaults to 10).
       --maxErrors          : Number of errors to allow before quitting
                              reading/validating the file.
                              -1 (default) indicates to not quit until
                              the entire file is read.
                              0 indicates not to read/validate anything
       --printableErrors    : Maximum number of errors to print before
                              suppressing them (Defaults to 20).
                              Different than maxErrors since 
                              printableErrors will continue reading and
                              validating the file until the end, but
                              just doesn't print the errors.
       --ignoreErrors       : Ignore all errors (same as printableErrors = 0)
                              overwrites the printableErrors option.
       --baseComposition    : Print the Base Composition Statistics.

--disableSeqIDCheck  : Disable the unique sequence identifier check. Use this option to save memory since the sequence id check uses a lot of memory. Does not affect the printing of Base Composition Statistics.

       --quiet              : Suppresses the display of errors and summary statistics.
                              Does not affect the printing of Base Composition Statistics.

Optional Space Options for Raw Sequence (Last one specified is used)

       --auto       : Determine baseSpace/colorSpace from the Raw Sequence in the file (Default).
       --baseSpace  : ACTGN only
       --colorSpace : 0123. only (with 1 character primer base)

Usage

./fastQValidator --file <fileName> [--minReadLen <minReadLen>] [--maxErrors <numErrors>] [--printableErrors <printableErrors>|--ignoreErrors] [--baseComposition] [--disableSeqIDCheck] [--quiet] [--baseSpace|--colorSpace|--auto] [--params]

Examples

       ./fastQValidator --file testFile.txt
       ./fastQValidator --file testFile.txt --minReadLen 10 --baseSpace --printableErrors 100
       ./fastQValidator --file test/testFile.txt --minReadLen 10 --colorSpace --ignoreErrors

Return Value

  • 0 - the fastq file is valid.
  • < 0 - invalid options specified.
  • > 0 - fastq file did not validate succesfully. One of the FastQStatus failure values is returned


FastQ Validator Output

When running the fastQValidator Executable, if the --params option is specified, the output starts with a summary of the parameters:

The following parameters are available.  Ones with "[]" are in effect:
Input Parameters
 --file [../fastqValidator/test/testFile.txt], --baseComposition,
               --disableSeqIDCheck, --quiet, --params [ON], --minReadLen [10],
               --maxErrors [-1]
  Space Type : --baseSpace, --colorSpace, --auto [ON]
      Errors : --ignoreErrors, --printableErrors [20]

The Validator Executable outputs error messages for invalid sequences based on Validation Criteria. For Example:

ERROR on Line 25: The sequence identifier line was too short.
ERROR on Line 29: First line of a sequence does not begin wtih @
ERROR on Line 33: No Sequence Identifier specified before the comment.

Base Composition Percentages by Index are printed if --printBaseComp is set to ON. For Example:

Base Composition Statistics:
Read Index	%A	%C	%G	%T	%N	Total Reads At Index
        0   100.00    0.00    0.00    0.00    0.00	20
        1     5.00   95.00    0.00    0.00    0.00	20
        2     5.00    0.00    5.00   90.00    0.00	20


Summary of the number of lines, sequences, and errors:

Finished processing testFile.txt with 92 lines containing 20 sequences.
There were a total of 17 errors.


Libraries & Classes

  • C++ Library: libStatGen
    • ParameterList - Class for reading in Parameters.
  • FastQValidator.cpp - Main method for the Executable.


Additional Features

  • Base composition reported and tracked by position.
  • Supports base space and color space files.
  • Consumes gzipped and uncompressed text files transparently.
  • Prints error messages for errors up to the configurable maximum number of reportable errors.
  • Prints a summary of the total number of errors.
  • Prints the total number of lines processed as well as the total number of sequences processed.


Additional Wishlist - Not Implemented

There are a series of optional capabilities a FastQ Validator could implement. Among those:

  • To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
  • Report average read quality score.
  • AutoDetect 64/33 illumina/standard quality scores.


Discussion

  • For color space, there is no specification for:
  1. The length of read and quality string may be the same or differs by 1 (depending on whether the primer base has a corresponding quality value).
    • Decided to require a quality score for the primer base.
  2. Missing values are usually presented by "." or sometimes left as a blank " ".
  3. Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)
  • It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).