Difference between revisions of "FastQValidator"

From Genome Analysis Wiki
Jump to: navigation, search
(Status)
Line 1: Line 1:
 
== Status  ==
 
== Status  ==
  
The [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is on our [[Todo List]].  
+
The initial version of a [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is complete.  
  
An initial version of the [[FastQFile]] has been completed which includes validation methods.
 
  
 
== Valid FastQ File Requirements  ==
 
== Valid FastQ File Requirements  ==
  
A valid fastQ file should meet the following requirements:
+
A valid fastQ file meets the validation criteria specified in [[FastQFile]].
  
*A base sequence should have non-zero length.
 
  
*A quality string should be present for every base sequence.
+
== Additional Features ==
  
*Paired quality and base sequences should be of the same length.
+
*Base composition reported and tracked by position.
 +
*Supports base space and color space files.
 +
*Consumes gzipped and uncompressed text files transparently.
 +
*Prints error messages for errors up to the configurable maximum number of reportable errors.
 +
*Prints a summary of the total number of errors.
 +
*Prints the total number of lines processed as well as the total number of sequences processed.  
  
*Valid quality values should all have ASCII codes > 32.
 
  
*Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
+
== Additional Wishlist - Not Implemented ==
  
*Every entry in the file should have a unique identifier.
+
There are a series of optional capabilities a FastQ Validator could implement. Among those:  
 
 
*Reads should be of a minimum length; many mappers will get into trouble with very short reads.
 
 
 
*Base composition should be reported and tracked by position.
 
 
 
== Additional Wishlist  ==
 
 
 
There are a series of optional capabilities a FastQ Validator should implement. Among those:  
 
 
 
*Consume gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
 
  
 
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
 
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
  
*Support color space files, where valid base sequences include the characters 0, 1, 2, 3, '.' (period) in addition to A, C, T, G and N (some csfastq sequence lines start with a primer base).
 
  
 
== Discussion ==
 
== Discussion ==
Line 44: Line 35:
  
 
* It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
 
* It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).
 +
 +
 +
 +
== How to Use the fastQValidator Executable ==
 +
'''Required Parameters:'''
 +
        -f  :  FastQ filename with path to be prorcessed.
 +
 +
'''Optional Parameters:'''
 +
        -l  :  Minimum allowed read length (Defaults to 10).
 +
        -e  :  Maximum number of errors to display before suppressing them(Defaults to 20).
 +
        -b  :  Raw sequence type: "A"/"C"/"G"/"T"/"N"  - Bases only;
 +
                                  "0"/"1"/"2"/"3"/"."  - Color space only;
 +
                                  ""                  - Base Decision on the first Raw Sequence Character (Default)
 +
                                  All other characters - Bases & Color space
 +
 +
'''Testing only Parameters:'''
 +
        -t  :  If "ReadOnly" is specified, the fastq will be read but not processed.  This may be used for determining read time.
 +
'''Usage:'''
 +
        ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>
 +
 +
'''Examples:'''
 +
        ../fastQValidator -f testFile.txt
 +
        ../fastQValidator -f testFile.txt -l 10 -b A -e 100
 +
        ./fastQValidator -f test/testFile.txt -l 10 -b Z -e 100
 +
        time ./fastQValidator -f test/testFile.txt -t ReadOnly
 +
 +
 +
== FastQ Validator Output ==
 +
When running the fastQValidator Executable, the output starts with a summary of the parameters:
 +
The following parameters are in effect:
 +
              FastQ File Name :    testFile.txt (-fname)
 +
              Min Read Length :              10 (-l9999)
 +
          Max Reported Errors :            100 (-e9999)
 +
                      BaseType :              A (-bname)
 +
                      TestMode :                (-tname)
 +
 +
Both the Executable and the Library outputs the following:
 +
*Error messages for the first Configurable number of errors.:
 +
ERROR on Line 25: The sequence identifier line was too short.
 +
ERROR on Line 29: First line of a sequence does not begin wtih @
 +
ERROR on Line 33: No Sequence Identifier specified before the comment.
 +
*Base Composition Percentages by Index:
 +
 +
Base Composition Statistics:
 +
Read Index %A %C %G %T %N Total Reads At Index
 +
        0  100.00    0.00    0.00    0.00    0.00 20
 +
        1    5.00  95.00    0.00    0.00    0.00 20
 +
        2    5.00    0.00    5.00  90.00    0.00 20
 +
*Summary of the number of lines, sequences, and errors:
 +
Finished processing testFile.txt with 92 lines containing 20 sequences.
 +
There were a total of 17 errors.

Revision as of 13:52, 22 February 2010

Status

The initial version of a FastQ Validator is complete.


Valid FastQ File Requirements

A valid fastQ file meets the validation criteria specified in FastQFile.


Additional Features

  • Base composition reported and tracked by position.
  • Supports base space and color space files.
  • Consumes gzipped and uncompressed text files transparently.
  • Prints error messages for errors up to the configurable maximum number of reportable errors.
  • Prints a summary of the total number of errors.
  • Prints the total number of lines processed as well as the total number of sequences processed.


Additional Wishlist - Not Implemented

There are a series of optional capabilities a FastQ Validator could implement. Among those:

  • To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).


Discussion

  • For color space, there is no specification for:
  1. The length of read and quality string may be the same or differs by 1 (depending on whether the primer base has a corresponding quality value).
  2. Missing values are usually presented by "." or sometimes left as a blank " ".
  3. Tag names for paired end reads may be the same (e.g. MAQ actually enforces that), and may be in the same file (e.g. BFAST require paired reads in the same file)
  • It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors).


How to Use the fastQValidator Executable

Required Parameters:

       -f  :  FastQ filename with path to be prorcessed.

Optional Parameters:

       -l  :  Minimum allowed read length (Defaults to 10).
       -e  :  Maximum number of errors to display before suppressing them(Defaults to 20).
       -b  :  Raw sequence type: "A"/"C"/"G"/"T"/"N"  - Bases only;
                                 "0"/"1"/"2"/"3"/"."  - Color space only;
                                 ""                   - Base Decision on the first Raw Sequence Character (Default)
                                 All other characters - Bases & Color space

Testing only Parameters:

       -t  :  If "ReadOnly" is specified, the fastq will be read but not processed.  This may be used for determining read time.

Usage:

       ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>

Examples:

       ../fastQValidator -f testFile.txt
       ../fastQValidator -f testFile.txt -l 10 -b A -e 100
       ./fastQValidator -f test/testFile.txt -l 10 -b Z -e 100
       time ./fastQValidator -f test/testFile.txt -t ReadOnly


FastQ Validator Output

When running the fastQValidator Executable, the output starts with a summary of the parameters:

The following parameters are in effect:
              FastQ File Name :    testFile.txt (-fname)
              Min Read Length :              10 (-l9999)
          Max Reported Errors :             100 (-e9999)
                     BaseType :               A (-bname)
                     TestMode :                 (-tname)

Both the Executable and the Library outputs the following:

  • Error messages for the first Configurable number of errors.:
ERROR on Line 25: The sequence identifier line was too short.
ERROR on Line 29: First line of a sequence does not begin wtih @
ERROR on Line 33: No Sequence Identifier specified before the comment.
  • Base Composition Percentages by Index:
Base Composition Statistics:
Read Index	%A	%C	%G	%T	%N	Total Reads At Index
        0   100.00    0.00    0.00    0.00    0.00	20
        1     5.00   95.00    0.00    0.00    0.00	20
        2     5.00    0.00    5.00   90.00    0.00	20
  • Summary of the number of lines, sequences, and errors:
Finished processing testFile.txt with 92 lines containing 20 sequences.
There were a total of 17 errors.