Line 1: |
Line 1: |
| == Status == | | == Status == |
| | | |
− | The [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is on our [[Todo List]]. | + | The initial version of a [http://en.wikipedia.org/wiki/FASTQ_format FastQ] Validator is complete. |
| | | |
− | An initial version of the [[FastQFile]] has been completed which includes validation methods.
| |
| | | |
| == Valid FastQ File Requirements == | | == Valid FastQ File Requirements == |
| | | |
− | A valid fastQ file should meet the following requirements: | + | A valid fastQ file meets the validation criteria specified in [[FastQFile]]. |
| | | |
− | *A base sequence should have non-zero length.
| |
| | | |
− | *A quality string should be present for every base sequence.
| + | == Additional Features == |
| | | |
− | *Paired quality and base sequences should be of the same length. | + | *Base composition reported and tracked by position. |
| + | *Supports base space and color space files. |
| + | *Consumes gzipped and uncompressed text files transparently. |
| + | *Prints error messages for errors up to the configurable maximum number of reportable errors. |
| + | *Prints a summary of the total number of errors. |
| + | *Prints the total number of lines processed as well as the total number of sequences processed. |
| | | |
− | *Valid quality values should all have ASCII codes > 32.
| |
| | | |
− | *Valid bases should be ACTG or N, unless ambiguous bases are explicitly allowed by the application consuming the file. Lower case characters are allowed.
| + | == Additional Wishlist - Not Implemented == |
| | | |
− | *Every entry in the file should have a unique identifier.
| + | There are a series of optional capabilities a FastQ Validator could implement. Among those: |
− | | |
− | *Reads should be of a minimum length; many mappers will get into trouble with very short reads.
| |
− | | |
− | *Base composition should be reported and tracked by position.
| |
− | | |
− | == Additional Wishlist ==
| |
− | | |
− | There are a series of optional capabilities a FastQ Validator should implement. Among those: | |
− | | |
− | *Consume gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
| |
| | | |
| *To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1). | | *To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1). |
| | | |
− | *Support color space files, where valid base sequences include the characters 0, 1, 2, 3, '.' (period) in addition to A, C, T, G and N (some csfastq sequence lines start with a primer base).
| |
| | | |
| == Discussion == | | == Discussion == |
Line 44: |
Line 35: |
| | | |
| * It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors). | | * It may be useful to report 2 types of information to the user: ERROR (critical failure) and WARNING (tolerable errors). |
| + | |
| + | |
| + | |
| + | == How to Use the fastQValidator Executable == |
| + | '''Required Parameters:''' |
| + | -f : FastQ filename with path to be prorcessed. |
| + | |
| + | '''Optional Parameters:''' |
| + | -l : Minimum allowed read length (Defaults to 10). |
| + | -e : Maximum number of errors to display before suppressing them(Defaults to 20). |
| + | -b : Raw sequence type: "A"/"C"/"G"/"T"/"N" - Bases only; |
| + | "0"/"1"/"2"/"3"/"." - Color space only; |
| + | "" - Base Decision on the first Raw Sequence Character (Default) |
| + | All other characters - Bases & Color space |
| + | |
| + | '''Testing only Parameters:''' |
| + | -t : If "ReadOnly" is specified, the fastq will be read but not processed. This may be used for determining read time. |
| + | '''Usage:''' |
| + | ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType> |
| + | |
| + | '''Examples:''' |
| + | ../fastQValidator -f testFile.txt |
| + | ../fastQValidator -f testFile.txt -l 10 -b A -e 100 |
| + | ./fastQValidator -f test/testFile.txt -l 10 -b Z -e 100 |
| + | time ./fastQValidator -f test/testFile.txt -t ReadOnly |
| + | |
| + | |
| + | == FastQ Validator Output == |
| + | When running the fastQValidator Executable, the output starts with a summary of the parameters: |
| + | The following parameters are in effect: |
| + | FastQ File Name : testFile.txt (-fname) |
| + | Min Read Length : 10 (-l9999) |
| + | Max Reported Errors : 100 (-e9999) |
| + | BaseType : A (-bname) |
| + | TestMode : (-tname) |
| + | |
| + | Both the Executable and the Library outputs the following: |
| + | *Error messages for the first Configurable number of errors.: |
| + | ERROR on Line 25: The sequence identifier line was too short. |
| + | ERROR on Line 29: First line of a sequence does not begin wtih @ |
| + | ERROR on Line 33: No Sequence Identifier specified before the comment. |
| + | *Base Composition Percentages by Index: |
| + | |
| + | Base Composition Statistics: |
| + | Read Index %A %C %G %T %N Total Reads At Index |
| + | 0 100.00 0.00 0.00 0.00 0.00 20 |
| + | 1 5.00 95.00 0.00 0.00 0.00 20 |
| + | 2 5.00 0.00 5.00 90.00 0.00 20 |
| + | *Summary of the number of lines, sequences, and errors: |
| + | Finished processing testFile.txt with 92 lines containing 20 sequences. |
| + | There were a total of 17 errors. |