Difference between revisions of "LibStatGen: FASTQ"
From Genome Analysis Wiki
Jump to navigationJump to searchLine 11: | Line 11: | ||
=== Raw Sequence Line === | === Raw Sequence Line === | ||
− | + | {| class="wikitable" border="1" | |
− | + | |- | |
− | + | ! Validation Criteria | |
− | + | ! Error Message | |
− | + | |- | |
− | + | | A base sequence should have non-zero length. | |
+ | | ERROR on Line <current line #>: | ||
+ | |- | ||
+ | | All characters in the base sequence must be in the allowable set specified via configuration. | ||
+ | * Base Only: A C T G N a c t g n | ||
+ | * Color Space Only: 0 1 2 3 .(period) | ||
+ | * Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period) | ||
+ | | ERROR on Line <current line #>: | ||
+ | |- | ||
+ | | Reads should be of a minimum length; many mappers will get into trouble with very short reads. | ||
+ | | ERROR on Line <current line #>: | ||
+ | |} | ||
=== Plus Line === | === Plus Line === | ||
=== Quality String Line === | === Quality String Line === | ||
− | + | === Raw Sequence Line === | |
− | + | {| class="wikitable" border="1" | |
− | + | |- | |
+ | ! Validation Criteria | ||
+ | ! Error Message | ||
+ | |- | ||
+ | | A quality string should be present for every base sequence. | ||
+ | | ERROR on Line <current line #>: | ||
+ | |- | ||
+ | | Paired quality and base sequences should be of the same length. | ||
+ | | ERROR on Line <current line #>: | ||
+ | |- | ||
+ | | Valid quality values should all have ASCII codes > 32. | ||
+ | | ERROR on Line <current line #>: | ||
+ | |} | ||
== Additional Features == | == Additional Features == |
Revision as of 14:18, 4 February 2010
Validation Criteria
Sequence Identifier Line
Validation Criteria | Error Message |
---|---|
Every entry in the file should have a unique identifier. | ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #> |
Raw Sequence Line
Validation Criteria | Error Message |
---|---|
A base sequence should have non-zero length. | ERROR on Line <current line #>: |
All characters in the base sequence must be in the allowable set specified via configuration.
|
ERROR on Line <current line #>: |
Reads should be of a minimum length; many mappers will get into trouble with very short reads. | ERROR on Line <current line #>: |
Plus Line
Quality String Line
Raw Sequence Line
Validation Criteria | Error Message |
---|---|
A quality string should be present for every base sequence. | ERROR on Line <current line #>: |
Paired quality and base sequences should be of the same length. | ERROR on Line <current line #>: |
Valid quality values should all have ASCII codes > 32. | ERROR on Line <current line #>: |
Additional Features
- Base composition are reported and tracked by position.
- Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
Additional Wishlist - Not Implemented
- To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
Assumptions
How to Use the fastQValidator Executable
Required Parameters:
-f : FastQ filename with path to be prorcessed.
Optional Parameters:
-l : Minimum allowed read length (Defaults to 10). -e : Maximum number of errors to display before suppressing them(Defaults to 20). -b : Raw sequence type: B - ACTGN only (Default) C - 0123. only BC - ACTGN or 0123.
Testing only Parameters:
-t : If "ReadOnly" is specified, the fastq will be read but not processed. This may be used for determining read time.
Usage:
./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>
Examples:
../fastQValidator -f testFile.txt ../fastQValidator -f testFile.txt -l 10 -b BC -e 100 ./fastQValidator -f test/testFile.txt -l 10 -b BC -e 100 time ./fastQValidator -f test/testFile.txt -t ReadOnly
FastQ Validator Output
Coming Soon