Line 1: |
Line 1: |
− | == Validation Criteria ==
| + | [[Category:C++]] |
− | === Sequence Identifier Line ===
| + | [[Category:libStatGen]] |
− | {| class="wikitable" border="1"
| + | [[Category:libStatGen FASTQ]] |
− | |-
| |
− | ! Validation Criteria
| |
− | ! Error Message
| |
− | |-
| |
− | | Every entry in the file should have a unique identifier.
| |
− | | ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #>
| |
− | |}
| |
| | | |
− | === Raw Sequence Line === | + | == Where to find the fastqFile Library and the FastQValidator == |
− | *A base sequence should have non-zero length.
| |
− | *Validates the base sequences against the characters allowed via configuration.
| |
− | ** Base Only: A C T G N a c t g n
| |
− | ** Color Space Only: 0 1 2 3 .(period)
| |
− | ** Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)
| |
− | *Reads should be of a minimum length; many mappers will get into trouble with very short reads.
| |
| | | |
− | === Plus Line ===
| + | The fastQ Library is now a part of [[C++ Library: libStatGen]]. |
| | | |
− | === Quality String Line ===
| + | The FastQValidator is documented at [[FastQValidator]]. |
− | *A quality string should be present for every base sequence.
| |
− | *Paired quality and base sequences should be of the same length.
| |
− | *Valid quality values should all have ASCII codes > 32.
| |
| | | |
− | == Additional Features == | + | == FASTQ Library Component for Reading and Validating FastQFiles == |
− | *Base composition are reported and tracked by position.
| + | The software reads and validates fastq files in both compressed and uncompressed formats. |
− | *Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
| |
| | | |
− | == Additional Wishlist - Not Implemented ==
| + | The FASTQ component of the library is found in libStatGen/fastq/. |
− | *To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
| |
| | | |
| + | See https://github.com/statgen/libStatGen/commits/master/fastq for a list of the most recent updates to the development version of the FASTQ portion of the library. |
| | | |
| + | For the old change log, see: [[C++ Library: FASTQ Change Log]] |
| | | |
− | == Assumptions == | + | === Classes in the FASTQ Portion of Library === |
− | | + | {| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" |
− | == How to Use the fastQValidator Executable == | + | |-style="background: #f2f2f2; text-align: center;" |
− | '''Required Parameters:'''
| + | ! Class Name !! Description |
− | -f : FastQ filename with path to be prorcessed.
| + | |- |
− | | + | | <code>[[C++ Class: FastQFile|FastQFile]]</code> |
− | '''Optional Parameters:'''
| + | | Class used for reading/validating a fastq file. |
− | -l : Minimum allowed read length (Defaults to 10).
| + | |- |
− | -e : Maximum number of errors to display before suppressing them(Defaults to 20).
| + | | <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classBaseCount.html BaseCount]</code> |
− | -b : Raw sequence type: B - ACTGN only (Default)
| + | | Wrapper around an array that has one index per base and an extra index for a total count of all bases. This class is used to keep a count of the number of times each index has occurred. It can print a percentage of the occurrence of each base against the total number of bases. |
− | C - 0123. only
| + | |- |
− | BC - ACTGN or 0123.
| + | | <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classBaseComposition.html BaseComposition]</code> |
− | | + | | Class that tracks the composition of base by read location. |
− | '''Testing only Parameters:'''
| + | |- |
− | -t : If "ReadOnly" is specified, the fastq will be read but not processed. This may be used for determining read time.
| + | | <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classFastQStatus.html FastQStatus]</code> |
− | '''Usage:'''
| + | | Status for FastQ operations. |
− | ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>
| + | |} |
| | | |
− | '''Examples:'''
| + | == FASTQ Output == |
− | ../fastQValidator -f testFile.txt
| + | When a sequence is read, error messages for the first maxReportedErrors are output for failed [[C++ Class: FastQFile#Validation Criteria Used For Reading a Sequence|Validation Criteria]]. |
− | ../fastQValidator -f testFile.txt -l 10 -b BC -e 100
| + | For Example: |
− | ./fastQValidator -f test/testFile.txt -l 10 -b BC -e 100
| + | ERROR on Line 25: The sequence identifier line was too short. |
− | time ./fastQValidator -f test/testFile.txt -t ReadOnly
| + | ERROR on Line 29: First line of a sequence does not begin wtih @ |
| + | ERROR on Line 33: No Sequence Identifier specified before the comment. |
| | | |
− | == FastQ Validator Output == | + | == FastQValidator == |
− | '''Coming Soon'''
| + | The [[FastQValidator]] was built using the FastQFile class. More details on that program are at the supplied link. |