Difference between revisions of "LibStatGen: FASTQ"

From Genome Analysis Wiki
Jump to: navigation, search
(Sequence Identifier Line)
(Classes in the FASTQ Portion of Library)
 
(47 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Validation Criteria ==
+
[[Category:C++]]
=== Sequence Identifier Line ===
+
[[Category:libStatGen]]
{| class="wikitable" border="1"
+
[[Category:libStatGen FASTQ]]
|-
 
!  Validation Criteria
 
!  Error Message
 
|-
 
|  Every entry in the file should have a unique identifier.
 
|  ERROR on Line <current line #>: Repeated Sequence Identifier: <identifier> at Lines <previous line #> <current line #>
 
|}
 
  
=== Raw Sequence Line ===
+
== Where to find the fastqFile Library and the FastQValidator ==
*A base sequence should have non-zero length.
 
*Validates the base sequences against the characters allowed via configuration.
 
** Base Only: A C T G N a c t g n
 
** Color Space Only: 0 1 2 3 .(period)
 
** Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)
 
*Reads should be of a minimum length; many mappers will get into trouble with very short reads.
 
  
=== Plus Line ===
+
The fastQ Library is now a part of [[C++ Library: libStatGen]].
  
=== Quality String Line ===
+
The FastQValidator is documented at [[FastQValidator]].
*A quality string should be present for every base sequence.
 
*Paired quality and base sequences should be of the same length.
 
*Valid quality values should all have ASCII codes &gt; 32.
 
  
== Additional Features ==
+
== FASTQ Library Component for Reading and Validating FastQFiles ==
*Base composition are reported and tracked by position.
+
The software reads and validates fastq files in both compressed and uncompressed formats.
*Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
 
  
== Additional Wishlist - Not Implemented ==
+
The FASTQ component of the library is found in libStatGen/fastq/.
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
 
  
 +
See https://github.com/statgen/libStatGen/commits/master/fastq for a list of the most recent updates to the development version of the FASTQ portion of the library.
  
 +
For the old change log, see: [[C++ Library: FASTQ Change Log]]
  
== Assumptions ==
+
=== Classes in the FASTQ Portion of Library ===
 
+
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
== How to Use the fastQValidator Executable ==
+
|-style="background: #f2f2f2; text-align: center;"
'''Required Parameters:'''
+
! Class Name !! Description
        -: FastQ filename with path to be prorcessed.
+
|-
 
+
| <code>[[C++ Class: FastQFile|FastQFile]]</code>
'''Optional Parameters:'''
+
| Class used for reading/validating a fastq file.
        -: Minimum allowed read length (Defaults to 10).
+
|-
        -: Maximum number of errors to display before suppressing them(Defaults to 20).
+
| <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classBaseCount.html BaseCount]</code>
        -Raw sequence type:  B - ACTGN only (Default)
+
| Wrapper around an array that has one index per base and an extra index for a total count of all bases. This class is used to keep a count of the number of times each index has occurredIt can print a percentage of the occurrence of each base against the total number of bases.
                                  C - 0123. only
+
|-
                                  BC - ACTGN or 0123.
+
| <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classBaseComposition.html BaseComposition]</code>
 
+
| Class that tracks the composition of base by read location.
'''Testing only Parameters:'''
+
|-
        -t :  If "ReadOnly" is specified, the fastq will be read but not processedThis may be used for determining read time.
+
| <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classFastQStatus.html FastQStatus]</code>
'''Usage:'''
+
| Status for FastQ operations.
        ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>
+
|}
  
'''Examples:'''
+
== FASTQ Output ==
        ../fastQValidator -f testFile.txt
+
When a sequence is read, error messages for the first maxReportedErrors are output for failed [[C++ Class: FastQFile#Validation Criteria Used For Reading a Sequence|Validation Criteria]].
        ../fastQValidator -f testFile.txt -l 10 -b BC -e 100
+
For Example:
        ./fastQValidator -f test/testFile.txt -l 10 -b BC -e 100
+
ERROR on Line 25: The sequence identifier line was too short.
        time ./fastQValidator -f test/testFile.txt -t ReadOnly
+
ERROR on Line 29: First line of a sequence does not begin wtih @
 +
ERROR on Line 33: No Sequence Identifier specified before the comment.
  
== FastQ Validator Output ==
+
== FastQValidator ==
'''Coming Soon'''
+
The [[FastQValidator]] was built using the FastQFile class.  More details on that program are at the supplied link.

Latest revision as of 10:49, 2 February 2017


Where to find the fastqFile Library and the FastQValidator

The fastQ Library is now a part of C++ Library: libStatGen.

The FastQValidator is documented at FastQValidator.

FASTQ Library Component for Reading and Validating FastQFiles

The software reads and validates fastq files in both compressed and uncompressed formats.

The FASTQ component of the library is found in libStatGen/fastq/.

See https://github.com/statgen/libStatGen/commits/master/fastq for a list of the most recent updates to the development version of the FASTQ portion of the library.

For the old change log, see: C++ Library: FASTQ Change Log

Classes in the FASTQ Portion of Library

Class Name Description
FastQFile Class used for reading/validating a fastq file.
BaseCount Wrapper around an array that has one index per base and an extra index for a total count of all bases. This class is used to keep a count of the number of times each index has occurred. It can print a percentage of the occurrence of each base against the total number of bases.
BaseComposition Class that tracks the composition of base by read location.
FastQStatus Status for FastQ operations.

FASTQ Output

When a sequence is read, error messages for the first maxReportedErrors are output for failed Validation Criteria. For Example:

ERROR on Line 25: The sequence identifier line was too short.
ERROR on Line 29: First line of a sequence does not begin wtih @
ERROR on Line 33: No Sequence Identifier specified before the comment.

FastQValidator

The FastQValidator was built using the FastQFile class. More details on that program are at the supplied link.