Changes

From Genome Analysis Wiki
Jump to navigationJump to search
54 bytes added ,  10:49, 2 February 2017
Line 1: Line 1: −
== Validation Criteria ==
+
[[Category:C++]]
=== Sequence Identifier Line ===
+
[[Category:libStatGen]]
*Every entry in the file should have a unique identifier.
+
[[Category:libStatGen FASTQ]]
   −
=== Raw Sequence Line ===
+
== Where to find the fastqFile Library and the FastQValidator ==
*A base sequence should have non-zero length.
  −
*Validates the base sequences against the characters allowed via configuration.
  −
** Base Only: A C T G N a c t g n
  −
** Color Space Only: 0 1 2 3 .(period)
  −
** Base or Color Space: A C T G N a c t g n 0 1 2 3 .(period)
  −
*Reads should be of a minimum length; many mappers will get into trouble with very short reads.
     −
=== Plus Line ===
+
The fastQ Library is now a part of [[C++ Library: libStatGen]].
   −
=== Quality String Line ===
+
The FastQValidator is documented at [[FastQValidator]].
*A quality string should be present for every base sequence.
  −
*Paired quality and base sequences should be of the same length.
  −
*Valid quality values should all have ASCII codes > 32.
     −
== Additional Features ==
+
== FASTQ Library Component for Reading and Validating FastQFiles ==
*Base composition are reported and tracked by position.
+
The software reads and validates fastq files in both compressed and uncompressed formats.
*Consumes gzipped and uncompressed text files transparently (see libcsg/InputFile.h).
     −
== Additional Wishlist - Not Implemented ==
+
The FASTQ component of the library is found in libStatGen/fastq/.
*To reduce memory usage, implement a two-pass algorithm that stores only a key for each sequence name (rather than complete sequence names) in memory (suggest a pair of options -1 -> one pass, high memory use, -2 -> two pass lower memory use, default is -1).
      +
See https://github.com/statgen/libStatGen/commits/master/fastq for a list of the most recent updates to the development version of the FASTQ portion of the library.
    +
For the old change log, see: [[C++ Library: FASTQ Change Log]]
   −
== Assumptions ==
+
=== Classes in the FASTQ Portion of Library ===
 +
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
! Class Name !!  Description
 +
|-
 +
| <code>[[C++ Class: FastQFile|FastQFile]]</code>
 +
| Class used for reading/validating a fastq file.
 +
|-
 +
| <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classBaseCount.html BaseCount]</code>
 +
| Wrapper around an array that has one index per base and an extra index for a total count of all bases.  This class is used to keep a count of the number of times each index has occurred.  It can print a percentage of the occurrence of each base against the total number of bases.
 +
|-
 +
| <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classBaseComposition.html BaseComposition]</code>
 +
| Class that tracks the composition of base by read location.
 +
|-
 +
| <code>[http://csg.sph.umich.edu//mktrost/doxygen/current/classFastQStatus.html FastQStatus]</code>
 +
| Status for FastQ operations.
 +
|}
   −
== How to Use the fastQValidator Executable ==
+
== FASTQ Output ==  
'''Required Parameters:'''
+
When a sequence is read, error messages for the first maxReportedErrors are output for failed [[C++ Class: FastQFile#Validation Criteria Used For Reading a Sequence|Validation Criteria]].
        -f FastQ filename with path to be prorcessed.
+
For Example:
 +
ERROR on Line 25: The sequence identifier line was too short.
 +
  ERROR on Line 29: First line of a sequence does not begin wtih @
 +
  ERROR on Line 33: No Sequence Identifier specified before the comment.
   −
'''Optional Parameters:'''
+
== FastQValidator ==
        -l  :  Minimum allowed read length (Defaults to 10).
+
The [[FastQValidator]] was built using the FastQFile classMore details on that program are at the supplied link.
        -e  :  Maximum number of errors to display before suppressing them(Defaults to 20).
  −
        -b  :  Raw sequence type:  B - ACTGN only (Default)
  −
                                  C - 0123. only
  −
                                  BC - ACTGN or 0123.
  −
 
  −
'''Testing only Parameters:'''
  −
        -t  : If "ReadOnly" is specified, the fastq will be read but not processed.  This may be used for determining read time.
  −
'''Usage:'''
  −
        ./fastQValidator -f <fileName> -l <minReadLen> -e <maxReprotedErrors> -b <rawSeqType>
  −
 
  −
'''Examples:'''
  −
        ../fastQValidator -f testFile.txt
  −
        ../fastQValidator -f testFile.txt -l 10 -b BC -e 100
  −
        ./fastQValidator -f test/testFile.txt -l 10 -b BC -e 100
  −
        time ./fastQValidator -f test/testFile.txt -t ReadOnly
  −
 
  −
== FastQ Validator Output ==
  −
'''Coming Soon'''
 
96

edits

Navigation menu