Difference between revisions of "LibStatGen: ASP"

From Genome Analysis Wiki
Jump to navigationJump to search
(Created page with 'Category:C++ Category:libStatGen Category:libStatGen ASP = Asymmetric Pileup (ASP) = Asymmetric Pileup (ASP) is a new pileup file format that we created to replace …')
 
 
(14 intermediate revisions by the same user not shown)
Line 6: Line 6:
  
 
Asymmetric Pileup (ASP) is a new pileup file format that we created to replace GLF.
 
Asymmetric Pileup (ASP) is a new pileup file format that we created to replace GLF.
 +
 +
<span style="color:#D2691E">'''It is currently in initial test phase, and is not yet available for public use.'''</span>
 +
  
 
== ASP File Format ==
 
== ASP File Format ==
  
ASP files are binary files consisting of 4 types of records.   Every record starts with a 1-byte field that indicates the type.
+
ASP files are binary files consisting of a header and 4 types of records.  
 +
 
 +
The header contains a list of all the read names in the file in the order in which they appear. 
 +
 
 +
Every record starts with a 1-byte field.  The lower 4 bits indicate the type and for some types, the upper four bits indicate the reference base.
  
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
Line 17: Line 24:
 
| 0 || [[#Record Type: Empty|Empty]] || indicates there are no bases for this position.
 
| 0 || [[#Record Type: Empty|Empty]] || indicates there are no bases for this position.
 
|-
 
|-
| 1 || [[#Record Type: Position Only|Position Only]] || specifies the chromsome id/0-based position of the next record (other records do not contain a position).
+
| 1 || [[#Record Type: Position Only|Position Only]] || specifies the chromsome id (see [[#ASP Header|ASP Header]]) and 0-based position of the next record (other records do not contain a position).
 
|-
 
|-
 
| 2 || [[#Record Type: Reference Only|Reference Only]] || indicates that all bases at this position match the reference and provides the number of bases, GLH, and GLA.
 
| 2 || [[#Record Type: Reference Only|Reference Only]] || indicates that all bases at this position match the reference and provides the number of bases, GLH, and GLA.
Line 23: Line 30:
 
| 3 || [[#Record Type: Detailed|Detailed]] || indicates that not all bases at this position match the reference and provides the number of bases, the bases, the qualities, the cycles, the strands, and the MQs.
 
| 3 || [[#Record Type: Detailed|Detailed]] || indicates that not all bases at this position match the reference and provides the number of bases, the bases, the qualities, the cycles, the strands, and the MQs.
 
|}
 
|}
 +
 +
Only position records contain a chromosome id/position.  The record after a position record has the chromosome id & position specified in the position record.  All other records are assumed to just increment one position from the previous record.
 +
 +
The first record in a file must be a Position Only Record.
 +
 +
=== ASP Header ===
 +
 +
The chromosome ID found in the [[#Record Type: Position Only|Position Only records]] is the index into this list starting at 0 and going to length of list - 1.
 +
 +
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align:center;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
! colspan="2" | Field !! Description !! Type
 +
|-
 +
| colspan="2"| numChroms || The number of chromosomes in the following list || uint32_t
 +
|-
 +
| colspan="4"| ''List of chromosome information (numChroms sets of entries)''
 +
|-
 +
| style="width: 20px"| || chromNameLen || Length of the chromosome name + 1 (including NULL) || uint32_t
 +
|-
 +
| || chromName || Chromosome name, NULL terminated || char[chromNameLen]
 +
|}
 +
  
 
=== Record Type: Empty ===
 
=== Record Type: Empty ===
 +
 +
An empty record is only 1 byte and just contains the type field. 
 +
 +
Since non-position only records do not have a position associated with them, the position is determined by adding one to the position of the previous record.
 +
 +
When positions have no bases, there are two ways to deal with them. 
 +
# Write a new position record for the next position that has bases
 +
# Write empty records to indicate those positions have no bases.
 +
 +
If positions that have bases are not far apart, it is preferable to write empty records rather than a new position record since Empty records should compress well and are only 1 byte.
 +
 +
 +
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align:center;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
! Field !! Description !! Type !! Value
 +
|-
 +
| type || Empty Record Type || uint8_t || 0
 +
|}
  
  
 
=== Record Type: Position Only ===
 
=== Record Type: Position Only ===
 +
 +
Position Only records are used to specify the chromosome ID & 0-based position of the following record.  The first record in the file must be a Position Only record.
 +
 +
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align:center;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
! Field !! Description !! Type !! Value
 +
|-
 +
| type || Position Only Record Type || uint8_t || 1
 +
|-
 +
| chromID || Chromosome ID of the next record || int32_t ||
 +
|-
 +
| pos || 0-based position of the next record || int32_t ||
 +
|}
  
  
 
=== Record Type: Reference Only ===
 
=== Record Type: Reference Only ===
 +
 +
The position associated with a Reference Only record is 1 greater than the position of the previous record unless the previous record is a position record.
 +
 +
The upper 4 bits of the first byte contain the reference base.
 +
 +
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align:center;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
! Field !! Description !! Type !! Value/Range
 +
|-
 +
| refBase || Reference Base || 4 bits ||  0=A, 1=C, 2=G, 3=T, 4=N
 +
|-
 +
| type || Reference Only Record Type || 4 bits || 2
 +
|-
 +
| numBases || Number of bases at this position || uint8_t || 1-255
 +
|-
 +
| GLH || Genotype Likelihood H || uint8_t || 0-255
 +
|-
 +
| GLA || Genotype Likelihood Alternate || uint8_t || 0-255
 +
|}
 +
 +
If a position has more than 255 bases, only the first 255 are used for calculating the GLH, GLA.
 +
 +
Phred Qualities that are unknown or less than 13 are not used in the Genotype Likelihood calculation, but are counted in the numBases if there is a base at the position.
  
  
 
=== Record Type: Detailed ===
 
=== Record Type: Detailed ===
 +
 +
The position associated with a Detailed record is 1 greater than the position of the previous record unless the previous record is a position record.
 +
 +
The upper 4 bits of the first byte contain the reference base.
 +
 +
When a a deletion occurs at a specified position, it counts as a mismatch to the reference.
 +
 +
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align: center;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
! Field !! Description !! Type !! Value/Range
 +
|-
 +
| refBase || Reference Base || 4 bits ||  0=A, 1=C, 2=G, 3=T, 4=N
 +
|-
 +
| type || style="text-align: left;"| Detailed Record Type || 4 bits || 3
 +
|-
 +
| numBases || style="text-align: left;"| Number of bases at this position || uint8_t || 1-255
 +
|-
 +
| all Bases || style="text-align: left;"| 8-bit encoded bases/deletions for this position
 +
| uint8_t[numBases]
 +
| 0=A, 1=C,
 +
2=G, 3=T,
 +
 +
4=N, 5=D (deletion)
 +
|-
 +
| allQuals || style="text-align: left;"| All Phred Quals for this position  || uint8_t[numBases] || 0-254
 +
255 - unknown quality, deletion qual
 +
|-
 +
| allCycles || style="text-align: left;"| 0-based position in the reads for all bases at this position  || uint8_t[numBases] || 0-254, 255 for a deletion
 +
|-
 +
| allStrands || style="text-align: left;"| Strand for all bases at this position 
 +
| uint8_t[numBases]
 +
| 0 - forward
 +
1 - reverse
 +
|-
 +
| allMQs || style="text-align: left;"| Mapping Qualities for all bases at this position  || uint8_t[numBases] || 0-255
 +
|}
 +
 +
 +
If a position has more than 255 bases, only the first 255 are used in this record.
 +
 +
 +
==== Example ====
 +
 +
Hex dump of a record: 0x230202031d1d020100012c22
 +
 +
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align: center;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
! width="25" | 0 !! width="25" | 1 !! width="25" | 2 !! width="25" | 3 !! width="25" | 4 !! width="25" | 5 !! width="25" | 6 !! width="25" | 7 !! width="25" | 8 !! width="25" | 9 !! width="25" | 10 !! width="25" | 11 !! width="25" |  12 !! width="25" | 13 !! width="25" | 14 !! width="25" | 15 !! width="25" | 16 !! width="25" | 17 !! width="25" | 18 !! width="25" | 19 !! width="25" | 20 !! width="25" | 21 !! width="25" |  22 !! width="25" | 23 !! width="25" | 24 !! width="25" | 25 !! width="25" | 26 !! width="25" | 27 !! width="25" | 28 !! width="25" | 29 !! width="25" | 30 !! width="25" | 31 !! width="25" | 32 !! width="25" | 33 !! width="25" | 34 !! width="25" | 35 !! width="25" | 36 !! width="25" | 37 !! width="25" | 38 !! width="25" | 39 !! width="25" | 40 !! width="25" | 41 !! width="25" |  42 !! width="25" | 43 !! width="25" | 44 !! width="25" | 45 !! width="25" | 46 !! width="25" | 47 !! width="25" | 48 !! width="25" | 49 !! width="25" | 50 !! width="25" | 51 !! width="25" | 52 !! width="25" | 53 !! width="25" | 54 !! width="25" | 55 !! width="25" | 56 !! width="25" | 57 !! width="25" | 58 !! width="25" | 59 !! width="25" | 60 !! width="25" | 61 !! width="25" | 62 !! width="25" | 63 !! width="25" | 64 !! width="25" | 65 !! width="25" | 66 !! width="25" | 67 !! width="25" | 68 !! width="25" | 69 !! width="25" | 70 !! width="25" | 71 !! width="25" | 72 !! width="25" | 73 !! width="25" | 74 !! width="25" | 75 !! width="25" | 76 !! width="25" | 77 !! width="25" | 78 !! width="25" | 79 !! width="25" | 80 !! width="25" | 81 !! width="25" | 82 !! width="25" | 83 !! width="25" | 84 !! width="25" | 85 !! width="25" | 86 !! width="25" | 87 !! width="25" | 88 !! width="25" | 89 !! width="25" | 90 !! width="25" | 91 !! width="25" | 92 !! width="25" | 93 !! width="25" | 94 !! width="25" | 95
 +
|-
 +
| colspan="8" | 23 || colspan="8" | 02 || colspan="8" | 02 || colspan="8" | 03 || colspan="8" | 1D || colspan="8" | 1D || colspan="8" | 02 || colspan="8" | 01 || colspan="8" | 00 || colspan="8" | 01 || colspan="8" | 2C || colspan="8" | 22
 +
|-
 +
| colspan="4" | RefBase = G || colspan="4" | Type = DETAILED || colspan="8" | NumBases = 2 || colspan="8" | Base1 = 2 = G || colspan="8" | Base2 = 3 = T || colspan="8" | Qual1 = 0x1D = 29 = '>'  || colspan="8" | Qual2 = 0x1D = 29 = '>' || colspan="8" | Cycle1 = 2 || colspan="8" | Cycle2 = 1 || colspan="8" | Strand1 = 0 = forward || colspan="8" | Strand2 = 1 = reverse || colspan="8" | MapQual1 = 0x2C = 44 || colspan="8" | MapQual2 = 0x22 = 34
 +
|-
 +
| colspan="4" | 2 || colspan="4" | 3 || colspan="4" | 0 || colspan="4" | 2 || colspan="4" | 0 || colspan="4" | 2 || colspan="4" | 0 || colspan="4" | 3 || colspan="4" | 1 || colspan="4" | D || colspan="4" | 1 || colspan="4" | D || colspan="4" | 0 || colspan="4" | 2 || colspan="4" | 0 || colspan="4" | 1 || colspan="4" | 0 || colspan="4" | 0 || colspan="4" | 0 || colspan="4" | 1 || colspan="4" | 2 || colspan="4" | C || colspan="4" | 2 || colspan="4" | 2
 +
|-
 +
| 0 || 0 || 1 || 0 || 0 || 0 || 1 || 1 || 0 || 0 || 0 || 0 || 0 || 0 || 1 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 1 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 1 || 1 || 0 || 0 || 0 || 1 || 1 || 1 || 0 || 1 || 0 || 0 || 0 || 1 || 1 || 1 || 0 || 1 || 0 || 0 || 0 || 0 || 0 || 0 || 1 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 1 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 0 || 1 || 0 || 0 || 1 || 0 || 1 || 1 || 0 || 0 || 0 || 0 || 1 || 0 || 0 || 0 || 1 || 0
 +
|}
 +
 +
== API for Reading ASP Files ==
 +
 +
You will use both an <code>AspFileReader</code> and an <code>AspRecord</code> for reading ASP files.
 +
 +
=== <code>AspFileReader</code> ===
 +
An instance of the AspFileReader class is used to read ASP files.
 +
 +
==== include file ====
 +
<code>AspFileReader</code> is declared in <code>AspFile.h</code>, so be sure to include that file.
 +
<source lang="cpp">
 +
#include "AspFile.h"
 +
</source>
 +
 +
==== Opening the ASP File ====
 +
<code>open</code> opens the specified file and throws an exception if it was not successfully open.
 +
 +
<source lang="cpp">
 +
  // Open the asp file for reading.
 +
    AspFileReader asp;
 +
    asp.open("aspFileName.asp");
 +
</source>
 +
 +
==== Reading the ASP Record ====
 +
There are many ways to read records in an ASP file.  You can read the next record or you can advance to a specific position.
 +
 +
See [[#Reading the next record|Reading the next record]] or [[#Advancing to a specific position|Advancing to a specific position]] below.
 +
 +
===== Reading the next record =====
 +
Once the file is open, there are two methods to get the next record, <code>getNextRecord</code> and <code>getNextDataRecord</code>.
 +
 +
Both methods return true if a record was successfully found and false on EOF or an error.
 +
 +
Both methods take a reference to an <code>AspRecord</code> as a parameter.  When true is returned, the <code>AspRecord</code> is updated with the next record.
 +
 +
The AspRecord set by <code>getNextRecord</code> will be any type of record, Reference Only, Detailed, Empty, or Position.
 +
 +
The AspRecord set by <code>getNextDataRecord</code> will only be a Reference Only Record or a Detailed Record.  It consumes any Empty Records and Position Records it finds until a Reference Only or Detailed Record is found.
 +
 +
If you only process one record at a time (are done with a record before reading the next one), you can loop until <code>getNextRecord</code> or <code>getNextDataRecord</code> return false, reusing the same record for each call.
 +
<source lang="cpp">
 +
        while(asp.getNextRecord(record))
 +
        {
 +
            // Your record specific processing here.
 +
        }
 +
        // Done reading the file.
 +
</source>
 +
 +
-OR-
 +
 +
<source lang="cpp">
 +
        while(asp.getNextDataRecord(record))
 +
        {
 +
            // Your record specific processing here.
 +
        }
 +
        // Done reading the file.
 +
</source>
 +
 +
 +
===== Advancing to the next chromosome =====
 +
To advance to the next chromosome, use <code>advanceToNextChromosome</code>.  It returns true if there was another chromosome, false if not.  The chromosome name is returned in the nextChrom string parameter.
 +
 +
The next record read will be on that chromosome.
 +
 +
<source lang="cpp">
 +
        std::string nextChrom = "";
 +
        if(asp.advanceToNextChromosome(nextChrom))
 +
        {
 +
            std::cerr << "The next chromosome is" << nextChrom << "\n";
 +
        }
 +
        else
 +
        {
 +
            std::cerr << "No more chromosomes\n";
 +
        }
 +
</source>
 +
 +
 +
===== Advancing to a specific position =====
 +
Rather than just reading the next record, you can advance to a specific position.
 +
 +
If the position has already been passed, the file is not advanced and a default (depends on the method) value is returned.
 +
 +
If the position is not found in the file, the file is advanced to the first position after the specified position and a default (depends on the method) value is returned.
 +
 +
*<code>getRecord(const char* chromName, int32_t pos0Based)</code> returns a reference to the reference-only record at the specified position.  The default if that position is not found or has already been passed is to return a reference to an EMPTY record.  The returned record is only valid until the next time a new position is read.
 +
*<code>getRefOnlyRecord(const char* chromName, int32_t pos0Based)</code> returns a reference to the reference-only record at the specified position.  The default if that position is not found, has already been passed, or is not a reference-only record is to return a reference to an EMPTY record.  The returned record is only valid until the next time a new position is read.
 +
*<code>getDetailedRecord(const char* chromName, int32_t pos0Based)</code> returns a reference to the detailed record at the specified position.  The default if that position is not found, has already been passed, or is not a detailed record is to return a reference to an EMPTY record.  The returned record is only valid until the next time a new position is read.
 +
*<code>getLikelihood(const char* chromName, int32_t pos0Based, char base1, char base2)</code> returns the likelihood at the specified position of the genotype with the 2 specified ases.  The default if that position is not found, has already been passed, or is not a data record is to return a 0.
 +
*<code>getNumBases(const char* chromName, int32_t pos0Based)</code> returns the number of bases at the specified position.  The default if that position is not found, has already been passed, or is not a data record is to return a 0.
 +
 +
==== Determining if EOF reached ====
 +
<code>isEof()</code> returns true if the end of file has been reached, false if not.
 +
<source lang="cpp">
 +
    if(asp.isEof())
 +
    {
 +
        // Your end of file processing goes here.
 +
    }
 +
</source>
 +
 +
==== Closing the ASP File ====
 +
<code>close</code> closes the opened ASP file.
 +
 +
<source lang="cpp">
 +
    asp.close();
 +
</source>
 +
 +
=== <code>AspRecord</code> ===
 +
Asp records are read from Asp files using <code>getNextRecord</code> and <code>getNextDataRecord</code>.
 +
 +
Once you have the record, use methods from the <code>AspRecord</code> to extract the desired information.
 +
 +
==== Determining the type of record ====
 +
To determine which type of record is in your <code>AspRecord</code> object, use <code>isEmptyType()</code>, <code>isPosType()</code>, <code>isRefOnlyType()</code>, and <code>isDetailedType()</code>.  These methods return true if the record is of that type and false if it is a different type.
 +
 +
==== Retrieve the Chromosome/Position ====
 +
Regardless of the record type, <code>getChromID()</code> returns the chromosome ID associated with this record.
 +
 +
Regardless of the record type, <code>getPosition()</code> returns the 0-based position associated with this record.
 +
 +
==== Reference Only Record Methods ====
 +
Reference Only records contain the number of bases at this position and the GLH & GLA.
 +
 +
* <code>getNumBases()</code> returns the number of bases at this position
 +
* <code>getRefBase()</code> returns the reference base of this position
 +
* <code>getGLH()</code> returns the GLH of this position
 +
* <code>getGLA()</code> returns the GLA of this position
 +
* <code>getLikelihood(char base1, char base2)</code> returns the likelihood of this position for the specified genotype.
 +
** When both bases match the reference base, 0 is returned.
 +
** If only one of the bases matches the reference base, the GLH is returned.
 +
** If neither base matches the reference base, the GLA is returned.
 +
 +
==== Detailed Record Methods ====
 +
Detailed records contain the number of bases at this position and all of the bases, qualities, cycles, strands, and MQs.
 +
 +
* <code>getNumBases()</code> returns the number of bases at this position
 +
* <code>getRefBase()</code> returns the reference base of this position
 +
* <code>getLikelihood(char base1, char base2)</code> returns the likelihood of this position for the specified genotype.
 +
* <code>getBaseChar(int index)</code> returns the base as a character (A,C,T,G,D,N) in the record at the specified index.  The index starts at 0 and goes to numBases - 1.  An out of range index returns 'N'.  'D' represents a deletion at this position.
 +
* <code>getPhredQual(int index)</code> returns the phred quality at the specified index.  The index starts at 0 and goes to numBases - 1.  An out of range index returns -1.  An unknown quality and the quality for a deletion is -1.
 +
* <code>getCharQual(int index)</code> returns the quality represented as a character (phred+33) at the specified index.  The index starts at 0 and goes to numBases - 1.  An out of range index returns ' '.  An unknown quality and the quality for a deletion is ' '.
 +
* <code>getCycle(int index)</code> returns the cycle of the base at this index in its original read.  The index starts at 0 and goes to numBases - 1.  An out of range index returns -2.  The cycle for a deletion deletion is -1.
 +
* <code>getStrand(int index)</code> returns the strand of the read at this index.  The index starts at 0 and goes to numBases - 1.  An out of range index returns false.
 +
* <code>getMQ(int index)</code> returns the Mapping Quality of the read at this index.  The index starts at 0 and goes to numBases - 1.  An out of range index returns -1.

Latest revision as of 11:45, 30 April 2012


Asymmetric Pileup (ASP)

Asymmetric Pileup (ASP) is a new pileup file format that we created to replace GLF.

It is currently in initial test phase, and is not yet available for public use.


ASP File Format

ASP files are binary files consisting of a header and 4 types of records.

The header contains a list of all the read names in the file in the order in which they appear.

Every record starts with a 1-byte field. The lower 4 bits indicate the type and for some types, the upper four bits indicate the reference base.

Type Value Record Type Description
0 Empty indicates there are no bases for this position.
1 Position Only specifies the chromsome id (see ASP Header) and 0-based position of the next record (other records do not contain a position).
2 Reference Only indicates that all bases at this position match the reference and provides the number of bases, GLH, and GLA.
3 Detailed indicates that not all bases at this position match the reference and provides the number of bases, the bases, the qualities, the cycles, the strands, and the MQs.

Only position records contain a chromosome id/position. The record after a position record has the chromosome id & position specified in the position record. All other records are assumed to just increment one position from the previous record.

The first record in a file must be a Position Only Record.

ASP Header

The chromosome ID found in the Position Only records is the index into this list starting at 0 and going to length of list - 1.

Field Description Type
numChroms The number of chromosomes in the following list uint32_t
List of chromosome information (numChroms sets of entries)
chromNameLen Length of the chromosome name + 1 (including NULL) uint32_t
chromName Chromosome name, NULL terminated char[chromNameLen]


Record Type: Empty

An empty record is only 1 byte and just contains the type field.

Since non-position only records do not have a position associated with them, the position is determined by adding one to the position of the previous record.

When positions have no bases, there are two ways to deal with them.

  1. Write a new position record for the next position that has bases
  2. Write empty records to indicate those positions have no bases.

If positions that have bases are not far apart, it is preferable to write empty records rather than a new position record since Empty records should compress well and are only 1 byte.


Field Description Type Value
type Empty Record Type uint8_t 0


Record Type: Position Only

Position Only records are used to specify the chromosome ID & 0-based position of the following record. The first record in the file must be a Position Only record.

Field Description Type Value
type Position Only Record Type uint8_t 1
chromID Chromosome ID of the next record int32_t
pos 0-based position of the next record int32_t


Record Type: Reference Only

The position associated with a Reference Only record is 1 greater than the position of the previous record unless the previous record is a position record.

The upper 4 bits of the first byte contain the reference base.

Field Description Type Value/Range
refBase Reference Base 4 bits 0=A, 1=C, 2=G, 3=T, 4=N
type Reference Only Record Type 4 bits 2
numBases Number of bases at this position uint8_t 1-255
GLH Genotype Likelihood H uint8_t 0-255
GLA Genotype Likelihood Alternate uint8_t 0-255

If a position has more than 255 bases, only the first 255 are used for calculating the GLH, GLA.

Phred Qualities that are unknown or less than 13 are not used in the Genotype Likelihood calculation, but are counted in the numBases if there is a base at the position.


Record Type: Detailed

The position associated with a Detailed record is 1 greater than the position of the previous record unless the previous record is a position record.

The upper 4 bits of the first byte contain the reference base.

When a a deletion occurs at a specified position, it counts as a mismatch to the reference.

Field Description Type Value/Range
refBase Reference Base 4 bits 0=A, 1=C, 2=G, 3=T, 4=N
type Detailed Record Type 4 bits 3
numBases Number of bases at this position uint8_t 1-255
all Bases 8-bit encoded bases/deletions for this position uint8_t[numBases] 0=A, 1=C,

2=G, 3=T,

4=N, 5=D (deletion)

allQuals All Phred Quals for this position uint8_t[numBases] 0-254

255 - unknown quality, deletion qual

allCycles 0-based position in the reads for all bases at this position uint8_t[numBases] 0-254, 255 for a deletion
allStrands Strand for all bases at this position uint8_t[numBases] 0 - forward

1 - reverse

allMQs Mapping Qualities for all bases at this position uint8_t[numBases] 0-255


If a position has more than 255 bases, only the first 255 are used in this record.


Example

Hex dump of a record: 0x230202031d1d020100012c22

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
23 02 02 03 1D 1D 02 01 00 01 2C 22
RefBase = G Type = DETAILED NumBases = 2 Base1 = 2 = G Base2 = 3 = T Qual1 = 0x1D = 29 = '>' Qual2 = 0x1D = 29 = '>' Cycle1 = 2 Cycle2 = 1 Strand1 = 0 = forward Strand2 = 1 = reverse MapQual1 = 0x2C = 44 MapQual2 = 0x22 = 34
2 3 0 2 0 2 0 3 1 D 1 D 0 2 0 1 0 0 0 1 2 C 2 2
0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 0

API for Reading ASP Files

You will use both an AspFileReader and an AspRecord for reading ASP files.

AspFileReader

An instance of the AspFileReader class is used to read ASP files.

include file

AspFileReader is declared in AspFile.h, so be sure to include that file.

#include "AspFile.h"

Opening the ASP File

open opens the specified file and throws an exception if it was not successfully open.

   // Open the asp file for reading.
    AspFileReader asp;
    asp.open("aspFileName.asp");

Reading the ASP Record

There are many ways to read records in an ASP file. You can read the next record or you can advance to a specific position.

See Reading the next record or Advancing to a specific position below.

Reading the next record

Once the file is open, there are two methods to get the next record, getNextRecord and getNextDataRecord.

Both methods return true if a record was successfully found and false on EOF or an error.

Both methods take a reference to an AspRecord as a parameter. When true is returned, the AspRecord is updated with the next record.

The AspRecord set by getNextRecord will be any type of record, Reference Only, Detailed, Empty, or Position.

The AspRecord set by getNextDataRecord will only be a Reference Only Record or a Detailed Record. It consumes any Empty Records and Position Records it finds until a Reference Only or Detailed Record is found.

If you only process one record at a time (are done with a record before reading the next one), you can loop until getNextRecord or getNextDataRecord return false, reusing the same record for each call.

        while(asp.getNextRecord(record))
        {
            // Your record specific processing here.
        }
        // Done reading the file.

-OR-

        while(asp.getNextDataRecord(record))
        {
            // Your record specific processing here.
        }
        // Done reading the file.


Advancing to the next chromosome

To advance to the next chromosome, use advanceToNextChromosome. It returns true if there was another chromosome, false if not. The chromosome name is returned in the nextChrom string parameter.

The next record read will be on that chromosome.

        std::string nextChrom = "";
        if(asp.advanceToNextChromosome(nextChrom))
        {
            std::cerr << "The next chromosome is" << nextChrom << "\n";
        }
        else
        {
            std::cerr << "No more chromosomes\n";
        }


Advancing to a specific position

Rather than just reading the next record, you can advance to a specific position.

If the position has already been passed, the file is not advanced and a default (depends on the method) value is returned.

If the position is not found in the file, the file is advanced to the first position after the specified position and a default (depends on the method) value is returned.

  • getRecord(const char* chromName, int32_t pos0Based) returns a reference to the reference-only record at the specified position. The default if that position is not found or has already been passed is to return a reference to an EMPTY record. The returned record is only valid until the next time a new position is read.
  • getRefOnlyRecord(const char* chromName, int32_t pos0Based) returns a reference to the reference-only record at the specified position. The default if that position is not found, has already been passed, or is not a reference-only record is to return a reference to an EMPTY record. The returned record is only valid until the next time a new position is read.
  • getDetailedRecord(const char* chromName, int32_t pos0Based) returns a reference to the detailed record at the specified position. The default if that position is not found, has already been passed, or is not a detailed record is to return a reference to an EMPTY record. The returned record is only valid until the next time a new position is read.
  • getLikelihood(const char* chromName, int32_t pos0Based, char base1, char base2) returns the likelihood at the specified position of the genotype with the 2 specified ases. The default if that position is not found, has already been passed, or is not a data record is to return a 0.
  • getNumBases(const char* chromName, int32_t pos0Based) returns the number of bases at the specified position. The default if that position is not found, has already been passed, or is not a data record is to return a 0.

Determining if EOF reached

isEof() returns true if the end of file has been reached, false if not.

    if(asp.isEof())
    {
        // Your end of file processing goes here.
    }

Closing the ASP File

close closes the opened ASP file.

    asp.close();

AspRecord

Asp records are read from Asp files using getNextRecord and getNextDataRecord.

Once you have the record, use methods from the AspRecord to extract the desired information.

Determining the type of record

To determine which type of record is in your AspRecord object, use isEmptyType(), isPosType(), isRefOnlyType(), and isDetailedType(). These methods return true if the record is of that type and false if it is a different type.

Retrieve the Chromosome/Position

Regardless of the record type, getChromID() returns the chromosome ID associated with this record.

Regardless of the record type, getPosition() returns the 0-based position associated with this record.

Reference Only Record Methods

Reference Only records contain the number of bases at this position and the GLH & GLA.

  • getNumBases() returns the number of bases at this position
  • getRefBase() returns the reference base of this position
  • getGLH() returns the GLH of this position
  • getGLA() returns the GLA of this position
  • getLikelihood(char base1, char base2) returns the likelihood of this position for the specified genotype.
    • When both bases match the reference base, 0 is returned.
    • If only one of the bases matches the reference base, the GLH is returned.
    • If neither base matches the reference base, the GLA is returned.

Detailed Record Methods

Detailed records contain the number of bases at this position and all of the bases, qualities, cycles, strands, and MQs.

  • getNumBases() returns the number of bases at this position
  • getRefBase() returns the reference base of this position
  • getLikelihood(char base1, char base2) returns the likelihood of this position for the specified genotype.
  • getBaseChar(int index) returns the base as a character (A,C,T,G,D,N) in the record at the specified index. The index starts at 0 and goes to numBases - 1. An out of range index returns 'N'. 'D' represents a deletion at this position.
  • getPhredQual(int index) returns the phred quality at the specified index. The index starts at 0 and goes to numBases - 1. An out of range index returns -1. An unknown quality and the quality for a deletion is -1.
  • getCharQual(int index) returns the quality represented as a character (phred+33) at the specified index. The index starts at 0 and goes to numBases - 1. An out of range index returns ' '. An unknown quality and the quality for a deletion is ' '.
  • getCycle(int index) returns the cycle of the base at this index in its original read. The index starts at 0 and goes to numBases - 1. An out of range index returns -2. The cycle for a deletion deletion is -1.
  • getStrand(int index) returns the strand of the read at this index. The index starts at 0 and goes to numBases - 1. An out of range index returns false.
  • getMQ(int index) returns the Mapping Quality of the read at this index. The index starts at 0 and goes to numBases - 1. An out of range index returns -1.