Difference between revisions of "LibStatGen: ASP"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 25: Line 25:
  
 
Only position records contain a chromosome id/position.  The record after a position record has the chromosome id & position specified in the position record.  All other records are assumed to just increment one position from the previous record.
 
Only position records contain a chromosome id/position.  The record after a position record has the chromosome id & position specified in the position record.  All other records are assumed to just increment one position from the previous record.
 +
 +
The first record in a file must be a Position Only Record.
 +
  
 
=== Record Type: Empty ===
 
=== Record Type: Empty ===
  
An empty record is only 1 byte and just contains the type field.  It is a placeholder to indicate at this position there were no bases.  This is to prevent having to have a position-only record every time a position contains no bases.
+
An empty record is only 1 byte and just contains the type field.   
 +
 
 +
Since non-position only records do not have a position associated with them, the position is determined by adding one to the position of the previous record.
 +
 
 +
When positions have no bases, there are two ways to deal with them.   
 +
# Write a new position record for the next position that has bases
 +
# Write empty records to indicate those positions have no bases.
 +
 
 +
If positions that have bases are not far apart, it is preferable to write empty records rather than a new position record since Empty records should compress well and are only 1 byte.
  
  
Line 39: Line 50:
  
  
 +
=== Record Type: Position Only ===
  
 
+
Position Only records are used to specify the chromosome ID & 0-based position of the following record.  The first record in the file must be a Position Only record.
=== Record Type: Position Only ===
 
  
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align:center;" border="1"
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align:center;" border="1"
Line 56: Line 67:
  
 
=== Record Type: Reference Only ===
 
=== Record Type: Reference Only ===
 +
 +
The position associated with a Reference Only record is 1 greater than the position of the previous record unless the previous record is a position record.
  
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align:center;" border="1"
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align:center;" border="1"
Line 76: Line 89:
  
 
=== Record Type: Detailed ===
 
=== Record Type: Detailed ===
 +
 +
The position associated with a Detailed record is 1 greater than the position of the previous record unless the previous record is a position record.
 +
  
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align: center;" border="1"
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse; text-align: center;" border="1"
Line 110: Line 126:
 
| allMQs || style="text-align: left;"| Mapping Qualities for all bases at this position  || uint8_t[numBases] || 0-255
 
| allMQs || style="text-align: left;"| Mapping Qualities for all bases at this position  || uint8_t[numBases] || 0-255
 
|}
 
|}
 +
 +
 +
If a position has more than 255 bases, only the first 255 are used in this record.
  
  

Revision as of 18:28, 25 January 2012


Asymmetric Pileup (ASP)

Asymmetric Pileup (ASP) is a new pileup file format that we created to replace GLF.

ASP File Format

ASP files are binary files consisting of 4 types of records. Every record starts with a 1-byte field that indicates the type.

Type Value Record Type Description
0 Empty indicates there are no bases for this position.
1 Position Only specifies the chromsome id/0-based position of the next record (other records do not contain a position).
2 Reference Only indicates that all bases at this position match the reference and provides the number of bases, GLH, and GLA.
3 Detailed indicates that not all bases at this position match the reference and provides the number of bases, the bases, the qualities, the cycles, the strands, and the MQs.

Only position records contain a chromosome id/position. The record after a position record has the chromosome id & position specified in the position record. All other records are assumed to just increment one position from the previous record.

The first record in a file must be a Position Only Record.


Record Type: Empty

An empty record is only 1 byte and just contains the type field.

Since non-position only records do not have a position associated with them, the position is determined by adding one to the position of the previous record.

When positions have no bases, there are two ways to deal with them.

  1. Write a new position record for the next position that has bases
  2. Write empty records to indicate those positions have no bases.

If positions that have bases are not far apart, it is preferable to write empty records rather than a new position record since Empty records should compress well and are only 1 byte.


Field Description Type Value
type Empty Record Type uint8_t 0


Record Type: Position Only

Position Only records are used to specify the chromosome ID & 0-based position of the following record. The first record in the file must be a Position Only record.

Field Description Type Value
type Position Only Record Type uint8_t 1
chromID Chromosome ID of the next record int32_t
pos 0-based position of the next record int32_t


Record Type: Reference Only

The position associated with a Reference Only record is 1 greater than the position of the previous record unless the previous record is a position record.

Field Description Type Value/Range
type Reference Only Record Type uint8_t 2
numBases Number of bases at this position uint8_t 1-255
GLH Genotype Likelihood H uint8_t 0-255
GLA Genotype Likelihood Alternate uint8_t 0-255

If a position has more than 255 bases, only the first 255 are used for calculating the GLH, GLA.

Phred Qualities that are unknown or less than 13 are not used in the Genotype Likelihood calculation, but are counted in the numBases if there is a base at the position.


Record Type: Detailed

The position associated with a Detailed record is 1 greater than the position of the previous record unless the previous record is a position record.


Field Description Type Value/Range
type Detailed Record Type uint8_t 2
numBases Number of bases at this position uint8_t 1-255
all Bases 4-bit encoded bases/deletions for this position

1st base is in the upper bits of 1st byte

if odd number of bases, the lower bits of the last byte are 0.

uint8_t[(numBases+1)/2] 0=A, 1=C,

2=G, 3=T,

4=N, 5=D (deletion)

allQuals All Phred Quals for this position uint8_t[numBases] 0-254

255 - unknown quality

allCycles 0-based position in the reads for all bases at this position uint8_t[numBases] 0-255
allStrands Strand for all bases at this position

strand of the 1st base is in the uppermost bit of the first byte

if numBases is not a multiple of 8, the extra lower bits are set to 0

uint8_t[(numBases+7)/8] 0 - forward

1 - reverse

allMQs Mapping Qualities for all bases at this position uint8_t[numBases] 0-255


If a position has more than 255 bases, only the first 255 are used in this record.


Example

Hex dump of a record: 0x0302231d1d0201402c22

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
03 02 23 1D 1D 02 01 40 2C 22
Type = DETAILED NumBases = 2 Base1 = 2 = G, Base2 = 3 = T Qual1 = 0x1D = 29 = '>' Qual2 = 0x1D = 29 = '>' Cycle1 = 2 Cycle2 = 1 Strand1 (bit 56) = 0 = forward,

Strand2 (bit 57) = 1 = reverse,

extra bits are dummy bits = 0

MapQual1 = 0x2C = 44 MapQual2 = 0x22 = 34
0 3 0 2 2 3 1 D 1 D 0 2 0 1 4 0 2 C 2 2
0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 0