Difference between revisions of "SAM Validation Criteria"

From Genome Analysis Wiki
Jump to: navigation, search
 
(29 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
'''NOTE: Not all validation Criteria has been listed here, and not all listed here have been implemented  (Implemented checks are marked green.)'''
 +
 
=== SAM Header Validation Rules ===
 
=== SAM Header Validation Rules ===
 
TODO
 
TODO
 +
{| class="wikitable" style="width:100%" border="1"
 +
|+ style="font-size:150%"|'''SAM Header'''
 +
! rowspan='2' width="60%"|Validation Criteria
 +
! colspan="2" width="20%"|Implemented
 +
! colspan="2" width="20%"|Tested
 +
|-
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
|-
 +
| All Required Fields are set
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| If HD line is there, VN is also there.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| HD/VN is not in valid format /^[0-9]+\.[0-9]+$/
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| HD/SO is a valid value (unsorted, queryname, coordinate)
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| SQ/SN all SQ lines have a unique SN field
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| SQ/LN is in the range [1, (2^29) -1]
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| SQ/LN is not a number
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| RG/ID all RG lines have a unique ID field
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| RG/PL is a valid value (ILLUMINA, SOLID, LS454, HELICOS, PACBIO)
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| Header has X-lines or fewer (or a max number of SQ lines (this was a problem once of a file with a crazy number of header lines)
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|}
  
 
=== SAM Alignment Validation ===
 
=== SAM Alignment Validation ===
 
{| class="wikitable" style="width:100%" border="1"
 
{| class="wikitable" style="width:100%" border="1"
 
|+ style="font-size:150%"|'''SAM Alignment Record'''
 
|+ style="font-size:150%"|'''SAM Alignment Record'''
! width="70%"|Validation Criteria
+
! rowspan='2' width="60%"|Validation Criteria
! width="15%"|Implemented
+
! colspan="2" width="20%"|Implemented
! width="15%"|Tested
+
! colspan="2" width="20%"|Tested
 +
|-
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 
|-
 
|-
 
| QNAME.Length() > 0 and <= 254
 
| QNAME.Length() > 0 and <= 254
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
|style="background-color:red;"| 
 
 
|-
 
|-
| QNAME does not contain [ \t\n\r]
+
| QNAME is valid: [!-?A-~] (printable characters minus space and '@') '''This is a new regular expression'''
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| FLAG is an integer [0-9]+
 
| FLAG is an integer [0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|-
 +
| FLAG is [0, (2^16)-1]
 +
|style="background-color:green;"| Parse Error since it will be written into a 16 bit field.
 +
|style="background-color:grey;"| N/A: only a 16 bit field
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: only a 16 bit field
 +
|-
 +
| RNAME does not contain [ \t\n\r@=]
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|-
 +
| RNAME is found in an SQ header record if there are any SQs in the header.
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| FLAG < 2048 (I think) or [0, (2^16)-1]
+
| Reference Name length does not match specified length.
 +
|style="background-color:grey;"| N/A: reference name length is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:grey;"| N/A: reference name length is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| RNAME does not contain [ \t\n\r@=]
+
| Reference ID is in range of the number of references
 +
|style="background-color:grey;"| N/A: rID is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:grey;"| N/A: rID is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| POS is an integer [0-9]+
 
| POS is an integer [0-9]+
|style="background-color:red;"|
+
|style="background-color:green;"|
|style="background-color:red;"|
+
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 
|-
 
|-
 
| POS is [0, (2^29)-1]
 
| POS is [0, (2^29)-1]
 +
|style="background-color:green;"| Parse Error if it can't fit in the 32 bit field, other out of range is a validation error.
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| MAPQ is an integer [0-9]+
 
| MAPQ is an integer [0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|-
 +
| MAPQ is [0, (2^8)-1]
 +
|style="background-color:green;"| Parse Error since it will be written into an 8 bit field.
 +
|style="background-color:grey;"| N/A: only a 8 bit field
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: only a 8 bit field
 +
|-
 +
| <nowiki>CIGAR ([0-9]+[MIDNSHP])+|\*</nowiki>
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
|-
 
| MAPQ is [0, (2^8)-1]
 
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| <nowiki>CIGAR ([0-9]+[MIDNSHP])+|\*</nowiki>
+
| CIGAR string matches the length of SEQ if both are not "*"
|style="background-color:red;"|
+
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME)
 
| MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME)
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ.
 
| If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| MPOS is an integer [0-9]+
 
| MPOS is an integer [0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| MPOS is [0, (2^29)-1]
 
| MPOS is [0, (2^29)-1]
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| ISIZE is an integer -?[0-9]+
 
| ISIZE is an integer -?[0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| ISIZE is [-(2^29), 2^29]
 
| ISIZE is [-(2^29), 2^29]
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| <nowiki>SEQ is [acgtnACGTN.=]+|\*</nowiki>
 
| <nowiki>SEQ is [acgtnACGTN.=]+|\*</nowiki>
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| If SEQ is * then QUAL is *
 
| If SEQ is * then QUAL is *
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| <nowiki>QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])</nowiki>
 
| <nowiki>QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])</nowiki>
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| If QUAL is not “*” it is the same length as SEQ.
+
| If QUAL and SEQ are not “*” they are the same length.
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| TAG is [A-Z][A-Z0-9]
 
| TAG is [A-Z][A-Z0-9]
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| A TAG only appears once per alignment
 
| A TAG only appears once per alignment
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| VTYPE is [AifZH] for SAM and [AcCsSiIfZH]
+
| VTYPE is [AifZH] for SAM and [AcCsSiIfZH] for BAM
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| VALUE does NOT contain [\t\n\r]
 
| VALUE does NOT contain [\t\n\r]
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| For VTYPE = “A”, VALUE is a printable character
 
| For VTYPE = “A”, VALUE is a printable character
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| For VTYPE = “i”, VALUE is a signed 32-bit integer.
 
| For VTYPE = “i”, VALUE is a signed 32-bit integer.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| For VTYPE = “f”, VALUE is a single-precision float.
 
| For VTYPE = “f”, VALUE is a single-precision float.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| For VTYPE = “Z”, VALUE is a printable string.
 
| For VTYPE = “Z”, VALUE is a printable string.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 124: Line 282:
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| For TAG = E2, length should be the same as the Read Length
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| For TAG = E2, each base should be different than the read Base (unless 'N')
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| For TAG = U2, length should be the same as the Read Length
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 
|}
 
|}
  
 
NOTE: There are other TAG Validations that can be done.  They will come later.
 
NOTE: There are other TAG Validations that can be done.  They will come later.
  
NOTE: There are other BAM Validations that can be done.  They will come later.
+
NOTE: There may be other BAM Validations that can be done.  They will come later.
 +
 
 +
Consider may want to validate the cigar string against the read length...
 +
 
 +
== Other Read Validation ==
 +
 
 +
{| class="wikitable" style="width:100%" border="1"
 +
|+ style="font-size:150%"|'''SAM Alignment Record'''
 +
! rowspan='2' width="60%"|Validation Criteria
 +
! colspan="2" width="20%"|Implemented
 +
! colspan="2" width="20%"|Tested
 +
|-
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
|-
 +
| If specified to check sort order (either based on SO flag or user specifies coordinate or query name).
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|}
 +
 
  
 
===SAM Questions===
 
===SAM Questions===
Line 134: Line 336:
 
**Same question for MRNM = “*” and MPOS & ISIZE = 0
 
**Same question for MRNM = “*” and MPOS & ISIZE = 0
 
*Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.”  - Is there anything here that needs to be validated???
 
*Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.”  - Is there anything here that needs to be validated???
 +
 +
== BamFile Classes ==
 +
[[C++ Library: libbam|BamFile]]

Latest revision as of 15:51, 27 August 2010

NOTE: Not all validation Criteria has been listed here, and not all listed here have been implemented (Implemented checks are marked green.)

SAM Header Validation Rules

TODO

SAM Header
Validation Criteria Implemented Tested
SAM BAM SAM BAM
All Required Fields are set
If HD line is there, VN is also there.
HD/VN is not in valid format /^[0-9]+\.[0-9]+$/
HD/SO is a valid value (unsorted, queryname, coordinate)
SQ/SN all SQ lines have a unique SN field
SQ/LN is in the range [1, (2^29) -1]
SQ/LN is not a number
RG/ID all RG lines have a unique ID field
RG/PL is a valid value (ILLUMINA, SOLID, LS454, HELICOS, PACBIO)
Header has X-lines or fewer (or a max number of SQ lines (this was a problem once of a file with a crazy number of header lines)

SAM Alignment Validation

SAM Alignment Record
Validation Criteria Implemented Tested
SAM BAM SAM BAM
QNAME.Length() > 0 and <= 254
QNAME is valid: [!-?A-~] (printable characters minus space and '@') This is a new regular expression
FLAG is an integer [0-9]+ N/A: just interpret the bits as an int. N/A: just interpret the bits as an int.
FLAG is [0, (2^16)-1] Parse Error since it will be written into a 16 bit field. N/A: only a 16 bit field N/A: only a 16 bit field
RNAME does not contain [ \t\n\r@=]
RNAME is found in an SQ header record if there are any SQs in the header.
Reference Name length does not match specified length. N/A: reference name length is in BAM format only N/A: reference name length is in BAM format only
Reference ID is in range of the number of references N/A: rID is in BAM format only N/A: rID is in BAM format only
POS is an integer [0-9]+ N/A: just interpret the bits as an int. N/A: just interpret the bits as an int.
POS is [0, (2^29)-1] Parse Error if it can't fit in the 32 bit field, other out of range is a validation error.
MAPQ is an integer [0-9]+ N/A: just interpret the bits as an int. N/A: just interpret the bits as an int.
MAPQ is [0, (2^8)-1] Parse Error since it will be written into an 8 bit field. N/A: only a 8 bit field N/A: only a 8 bit field
CIGAR ([0-9]+[MIDNSHP])+|\*
CIGAR string matches the length of SEQ if both are not "*"
MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME)
If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ.
MPOS is an integer [0-9]+
MPOS is [0, (2^29)-1]
ISIZE is an integer -?[0-9]+
ISIZE is [-(2^29), 2^29]
SEQ is [acgtnACGTN.=]+|\*
If SEQ is * then QUAL is *
QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])
If QUAL and SEQ are not “*” they are the same length.
TAG is [A-Z][A-Z0-9]
A TAG only appears once per alignment
VTYPE is [AifZH] for SAM and [AcCsSiIfZH] for BAM
VALUE does NOT contain [\t\n\r]
For VTYPE = “A”, VALUE is a printable character
For VTYPE = “i”, VALUE is a signed 32-bit integer.
For VTYPE = “f”, VALUE is a single-precision float.
For VTYPE = “Z”, VALUE is a printable string.
For VTYPE = “H”, VALUE is a Hex string.
For TAG = E2, length should be the same as the Read Length
For TAG = E2, each base should be different than the read Base (unless 'N')
For TAG = U2, length should be the same as the Read Length

NOTE: There are other TAG Validations that can be done. They will come later.

NOTE: There may be other BAM Validations that can be done. They will come later.

Consider may want to validate the cigar string against the read length...

Other Read Validation

SAM Alignment Record
Validation Criteria Implemented Tested
SAM BAM SAM BAM
If specified to check sort order (either based on SO flag or user specifies coordinate or query name).


SAM Questions

  • Comment says: “If the mapping position of the query is not available, RNAME and CIGAR are set as “*”, and POS and MAPQ as 0.” Is it all or nothing? Can some be set to “*”/0 but not all?
    • Same question for MRNM = “*” and MPOS & ISIZE = 0
  • Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.” - Is there anything here that needs to be validated???

BamFile Classes

BamFile