Changes

From Genome Analysis Wiki
Jump to navigationJump to search
7,161 bytes added ,  15:51, 27 August 2010
no edit summary
Line 1: Line 1:  +
'''NOTE: Not all validation Criteria has been listed here, and not all listed here have been implemented  (Implemented checks are marked green.)'''
 +
 
=== SAM Header Validation Rules ===
 
=== SAM Header Validation Rules ===
 
TODO
 
TODO
 +
{| class="wikitable" style="width:100%" border="1"
 +
|+ style="font-size:150%"|'''SAM Header'''
 +
! rowspan='2' width="60%"|Validation Criteria
 +
! colspan="2" width="20%"|Implemented
 +
! colspan="2" width="20%"|Tested
 +
|-
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
|-
 +
| All Required Fields are set
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| If HD line is there, VN is also there.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| HD/VN is not in valid format /^[0-9]+\.[0-9]+$/
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| HD/SO is a valid value (unsorted, queryname, coordinate)
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| SQ/SN all SQ lines have a unique SN field
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| SQ/LN is in the range [1, (2^29) -1]
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| SQ/LN is not a number
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| RG/ID all RG lines have a unique ID field
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| RG/PL is a valid value (ILLUMINA, SOLID, LS454, HELICOS, PACBIO)
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| Header has X-lines or fewer (or a max number of SQ lines (this was a problem once of a file with a crazy number of header lines)
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|}
    
=== SAM Alignment Validation ===
 
=== SAM Alignment Validation ===
 
{| class="wikitable" style="width:100%" border="1"
 
{| class="wikitable" style="width:100%" border="1"
 
|+ style="font-size:150%"|'''SAM Alignment Record'''
 
|+ style="font-size:150%"|'''SAM Alignment Record'''
! width="70%"|Validation Criteria
+
! rowspan='2' width="60%"|Validation Criteria
! width="15%"|Implemented
+
! colspan="2" width="20%"|Implemented
! width="15%"|Tested
+
! colspan="2" width="20%"|Tested
 +
|-
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 
|-
 
|-
 
| QNAME.Length() > 0 and <= 254
 
| QNAME.Length() > 0 and <= 254
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
|style="background-color:red;"| 
   
|-
 
|-
| QNAME does not contain [ \t\n\r]
+
| QNAME is valid: [!-?A-~] (printable characters minus space and '@') '''This is a new regular expression'''
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| FLAG is an integer [0-9]+
 
| FLAG is an integer [0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|-
 +
| FLAG is [0, (2^16)-1]
 +
|style="background-color:green;"| Parse Error since it will be written into a 16 bit field.
 +
|style="background-color:grey;"| N/A: only a 16 bit field
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: only a 16 bit field
 +
|-
 +
| RNAME does not contain [ \t\n\r@=]
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|-
 +
| RNAME is found in an SQ header record if there are any SQs in the header.
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| FLAG < 2048 (I think) or [0, (2^16)-1]
+
| Reference Name length does not match specified length.
 +
|style="background-color:grey;"| N/A: reference name length is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:grey;"| N/A: reference name length is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| RNAME does not contain [ \t\n\r@=]
+
| Reference ID is in range of the number of references
 +
|style="background-color:grey;"| N/A: rID is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:grey;"| N/A: rID is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| POS is an integer [0-9]+
 
| POS is an integer [0-9]+
|style="background-color:red;"|
+
|style="background-color:green;"|
|style="background-color:red;"|
+
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 
|-
 
|-
 
| POS is [0, (2^29)-1]
 
| POS is [0, (2^29)-1]
 +
|style="background-color:green;"| Parse Error if it can't fit in the 32 bit field, other out of range is a validation error.
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| MAPQ is an integer [0-9]+
 
| MAPQ is an integer [0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|-
 +
| MAPQ is [0, (2^8)-1]
 +
|style="background-color:green;"| Parse Error since it will be written into an 8 bit field.
 +
|style="background-color:grey;"| N/A: only a 8 bit field
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: only a 8 bit field
 +
|-
 +
| <nowiki>CIGAR ([0-9]+[MIDNSHP])+|\*</nowiki>
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
|-
  −
| MAPQ is [0, (2^8)-1]
   
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| <nowiki>CIGAR ([0-9]+[MIDNSHP])+|\*</nowiki>
+
| CIGAR string matches the length of SEQ if both are not "*"
|style="background-color:red;"|
+
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME)
 
| MRNM does not contain [ \t\n\r@] ('=' means it is the same as RNAME)
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ.
 
| If SQ is in the header RNAME & MRNM (if not “=”) must be in SQ.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| MPOS is an integer [0-9]+
 
| MPOS is an integer [0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| MPOS is [0, (2^29)-1]
 
| MPOS is [0, (2^29)-1]
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| ISIZE is an integer -?[0-9]+
 
| ISIZE is an integer -?[0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| ISIZE is [-(2^29), 2^29]
 
| ISIZE is [-(2^29), 2^29]
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| <nowiki>SEQ is [acgtnACGTN.=]+|\*</nowiki>
 
| <nowiki>SEQ is [acgtnACGTN.=]+|\*</nowiki>
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| If SEQ is * then QUAL is *
 
| If SEQ is * then QUAL is *
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| <nowiki>QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])</nowiki>
 
| <nowiki>QUAL is [!-~]+|* → dec 33 – 126 or dec 42 (which is in 32-126) (for BAM, it is between [0,93])</nowiki>
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| If QUAL is not “*” it is the same length as SEQ.
+
| If QUAL and SEQ are not “*” they are the same length.
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| TAG is [A-Z][A-Z0-9]
 
| TAG is [A-Z][A-Z0-9]
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| A TAG only appears once per alignment
 
| A TAG only appears once per alignment
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| VTYPE is [AifZH] for SAM and [AcCsSiIfZH]
+
| VTYPE is [AifZH] for SAM and [AcCsSiIfZH] for BAM
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| VALUE does NOT contain [\t\n\r]
 
| VALUE does NOT contain [\t\n\r]
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| For VTYPE = “A”, VALUE is a printable character
 
| For VTYPE = “A”, VALUE is a printable character
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| For VTYPE = “i”, VALUE is a signed 32-bit integer.
 
| For VTYPE = “i”, VALUE is a signed 32-bit integer.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| For VTYPE = “f”, VALUE is a single-precision float.
 
| For VTYPE = “f”, VALUE is a single-precision float.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
 
| For VTYPE = “Z”, VALUE is a printable string.
 
| For VTYPE = “Z”, VALUE is a printable string.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 124: Line 282:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| For TAG = E2, length should be the same as the Read Length
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| For TAG = E2, each base should be different than the read Base (unless 'N')
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| For TAG = U2, length should be the same as the Read Length
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 
|}
 
|}
    
NOTE: There are other TAG Validations that can be done.  They will come later.
 
NOTE: There are other TAG Validations that can be done.  They will come later.
   −
NOTE: There are other BAM Validations that can be done.  They will come later.
+
NOTE: There may be other BAM Validations that can be done.  They will come later.
 +
 
 +
Consider may want to validate the cigar string against the read length...
 +
 
 +
== Other Read Validation ==
 +
 
 +
{| class="wikitable" style="width:100%" border="1"
 +
|+ style="font-size:150%"|'''SAM Alignment Record'''
 +
! rowspan='2' width="60%"|Validation Criteria
 +
! colspan="2" width="20%"|Implemented
 +
! colspan="2" width="20%"|Tested
 +
|-
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
|-
 +
| If specified to check sort order (either based on SO flag or user specifies coordinate or query name).
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|}
 +
 
    
===SAM Questions===
 
===SAM Questions===
Line 134: Line 336:  
**Same question for MRNM = “*” and MPOS & ISIZE = 0
 
**Same question for MRNM = “*” and MPOS & ISIZE = 0
 
*Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.”  - Is there anything here that needs to be validated???
 
*Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.”  - Is there anything here that needs to be validated???
 +
 +
== BamFile Classes ==
 +
[[C++ Library: libbam|BamFile]]

Navigation menu