Changes

From Genome Analysis Wiki
Jump to navigationJump to search
5,196 bytes added ,  15:51, 27 August 2010
no edit summary
Line 1: Line 1:  +
'''NOTE: Not all validation Criteria has been listed here, and not all listed here have been implemented  (Implemented checks are marked green.)'''
 +
 
=== SAM Header Validation Rules ===
 
=== SAM Header Validation Rules ===
 
TODO
 
TODO
  −
=== SAM Alignment Validation ===
   
{| class="wikitable" style="width:100%" border="1"
 
{| class="wikitable" style="width:100%" border="1"
|+ style="font-size:150%"|'''SAM Alignment Record'''
+
|+ style="font-size:150%"|'''SAM Header'''
 
! rowspan='2' width="60%"|Validation Criteria
 
! rowspan='2' width="60%"|Validation Criteria
 
! colspan="2" width="20%"|Implemented
 
! colspan="2" width="20%"|Implemented
Line 14: Line 14:  
! width="10%"|BAM
 
! width="10%"|BAM
 
|-
 
|-
| QNAME.Length() > 0 and <= 254
+
| All Required Fields are set
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| If HD line is there, VN is also there.
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| HD/VN is not in valid format /^[0-9]+\.[0-9]+$/
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| HD/SO is a valid value (unsorted, queryname, coordinate)
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| SQ/SN all SQ lines have a unique SN field
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 20: Line 44:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| QNAME does not contain [ \t\n\r]
+
| SQ/LN is in the range [1, (2^29) -1]
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 26: Line 50:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| FLAG is an integer [0-9]+
+
| SQ/LN is not a number
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 32: Line 56:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| FLAG < 2048 (I think) or [0, (2^16)-1]
+
| RG/ID all RG lines have a unique ID field
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 38: Line 62:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| RNAME does not contain [ \t\n\r@=]
+
| RG/PL is a valid value (ILLUMINA, SOLID, LS454, HELICOS, PACBIO)
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 44: Line 68:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| POS is an integer [0-9]+
+
| Header has X-lines or fewer (or a max number of SQ lines (this was a problem once of a file with a crazy number of header lines)
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|}
 +
 +
=== SAM Alignment Validation ===
 +
{| class="wikitable" style="width:100%" border="1"
 +
|+ style="font-size:150%"|'''SAM Alignment Record'''
 +
! rowspan='2' width="60%"|Validation Criteria
 +
! colspan="2" width="20%"|Implemented
 +
! colspan="2" width="20%"|Tested
 
|-
 
|-
| POS is [0, (2^29)-1]
+
! width="10%"|SAM
 +
! width="10%"|BAM
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
|-
 +
| QNAME.Length() > 0 and <= 254
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:red;"|
 +
|-
 +
| QNAME is valid: [!-?A-~]  (printable characters minus space and '@') '''This is a new regular expression'''
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 56: Line 99:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| MAPQ is an integer [0-9]+
+
| FLAG is an integer [0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|-
 +
| FLAG is [0, (2^16)-1]
 +
|style="background-color:green;"| Parse Error since it will be written into a 16 bit field.
 +
|style="background-color:grey;"| N/A: only a 16 bit field
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: only a 16 bit field
 +
|-
 +
| RNAME does not contain [ \t\n\r@=]
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|-
 +
| RNAME is found in an SQ header record if there are any SQs in the header.
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|-
 +
| Reference Name length does not match specified length.
 +
|style="background-color:grey;"| N/A: reference name length is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:grey;"| N/A: reference name length is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| MAPQ is [0, (2^8)-1]
+
| Reference ID is in range of the number of references
 +
|style="background-color:grey;"| N/A: rID is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:grey;"| N/A: rID is in BAM format only
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|-
 +
| POS is an integer [0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|-
 +
| POS is [0, (2^29)-1]
 +
|style="background-color:green;"| Parse Error if it can't fit in the 32 bit field, other out of range is a validation error.
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|-
 +
| MAPQ is an integer [0-9]+
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: just interpret the bits as an int.
 +
|-
 +
| MAPQ is [0, (2^8)-1]
 +
|style="background-color:green;"| Parse Error since it will be written into an 8 bit field.
 +
|style="background-color:grey;"| N/A: only a 8 bit field
 +
|style="background-color:green;"|
 +
|style="background-color:grey;"| N/A: only a 8 bit field
 
|-
 
|-
 
| <nowiki>CIGAR ([0-9]+[MIDNSHP])+|\*</nowiki>
 
| <nowiki>CIGAR ([0-9]+[MIDNSHP])+|\*</nowiki>
Line 72: Line 163:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| CIGAR string matches the length of SEQ if both are not "*"
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
Line 87: Line 184:  
|-
 
|-
 
| MPOS is an integer [0-9]+
 
| MPOS is an integer [0-9]+
|style="background-color:red;"|
+
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 99: Line 196:  
|-
 
|-
 
| ISIZE is an integer -?[0-9]+
 
| ISIZE is an integer -?[0-9]+
|style="background-color:red;"|
+
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 128: Line 225:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| If QUAL is not “*” it is the same length as SEQ.
+
| If QUAL and SEQ are not “*” they are the same length.
|style="background-color:red;"|
+
|style="background-color:green;"|
|style="background-color:red;"|
   
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|style="background-color:green;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
Line 146: Line 243:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|-
 
|-
| VTYPE is [AifZH] for SAM and [AcCsSiIfZH]
+
| VTYPE is [AifZH] for SAM and [AcCsSiIfZH] for BAM
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
Line 187: Line 284:  
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 
|style="background-color:red;"|
 +
|-
 +
| For TAG = E2, length should be the same as the Read Length
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| For TAG = E2, each base should be different than the read Base (unless 'N')
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 +
| For TAG = U2, length should be the same as the Read Length
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|style="background-color:red;"|
 +
|-
 
|}
 
|}
   Line 192: Line 308:     
NOTE: There may be other BAM Validations that can be done.  They will come later.
 
NOTE: There may be other BAM Validations that can be done.  They will come later.
 +
 +
Consider may want to validate the cigar string against the read length...
 +
 +
== Other Read Validation ==
 +
 +
{| class="wikitable" style="width:100%" border="1"
 +
|+ style="font-size:150%"|'''SAM Alignment Record'''
 +
! rowspan='2' width="60%"|Validation Criteria
 +
! colspan="2" width="20%"|Implemented
 +
! colspan="2" width="20%"|Tested
 +
|-
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
! width="10%"|SAM
 +
! width="10%"|BAM
 +
|-
 +
| If specified to check sort order (either based on SO flag or user specifies coordinate or query name).
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|style="background-color:green;"|
 +
|}
 +
    
===SAM Questions===
 
===SAM Questions===
Line 197: Line 336:  
**Same question for MRNM = “*” and MPOS & ISIZE = 0
 
**Same question for MRNM = “*” and MPOS & ISIZE = 0
 
*Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.”  - Is there anything here that needs to be validated???
 
*Comment says: “The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits.”  - Is there anything here that needs to be validated???
 +
 +
== BamFile Classes ==
 +
[[C++ Library: libbam|BamFile]]

Navigation menu