Difference between revisions of "C++ Class: SamFile"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 243: Line 243:
  
 
NOTE: If the TotalReads is greater than 10^6, then the Read Counts and Base Counts specify the total counts divided by 10^6.  This is indicated in the output with a (e6) appended to the field name.
 
NOTE: If the TotalReads is greater than 10^6, then the Read Counts and Base Counts specify the total counts divided by 10^6.  This is indicated in the output with a (e6) appended to the field name.
 +
 +
==== Example Statistics Output ====
 +
<pre>
 +
TotalReads(e6) 18.90
 +
MappedReads(e6) 14.77
 +
PairedReads(e6) 18.90
 +
ProperPair(e6) 11.28
 +
DuplicateReads(e6) 0.00
 +
QCFailureReads(e6) 0.00
 +
 +
MappingRate(%) 78.17
 +
PairedReads(%) 100.00
 +
ProperPair(%) 59.68
 +
DupRate(%) 0.00
 +
QCFailRate(%) 0.00
 +
 +
TotalBases(e6) 699.30
 +
BasesInMappedReads(e6) 546.67
 +
</pre>
  
 
== Usage Examples ==
 
== Usage Examples ==
 
[[Sam Library Usage Examples]]
 
[[Sam Library Usage Examples]]

Revision as of 17:19, 23 August 2010


Reading/Writing SAM/BAM Files In Your Program

The SamFile class allows a user to easily read/write a SAM/BAM file.

The SamFile class contains additional functionality that allows a user to read specific sections of sorted & indexed BAM files. In order take advantage of this capability, the index file must be read prior to setting the read section. This logic saves the time of having to read the entire file and takes advantage of the seeking capability of BGZF files.

Future Enhancements: Add the ability to read alignments that match a given start, end position for a specific reference sequence.

This class is part of libbam.

Class Methods

Method Name Description
SamFile::SamFile() Default Constructor - initializes the variables, but does not open any files.
SamFile::SamFile(const char* filename, OpenType mode = READ) Constructor - initializes the variables, and opens the specified file for READ/WRITE based on the passed in mode. If the mode is not specified, it defaults to READ.

Aborts if the specified file could not be opened.

bool SamFile::IsEOF() bool: true if the end of file has been reached, false if not.

Be careful using this method when you are only reading a specific section since you may reach the end of your section without hitting the end of the file

bool SamFile::OpenForRead(const char* filename) Opens the specified file for reading.

Determines if it is a BAM/SAM file by reading the beginning of the file. Returns true if successfully opened reading, false if not.

bool SamFile::OpenForWrite(const char * filename) bool: true if successfully opened, false if not.

Opens as BAM file if the specified filename ends in .bam. Otherwise it is opened as a SAM file. Returns true if successfully opened for writing, false if not.

bool SamFile::ReadBamIndex(const char * filename) bool: true if the bam index file was successfully read, false if not.

Reads the specified bam index file. It must be read prior to setting a read section, for seeking and reading portions of a bam file.

void SamFile::Close() Close the file if there is one open.
bool SamFile::ReadHeader(SamFileHeader& header) Reads the header section from the file and stores it in the passed in header.

Returns true if successfully read, false if not.

bool SamFile::WriteHeader(SamFileHeader& header) Writes the specified header into the file.

Returns true if successfully written, false if not.

bool SamFile::ReadRecord(SamFileHeader& header, SamRecord& record) Reads the next record from the file and stores it in the passed in record.

If it is an indexed BAM file and SetReadSection was called, only alignments in the section specified by SetReadSection are read. If they all have already been read, this method returns false.

Validates that the record is sorted according to the value set by setSortedValidation. No sorting validation is done if specified to be unsorted, or setSortedValidation was never called.

Returns false if the record was not successfully read or not properly sorted. Returns true if successfully read and properly sorted.

bool SamFile::WriteRecord(SamFileHeader& header, SamRecord& record) Writes the specified record into the file.

Validates that the record is sorted according to the value set by setSortedValidation. No sorting validation is done if specified to be unsorted, or setSortedValidation was never called. Returns false and does not write the record if the record was not properly sorted.

Returns false if the record was not properly sorted or not successfully written. Returns true if properly sorted and successfully written.

void SamFile::setSortedValidation(SortedType sortType)<\code> Set the flag to validate that the file is sorted as it is read/written. Must be called after the file has been opened.

sortType specifies the type of sort to be checked for.

uint32_t SamFile::GetCurrentRecordCount() Return the number of records that have been read/written so far.
SamStatus::Status SamFile::GetStatus() Get the status result of the last status reporting method call.
const char* SamFile::GetStatusMessage() Get the Status Message of the last call that sets status.
DEPRECATED: SamStatus::Status SamFile::GetFailure() Get the type of failure that occurred on a method failure.
bool SamFile::SetReadSection(int32_t refID) Tell the class which reference ID should be read from the BAM file. This is the index into the BAM Index list of reference information: 0 - #references. The records for that reference id will be retrieved on each ReadRecord call. When all records have been retrieved for the specified reference id, ReadRecord will return false until a new read section is set.

Pass in -1 in order to read the section of the bam file not associated with any reference ID.

Must be called after OpenForRead.

Returns true if the read section was successfully set, false if not. False is returned if the BAM Index File has not yet been read or if a BAM file is not open for reading.

bool SamFile::SetReadSection(const char* refName) Tell the class which reference name should be read from the BAM file. The specified name will be mapped to the index into the BAM Index list of reference information: 0 - #references. The records for that reference name will be retrieved on each ReadRecord call. When all records have been retrieved for the specified reference name, ReadRecord will return false until a new read section is set.

Pass in "" or "*" in order to read the section of the bam file not associated with any reference name.

Must be called after OpenForRead.

Returns true if the read section was successfully set, false if not. False is returned if the BAM Index File has not yet been read or if a BAM file is not open for reading.

bool SamFile::SetReadSection(int32_t refID, int32_t start, int32_t end) Sets what part of the BAM file should be read. Fails if this is not a BAM file &/or the index file has not yet been read. This version will set it to only read a specific reference id and 0-based start(inclusive)/end(exclusive) region. The records for this section will be retrieved on each ReadRecord call. When all records have been retrieved for the specified section, ReadRecord will return failure until a new read section is set.

Pass in -1 in order to read the section of the bam file not associated with any reference ID.

Must be called after OpenForRead.

Return Value: true = success; false = failure.

bool SamFile::SetReadSection(const char* refName, int32_t start, int32_t end) Sets what part of the BAM file should be read. Fails if this is not a BAM file &/or the index file has not yet been read. This version will set it to only read a specific reference name and 0-based start(inclusive)/end(exclusive) region. The records for this section will be retrieved on each ReadRecord call. When all records have been retrieved for the specified section, ReadRecord will return failure until a new read section is set.

Pass in "" or "*" in order to read the section of the bam file not associated with any reference name.

Must be called after OpenForRead.

Return Value: true = success; false = failure.

uint32_t SamFile::GetNumOverlaps(SamRecord& samRecord) Returns the number of bases in the passed in read that overlap the region that is currently set for this file. Overlapping means that the bases occur in both the read and the reference as either matches or mismatches. This does not count insertions, deletions, clips, pads, or skips.

Class Enums

enum OpenType
Enum Value Description
READ Open the file for read.
WRITE Open the file for write.
enum SortedType
Enum Value Description
UNSORTED Do not do any sorting validation.
FLAG Validate that the file is sorted by the type specified in the SO Tag of the file's header.
COORDINATE Validate that the file is sorted by Coordinate.
QUERY_NAME Validate that the file is sorted by Query Name.


Child Classes

SamFileReader

Additional Child Class Methods
Method Name Description
SamFileReader::SamFileReader(const char* filename) Constructor - initializes the variables, and opens the specified file for reading.

Aborts if the specified file could not be opened.

SamFileWriter

Additional Child Class Methods
Method Name Description
SamFileWriter::SamFileWriter(const char* filename) Constructor - initializes the variables, and opens the specified file for writing.

Aborts if the specified file could not be opened.


Statistics

Statistic Generation

The following statistics can be optionally recorded when reading a SamFile by specifying SamFile::GenerateStatistics() and displayed with SamFile::PrintStatistics()

The statistics only reflect alignments that were successfully read from the BAM file. Alignments that failed to parse from the file are not reflected in the statistics, but alignments that are invalid for other reasons may show up in the statistics.

Read Counts
Statistic Description
TotalReads Total number of alignments that were successfully read from the file.
MappedReads Total number of alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
PairedReads Total number of alignments that were successfully read from the file with FLAG bit 0x001 set to 1 (paired).
ProperPair Total number of alignments that were successfully read from the file with FLAG bits 0x001 set to 1 (paired) AND 0x002 (proper pair).
DuplicateReads Total number of alignments that were successfully read from the file with FLAG bit 0x400 set to 1 (PCR or optical duplicate).
QCFailureReads Total number of alignments that were successfully read from the file with FLAG bit 0x200 set to 1 (failed quality checks).
Statistic Description
MappingRate(%) 100 * MappedReads/TotalReads
PairedReads(%) 100 * PairedReads/TotalReads
ProperPair(%) 100 * ProperPair/TotalReads
DupRate(%) 100 * DuplicateReads/TotalReads
QCFailRate(%) 100 * QCFailureReads/TotalReads
Statistic Description
TotalBases Sum of the SEQ lengths for all alignments that were successfully read from the file.
BasesInMappedReads Sum of the SEQ lengths for all alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).

NOTE: If the TotalReads is greater than 10^6, then the Read Counts and Base Counts specify the total counts divided by 10^6. This is indicated in the output with a (e6) appended to the field name.

Example Statistics Output

TotalReads(e6)	18.90
MappedReads(e6)	14.77
PairedReads(e6)	18.90
ProperPair(e6)	11.28
DuplicateReads(e6)	0.00
QCFailureReads(e6)	0.00

MappingRate(%)	78.17
PairedReads(%)	100.00
ProperPair(%)	59.68
DupRate(%)	0.00
QCFailRate(%)	0.00

TotalBases(e6)	699.30
BasesInMappedReads(e6)	546.67

Usage Examples

Sam Library Usage Examples