Difference between revisions of "C++ Class: SamFile"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(19 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[Category:libbam]]
+
[[Category:C++]]
 +
[[Category:libStatGen]]
 +
[[Category:libStatGen BAM]]
  
 
== Reading/Writing SAM/BAM Files In Your Program ==
 
== Reading/Writing SAM/BAM Files In Your Program ==
Line 8: Line 10:
 
'''Future Enhancements:''' Add the ability to read alignments that match a given start, end position for a specific reference sequence.  
 
'''Future Enhancements:''' Add the ability to read alignments that match a given start, end position for a specific reference sequence.  
  
This class is part of [[C++ Library: libbam|libbam]].
+
This class is part of [[C++ Library: libStatGen|C++ Library: libStatGen]].
  
=== Class Methods ===
+
=== Class Documentation ===
  
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
+
See: http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFile.html
|-style="background: #f2f2f2; text-align: center;"  '''SamFile Class Methods'''
+
 
! Method Name !!  Description
+
== Child Classes ==
|-
+
=== SamFileReader ===
| <code>bool SamFile::IsEOF()</code>
+
http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileReader.html
| bool: true if the end of file has been reached, false if not.
 
Be careful using this method when you are only reading a specific section since you may reach the end of your section without hitting the end of the file
 
  
|-
+
=== SamFileWriter ===
| <code>bool SamFile::OpenForRead(const char* filename)</code>
+
http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileWriter.html
| Opens the specified file for reading.
 
Determines if it is a BAM/SAM file by reading the beginning of the file.
 
Returns true if successfully opened reading, false if not.
 
|-
 
| <code>bool SamFile::OpenForWrite(const char * filename)</code>
 
| bool: true if successfully opened, false if not.
 
Opens as BAM file if the specified filename ends in .bam. Otherwise it is opened as a SAM file.
 
Returns true if successfully opened for writing, false if not.
 
|-
 
| <code>bool SamFile::ReadBamIndex(const char * filename)</code>
 
| bool: true if the bam index file was successfully read, false if not.
 
Reads the specified bam index file.  It must be read prior to setting a read section, for seeking and reading portions of a bam file.
 
|-
 
| <code>void SamFile::Close()</code>
 
| Close the file if there is one open.
 
|-
 
| <code>bool SamFile::ReadHeader(SamFileHeader& header)</code>
 
| Reads the header section from the file and stores it in the passed in header.
 
Returns true if successfully read, false if not.
 
|-
 
| <code>bool SamFile::WriteHeader(SamFileHeader& header)</code>
 
| Writes the specified header into the file.
 
Returns true if successfully written, false if not.
 
|- 
 
| <code>bool SamFile::ReadRecord(SamFileHeader& header, SamRecord& record)</code>
 
| Reads the next record from the file and stores it in the passed in record.
 
  
If it is an indexed BAM file and SetReadSection was called, only alignments in the section specified by <code>SetReadSection</code> are read.  If they all have already been read, this method returns false.
+
== Statistics ==
 +
=== Statistic Generation ===
  
Validates that the record is sorted according to the value set by <code>setSortedValidation</code>.  No sorting validation is done if specified to be unsorted, or <code>setSortedValidation</code> was never called.
+
The following statistics can be optionally recorded when reading a SamFile by specifying <code>SamFile::GenerateStatistics()</code> and displayed with <code>SamFile::PrintStatistics()</code>
  
Returns false if the record was not successfully read or not properly sorted.  Returns true if successfully read and properly sorted.
+
The statistics only reflect alignments that were successfully read from the BAM file. Alignments that failed to parse from the file are not reflected in the statistics, but alignments that are invalid for other reasons may show up in the statistics.
|-
 
| <code>bool SamFile::WriteRecord(SamFileHeader& header, SamRecord& record)</code>
 
| Writes the specified record into the file.
 
Validates that the record is sorted according to the value set by <code>setSortedValidation</code>.  No sorting validation is done if specified to be unsorted, or <code>setSortedValidation</code> was never called.  Returns false and does not write the record if the record was not properly sorted.
 
  
Returns false if the record was not properly sorted or not successfully written.  Returns true if properly sorted and successfully written.
+
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
|+ '''Read Counts'''
 +
! Statistic !! Description
 
|-
 
|-
| <code>void SamFile::setSortedValidation(SortedType sortType)<\code>
+
|TotalReads
| Set the flag to validate that the file is sorted as it is read/written.  Must be called after the file has been opened.
+
| Total number of alignments that were successfully read from the file.
sortType specifies the type of sort to be checked for.
 
 
|-
 
|-
| <code>uint32_t SamFile::GetCurrentRecordCount()</code>
+
|MappedReads
| Return the number of records that have been read/written so far.
+
| Total number of alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
 
|-
 
|-
| <code>SamStatus::Status SamFile::GetStatus()</code>
+
|PairedReads
| Get the status result of the last status reporting method call.
+
| Total number of alignments that were successfully read from the file with FLAG bit 0x001 set to 1 (paired).
 
|-
 
|-
| <code>const char* SamFile::GetStatusMessage()</code>
+
|ProperPair
| Get the Status Message of the last call that sets status.
+
| Total number of alignments that were successfully read from the file with FLAG bits 0x001 set to 1 (paired) AND 0x002 (proper pair).
 
|-
 
|-
| '''DEPRECATED: '''<code>SamStatus::Status SamFile::GetFailure()</code>
+
|DuplicateReads
| Get the type of failure that occurred on a method failure.
+
| Total number of alignments that were successfully read from the file with FLAG bit 0x400 set to 1 (PCR or optical duplicate).
 
|-
 
|-
| <code>bool SamFile::SetReadSection(int32_t refID)</code>
+
|QCFailureReads
| Tell the class which reference ID should be read from the BAM file.  This is the index into the BAM Index list of reference information: 0 - #references.  The records for that reference id will be retrieved on each <code>ReadRecord</code> call.  When all records have been retrieved for the specified reference id, <code>ReadRecord</code> will return false until a new read section is set.
+
| Total number of alignments that were successfully read from the file with FLAG bit 0x200 set to 1 (failed quality checks).
Pass in -1 in order to read the section of the bam file not associated with any reference ID.
 
Returns true if the read section was successfully set, false if not.  False is returned if the BAM Index File has not yet been read or if a BAM file is not open for reading.
 
 
|}
 
|}
  
=== Class Enums ===
 
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 
|-style="background: #f2f2f2; text-align: center;"
 
|-style="background: #f2f2f2; text-align: center;"
! colspan="2"| enum SortedType
+
|- '''Read Percentages'''
 +
! Statistic !! Description
 
|-
 
|-
! Enum Value !!  Description
+
|MappingRate(%)
 +
| 100 * MappedReads/TotalReads
 
|-
 
|-
| UNSORTED
+
|PairedReads(%)
| Do not do any sorting validation.
+
| 100 * PairedReads/TotalReads
 
|-
 
|-
| FLAG
+
|ProperPair(%)
| Validate that the file is sorted by the type specified in the SO Tag of the file's header.
+
| 100 * ProperPair/TotalReads
 
|-
 
|-
| COORDINATE
+
|DupRate(%)
| Validate that the file is sorted by Coordinate.
+
| 100 * DuplicateReads/TotalReads
 
|-
 
|-
| QUERY_NAME
+
|QCFailRate(%)
| Validate that the file is sorted by Query Name.
+
| 100 * QCFailureReads/TotalReads
 
|}
 
|}
  
=== Usage Examples ===
+
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
==== Simple Read/Write SAM/BAM files ====
+
|-style="background: #f2f2f2; text-align: center;"
The following example reads in a sam/bam file and writes it out as a sam/bam file.  The file format of the input sam/bam is determined by the SamFile class based on reading the type from the file.  The file format of the output sam/bam file is determined by the '''SamFile''' class based on the extension of the output file.  A ".bam" extension indicates a BAM file.  All other extensions indicate SAM files.
+
|- '''Base Counts'''
 
+
! Statistic !! Description
<source lang="cpp">
+
|-
#include "SamFile.h"
+
|TotalBases
#include "SamValidation.h"
+
| Sum of the SEQ lengths for all alignments that were successfully read from the file.
 
+
NOTE: Includes bases that are 'N'.
int main(int argc, char ** argv)
+
|-
{
+
|BasesInMappedReads
  // Check for the appropriate number of arguments.
+
| Sum of the SEQ lengths for all alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
  if(argc != 3)
+
NOTE: Includes bases that are 'N'.
  {
+
|}
      printf("./bam <inputFile> <outputFile.sam/bam>\n");
 
      exit(-1);
 
  }
 
 
 
  // Open the input file for reading.
 
  SamFile samIn;
 
  if(!samIn.OpenForRead(argv[1]))
 
  {
 
      fprintf(stderr, "%s\n", samIn.GetStatusMessage());
 
      return(samIn.GetStatus());
 
  }
 
 
 
  // Open the output file for writing.
 
  SamFile samOut;
 
  if(!samOut.OpenForWrite(argv[2]))
 
  {
 
      fprintf(stderr, "%s\n", samOut.GetStatusMessage());
 
      return(samOut.GetStatus());
 
  }
 
 
 
  // Read the sam header.
 
  SamFileHeader samHeader;
 
  if(!samIn.ReadHeader(samHeader))
 
  {
 
      fprintf(stderr, "%s\n", samIn.GetStatusMessage());
 
      return(samIn.GetStatus());
 
  }
 
 
 
  // Write the sam header.
 
  if(!samOut.WriteHeader(samHeader))
 
  {
 
      fprintf(stderr, "%s\n", samOut.GetStatusMessage());
 
      return(samOut.GetStatus());   
 
  }
 
 
 
  SamRecord samRecord;
 
  SamValidationErrors samInvalidErrors;
 
 
 
  // Set writeStatus to success.  It will be changed
 
  // to the failure reason if any of the writes fail.
 
  SamStatus::Status writeStatus = SamStatus::SUCCESS;
 
 
 
  // Keep reading records until ReadRecord returns false.
 
  while(samIn.ReadRecord(samHeader, samRecord))
 
  {
 
      // Successfully read a record from the file, so write it.
 
      if(!samOut.WriteRecord(samHeader, samRecord))
 
      {
 
        // Failed to write a record.
 
        fprintf(stderr, "%s\n", samOut.GetStatusMessage());
 
        writeStatus = samOut.GetStatus();
 
      }
 
  }
 
 
 
  if(samIn.GetStatus() != SamStatus::NO_MORE_RECS)
 
  {
 
      // Failed to read a record.
 
      fprintf(stderr, "%s\n", samIn.GetStatusMessage());
 
  }
 
 
 
  std::cout << std::endl << "Number of records read = " <<
 
      samIn.GetCurrentRecordCount() << std::endl;
 
  std::cout << "Number of records written = " <<
 
      samOut.GetCurrentRecordCount() << std::endl;
 
 
 
  if(samIn.GetStatus() != SamStatus::NO_MORE_RECS)
 
  {
 
      // Failed reading a record.
 
      return(samIn.GetStatus());
 
  }
 
 
 
  // Since the reads were successful, return the status based
 
  // on the status of the writes.  If any failed, return
 
  // their failure status.
 
  return(writeStatus);
 
}
 
</source>
 
 
 
 
 
==== Simple Read/Write Indexed BAM Files ====
 
This example reads in the inputFilename bam file and writes it back out section by section to the specified outputFilename, starting with section -1.  It also prints a count of the number of records in each section.
 
<source lang="cpp">
 
#include "SamFile.h"
 
#include "SamValidation.h"
 
 
 
int ReadIndexedBam(const char* inputFilename,
 
                  const char* outputFilename,
 
                  const char* indexFilename)
 
{
 
  // Open the input file for reading.
 
  SamFile samIn;
 
  if(!samIn.OpenForRead(inputFilename))
 
  {
 
      fprintf(stderr, "%s\n", samIn.GetStatusMessage());
 
      return(samIn.GetStatus());
 
  }
 
 
 
  // Open the bam index file for reading.
 
  if(!samIn.ReadBamIndex(indexFilename))
 
  {
 
      fprintf(stderr, "%s\n", samIn.GetStatusMessage());
 
      return(samIn.GetStatus());
 
  }
 
 
 
  // Open the output file for writing.
 
  SamFile samOut;
 
  if(!samOut.OpenForWrite(outputFilename))
 
  {
 
      fprintf(stderr, "%s\n", samOut.GetStatusMessage());
 
      return(samOut.GetStatus());
 
  }
 
 
 
  // Read the sam header.
 
  SamFileHeader samHeader;
 
  if(!samIn.ReadHeader(samHeader))
 
  {
 
      fprintf(stderr, "%s\n", samIn.GetStatusMessage());
 
      return(samIn.GetStatus());
 
  }
 
 
 
  // Write the sam header.
 
  if(!samOut.WriteHeader(samHeader))
 
  {
 
      fprintf(stderr, "%s\n", samOut.GetStatusMessage());
 
      return(samOut.GetStatus());   
 
  }
 
 
 
  SamRecord samRecord;
 
  SamValidationErrors samInvalidErrors;
 
 
 
  // Loop through each Reference.
 
  for(int i = -1; i < 23; i++)
 
  {
 
      int numSectionRecords = 0;
 
      samIn.SetReadSection(i);
 
      // Keep reading records until they aren't more.
 
      while(samIn.ReadRecord(samHeader, samRecord))
 
      {
 
        numSectionRecords++;
 
        // Successfully read a record from the file, so write it.
 
        if(!samOut.WriteRecord(samHeader, samRecord))
 
        {
 
            // Failed to write a record.
 
            fprintf(stderr, "%s\n", samOut.GetStatusMessage());
 
        }
 
      }
 
 
 
      if(samIn.GetStatus() != SamStatus::NO_MORE_RECS)
 
      {
 
        // Failed to read a record.
 
        fprintf(stderr, "%s\n", samIn.GetStatusMessage());
 
      }
 
 
 
      std::cout << "Reference ID " << i << " has " << numSectionRecords
 
                << " records" << std::endl;
 
  }
 
 
 
  std::cout << "Number of records = " << samIn.GetCurrentRecordCount() << std::endl;
 
 
 
  return(0);
 
}
 
</source>
 
 
 
==== Validate SAM/BAM Files Continuing Processing After Allowed Errors ====
 
This example demonstrates calling validation on a SAM/BAM file.  It also demonstrates a more complicated read loop that continues until a configurable number of errors are hit.
 
<source lang="cpp">
 
#include "SamFile.h"
 
#include "SamValidation.h"
 
 
 
int Validate(String inFile, int quitAfterErrorNum, int maxReportedErrors, SamFile::SortedType sortType)
 
{
 
  // Check to see if the in file was specified.  If it wasn't this is an error.
 
  if(inFile == "")
 
  {
 
      // In file was not specified but it is mandatory.
 
      std::cerr << "--in is a mandatory argument for validate, "
 
                << "but was not specified" << std::endl;
 
      return(-1);
 
  }
 
 
 
  SamFile samIn;
 
  // Open the file for reading. 
 
  if(!samIn.OpenForRead(inFile))
 
  {
 
      fprintf(stderr, "%s\n", samIn.GetStatusMessage());
 
      return(samIn.GetStatus());
 
  }
 
 
 
  // Set the sorting validation type.
 
  samIn.setSortedValidation(sortType);
 
 
 
  // Read the sam header.
 
  SamFileHeader samHeader;
 
  if(!samIn.ReadHeader(samHeader))
 
  {
 
      fprintf(stderr, "%s\n", samIn.GetStatusMessage());
 
      return(samIn.GetStatus());
 
  }
 
 
 
  // Read the sam records.
 
  SamRecord samRecord;
 
 
 
  // Track the status.
 
  SamStatus::Status status = SamStatus::SUCCESS;
 
  
  // Keep reading records until the end of the file is reached.
+
NOTE: If the TotalReads is greater than 10^6, then the Read Counts and Base Counts specify the total counts divided by 10^6.  This is indicated in the output with a (e6) appended to the field name.
  int numValidRecords = 0;
 
  int numInvalidRecords = 0;
 
  int numErrorRecords = 0;
 
  int numRecords = 0;
 
  int numReportedErrors = 0;
 
  int totalErrorRecords = 0;
 
  
  SamValidationErrors invalidSamErrors;
+
==== Example Statistics Output ====
 +
<pre>
 +
TotalReads(e6) 18.90
 +
MappedReads(e6) 14.77
 +
PairedReads(e6) 18.90
 +
ProperPair(e6) 11.28
 +
DuplicateReads(e6) 0.00
 +
QCFailureReads(e6) 0.00
  
  // Keep reading records from the file until SamFile::ReadRecord
+
MappingRate(%) 78.17
  // indicates to stop (returns false).
+
PairedReads(%) 100.00
  while( (samIn.ReadRecord(samHeader, samRecord))||
+
ProperPair(%) 59.68
          (SamStatus::isContinuableStatus(samIn.GetStatus()) &&
+
DupRate(%) 0.00
          ( (quitAfterErrorNum < 0) ||
+
QCFailRate(%) 0.00
            (numErrorRecords < quitAfterErrorNum) )) )
 
  {
 
      ++numRecords;
 
      if(samIn.GetStatus() == SamStatus::SUCCESS)
 
      {
 
        // Successfully set the record, so check to see if it is valid.
 
        // Clear any errors in the list.
 
        invalidSamErrors.clear();
 
        if(!SamValidator::isValid(samHeader, samRecord, invalidSamErrors))
 
        {
 
            // The record is not valid.
 
            ++numInvalidRecords;
 
            ++totalErrorRecords;
 
            if(numReportedErrors < maxReportedErrors)
 
            {
 
              std::cout << "Record " << numRecords << std::endl
 
                        << invalidSamErrors << std::endl;
 
              numReportedErrors += invalidSamErrors.numErrors();
 
              status = SamStatus::INVALID;
 
            }
 
        }
 
        else
 
        {
 
            // Valid record, so increment the counter.
 
            ++numValidRecords;
 
        }
 
      }
 
      else
 
      {
 
        // Error reading the record.
 
        ++numErrorRecords;
 
        if(numReportedErrors < maxReportedErrors)
 
        {
 
            // report error.
 
            std::cout << "Record " << numRecords << std::endl
 
                      << samIn.GetStatusMessage() << std::endl
 
                      << std::endl;
 
            ++numReportedErrors;
 
            ++totalErrorRecords;
 
            status = samIn.GetStatus();
 
        }
 
      }
 
  }
 
  
  if(samIn.GetStatus() != SamStatus::NO_MORE_RECS)
+
TotalBases(e6) 699.30
  {
+
BasesInMappedReads(e6) 546.67
      ++numErrorRecords;
+
</pre>
      // The last read call had a failure, so report it.
 
      if(numReportedErrors < maxReportedErrors)
 
      {
 
        std::cout << "Record " << numRecords << ": ";
 
        std::cout << std::endl << samIn.GetStatusMessage() << std::endl;
 
      }
 
      status = samIn.GetStatus();
 
  }
 
  
  printf("\nNumber of records read = %d\n", numRecords);
+
== Usage Examples ==
  printf("Number of valid records = %d\n", numValidRecords);
+
[[Sam Library Usage Examples]]
  printf("Returning: %d (%s)\n", status, SamStatus::getStatusString(status));
 
  return(status);
 
}
 
</source>
 

Latest revision as of 11:03, 2 February 2017


Reading/Writing SAM/BAM Files In Your Program

The SamFile class allows a user to easily read/write a SAM/BAM file.

The SamFile class contains additional functionality that allows a user to read specific sections of sorted & indexed BAM files. In order take advantage of this capability, the index file must be read prior to setting the read section. This logic saves the time of having to read the entire file and takes advantage of the seeking capability of BGZF files.

Future Enhancements: Add the ability to read alignments that match a given start, end position for a specific reference sequence.

This class is part of C++ Library: libStatGen.

Class Documentation

See: http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFile.html

Child Classes

SamFileReader

http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileReader.html

SamFileWriter

http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileWriter.html

Statistics

Statistic Generation

The following statistics can be optionally recorded when reading a SamFile by specifying SamFile::GenerateStatistics() and displayed with SamFile::PrintStatistics()

The statistics only reflect alignments that were successfully read from the BAM file. Alignments that failed to parse from the file are not reflected in the statistics, but alignments that are invalid for other reasons may show up in the statistics.

Read Counts
Statistic Description
TotalReads Total number of alignments that were successfully read from the file.
MappedReads Total number of alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
PairedReads Total number of alignments that were successfully read from the file with FLAG bit 0x001 set to 1 (paired).
ProperPair Total number of alignments that were successfully read from the file with FLAG bits 0x001 set to 1 (paired) AND 0x002 (proper pair).
DuplicateReads Total number of alignments that were successfully read from the file with FLAG bit 0x400 set to 1 (PCR or optical duplicate).
QCFailureReads Total number of alignments that were successfully read from the file with FLAG bit 0x200 set to 1 (failed quality checks).
Statistic Description
MappingRate(%) 100 * MappedReads/TotalReads
PairedReads(%) 100 * PairedReads/TotalReads
ProperPair(%) 100 * ProperPair/TotalReads
DupRate(%) 100 * DuplicateReads/TotalReads
QCFailRate(%) 100 * QCFailureReads/TotalReads
Statistic Description
TotalBases Sum of the SEQ lengths for all alignments that were successfully read from the file.

NOTE: Includes bases that are 'N'.

BasesInMappedReads Sum of the SEQ lengths for all alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).

NOTE: Includes bases that are 'N'.

NOTE: If the TotalReads is greater than 10^6, then the Read Counts and Base Counts specify the total counts divided by 10^6. This is indicated in the output with a (e6) appended to the field name.

Example Statistics Output

TotalReads(e6)	18.90
MappedReads(e6)	14.77
PairedReads(e6)	18.90
ProperPair(e6)	11.28
DuplicateReads(e6)	0.00
QCFailureReads(e6)	0.00

MappingRate(%)	78.17
PairedReads(%)	100.00
ProperPair(%)	59.68
DupRate(%)	0.00
QCFailRate(%)	0.00

TotalBases(e6)	699.30
BasesInMappedReads(e6)	546.67

Usage Examples

Sam Library Usage Examples