Difference between revisions of "C++ Class: SamFile"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(30 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
[[Category:C++]]
 +
[[Category:libStatGen]]
 +
[[Category:libStatGen BAM]]
 +
 
== Reading/Writing SAM/BAM Files In Your Program ==
 
== Reading/Writing SAM/BAM Files In Your Program ==
 
The '''SamFile''' class allows a user to easily read/write a SAM/BAM file.
 
The '''SamFile''' class allows a user to easily read/write a SAM/BAM file.
  
=== Class Methods ===
+
The '''SamFile''' class contains additional functionality that allows a user to read specific sections of sorted & indexed BAM filesIn order take advantage of this capability, the index file must be read prior to setting the read sectionThis logic saves the time of having to read the entire file and takes advantage of the seeking capability of BGZF files.  
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 
|-style="background: #f2f2f2; text-align: center;"  '''SamFile Class Methods'''
 
! Method Name !!  Description
 
|-
 
| <code>bool SamFile::IsEOF()</code>
 
| bool: true if the end of file has been reached, false if not.
 
|-
 
| <code>bool SamFile::OpenForRead(const char* filename)</code>
 
| Opens the specified file for reading.
 
Determines if it is a BAM/SAM file by reading the beginning of the file.
 
Returns true if successfully opened reading, false if not.
 
|-
 
| <code>bool SamFile::OpenForWrite(const char * filename)</code>
 
| bool: true if successfully opened, false if not.
 
Opens as BAM file if the specified filename ends in .bamOtherwise it is opened as a SAM file.
 
Returns true if successfully opened for writing, false if not.
 
|-
 
| <code>void SamFile::Close()</code>
 
| Close the file if there is one open.
 
|-
 
| <code>bool SamFile::ReadHeader(SamFileHeader& header)</code>
 
| Reads the header section from the file and stores it in the passed in header.
 
Returns true if successfully read, false if not.
 
|-
 
| <code>bool SamFile::WriteHeader(SamFileHeader& header)</code>
 
| Writes the specified header into the file.
 
Returns true if successfully written, false if not.
 
|- 
 
| <code>bool SamFile::ReadRecord(SamFileHeader& header, SamRecord& record)</code>
 
| Reads the next record from the file and stores it in the passed in record.
 
Returns true if successfully read, false if not.
 
|-
 
| <code>bool SamFile::WriteRecord(SamFileHeader& header, SamRecord& record)</code>
 
| Writes the specified record into the file.
 
Returns true if successfully written, false if not.
 
|-
 
| <code>void setSortedValidation(SortedType sortType)<\code>
 
| Set the flag to validate that the file is sorted as it is read/writtenMust be called after the file has been opened.
 
sortType specifies the type of sort to be checked for.
 
|-
 
| <code>uint32_t GetCurrentRecordCount()</code>
 
| Return the number of records that have been read/written so far.
 
|-
 
| <code>SamStatus::Status GetFailure()</code>
 
| Get the type of failure that occurred on a method failure.
 
|}
 
  
 +
'''Future Enhancements:''' Add the ability to read alignments that match a given start, end position for a specific reference sequence.
  
=== Usage Example ===
+
This class is part of [[C++ Library: libStatGen|C++ Library: libStatGen]].
The following example reads in a sam/bam file and writes it out as a sam/bam file.  The file format of the input sam/bam is determined by the SamFile class based on reading the type from the file.  The file format of the output sam/bam file is determined by the '''SamFile''' class based on the extension of the output file.  A ".bam" extension indicates a BAM file.  All other extensions indicate SAM files.
 
<source lang="cpp">
 
int main(int argc, char ** argv)
 
{
 
  if(argc != 3)
 
  {
 
      printf("./bam <inputFile> <outputFile.sam/bam>\n");
 
      exit(-1);
 
  }
 
  
 +
=== Class Documentation ===
  
  SamFile samIn;
+
See: http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFile.html
     
 
  samIn.OpenForRead(argv[1]);
 
  
  SamFile samOut;
+
== Child Classes ==
 +
=== SamFileReader ===
 +
http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileReader.html
  
  samOut.OpenForWrite(argv[2]);
+
=== SamFileWriter ===
 +
http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileWriter.html
  
  // Read the sam header.
+
== Statistics ==
  SamFileHeader samHeader;
+
=== Statistic Generation ===
  samIn.ReadHeader(samHeader);
 
  
  samOut.WriteHeader(samHeader);
+
The following statistics can be optionally recorded when reading a SamFile by specifying <code>SamFile::GenerateStatistics()</code> and displayed with <code>SamFile::PrintStatistics()</code>
  
  // Read the first sam record.
+
The statistics only reflect alignments that were successfully read from the BAM file.  Alignments that failed to parse from the file are not reflected in the statistics, but alignments that are invalid for other reasons may show up in the statistics.
  SamRecord samRecord;
 
  
  // Track if any of the records are invalid.  Initialize to 0, it will
+
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
  // be set to -1 on an invalid record.
+
|-style="background: #f2f2f2; text-align: center;"
  int status = 0;
+
|+ '''Read Counts'''
 
+
! Statistic !! Description
  // Keep reading records until the end of the file is reached.
+
|-
  int numValidRecords = 0;
+
|TotalReads
  while(!samIn.IsEOF())
+
| Total number of alignments that were successfully read from the file.
  {
+
|-
      if(samIn.ReadRecord(samHeader, samRecord) == true)
+
|MappedReads
      {
+
| Total number of alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
        // Successfully read a record from the file, so check to see
+
|-
        // if it is valid.
+
|PairedReads
        if(samRecord.isValid(samHeader))
+
| Total number of alignments that were successfully read from the file with FLAG bit 0x001 set to 1 (paired).
        {
+
|-
            //  It is valid, so write it.
+
|ProperPair
            numValidRecords++;
+
| Total number of alignments that were successfully read from the file with FLAG bits 0x001 set to 1 (paired) AND 0x002 (proper pair).
            samOut.WriteRecord(samHeader, samRecord);
+
|-
        }
+
|DuplicateReads
        else
+
| Total number of alignments that were successfully read from the file with FLAG bit 0x400 set to 1 (PCR or optical duplicate).
        {
+
|-
            // There is at least one invalid record, return -1.
+
|QCFailureReads
            status = -1;
+
| Total number of alignments that were successfully read from the file with FLAG bit 0x200 set to 1 (failed quality checks).
        }
+
|}
      }
 
      else
 
      {
 
        // Failed to read the record.  Check to see if it failed due to
 
        // there being no more records, which is not a failure.  Any
 
        // other failure reason counts as a failure.
 
        if(samIn.GetFailure() != SamStatus::NO_MORE_RECS)
 
        {
 
            // Read failed, invalid.
 
            status = -1;
 
        }
 
      }
 
  }
 
  std::cout << "Number of records read = " <<
 
      samIn.GetCurrentRecordCount() << std::endl;
 
  std::cout << "Number of records written = " <<
 
      samOut.GetCurrentRecordCount() << std::endl;
 
  std::cout << "Number of valid records = " << numValidRecords << std::endl;
 
 
 
  return(status);
 
}
 
</source>
 
 
 
== Reading Indexed (and Sorted) BAM Files ==
 
The '''IndexedBamReader''' class allows a user to easily read BAM files that are sorted and indexed.
 
This class allows a user to read only alignments for specific reference sequence.  This saves the time of having to read the entire file.
 
It takes advantage of the seeking capability of BGZF files, using the BAM Index file to determine where in the BAM file to seek to.
 
 
 
'''Future Enhancements''': Add the ability to read alignments that match a given start, end position for a specific reference sequence.
 
  
=== Class Methods ===
 
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
|-style="background: #f2f2f2; text-align: center;" '''SamFile Class Methods'''
+
|-style="background: #f2f2f2; text-align: center;"
! Method Name !! Description
+
|- '''Read Percentages'''
|- 
+
! Statistic !! Description
| <code>bool OpenForRead(const char* bamFilename, const char* bamIndexFilename)</code>
 
| Opens the bam file for reading and reads in the corresponding index file.
 
Returns true if successfully opened, false if not.
 
 
|-
 
|-
| <code>bool SetReadSection(int32_t refID)</code>
+
|MappingRate(%)
| Tell the class which reference ID should be read from the BAM file.  This is the index into the BAM Index list of reference information: 0 - #references.  The records for that reference id will be retrieved on each ReadRecord call.  When all records have been retrieved for the specified reference id, ReadRecord will return false until a new read section is set.
+
| 100 * MappedReads/TotalReads
Pass in -1 in order to read the section of the bam file not associated with any reference ID.
 
Returns true if the read section was successfully set, false if not.
 
 
|-
 
|-
| <code>bool ReadRecord(SamFileHeader& header, SamRecord& record)</code>
+
|PairedReads(%)
| Reads the next record from the file and stores it in the passed in record.  Only alignments in the section specified by SetReadSection are read.  If they have all already been read, this method returns false.  If SetReadSection has not been called, then the entire file is read.
+
| 100 * PairedReads/TotalReads
Returns true if successfully read, false if not.
 
 
|-
 
|-
| <code>bool IsEOF()</code>
+
|ProperPair(%)
| While this is available, if you are only reading a specific reference sequence, you may never hit the end of the file, so be careful using this method.
+
| 100 * ProperPair/TotalReads
bool: true if the end of file has been reached, false if not.
 
 
|-
 
|-
| <code>bool OpenForRead(const char* filename)</code>
+
|DupRate(%)
| This method exists, but does not do anything and just returns false.  You cannot open a file without an associated index file.
+
| 100 * DuplicateReads/TotalReads
Returns false.
 
 
|-
 
|-
| <code>bool OpenForWrite(const char * filename)</code>
+
|QCFailRate(%)
| This method exists, but does not do anything and just returns false.  This class is only for reading.
+
| 100 * QCFailureReads/TotalReads
Returns false.
+
|}
 +
 
 +
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
|- '''Base Counts'''
 +
! Statistic !! Description
 
|-
 
|-
| <code>void Close()</code>
+
|TotalBases
| Close the file if there is one open.
+
| Sum of the SEQ lengths for all alignments that were successfully read from the file.
 +
NOTE: Includes bases that are 'N'.
 
|-
 
|-
| <code>bool ReadHeader(SamFileHeader& header)</code>
+
|BasesInMappedReads
| Reads the header section from the file and stores it in the passed in header.
+
| Sum of the SEQ lengths for all alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
Returns true if successfully read, false if not.
+
NOTE: Includes bases that are 'N'.
|-
 
| <code>bool WriteHeader(SamFileHeader& header)</code>
 
| This method exists, but does not do anything and just returns false.  This class is only for reading.
 
Returns false.
 
|-
 
| <code>bool WriteRecord(SamFileHeader& header, SamRecord& record)</code>
 
| This method exists, but does not do anything and just returns false.  This class is only for reading.
 
Returns false.
 
 
|}
 
|}
  
 +
NOTE: If the TotalReads is greater than 10^6, then the Read Counts and Base Counts specify the total counts divided by 10^6.  This is indicated in the output with a (e6) appended to the field name.
  
=== Usage Example ===
+
==== Example Statistics Output ====
This example reads in the inputFilename bam file and writes it back out section by section to the specified outputFilename, starting with section -1.  It also prints a count of the number of records in each section.
+
<pre>
<source lang="cpp">
+
TotalReads(e6) 18.90
int ReadIndexedBam(const char* inputFilename,
+
MappedReads(e6) 14.77
                  const char* outputFilename,
+
PairedReads(e6) 18.90
                  const char* indexFilename)
+
ProperPair(e6) 11.28
{
+
DuplicateReads(e6) 0.00
  IndexedBamReader samIn;
+
QCFailureReads(e6) 0.00
     
 
  samIn.OpenForRead(inputFilename, indexFilename);
 
 
 
  SamFile samOut;
 
 
 
  samOut.OpenForWrite(outputFilename);
 
 
 
  // Read the sam header.
 
  SamFileHeader samHeader;
 
  samIn.ReadHeader(samHeader);
 
  // Write the sam header.
 
  samOut.WriteHeader(samHeader);
 
  
  SamRecord samRecord;
+
MappingRate(%) 78.17
 
+
PairedReads(%) 100.00
  int numValidRecords = 0;
+
ProperPair(%) 59.68
  int numRecords = 0;
+
DupRate(%) 0.00
 +
QCFailRate(%) 0.00
  
  // Loop through each Reference.
+
TotalBases(e6) 699.30
  for(int i = -1; i < 23; i++)
+
BasesInMappedReads(e6) 546.67
  {
+
</pre>
      int numSectionRecords = 0;
 
      samIn.SetReadSection(i);
 
      // Keep reading records until they aren't read successfully.
 
      while(samIn.ReadRecord(samHeader, samRecord) == true)
 
      {
 
        numSectionRecords++;
 
        numRecords++;
 
        // Successfully read a record from the file, so check to see
 
        // if it is valid.
 
        if(samRecord.isValid(samHeader))
 
        {
 
            // It is valid, so write it.
 
            numValidRecords++;
 
            samOut.WriteRecord(samHeader, samRecord);
 
        }
 
      }
 
      std::cout << "Reference ID " << i << " has " << numSectionRecords
 
                << " records" << std::endl;
 
  }
 
     
 
  std::cout << "Number of records = " << numRecords << std::endl;
 
  std::cout << "Number of valid records = " << numValidRecords << std::endl;
 
  
  return(0);
+
== Usage Examples ==
}
+
[[Sam Library Usage Examples]]
</source>
 

Latest revision as of 11:03, 2 February 2017


Reading/Writing SAM/BAM Files In Your Program

The SamFile class allows a user to easily read/write a SAM/BAM file.

The SamFile class contains additional functionality that allows a user to read specific sections of sorted & indexed BAM files. In order take advantage of this capability, the index file must be read prior to setting the read section. This logic saves the time of having to read the entire file and takes advantage of the seeking capability of BGZF files.

Future Enhancements: Add the ability to read alignments that match a given start, end position for a specific reference sequence.

This class is part of C++ Library: libStatGen.

Class Documentation

See: http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFile.html

Child Classes

SamFileReader

http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileReader.html

SamFileWriter

http://csg.sph.umich.edu//mktrost/doxygen/current/classSamFileWriter.html

Statistics

Statistic Generation

The following statistics can be optionally recorded when reading a SamFile by specifying SamFile::GenerateStatistics() and displayed with SamFile::PrintStatistics()

The statistics only reflect alignments that were successfully read from the BAM file. Alignments that failed to parse from the file are not reflected in the statistics, but alignments that are invalid for other reasons may show up in the statistics.

Read Counts
Statistic Description
TotalReads Total number of alignments that were successfully read from the file.
MappedReads Total number of alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).
PairedReads Total number of alignments that were successfully read from the file with FLAG bit 0x001 set to 1 (paired).
ProperPair Total number of alignments that were successfully read from the file with FLAG bits 0x001 set to 1 (paired) AND 0x002 (proper pair).
DuplicateReads Total number of alignments that were successfully read from the file with FLAG bit 0x400 set to 1 (PCR or optical duplicate).
QCFailureReads Total number of alignments that were successfully read from the file with FLAG bit 0x200 set to 1 (failed quality checks).
Statistic Description
MappingRate(%) 100 * MappedReads/TotalReads
PairedReads(%) 100 * PairedReads/TotalReads
ProperPair(%) 100 * ProperPair/TotalReads
DupRate(%) 100 * DuplicateReads/TotalReads
QCFailRate(%) 100 * QCFailureReads/TotalReads
Statistic Description
TotalBases Sum of the SEQ lengths for all alignments that were successfully read from the file.

NOTE: Includes bases that are 'N'.

BasesInMappedReads Sum of the SEQ lengths for all alignments that were successfully read from the file with FLAG bit 0x004 set to 0 (not unmapped).

NOTE: Includes bases that are 'N'.

NOTE: If the TotalReads is greater than 10^6, then the Read Counts and Base Counts specify the total counts divided by 10^6. This is indicated in the output with a (e6) appended to the field name.

Example Statistics Output

TotalReads(e6)	18.90
MappedReads(e6)	14.77
PairedReads(e6)	18.90
ProperPair(e6)	11.28
DuplicateReads(e6)	0.00
QCFailureReads(e6)	0.00

MappingRate(%)	78.17
PairedReads(%)	100.00
ProperPair(%)	59.68
DupRate(%)	0.00
QCFailRate(%)	0.00

TotalBases(e6)	699.30
BasesInMappedReads(e6)	546.67

Usage Examples

Sam Library Usage Examples