C++ Class: SamFile

From Genome Analysis Wiki
Revision as of 09:18, 31 March 2010 by Mktrost (talk | contribs)
Jump to: navigation, search

Reading/Writing SAM/BAM Files In Your Program

Reading/Writing Standard SAM/BAM Files

The SamFile class allows a user to easily read/write a SAM/BAM file. The methods found in this class are:

Method Name Description
bool SamFile::IsEOF() bool: true if the end of file has been reached, false if not.
bool SamFile::OpenForRead(const char* filename) Opens the specified file for reading.

Determines if it is a BAM/SAM file by reading the beginning of the file. Returns true if successfully opened reading, false if not.

bool SamFile::OpenForWrite(const char * filename) bool: true if successfully opened, false if not.

Opens as BAM file if the specified filename ends in .bam. Otherwise it is opened as a SAM file. Returns true if successfully opened for writing, false if not.

void SamFile::Close() Close the file if there is one open.
bool SamFile::ReadHeader(SamFileHeader& header) Reads the header section from the file and stores it in the passed in header.

Returns true if successfully read, false if not.

bool SamFile::WriteHeader(SamFileHeader& header) Writes the specified header into the file.

Returns true if successfully written, false if not.

bool SamFile::ReadRecord(SamFileHeader& header, SamRecord& record) Reads the next record from the file and stores it in the passed in record.

Returns true if successfully read, false if not.

bool SamFile::WriteRecord(SamFileHeader& header, SamRecord& record) Writes the specified record into the file.

Returns true if successfully written, false if not.


Usage Example

The following example reads in a sam/bam file and writes it out as a sam/bam file. The file format of the input sam/bam is determined by the SamFile class based on reading the type from the file. The file format of the output sam/bam file is determined by the SamFile class based on the extension of the output file. A ".bam" extension indicates a BAM file. All other extensions indicate SAM files.

int main(int argc, char ** argv)
{
   if(argc != 3)
   {
      printf("./bam <inputFile> <outputFile.sam/bam>\n");
      exit(-1);
   }


   SamFile samIn;
      
   samIn.OpenForRead(argv[1]);

   SamFile samOut;

   samOut.OpenForWrite(argv[2]);

   // Read the sam header.
   SamFileHeader samHeader;
   samIn.ReadHeader(samHeader);

   samOut.WriteHeader(samHeader);

   // Read the first sam record.
   SamRecord samRecord;

   // Keep reading records until the end of the file is reached.
   int numValidRecords = 0;
   int numRecords = 0;
   while(!samIn.IsEOF())
   {
      if(samIn.ReadRecord(samHeader, samRecord) == true)
      {
         numRecords++;
         // Successfully read a record from the file, so check to see
         // if it is valid.
         if(samRecord.isValid(samHeader))
         {
            //  It is valid, so write it.
            numValidRecords++;
            samOut.WriteRecord(samHeader, samRecord);
         }
      }
      
   }
   std::cout << "Number of records = " << numRecords << std::endl;
   std::cout << "Number of valid records = " << numValidRecords << std::endl;
}


Reading Indexed (and Sorted) BAM Files

The IndexedBamReader class allows a user to easily read BAM files that are sorted and indexed. This class allows a user to read only alignments for specific reference sequence. This saves the time of having to read the entire file. It takes advantage of the seeking capability of BGZF files, using the BAM Index file to determine where in the BAM file to seek to.

Future Enhancements: Add the ability to read alignments that match a given start, end position for a specific reference sequence. The methods found in this class are:

SamFile Class Methods
Method Name Description
bool OpenForRead(const char* bamFilename, const char* bamIndexFilename) Opens the bam file for reading and reads in the corresponding index file.

Returns true if successfully opened, false if not.

bool SetReadSection(int32_t refID) Tell the class which reference ID should be read from the BAM file. This is the index into the BAM Index list of reference information: 0 - #references. The records for that reference id will be retrieved on each ReadRecord call. When all records have been retrieved for the specified reference id, ReadRecord will return false until a new read section is set.

Pass in -1 in order to read the section of the bam file not associated with any reference ID. Returns true if the read section was successfully set, false if not.

bool ReadRecord(SamFileHeader& header, SamRecord& record) Reads the next record from the file and stores it in the passed in record. Only alignments in the section specified by SetReadSection are read. If they have all already been read, this method returns false. If SetReadSection has not been called, then the entire file is read.

Returns true if successfully read, false if not.

bool IsEOF() While this is available, if you are only reading a specific reference sequence, you may never hit the end of the file, so be careful using this method.

bool: true if the end of file has been reached, false if not.

bool OpenForRead(const char* filename) This method exists, but does not do anything and just returns false. You cannot open a file without an associated index file.

Returns false.

bool OpenForWrite(const char * filename) This method exists, but does not do anything and just returns false. This class is only for reading.

Returns false.

void Close() Close the file if there is one open.
bool ReadHeader(SamFileHeader& header) Reads the header section from the file and stores it in the passed in header.

Returns true if successfully read, false if not.

bool WriteHeader(SamFileHeader& header) This method exists, but does not do anything and just returns false. This class is only for reading.

Returns false.

bool WriteRecord(SamFileHeader& header, SamRecord& record) This method exists, but does not do anything and just returns false. This class is only for reading.

Returns false.


Usage Example

This example reads in the inputFilename bam file and writes it back out section by section to the specified outputFilename, starting with section -1. It also prints a count of the number of records in each section.

int ReadIndexedBam(const char* inputFilename,
                   const char* outputFilename,
                   const char* indexFilename)
{
   IndexedBamReader samIn;
      
   samIn.OpenForRead(inputFilename, indexFilename);

   SamFile samOut;

   samOut.OpenForWrite(outputFilename);

   // Read the sam header.
   SamFileHeader samHeader;
   samIn.ReadHeader(samHeader);
   // Write the sam header.
   samOut.WriteHeader(samHeader);

   SamRecord samRecord;
   
   int numValidRecords = 0;
   int numRecords = 0;

   // Loop through each Reference.
   for(int i = -1; i < 23; i++)
   {
      int numSectionRecords = 0;
      samIn.SetReadSection(i);
      // Keep reading records until they aren't read successfully.
      while(samIn.ReadRecord(samHeader, samRecord) == true)
      {
         numSectionRecords++;
         numRecords++;
         // Successfully read a record from the file, so check to see
         // if it is valid.
         if(samRecord.isValid(samHeader))
         {
            //  It is valid, so write it.
            numValidRecords++;
            samOut.WriteRecord(samHeader, samRecord);
         }
      }
      std::cout << "Reference ID " << i << " has " << numSectionRecords 
                << " records" << std::endl;
   }
      
   std::cout << "Number of records = " << numRecords << std::endl;
   std::cout << "Number of valid records = " << numValidRecords << std::endl;

   return(0);
}