LibStatGen: BAM

From Genome Analysis Wiki
Revision as of 10:05, 6 April 2010 by Mktrost (talk | contribs) (Reading/Writing SAM/BAM Files In Your Program)
Jump to: navigation, search

SAM/BAM File

SAM Library Change Log

Read & Write BAM/SAM Executable

When the pipeline is compiled, the sam/bam executable, "bam" is generated in the pipeline/bam/ directory.

This executable takes 2 arguments. The first argument is the input file. The second argument is the output file. The executable converts the first file into the format of the second file. So if you want to convert a Bam file to a SAM file, from the pipeline/bam/ directory you just call:

./bam <bamFile>.bam <newSamFile>.sam

Don't forget to put in the paths to the executable and your test files.

The software reads the beginning of the input file to determine if it is SAM/BAM. To determine the format (SAM/BAM) of the output file, the software checks the output file's extension. If the extension is ".bam" it writes a BAM file, otherwise it writes a SAM file.


Read & Write BAM/SAM Library

The software for reading/writing/parsing/validating SAM/BAM files is also available in library format. The library is also found in pipeline/bam, and is called libbam.a.

This library is dependent on two other libraries, so be sure to include them all in the proper order:

<path to base pipeline directory>/libbam.a <path to base pipeline directory>/libcsg/libcsg.a <path to base pipeline directory>/thirdParty/samtools/libbam.a


Reading/Writing SAM/BAM Files In Your Program

Documentation on the SamFile class can be found at C++ Class: SamFile

Setting fields in a SAM/BAM Header

The SamRecord class contains accessors to set the header lines of a SAM/BAM header. By using these set methods to setup the header, they can be pulled back out using the get accessors or the header can be later written to a SAM/BAM file. The methods found in the SamFileHeader class for setting fields are:

SamFileHeader Class Methods
Method Name Description
bool addHeaderLine(const char* type, const char* tag, int value) Adds the type, tag, and integer value to the header.

Returns true if successfully added, false if not. NOTE: currently, this method will only do one tag per type on a line. If a type has multiple tags, then the whole line needs to be added at once.

bool addHeaderLine(const char* type, const char* tag, const char* value) Adds the type, tag, and const char* value to the header.

Returns true if successfully added, false if not. NOTE: currently, this method will only do one tag per type on a line. If a type has multiple tags, then the whole line needs to be added at once.

bool addHeaderLine(const char* headerLine) Adds the already setup/formatted headerLine to the header. It is assumed that the line does not contain a “\n”.

Returns true if successfully added, false if not.


Setting fields in a SAM/BAM Record

The SamRecord class contains accessors to set the fields of a SAM/BAM record. They are used for creating a record that is not read from a SAM/BAM file. By using these set methods to setup the record, they can be pulled back out using the get accessors or the record can be later written as either a SAM/BAM record. The methods found in the SamRecord class for setting fields are:

SamRecord Class Methods
Method Name Description
void resetRecord() Resets the record to be an empty record. This is not necessary when you are reading a Sam/Bam file, but if you are setting fields, it is a good idea to clean out a record before reusing it. Clearing it allows you to not have to set any empty fields.
bool setReadName(const char* readName) Sets QNAME to the passed in name.

Returns true if successfully set, false if not.

bool setFlag(int16_t flag) Sets the bitwise FLAG to the passed in value.

Returns true if successfully set, false if not.

bool setReferenceName(SamFileHeader& header, const char* referenceName) Sets the reference sequence name. The reference id is calculated using the header.

Returns true if successfully set, false if not.

bool set1BasedPosition(int32_t position) Sets the leftmost position. The value passed in is 1-based (SAM formatted). Internal processing handles switching between SAM/BAM formats when read/written.

Returns true if successfully set, false if not.

bool set0BasedPosition(int32_t position) Sets the leftmost position. The value passed in is 0-based (BAM formatted). Internal processing handles switching between SAM/BAM formats when read/written.

Returns true if successfully set, false if not.

bool setMapQuality(int8_t mapQuality) Sets the mapping quality.

Returns true if successfully set, false if not.

bool setCigar(const char* cigar) Sets the cigar string to the passed in CIGAR. This is a SAM formatted CIGAR string. Internal processing handles switching between SAM/BAM formats when read/written.

Returns true if successfully set, false if not.

bool setMateReferenceName(SamFileHeader& header, const char* referenceName) Sets the mate reference sequence name. The mate reference id is calculated using the header.

Returns true if successfully set, false if not.

bool set1BasedMatePosition(int32_t matePosition) Sets the leftmost mate position. The value passed in is 1-based (SAM formatted). Internal processing handles switching between SAM/BAM formats when read/written.

Returns true if successfully set, false if not.

bool set0BasedMatePosition(int32_t matePosition) Sets the leftmost mate position. The value passed in is 0-based (BAM formatted). Internal processing handles switching between SAM/BAM formats when read/written.

Returns true if successfully set, false if not.

bool setInsertSize(int32_t insertSize) Sets the inferred insert size.

Returns true if successfully set, false if not.

bool setSequence(const char* seq) Sets the sequence string to the passed in string. This is a SAM formatted sequence string. Internal processing handles switching between SAM/BAM formats when read/written.

Returns true if successfully set, false if not.

bool setQuality(const char* quality) Sets the quality string to the passed in string. This is a SAM formatted quality string. Internal processing handles switching between SAM/BAM formats when read/written.

Returns true if successfully set, false if not.

bool addTag(const char* tag, char vtype, const char* value) Adds a tag to the record with the specified tag, vtype, and value. Vtype can be SAM/BAM vtype. Internal processing handles switching between SAM/BAM vtypes when read/written.

Returns true if successfully set, false if not.

When set, SAM fields are validated against: SAM Validation Criteria


Retrieving fields from a SAM/BAM Record

The SamRecord class contains accessors to access the fields of a SAM/BAM record. They assume that the class has already been populated, either by using the set commands or by calling SamFile::ReadRecord. Not all of the values that can be retrieved using these get accessors have set methods. That is because they are internally calculated values if they were not read from a file.

The methods found in the SamRecord class for setting fields are:

SamRecord Class Get Methods
Method Name Description
bool isValid(SamFileHeader& header) Returns true if the record is valid. This performs validation steps. TODO: the method exists, but it does not yet perform any checks, so just returns true.
int32_t getBlockSize() Returns the BAM block size of the record.
const char* getReferenceName(SamFileHeader& header) Returns the reference sequence name (SAM format).
int32_t getReferenceID() Returns the reference sequence ID (BAM format).
int32_t get1BasedPosition() Returns the 1-based (SAM formatted) leftmost position.
int32_t get0BasedPosition() Returns the 0-based (BAM formatted) leftmost position.
int8_t getReadNameLength() Returns the length of the ReadName (QNAME).
int8_t getMapQuality() Returns the map quality.
int16_t getBin() Returns the BAM bin for the record.
int16_t getCigarLength() Returns the length of the CIGAR in BAM format.
int16_t getFlag() Returns the flag.
int32_t getReadLength() Returns the length of the read.
const char* getMateReferenceName(SamFileHeader& header) Returns the mate reference sequence name (SAM format). Returns the mate reference sequence name even if it is the same as the reference sequence name.
const char* getMateReferenceNameOrEqual(SamFileHeader& header) Returns the mate reference sequence name (SAM format). Returns the mate reference sequence name, unless it is the same as the reference sequence name, then an "=" is returned..
int32_t getMateReferenceID() Returns the mate reference sequence id (BAM format).
int32_t get1BasedMatePosition() Returns the 1-based (SAM formatted) mate leftmost position.
int32_t get0BasedMatePosition() Returns the 0-based (BAM formatted) mate leftmost position.
int32_t getInsertSize() Returns the insert size.
int32_t get0BasedAlignmentEnd(); Returns the 0-based inclusive right-most position of the clipped sequence.
int32_t get1BasedAlignmentEnd(); Returns the 1-based inclusive right-most position of the clipped sequence.
int32_t get0BasedUnclippedStart(); Returns the 0-based inclusive left-most position adjusted for clipped bases.
int32_t get1BasedUnclippedStart(); Returns the 1-based inclusive left-most position adjusted for clipped bases.
int32_t get0BasedUnclippedEnd(); Returns the 0-based inclusive right-most position adjusted for clipped bases.
int32_t get1BasedUnclippedEnd(); Returns the 1-based inclusive right-most position adjusted for clipped bases.
const char* getReadName() Returns the SAM formatted Read Name (QNAME).
const char* getCigar() Returns the SAM formatted CIGAR string.
const char* getSequence() Returns the SAM formatted Sequence string.
const char* getQuality() Returns the SAM formatted Quality string.
bool getNextSamTag(char* tag, char& vtype, void** value) Returns true if a tag was read, false if there are no more tags.

For a true return value, tag is sent to the tag of the tag, vtype is set to the vtype of the tag, and value is a pointer to the value of the tag. You will then need to use a switch to cast value to int, double, char, or String.

bool isIntegerType(char vtype) Returns true if the passed in vtype is of integer ('c', 'C', 's', 'S', 'i', 'I') type.
bool isDoubleType(char vtype) Returns true if the passed in vtype is of double ('f') type.
bool isCharType(char vtype) Returns true if the passed in vtype is of char ('A') type.
bool isStringType(char vtype) Returns true if the passed in vtype is of String ('Z') type.


Example of using getNextSamTag:

   // record is a previously setup SamRecord.
   String recordString = "";
   char tag[3];
   char vtype;
   void* value;

   // While there are more tags, write them to the recordString.
   while(record.getNextSamTag(tag, vtype, &value) != false)
   {
      recordString += "\t";
      recordString += tag;
      recordString += ":"; 
      recordString += vtype;
      recordString += ":";
      if(record.isIntegerType(vtype))
      {
         recordString += (int)*(int*)value;
      }
      else if(record.isDoubleType(vtype))
      {
         recordString += (double)*(double*)value;
      }
      else if(record.isCharType(vtype))
      {
         recordString += (char)*(char*)value;
      }
      else
      {
         // String type.
         recordString += (String)*(String*)value;
      }
   }

   recordString += "\n";


Suggested Improvements/Features

  • Add optional user flag for checking if a file is sorted when it is read. It would report an error if it is unsorted.