BAM Review Action Items

From Genome Analysis Wiki
Revision as of 10:16, 17 September 2010 by Mktrost (talk | contribs)
Jump to navigationJump to search

Review Sept 17th

Review Discussion Topics

http://genome.sph.umich.edu/wiki/SAM/BAM_Library_FAQs http://www.sph.umich.edu/csg/mktrost/doxygen/html/

Example of using the library to set values: http://www.sph.umich.edu/csg/mktrost/doxygen/html/WriteFiles_8cpp-source.html

Return Statuses

Currently anytime you do anything on a SAM/BAM file, you have to check the status for failure:

   SamFile samIn;
   if(!samIn.OpenForRead(argv[1]))
   {
      fprintf(stderr, "%s\n", samIn.GetStatusMessage());
      return(samIn.GetStatus());
   }

   // Read the sam header.
   SamFileHeader samHeader;
   if(!samIn.ReadHeader(samHeader))
   {
      fprintf(stderr, "%s\n", samIn.GetStatusMessage());
      return(samIn.GetStatus());
   }

A previous recommendation was to "Add an option by class that says whether or not to abort on failure. (or even an option on each method)"

I am proposing modifying the classes to throw exceptions on failures. It would then be up to the user to catch them if they want to handle them or to let them exit the program (which would print out the error message)

   SamFile samIn;
   samIn.OpenForRead(argv[1]);

   // Read the sam header.
   SamFileHeader samHeader;
   samIn.ReadHeader(samHeader);

   // Open the output file for writing.
   SamFile samOut;
   try
   {
      samOut.OpenForWrite(argv[2]);
      samOut.WriteHeader(samHeader);
   }
   catch(GenomeException e)
   {
      std::cout << "Caught an Exception" << e.what() << std::endl;
   }
   std::cout << "Continue Processing\n";

For caught exceptions, you would see the following and processing would continue:

Caught Exception:
FAIL_IO: Failed to Open testFiles/unknown for writing
Continue Processing

For an uncaught exception, you would see the following and processing would be stopped:

terminate called after throwing an instance of 'GenomeException'
  what():  
FAIL_IO: Failed to Open testFiles/unknown for reading
Aborted


Accessing String Values

SAM/BAM files have strings in them that people will want to read out. How should we handle this interface? Currently we do a mix of returning const char*, like:

const char* SamRecord::getSequence()
{
    myStatus = SamStatus::SUCCESS;
    if(mySequence.Length() == 0)
    {
        // 0 Length, means that it is in the buffer, but has not yet
        // been synced to the string, so do the sync.
        setSequenceAndQualityFromBuffer();
    }
    return mySequence.c_str();
}

and passing in references to strings, like:

// Set the passed in string to the header line at the specified index.
// It does NOT clear the current contents of header.
// NOTE: some indexes will return blank if the entry was deleted.
bool SamFileHeader::getHeaderLine(unsigned int index, std::string& header) const
{
    // Check to see if the index is in range of the header records vector.
    if(index < myHeaderRecords.size())
    {
        // In range of the header records vector, so get the string for
        // that record.
        SamHeaderRecord* hdrRec = myHeaderRecords[index];
        hdrRec->appendString(header);
        return(true);
    }
    else
    {
        unsigned int commentIndex = index - myHeaderRecords.size();
        // Check to see if it is in range of the comments.
        if(commentIndex < myComments.size())
        {
            // It is in range of the comments, so add the type.
            header += "@CO\t";
            // Add the comment.
            header += myComments[commentIndex];
            // Add the new line.
            header += "\n";
            return(true);
        }
    }
    // Invalid index.
    return(false);
}

http://www.sph.umich.edu/csg/mktrost/doxygen/html/SamRecord_8h-source.html

SamFileHeader
  • Should this be renamed to SamHeader?
  • Do we like the classes being named starting with Sam? Should it be Bam?

Should we add the following to SamFileHeader:

    //////////////////////////////////
    // Set methods for header fields.
    bool setVersion(const char* version);
    bool setSortOrder(const char* sortOrder);
    bool addSequenceName(const char* sequenceName);
    bool setSequenceLength(const char* keyID, int sequenceLength);
    bool setGenomeAssemblyId(const char* keyID, const char* genomeAssemblyId);
    bool setMD5Checksum(const char* keyID, const char* md5sum);
    bool setURI(const char* keyID, const char* uri);
    bool setSpecies(const char* keyID, const char* species);
    bool addReadGroupID(const char* readGroupID);
    bool setSample(const char* keyID, const char* sample);
    bool setLibrary(const char* keyID, const char* library);
    bool setDescription(const char* keyID, const char* description);
    bool setPlatformUnit(const char* keyID, const char* platform);
    bool setPredictedMedianInsertSize(const char* keyID, const char* isize);
    bool setSequencingCenter(const char* keyID, const char* center);
    bool setRunDate(const char* keyID, const char* runDate);
    bool setTechnology(const char* keyID, const char* technology);
    bool addProgram(const char* programID);
    bool setProgramVersion(const char* keyID, const char* version);
    bool setCommandLine(const char* keyID, const char* commandLine);
    
    ///////////////////////////////////
    // Get methods for header fields.
    // Returns the number of SQ entries in the header.
    int32_t getSequenceDictionaryCount();
    // Return the Sort Order value that is set in the Header.
    // If this field does not exist, "" is returned.
    const char* getSortOrder();
/// Additional gets for the rest of the fields.

Should these also be added to SamHeaderRG, SamHeaderSQ, etc as appropriate....

Review June 7th

  • Move the examples from the SamFile wiki page to their own page
    • include links from the main library page and the SamFile page.
    • look into why the one example have two if checks on SamIn status - one was printing the result and one was setting the return value - cleaned up to be in one if statement.
  • Create 1 library for all of our library code rather than having libcsg, libbam, libfqf separated.
    • What should this library be called?
      • libdna
      • libdna++
      • libsequence++
      • libDNA
      • libgenotype
  • Add an option by class that says whether or not to abort on failure. (or even an option on each method)
    • This allows calling code to set that option and then not have to check for failures since the code it calls would abort on a failure.
    • Could/should this be achieved using exceptions? User can decide to catch them or let them terminate the program.
  • SamFile add a constructor that takes the filename and a flag to indicate open for read/write. (abort on failure to open)
    • Also have 2 subclasses one that opens for read, one for write: SamReadFile, SamWriteFile? Or SamFileRead, SamFileWrite? - went with SamFileReader and SamFileWriter
  • Add a function that says: skipInvalidRecords, validateRecords, etc.
    • That way, ReadRecord will keep reading records until a valid/parseable one is found.
  • SamFileHeader::setTag - instead of having separate ones for PG, RG, etc, have a generic one that takes as a parameter which one it is.
    • KeyID, then Value as parameters....(keyID first, then value)
  • SamFileHEader::setProgramName, etc...have specific methods for setting fields so users don't need to know the specific tags, etc used for certain values in the header.
    • KeyID, then Value as parameters....(keyID first, then value)
  • BAM write utility could add a PG field with default settings (user could specify alternate settings) when it writes a file.
  • Future methods to add:
    • SamFile::setReadSection(const std::string& refName) - take in the reference name by string since that is what most people will know.
      • "" would indicate the ones not associated with a reference.