SAM/BAM Convert Sequence

From Genome Analysis Wiki
Revision as of 10:36, 2 February 2017 by Ppwhite (talk | contribs) (Proposed Solution)
Jump to: navigation, search

SAM/BAM support conversion between '=' and the base in a sequence

Request: To be able to call SamRecord::getSequence and have it return the sequence with '=' or with the correct bases.

Requester: Goncalo Abecasis & Hyun Min Kang

Date Requested: Nov 23, 2010?

Notes

SAM/BAM format allows '=' in the sequence when it matches the reference base. Using '=' could make it compress better in BAM format, but a user analyzing the file may want to know what the actual bases are. This library enhancements performs the conversion between these two formats.

It was requested to have this handled within the getSequence call so the user doesn't have to convert the sequence after calling getSequence.

Proposed Solution

The diffs are provided so you can see the classes as they were both before and after the modifications.

Just scroll through the files - the diffs are highlighted, but the entire file is included so you have context.

Sequence Translation Modifications

The doxygen output for the current git repository is: Current Doxygen

USER API UPDATES

SamFile

Added:

// Sets the reference to the specified genome sequence object.
// \param reference pointer to the GenomeSequence object.
void SetReference(GenomeSequence* reference);

/// Set the type of sequence translation to use when reading
/// the sequence.  Passed down to the SamRecord when it is read.  
/// The default type (if this method is never called) is
/// NONE (the sequence is left as-is).
/// \param translation type of sequence translation to use.
void SetReadSequenceTranslation(SamRecord::SequenceTranslation translation);

/// Set the type of sequence translation to use when writing
/// the sequence.  Passed down to the SamRecord when it is written.
/// The default type (if this method is never called) is
/// NONE (the sequence is left as-is).
/// \param translation type of sequence translation to use.
void SetWriteSequenceTranslation(SamRecord::SequenceTranslation translation);

If you want to perform sequence translation either when pulling out information from a record or when writing a new record you need to call SamFile::SetReference on the SamFile object that you want to perform translations on. Then call SetReadSequenceTranslation or SetWriteSequenceTranslation as appropriate. The read & write translations do not need to be the same. The read and write translations may need to be set on different SamFiles, one that reads in the file and one that writes out the file. In this case, both SamFiles should have SetReference called on them.

The reference that is passed to SamFile::SetReference is passed onto the SamRecord that is passed to SamFile methods. The translation specified in SamFile::SetReadSequenceTranslation is passed to the SamRecord specified when performing reads. The translation specified in SamFile::SetWriteSequenceTranslation is not stored in the record but is used to write the data with the correct translation.

The SamRecord::SequenceTranslation values are described below.

SamRecord

Added:

/// Enum containing the settings on how to translate the sequence if a
/// reference is available.  If no reference is available, no translation
/// is done.
enum SequenceTranslation { 
    NONE,   ///< Leave the sequence as is.
    EQUAL,  ///< Translate bases that match the reference to '='
    BASES,  ///< Translate '=' to the actual base.
};

/// Set the reference to the specified genome sequence object.
/// \param reference pointer to the GenomeSequence object.
void setReference(GenomeSequence* reference);

/// Set the type of sequence translation to use when getting
/// the sequence.  The default type (if this method is never called) is
/// NONE (the sequence is left as-is).  Can be over-ridden by using 
/// the accessors that take a SequenceTranslation parameter.
/// \param translation type of sequence translation to use.
void setSequenceTranslation(SequenceTranslation translation);

/// Get a const pointer to the buffer that contains the BAM representation
/// of the record.
/// \param translation type of sequence translation to use.
/// \return const pointer to the buffer that contains the BAM representation
/// of the record.
const void* getRecordBuffer(SequenceTranslation translation);

/// Write the record as a BAM into the specified file.
/// \param filePtr file to write the BAM record into.
/// \param translation type of sequence translation to use.
/// \return status of the write.
SamStatus::Status writeRecordBuffer(IFILE filePtr, 
                                    SequenceTranslation translation);
/// Returns the SAM formatted sequence string performing the specified
/// sequence translation.
/// \param translation type of sequence translation to use.
/// \return sequence string.
const char* getSequence(SequenceTranslation translation);

/// Get the sequence base at the specified index into this sequence 0 to
/// readLength -  performing the specified sequence translation1.
/// \param index index into the sequence string (0 to readLength-1).
/// \param translation type of sequence translation to use.
/// \return the sequence base at the specified index into the sequence.
char getSequence(int index, SequenceTranslation translation);

/// Returns the values of all fields except the tags.
/// \param recStruct structure containing the contents of all 
/// non-variable length fields.
/// \param readName read name from the record (return param)
/// \param cigar cigar string from the record (return param)
/// \param sequence sequence string from the record (return param)
/// \param quality quality string from the record (return param)
/// \param translation type of sequence translation to use.
/// \return true if all fields were successfully set, false otherwise.
bool getFields(bamRecordStruct& recStruct, String& readName, 
               String& cigar, String& sequence, String& quality,
               SequenceTranslation translation);

SequenceTranslation is an enum used to define the type of sequence translation that should be performed on the sequence stored in the record. NONE means that no translation should be performed on the sequence. EQUAL means that any bases that match the reference should be translated to '=' instead of the actual base. BASES means that any '=' in the sequence should be translated to the actual base using the reference.

setReference is automatically called from SamFile when the record is passed in to a SamFile method if the reference has been set. setSequenceTranslation is automatically called by SamFile when reading records. So, when interfacing through SamFile SamRecord::setReference and SamRecord::setSequenceTranslation should not need to be called by the user unless records are being created from scratch.

There are now versions of getRecordBuffer, writeRecordBuffer, getSequence, and getFields that take and do not take SequenceTranslation as a parameter. The versions that do not take SequenceTranslation as a parameter use the translation setting that has been set for the SamRecord either by SamRecord::setSequenceTranslation or through SamFile.

If SamRecord::setReference was never called on the record either by the user or by SamFile, no sequence translation is done, regardless of the SequenceTranslation setting.

Once the reference or sequence translation has been set on a SamRecord object it remains set until a setReference or setSequenceTranslation call is made. Resetting or reusing the object does not clear these settings.

Possibly useful to a User

Added a class, SamQuerySeqWithRef, with static methods seqWithEquals and seqWithoutEquals. SamRecord uses these methods to translate the sequence for the user.

// Gets the sequence with '=' in any position where the sequence matches the reference.
// NOTE: 'N' in both the sequence and the reference is not considered a match.
// Parameters:
//    	currentSeq 	sequence that should be converted
//    	seq0BasedPos 	0 based start position of currentSeq on the reference.
//   	cigar 	cigar string for currentSeq (used for determining how the sequence aligns to the reference)
//    	referenceName 	reference name associated with this sequence
//    	refSequence 	reference sequence object
//    	updatedSeq 	return parameter that this method sets to the current sequence, replacing any matches to the reference with '='.
static void SamQuerySeqWithRef::seqWithEquals(const char * 		currentSeq,
                                              int32_t 	 		seq0BasedPos,
                                              Cigar &  			cigar,
                                              const char *  		referenceName,
                                              const GenomeSequence &  	refSequence,
                                              std::string &  		updatedSeq)

// Gets the sequence converting '=' to the appropriate base using the reference.
// Parameters:
//    	currentSeq 	sequence that should be converted
//    	seq0BasedPos 	0 based start position of currentSeq on the reference.
//   	cigar 	cigar string for currentSeq (used for determining how the sequence aligns to the reference)
//    	referenceName 	reference name associated with this sequence
//    	refSequence 	reference sequence object
//    	updatedSeq 	return parameter that this method sets to the current sequence, replacing any "=" with the base from the reference.
static void SamQuerySeqWithRef::seqWithoutEquals(const char * 		currentSeq,
                                                 int32_t 	 	seq0BasedPos,
                                                 Cigar &  		cigar,
                                                 const char *  		referenceName,
                                                 const GenomeSequence & refSequence,
                                                 std::string &  	updatedSeq)

These static methods can be used without going through a SamRecord object.


INTERNAL MODIFICATIONS

  • Updated GenericSamInterface, BamInterface, and SamInterface to take TranslationType as a parameter to writeRecord.
    • BamInterface::writeRecord passes the translation parameter to SamRecord::writeRecordBuffer
    • SamInterface::writeRecord passes the translation parameter to SamRecord::getSequence when it obtains the sequence to write into the SAM file.

FAQ

Q: How do you know what the actual base was? Doesn't that depend on the specific reference used to do the alignment

A: The user needs to create a GenomeSequence object, specifying the fasta file. That object is used to map to the reference.

Q: What if I don't know what my reference file is?

A: Then I don't either. Although....additional validation will be added to check the reference specified versus the SQ fields in the header: Additional Validation Coming Soon

STILL NEEDS WORK

  • Additional validation will be added to check the specified GenomeSequence against the Header from the SamFile. It will check things like sequence length. An error will be reported if the reference sequence length does not match the SQ LN field for that reference name. This will give you an indication of if you are using the wrong reference file.