SAM/BAM Convert Sequence
SAM/BAM support conversion between '=' and the base in a sequence
Request: To be able to call SamRecord::getSequence and have it return the sequence with '=' or with the correct bases.
Requester: Goncalo Abecasis & Hyun Min Kang
Date Requested: Nov 23, 2010?
Notes
It was requested to have this handled within the getSequence call so the user doesn't have to convert the sequence after calling getSequence.
Proposed Solution
The diffs are provided so you can see the classes as they were both before and after the modifications.
Just scroll through the files - the diffs are highlighted, but the entire file is included so you have context.
Sequence Translation Modifications
The doxygen output for the current git repository is: Current Doxygen
USER API UPDATES
SamFile
Added:
// Sets the reference to the specified genome sequence object.
// \param reference pointer to the GenomeSequence object.
void SetReference(GenomeSequence* reference);
/// Set the type of sequence translation to use when reading
/// the sequence. Passed down to the SamRecord when it is read.
/// The default type (if this method is never called) is
/// NONE (the sequence is left as-is).
/// \param translation type of sequence translation to use.
void SetReadSequenceTranslation(SamRecord::SequenceTranslation translation);
/// Set the type of sequence translation to use when writing
/// the sequence. Passed down to the SamRecord when it is written.
/// The default type (if this method is never called) is
/// NONE (the sequence is left as-is).
/// \param translation type of sequence translation to use.
void SetWriteSequenceTranslation(SamRecord::SequenceTranslation translation);
If you want to perform sequence translation either when pulling out information from a record or when writing a new record you need to call SamFile::SetReference
on the SamFile
object that you want to perform translations on. Then call SetReadSequenceTranslation
or SetWriteSequenceTranslation
as appropriate. The read & write translations do not need to be the same. The read and write translations may need to be set on different SamFiles, one that reads in the file and one that writes out the file. In this case, both SamFile
s should have SetReference
called on them.
The reference that is passed to SamFile::SetReference
is passed onto the SamRecord
that is passed to SamFile
methods. The translation specified in SamFile::SetReadSequenceTranslation
is passed to the SamRecord
specified when performing reads. The translation specified in SamFile::SetWriteSequenceTranslation
is not stored in the record but is used to write the data with the correct translation.
The SamRecord::SequenceTranslation
values are described below.
SamRecord
Added:
/// Enum containing the settings on how to translate the sequence if a
/// reference is available. If no reference is available, no translation
/// is done.
enum SequenceTranslation {
NONE, ///< Leave the sequence as is.
EQUAL, ///< Translate bases that match the reference to '='
BASES, ///< Translate '=' to the actual base.
};
/// Set the reference to the specified genome sequence object.
/// \param reference pointer to the GenomeSequence object.
void setReference(GenomeSequence* reference);
/// Set the type of sequence translation to use when getting
/// the sequence. The default type (if this method is never called) is
/// NONE (the sequence is left as-is). Can be over-ridden by using
/// the accessors that take a SequenceTranslation parameter.
/// \param translation type of sequence translation to use.
void setSequenceTranslation(SequenceTranslation translation);
/// Get a const pointer to the buffer that contains the BAM representation
/// of the record.
/// \param translation type of sequence translation to use.
/// \return const pointer to the buffer that contains the BAM representation
/// of the record.
const void* getRecordBuffer(SequenceTranslation translation);
/// Write the record as a BAM into the specified file.
/// \param filePtr file to write the BAM record into.
/// \param translation type of sequence translation to use.
/// \return status of the write.
SamStatus::Status writeRecordBuffer(IFILE filePtr,
SequenceTranslation translation);
/// Returns the SAM formatted sequence string performing the specified
/// sequence translation.
/// \param translation type of sequence translation to use.
/// \return sequence string.
const char* getSequence(SequenceTranslation translation);
/// Get the sequence base at the specified index into this sequence 0 to
/// readLength - performing the specified sequence translation1.
/// \param index index into the sequence string (0 to readLength-1).
/// \param translation type of sequence translation to use.
/// \return the sequence base at the specified index into the sequence.
char getSequence(int index, SequenceTranslation translation);
/// Returns the values of all fields except the tags.
/// \param recStruct structure containing the contents of all
/// non-variable length fields.
/// \param readName read name from the record (return param)
/// \param cigar cigar string from the record (return param)
/// \param sequence sequence string from the record (return param)
/// \param quality quality string from the record (return param)
/// \param translation type of sequence translation to use.
/// \return true if all fields were successfully set, false otherwise.
bool getFields(bamRecordStruct& recStruct, String& readName,
String& cigar, String& sequence, String& quality,
SequenceTranslation translation);
SequenceTranslation
is an enum used to define the type of sequence translation that should be performed on the sequence stored in the record. NONE
means that no translation should be performed on the sequence. EQUAL
means that any bases that match the reference should be translated to '=' instead of the actual base. BASES
means that any '=' in the sequence should be translated to the actual base using the reference.
setReference
is automatically called from SamFile
when the record is passed in to a SamFile
method if the reference has been set. setSequenceTranslation
is automatically called by
SamFile
when reading records. So, when interfacing through SamFile
SamRecord::setReference
and SamRecord::setSequenceTranslation
should not need to be called by the user unless records are being created from scratch.
There are now versions of
getRecordBuffer
, writeRecordBuffer
, getSequence
, and getFields
that take and do not take SequenceTranslation
as a parameter. The versions that do not take SequenceTranslation
as a parameter use the translation setting that has been set for the SamRecord
either by SamRecord::setSequenceTranslation
or through SamFile
.
If
SamRecord::setReference
was never called on the record either by the user or by SamFile
, no sequence translation is done, regardless of the SequenceTranslation
setting.
Once the reference or sequence translation has been set on a
SamRecord
object it remains set until a setReference
or setSequenceTranslation
call is made. Resetting or reusing the object does not clear these settings.
Possibly useful to a User
Added a class, SamQuerySeqWithRef
, with static methods seqWithEquals
and seqWithoutEquals
. SamRecord uses these methods to translate the sequence for the user.
// Gets the sequence with '=' in any position where the sequence matches the reference.
// NOTE: 'N' in both the sequence and the reference is not considered a match.
// Parameters:
// currentSeq sequence that should be converted
// seq0BasedPos 0 based start position of currentSeq on the reference.
// cigar cigar string for currentSeq (used for determining how the sequence aligns to the reference)
// referenceName reference name associated with this sequence
// refSequence reference sequence object
// updatedSeq return parameter that this method sets to the current sequence, replacing any matches to the reference with '='.
static void SamQuerySeqWithRef::seqWithEquals(const char * currentSeq,
int32_t seq0BasedPos,
Cigar & cigar,
const char * referenceName,
const GenomeSequence & refSequence,
std::string & updatedSeq)
// Gets the sequence converting '=' to the appropriate base using the reference.
// Parameters:
// currentSeq sequence that should be converted
// seq0BasedPos 0 based start position of currentSeq on the reference.
// cigar cigar string for currentSeq (used for determining how the sequence aligns to the reference)
// referenceName reference name associated with this sequence
// refSequence reference sequence object
// updatedSeq return parameter that this method sets to the current sequence, replacing any "=" with the base from the reference.
static void SamQuerySeqWithRef::seqWithoutEquals(const char * currentSeq,
int32_t seq0BasedPos,
Cigar & cigar,
const char * referenceName,
const GenomeSequence & refSequence,
std::string & updatedSeq)
These static methods can be used without going through a SamRecord
object.
INTERNAL MODIFICATIONS
- Updated GenericSamInterface, BamInterface, and SamInterface to take TranslationType as a parameter to
writeRecord
.
BamInterface::writeRecord
passes the translation parameter to SamRecord::writeRecordBuffer
SamInterface::writeRecord
passes the translation parameter to SamRecord::getSequence
when it obtains the sequence to write into the SAM file.