BAM to FASTQ

From Genome Analysis Wiki
Jump to navigationJump to search

BAM to FASTQ

Please see BamUtil: bam2FastQ for this tool.


Request: Software to convert from BAM to FASTQ using the OQ for the quality.

Requester: Hyun Min Kang & Goo Jun

Date Requested: December 7, 2010

Date Needed: Soon

Current Status: On hold per direction from Hyun Min Kang (12/7/2010)

  • On hold to determine if it is useful to update Bingshan's tool (would need conversion to the new BAM Library) to use OQ or if that does not provide much of a benefit.
  • Although having it work quickly & efficiently on unsorted BAMs may be useful.

Notes

  • Bingshan has code Bam2FastQ that already converts from BAM to FASTQ, but does not use OQ for quality.

The code needs to figure out the strand and reverse compliment the reverse strands.

One file for the first in pair & 1 file for the 2nd in the pair - the order in the two files must match.

Reverse complimenting means:

  • if the sequence in the BAM is: ACTG, the reverse compliment is: CAGT
  • if the quality in OQ is: 1234, the reverse compliment is 4321.

  • Initial release would require the files to be sorted by ReadName (producing an error if not already sorted).
    • This is done for you by calling SamFile::setSortedValidation(SamFile::QUERY_NAME). Then ReadRecord will fail if the record is not sorted.
    • It would also error if the read is not paired.
  • Later Release: work on unsorted BAM files.
    • Prefer to sort at the same time as writing the FASTQ files rather than 1 step to sort and a 2nd step to write the FASTQs.
    • Would be useful to have something implemented within the library (would be useful for dedupping, etc, but might be tricky to implement as API - sometimes the pair may be far apart.
      • maybe something like SamFile::getNextReadPair or SamFileHelper::getNextReadPair due to bookkeeping, may be useful to separate it out from the SamFile - either would return handle the logic and return a pair of records
      • At some point may have to start writing a file.
      • could attempt to just store the readname and FilePosition and use random access to jump around when a pair is found (but that would be inefficient if they are close) - and it would depend on how big the file is to whether or not readname & filePosition would still be storing too much information
      • A two scan approach on the original BAM may be the best
  • Separate suggestion: implement is a smart pileup - which retains a clone of SamRecord until you see the mate pair
    • useful in the dedupper and variant caller and etc but we probably need to discuss if we decide to implement it

Proposed Solution