BamUtil: convert

From Genome Analysis Wiki
Jump to navigationJump to search


Overview of the convert function of bamUtil

The convert option on the BamUtil executable reads a SAM/BAM file and writes it as a SAM/BAM file.

The executable converts the input file into the format of the output file.

It has options to allow for the conversion of the sequence to/from '=' from/to the actual bases by using the reference sequence.

It also has an option to left shift indels in the CIGARs before writing the output file.

If you want to convert a BAM file to a SAM file, just call:

<pathToExe>/bam --in <bamFile>.bam --out <newSamFile>.sam

Don't forget to put in the paths to the executable and your test files.


Usage

./bam convert --in <inputFile> --out <outputFile.sam/bam/ubam (ubam is uncompressed bam)> [--refFile <reference filename>] [--useBases|--useEquals|--useOrigSeq] [--lshift] [--noeof] [--params]


Parameters

	Required Parameters:
		--in         : the SAM/BAM file to be read
		--out        : the SAM/BAM file to be written
	Optional Parameters:
		--refFile    : reference file name
		--lshift     : left shift indels when writing records
		--noeof      : do not expect an EOF block on a bam file
		--params     : print the parameter settings
		--recover    : attempt error recovery while reading a bam file
	Optional Sequence Parameters (only specify one):
		--useOrigSeq : Leave the sequence as is (default & used if reference is not specified)
		--useBases   : Convert any '=' in the sequence to the appropriate base using the reference (requires --refFile)
		--useEquals  : Convert any bases that match the reference to '=' (requires --refFile)
	PhoneHome:
		--noPhoneHome       : disable PhoneHome (default enabled)
		--phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)

Required Parameters

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output File (--out)

Use --out followed by your file name to specify the SAM/BAM output file.

The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A - is used to indicate stdout and the extension for file type (no extension is SAM).

SAM to file --out yourFileName.sam
BAM to file --out yourFileName.bam
Uncompressed BAM to file --out yourFileName.ubam
SAM to stdout --out -
BAM to stdout --out -.bam
Uncompressed BAM to stdout --out -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Optional Parameters

Reference File (--refFile)

Use --refFile followed by the reference file name to specify the reference sequence file.

Left Shift Indels in the CIGAR (--lshift)

Left shift indels as far as they can go in the read.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

Recover a corrupted BAM file (--recover)

See BAM File Recovery.

Sequence Representation Parameters (--useOrigSeq, --useBases, --useEquals, --refFile)

The sequence parameters options specify how to represent the sequence if the reference is specified (refFile option).

If the reference is not specified or useOrigSeq is specified, no modifications are made to the sequence.

If the reference and useBases is specified, any matches between the sequence and the reference are represented in the sequence as the appropriate base.

If the reference and useEquals is specified, any matches between the sequence and the reference are represented in the sequence as '='.

Examples

ExtendedCigar: SSMMMDDMMMIMNNNMPMSSS
Sequence:      AATAA  CTAGA   T AGGG
Reference:       TAACCCTA ACCCT A
Sequence with Orig:   AATAACTAGATAGGG
Sequence with Bases:  AATAACTAGATAGGG
Sequence with Equals: AA======G===GGG
ExtendedCigar: SSMMMDDMMMIMNNNMPMSSS
Sequence:      AATGA  CTGGA   T AGGG
Reference:       TAACCCTA ACCCT A
Sequence with Orig:   AATGACTGGATAGGG
Sequence with Bases:  AATGACTGGATAGGG
Sequence with Equals: AA=G===GG===GGG
ExtendedCigar: SSMMMDDMMMIMNNNMPMSSS
Sequence:      AAT=A  CT=GA   T AGGG
Reference:       TAACCCTA ACCCT A
Sequence with Orig:   AAT=ACT=GATAGGG
Sequence with Bases:  AATGACTGGATAGGG
Sequence with Equals: AA======G===GGG
ExtendedCigar: SSMMMDDMMMIMNNNMPMSSS
Sequence:      AA===  ===G=   = =GGG
Reference:       TAACCCTA ACCCT A
Sequence with Orig:   AA======G===GGG
Sequence with Bases:  AATAACTAGATAGGG
Sequence with Equals: AA======G===GGG

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

Turn off PhoneHome (--noPhoneHome)

Use the --noPhoneHome option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.

Adjust the Frequency of PhoneHome (--phoneHomeThinning)

Use --phoneHomeThinning to modify the percentage of the time that PhoneHome will run (0-100).

  • By default, --phoneHomeThinning is set to 50, running 50% of the time.
  • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
  • N/A if --noPhoneHome is set.

BAM File Recovery

A BAM file that has been corrupted or truncated due to a copy or disk problem can often be partially recovered.

Both the BGZF format and binary BAM format have enough information to scan forward and resynchronize the input data. While some data will be lost, substantial recovery can often be done.

When a file has bad blocks in it, normal copy commands (cp) will truncate the file at the point of disk read failure. To recover the maximum amount of data possible, use the dd command with the conv=noerror option.

So a normal use case for recovery would look this this:

# dd if=brokenbamfile.bam of=/tmp/brokenbamfile1.bam conv=noerror bs=4k
# bam convert --recover --in /tmp/brokenbamfile1.bam --out /tmp/brokenbamfilerecovered.bam

Note, you will of course need to output the result file to a known good filesystem.

Currently, no statistics are printed as far as how many BAM records are recovered, but subsequent tests can readily be done on the resulting file to determine the quality of recovery.

In real cases, we have recovered better than 94% of reads from a set of severely damaged files (numerous 64K chunks of a RAID were lost), and better than 99.9% recovery from a moderately damaged file (3 disk pages were corrupt).


Return Value

Returns the SamStatus for the reads/writes (0 for success, non-0 for failure).

Example Output

Number of records read = 10
Number of records written = 10