Difference between revisions of "BamUtil"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 390: Line 390:
 
# specified name with _only_<in1>.bam with records only in the in1 file
 
# specified name with _only_<in1>.bam with records only in the in1 file
 
# specified name with _only_<in2>.bam with records only in the in2 file
 
# specified name with _only_<in2>.bam with records only in the in2 file
 +
 +
When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file.
 +
 +
The following tags are used:
 +
* ZF - Flag
 +
* ZP - Pos
 +
* ZC - Cigar
 +
* ZS - Sequence
 +
* ZQ - Base Quality
 +
* ZT - Tags
  
 
== readReference ==
 
== readReference ==

Revision as of 12:59, 13 June 2011


bam Executable

When statgen is compiled, the SAM/BAM executable, "bam" is generated in the statgen/src/bin/ directory.

The software reads the beginning of an input file to determine if it is SAM/BAM. To determine the format (SAM/BAM) of the output file, the software checks the output file's extension. If the extension is ".bam" it writes a BAM file, otherwise it writes a SAM file.

The bam executable has the following functions.

This executable is built using StatGenLibrary: BAM.

Just running ./bam will print the Usage information for the bam executable.


validate

The validate option on the bam executable reads and validates a SAM/BAM file. This option is documented at: BamValidator

convert

The convert option on the bam executable reads a SAM/BAM file and writes it as a SAM/BAM file.

The executable converts the input file into the format of the output file. So if you want to convert a BAM file to a SAM file, from the pipeline/bam/ directory you just call:

./bam --in <bamFile>.bam --out <newSamFile>.sam

Don't forget to put in the paths to the executable and your test files.

Sequence Representation

The sequence parameter options specify how to represent the sequence if the reference is specified (refFile option). If the reference is not specified or seqOrig is specified, no modifications are made to the sequence. If the reference and seqBases is specified, any matches between the sequence and the reference are represented in the sequence as the appropriate base. If the reference and seqEquals is specified, any matches between the sequence and the reference are represented in the sequence as '='.

Examples

ExtendedCigar: SSMMMDDMMMIMNNNMPMSSS
Sequence:      AATAA  CTAGA   T AGGG
Reference:       TAACCCTA ACCCT A
Sequence with Orig:   AATAACTAGATAGGG
Sequence with Bases:  AATAACTAGATAGGG
Sequence with Equals: AA======G===GGG
ExtendedCigar: SSMMMDDMMMIMNNNMPMSSS
Sequence:      AATGA  CTGGA   T AGGG
Reference:       TAACCCTA ACCCT A
Sequence with Orig:   AATGACTGGATAGGG
Sequence with Bases:  AATGACTGGATAGGG
Sequence with Equals: AA=G===GG===GGG
ExtendedCigar: SSMMMDDMMMIMNNNMPMSSS
Sequence:      AAT=A  CT=GA   T AGGG
Reference:       TAACCCTA ACCCT A
Sequence with Orig:   AAT=ACT=GATAGGG
Sequence with Bases:  AATGACTGGATAGGG
Sequence with Equals: AA======G===GGG
ExtendedCigar: SSMMMDDMMMIMNNNMPMSSS
Sequence:      AA===  ===G=   = =GGG
Reference:       TAACCCTA ACCCT A
Sequence with Orig:   AA======G===GGG
Sequence with Bases:  AATAACTAGATAGGG
Sequence with Equals: AA======G===GGG

Parameters

    Required Parameters:
        --in        : the SAM/BAM file to be read
        --out       : the SAM/BAM file to be written
    Optional Parameters:
	--refFile   : reference file name
        --noeof     : do not expect an EOF block on a bam file.
        --params    : print the parameter settings
    Optional Sequence Parameters (only specify one):
	--seqOrig   : Leave the sequence as is (default & used if reference is not specified).
	--seqBases  : Convert any '=' in the sequence to the appropriate base using the reference (requires --ref).
	--seqEquals : Convert any bases that match the reference to '=' (requires --ref).

Usage

./bam convert --in <inputFile> --out <outputFile.sam/bam/ubam (ubam is uncompressed bam)> [--refFile <reference filename>] [--seqBases|--seqEquals|--seqOrig] [--noeof] [--params]


Return Value

Returns the SamStatus for the reads/writes.

Example Output

Number of records read = 10
Number of records written = 10

dumpHeader

The dumpHeader option on the bam executable prints the header of the specified SAM/BAM file to cout.

Parameters

    Required Parameters:
	filename : the sam/bam filename whose header should be printed.

Usage

./bam dumpHeader <inputFile>

Return Value

  • 0: the header was successfully read and printed.
  • non-0: the header was not successfully read or was not printed. (Returns the SamStatus.)


Example Output

@SQ	SN:1	LN:247249719
@SQ	SN:2	LN:242951149
@SQ	SN:3	LN:199501827


splitChromosome

The splitChromosome option on the bam executable splits an indexed BAM file into multiple files based on the Chromosome (Reference Name).

The files all have the same base name, but with an _# where # corresponds with the associated reference id from the BAM file.

Parameters

    Required Parameters:
        --in       : the BAM file to be split
        --out      : the base filename for the SAM/BAM files to write into.  Does not include the extension.
                     _N will be appended to the basename where N indicates the Chromosome.
    Optional Parameters:
        --noeof  : do not expect an EOF block on a bam file.
        --bamIndex : the path/name of the bam index file
                     (if not specified, uses the --in value + ".bai")
        --bamout : write the output files in BAM format (default).
        --samout : write the output files in SAM format.
        --params : print the parameter settings

Usage

./bam splitChromosome --in <inputFilename>  --out <outputFileBaseName> [--bamIndex <bamIndexFile>] [--noeof] [--bamout|--samout] [--params]


Return Value

  • 0: all records are successfully read and written.
  • non-0: at least one record was not successfully read or written.

Example Output

Reference ID -1 has 2 records
Reference ID 0 has 5 records
Reference ID 1 has 2 records
Reference ID 2 has 1 records
Reference ID 3 has 0 records
Reference ID 4 has 0 records
Reference ID 5 has 0 records
Reference ID 6 has 0 records
Reference ID 7 has 0 records
Reference ID 8 has 0 records
Reference ID 9 has 0 records
Reference ID 10 has 0 records
Reference ID 11 has 0 records
Reference ID 12 has 0 records
Reference ID 13 has 0 records
Reference ID 14 has 0 records
Reference ID 15 has 0 records
Reference ID 16 has 0 records
Reference ID 17 has 0 records
Reference ID 18 has 0 records
Reference ID 19 has 0 records
Reference ID 20 has 0 records
Reference ID 21 has 0 records
Reference ID 22 has 0 records
Number of records = 10
Returning: 0 (SUCCESS)


writeRegion

The writeRegion option on the bam executable writes the alignments in the indexed BAM file that fall into the specified region (reference id and start/end position).

Parameters

    Required Parameters:
        --in       : the BAM file to be read
        --out      : the SAM/BAM file to write to
    Optional Parameters:
        --noeof  : do not expect an EOF block on a bam file.
        --bamIndex : the path/name of the bam index file
                     (if not specified, uses the --in value + ".bai")
        --refName  : the BAM reference Name to read (either this or refID can be specified)
        --refID    : the BAM reference ID to read (defaults to -1: unmapped)
        --start    : inclusive 0-based start position (defaults to -1)
        --end      : exclusive 0-based end position (defaults to -1: meaning til the end of the reference)
        --params   : print the parameter settings

Usage

./bam writeRegion --in <inputFilename>  --out <outputFilename> [--bamIndex <bamIndexFile>] [--noeof] [--refName <reference Name> | --refID <reference ID>] [--start <0-based start pos>] [--end <0-based end psoition>] [--params]

Return Value

  • 0: all records are successfully read and written.
  • non-0: at least one record was not successfully read or written.

Example Output


Wrote t.sam with 2 records.


dumpRefInfo

The dumpRefInfo option on the bam executable prints the SAM/BAM file's reference information.

Parameters

    Required Parameters:
        --in               : the SAM/BAM file to be read
    Optional Parameters:
        --noeof            : do not expect an EOF block on a bam file.
        --printRecordRefs  : print the reference information for the records in the file (grouped by reference).
        --params           : print the parameter settings

Usage

./bam dumpRefInfo --in <inputFilename> [--noeof] [--printRecordRefs] [--params]

Return Value

  • 0: the file was processed successfully.
  • non-0: the file was not processed successfully.


dumpIndex

The dumpIndex option on the bam executable prints BAM index file in an easy to read format.

Parameters

    Required Parameters:
        --bamIndex : the path/name of the bam index file to display
    Optional Parameters:
        --refID    : the reference ID to read, defaults to print all
        --summary  : only print a summary - 1 line per reference.
        --params   : print the parameter settings

Usage

./bam dumpIndex --bamIndex <bamIndexFile> [--refID <ref#>] [--summary] [--params]

Return Value

  • 0: the BAM index file was processed successfully.
  • non-0: the BAM index file was not processed successfully.


readIndexedBam

The readIndexedBam option on the bam executable reads an indexed BAM file reference id by reference id -1 to the max reference id and writes it out as a SAM/BAM file.

Parameters

	Required Parameters:
		inputFilename      - path/name of the input BAM file
		outputFile.sam/bam - path/name of the output file
		bamIndexFile       - path/name of the BAM index file

Usage

./bam readIndexedBam <inputFilename> <outputFile.sam/bam> <bamIndexFile>

Return Value

  • 0

filter

The filter option on the bam executable filters the reads in a a SAM/BAM file. This option is documented at: Bam Executable: Filter

diff

***Coming Soon***

The diff option on the bam executable prints the difference between two coordinate sorted SAM/BAM files. This can be used to compare the outputs of running a SAM/BAM through different tools/versions of tools.

The diff tool compares records that have the same Read Name and Fragment (from the flag). If a matching ReadName & Fragment is not found, the record is considered to be different.

diff assumes the files are coordinate sorted and uses this assumption for determining how long to store a record before determining that the other file does not contain a matching ReadName/Fragment. If the files are not coordinate sorted, this logic does not work.

By default, just the chromosome/position and cigar are compared for each record.

Options are available to compare:

  • sequence
  • base quality
  • specified tags
  • turn off position comparison
  • turn off cigar comparison

Parameters

	Required Parameters:
		--in1         : first coordinate sorted SAM/BAM file to be diffed
		--in2         : second coordinate sorted SAM/BAM file to be diffed
	Optional Parameters:
		--out         : output filename, use .bam extension to output in BAM format instead of diff format.
		                In BAM format there will be 3 output files:
		                    1) the specified name with record diffs
		                    2) specified name with _only_<in1>.bam with records only in the in1 file
		                    3) specified name with _only_<in2>.bam with records only in the in2 file
		--seq         : diff the sequence bases.
		--baseQual    : diff the base qualities.
		--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
		--noCigar     : do not diff the the cigars.
		--noPos       : do not diff the positions.
		--onlyDiffs   : only print the fields that are different, otherwise for any diff all the fields that are compared are printed.
		--recPoolSize : number of records to allow to be stored at a time, default value: 1000000
		--posDiff     : max base pair difference between possibly matching records100000
		--noeof       : do not expect an EOF block on a bam file.
		--params      : print the parameter settings

Usage

./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]

Return Value

  • 0: all records are successfully read and written.
  • non-0: an error occurred processing the parameters or reading one of the files.e

Output Format

2 Output Formats:

  1. Diff Format
  2. BAM Format

Diff Format

There are 2 types of differences.

  • ReadName/Fragment combo is in one file, but not in the other file within the window set by recPoolSize & posDiff
  • ReadName/Fragment combo is in both files, but at least one of the specified fields to diff is different

Each difference output consists of 2 or 3 lines. If the record only appears in one of the files, the diff is 2 lines, if it appears in both files, the diff is 3 lines.

The first line of the difference output is just the read name.

The 2nd and 3rd line (if present) begin with either a '<' or a '>'. If the record is from the first file (--in1), it begins with a '<'. If the record is from the 2nd file (--in2), it begins with a '>'.

The 2nd line is the flag followed by the diff'd fields from one of the records.

The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.


The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified:

  • '<' or '>'
  • flag
  • chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
  • cigar - if --noCigar is not specified
  • sequence - if --seq is specified
  • base quality - if --baseQual is specified
  • tag:type:value - for each tag:type specified in --tags
  • ...
  • tag:type:value

If onlyDiffs is specified, only the fields that are specified and are different get printed in lines 2 & 3.

Example Output

Command:

../bin/bam diff --in1 testFiles/testDiff1.sam --in2 testFiles/testDiff2.sam --seq --baseQual --tags "OP:i;MD:Z" --onlyDiffs --out results/diffOrderSam.log

Output:

18:462+29M5I3M:F:295
<	a1	1:78
>	a1	1:74
1
>	a1	1:70	3S1M1S	ACGTN	;46>>	OP:i:75	MD:Z:30A0C5
2
>	a1	1:72	3S1M1S	ACGTN	;47>>	OP:i:75	MD:Z:30A0C5
ABC
>	cd	*:0	*	*	*
DEF
>	cd	*:0	*	*	*

Bam Format

use .bam extension to output in BAM format instead of diff format.

In BAM format there will be 3 output files:

  1. the specified name with record diffs
  2. specified name with _only_<in1>.bam with records only in the in1 file
  3. specified name with _only_<in2>.bam with records only in the in2 file

When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file.

The following tags are used:

  • ZF - Flag
  • ZP - Pos
  • ZC - Cigar
  • ZS - Sequence
  • ZQ - Base Quality
  • ZT - Tags

readReference

The readReference option on the bam executable prints the specified region of the reference sequence in an easy to read format.

Parameters

    Required Parameters:
        --refFile  : the reference
        --refName  : the SAM/BAM reference Name to read
        --start    : inclusive 0-based start position (defaults to -1)
    Required Length Parameter (one but not both needs to be specified):
        --end      : exclusive 0-based end position (defaults to -1: meaning til the end of the reference)
        --numBases : number of bases from start to display
        --params   : print the parameter settings

Usage

./bam readReference --refFile <referenceFilename> --refName <reference Name> --start <0 based start> --end <0 based end>|--numBases <number of bases> [--params]

Return Value

  • 0: the reference file was successfully read.
  • non-0: the reference file was not successfully read.

Example Output