Changes

BamUtil (view source)

Revision as of 14:35, 2 September 2011

17,218 bytes removed , 14:35, 2 September 2011

Split into mulitple pages and add missing tools

Line 18: Line 18:

(It will be available without libStatGen in case you already have a downloaded version of libStatGen that you want to use.

−

=== ~~Using github~~ ===

+

=== Releases ===

−

~~Releases~~ are '''Coming Soon'''.

+

Release downloads are '''Coming Soon'''.

Line 62: Line 62:

The bam executable has the following functions.

−

* [[~~C++ Executable~~: ~~bam#~~validate|validate - Read and Validate a SAM/BAM file]]

+

* [[BamUtil: validate|validate|validate - Read and Validate a SAM/BAM file]]

* [[BamUtil: convert|convert - Read a SAM/BAM file and write as a SAM/BAM file (optionally converts between '=' & bases in the sequence)]]

−

* [[~~C++ Executable~~: ~~bam#~~dumpHeader|dumpHeader - Print SAM/BAM header]]

+

* [[BamUtil: dumpHeader|dumpHeader - Print SAM/BAM header]]

−

* [[~~C++ Executable~~: ~~bam#~~splitChromosome|splitChromosome - Split BAM by Chromosome]]

+

* [[BamUtil: splitChromosome|splitChromosome - Split BAM by Chromosome]]

−

* [[~~C++ Executable~~: ~~bam#~~writeRegion|writeRegion - Write the alignments in the indexed BAM file that fall into the specified region]]

+

* [[BamUtil: writeRegion|writeRegion - Write the alignments in the indexed BAM file that fall into the specified region]]

−

* [[~~C++ Executable~~: ~~bam#~~dumpRefInfo|dumpRefInfo - Print SAM/BAM Reference Information]]

+

* [[BamUtil: dumpRefInfo|dumpRefInfo - Print SAM/BAM Reference Information]]

−

* [[~~C++ Executable~~: ~~bam#~~dumpIndex|dumpIndex - Dump a BAM index file into an easy to read text version]]

+

* [[BamUtil: dumpIndex|dumpIndex - Dump a BAM index file into an easy to read text version]]

−

* [[~~C++ Executable~~: ~~bam#~~readIndexedBam|readIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file]]

+

* [[BamUtil: readIndexedBam|readIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file]]

−

* [[~~C++ Executable~~: ~~bam#~~filter|filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high]]

+

* [[BamUtil: filter|filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high]]

−

* [[~~C++ Executable~~: ~~bam#~~readReference|readReference - Print the reference string for the specified region]]

+

* [[BamUtil: readReference|readReference - Print the reference string for the specified region]]

−

* [[~~C++ Executable~~: ~~bam#~~diff|diff - Print the diffs between 2 bams]]

+

* [[BamUtil: diff|diff - Print the diffs between 2 bams]]

+

* [[BamUtil: stats|stats - Print the diffs between 2 bams]]

+

* [[BamUtil: revert|revert - Revert SAM/BAM replacing the specified fields with their previous values (if known).]]

+

* [[BamUtil: squeeze|squeeze - reduces files size by dropping OQ fields, duplicates, specified tags, using '=' when a base matches the reference, binning quality scores.]]

+

* [[BamUtil: findCigars|findCigars - Output just the reads that contain any of the specified CIGAR operations.]]

This executable is built using [[C++ Library: libStatGen]].

Just running ./bam will print the Usage information for the bam executable.

−

~~== validate ==~~

−

~~The <code>validate</code> option on the bam executable reads and validates a SAM/BAM file. This option is documented at: [[BamValidator]]~~

−

~~== dumpHeader ==~~

−

~~The <code>dumpHeader</code> option on the bam executable prints the header of the specified SAM/BAM file to cout.~~

−

~~=== Parameters ===~~

−

~~<pre>~~

−

~~Required Parameters:~~

−

~~filename : the sam/bam filename whose header should be printed.~~

−

~~</pre>~~

−

~~=== Usage ===~~

−

~~./bam dumpHeader <inputFile>~~

−

~~=== Return Value ===~~

−

* 0: the header was successfully read and printed.

−

* non-0: the header was not successfully read or was not printed. (Returns the SamStatus.)

−

~~=== Example Output ===~~

−

~~<pre>~~

−

~~@SQ SN:1 LN:247249719~~

−

~~@SQ SN:2 LN:242951149~~

−

~~@SQ SN:3 LN:199501827~~

−

~~</pre>~~

−

~~== splitChromosome ==~~

−

~~The <code>splitChromosome</code> option on the bam executable splits an indexed BAM file into multiple files based on the Chromosome (Reference Name).~~

−

~~The files all have the same base name, but with an _# where # corresponds with the associated reference id from the BAM file.~~

−

~~=== Parameters ===~~

−

~~<pre>~~

−

~~Required Parameters:~~

−

~~--in : the BAM file to be split~~

−

~~--out : the base filename for the SAM/BAM files to write into. Does not include the extension.~~

−

~~_N will be appended to the basename where N indicates the Chromosome.~~

−

~~Optional Parameters:~~

−

~~--noeof : do not expect an EOF block on a bam file.~~

−

~~--bamIndex : the path/name of the bam index file~~

−

~~(if not specified, uses the --in value + ".bai")~~

−

~~--bamout : write the output files in BAM format (default).~~

−

~~--samout : write the output files in SAM format.~~

−

~~--params : print the parameter settings~~

−

~~</pre>~~

−

~~=== Usage ===~~

−

~~./bam splitChromosome --in <inputFilename> --out <outputFileBaseName> [--bamIndex <bamIndexFile>] [--noeof] [--bamout|--samout] [--params]~~

−

~~=== Return Value ===~~

−

* 0: all records are successfully read and written.

−

* non-0: at least one record was not successfully read or written.

−

~~=== Example Output ===~~

−

~~<pre>~~

−

~~Reference ID -1 has 2 records~~

−

~~Reference ID 0 has 5 records~~

−

~~Reference ID 1 has 2 records~~

−

~~Reference ID 2 has 1 records~~

−

~~Reference ID 3 has 0 records~~

−

~~Reference ID 4 has 0 records~~

−

~~Reference ID 5 has 0 records~~

−

~~Reference ID 6 has 0 records~~

−

~~Reference ID 7 has 0 records~~

−

~~Reference ID 8 has 0 records~~

−

~~Reference ID 9 has 0 records~~

−

~~Reference ID 10 has 0 records~~

−

~~Reference ID 11 has 0 records~~

−

~~Reference ID 12 has 0 records~~

−

~~Reference ID 13 has 0 records~~

−

~~Reference ID 14 has 0 records~~

−

~~Reference ID 15 has 0 records~~

−

~~Reference ID 16 has 0 records~~

−

~~Reference ID 17 has 0 records~~

−

~~Reference ID 18 has 0 records~~

−

~~Reference ID 19 has 0 records~~

−

~~Reference ID 20 has 0 records~~

−

~~Reference ID 21 has 0 records~~

−

~~Reference ID 22 has 0 records~~

−

~~Number of records = 10~~

−

~~Returning: 0 (SUCCESS)~~

−

~~</pre>~~

−

~~== writeRegion ==~~

−

~~The <code>writeRegion</code> option on the bam executable writes the alignments in the indexed BAM file that fall into the specified region (reference id and start/end position).~~

−

~~=== Parameters ===~~

−

~~<pre>~~

−

~~Required Parameters:~~

−

~~--in : the BAM file to be read~~

−

~~--out : the SAM/BAM file to write to~~

−

~~Optional Parameters:~~

−

~~--noeof : do not expect an EOF block on a bam file.~~

−

~~--bamIndex : the path/name of the bam index file~~

−

~~(if not specified, uses the --in value + ".bai")~~

−

~~--refName : the BAM reference Name to read (either this or refID can be specified)~~

−

~~--refID : the BAM reference ID to read (defaults to -1: unmapped)~~

−

~~--start : inclusive 0-based start position (defaults to -1)~~

−

~~--end : exclusive 0-based end position (defaults to -1: meaning til the end of the reference)~~

−

~~--params : print the parameter settings~~

−

~~</pre>~~

−

~~=== Usage ===~~

−

./bam writeRegion --in <inputFilename> --out <outputFilename> [--bamIndex <bamIndexFile>] [--noeof] [--refName <reference Name> | --refID <reference ID>] [--start <0-based start pos>] [--end <0-based end psoition>] [--params]

−

~~=== Return Value ===~~

−

* 0: all records are successfully read and written.

−

* non-0: at least one record was not successfully read or written.

−

~~=== Example Output ===~~

−

~~<pre>~~

−

~~Wrote t.sam with 2 records.~~

−

~~</pre>~~

−

~~== dumpRefInfo ==~~

−

~~The <code>dumpRefInfo</code> option on the bam executable prints the SAM/BAM file's reference information.~~

−

~~=== Parameters ===~~

−

~~<pre>~~

−

~~Required Parameters:~~

−

~~--in : the SAM/BAM file to be read~~

−

~~Optional Parameters:~~

−

~~--noeof : do not expect an EOF block on a bam file.~~

−

~~--printRecordRefs : print the reference information for the records in the file (grouped by reference).~~

−

~~--params : print the parameter settings~~

−

~~</pre>~~

−

~~=== Usage ===~~

−

~~./bam dumpRefInfo --in <inputFilename> [--noeof] [--printRecordRefs] [--params]~~

−

~~=== Return Value ===~~

−

* 0: the file was processed successfully.

−

* non-0: the file was not processed successfully.

−

~~== dumpIndex ==~~

−

~~The <code>dumpIndex</code> option on the bam executable prints BAM index file in an easy to read format.~~

−

~~=== Parameters ===~~

−

~~<pre>~~

−

~~Required Parameters:~~

−

~~--bamIndex : the path/name of the bam index file to display~~

−

~~Optional Parameters:~~

−

~~--refID : the reference ID to read, defaults to print all~~

−

~~--summary : only print a summary - 1 line per reference.~~

−

~~--params : print the parameter settings~~

−

~~</pre>~~

−

~~=== Usage ===~~

−

~~./bam dumpIndex --bamIndex <bamIndexFile> [--refID <ref#>] [--summary] [--params]~~

−

~~=== Return Value ===~~

−

* 0: the BAM index file was processed successfully.

−

* non-0: the BAM index file was not processed successfully.

−

~~== readIndexedBam ==~~

−

~~The <code>readIndexedBam</code> option on the bam executable reads an indexed BAM file reference id by reference id -1 to the max reference id and writes it out as a SAM/BAM file.~~

−

~~=== Parameters ===~~

−

~~<pre>~~

−

~~Required Parameters:~~

−

~~inputFilename - path/name of the input BAM file~~

−

~~outputFile.sam/bam - path/name of the output file~~

−

~~bamIndexFile - path/name of the BAM index file~~

−

~~</pre>~~

−

~~=== Usage ===~~

−

~~./bam readIndexedBam <inputFilename> <outputFile.sam/bam> <bamIndexFile>~~

−

~~=== Return Value ===~~

−

* 0

−

~~== filter ==~~

−

~~The <code>filter</code> option on the bam executable filters the reads in a a SAM/BAM file. This option is documented at: [[Bam Executable: Filter]]~~

−

~~== diff ==~~

−

~~<span style="color:#D2691E">'''***Coming Soon***'''</span>~~

−

The <code>diff</code> option on the bam executable prints the difference between two coordinate sorted SAM/BAM files. This can be used to compare the outputs of running a SAM/BAM through different tools/versions of tools.

−

~~The <code>diff</code> tool compares records that have the same Read Name and Fragment (from the flag). If a matching ReadName & Fragment is not found, the record is considered to be different.~~

−

<code>diff</code> assumes the files are coordinate sorted and uses this assumption for determining how long to store a record before determining that the other file does not contain a matching ReadName/Fragment. If the files are not coordinate sorted, this logic does not work.

−

~~By default, just the chromosome/position and cigar are compared for each record.~~

−

~~Options are available to compare:~~

−

* sequence

−

* base quality

−

* specified tags

−

* turn off position comparison

−

* turn off cigar comparison

−

~~=== Parameters ===~~

−

~~<pre>~~

−

~~Required Parameters:~~

−

~~--in1 : first coordinate sorted SAM/BAM file to be diffed~~

−

~~--in2 : second coordinate sorted SAM/BAM file to be diffed~~

−

~~Optional Parameters:~~

−

~~--out : output filename, use .bam extension to output in SAM/BAM format instead of diff format.~~

−

~~In SAMBAM format there will be 3 output files:~~

−

~~1) the specified name with record diffs~~

−

~~2) specified name with _only_<in1>.sam/bam with records only in the in1 file~~

−

~~3) specified name with _only_<in2>.sam/bam with records only in the in2 file~~

−

~~--seq : diff the sequence bases.~~

−

~~--baseQual : diff the base qualities.~~

−

~~--tags : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...~~

−

~~--noCigar : do not diff the the cigars.~~

−

~~--noPos : do not diff the positions.~~

−

~~--onlyDiffs : only print the fields that are different, otherwise for any diff all the fields that are compared are printed.~~

−

~~--recPoolSize : number of records to allow to be stored at a time, default value: 1000000~~

−

~~--posDiff : max base pair difference between possibly matching records100000~~

−

~~--noeof : do not expect an EOF block on a bam file.~~

−

~~--params : print the parameter settings~~

−

~~</pre>~~

−

~~=== Usage ===~~

−

./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]

−

~~=== Return Value ===~~

−

* 0: all records are successfully read and written.

−

* non-0: an error occurred processing the parameters or reading one of the files.e

−

~~=== Output Format ===~~

−

~~2 Output Formats:~~

−

~~# Diff Format~~

−

~~# BAM Format~~

−

~~==== Diff Format ====~~

−

~~There are 2 types of differences.~~

−

* ReadName/Fragment combo is in one file, but not in the other file within the window set by recPoolSize & posDiff

−

* ReadName/Fragment combo is in both files, but at least one of the specified fields to diff is different

−

~~Each difference output consists of 2 or 3 lines. If the record only appears in one of the files, the diff is 2 lines, if it appears in both files, the diff is 3 lines.~~

−

~~The first line of the difference output is just the read name.~~

−

The 2nd and 3rd line (if present) begin with either a '<' or a '>'. If the record is from the first file (--in1), it begins with a '<'. If the record is from the 2nd file (--in2), it begins with a '>'.

−

~~The 2nd line is the flag followed by the diff'd fields from one of the records.~~

−

~~The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.~~

−

~~The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified:~~

−

* '<' or '>'

−

* flag

−

* chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified

−

* cigar - if --noCigar is not specified

−

* sequence - if --seq is specified

−

* base quality - if --baseQual is specified

−

* tag:type:value - for each tag:type specified in --tags

−

* ...

−

* tag:type:value

−

~~If <code>onlyDiffs</code> is specified, only the fields that are specified and are different get printed in lines 2 & 3.~~

−

~~===== Example Output =====~~

−

~~Command:~~

−

~~../bin/bam diff --in1 testFiles/testDiff1.sam --in2 testFiles/testDiff2.sam --seq --baseQual --tags "OP:i;MD:Z" --onlyDiffs --out results/diffOrderSam.log~~

−

~~Output:~~

−

~~<pre>~~

−

~~18:462+29M5I3M:F:295~~

−

~~< a1 1:78~~

−

~~> a1 1:74~~

−

1

−

~~> a1 1:70 3S1M1S ACGTN ;46>> OP:i:75 MD:Z:30A0C5~~

−

2

−

~~> a1 1:72 3S1M1S ACGTN ;47>> OP:i:75 MD:Z:30A0C5~~

−

~~ABC~~

−

> cd *:0 * * *

−

~~DEF~~

−

> cd *:0 * * *

−

~~</pre>~~

−

~~==== SAM/Bam Format ====~~

−

~~use .sam/.bam extension to output in SAM/BAM format instead of diff format.~~

−

~~In SAM/BAM format there will be 3 output files:~~

−

~~# the specified name with record diffs~~

−

~~# specified name with _only_<in1>.sam/bam with records only in the in1 file~~

−

~~# specified name with _only_<in2>.sam/bam with records only in the in2 file~~

−

When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:

−

* ZF - Flag

−

* ZP - Pos

−

* ZC - Cigar

−

* ZS - Sequence

−

* ZQ - Base Quality

−

* ZT - Tags

−

~~== readReference ==~~

−

~~The <code>readReference</code> option on the bam executable prints the specified region of the reference sequence in an easy to read format.~~

−

~~=== Parameters ===~~

−

~~<pre>~~

−

~~Required Parameters:~~

−

~~--refFile : the reference~~

−

~~--refName : the SAM/BAM reference Name to read~~

−

~~--start : inclusive 0-based start position (defaults to -1)~~

−

~~Required Length Parameter (one but not both needs to be specified):~~

−

~~--end : exclusive 0-based end position (defaults to -1: meaning til the end of the reference)~~

−

~~--numBases : number of bases from start to display~~

−

~~--params : print the parameter settings~~

−

~~</pre>~~

−

~~=== Usage ===~~

−

~~./bam readReference --refFile <referenceFilename> --refName <reference Name> --start <0 based start> --end <0 based end>|--numBases <number of bases> [--params]~~

−

~~=== Return Value ===~~

−

* 0: the reference file was successfully read.

−

* non-0: the reference file was not successfully read.

−

~~=== Example Output ===~~

−

~~<pre>~~

−

~~</pre>~~

−

~~== stats ==~~

−

~~The <code>stats</code> option on the bam executable generates the specified statistics on a SAM/BAM file.~~

−

~~=== Parameters ===~~

−

~~<pre>~~

−

~~Required Parameters:~~

−

~~--in : the SAM/BAM file to calculate stats for~~

−

~~Types of Statistics that can be generated:~~

−

~~--basic : Turn on basic statistic generation~~

−

~~--qual : Generate a count for each quality (displayed as non-phred quality)~~

−

~~--phred : Generate a count for each quality (displayed as phred quality)~~

−

~~--baseQC : Write per base statistics to the specified file.~~

−

~~Optional Parameters:~~

−

~~--maxNumReads : Maximum number of reads to process~~

−

~~Defaults to -1 to indicate all reads.~~

−

~~--unmapped : Only process unmapped reads (requires a bamIndex file)~~

−

~~--bamIndex : The path/name of the bam index file~~

−

~~(if required and not specified, uses the --in value + ".bai")~~

−

~~--regionList : File containing the region list chr<tab>start_pos<tab>end<pos>.~~

−

~~Positions are 0 based and the end_pos is not included in the region.~~

−

~~Uses bamIndex.~~

−

~~--minMapQual : The minimum mapping quality for filtering reads in the baseQC stats.~~

−

~~--dbsnp : The dbSnp file of positions to exclude from baseQC analysis.~~

−

~~--noeof : Do not expect an EOF block on a bam file.~~

−

~~--params : Print the parameter settings~~

−

~~</pre>~~

−

~~For all types of statistics, the bam file used is specified by <code>--in</code>.~~

−

~~The optional parameters are also used for all types of statistics.~~

−

~~Usage:~~

−

~~<pre>~~

−

./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--baseQC <outputFileName>] [--maxNumReads <maxNum>] [--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>] [--noeof] [--params]

−

~~</pre>~~

−

~~=== Types of Statistics ===~~

−

~~==== Basic ====~~

−

~~Prints summary statistics for the file:~~

−

*TotalReads - # of reads that are in the file

−

*MappedReads - # of reads marked mapped in the flag

−

*PairedReads - # of reads marked paired in the flag

−

*ProperPair - # of reads marked paired AND proper paired in the flag

−

*DuplicateReads - # of reads marked duplicate in the flag

−

*QCFailureReads - # of reads marked QC failure in the flag

−

*MappingRate(%) - # of reads marked mapped in the flag / TotalReads

−

*PairedReads(%) - # of reads marked paired in the flag / TotalReads

−

*ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads

−

*DupRate(%) - # of reads marked duplicate in the flag / TotalReads

−

*QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads

−

*TotalBases - # of bases in all reads

−

*BasesInMappedReads - # of bases in reads marked mapped in the flag

−

~~==== Qual/Phred ====~~

−

~~Prints a count of the number of times each quality value appears in the file.~~

−

*<code>phred</code> Displays Quality as phred integers [0-93]

−

*<code>qual</code> Displays Quality as non-phred integers (phred + 33) [33-126]

−

~~==== BaseQC ====~~

−

~~'''This capability is coming soon, so these notes may be updated prior to it being completed...'''~~

−

~~Do we print stats for positions where the reference base is 'N'?? (any special note for those? Qplot would not count them in the depth.)~~

−

~~The <code>baseQC</code> option generates the following statistics:~~

−

~~For each position, the following counts are incremented if:~~

−

~~# a read spans the reference position (starts before or at this reference position and ends at or after this position)~~

−

~~# regardless of duplicate/qc failure/unmapped/mapping quality~~

−

~~# regardless of the CIGAR for this position (other than clips at the beginning/end which are not counted, but deletions and skips are counted)~~

−

*TotalReads(e6) - # of reads that span this position.

−

*DupRate(%) - # of reads marked duplicate in the flag / TotalReads

−

*QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads

−

*PairedReads(%) - # of reads marked paired in the flag / TotalReads

−

*ProperPaired(%) - # of reads marked paired AND proper paired in the flag / TotalReads

−

*MappedBases(e9) - # of reads marked mapped in the flag

−

*MappingRate(%) - # of reads marked mapped in the flag / TotalReads

−

*ZeroMapQual(%) - # of reads marked mapped in the flag AND have a Mapping Quality of 0 / TotalReads

−

*MapQual<10(%) - # of reads marked mapped in the flag AND have a Mapping Quality < 10 / TotalReads

−

*MapRate_MQpass(%) - # of reads marked mapped in the flag AND have a Mapping Quality >= a minimum Mapping Quality / TotalReads

−

~~For each position, the following counts are incremented if:~~

−

~~# a read spans the reference position (starts before or at this reference position and ends at or after this position)~~

−

~~# the read is NOT a duplicate, qc failure, unmapped, or mapped with a mapping quality less than the min~~

−

~~# the CIGAR for this position is a M/=/X (match/mismatch)~~

−

~~TBD - should it count if the read has a base of 'N'~~

−

*Depth - # of reads.

−

*Q20Bases(e9) - # of bases at this position with a base quality (from the read) of Q20 or higher.

−

*Q20BasesPct(%) - Q20Bases / Depth

Mktrost

Administrators

3,045

edits

Changes

BamUtil (view source)

Revision as of 14:35, 2 September 2011

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools