Difference between revisions of "BamUtil"

From Genome Analysis Wiki
Jump to navigationJump to search
(→‎Programs: Add comments to convert for switching between '=' and bases in the sequence.)
 
(43 intermediate revisions by one other user not shown)
Line 1: Line 1:
 
[[Category:bamUtil]]
 
[[Category:bamUtil]]
 +
[[Category:C++]]
 +
[[Category:Software]]
  
 
= bamUtil Overview =
 
= bamUtil Overview =
Line 6: Line 8:
  
  
== Where to Find It ==
+
== Getting Help ==
The bamUtil repository is available both via release downloads (coming soon) and via github.
 
  
On github, you can both browse and download the latest version of the repository as well as explore the history of changes.
+
If you have any questions please use the [https://github.com/statgen/bamUtil bamUtil GitHub page] to raise and issue.
  
You can access the latest version with or without git.
+
See [[BamUtil: FAQ]] to see if your question has already been answered.
  
If you download from github or use git to keep up to date, you also need to download our library: [[C++ Library: libStatGen|libStatGen]].
+
== Where to Find It ==
 +
{{ToolGitRepo|repoName=bamUtil}}
  
The releases will be available both with and without libStatGen included.  If you download the verison without libStatGen included, you will also need to download libStatGen separately.
+
== Releases ==
(It will be available without libStatGen in case you already have a downloaded version of libStatGen that you want to use.
 
  
=== Using github ===
+
If you prefer to run the last official release rather than the latest development version, you can download that here.
Releases are '''Coming Soon'''.
 
  
 +
There are two versions of the release, one that include libStatGen and one that does not.  If you already have libStatGen installed and want to use your own copy, use the version that does not include libStatGen.
  
=== Using github ===
+
=== Full Release (includes libStatGen) ===
  
==== Using Git To Track the Current Development Version ====
+
To install an official release, unpack the downloaded file (tar xvf), cd into the bamUtil_x.x.x directory and type make all.
  
===== Clone (get your own copy) =====
+
For version 1.0.14 and later, please download libStatGen and bamUtil separately:  
You can create your own git clone (copy) using:
 
git clone https://github.com/statgen/bamUtil.git
 
or
 
git clone git://github.com/statgen/bamUtil.git
 
  
Either of these commands create a directory called <code>bamUtil</code> in the current directory.
 
  
Then just <code>cd bamUtil</code> and [[BamUtil#Building|compile]].
+
'''Version 1.0.14 - Released 7/8/2015'''
 +
*[[LibStatGen Download#Official Releases|libStatGen version 1.0.14]]
 +
*[[#Release of just BamUtil (does not include libStatGen)|bamUtil version 1.0.14]]
  
===== Get the latest Updates (update your copy) =====
 
To update your copy to the latest version (a major advantage of using git):
 
# <code>cd pathToYourCopy/bamUtil</code>
 
# <code>make clean</code>
 
# <code>git pull</code>
 
# <code>make all</code>
 
  
=== Git Refresher ===
+
'''Older Releases'''
If you decide to use git, but need a refresher, see [[How To Use Git]] or [https://statgen.sph.umich.edu/wiki/How_To_Use_Git Notes on how to use git] (if you have access)
+
* [[Media:BamUtilLibStatGen.1.0.13.tgz|BamUtilLibStatGen.1.0.13.tgz‎]] - Released 2/20/2015
 +
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.13]] - see link for version updates
 +
** Contains: [[#Release of just BamUtil (does not include libStatGen)|bamUtil version 1.0.13]] - see link for version updates
  
  
==== Downloading From GitHub Without Git ====
+
* [[Media:BamUtilLibStatGen.1.0.12.tar.gz|BamUtilLibStatGen.1.0.12.tgz‎]] - Released 5/14/2014
Periodically download the latest copy from github from the "Downloads" link on the webpage: https://github.com/statgen/bamUtil/archives/master.
+
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.12]] - see link for version updates
 +
** Contains: [[#Release of just BamUtil (does not include libStatGen)|bamUtil version 1.0.12]] - see link for version updates
 +
** Adds regions to [[BamUtil: mergeBam|mergeBam]]
 +
** Accept ',' delimiters for the tags string input in [[BamUtil: squeeze|squeeze]], [[BamUtil: revert|revert]], & [[BamUtil: diff|diff]]
  
The downloaded tar file is named "statgen-bamUtil-someHexNumber.tar.gz". The directory created when it is untared shares the same base name. I recommend that you do not change the name of the directory. If you want one called bamUtil, create a link to this directory. The hex number in the directory name identifies the version of the repository that you downloaded and is necessary to easily troubleshoot any issues you encounter. If you must rename the directory, be sure to record the hex number that was on the download for future reference.
+
*[[Media:BamUtilLibStatGen.1.0.11.tar.gz|BamUtilLibStatGen.1.0.11.tar.gz‎]] - Released 2/28/2014
 +
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.11]] - see link for version updates
 +
** Contains: [[#Release of just BamUtil (does not include libStatGen)|bamUtil version 1.0.11]] - see link for version updates
 +
** Now properly supports 'B' & 'f' tags
 +
** Cleanup - compile issues
  
== Building ==
+
*[[Media:BamUtilLibStatGen.1.0.10.tar.gz|BamUtilLibStatGen.1.0.10.tar.gz‎]] - Released 1/2/2014
After obtaining the bamUtil repository (either by download or from github), compile the code using <code>make all</code>. This creates the executable, <code>bam</code>, in the <code>bamUtil/bin/</code> directory, the debug executable in the <code>bamUtil/bin/debug/</code> directory, and the profiling executable in the <code>bamUtil/bin/profile/</code> directory.
+
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.10]] - see link for version updates
 +
** Contains: [[#Release of just BamUtil (does not include libStatGen)|bamUtil version 1.0.10]] - see link for version updates
 +
** Adds PhoneHome/Version checking.  
  
 +
*[[Media:BamUtilLibStatGen.1.0.9.tgz|BamUtilLibStatGen.1.0.9.tgz‎]] - Released 7/7/2013
 +
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.9]]
 +
** Contains: [[#Release of just BamUtil (does not include libStatGen)|bamUtil version 1.0.9]]
 +
** Update to [[BamUtil: mergeBam|mergeBam]]
 +
*** Update to ignore PG lines with duplicate IDs
 +
*** Update to accept merges of matching RG lines
 +
*** Update to log to stderr if no log/out file is specified
 +
* There is no version 1.0.8.  It was skipped to stay in line with libStatGen versions (libStatGen 1.0.8 added vcf support)
 +
*[[Media:BamUtilLibStatGen.1.0.7.tgz|BamUtilLibStatGen.1.0.7.tgz‎]] - Released 1/29/2013
 +
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.7]]
 +
** Contains: [[#Release of just BamUtil (does not include libStatGen)|bamUtil version 1.0.7]]
 +
** Update to fix some compile issues on ubuntu 12.10
 +
** Update use of SamRecord::getStringTag to expect the return of a const string pointer due to libStatGen v1.0.7 updates
 +
** Update SamReferenceInfo usage due to libStatGen v1.0.7 updates
 +
** Update to [[BamUtil: diff|diff]]
 +
***  Fix DIFF to test and properly handle running out of available records.  Previously no message was printed when this happened and there was a bug for which file it freed
 +
** Update to [[BamUtil: clipOverlap|clipOverlap]]
 +
*** Update to facilitate adding other overlap handling functions
 +
** Update to [[BamUtil: mergeBam|mergeBam]] (formerly RGMergeBam)
 +
*** Rename RGMergeBam to MergeBam
 +
*** Update to handle files that already have an RG
  
 +
*[[Media:BamUtilLibStatGen.1.0.6.tgz|BamUtilLibStatGen.1.0.6.tgz‎]] - Released 11/14/2012
 +
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.6]]
 +
** Contains: [[#Release of just BamUtil (does not include libStatGen)|bamUtil version 1.0.6]]
 +
** Update to [[BamUtil: trimBam|trimBam]]
 +
*** Update to allow trimming a different number of bases from each end of the read
 +
*[[Media:BamUtilLibStatGen.1.0.5.tgz|BamUtilLibStatGen.1.0.5.tgz‎]] - Released 10/24/2012
 +
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.5]]
 +
** Contains: [[#Release of just BamUtil (does not include libStatGen)|bamUtil version 1.0.5]]
 +
** Updates to: [[BamUtil: dedup|dedup]], [[BamUtil: polishBam|polishBam]], [[BamUtil: recab|recab]]
 +
** Update to add compile option to compile without C++0x/C++11
 +
** See [[#Release of just BamUtil (does not include libStatGen)|below]] for additional details on updates
 +
*BamUtilLibStatGen.1.0.4.tgz‎ - Released skipped
 +
*[[Media:BamUtilLibStatGen.1.0.3.tgz|BamUtilLibStatGen.1.0.3.tgz‎]] - Released 09/19/2012
 +
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.3]]
 +
** Contains: [[#Release of just BamUtil (does not include libStatGen)|bamUtil version 1.0.3]]
 +
** Adds: [[BamUtil: dedup|dedup]] [[BamUtil: recab|recab]]
 +
*[[Media:BamUtilLibStatGen.1.0.2.tgz|BamUtilLibStatGen.1.0.2.tgz‎]] - Released 05/16/2012
 +
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.2]]
 +
** Adds: [[BamUtil: bam2FastQ|bam2FastQ]]
 +
*[[Media:BamUtilLibStatGen.1.0.1.tgz|BamUtilLibStatGen.1.0.1.tgz‎]] - Released 05/04/2012
 +
** Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.1]]
 +
** Adds: [[BamUtil: splitBam|splitBam]], [[BamUtil: clipOverlap|clipOverlap]],  [[BamUtil: trimBam|trimBam]], [[BamUtil: polishBam|polishBam]], [[BamUtil: rgMergeBam|rgMergeBam]], [[BamUtil: gapInfo|gapInfo]]
 +
** Adds additional functionality to [[BamUtil: stats|stats]]
 +
** Adds leftShifting to [[BamUtil: writeRegion|writeRegion]] and [[BamUtil: convert|convert]]
 +
** Adds more diff fields to [[BamUtil: diff|diff]]
 +
* [[Media:BamUtilLibStatGen.1.0.0.tgz|BamUtilLibStatGen.1.0.0.tgz‎]] - Released 10/10/2011
 +
**Initial release of bamUtil that includes libStatGen version 1.0.0.  It started from the tool found in the deprecated StatGen repository.
 +
**Contains: [[LibStatGen Download#Official Releases|libStatGen version 1.0.0]] [[BamUtil: validate|validate]], [[BamUtil: convert|convert]], [[BamUtil: dumpHeader|dumpHeader]], [[BamUtil: splitChromosome|splitChromosome]], [[BamUtil: writeRegion|writeRegion]], [[BamUtil: dumpRefInfo|dumpRefInfo]], [[BamUtil: dumpIndex|dumpIndex]], [[BamUtil: readIndexedBam|readIndexedBam]], [[BamUtil: filter|filter]], [[BamUtil: readReference|readReference]], [[BamUtil: revert|revert]], [[BamUtil: diff|diff]], [[BamUtil: squeeze|squeeze]], [[BamUtil: findCigars|findCigars]], [[BamUtil: stats|stats]]
  
= Programs =
+
=== Release of just BamUtil (does not include libStatGen) ===
  
The software reads the beginning of an input file to determine if it is SAM/BAM.  To determine the format (SAM/BAM) of the output file, the software checks the output file's extension. If the extension is ".bam" it writes a BAM file, otherwise it writes a SAM file.
+
To install an official release, unpack the downloaded file (tar xvf), cd into the bamUtil_x.x.x directory and type make all.
  
The bam executable has the following functions.
+
'''BamUtil.1.0.14 Release Notes'''
* [[C++ Executable: bam#validate|validate - Read and Validate a SAM/BAM file]]
+
* BamUtil Version 1.0.14 - Released 7/8/2015
* [[BamUtil: convert|convert - Read a SAM/BAM file and write as a SAM/BAM file (optionally converts between '=' & bases in the sequence)]]
+
** https://github.com/statgen/bamUtil/archive/v1.0.14.tar.gz
* [[C++ Executable: bam#dumpHeader|dumpHeader - Print SAM/BAM header]]
+
** Requires, but does not include: [[LibStatGen Download#Official Releases|libStatGen version 1.0.14]]
* [[C++ Executable: bam#splitChromosome|splitChromosome - Split BAM by Chromosome]]
+
** Update [[BamUtil: trimBam|trimBam]]
* [[C++ Executable: bam#writeRegion|writeRegion - Write the alignments in the indexed BAM file that fall into the specified region]]
+
*** Add option to soft clip (-c) instead of trimming
* [[C++ Executable: bam#dumpRefInfo|dumpRefInfo - Print SAM/BAM Reference Information]]
+
** Update [[BamUtil: clipOverlap|clipOverlap]]
* [[C++ Executable: bam#dumpIndex|dumpIndex - Dump a BAM index file into an easy to read text version]]
+
*** Add option to mark reads as unmapped if they are entirely clipped
* [[C++ Executable: bam#readIndexedBam|readIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file]]
+
** Update to [[BamUtil: bam2FastQ|bam2FastQ]]
* [[C++ Executable: bam#filter|filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high]]
+
*** Add option to gzip the output files
* [[C++ Executable: bam#readReference|readReference - Print the reference string for the specified region]]
+
*** Add option to split Read Groups into separate fastq files
* [[C++ Executable: bam#diff|diff - Print the diffs between 2 bams]]
+
*** Add option to get the quality from a tag
 +
** Update [[BamUtil: recab|recab]]
 +
*** Update to ignore ref 'N' when building the recalibration table
 +
*** Add ability to bin
 +
** Add Dedup_LowMem tool
  
This executable is built using [[C++ Library: libStatGen]].
+
'''Older Releases'''
 +
* BamUtil Version 1.0.13 - Released 2/20/2015
 +
** https://github.com/statgen/bamUtil/archive/v1.0.13.tar.gz
 +
** Requires, but does not include: [[LibStatGen Download#Official Releases|libStatGen version 1.0.13]]
 +
** Makefile Updates
 +
*** Improve logic to determine actual path for the library
 +
*** Update to append to USER_COMPILE_VARS even if specified on the command line
 +
** Update [[BamUtil: writeRegion|writeRegion]]
 +
*** Add option to specify readnames to keep in a file
 +
*** Fixed bug that if a read overlapped 2 BED positions, it was printed twice
 +
** Update to [[BamUtil: bam2FastQ|bam2FastQ]]
 +
*** Update to skip non-primary reads
 +
** Update to [[BamUtil: polishBam|polishBam]]
 +
*** Update to handle '\t' string inputs and to add CO option
 +
*** Fix MD5sum calculation to convert fasta to uppercase prior to calculating
  
Just running ./bam will print the Usage information for the bam executable.
+
* [[Media:BamUtil.1.0.12.tgz|BamUtil.1.0.12.tgz‎]] - Released 5/14/2014
 +
** Requires, but does not include: [[LibStatGen Download#Official Releases|libStatGen version 1.0.12]]
 +
** Update [[BamUtil: mergeBam|mergeBam]]
 +
*** Add a regions option
 +
** Update to [[BamUtil: squeeze|squeeze]], [[BamUtil: revert|revert]], [[BamUtil: diff|diff]]
 +
*** Also accept ',' instead of just ';' as the delimiter in the input tags string.
  
 +
* [[Media:BamUtil.1.0.11.tgz|BamUtil.1.0.11.tgz‎]] - Released 2/28/2014
 +
** Requires, but does not include: [[LibStatGen Download#Official Releases|libStatGen version 1.0.11]]
 +
*** Adds support for 'B' & 'f' tags that did not work properly before.
 +
** Update [[BamUtil: splitBam|splitBam]] & [[BamUtil: polishBam|polishBam]]
 +
*** Update to work properly if log & output file are not specified (no longer creates '.log')
 +
** Update Main dummy/example tool to indicate the correct tool
 +
** Update to [[BamUtil: bam2FastQ|bam2FastQ]], [[BamUtil: clipOverlap|clipOverlap]], [[BamUtil: filter|filter]], [[BamUtil: mergeBam|mergeBam]], [[BamUtil: splitBam|splitBam]], [[BamUtil: squeeze|squeeze]], [[BamUtil: stats|stats]]
 +
*** Cleanup usage/parameter descriptions
 +
** Update [[BamUtil: revert|revert]]
 +
*** Update compatibility with libStatGen due to 'B' & 'f' tag handling updates
 +
** Add tests for 'B' & 'f' tags
  
== validate ==
+
* [[Media:BamUtil.1.0.10.tar.gz|BamUtil.1.0.10.tar.gz‎]] - Released 1/2/2014
 +
** Requires, but does not include: [[LibStatGen Download#Official Releases|libStatGen version 1.0.10]]
 +
** All
 +
*** Add PhoneHome/version checking
 +
*** Make sub-program names case independent
 +
*** Fix Logger.cpp compiler warning
 +
** Adds: [[BamUtil: explainFlags|explainFlags]] - describes the SAM/BAM flags based on the flag value
 +
** Update to [[BamUtil: stats|stats]]
 +
*** Fix Stats to not try to not try to process a record after it is out of the loop (it would already have been processed or is invalid)
 +
** Update to [[BamUtil: splitBam|splitBam]]
 +
*** fix description of --noeof option
 +
** Update to [[BamUtil: writeRegion|writeRegion]]
 +
*** add exclude/required flags
 +
** Update to [[BamUtil: dedup|dedup]] & [[BamUtil: recab|recab]]
 +
*** Ignore secondary reads for dedup and making the recalibration table.
 +
*** skip QC Failures
 +
*** add excludeFlags parameters
 +
** Update to [[BamUtil: clipOverlap|clipOverlap]]
 +
*** add exclude flags
 +
*** fix bug for readName sorted when a read is filtered due to flags
 +
*** add sorting validation
 +
** Update to [[BamUtil: bam2FastQ|bam2FastQ]]
 +
*** add --merge option to generate interleaved files.
 +
*** update to open the input file before opening the output files, so if there is an error, the outputs aren't opened
 +
** Update to [[BamUtil: mergeBam|mergeBam]]
 +
*** add option to ignore the RG PI field when checking headers
 +
*** add more informative header merge error messages
  
The <code>validate</code> option on the bam executable reads and validates a SAM/BAM file. This option is documented at: [[BamValidator]]
+
* [[Media:BamUtil.1.0.9.tgz|BamUtil.1.0.9.tgz‎]] - Released 7/7/2013
 +
** Requires, but does not include: [[LibStatGen Download#Official Releases|libStatGen version 1.0.9]] (version 1.0.7 should also work)
 +
** Update to [[BamUtil: mergeBam|mergeBam]]
 +
*** Update to ignore PG lines with duplicate IDs
 +
*** Update to accept merges of matching RG lines
 +
*** Update to log to stderr if no log/out file is specified
  
== dumpHeader ==
+
*[[Media:BamUtil.1.0.7.tgz|BamUtil.1.0.7.tgz‎]] - Released 1/29/2013
The <code>dumpHeader</code> option on the bam executable prints the header of the specified SAM/BAM file to cout.   
+
** Requires, but does not include: [[LibStatGen Download#Official Releases|libStatGen version 1.0.7]] or above
 +
** Update to fix some compile issues on ubuntu 12.10
 +
** Update use of SamRecord::getStringTag to expect the return of a const string pointer due to libStatGen v1.0.7 updates
 +
** Update SamReferenceInfo usage due to libStatGen v1.0.7 updates
 +
** Update to [[BamUtil: diff|diff]]
 +
***  Fix DIFF to test and properly handle running out of available records.  Previously no message was printed when this happened and there was a bug for which file it freed
 +
** Update to [[BamUtil: clipOverlap|clipOverlap]]
 +
*** Update to facilitate adding other overlap handling functions
 +
** Update to [[BamUtil: mergeBam|mergeBam]] (formerly RGMergeBam)
 +
*** Rename RGMergeBam to MergeBam
 +
*** Update to handle files that already have an RG
 +
*[[Media:BamUtil.1.0.6.tgz|BamUtil.1.0.6.tgz‎]] - Released 11/14/2012
 +
** Update to [[BamUtil: trimBam|trimBam]]
 +
*** Update to allow trimming a different number of bases from each end of the read
 +
*[[Media:BamUtil.1.0.5.tgz|BamUtil.1.0.5.tgz‎]] - Released 10/24/2012
 +
** Update to [[BamUtil: dedup|dedup]]
 +
*** Update logic for which pair to keep if they have the same quality
 +
** Update to [[BamUtil: polishBam|polishBam]]
 +
*** Update to print the number of successful header additions
 +
** Update to [[BamUtil: recab|recab]]
 +
*** Update to print the number of base skipped due to the base quality
 +
** General Updates
 +
*** Update to add compile option to compile without C++0x/C++11
 +
*BamUtil.1.0.4.tgz‎ - Released skipped
 +
*[[Media:BamUtil.1.0.3.tgz|BamUtil.1.0.3.tgz‎]] - Released 09/19/2012
 +
** Adds: [[BamUtil: dedup|dedup]] [[BamUtil: recab|recab]]
 +
** General Updates
 +
*** Update Logger to write to stderr if output is stdout
 +
** Update to [[BamUtil: stats|stats]]
 +
*** Add required/exclude flags
 +
*** Exclude Clips if excluding umapped
 +
*** Add --withinRegion flag
 +
*** Update phred/qual counts to be uint64_t instead of int to avoid overflow
 +
** Update to [[BamUtil: validate|validate]]
 +
*** Detect header failures
 +
** Update to [[BamUtil: diff|diff]]
 +
*** Update to specify chromosome/pos in ZP as a string rather than int so both can be shown
 +
** Update to [[BamUtil: readReference|readReference]]
 +
*** Output error message if the reference name is not found
 +
** Update to [[BamUtil: splitChromosome|splitChromosome]]
 +
*** Update to actually split the chromosomes and not just hard coded to output chromosomes ids 0-22
 +
** Update Makefile to have cloneLib for cloning libStatGen
 +
*[[Media:BamUtil.1.0.2.tgz|BamUtil.1.0.2.tgz‎]] - Released 05/16/2012
 +
** Adds: [[BamUtil: bam2FastQ|bam2FastQ]]
 +
*[[Media:BamUtil.1.0.1.tgz|BamUtil.1.0.1.tgz‎]] - Released 05/04/2012
 +
** Adds: [[BamUtil: splitBam|splitBam]], [[BamUtil: clipOverlap|clipOverlap]],  [[BamUtil: trimBam|trimBam]], [[BamUtil: polishBam|polishBam]], [[BamUtil: rgMergeBam|rgMergeBam]], [[BamUtil: gapInfo|gapInfo]]
 +
** Adds additional functionality to [[BamUtil: stats|stats]]
 +
** Adds leftShifting to [[BamUtil: writeRegion|writeRegion]] and [[BamUtil: convert|convert]]
 +
** Adds more diff fields to [[BamUtil: diff|diff]]
 +
*[[Media:BamUtil.1.0.0.tgz|BamUtil.1.0.0.tgz‎]] - Released 10/10/2011
 +
**Initial release of just bamUtilIt started from the tool found in the deprecated StatGen repository.
 +
**Contains: [[BamUtil: validate|validate]], [[BamUtil: convert|convert]], [[BamUtil: dumpHeader|dumpHeader]], [[BamUtil: splitChromosome|splitChromosome]], [[BamUtil: writeRegion|writeRegion]], [[BamUtil: dumpRefInfo|dumpRefInfo]], [[BamUtil: dumpIndex|dumpIndex]], [[BamUtil: readIndexedBam|readIndexedBam]], [[BamUtil: filter|filter]], [[BamUtil: readReference|readReference]], [[BamUtil: revert|revert]], [[BamUtil: diff|diff]], [[BamUtil: squeeze|squeeze]], [[BamUtil: findCigars|findCigars]], [[BamUtil: stats|stats]]
  
=== Parameters ===
+
== Citation ==
<pre>
+
If you use BamUtil, please cite our publication on GotCloud which includes BamUtil:  
    Required Parameters:
+
[http://genome.cshlp.org/content/early/2015/04/14/gr.176552.114.abstract Jun, Goo, et al. "An efficient and scalable analysis framework for variant extraction and refinement from population scale DNA sequence data." Genome research (2015): gr-176552.]
filename : the sam/bam filename whose header should be printed.
 
</pre>
 
  
=== Usage ===
 
  
./bam dumpHeader <inputFile>
+
= Programs =
 
 
=== Return Value ===
 
*    0: the header was successfully read and printed.
 
* non-0: the header was not successfully read or was not printed.  (Returns the SamStatus.)
 
 
 
 
 
=== Example Output ===
 
<pre>
 
@SQ SN:1 LN:247249719
 
@SQ SN:2 LN:242951149
 
@SQ SN:3 LN:199501827
 
</pre>
 
 
 
 
 
== splitChromosome ==
 
 
 
The <code>splitChromosome</code> option on the bam executable splits an indexed BAM file into multiple files based on the Chromosome (Reference Name). 
 
 
 
The files all have the same base name, but with an _# where # corresponds with the associated reference id from the BAM file.
 
 
 
=== Parameters ===
 
<pre>
 
    Required Parameters:
 
        --in      : the BAM file to be split
 
        --out      : the base filename for the SAM/BAM files to write into.  Does not include the extension.
 
                    _N will be appended to the basename where N indicates the Chromosome.
 
    Optional Parameters:
 
        --noeof  : do not expect an EOF block on a bam file.
 
        --bamIndex : the path/name of the bam index file
 
                    (if not specified, uses the --in value + ".bai")
 
        --bamout : write the output files in BAM format (default).
 
        --samout : write the output files in SAM format.
 
        --params : print the parameter settings
 
</pre>
 
 
 
=== Usage ===
 
 
 
./bam splitChromosome --in <inputFilename>  --out <outputFileBaseName> [--bamIndex <bamIndexFile>] [--noeof] [--bamout|--samout] [--params]
 
 
 
 
 
=== Return Value ===
 
*    0: all records are successfully read and written.
 
* non-0: at least one record was not successfully read or written.
 
 
 
=== Example Output ===
 
<pre>
 
Reference ID -1 has 2 records
 
Reference ID 0 has 5 records
 
Reference ID 1 has 2 records
 
Reference ID 2 has 1 records
 
Reference ID 3 has 0 records
 
Reference ID 4 has 0 records
 
Reference ID 5 has 0 records
 
Reference ID 6 has 0 records
 
Reference ID 7 has 0 records
 
Reference ID 8 has 0 records
 
Reference ID 9 has 0 records
 
Reference ID 10 has 0 records
 
Reference ID 11 has 0 records
 
Reference ID 12 has 0 records
 
Reference ID 13 has 0 records
 
Reference ID 14 has 0 records
 
Reference ID 15 has 0 records
 
Reference ID 16 has 0 records
 
Reference ID 17 has 0 records
 
Reference ID 18 has 0 records
 
Reference ID 19 has 0 records
 
Reference ID 20 has 0 records
 
Reference ID 21 has 0 records
 
Reference ID 22 has 0 records
 
Number of records = 10
 
Returning: 0 (SUCCESS)
 
</pre>
 
 
 
 
 
== writeRegion ==
 
 
 
The <code>writeRegion</code> option on the bam executable writes the alignments in the indexed BAM file that fall into the specified region (reference id and start/end position).
 
 
 
=== Parameters ===
 
<pre>
 
    Required Parameters:
 
        --in      : the BAM file to be read
 
        --out      : the SAM/BAM file to write to
 
    Optional Parameters:
 
        --noeof  : do not expect an EOF block on a bam file.
 
        --bamIndex : the path/name of the bam index file
 
                    (if not specified, uses the --in value + ".bai")
 
        --refName  : the BAM reference Name to read (either this or refID can be specified)
 
        --refID    : the BAM reference ID to read (defaults to -1: unmapped)
 
        --start    : inclusive 0-based start position (defaults to -1)
 
        --end      : exclusive 0-based end position (defaults to -1: meaning til the end of the reference)
 
        --params  : print the parameter settings
 
</pre>
 
 
 
=== Usage ===
 
 
 
./bam writeRegion --in <inputFilename>  --out <outputFilename> [--bamIndex <bamIndexFile>] [--noeof] [--refName <reference Name> | --refID <reference ID>] [--start <0-based start pos>] [--end <0-based end psoition>] [--params]
 
 
=== Return Value ===
 
*    0: all records are successfully read and written.
 
* non-0: at least one record was not successfully read or written.
 
 
 
=== Example Output ===
 
<pre>
 
 
 
Wrote t.sam with 2 records.
 
</pre>
 
 
 
 
 
== dumpRefInfo ==
 
The <code>dumpRefInfo</code> option on the bam executable prints the SAM/BAM file's reference information.
 
 
 
=== Parameters ===
 
<pre>
 
    Required Parameters:
 
        --in              : the SAM/BAM file to be read
 
    Optional Parameters:
 
        --noeof            : do not expect an EOF block on a bam file.
 
        --printRecordRefs  : print the reference information for the records in the file (grouped by reference).
 
        --params          : print the parameter settings
 
</pre>
 
 
 
=== Usage ===
 
./bam dumpRefInfo --in <inputFilename> [--noeof] [--printRecordRefs] [--params]
 
 
 
=== Return Value ===
 
*    0: the file was processed successfully.
 
* non-0: the file was not processed successfully.
 
 
 
 
 
== dumpIndex ==
 
The <code>dumpIndex</code> option on the bam executable prints BAM index file in an easy to read format.
 
 
 
=== Parameters ===
 
<pre>
 
    Required Parameters:
 
        --bamIndex : the path/name of the bam index file to display
 
    Optional Parameters:
 
        --refID    : the reference ID to read, defaults to print all
 
        --summary  : only print a summary - 1 line per reference.
 
        --params  : print the parameter settings
 
</pre>
 
 
 
=== Usage ===
 
./bam dumpIndex --bamIndex <bamIndexFile> [--refID <ref#>] [--summary] [--params]
 
 
 
=== Return Value ===
 
*    0: the BAM index file was processed successfully.
 
* non-0: the BAM index file was not processed successfully.
 
 
 
 
 
== readIndexedBam ==
 
The <code>readIndexedBam</code> option on the bam executable reads an indexed BAM file reference id by reference id -1 to the max reference id and writes it out as a SAM/BAM file.
 
 
 
=== Parameters ===
 
<pre>
 
Required Parameters:
 
inputFilename      - path/name of the input BAM file
 
outputFile.sam/bam - path/name of the output file
 
bamIndexFile      - path/name of the BAM index file
 
</pre>
 
 
 
=== Usage ===
 
./bam readIndexedBam <inputFilename> <outputFile.sam/bam> <bamIndexFile>
 
 
 
=== Return Value ===
 
* 0
 
 
 
== filter ==
 
 
 
The <code>filter</code> option on the bam executable filters the reads in a a SAM/BAM file.  This option is documented at: [[Bam Executable: Filter]]
 
 
 
== diff ==
 
<span style="color:#D2691E">'''***Coming Soon***'''</span>
 
 
 
The <code>diff</code> option on the bam executable prints the difference between two coordinate sorted SAM/BAM files.  This can be used to compare the outputs of running a SAM/BAM through different tools/versions of tools.
 
 
 
The <code>diff</code> tool compares records that have the same Read Name and Fragment (from the flag).  If a matching ReadName & Fragment is not found, the record is considered to be different.
 
 
 
<code>diff</code> assumes the files are coordinate sorted and uses this assumption for determining how long to store a record before determining that the other file does not contain a matching ReadName/Fragment. If the files are not coordinate sorted, this logic does not work.
 
 
 
By default, just the chromosome/position and cigar are compared for each record.
 
 
 
Options are available to compare:
 
* sequence
 
* base quality
 
* specified tags
 
* turn off position comparison
 
* turn off cigar comparison
 
 
 
=== Parameters ===
 
<pre>
 
Required Parameters:
 
--in1        : first coordinate sorted SAM/BAM file to be diffed
 
--in2        : second coordinate sorted SAM/BAM file to be diffed
 
Optional Parameters:
 
--out        : output filename, use .bam extension to output in SAM/BAM format instead of diff format.
 
                In SAMBAM format there will be 3 output files:
 
                    1) the specified name with record diffs
 
                    2) specified name with _only_<in1>.sam/bam with records only in the in1 file
 
                    3) specified name with _only_<in2>.sam/bam with records only in the in2 file
 
--seq        : diff the sequence bases.
 
--baseQual    : diff the base qualities.
 
--tags        : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type...
 
--noCigar    : do not diff the the cigars.
 
--noPos      : do not diff the positions.
 
--onlyDiffs  : only print the fields that are different, otherwise for any diff all the fields that are compared are printed.
 
--recPoolSize : number of records to allow to be stored at a time, default value: 1000000
 
--posDiff    : max base pair difference between possibly matching records100000
 
--noeof      : do not expect an EOF block on a bam file.
 
--params      : print the parameter settings
 
</pre>
 
 
 
=== Usage ===
 
./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]
 
 
 
=== Return Value ===
 
* 0: all records are successfully read and written.
 
* non-0: an error occurred processing the parameters or reading one of the files.e
 
 
 
=== Output Format ===
 
2 Output Formats:
 
# Diff Format
 
# BAM Format
 
 
 
==== Diff Format ====
 
There are 2 types of differences.
 
* ReadName/Fragment combo is in one file, but not in the other file within the window set by recPoolSize & posDiff
 
* ReadName/Fragment combo is in both files, but at least one of the specified fields to diff is different
 
 
 
Each difference output consists of 2 or 3 lines.  If the record only appears in one of the files, the diff is 2 lines, if it appears in both files, the diff is 3 lines.
 
 
 
The first line of the difference output is just the read name.
 
 
 
The 2nd and 3rd line (if present) begin with either a '<' or a '>'.  If the record is from the first file (--in1), it begins with a '<'.  If the record is from the 2nd file (--in2), it begins with a '>'.
 
 
 
The 2nd line is the flag followed by the diff'd fields from one of the records.
 
 
 
The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.
 
 
 
 
 
The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified:
 
* '<' or '>'
 
* flag
 
* chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
 
* cigar - if --noCigar is not specified
 
* sequence - if --seq is specified
 
* base quality - if --baseQual is specified
 
* tag:type:value - for each tag:type specified in --tags
 
* ...
 
* tag:type:value
 
 
 
If <code>onlyDiffs</code> is specified, only the fields that are specified and are different get printed in lines 2 & 3.
 
 
 
===== Example Output =====
 
Command:
 
../bin/bam diff --in1 testFiles/testDiff1.sam --in2 testFiles/testDiff2.sam --seq --baseQual --tags "OP:i;MD:Z" --onlyDiffs --out results/diffOrderSam.log
 
 
 
Output:
 
<pre>
 
18:462+29M5I3M:F:295
 
< a1 1:78
 
> a1 1:74
 
1
 
> a1 1:70 3S1M1S ACGTN ;46>> OP:i:75 MD:Z:30A0C5
 
2
 
> a1 1:72 3S1M1S ACGTN ;47>> OP:i:75 MD:Z:30A0C5
 
ABC
 
> cd *:0 * * *
 
DEF
 
> cd *:0 * * *
 
</pre>
 
 
 
==== SAM/Bam Format ====
 
use .sam/.bam extension to output in SAM/BAM format instead of diff format.
 
 
 
In SAM/BAM format there will be 3 output files:
 
# the specified name with record diffs
 
# specified name with _only_<in1>.sam/bam with records only in the in1 file
 
# specified name with _only_<in2>.sam/bam with records only in the in2 file
 
 
 
When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:
 
* ZF - Flag
 
* ZP - Pos
 
* ZC - Cigar
 
* ZS - Sequence
 
* ZQ - Base Quality
 
* ZT - Tags
 
 
 
== readReference ==
 
The <code>readReference</code> option on the bam executable prints the specified region of the reference sequence in an easy to read format.
 
 
 
=== Parameters ===
 
<pre>
 
    Required Parameters:
 
        --refFile  : the reference
 
        --refName  : the SAM/BAM reference Name to read
 
        --start    : inclusive 0-based start position (defaults to -1)
 
    Required Length Parameter (one but not both needs to be specified):
 
        --end      : exclusive 0-based end position (defaults to -1: meaning til the end of the reference)
 
        --numBases : number of bases from start to display
 
        --params  : print the parameter settings
 
</pre>
 
 
 
=== Usage ===
 
./bam readReference --refFile <referenceFilename> --refName <reference Name> --start <0 based start> --end <0 based end>|--numBases <number of bases> [--params]
 
 
 
=== Return Value ===
 
*    0: the reference file was successfully read.
 
* non-0: the reference file was not successfully read.
 
 
 
=== Example Output ===
 
<pre>
 
 
 
</pre>
 
 
 
== stats ==
 
The <code>stats</code> option on the bam executable generates the specified statistics on a SAM/BAM file.
 
 
 
=== Parameters ===
 
<pre>
 
Required Parameters:
 
--in : the SAM/BAM file to calculate stats for
 
Types of Statistics that can be generated:
 
--basic      : Turn on basic statistic generation
 
--qual        : Generate a count for each quality (displayed as non-phred quality)
 
--phred      : Generate a count for each quality (displayed as phred quality)
 
--baseQC      : Write per base statistics to the specified file.
 
Optional Parameters:
 
--maxNumReads : Maximum number of reads to process
 
                Defaults to -1 to indicate all reads.
 
--unmapped    : Only process unmapped reads (requires a bamIndex file)
 
--bamIndex    : The path/name of the bam index file
 
                (if required and not specified, uses the --in value + ".bai")
 
--regionList  : File containing the region list chr<tab>start_pos<tab>end<pos>.
 
                Positions are 0 based and the end_pos is not included in the region.
 
                Uses bamIndex.
 
--minMapQual  : The minimum mapping quality for filtering reads in the baseQC stats.
 
--dbsnp      : The dbSnp file of positions to exclude from baseQC analysis.
 
--noeof      : Do not expect an EOF block on a bam file.
 
--params      : Print the parameter settings
 
</pre>
 
 
 
For all types of statistics, the bam file used is specified by <code>--in</code>.
 
 
 
The optional parameters are also used for all types of statistics.
 
 
 
Usage:
 
<pre>
 
./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--baseQC <outputFileName>] [--maxNumReads <maxNum>] [--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>] [--noeof] [--params]
 
</pre>
 
 
 
 
 
 
 
=== Types of Statistics ===
 
 
 
==== Basic ====
 
Prints summary statistics for the file:
 
*TotalReads - # of reads that are in the file
 
*MappedReads - # of reads marked mapped in the flag
 
*PairedReads - # of reads marked paired in the flag
 
*ProperPair - # of reads marked paired AND proper paired in the flag
 
*DuplicateReads - # of reads marked duplicate in the flag
 
*QCFailureReads - # of reads marked QC failure in the flag
 
*MappingRate(%) - # of reads marked mapped in the flag / TotalReads
 
*PairedReads(%) - # of reads marked paired in the flag / TotalReads
 
*ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads
 
*DupRate(%) - # of reads marked duplicate in the flag / TotalReads
 
*QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
 
*TotalBases - # of bases in all reads
 
*BasesInMappedReads - # of bases in reads marked mapped in the flag
 
 
 
 
 
 
 
==== Qual/Phred ====
 
Prints a count of the number of times each quality value appears in the file.
 
*<code>phred</code> Displays Quality as phred integers [0-93]
 
*<code>qual</code>  Displays Quality as non-phred integers (phred + 33) [33-126]
 
 
 
 
 
==== BaseQC ====
 
'''This capability is coming soon, so these notes may be updated prior to it being completed...'''
 
 
 
Do we print stats for positions where the reference base is 'N'??  (any special note for those?  Qplot would not count them in the depth.)
 
 
 
The <code>baseQC</code> option generates the following statistics:
 
 
 
For each position, the following counts are incremented if:
 
# a read spans the reference position (starts before or at this reference position and ends at or after this position)
 
# regardless of duplicate/qc failure/unmapped/mapping quality
 
# regardless of the CIGAR for this position (other than clips at the beginning/end which are not counted, but deletions and skips are counted)
 
*TotalReads(e6) - # of reads that span this position.
 
*DupRate(%) - # of reads marked duplicate in the flag / TotalReads
 
*QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
 
*PairedReads(%) - # of reads marked paired in the flag / TotalReads
 
*ProperPaired(%) - # of reads marked paired AND proper paired in the flag / TotalReads
 
*MappedBases(e9) - # of reads marked mapped in the flag
 
*MappingRate(%) - # of reads marked mapped in the flag / TotalReads
 
*ZeroMapQual(%) - # of reads marked mapped in the flag AND have a Mapping Quality of 0 / TotalReads
 
*MapQual<10(%) - # of reads marked mapped in the flag AND have a Mapping Quality < 10 / TotalReads
 
*MapRate_MQpass(%) - # of reads marked mapped in the flag AND have a Mapping Quality >= a minimum Mapping Quality / TotalReads
 
  
 +
The software reads the beginning of an input file to determine if it is SAM/BAM.  To determine the format (SAM/BAM) of the output file, the software checks the output file's extension.  If the extension is ".bam" it writes a BAM file, otherwise it writes a SAM file.
  
For each position, the following counts are incremented if:
+
{{BamUtilPrograms}}
# a read spans the reference position (starts before or at this reference position and ends at or after this position)
 
# the read is NOT a duplicate, qc failure, unmapped, or mapped with a mapping quality less than the min
 
# the CIGAR for this position is a M/=/X (match/mismatch)
 
TBD - should it count if the read has a base of 'N'
 
*Depth - # of reads. 
 
*Q20Bases(e9) - # of bases at this position with a base quality (from the read) of Q20 or higher.
 
*Q20BasesPct(%) - Q20Bases / Depth
 

Latest revision as of 17:14, 11 September 2021


bamUtil Overview

bamUtil is a repository that contains several programs that perform operations on SAM/BAM files. All of these programs are built into a single executable, bam.


Getting Help

If you have any questions please use the bamUtil GitHub page to raise and issue.

See BamUtil: FAQ to see if your question has already been answered.

Where to Find It

The bamUtil repository is available both via release downloads and via github.

On github, https://github.com/statgen/bamUtil, you can both browse and download the bamUtil source code as well as explore the history of changes.

You can obtain the source either with or without git.

The releases may be available both with and without libStatGen included.

If you do not use the release version that already contains libStatGen, you need to download the library: libStatGen.

If you try to compile bamUtil and it cannot find libStatGen, it will fail and provide instructions of what to do next:

  • if libStatGen is in a different location then expected
    • follow the directions to set the path to libStatGen
  • if libStatGen is not downloaded and you have git
    • make libStatGen will download via git and build libStatGen
  • if libStatGen is not downloaded and you don't have git

Using Git To Track the Current Development Version

Clone (get your own copy)

You can create your own git clone (copy) using:

git clone https://github.com/statgen/bamUtil.git

or

git clone git://github.com/statgen/bamUtil.git

Either of these commands create a directory called bamUtil in the current directory.

Then just cd bamUtil and compile.

Get the latest Updates (update your copy)

To update your copy to the latest version (a major advantage of using git):

  1. cd pathToYourCopy/bamUtil
  2. make clean
  3. git pull
  4. make all

Git Refresher

If you decide to use git, but need a refresher, see How To Use Git or Notes on how to use git (if you have access)


Downloading From GitHub Without Git

If you download the latest code/version, make sure you periodically update it by downloading a newer version.

From github you can download:

  1. Latest Code (master branch)
    via Website
    1. Goto: https://github.com/statgen/bamUtil
    2. Click on the Download ZIP button on the right side panel.
    via Command Line
    wget https://github.com/statgen/bamUtil/archive/master.tar.gz
    or
    wget https://github.com/statgen/bamUtil/archive/master.zip
  2. Specific Release (via a tag)
    via Website
    1. Goto: https://github.com/statgen/bamUtil/releases to see the available releases
    2. Click zip or tar.gz for the desired version.
    via Command Line
    wget https://github.com/statgen/bamUtil/archive/<tagName>.tar.gz
    or
    wget https://github.com/statgen/bamUtil/archive/<tagName>.zip


After downloading the file, uncompress (unzip/untar) it. The directory created will be named bamUtil-<name of version you downloaded>.

Building

After obtaining the bamUtil repository (either by download or from github), compile the code using:

make all  

Object (.o) files are compiled into the obj directory with a subdirectory debug and profile for the debugging and profiling objects.

This creates the executable(s) in the bamUtil/bin/ directory, the debug executable(s) in the bamUtil/bin/debug/ directory, and the profiling executable(s) in the bamUtil/bin/profile/ directory.

make install installs the opt binary if you have permission.

make test compiles for opt, debug, and profile and runs the tests (found in the test subdirectory).

To see all make options, type make help.


If compilation fails due to warnings being treated as errors, please contact us so we can fix the warnings. As a work-around to get it to compile, you can disable the treatment of warnings as errors by editing libStatGen/general/Makefile to remove -Werror.

Releases

If you prefer to run the last official release rather than the latest development version, you can download that here.

There are two versions of the release, one that include libStatGen and one that does not. If you already have libStatGen installed and want to use your own copy, use the version that does not include libStatGen.

Full Release (includes libStatGen)

To install an official release, unpack the downloaded file (tar xvf), cd into the bamUtil_x.x.x directory and type make all.

For version 1.0.14 and later, please download libStatGen and bamUtil separately:


Version 1.0.14 - Released 7/8/2015


Older Releases


  • BamUtilLibStatGen.1.0.9.tgz‎ - Released 7/7/2013
  • There is no version 1.0.8. It was skipped to stay in line with libStatGen versions (libStatGen 1.0.8 added vcf support)
  • BamUtilLibStatGen.1.0.7.tgz‎ - Released 1/29/2013
    • Contains: libStatGen version 1.0.7
    • Contains: bamUtil version 1.0.7
    • Update to fix some compile issues on ubuntu 12.10
    • Update use of SamRecord::getStringTag to expect the return of a const string pointer due to libStatGen v1.0.7 updates
    • Update SamReferenceInfo usage due to libStatGen v1.0.7 updates
    • Update to diff
      • Fix DIFF to test and properly handle running out of available records. Previously no message was printed when this happened and there was a bug for which file it freed
    • Update to clipOverlap
      • Update to facilitate adding other overlap handling functions
    • Update to mergeBam (formerly RGMergeBam)
      • Rename RGMergeBam to MergeBam
      • Update to handle files that already have an RG

Release of just BamUtil (does not include libStatGen)

To install an official release, unpack the downloaded file (tar xvf), cd into the bamUtil_x.x.x directory and type make all.

BamUtil.1.0.14 Release Notes

  • BamUtil Version 1.0.14 - Released 7/8/2015

Older Releases

  • BamUtil Version 1.0.13 - Released 2/20/2015
    • https://github.com/statgen/bamUtil/archive/v1.0.13.tar.gz
    • Requires, but does not include: libStatGen version 1.0.13
    • Makefile Updates
      • Improve logic to determine actual path for the library
      • Update to append to USER_COMPILE_VARS even if specified on the command line
    • Update writeRegion
      • Add option to specify readnames to keep in a file
      • Fixed bug that if a read overlapped 2 BED positions, it was printed twice
    • Update to bam2FastQ
      • Update to skip non-primary reads
    • Update to polishBam
      • Update to handle '\t' string inputs and to add CO option
      • Fix MD5sum calculation to convert fasta to uppercase prior to calculating
  • BamUtil.1.0.10.tar.gz‎ - Released 1/2/2014
    • Requires, but does not include: libStatGen version 1.0.10
    • All
      • Add PhoneHome/version checking
      • Make sub-program names case independent
      • Fix Logger.cpp compiler warning
    • Adds: explainFlags - describes the SAM/BAM flags based on the flag value
    • Update to stats
      • Fix Stats to not try to not try to process a record after it is out of the loop (it would already have been processed or is invalid)
    • Update to splitBam
      • fix description of --noeof option
    • Update to writeRegion
      • add exclude/required flags
    • Update to dedup & recab
      • Ignore secondary reads for dedup and making the recalibration table.
      • skip QC Failures
      • add excludeFlags parameters
    • Update to clipOverlap
      • add exclude flags
      • fix bug for readName sorted when a read is filtered due to flags
      • add sorting validation
    • Update to bam2FastQ
      • add --merge option to generate interleaved files.
      • update to open the input file before opening the output files, so if there is an error, the outputs aren't opened
    • Update to mergeBam
      • add option to ignore the RG PI field when checking headers
      • add more informative header merge error messages
  • BamUtil.1.0.9.tgz‎ - Released 7/7/2013
    • Requires, but does not include: libStatGen version 1.0.9 (version 1.0.7 should also work)
    • Update to mergeBam
      • Update to ignore PG lines with duplicate IDs
      • Update to accept merges of matching RG lines
      • Update to log to stderr if no log/out file is specified
  • BamUtil.1.0.7.tgz‎ - Released 1/29/2013
    • Requires, but does not include: libStatGen version 1.0.7 or above
    • Update to fix some compile issues on ubuntu 12.10
    • Update use of SamRecord::getStringTag to expect the return of a const string pointer due to libStatGen v1.0.7 updates
    • Update SamReferenceInfo usage due to libStatGen v1.0.7 updates
    • Update to diff
      • Fix DIFF to test and properly handle running out of available records. Previously no message was printed when this happened and there was a bug for which file it freed
    • Update to clipOverlap
      • Update to facilitate adding other overlap handling functions
    • Update to mergeBam (formerly RGMergeBam)
      • Rename RGMergeBam to MergeBam
      • Update to handle files that already have an RG
  • BamUtil.1.0.6.tgz‎ - Released 11/14/2012
    • Update to trimBam
      • Update to allow trimming a different number of bases from each end of the read
  • BamUtil.1.0.5.tgz‎ - Released 10/24/2012
    • Update to dedup
      • Update logic for which pair to keep if they have the same quality
    • Update to polishBam
      • Update to print the number of successful header additions
    • Update to recab
      • Update to print the number of base skipped due to the base quality
    • General Updates
      • Update to add compile option to compile without C++0x/C++11
  • BamUtil.1.0.4.tgz‎ - Released skipped
  • BamUtil.1.0.3.tgz‎ - Released 09/19/2012
    • Adds: dedup recab
    • General Updates
      • Update Logger to write to stderr if output is stdout
    • Update to stats
      • Add required/exclude flags
      • Exclude Clips if excluding umapped
      • Add --withinRegion flag
      • Update phred/qual counts to be uint64_t instead of int to avoid overflow
    • Update to validate
      • Detect header failures
    • Update to diff
      • Update to specify chromosome/pos in ZP as a string rather than int so both can be shown
    • Update to readReference
      • Output error message if the reference name is not found
    • Update to splitChromosome
      • Update to actually split the chromosomes and not just hard coded to output chromosomes ids 0-22
    • Update Makefile to have cloneLib for cloning libStatGen
  • BamUtil.1.0.2.tgz‎ - Released 05/16/2012
  • BamUtil.1.0.1.tgz‎ - Released 05/04/2012
  • BamUtil.1.0.0.tgz‎ - Released 10/10/2011

Citation

If you use BamUtil, please cite our publication on GotCloud which includes BamUtil: Jun, Goo, et al. "An efficient and scalable analysis framework for variant extraction and refinement from population scale DNA sequence data." Genome research (2015): gr-176552.


Programs

The software reads the beginning of an input file to determine if it is SAM/BAM. To determine the format (SAM/BAM) of the output file, the software checks the output file's extension. If the extension is ".bam" it writes a BAM file, otherwise it writes a SAM file.


BamUtil is built using libStatGen. Running bin/bam with no parameters will print the usage information for the bam executable. Running bin/bam subProgram will print the usage information for the BamUtil sub-program.

Tools to Rewrite SAM/BAM Files:

  • convert - Convert SAM/BAM to SAM/BAM (optionally converts between '=' & bases in the sequence
  • writeRegion - Write a file with reads in the specified region and/or have the specified read name
  • splitChromosome - Split BAM into 1 file per Chromosome
  • splitBam - Split BAM into 1 file per Read Group
  • findCigars - Output just the reads that contain any of the specified CIGAR operations.
  • BAM Recovery - Recover corrupted BAM files

Tools to Modify & write SAM/BAM Files:

  • clipOverlap - Clip overlapping read pairs in a SAM/BAM File already sorted by Coordinate or ReadName so they do not overlap
  • filter - Filter reads by soft clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high
  • revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags
  • squeeze - Reduce file size by dropping OQ fields, duplicates, & specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers
  • trimBam - Trim the ends of reads in a SAM/BAM file changing read ends to 'N' and quality to '!' or by doing soft clips
  • mergeBam - Merge multiple BAMs and headers appending ReadGroupIDs if necessary
  • polishBam - Add/update header lines & add the RG tag to each record
  • dedup - Mark or remove duplicates, can also perform recalibration
  • recab - Recalibrate base qualities

Informational Tools:

  • validate - Validate a SAM/BAM File, checking file format & printing statistics
  • diff - Diff 2 coordinate sorted SAM/BAM files.
  • stats - Generate some basic statistics for a SAM/BAM file
  • gapInfo - Print information on the gap between read pairs in a SAM/BAM File.

Helper Tools to Print Information In Readable Format:

  • dumpHeader - Print the SAM/BAM Header to the screen
  • dumpRefInfo - Print SAM/BAM Reference Name Information from the header
  • dumpIndex - Print BAM Index File to the screen in a readable format
  • readReference - Print the reference string for the specified region to the screen
  • explainFlags - Describe SAM/BAM flags

Additional Tools:

  • bam2FastQ - Convert the specified BAM file to fastQs.

Dummy/Example Tools:

  • readIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file

ASP programs: ASP is a new format that is currently in production, so this tool is not yet available for public release.

  • asp - perform an asynchronous pileup producing an ASP file.
  • dumpAsp - perform an asynchronous pileup producing an ASP file.