Difference between revisions of "BamUtil"
(Remove convert) |
|||
Line 65: | Line 65: | ||
The bam executable has the following functions. | The bam executable has the following functions. | ||
* [[C++ Executable: bam#validate|validate - Read and Validate a SAM/BAM file]] | * [[C++ Executable: bam#validate|validate - Read and Validate a SAM/BAM file]] | ||
− | * [[ | + | * [[BamUtil: convert|convert - Read a SAM/BAM file and write as a SAM/BAM file]] |
* [[C++ Executable: bam#dumpHeader|dumpHeader - Print SAM/BAM header]] | * [[C++ Executable: bam#dumpHeader|dumpHeader - Print SAM/BAM header]] | ||
* [[C++ Executable: bam#splitChromosome|splitChromosome - Split BAM by Chromosome]] | * [[C++ Executable: bam#splitChromosome|splitChromosome - Split BAM by Chromosome]] |
Revision as of 16:27, 1 September 2011
bamUtil Overview
bamUtil is a repository that contains several programs that perform operations on SAM/BAM files. All of these programs are built into a single executable, bam
.
Where to Find It
The bamUtil repository is available both via release downloads (coming soon) and via github.
On github, you can both browse and download the latest version of the repository as well as explore the history of changes.
You can access the latest version with or without git.
If you download from github or use git to keep up to date, you also need to download our library: libStatGen.
The releases will be available both with and without libStatGen included. If you download the verison without libStatGen included, you will also need to download libStatGen separately. (It will be available without libStatGen in case you already have a downloaded version of libStatGen that you want to use.
Using github
Releases are Coming Soon.
Using github
Using Git To Track the Current Development Version
Clone (get your own copy)
You can create your own git clone (copy) using:
git clone https://github.com/statgen/bamUtil.git
or
git clone git://github.com/statgen/bamUtil.git
Either of these commands create a directory called bamUtil
in the current directory.
Then just cd bamUtil
and compile.
Get the latest Updates (update your copy)
To update your copy to the latest version (a major advantage of using git):
cd pathToYourCopy/bamUtil
make clean
git pull
make all
Git Refresher
If you decide to use git, but need a refresher, see How To Use Git or Notes on how to use git (if you have access)
Downloading From GitHub Without Git
Periodically download the latest copy from github from the "Downloads" link on the webpage: https://github.com/statgen/bamUtil/archives/master.
The downloaded tar file is named "statgen-bamUtil-someHexNumber.tar.gz". The directory created when it is untared shares the same base name. I recommend that you do not change the name of the directory. If you want one called bamUtil, create a link to this directory. The hex number in the directory name identifies the version of the repository that you downloaded and is necessary to easily troubleshoot any issues you encounter. If you must rename the directory, be sure to record the hex number that was on the download for future reference.
Building
After obtaining the bamUtil repository (either by download or from github), compile the code using make all
. This creates the executable, bam
, in the bamUtil/bin/
directory, the debug executable in the bamUtil/bin/debug/
directory, and the profiling executable in the bamUtil/bin/profile/
directory.
Programs
The software reads the beginning of an input file to determine if it is SAM/BAM. To determine the format (SAM/BAM) of the output file, the software checks the output file's extension. If the extension is ".bam" it writes a BAM file, otherwise it writes a SAM file.
The bam executable has the following functions.
- validate - Read and Validate a SAM/BAM file
- convert - Read a SAM/BAM file and write as a SAM/BAM file
- dumpHeader - Print SAM/BAM header
- splitChromosome - Split BAM by Chromosome
- writeRegion - Write the alignments in the indexed BAM file that fall into the specified region
- dumpRefInfo - Print SAM/BAM Reference Information
- dumpIndex - Dump a BAM index file into an easy to read text version
- readIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file
- filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high
- readReference - Print the reference string for the specified region
- diff - Print the diffs between 2 bams
This executable is built using C++ Library: libStatGen.
Just running ./bam will print the Usage information for the bam executable.
validate
The validate
option on the bam executable reads and validates a SAM/BAM file. This option is documented at: BamValidator
dumpHeader
The dumpHeader
option on the bam executable prints the header of the specified SAM/BAM file to cout.
Parameters
Required Parameters: filename : the sam/bam filename whose header should be printed.
Usage
./bam dumpHeader <inputFile>
Return Value
- 0: the header was successfully read and printed.
- non-0: the header was not successfully read or was not printed. (Returns the SamStatus.)
Example Output
@SQ SN:1 LN:247249719 @SQ SN:2 LN:242951149 @SQ SN:3 LN:199501827
splitChromosome
The splitChromosome
option on the bam executable splits an indexed BAM file into multiple files based on the Chromosome (Reference Name).
The files all have the same base name, but with an _# where # corresponds with the associated reference id from the BAM file.
Parameters
Required Parameters: --in : the BAM file to be split --out : the base filename for the SAM/BAM files to write into. Does not include the extension. _N will be appended to the basename where N indicates the Chromosome. Optional Parameters: --noeof : do not expect an EOF block on a bam file. --bamIndex : the path/name of the bam index file (if not specified, uses the --in value + ".bai") --bamout : write the output files in BAM format (default). --samout : write the output files in SAM format. --params : print the parameter settings
Usage
./bam splitChromosome --in <inputFilename> --out <outputFileBaseName> [--bamIndex <bamIndexFile>] [--noeof] [--bamout|--samout] [--params]
Return Value
- 0: all records are successfully read and written.
- non-0: at least one record was not successfully read or written.
Example Output
Reference ID -1 has 2 records Reference ID 0 has 5 records Reference ID 1 has 2 records Reference ID 2 has 1 records Reference ID 3 has 0 records Reference ID 4 has 0 records Reference ID 5 has 0 records Reference ID 6 has 0 records Reference ID 7 has 0 records Reference ID 8 has 0 records Reference ID 9 has 0 records Reference ID 10 has 0 records Reference ID 11 has 0 records Reference ID 12 has 0 records Reference ID 13 has 0 records Reference ID 14 has 0 records Reference ID 15 has 0 records Reference ID 16 has 0 records Reference ID 17 has 0 records Reference ID 18 has 0 records Reference ID 19 has 0 records Reference ID 20 has 0 records Reference ID 21 has 0 records Reference ID 22 has 0 records Number of records = 10 Returning: 0 (SUCCESS)
writeRegion
The writeRegion
option on the bam executable writes the alignments in the indexed BAM file that fall into the specified region (reference id and start/end position).
Parameters
Required Parameters: --in : the BAM file to be read --out : the SAM/BAM file to write to Optional Parameters: --noeof : do not expect an EOF block on a bam file. --bamIndex : the path/name of the bam index file (if not specified, uses the --in value + ".bai") --refName : the BAM reference Name to read (either this or refID can be specified) --refID : the BAM reference ID to read (defaults to -1: unmapped) --start : inclusive 0-based start position (defaults to -1) --end : exclusive 0-based end position (defaults to -1: meaning til the end of the reference) --params : print the parameter settings
Usage
./bam writeRegion --in <inputFilename> --out <outputFilename> [--bamIndex <bamIndexFile>] [--noeof] [--refName <reference Name> | --refID <reference ID>] [--start <0-based start pos>] [--end <0-based end psoition>] [--params]
Return Value
- 0: all records are successfully read and written.
- non-0: at least one record was not successfully read or written.
Example Output
Wrote t.sam with 2 records.
dumpRefInfo
The dumpRefInfo
option on the bam executable prints the SAM/BAM file's reference information.
Parameters
Required Parameters: --in : the SAM/BAM file to be read Optional Parameters: --noeof : do not expect an EOF block on a bam file. --printRecordRefs : print the reference information for the records in the file (grouped by reference). --params : print the parameter settings
Usage
./bam dumpRefInfo --in <inputFilename> [--noeof] [--printRecordRefs] [--params]
Return Value
- 0: the file was processed successfully.
- non-0: the file was not processed successfully.
dumpIndex
The dumpIndex
option on the bam executable prints BAM index file in an easy to read format.
Parameters
Required Parameters: --bamIndex : the path/name of the bam index file to display Optional Parameters: --refID : the reference ID to read, defaults to print all --summary : only print a summary - 1 line per reference. --params : print the parameter settings
Usage
./bam dumpIndex --bamIndex <bamIndexFile> [--refID <ref#>] [--summary] [--params]
Return Value
- 0: the BAM index file was processed successfully.
- non-0: the BAM index file was not processed successfully.
readIndexedBam
The readIndexedBam
option on the bam executable reads an indexed BAM file reference id by reference id -1 to the max reference id and writes it out as a SAM/BAM file.
Parameters
Required Parameters: inputFilename - path/name of the input BAM file outputFile.sam/bam - path/name of the output file bamIndexFile - path/name of the BAM index file
Usage
./bam readIndexedBam <inputFilename> <outputFile.sam/bam> <bamIndexFile>
Return Value
- 0
filter
The filter
option on the bam executable filters the reads in a a SAM/BAM file. This option is documented at: Bam Executable: Filter
diff
***Coming Soon***
The diff
option on the bam executable prints the difference between two coordinate sorted SAM/BAM files. This can be used to compare the outputs of running a SAM/BAM through different tools/versions of tools.
The diff
tool compares records that have the same Read Name and Fragment (from the flag). If a matching ReadName & Fragment is not found, the record is considered to be different.
diff
assumes the files are coordinate sorted and uses this assumption for determining how long to store a record before determining that the other file does not contain a matching ReadName/Fragment. If the files are not coordinate sorted, this logic does not work.
By default, just the chromosome/position and cigar are compared for each record.
Options are available to compare:
- sequence
- base quality
- specified tags
- turn off position comparison
- turn off cigar comparison
Parameters
Required Parameters: --in1 : first coordinate sorted SAM/BAM file to be diffed --in2 : second coordinate sorted SAM/BAM file to be diffed Optional Parameters: --out : output filename, use .bam extension to output in SAM/BAM format instead of diff format. In SAMBAM format there will be 3 output files: 1) the specified name with record diffs 2) specified name with _only_<in1>.sam/bam with records only in the in1 file 3) specified name with _only_<in2>.sam/bam with records only in the in2 file --seq : diff the sequence bases. --baseQual : diff the base qualities. --tags : diff the specified Tags formatted as Tag:Type;Tag:Type;Tag:Type... --noCigar : do not diff the the cigars. --noPos : do not diff the positions. --onlyDiffs : only print the fields that are different, otherwise for any diff all the fields that are compared are printed. --recPoolSize : number of records to allow to be stored at a time, default value: 1000000 --posDiff : max base pair difference between possibly matching records100000 --noeof : do not expect an EOF block on a bam file. --params : print the parameter settings
Usage
./bam diff --in1 <inputFile> --in2 <inputFile> [--out <outputFile>] [--baseQual] [--tags <Tag:Type[;Tag:Type]*>] [--noCigar] [--noPos] [--onlyDiffs] [--recPoolSize <int>] [--posDiff <int>] [--noeof] [--params]
Return Value
- 0: all records are successfully read and written.
- non-0: an error occurred processing the parameters or reading one of the files.e
Output Format
2 Output Formats:
- Diff Format
- BAM Format
Diff Format
There are 2 types of differences.
- ReadName/Fragment combo is in one file, but not in the other file within the window set by recPoolSize & posDiff
- ReadName/Fragment combo is in both files, but at least one of the specified fields to diff is different
Each difference output consists of 2 or 3 lines. If the record only appears in one of the files, the diff is 2 lines, if it appears in both files, the diff is 3 lines.
The first line of the difference output is just the read name.
The 2nd and 3rd line (if present) begin with either a '<' or a '>'. If the record is from the first file (--in1), it begins with a '<'. If the record is from the 2nd file (--in2), it begins with a '>'.
The 2nd line is the flag followed by the diff'd fields from one of the records.
The 3rd line (if a matching record was found) is the flag followed by the diff'd fields from the matching record.
The diff'd record lines are tab separated, and are in the following order if --onlyDiffs is not specified:
- '<' or '>'
- flag
- chrom:pos (chromosome name ':' 1 based position) - if --noPos is not specified
- cigar - if --noCigar is not specified
- sequence - if --seq is specified
- base quality - if --baseQual is specified
- tag:type:value - for each tag:type specified in --tags
- ...
- tag:type:value
If onlyDiffs
is specified, only the fields that are specified and are different get printed in lines 2 & 3.
Example Output
Command:
../bin/bam diff --in1 testFiles/testDiff1.sam --in2 testFiles/testDiff2.sam --seq --baseQual --tags "OP:i;MD:Z" --onlyDiffs --out results/diffOrderSam.log
Output:
18:462+29M5I3M:F:295 < a1 1:78 > a1 1:74 1 > a1 1:70 3S1M1S ACGTN ;46>> OP:i:75 MD:Z:30A0C5 2 > a1 1:72 3S1M1S ACGTN ;47>> OP:i:75 MD:Z:30A0C5 ABC > cd *:0 * * * DEF > cd *:0 * * *
SAM/Bam Format
use .sam/.bam extension to output in SAM/BAM format instead of diff format.
In SAM/BAM format there will be 3 output files:
- the specified name with record diffs
- specified name with _only_<in1>.sam/bam with records only in the in1 file
- specified name with _only_<in2>.sam/bam with records only in the in2 file
When a record is found in both input files, but a difference is found, the record from the first file is written with additional tags to indicate the values from the second file, using the following tags:
- ZF - Flag
- ZP - Pos
- ZC - Cigar
- ZS - Sequence
- ZQ - Base Quality
- ZT - Tags
readReference
The readReference
option on the bam executable prints the specified region of the reference sequence in an easy to read format.
Parameters
Required Parameters: --refFile : the reference --refName : the SAM/BAM reference Name to read --start : inclusive 0-based start position (defaults to -1) Required Length Parameter (one but not both needs to be specified): --end : exclusive 0-based end position (defaults to -1: meaning til the end of the reference) --numBases : number of bases from start to display --params : print the parameter settings
Usage
./bam readReference --refFile <referenceFilename> --refName <reference Name> --start <0 based start> --end <0 based end>|--numBases <number of bases> [--params]
Return Value
- 0: the reference file was successfully read.
- non-0: the reference file was not successfully read.
Example Output
stats
The stats
option on the bam executable generates the specified statistics on a SAM/BAM file.
Parameters
Required Parameters: --in : the SAM/BAM file to calculate stats for Types of Statistics that can be generated: --basic : Turn on basic statistic generation --qual : Generate a count for each quality (displayed as non-phred quality) --phred : Generate a count for each quality (displayed as phred quality) --baseQC : Write per base statistics to the specified file. Optional Parameters: --maxNumReads : Maximum number of reads to process Defaults to -1 to indicate all reads. --unmapped : Only process unmapped reads (requires a bamIndex file) --bamIndex : The path/name of the bam index file (if required and not specified, uses the --in value + ".bai") --regionList : File containing the region list chr<tab>start_pos<tab>end<pos>. Positions are 0 based and the end_pos is not included in the region. Uses bamIndex. --minMapQual : The minimum mapping quality for filtering reads in the baseQC stats. --dbsnp : The dbSnp file of positions to exclude from baseQC analysis. --noeof : Do not expect an EOF block on a bam file. --params : Print the parameter settings
For all types of statistics, the bam file used is specified by --in
.
The optional parameters are also used for all types of statistics.
Usage:
./bam stats --in <inputFile> [--basic] [--qual] [--phred] [--baseQC <outputFileName>] [--maxNumReads <maxNum>] [--unmapped] [--bamIndex <bamIndexFile>] [--regionList <regFileName>] [--minMapQual <minMapQ>] [--dbsnp <dbsnpFile>] [--noeof] [--params]
Types of Statistics
Basic
Prints summary statistics for the file:
- TotalReads - # of reads that are in the file
- MappedReads - # of reads marked mapped in the flag
- PairedReads - # of reads marked paired in the flag
- ProperPair - # of reads marked paired AND proper paired in the flag
- DuplicateReads - # of reads marked duplicate in the flag
- QCFailureReads - # of reads marked QC failure in the flag
- MappingRate(%) - # of reads marked mapped in the flag / TotalReads
- PairedReads(%) - # of reads marked paired in the flag / TotalReads
- ProperPair(%) - # of reads marked paired AND proper paired in the flag / TotalReads
- DupRate(%) - # of reads marked duplicate in the flag / TotalReads
- QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
- TotalBases - # of bases in all reads
- BasesInMappedReads - # of bases in reads marked mapped in the flag
Qual/Phred
Prints a count of the number of times each quality value appears in the file.
phred
Displays Quality as phred integers [0-93]qual
Displays Quality as non-phred integers (phred + 33) [33-126]
BaseQC
This capability is coming soon, so these notes may be updated prior to it being completed...
Do we print stats for positions where the reference base is 'N'?? (any special note for those? Qplot would not count them in the depth.)
The baseQC
option generates the following statistics:
For each position, the following counts are incremented if:
- a read spans the reference position (starts before or at this reference position and ends at or after this position)
- regardless of duplicate/qc failure/unmapped/mapping quality
- regardless of the CIGAR for this position (other than clips at the beginning/end which are not counted, but deletions and skips are counted)
- TotalReads(e6) - # of reads that span this position.
- DupRate(%) - # of reads marked duplicate in the flag / TotalReads
- QCFailRate(%) - # of reads marked QC failure in the flag / TotalReads
- PairedReads(%) - # of reads marked paired in the flag / TotalReads
- ProperPaired(%) - # of reads marked paired AND proper paired in the flag / TotalReads
- MappedBases(e9) - # of reads marked mapped in the flag
- MappingRate(%) - # of reads marked mapped in the flag / TotalReads
- ZeroMapQual(%) - # of reads marked mapped in the flag AND have a Mapping Quality of 0 / TotalReads
- MapQual<10(%) - # of reads marked mapped in the flag AND have a Mapping Quality < 10 / TotalReads
- MapRate_MQpass(%) - # of reads marked mapped in the flag AND have a Mapping Quality >= a minimum Mapping Quality / TotalReads
For each position, the following counts are incremented if:
- a read spans the reference position (starts before or at this reference position and ends at or after this position)
- the read is NOT a duplicate, qc failure, unmapped, or mapped with a mapping quality less than the min
- the CIGAR for this position is a M/=/X (match/mismatch)
TBD - should it count if the read has a base of 'N'
- Depth - # of reads.
- Q20Bases(e9) - # of bases at this position with a base quality (from the read) of Q20 or higher.
- Q20BasesPct(%) - Q20Bases / Depth