BamUtil: squeeze
Overview of the squeeze
function of bamUtil
The squeeze
option on the bamUtil executable reduces files size by optionally:
- dropping OQ fields (default, disable using
--keepOQ
) - dropping duplicates (default, disable using
--keepDups
) - dropping specified tags (
--rmTags "Tag1:Type1;Tag2:Type2"
) - using '=' when a base matches the reference (
--refFile refFileName.fa
) - binning quality scores (
--binQualS
--binQualF
) - replacing readNames with unique integers (
--readName
/--sReadName
)
Usage
./bam squeeze --in <inputFile> --out <outputFile.sam/bam/ubam (ubam is uncompressed bam)> [--refFile <refFilePath/Name>] [--keepOQ] [--keepDups] [--readName <readNameMapFile.txt>] [--sReadName <readNameMapFile.txt>] [--binQualS <minQualBin2>,<minQualBin3><...>] [--binQualF <filename>] [--rmTags <"Tag:Type[;Tag:Type]*>"] [--noeof] [--params]
Parameters
Required Parameters: --in : the SAM/BAM file to be read --out : the SAM/BAM file to be written Optional Parameters: --refFile : reference file name used to convert any bases that match the reference to '=' --keepOQ : keep the OQ tag rather than removing it. Default is to remove it. --keepDups : keep duplicates rather than removing records marked duplicate. Default is to remove them. --sReadName : Replace read names with unique integers and write the mapping to the specified file. This version requires the input file to have been presorted by readname, but no validation is done to ensure this. If it is not sorted, a readname will get mapped to multiple new values. --readName : Replace read names with unique integers and write the mapping to the specified file. This version does not require the input file to have been presorted by readname, but uses a lot of memory since it stores all the read names. --rmTags : Remove the specified Tags formatted as "Tag:Type;Tag:Type;Tag:Type"... --noeof : do not expect an EOF block on a bam file. --params : print the parameter settings Quality Binning Parameters (optional): Bin qualities by phred score, into the ranges specified by binQualS or binQualF (both cannot be used) Ranges are specified by comma separated minimum phred score for the bin, example: 1,17,20,30,40,50,70 The first bin always starts at 0, so does not need to be specified. By default, the bin value is the low end of the range. --binQualS : Bin the Qualities as specified (phred): minQualOfBin2, minQualofBin3... --binQualF : Bin the Qualities based on the specified file --binMid : Use the mid point of the quality bin range for the quality value of the bin. --binHigh : Use the high end of the quality bin range for the quality value of the bin.
PhoneHome: --noPhoneHome : disable PhoneHome (default enabled) --phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)
Required Parameters
Input File (--in
)
Use --in
followed by your file name to specify the SAM/BAM input file.
The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.
A -
is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).
SAM/BAM/Uncompressed BAM from file | --in yourFileName
|
SAM from stdin | --in - |
BAM from stdin | --in -.bam |
Uncompressed BAM from stdin | --in -.ubam |
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools
implementation so pipes between our tools and samtools
are supported.
Output File (--out
)
Use --out
followed by your file name to specify the SAM/BAM output file.
The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A -
is used to indicate stdout and the extension for file type (no extension is SAM).
SAM to file | --out yourFileName.sam
|
BAM to file | --out yourFileName.bam
|
Uncompressed BAM to file | --out yourFileName.ubam
|
SAM to stdout | --out -
|
BAM to stdout | --out -.bam
|
Uncompressed BAM to stdout | --out -.ubam
|
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools
implementation so pipes between our tools and samtools
are supported.
Optional Parameters
Reference File (--refFile
)
Use --refFile
followed by the reference file name to specify the reference sequence file.
Keep OQ Tag (--keepOQ
)
Use --keepOQ
to keep the OQ tag rather than removing it. By default, the OQ tag is removed.
Keep Duplicates (--keepDups
)
Use --keepDups
to keep records that are marked as duplicate (in the flag). By default, records marked as duplicate are removed.
Replace Read Names with Unique Integers (--sReadName
, --readName
)
Use --sReadName
or --readName
to replace read names with unique integers and write the mapping to the specified file.
--sReadName
requires the input file to have been presorted by readname, but no validation is done to ensure proper sorting. If it is not sorted, a readname will get mapped to multiple new values.
--readName
does not require the input file to have been presorted by readname, but uses a lot of memory since it stores all the read names in memory.
Remove Tags (--rmTags
)
Use --rmTags
followed by a list of tags separated by ';' to remove the specified tags. The tags should be formatted as: "Tag:Type"
. Note: when using the ';' to specify multiple tags, be sure to put the whole string in quotes - otherwise the ';' will be interpreted as the end of the command. Example: --rmTags "OQ:Z;MD:Z"
or --rmTags 'OQ:Z;MD:Z'
Do not require BGZF EOF block (--noeof
)
Use --noeof
if you do not expect a trailing eof block in your bgzf file.
By default, the trailing empty block is expected and checked for.
Print the Program Parameters (--params
)
Use --params
to print the parameters for your program to stderr.
Optional Quality Binning Parameters
Optionally, Quality scores can be binned to reduce the number of possible quality scores.
Quality Score Bins (--binQualS
, --binQualF
)
Use --binQualS
or --binQualF
to bin qualities by phred score, into the specified ranges (only one of the two options can be specified).
The ranges are specified by comma separated minimum phred score for the bin, example: 1,17,20,30,40,50,70
The first bin always starts at 0, so does not need to be specified.
By default, the bin value is the low end of the range. Use --binMid
or --binHigh
to change the value for the bin.
Use --binQualS
followed by the comma-separated bin minimum phred scores to specify the ranges on the command line.
Use --binQualF
followed by the filename to specify the ranges in a file.
Quality Score Bin Value (--binMid
, --binHigh
)
By default the lowest number in a bin is used as the bin's value.
Use --binMid
to use the mid point of the quality bin range for the quality value of the bin.
Use --binHigh
to use the highest number in the quality bin for the quality value of the bin.
PhoneHome Parameters
See PhoneHome for more information on how PhoneHome works and what it does.
Turn off PhoneHome (--noPhoneHome
)
Use the --noPhoneHome
option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.
Adjust the Frequency of PhoneHome (--phoneHomeThinning
)
Use --phoneHomeThinning
to modify the percentage of the time that PhoneHome will run (0-100).
- By default,
--phoneHomeThinning
is set to 50, running 50% of the time. - PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
- N/A if
--noPhoneHome
is set.
Return Value
Returns the SamStatus for the reads/writes (0 for success, non-0 for failure).
Example Output
Number of records read = 13 Number of records written = 10