BamUtil: bam2FastQ
Overview of the bam2FastQ
function of bamUtil
The bam2FastQ
option on the bamUtil converts a BAM file into FastQ files. This is necessary when only BAM files are delivered but a new alignment is desired. By converting BAM to FastQ files new alignments can be done using FastQ files
NOTE: Secondary and Supplementary reads are skipped when converting to FastQ. It assumes that there will only be 2 reads (the 2 primary mates) with the same read name that are not secondary or supplementary.
NOTE: Use the --splitRG option to split reads into read group specific FASTQs.
How to use it
When bam2FastQ is invoked without any arguments the usage information is displayed as described below under Usage.
The input BAM file is required, input File (--in).
It works on both read/query name and coordinate sorted SAM/BAM files.
If you want to convert a SAM/BAM that is read/query name sorted but the SO field of the header does not specify "queryname", then use the --readName option.
When processing files sorted by read name, the only requirement is that matching read names are next to each other. It does not need to be in strict alphabetical order.
Read Names in paired-end FASTQ files are appended with "/1" for the first in the pair, and "/2" for the second in the pair. Override these defaults using --firstRNExt and --secondRNExt
Sequences marked as Reverse strands in the SAM/BAM file are reverse complemented prior to writing to the FASTQ files. To skip this step, specify --noReverseComp
Any errors and a summary of how many pairs and unpaired reads were processed are written to stderr.
NOTE: This tool does not work on templates that have more than 2 segments. It does not properly match reads when more than 2 reads have the same read name.
NOTE: This tool does not split reads into read group specific FASTQs. If you want Read Group specific FASTQ files, first run BamUtil: splitBam to first split the BAM into 1 BAM per Read Group. Then run bam2FastQ on each bam.
Output Files
By default, this program produces 3 output fastq files.
- unpaired reads
- first end of paired reads
- second end of paired reads
If the --merge
option is specified, the program produces 2 output fastq files.
- unpaired reads
- interleaved paired-end reads
The default fastq file names are determined by taking the base name of the input file and adding an extension for each filetype.
Default | --merge
| ||
---|---|---|---|
Output File Contents | Extension | Output File Contents | Extension |
unpaired reads | .fastq | unpaired reads | .fastq |
first end of paired reads | _1.fastq | interleaved paired-end reads
(both first & second end) |
_interleaved.fastq |
second end of paired reads | _2.fastq |
If the inputFile was "myPath/myFile.bam", the resulting fastqs would be:
- myPath/myFile.fastq
- myPath/myFile_1.fastq
- myPath/myFile_2.fastq
With the --merge
option, the resulting fastqs would be:
- myPath/myFile.fastq
- myPath/myFile_interleaved.fastq
Instead of using the inputFile base name as the output file base, you can specify a different base name by using the --outBase option.
You can optionally directly specify the output fastq filenames using:
- --firstOut firstReadInAPair.fastq (also used for the interleaved filename with
--merge
) - --secondOut secondReadInAPair.fastq
- --unpairedOut unpairedReads.fastq
If any of these are not specified, the --outBase
or default is used for that file.
Usage
./bam bam2FastQ --in <inputFile> [--readName] [--splitRG] [--qualField <tag>] [--refFile <referenceFile>] [--outBase <outputFileBase>] [--firstOut <1stReadInPairOutFile>] [--merge|--secondOut <2ndReadInPairOutFile>] [--unpairedOut <unpairedOutFile>] [--firstRNExt <firstInPairReadNameEx t>] [--secondRNExt <secondInPairReadNameExt>] [--rnPlus] [--noReverseComp] [--region <chr>[:<pos>[:<base>]]] [--gzip] [--noeof] [--params]
Parameters
Required Parameters: --in : the SAM/BAM file to convert to FastQ Optional Parameters: --readname : Process the BAM as readName sorted instead of coordinate if the header does not indicate a sort order. --splitRG : Split into RG specific fastqs. --qualField : Use the base quality from the specified tag rather than from the Quality field (default) --merge : Generate 1 interleaved (merged) FASTQ for paired-ends (unpaired in a separate file) use firstOut to override the filename of the interleaved file. --refFile : Reference file for converting '=' in the sequence to the actual base if '=' are found and the refFile is not specified, 'N' is written to the FASTQ --firstRNExt : read name extension to use for first read in a pair default is "/1" --secondRNExt : read name extension to use for second read in a pair default is "/2" --rnPlus : Add the Read Name/extension to the '+' line of the fastq records --noReverseComp : Do not reverse complement reads marked as reverse --region : Only convert reads containing the specified region/nucleotide. Position formatted as: chr:pos:base pos (0-based) & base are optional. --gzip : Compress the output FASTQ files using gzip --noeof : Do not expect an EOF block on a bam file. --params : Print the parameter settings to stderr Optional OutputFile Names: --outBase : Base output name for generated output files --firstOut : Output name for the first in pair file over-rides setting of outBase --secondOut : Output name for the second in pair file over-rides setting of outBase --unpairedOut : Output name for unpaired reads over-rides setting of outBase
Required Parameters
Input File (--in
)
Use --in
followed by your file name to specify the SAM/BAM input file.
The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.
A -
is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).
SAM/BAM/Uncompressed BAM from file | --in yourFileName
|
SAM from stdin | --in - |
BAM from stdin | --in -.bam |
Uncompressed BAM from stdin | --in -.ubam |
Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools
implementation so pipes between our tools and samtools
are supported.
Optional Parameters
BAM File Is Sorted By Read Name (--readname
)
The bam2FastQ program by default checks the sort order in the SAM/BAM header when converting to FASTQ, and if that is not specified, assumes it is sorted by coordinate.
To override the default and force it to assume the file is sorted by readname, specify the --readName
option
The file does not need to be strictly sorted by read name. The only requirement is that matching read names are next to each other.
Split into RG Specific FASTQs (--splitRG
)
Create RG specific FASTQ files.
Cannot be specified with firstOut/secondOut/unpairedOut since there will be a different filename for each RG.
Cannot write to stdout when --splitRG
is specified.
Output filenames will be <outBase>.<RG>_1.fastq, <outBase>.<RG>_2.fastq, and <outBase>.<RG>.fastq. A fastq list file <outBase>.list will be created containing MERGE_NAME (the RG tag's SM value or outBase if the value is empty), fastq 1, fastq 2 (or . if it is a single ended fastq), and the RG tag string.
Use the Base Quality from the Specified Tag (--qualField
)
By default, the quality field is used for the Base Qualities in the FASTQ file. Specify --qualField <tagName>
to use the base qualities from the specified tag instead of the quality field.
Generate 1 Paired-End Output File (--merge
)
Use the --merge
option to generate 1 interleaved (merged) FASTQ for paired-ends instead of 2 files. Unpaired reads are still written to a separate file.
The default extension for the output file is "_interleaved".
Use --firstOut
to override the filename of the interleaved file.
This parameter was added in version 1.0.10.
Reference File for Converting '=' in the Sequence to Bases (--refFile
)
If the SAM/BAM file contains '=' in the sequence instead of the actual bases, the bam2FastQ program needs to convert the '=' back to the bases. To do that it needs the reference. Specify the reference by using --refFile
followed by the reference filename.
For example:
./bam bam2FastQ --in myFile.bam --refFile myPath/myRefFile.fa
First in Pair FastQ ReadName Extension (--firstRNExt
)
--firstRNExt
overrides the default "/1" that is appended to the Read Name of the first-end of a read pair with the specified value.
Second in Pair FastQ ReadName Extension (--secondRNExt
)
--secondRNExt
overrides the default "/2" that is appended to the Read Name of the second-end of a read pair with the specified value.
Include the Read Name on the "+" line of the FASTQ (--rnPlus
)
By default the read name is not included on the "+" line of the FASTQ files. To include the read name and the extension for paired-end reads, specify --rnPlus
.
Do Not Reverse Complement Reverse Strands (--noReverseComp
)
By default, reads marked as reverse in the BAM file are reverse complemented prior to writing to the FASTQ files. --noReverseComp
disables this feature, and skips the reverse complement step.
For example, if a sequence is ACCGTG marked as reverse, the default FASTQ record will be written as: CACGGT
Specifying --noReverseComp
would result in a FASTQ sequence of ACCGTG
Only Convert the Specified Region (--region
)
Only convert reads containing the specified region/nucleotide.
Position formatted as: chr:pos:base
pos (0-based) & base are optional.
Do not require BGZF EOF block (--noeof
)
Use --noeof
if you do not expect a trailing eof block in your bgzf file.
By default, the trailing empty block is expected and checked for.
Print the Program Parameters (--params
)
Use --params
to print the parameters for your program to stderr.
Optional Output Filenames
Output FastQ File Base Name (--outBase
)
You can replace the default output base name by using the --outBase
option.
If the outBase was "myNewPath/myFastQBase", the resulting fastq's would be:
- myNewPath/myFastQBase.fastq
- myNewPath/myFastQBase_1.fastq
- myNewPath/myFastQBase_2.fastq
The value specified by this parameter is overridden by --firstOut
, --secondOut
, and --unpairedOut
, but is used for whichever output files are not specified.
With the --merge
option, the resulting fastq's would instead be:
- myNewPath/myFastQBase.fastq
- myNewPath/myFastQBase_interleaved.fastq
Output FastQ File Name For the First End of Paired End (--firstOut
)
This setting overides the default and --outBase
file name.
The entire filename and extension must be specified.
Does not affect the filenames for the second end or for unpaired reads.
For example:
./bam bam2FastQ --in myFile.bam --firstOut myFileEnd1.fastq
Output FastQ File Name For the Second End of Paired End (--secondOut
)
This setting overides the default and --outBase
file name.
The entire filename and extension must be specified.
Does not affect the filenames for the first end or for unpaired reads.
For example:
./bam bam2FastQ --in myFile.bam --secondOut myFileEnd2.fastq
Output FastQ File Name For Unpaired Reads (--unpairedOut
)
This setting overides the default and --outBase
file names.
The entire filename and extension must be specified.
Does not affect the filenames for the paired-end fastq files.
For example:
./bam bam2FastQ --in myFile.bam --unpairedOut myFileUnpaired.fastq
PhoneHome Parameters
See PhoneHome for more information on how PhoneHome works and what it does.
Turn off PhoneHome (--noPhoneHome
)
Use the --noPhoneHome
option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.
Adjust the Frequency of PhoneHome (--phoneHomeThinning
)
Use --phoneHomeThinning
to modify the percentage of the time that PhoneHome will run (0-100).
- By default,
--phoneHomeThinning
is set to 50, running 50% of the time. - PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
- N/A if
--noPhoneHome
is set.
Return Value
Returns -1 if input parameters are invalid.
Returns the SamStatus for the reads/writes (0 on success, non-0 on failure).