BamUtil: bam2FastQ

From Genome Analysis Wiki
Jump to: navigation, search

Overview of the bam2FastQ function of bamUtil

The bam2FastQ option on the bamUtil converts a BAM file into FastQ files. This is necessary when only BAM files are delivered but a new alignment is desired. By converting BAM to FastQ files new alignments can be done using FastQ files

NOTE: Secondary and Supplementary reads are skipped when converting to FastQ. It assumes that there will only be 2 reads (the 2 primary mates) with the same read name that are not secondary or supplementary.

NOTE: Use the --splitRG option to split reads into read group specific FASTQs.

How to use it

When bam2FastQ is invoked without any arguments the usage information is displayed as described below under Usage.

The input BAM file is required, input File (--in).

It works on both read/query name and coordinate sorted SAM/BAM files.

If you want to convert a SAM/BAM that is read/query name sorted but the SO field of the header does not specify "queryname", then use the --readName option.

When processing files sorted by read name, the only requirement is that matching read names are next to each other. It does not need to be in strict alphabetical order.

Read Names in paired-end FASTQ files are appended with "/1" for the first in the pair, and "/2" for the second in the pair. Override these defaults using --firstRNExt and --secondRNExt

Sequences marked as Reverse strands in the SAM/BAM file are reverse complemented prior to writing to the FASTQ files. To skip this step, specify --noReverseComp

Any errors and a summary of how many pairs and unpaired reads were processed are written to stderr.

NOTE: This tool does not work on templates that have more than 2 segments. It does not properly match reads when more than 2 reads have the same read name.

NOTE: This tool does not split reads into read group specific FASTQs. If you want Read Group specific FASTQ files, first run BamUtil: splitBam to first split the BAM into 1 BAM per Read Group. Then run bam2FastQ on each bam.

Output Files

By default, this program produces 3 output fastq files.

  1. unpaired reads
  2. first end of paired reads
  3. second end of paired reads

If the --merge option is specified, the program produces 2 output fastq files.

  1. unpaired reads
  2. interleaved paired-end reads

The default fastq file names are determined by taking the base name of the input file and adding an extension for each filetype.

Default --merge
Output File Contents Extension Output File Contents Extension
unpaired reads .fastq unpaired reads .fastq
first end of paired reads _1.fastq interleaved paired-end reads

(both first & second end)

_interleaved.fastq
second end of paired reads _2.fastq

If the inputFile was "myPath/myFile.bam", the resulting fastqs would be:

  1. myPath/myFile.fastq
  2. myPath/myFile_1.fastq
  3. myPath/myFile_2.fastq

With the --merge option, the resulting fastqs would be:

  1. myPath/myFile.fastq
  2. myPath/myFile_interleaved.fastq

Instead of using the inputFile base name as the output file base, you can specify a different base name by using the --outBase option.

You can optionally directly specify the output fastq filenames using:

  • --firstOut firstReadInAPair.fastq (also used for the interleaved filename with --merge)
  • --secondOut secondReadInAPair.fastq
  • --unpairedOut unpairedReads.fastq

If any of these are not specified, the --outBase or default is used for that file.

Usage

./bam bam2FastQ --in <inputFile> [--readName] [--splitRG] [--qualField <tag>] [--refFile <referenceFile>] [--outBase <outputFileBase>] [--firstOut <1stReadInPairOutFile>] [--merge|--secondOut <2ndReadInPairOutFile>] [--unpairedOut <unpairedOutFile>] [--firstRNExt <firstInPairReadNameEx                           t>] [--secondRNExt <secondInPairReadNameExt>] [--rnPlus] [--noReverseComp] [--region <chr>[:<pos>[:<base>]]] [--gzip] [--noeof] [--params]

Parameters

        Required Parameters:
                --in       : the SAM/BAM file to convert to FastQ
        Optional Parameters:
                --readname      : Process the BAM as readName sorted instead
                                  of coordinate if the header does not indicate a sort order.
                --splitRG       : Split into RG specific fastqs.
                --qualField     : Use the base quality from the specified tag
                                  rather than from the Quality field (default)
                --merge         : Generate 1 interleaved (merged) FASTQ for paired-ends (unpaired in a separate file)
                                  use firstOut to override the filename of the interleaved file.
                --refFile       : Reference file for converting '=' in the sequence to the actual base
                                  if '=' are found and the refFile is not specified, 'N' is written to the FASTQ
                --firstRNExt    : read name extension to use for first read in a pair
                                  default is "/1"
                --secondRNExt   : read name extension to use for second read in a pair
                                  default is "/2"
                --rnPlus        : Add the Read Name/extension to the '+' line of the fastq records
                --noReverseComp : Do not reverse complement reads marked as reverse
                --region        : Only convert reads containing the specified region/nucleotide.
                                  Position formatted as: chr:pos:base
                                  pos (0-based) & base are optional.
                --gzip          : Compress the output FASTQ files using gzip
                --noeof         : Do not expect an EOF block on a bam file.
                --params        : Print the parameter settings to stderr
        Optional OutputFile Names:
                --outBase       : Base output name for generated output files
                --firstOut      : Output name for the first in pair file
                                  over-rides setting of outBase
                --secondOut     : Output name for the second in pair file
                                  over-rides setting of outBase
                --unpairedOut   : Output name for unpaired reads
                                  over-rides setting of outBase

Required Parameters

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Optional Parameters

BAM File Is Sorted By Read Name (--readname)

The bam2FastQ program by default checks the sort order in the SAM/BAM header when converting to FASTQ, and if that is not specified, assumes it is sorted by coordinate.

To override the default and force it to assume the file is sorted by readname, specify the --readName option

The file does not need to be strictly sorted by read name. The only requirement is that matching read names are next to each other.

Split into RG Specific FASTQs (--splitRG)

Create RG specific FASTQ files.

Cannot be specified with firstOut/secondOut/unpairedOut since there will be a different filename for each RG.

Cannot write to stdout when --splitRG is specified.

Output filenames will be <outBase>.<RG>_1.fastq, <outBase>.<RG>_2.fastq, and <outBase>.<RG>.fastq. A fastq list file <outBase>.list will be created containing MERGE_NAME (the RG tag's SM value or outBase if the value is empty), fastq 1, fastq 2 (or . if it is a single ended fastq), and the RG tag string.

Use the Base Quality from the Specified Tag (--qualField)

By default, the quality field is used for the Base Qualities in the FASTQ file. Specify --qualField <tagName> to use the base qualities from the specified tag instead of the quality field.


Generate 1 Paired-End Output File (--merge)

Use the --merge option to generate 1 interleaved (merged) FASTQ for paired-ends instead of 2 files. Unpaired reads are still written to a separate file.

The default extension for the output file is "_interleaved".

Use --firstOut to override the filename of the interleaved file.

This parameter was added in version 1.0.10.

Reference File for Converting '=' in the Sequence to Bases (--refFile)

If the SAM/BAM file contains '=' in the sequence instead of the actual bases, the bam2FastQ program needs to convert the '=' back to the bases. To do that it needs the reference. Specify the reference by using --refFile followed by the reference filename.

For example:

./bam bam2FastQ --in myFile.bam --refFile myPath/myRefFile.fa

First in Pair FastQ ReadName Extension (--firstRNExt)

--firstRNExt overrides the default "/1" that is appended to the Read Name of the first-end of a read pair with the specified value.

Second in Pair FastQ ReadName Extension (--secondRNExt)

--secondRNExt overrides the default "/2" that is appended to the Read Name of the second-end of a read pair with the specified value.

Include the Read Name on the "+" line of the FASTQ (--rnPlus)

By default the read name is not included on the "+" line of the FASTQ files. To include the read name and the extension for paired-end reads, specify --rnPlus.

Do Not Reverse Complement Reverse Strands (--noReverseComp)

By default, reads marked as reverse in the BAM file are reverse complemented prior to writing to the FASTQ files. --noReverseComp disables this feature, and skips the reverse complement step.

For example, if a sequence is ACCGTG marked as reverse, the default FASTQ record will be written as: CACGGT

Specifying --noReverseComp would result in a FASTQ sequence of ACCGTG

Only Convert the Specified Region (--region)

Only convert reads containing the specified region/nucleotide.

Position formatted as: chr:pos:base

pos (0-based) & base are optional.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

Optional Output Filenames

Output FastQ File Base Name (--outBase)

You can replace the default output base name by using the --outBase option. If the outBase was "myNewPath/myFastQBase", the resulting fastq's would be:

  1. myNewPath/myFastQBase.fastq
  2. myNewPath/myFastQBase_1.fastq
  3. myNewPath/myFastQBase_2.fastq

The value specified by this parameter is overridden by --firstOut, --secondOut, and --unpairedOut, but is used for whichever output files are not specified.

With the --merge option, the resulting fastq's would instead be:

  1. myNewPath/myFastQBase.fastq
  2. myNewPath/myFastQBase_interleaved.fastq

Output FastQ File Name For the First End of Paired End (--firstOut)

This setting overides the default and --outBase file name.

The entire filename and extension must be specified.

Does not affect the filenames for the second end or for unpaired reads.

For example:

./bam bam2FastQ --in myFile.bam --firstOut myFileEnd1.fastq

Output FastQ File Name For the Second End of Paired End (--secondOut)

This setting overides the default and --outBase file name.

The entire filename and extension must be specified.

Does not affect the filenames for the first end or for unpaired reads.

For example:

./bam bam2FastQ --in myFile.bam --secondOut myFileEnd2.fastq

Output FastQ File Name For Unpaired Reads (--unpairedOut)

This setting overides the default and --outBase file names.

The entire filename and extension must be specified.

Does not affect the filenames for the paired-end fastq files.

For example:

./bam bam2FastQ --in myFile.bam --unpairedOut myFileUnpaired.fastq

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

Turn off PhoneHome (--noPhoneHome)

Use the --noPhoneHome option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.

Adjust the Frequency of PhoneHome (--phoneHomeThinning)

Use --phoneHomeThinning to modify the percentage of the time that PhoneHome will run (0-100).

  • By default, --phoneHomeThinning is set to 50, running 50% of the time.
  • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
  • N/A if --noPhoneHome is set.

Return Value

Returns -1 if input parameters are invalid.

Returns the SamStatus for the reads/writes (0 on success, non-0 on failure).