Difference between revisions of "BamUtil: bam2FastQ"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 10: Line 10:
 
It works on both read/query name and coordinate sorted SAM/BAM files.   
 
It works on both read/query name and coordinate sorted SAM/BAM files.   
  
If you want to convert a SAM/BAM that is read/query name sorted but the SO field of the header does not specify "queryname", then use the [[#BAM File Is Sorted By Read Name (<code>--readname</code>)|--readName]] option.
+
If you want to convert a SAM/BAM that is read/query name sorted but the SO field of the header does not specify "queryname", then use the [[#BAM File Is Sorted By Read Name (--readname)|--readName]] option.
  
 
When processing files sorted by read name, the only requirement is that matching read names are next to each other.  It does not need to be in strict alphabetical order.
 
When processing files sorted by read name, the only requirement is that matching read names are next to each other.  It does not need to be in strict alphabetical order.

Revision as of 17:48, 16 May 2012

Overview of the bam2FastQ function of bamUtil

The bam2FastQ option on the bamUtil converts a BAM file into FastQ files. This is necessary when only BAM files are delivered but a new alignment is desired. By converting BAM to FastQ files new alignments can be done using FastQ files

How to use it

When bam2FastQ is invoked without any arguments the usage information is displayed as described below under Usage.

The input BAM file is required, input File (--in).

It works on both read/query name and coordinate sorted SAM/BAM files.

If you want to convert a SAM/BAM that is read/query name sorted but the SO field of the header does not specify "queryname", then use the --readName option.

When processing files sorted by read name, the only requirement is that matching read names are next to each other. It does not need to be in strict alphabetical order.

Any errors and a summary of how many pairs and unpaired reads were processed are written to stderr.

Output Files

This program produces 3 output fastq files.

  1. unpaired reads
  2. first end of paired reads
  3. second end of paired reads

The default fastq file names are determined by taking the base name of the input file and adding an extension for each filetype.

Output File Contents Extension
unpaired reads .fastq
first end of paired reads _1.fastq
second end of paired reads _2.fastq

If the inputFile was "myPath/myFile.bam", the resulting fastq's would be:

  1. myPath/myFile.fastq
  2. myPath/myFile_1.fastq
  3. myPath/myFile_2.fastq

Instead of using the inputFile base name as the output file base, you can specify a different base name by using the --outBase option.

You can optionally directly specify the output fastq filenames using:

  • --firstOut firstReadInAPair.fastq
  • --secondOut secondReadInAPair.fastq
  • --unpairedOut unpairedReads.fastq

If any of these are not specified, the --outBase or default is used for that file.

Usage

./bam bam2FastQ --in <inputFile> [--readName] [--refFile <referenceFile>] [--outBase <outputFileBase>] [--firstOut <1stReadInPairOutFile>] [--secondOut <2ndReadInPairOutFile>] [--unpairedOut <unpairedOutFile>] [--firstRNExt <firstInPairReadNameExt>] [--secondRNExt <secondInPairReadNameExt>] [--rnPlus] [--noReverseComp] [--noeof] [--params]


Parameters

	Required Parameters:
		--in       : the SAM/BAM file to convert to FastQ
	Optional Parameters:
		--readname      : Process the BAM as readName sorted instead
		                  of coordinate if the header does not indicate a sort order.
		--refFile       : Reference file for converting '=' in the sequence to the actual base
		                  if '=' are found and the refFile is not specified, 'N' is written to the FASTQ
		--outBase       : Base output name for generated output files
		--firstOut      : Output name for the first in pair file
		                  over-rides setting of outBase
		--secondOut     : Output name for the second in pair file
		                  over-rides setting of outBase
		--unpairedOut   : Output name for unpaired reads
		                  over-rides setting of outBase
		--firstRNExt    : read name extension to use for first read in a pair
		                  default is "/1"
		--secondRNExt   : read name extension to use for second read in a pair
		                  default is "/2"
		--rnPlus        : Add the Read Name/extension to the '+' line of the fastq records
		--noReverseComp : Do not reverse complement reads marked as reverse
		--noeof         : Do not expect an EOF block on a bam file.
		--params        : Print the parameter settings to stderr

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output FastQ File Base Name (--outBase)

You can replace the default output base name by using the --outBase option. If the outBase was "myNewPath/myFastQBase", the resulting fastq's would be:

  1. myNewPath/myFastQBase.fastq
  2. myNewPath/myFastQBase_1.fastq
  3. myNewPath/myFastQBase_2.fastq

The value specified by this parameter is overridden by --firstOut, --secondOut, and --unpairedOut, but is used for whichever output files are not specified.


Output FastQ File Name For the First End of Paired End (--firstOut)

This setting overides the default and --outBase file name.

The entire filename and extension must be specified.

Does not affect the filenames for the first end or for unpaired reads.

For example:

./bam bam2FastQ --in myFile.bam --firstOut myFileEnd1.fastq


Output FastQ File Name For the Second End of Paired End (--secondOut)

This setting overides the default and --outBase file name.

The entire filename and extension must be specified.

Does not affect the filenames for the first end or for unpaired reads.

For example:

./bam bam2FastQ --in myFile.bam --secondOut myFileEnd2.fastq


Output FastQ File Name For Unpaired Reads (--unpairedOut)

This setting overides the default and --outBase file names.

The entire filename and extension must be specified.

Does not affect the filenames for the two paired end fastq files.

For example:

./bam bam2FastQ --in myFile.bam --unpairedOut myFileUnpaired.fastq


BAM File Is Sorted By Read Name (--readname)

The bam2FastQ program by default checks the sort order in the SAM/BAM header when converting to FASTQ, and if that is not specified, assumes it is sorted by coordinate.

To override the default and force it to assume the file is sorted by readname, specify the --readName option

The file does not need to be strictly sorted by read name. The only requirement is that matching read names are next to each other.


Reference File for Converting '=' in the Sequence to Bases --refFile

If the SAM/BAM file contains '=' in the sequence instead of the actual bases, the bam2FastQ program needs to convert the '=' back to the bases. To do that it needs the reference. Specify the reference by using --refFile followed by the reference filename.

For example:

./bam bam2FastQ --in myFile.bam --refFile myPath/myRefFile.fa