Changes

6,392 bytes added , 23:53, 5 March 2016

Line 1: Line 1: −

~~=This functionality will be released on 5/17/2012=~~

−

= Overview of the <code>bam2FastQ</code> function of <code>[[bamUtil]]</code> =

The <code>bam2FastQ</code> option on the [[bamUtil]] converts a BAM file into FastQ files. This is necessary when only BAM files are delivered but a new alignment is desired. By converting BAM to FastQ files new alignments can be done using FastQ files

+

'''NOTE: Secondary and Supplementary reads are skipped when converting to FastQ. It assumes that there will only be 2 reads (the 2 primary mates) with the same read name that are not secondary or supplementary.'''

+

'''NOTE: Use the --splitRG option to split reads into read group specific FASTQs.'''

== How to use it ==

Line 16: Line 17:

When processing files sorted by read name, the only requirement is that matching read names are next to each other. It does not need to be in strict alphabetical order.

+

Read Names in paired-end FASTQ files are appended with "/1" for the first in the pair, and "/2" for the second in the pair. Override these defaults using [[#First in Pair FastQ ReadName Extension (--firstRNExt)|--firstRNExt]] and [[#Second in Pair FastQ ReadName Extension (--secondRNExt)|--secondRNExt]]

+

Sequences marked as Reverse strands in the SAM/BAM file are reverse complemented prior to writing to the FASTQ files. To skip this step, specify [[#Do Not Reverse Complement Reverse Strands (--noReverseComp)|--noReverseComp]]

Any errors and a summary of how many pairs and unpaired reads were processed are written to stderr.

+

'''NOTE: This tool does not work on templates that have more than 2 segments. It does not properly match reads when more than 2 reads have the same read name.'''

+

'''NOTE: This tool does not split reads into read group specific FASTQs. If you want Read Group specific FASTQ files, first run [[BamUtil: splitBam]] to first split the BAM into 1 BAM per Read Group. Then run bam2FastQ on each bam.'''

=== Output Files ===

−

~~This~~ program produces 3 output fastq files.

+

By default, this program produces 3 output fastq files.

# unpaired reads

# first end of paired reads

# second end of paired reads

+

If the [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]] option is specified, the program produces 2 output fastq files.

+

# unpaired reads

+

# interleaved paired-end reads

The default fastq file names are determined by taking the base name of the input file and adding an extension for each filetype.

{|border="1" cellspacing="0" cellpadding="2"

−

! Output File Contents !! Extension

+

! colspan="2"|Default !!colspan="2"|[[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]]

+

|-

+

! Output File Contents !! Extension !! Output File Contents !! Extension

|-

+

|unpaired reads

+

| .fastq

|unpaired reads

| .fastq

Line 34: Line 51:

|first end of paired reads

| _1.fastq

+

| rowspan="2"|interleaved paired-end reads

+

(both first & second end)

+

| rowspan="2"|_interleaved.fastq

|-

|second end of paired reads

Line 39: Line 59:

|}

−

If the inputFile was "myPath/myFile.bam", the resulting ~~fastq's~~ would be:

+

If the inputFile was "myPath/myFile.bam", the resulting fastqs would be:

#myPath/myFile.fastq

#myPath/myFile_1.fastq

#myPath/myFile_2.fastq

+

With the [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]] option, the resulting fastqs would be:

+

#myPath/myFile.fastq

+

#myPath/myFile_interleaved.fastq

Instead of using the inputFile base name as the output file base, you can specify a different base name by using the [[#Output FastQ File Base Name (--outBase)|--outBase]] option.

You can optionally directly specify the output fastq filenames using:

−

* --firstOut firstReadInAPair.fastq

+

* --firstOut firstReadInAPair.fastq (also used for the interleaved filename with [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]])

* --secondOut secondReadInAPair.fastq

* --unpairedOut unpairedReads.fastq

Line 53: Line 77:

= Usage =

−

./bam bam2FastQ --in <inputFile> [--readName] [--refFile <referenceFile>] [--outBase <outputFileBase>] [--firstOut <1stReadInPairOutFile>] [--secondOut <2ndReadInPairOutFile>] [--unpairedOut <unpairedOutFile>] [--firstRNExt <~~firstInPairReadNameExt~~>] [--secondRNExt <secondInPairReadNameExt>] [--rnPlus] [--noReverseComp] [--noeof] [--params]

+

./bam bam2FastQ --in <inputFile> [--readName] [--splitRG] [--qualField <tag>] [--refFile <referenceFile>] [--outBase <outputFileBase>] [--firstOut <1stReadInPairOutFile>] [--merge|--secondOut <2ndReadInPairOutFile>] [--unpairedOut <unpairedOutFile>] [--firstRNExt <firstInPairReadNameEx t>] [--secondRNExt <secondInPairReadNameExt>] [--rnPlus] [--noReverseComp] [--region <chr>[:<pos>[:<base>]]] [--gzip] [--noeof] [--params]

−

= Parameters =

<pre>

−

Required Parameters:

+

Required Parameters:

−

--in : the SAM/BAM file to convert to FastQ

+

--in : the SAM/BAM file to convert to FastQ

−

Optional Parameters:

+

Optional Parameters:

−

--readname : Process the BAM as readName sorted instead

+

--readname : Process the BAM as readName sorted instead

−

of coordinate if the header does not indicate a sort order.

+

of coordinate if the header does not indicate a sort order.

−

--refFile : Reference file for converting '=' in the sequence to the actual base

+

--splitRG : Split into RG specific fastqs.

−

if '=' are found and the refFile is not specified, 'N' is written to the FASTQ

+

--qualField : Use the base quality from the specified tag

−

~~--outBase : Base output name for generated output files~~

+

rather than from the Quality field (default)

−

~~--firstOut : Output name for the first in pair file~~

+

--merge : Generate 1 interleaved (merged) FASTQ for paired-ends (unpaired in a separate file)

−

~~over-rides setting of outBase~~

+

use firstOut to override the filename of the interleaved file.

−

~~--secondOut : Output name for the second in pair file~~

+

--refFile : Reference file for converting '=' in the sequence to the actual base

−

~~over-rides setting of outBase~~

+

if '=' are found and the refFile is not specified, 'N' is written to the FASTQ

−

~~--unpairedOut : Output name for unpaired reads~~

+

--firstRNExt : read name extension to use for first read in a pair

−

~~over-rides setting of outBase~~

+

default is "/1"

−

--firstRNExt : read name extension to use for first read in a pair

+

--secondRNExt : read name extension to use for second read in a pair

−

default is "/1"

+

default is "/2"

−

--secondRNExt : read name extension to use for second read in a pair

+

--rnPlus : Add the Read Name/extension to the '+' line of the fastq records

−

default is "/2"

+

--noReverseComp : Do not reverse complement reads marked as reverse

−

--rnPlus : Add the Read Name/extension to the '+' line of the fastq records

+

--region : Only convert reads containing the specified region/nucleotide.

−

--noReverseComp : Do not reverse complement reads marked as reverse

+

Position formatted as: chr:pos:base

−

--noeof : Do not expect an EOF block on a bam file.

+

pos (0-based) & base are optional.

−

--params : Print the parameter settings to stderr

+

--gzip : Compress the output FASTQ files using gzip

+

--noeof : Do not expect an EOF block on a bam file.

+

--params : Print the parameter settings to stderr

+

Optional OutputFile Names:

+

--outBase : Base output name for generated output files

+

--firstOut : Output name for the first in pair file

+

over-rides setting of outBase

+

--secondOut : Output name for the second in pair file

+

over-rides setting of outBase

+

--unpairedOut : Output name for unpaired reads

+

over-rides setting of outBase

</pre>

+

== Required Parameters ==

−

== Output FastQ File Base Name (<code>--outBase</code>) ==

+

== Optional Parameters ==

+

=== BAM File Is Sorted By Read Name (<code>--readname</code>) ===

+

The bam2FastQ program by default checks the sort order in the SAM/BAM header when converting to FASTQ, and if that is not specified, assumes it is sorted by coordinate.

+

To override the default and force it to assume the file is sorted by readname, specify the <code>--readName</code> option

+

The file does not need to be strictly sorted by read name. The only requirement is that matching read names are next to each other.

+

=== Split into RG Specific FASTQs (<code>--splitRG</code>) ===

+

Create RG specific FASTQ files.

+

Cannot be specified with firstOut/secondOut/unpairedOut since there will be a different filename for each RG.

+

Cannot write to stdout when <code>--splitRG</code> is specified.

+

Output filenames will be <outBase>.<RG>_1.fastq, <outBase>.<RG>_2.fastq, and <outBase>.<RG>.fastq. A fastq list file <outBase>.list will be created containing MERGE_NAME (the RG tag's SM value or outBase if the value is empty), fastq 1, fastq 2 (or . if it is a single ended fastq), and the RG tag string.

+

=== Use the Base Quality from the Specified Tag (<code>--qualField</code>) ===

+

By default, the quality field is used for the Base Qualities in the FASTQ file. Specify <code>--qualField <tagName></code> to use the base qualities from the specified tag instead of the quality field.

+

=== Generate 1 Paired-End Output File (<code>--merge</code>) ===

+

Use the <code>--merge</code> option to generate 1 interleaved (merged) FASTQ for paired-ends instead of 2 files. Unpaired reads are still written to a separate file.

+

The default extension for the output file is "_interleaved".

+

Use [[#Output FastQ File Name For the First End of Paired End (--firstOut)|<code>--firstOut</code>]] to override the filename of the interleaved file.

+

This parameter was added in version 1.0.10.

+

=== Reference File for Converting '=' in the Sequence to Bases (<code>--refFile</code>) ===

+

If the SAM/BAM file contains '=' in the sequence instead of the actual bases, the bam2FastQ program needs to convert the '=' back to the bases. To do that it needs the reference. Specify the reference by using <code>--refFile</code> followed by the reference filename.

+

For example:

+

./bam bam2FastQ --in myFile.bam --refFile myPath/myRefFile.fa

+

=== First in Pair FastQ ReadName Extension (<code>--firstRNExt</code>) ===

+

<code>--firstRNExt</code> overrides the default "/1" that is appended to the Read Name of the first-end of a read pair with the specified value.

+

=== Second in Pair FastQ ReadName Extension (<code>--secondRNExt</code>) ===

+

<code>--secondRNExt</code> overrides the default "/2" that is appended to the Read Name of the second-end of a read pair with the specified value.

+

=== Include the Read Name on the "+" line of the FASTQ (<code>--rnPlus</code>) ===

+

By default the read name is not included on the "+" line of the FASTQ files. To include the read name and the extension for paired-end reads, specify <code>--rnPlus</code>.

+

=== Do Not Reverse Complement Reverse Strands (<code>--noReverseComp</code>) ===

+

By default, reads marked as reverse in the BAM file are reverse complemented prior to writing to the FASTQ files. <code>--noReverseComp</code> disables this feature, and skips the reverse complement step.

+

For example, if a sequence is ACCGTG marked as reverse, the default FASTQ record will be written as: CACGGT

+

Specifying <code>--noReverseComp</code> would result in a FASTQ sequence of ACCGTG

+

=== Only Convert the Specified Region (<code>--region</code>) ===

+

Only convert reads containing the specified region/nucleotide.

+

Position formatted as: chr:pos:base

+

pos (0-based) & base are optional.

+

== Optional Output Filenames ==

+

=== Output FastQ File Base Name (<code>--outBase</code>) ===

You can replace the default output base name by using the <code>--outBase</code> option.

Line 94: Line 201:

The value specified by this parameter is overridden by <code>--firstOut</code>, <code>--secondOut</code>, and <code>--unpairedOut</code>, but is used for whichever output files are not specified.

+

With the [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]] option, the resulting fastq's would instead be:

+

#myNewPath/myFastQBase.fastq

+

#myNewPath/myFastQBase_interleaved.fastq

−

== Output FastQ File Name For the First End of Paired End (<code>--firstOut</code>) ==

+

=== Output FastQ File Name For the First End of Paired End (<code>--firstOut</code>) ===

This setting overides the default and <code>--outBase</code> file name.

Line 101: Line 211:

The entire filename and extension must be specified.

−

Does not affect the filenames for the ~~first~~ end or for unpaired reads.

+

Does not affect the filenames for the second end or for unpaired reads.

For example:

./bam bam2FastQ --in myFile.bam --firstOut myFileEnd1.fastq

−

+

=== Output FastQ File Name For the Second End of Paired End (<code>--secondOut</code>) ===

−

== Output FastQ File Name For the Second End of Paired End (<code>--secondOut</code>) ==

This setting overides the default and <code>--outBase</code> file name.

Line 118: Line 227:

./bam bam2FastQ --in myFile.bam --secondOut myFileEnd2.fastq

−

+

=== Output FastQ File Name For Unpaired Reads (<code>--unpairedOut</code>) ===

−

== Output FastQ File Name For Unpaired Reads (<code>--unpairedOut</code>) ==

This setting overides the default and <code>--outBase</code> file names.

Line 125: Line 233:

The entire filename and extension must be specified.

−

Does not affect the filenames for the ~~two~~ paired end fastq files.

+

Does not affect the filenames for the paired-end fastq files.

For example:

./bam bam2FastQ --in myFile.bam --unpairedOut myFileUnpaired.fastq

+

−

=~~= BAM File Is Sorted By Read Name (<code>--readname</code>) =~~=

+

= Return Value =

−

~~The bam2FastQ program by default checks the sort order in the SAM/BAM header when converting to FASTQ, and~~ if ~~that is not specified, assumes it is sorted by coordinate~~.

+

Returns -1 if input parameters are invalid.

−

~~To override~~ the ~~default and force it to assume the file is sorted by readname, specify the <code>--readName</code> option~~

+

Returns the SamStatus for the reads/writes (0 on success, non-0 on failure).

−

~~The file does not need to be strictly sorted by read name. The only requirement is that matching read names are next to each other.~~

−

~~== Reference File~~ for ~~Converting '=' in~~ the ~~Sequence to Bases <code>--refFile</code>==~~

−

~~If the SAM~~/~~BAM file contains '=' in the sequence instead of the actual bases~~, ~~the bam2FastQ program needs to convert the '=' back to the bases. To do that it needs the reference. Specify the reference by using <code>--refFile</code> followed by the reference filename.~~

−

~~For example:~~

−

~~./bam bam2FastQ --in myFile.bam -~~-~~refFile myPath/myRefFile~~.fa

Mktrost

Administrators

3,045

edits

Changes

BamUtil: bam2FastQ (view source)

Revision as of 23:53, 5 March 2016

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools