Changes

From Genome Analysis Wiki
Jump to navigationJump to search
11,423 bytes added ,  23:53, 5 March 2016
Line 1: Line 1: −
=== Purpose ===
+
= Overview of the <code>bam2FastQ</code> function of <code>[[bamUtil]]</code> =
This converts a BAM file into FastQ files. This is necessary when only BAM files are delivered but a new alignment is desired. By converting BAM to FastQ files new alignments can be done using FastQ files
+
The <code>bam2FastQ</code> option on the [[bamUtil]] converts a BAM file into FastQ files. This is necessary when only BAM files are delivered but a new alignment is desired. By converting BAM to FastQ files new alignments can be done using FastQ files
 +
 
 +
'''NOTE: Secondary and Supplementary reads are skipped when converting to FastQ.  It assumes that there will only be 2 reads (the 2 primary mates) with the same read name that are not secondary or supplementary.'''
 +
 
 +
'''NOTE: Use the --splitRG option to split reads into read group specific FASTQs.'''
    
== How to use it ==
 
== How to use it ==
   −
When bam2FastQ is invoked without any arguments the following information is displayed
+
When bam2FastQ is invoked without any arguments the usage information is displayed as described below under [[#Usage|Usage]].
  The following parameters are in effect:
+
 
            Input BAM/SAM File :                 (-iname)
+
The input BAM file is required, [[#input File (--in)|input File (--in)]].
+
 
Output FastQ Files
+
It works on both read/query name and coordinate sorted SAM/BAM files.  
  Output : --first [], --second [], --single []
+
 
 +
If you want to convert a SAM/BAM that is read/query name sorted but the SO field of the header does not specify "queryname", then use the [[#BAM File Is Sorted By Read Name (--readname)|--readName]] option.
 +
 
 +
When processing files sorted by read name, the only requirement is that matching read names are next to each other.  It does not need to be in strict alphabetical order.
 +
 
 +
Read Names in paired-end FASTQ files are appended with "/1" for the first in the pair, and "/2" for the second in the pair.  Override these defaults using [[#First in Pair FastQ ReadName Extension (--firstRNExt)|--firstRNExt]] and [[#Second in Pair FastQ ReadName Extension (--secondRNExt)|--secondRNExt]]
 +
 
 +
Sequences marked as Reverse strands in the SAM/BAM file are reverse complemented prior to writing to the FASTQ files.  To skip this step, specify [[#Do Not Reverse Complement Reverse Strands (--noReverseComp)|--noReverseComp]]
 +
 
 +
Any errors and a summary of how many pairs and unpaired reads were processed are written to stderr.
 +
 
 +
'''NOTE: This tool does not work on templates that have more than 2 segments.  It does not properly match reads when more than 2 reads have the same read name.'''
 +
 
 +
'''NOTE: This tool does not split reads into read group specific FASTQs.  If you want Read Group specific FASTQ files, first run [[BamUtil: splitBam]] to first split the BAM into 1 BAM per Read Group.  Then run bam2FastQ on each bam.'''
 +
 
 +
=== Output Files ===
 +
By default, this program produces 3 output fastq files.
 +
# unpaired reads
 +
# first end of paired reads
 +
# second end of paired reads
 +
 
 +
If the [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]] option is specified, the program produces 2 output fastq files.
 +
# unpaired reads
 +
# interleaved paired-end reads
 +
 
 +
The default fastq file names are determined by taking the base name of the input file and adding an extension for each filetype. 
 +
{|border="1" cellspacing="0" cellpadding="2"
 +
! colspan="2"|Default !!colspan="2"|[[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]]
 +
|-
 +
! Output File Contents !! Extension !! Output File Contents !! Extension
 +
|-
 +
|unpaired reads
 +
| .fastq
 +
|unpaired reads
 +
| .fastq
 +
|-
 +
|first end of paired reads
 +
| _1.fastq
 +
| rowspan="2"|interleaved paired-end reads
 +
(both first & second end)
 +
| rowspan="2"|_interleaved.fastq
 +
|-
 +
|second end of paired reads
 +
| _2.fastq
 +
|}
 +
 
 +
If the inputFile was "myPath/myFile.bam", the resulting fastqs would be:
 +
#myPath/myFile.fastq
 +
#myPath/myFile_1.fastq
 +
#myPath/myFile_2.fastq
 +
 
 +
With the [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]] option, the resulting fastqs would be:
 +
#myPath/myFile.fastq
 +
#myPath/myFile_interleaved.fastq
 +
 
 +
Instead of using the inputFile base name as the output file base, you can specify a different base name by using the [[#Output FastQ File Base Name (--outBase)|--outBase]] option.
 +
 
 +
You can optionally directly specify the output fastq filenames using:
 +
* --firstOut firstReadInAPair.fastq (also used for the interleaved filename with [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]])
 +
* --secondOut secondReadInAPair.fastq
 +
* --unpairedOut unpairedReads.fastq
 +
If any of these are not specified, the <code>--outBase</code> or default is used for that file.
 +
 
 +
= Usage =
 +
./bam bam2FastQ --in <inputFile> [--readName] [--splitRG] [--qualField <tag>] [--refFile <referenceFile>] [--outBase <outputFileBase>] [--firstOut <1stReadInPairOutFile>] [--merge|--secondOut <2ndReadInPairOutFile>] [--unpairedOut <unpairedOutFile>] [--firstRNExt <firstInPairReadNameEx                          t>] [--secondRNExt <secondInPairReadNameExt>] [--rnPlus] [--noReverseComp] [--region <chr>[:<pos>[:<base>]]] [--gzip] [--noeof] [--params]
 +
 
 +
= Parameters =
 +
<pre>
 +
        Required Parameters:
 +
                --in      : the SAM/BAM file to convert to FastQ
 +
        Optional Parameters:
 +
                --readname      : Process the BAM as readName sorted instead
 +
                                  of coordinate if the header does not indicate a sort order.
 +
                --splitRG      : Split into RG specific fastqs.
 +
                --qualField    : Use the base quality from the specified tag
 +
                                  rather than from the Quality field (default)
 +
                --merge        : Generate 1 interleaved (merged) FASTQ for paired-ends (unpaired in a separate file)
 +
                                  use firstOut to override the filename of the interleaved file.
 +
                --refFile      : Reference file for converting '=' in the sequence to the actual base
 +
                                  if '=' are found and the refFile is not specified, 'N' is written to the FASTQ
 +
                --firstRNExt    : read name extension to use for first read in a pair
 +
                                  default is "/1"
 +
                --secondRNExt  : read name extension to use for second read in a pair
 +
                                  default is "/2"
 +
                --rnPlus        : Add the Read Name/extension to the '+' line of the fastq records
 +
                --noReverseComp : Do not reverse complement reads marked as reverse
 +
                --region        : Only convert reads containing the specified region/nucleotide.
 +
                                  Position formatted as: chr:pos:base
 +
                                  pos (0-based) & base are optional.
 +
                --gzip          : Compress the output FASTQ files using gzip
 +
                --noeof        : Do not expect an EOF block on a bam file.
 +
                --params        : Print the parameter settings to stderr
 +
        Optional OutputFile Names:
 +
                --outBase      : Base output name for generated output files
 +
                --firstOut      : Output name for the first in pair file
 +
                                  over-rides setting of outBase
 +
                --secondOut    : Output name for the second in pair file
 +
                                  over-rides setting of outBase
 +
                --unpairedOut  : Output name for unpaired reads
 +
                                  over-rides setting of outBase
 +
</pre>
 +
 
 +
== Required Parameters ==
 +
{{inBAMInputFile}}
 +
 
 +
== Optional Parameters ==
 +
=== BAM File Is Sorted By Read Name (<code>--readname</code>) ===
 +
 
 +
The bam2FastQ program by default checks the sort order in the SAM/BAM header when converting to FASTQ, and if that is not specified, assumes it is sorted by coordinate.
 +
 
 +
To override the default and force it to assume the file is sorted by readname, specify the <code>--readName</code> option
 +
 
 +
The file does not need to be strictly sorted by read name.  The only requirement is that matching read names are next to each other.
 +
 
 +
=== Split into RG Specific FASTQs (<code>--splitRG</code>) ===
 +
 
 +
Create RG specific FASTQ files.
 +
 
 +
Cannot be specified with firstOut/secondOut/unpairedOut since there will be a different filename for each RG.
 +
 
 +
Cannot write to stdout when <code>--splitRG</code> is specified.
 +
 
 +
Output filenames will be <outBase>.<RG>_1.fastq, <outBase>.<RG>_2.fastq, and <outBase>.<RG>.fastq.  A fastq list file <outBase>.list will be created containing MERGE_NAME (the RG tag's SM value or outBase if the value is empty), fastq 1, fastq 2 (or . if it is a single ended fastq), and the RG tag string.
 +
 
 +
=== Use the Base Quality from the Specified Tag (<code>--qualField</code>) ===
 +
 
 +
By default, the quality field is used for the Base Qualities in the FASTQ file.  Specify <code>--qualField <tagName></code> to use the base qualities from the specified tag instead of the quality field.
 +
 
 +
 
 +
=== Generate 1 Paired-End Output File (<code>--merge</code>) ===
 +
 
 +
Use the <code>--merge</code> option to generate 1 interleaved (merged) FASTQ for paired-ends instead of 2 files.  Unpaired reads are still written to a separate file.
 +
 
 +
The default extension for the output file is "_interleaved".
 +
 
 +
Use [[#Output FastQ File Name For the First End of Paired End (--firstOut)|<code>--firstOut</code>]] to override the filename of the interleaved file.
 +
 
 +
This parameter was added in version 1.0.10.
 +
 
 +
=== Reference File for Converting '=' in the Sequence to Bases (<code>--refFile</code>) ===
 +
If the SAM/BAM file contains '=' in the sequence instead of the actual bases, the bam2FastQ program needs to convert the '=' back to the bases.  To do that it needs the reference.  Specify the reference by using <code>--refFile</code> followed by the reference filename.
 +
 
 +
For example:
 +
./bam bam2FastQ --in myFile.bam --refFile myPath/myRefFile.fa
 +
 
 +
=== First in Pair FastQ ReadName Extension (<code>--firstRNExt</code>) ===
 +
 
 +
<code>--firstRNExt</code> overrides the default "/1" that is appended to the Read Name of the first-end of a read pair with the specified value.
 +
 
 +
=== Second in Pair FastQ ReadName Extension (<code>--secondRNExt</code>) ===
 +
 
 +
<code>--secondRNExt</code> overrides the default "/2" that is appended to the Read Name of the second-end of a read pair with the specified value.
 +
 
 +
=== Include the Read Name on the "+" line of the FASTQ (<code>--rnPlus</code>) ===
 +
 
 +
By default the read name is not included on the "+" line of the FASTQ files.  To include the read name and the extension for paired-end reads, specify <code>--rnPlus</code>.
 +
 
 +
=== Do Not Reverse Complement Reverse Strands (<code>--noReverseComp</code>) ===
 +
 
 +
By default, reads marked as reverse in the BAM file are reverse complemented prior to writing to the FASTQ files.  <code>--noReverseComp</code> disables this feature, and skips the reverse complement step.
 +
 
 +
For example, if a sequence is ACCGTG marked as reverse, the default FASTQ record will be written as: CACGGT
 +
 
 +
Specifying <code>--noReverseComp</code> would result in a FASTQ sequence of ACCGTG
 +
 
 +
=== Only Convert the Specified Region (<code>--region</code>) ===
 +
 
 +
Only convert reads containing the specified region/nucleotide.
 +
 
 +
Position formatted as: chr:pos:base
 +
 
 +
pos (0-based) & base are optional.
 +
 
 +
{{noeofBGZFParameter}}
 +
{{paramsParameter}}
 +
 
 +
== Optional Output Filenames ==
 +
 
 +
=== Output FastQ File Base Name (<code>--outBase</code>) ===
 +
 
 +
You can replace the default output base name by using the <code>--outBase</code> option.
 +
If the outBase was "myNewPath/myFastQBase", the resulting fastq's would be:
 +
#myNewPath/myFastQBase.fastq
 +
#myNewPath/myFastQBase_1.fastq
 +
#myNewPath/myFastQBase_2.fastq
 +
 
 +
The value specified by this parameter is overridden by <code>--firstOut</code>, <code>--secondOut</code>, and <code>--unpairedOut</code>, but is used for whichever output files are not specified.
 +
 
 +
With the [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]] option, the resulting fastq's would instead be:
 +
#myNewPath/myFastQBase.fastq
 +
#myNewPath/myFastQBase_interleaved.fastq
 +
 
 +
=== Output FastQ File Name For the First End of Paired End (<code>--firstOut</code>) ===
 +
 
 +
This setting overides the default and <code>--outBase</code> file name.
 +
 
 +
The entire filename and extension must be specified.
 +
 
 +
Does not affect the filenames for the second end or for unpaired reads.
 +
 
 +
For example:
 +
./bam bam2FastQ --in myFile.bam --firstOut myFileEnd1.fastq
 +
 
 +
=== Output FastQ File Name For the Second End of Paired End (<code>--secondOut</code>) ===
 +
 
 +
This setting overides the default and <code>--outBase</code> file name.
 +
 
 +
The entire filename and extension must be specified.
 +
 
 +
Does not affect the filenames for the first end or for unpaired reads.
 +
 
 +
For example:
 +
./bam bam2FastQ --in myFile.bam --secondOut myFileEnd2.fastq
 +
 
 +
=== Output FastQ File Name For Unpaired Reads (<code>--unpairedOut</code>) ===
   −
Required parameter
+
This setting overides the default and <code>--outBase</code> file names.
-i InputBAM/SAM
     −
Optional parameters for output (however either --single or both --first and --second have to be provided)
+
The entire filename and extension must be specified.
--first firstReadInAPair_FastQ
  −
--second secondReadInAPair_FastQ
  −
--single unpairedReads_FastQ
     −
In order to extract paired end reads, the BAM file has to be sorted by name, e.g. using samtools. Suppose the BAM file is myinput.bam
+
Does not affect the filenames for the paired-end fastq files.
   −
  samtools sort -n myinput.bam myinout.sortByName.bam
+
For example:
 +
  ./bam bam2FastQ --in myFile.bam --unpairedOut myFileUnpaired.fastq
   −
Using sorted bam file to extract paired end fastq files
+
{{PhoneHomeParameters}}
   −
  bam2FastQ -i myinput.sortByName.BAM --first myread1.fastQ --second myread2.fastQ
+
= Return Value =
   −
Or to extract both paired end and single end fastq files (if the bam file contains both single and paired end reads)
+
Returns -1 if input parameters are invalid.
  −
  bam2FastQ -i myinput.sortByName.BAM --first myread1.fastQ --second myread2.fastQ --single myreadSingle.fastQ
     −
Or using bam (sorted or not) file to extract single end fastq files
+
Returns the SamStatus for the reads/writes (0 on success, non-0 on failure).
  −
  bam2FastQ -i myinput.sortByName.BAM --single myreadSingle.fastQ
 

Navigation menu