Line 1: |
Line 1: |
− | =This functionality will be released on 5/17/2012=
| |
− |
| |
| = Overview of the <code>bam2FastQ</code> function of <code>[[bamUtil]]</code> = | | = Overview of the <code>bam2FastQ</code> function of <code>[[bamUtil]]</code> = |
| The <code>bam2FastQ</code> option on the [[bamUtil]] converts a BAM file into FastQ files. This is necessary when only BAM files are delivered but a new alignment is desired. By converting BAM to FastQ files new alignments can be done using FastQ files | | The <code>bam2FastQ</code> option on the [[bamUtil]] converts a BAM file into FastQ files. This is necessary when only BAM files are delivered but a new alignment is desired. By converting BAM to FastQ files new alignments can be done using FastQ files |
| | | |
| + | '''NOTE: Secondary and Supplementary reads are skipped when converting to FastQ. It assumes that there will only be 2 reads (the 2 primary mates) with the same read name that are not secondary or supplementary.''' |
| + | |
| + | '''NOTE: Use the --splitRG option to split reads into read group specific FASTQs.''' |
| | | |
| == How to use it == | | == How to use it == |
Line 16: |
Line 17: |
| | | |
| When processing files sorted by read name, the only requirement is that matching read names are next to each other. It does not need to be in strict alphabetical order. | | When processing files sorted by read name, the only requirement is that matching read names are next to each other. It does not need to be in strict alphabetical order. |
| + | |
| + | Read Names in paired-end FASTQ files are appended with "/1" for the first in the pair, and "/2" for the second in the pair. Override these defaults using [[#First in Pair FastQ ReadName Extension (--firstRNExt)|--firstRNExt]] and [[#Second in Pair FastQ ReadName Extension (--secondRNExt)|--secondRNExt]] |
| + | |
| + | Sequences marked as Reverse strands in the SAM/BAM file are reverse complemented prior to writing to the FASTQ files. To skip this step, specify [[#Do Not Reverse Complement Reverse Strands (--noReverseComp)|--noReverseComp]] |
| | | |
| Any errors and a summary of how many pairs and unpaired reads were processed are written to stderr. | | Any errors and a summary of how many pairs and unpaired reads were processed are written to stderr. |
| + | |
| + | '''NOTE: This tool does not work on templates that have more than 2 segments. It does not properly match reads when more than 2 reads have the same read name.''' |
| + | |
| + | '''NOTE: This tool does not split reads into read group specific FASTQs. If you want Read Group specific FASTQ files, first run [[BamUtil: splitBam]] to first split the BAM into 1 BAM per Read Group. Then run bam2FastQ on each bam.''' |
| | | |
| === Output Files === | | === Output Files === |
− | This program produces 3 output fastq files.
| + | By default, this program produces 3 output fastq files. |
| # unpaired reads | | # unpaired reads |
| # first end of paired reads | | # first end of paired reads |
| # second end of paired reads | | # second end of paired reads |
| + | |
| + | If the [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]] option is specified, the program produces 2 output fastq files. |
| + | # unpaired reads |
| + | # interleaved paired-end reads |
| | | |
| The default fastq file names are determined by taking the base name of the input file and adding an extension for each filetype. | | The default fastq file names are determined by taking the base name of the input file and adding an extension for each filetype. |
| {|border="1" cellspacing="0" cellpadding="2" | | {|border="1" cellspacing="0" cellpadding="2" |
− | ! Output File Contents !! Extension | + | ! colspan="2"|Default !!colspan="2"|[[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]] |
| + | |- |
| + | ! Output File Contents !! Extension !! Output File Contents !! Extension |
| |- | | |- |
| + | |unpaired reads |
| + | | .fastq |
| |unpaired reads | | |unpaired reads |
| | .fastq | | | .fastq |
Line 34: |
Line 51: |
| |first end of paired reads | | |first end of paired reads |
| | _1.fastq | | | _1.fastq |
| + | | rowspan="2"|interleaved paired-end reads |
| + | (both first & second end) |
| + | | rowspan="2"|_interleaved.fastq |
| |- | | |- |
| |second end of paired reads | | |second end of paired reads |
Line 39: |
Line 59: |
| |} | | |} |
| | | |
− | If the inputFile was "myPath/myFile.bam", the resulting fastq's would be: | + | If the inputFile was "myPath/myFile.bam", the resulting fastqs would be: |
| #myPath/myFile.fastq | | #myPath/myFile.fastq |
| #myPath/myFile_1.fastq | | #myPath/myFile_1.fastq |
| #myPath/myFile_2.fastq | | #myPath/myFile_2.fastq |
| + | |
| + | With the [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]] option, the resulting fastqs would be: |
| + | #myPath/myFile.fastq |
| + | #myPath/myFile_interleaved.fastq |
| | | |
| Instead of using the inputFile base name as the output file base, you can specify a different base name by using the [[#Output FastQ File Base Name (--outBase)|--outBase]] option. | | Instead of using the inputFile base name as the output file base, you can specify a different base name by using the [[#Output FastQ File Base Name (--outBase)|--outBase]] option. |
| | | |
| You can optionally directly specify the output fastq filenames using: | | You can optionally directly specify the output fastq filenames using: |
− | * --firstOut firstReadInAPair.fastq | + | * --firstOut firstReadInAPair.fastq (also used for the interleaved filename with [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]]) |
| * --secondOut secondReadInAPair.fastq | | * --secondOut secondReadInAPair.fastq |
| * --unpairedOut unpairedReads.fastq | | * --unpairedOut unpairedReads.fastq |
Line 53: |
Line 77: |
| | | |
| = Usage = | | = Usage = |
− | ./bam bam2FastQ --in <inputFile> [--readName] [--refFile <referenceFile>] [--outBase <outputFileBase>] [--firstOut <1stReadInPairOutFile>] [--secondOut <2ndReadInPairOutFile>] [--unpairedOut <unpairedOutFile>] [--firstRNExt <firstInPairReadNameExt>] [--secondRNExt <secondInPairReadNameExt>] [--rnPlus] [--noReverseComp] [--noeof] [--params]
| + | ./bam bam2FastQ --in <inputFile> [--readName] [--splitRG] [--qualField <tag>] [--refFile <referenceFile>] [--outBase <outputFileBase>] [--firstOut <1stReadInPairOutFile>] [--merge|--secondOut <2ndReadInPairOutFile>] [--unpairedOut <unpairedOutFile>] [--firstRNExt <firstInPairReadNameEx t>] [--secondRNExt <secondInPairReadNameExt>] [--rnPlus] [--noReverseComp] [--region <chr>[:<pos>[:<base>]]] [--gzip] [--noeof] [--params] |
− | | |
| | | |
| = Parameters = | | = Parameters = |
| <pre> | | <pre> |
− | Required Parameters:
| + | Required Parameters: |
− | --in : the SAM/BAM file to convert to FastQ
| + | --in : the SAM/BAM file to convert to FastQ |
− | Optional Parameters:
| + | Optional Parameters: |
− | --readname : Process the BAM as readName sorted instead
| + | --readname : Process the BAM as readName sorted instead |
− | of coordinate if the header does not indicate a sort order.
| + | of coordinate if the header does not indicate a sort order. |
− | --refFile : Reference file for converting '=' in the sequence to the actual base
| + | --splitRG : Split into RG specific fastqs. |
− | if '=' are found and the refFile is not specified, 'N' is written to the FASTQ
| + | --qualField : Use the base quality from the specified tag |
− | --outBase : Base output name for generated output files
| + | rather than from the Quality field (default) |
− | --firstOut : Output name for the first in pair file
| + | --merge : Generate 1 interleaved (merged) FASTQ for paired-ends (unpaired in a separate file) |
− | over-rides setting of outBase
| + | use firstOut to override the filename of the interleaved file. |
− | --secondOut : Output name for the second in pair file
| + | --refFile : Reference file for converting '=' in the sequence to the actual base |
− | over-rides setting of outBase
| + | if '=' are found and the refFile is not specified, 'N' is written to the FASTQ |
− | --unpairedOut : Output name for unpaired reads
| + | --firstRNExt : read name extension to use for first read in a pair |
− | over-rides setting of outBase
| + | default is "/1" |
− | --firstRNExt : read name extension to use for first read in a pair
| + | --secondRNExt : read name extension to use for second read in a pair |
− | default is "/1"
| + | default is "/2" |
− | --secondRNExt : read name extension to use for second read in a pair
| + | --rnPlus : Add the Read Name/extension to the '+' line of the fastq records |
− | default is "/2"
| + | --noReverseComp : Do not reverse complement reads marked as reverse |
− | --rnPlus : Add the Read Name/extension to the '+' line of the fastq records
| + | --region : Only convert reads containing the specified region/nucleotide. |
− | --noReverseComp : Do not reverse complement reads marked as reverse
| + | Position formatted as: chr:pos:base |
− | --noeof : Do not expect an EOF block on a bam file.
| + | pos (0-based) & base are optional. |
− | --params : Print the parameter settings to stderr
| + | --gzip : Compress the output FASTQ files using gzip |
| + | --noeof : Do not expect an EOF block on a bam file. |
| + | --params : Print the parameter settings to stderr |
| + | Optional OutputFile Names: |
| + | --outBase : Base output name for generated output files |
| + | --firstOut : Output name for the first in pair file |
| + | over-rides setting of outBase |
| + | --secondOut : Output name for the second in pair file |
| + | over-rides setting of outBase |
| + | --unpairedOut : Output name for unpaired reads |
| + | over-rides setting of outBase |
| </pre> | | </pre> |
| | | |
| + | == Required Parameters == |
| {{inBAMInputFile}} | | {{inBAMInputFile}} |
| | | |
− | == Output FastQ File Base Name (<code>--outBase</code>) == | + | == Optional Parameters == |
| + | === BAM File Is Sorted By Read Name (<code>--readname</code>) === |
| + | |
| + | The bam2FastQ program by default checks the sort order in the SAM/BAM header when converting to FASTQ, and if that is not specified, assumes it is sorted by coordinate. |
| + | |
| + | To override the default and force it to assume the file is sorted by readname, specify the <code>--readName</code> option |
| + | |
| + | The file does not need to be strictly sorted by read name. The only requirement is that matching read names are next to each other. |
| + | |
| + | === Split into RG Specific FASTQs (<code>--splitRG</code>) === |
| + | |
| + | Create RG specific FASTQ files. |
| + | |
| + | Cannot be specified with firstOut/secondOut/unpairedOut since there will be a different filename for each RG. |
| + | |
| + | Cannot write to stdout when <code>--splitRG</code> is specified. |
| + | |
| + | Output filenames will be <outBase>.<RG>_1.fastq, <outBase>.<RG>_2.fastq, and <outBase>.<RG>.fastq. A fastq list file <outBase>.list will be created containing MERGE_NAME (the RG tag's SM value or outBase if the value is empty), fastq 1, fastq 2 (or . if it is a single ended fastq), and the RG tag string. |
| + | |
| + | === Use the Base Quality from the Specified Tag (<code>--qualField</code>) === |
| + | |
| + | By default, the quality field is used for the Base Qualities in the FASTQ file. Specify <code>--qualField <tagName></code> to use the base qualities from the specified tag instead of the quality field. |
| + | |
| + | |
| + | === Generate 1 Paired-End Output File (<code>--merge</code>) === |
| + | |
| + | Use the <code>--merge</code> option to generate 1 interleaved (merged) FASTQ for paired-ends instead of 2 files. Unpaired reads are still written to a separate file. |
| + | |
| + | The default extension for the output file is "_interleaved". |
| + | |
| + | Use [[#Output FastQ File Name For the First End of Paired End (--firstOut)|<code>--firstOut</code>]] to override the filename of the interleaved file. |
| + | |
| + | This parameter was added in version 1.0.10. |
| + | |
| + | === Reference File for Converting '=' in the Sequence to Bases (<code>--refFile</code>) === |
| + | If the SAM/BAM file contains '=' in the sequence instead of the actual bases, the bam2FastQ program needs to convert the '=' back to the bases. To do that it needs the reference. Specify the reference by using <code>--refFile</code> followed by the reference filename. |
| + | |
| + | For example: |
| + | ./bam bam2FastQ --in myFile.bam --refFile myPath/myRefFile.fa |
| + | |
| + | === First in Pair FastQ ReadName Extension (<code>--firstRNExt</code>) === |
| + | |
| + | <code>--firstRNExt</code> overrides the default "/1" that is appended to the Read Name of the first-end of a read pair with the specified value. |
| + | |
| + | === Second in Pair FastQ ReadName Extension (<code>--secondRNExt</code>) === |
| + | |
| + | <code>--secondRNExt</code> overrides the default "/2" that is appended to the Read Name of the second-end of a read pair with the specified value. |
| + | |
| + | === Include the Read Name on the "+" line of the FASTQ (<code>--rnPlus</code>) === |
| + | |
| + | By default the read name is not included on the "+" line of the FASTQ files. To include the read name and the extension for paired-end reads, specify <code>--rnPlus</code>. |
| + | |
| + | === Do Not Reverse Complement Reverse Strands (<code>--noReverseComp</code>) === |
| + | |
| + | By default, reads marked as reverse in the BAM file are reverse complemented prior to writing to the FASTQ files. <code>--noReverseComp</code> disables this feature, and skips the reverse complement step. |
| + | |
| + | For example, if a sequence is ACCGTG marked as reverse, the default FASTQ record will be written as: CACGGT |
| + | |
| + | Specifying <code>--noReverseComp</code> would result in a FASTQ sequence of ACCGTG |
| + | |
| + | === Only Convert the Specified Region (<code>--region</code>) === |
| + | |
| + | Only convert reads containing the specified region/nucleotide. |
| + | |
| + | Position formatted as: chr:pos:base |
| + | |
| + | pos (0-based) & base are optional. |
| + | |
| + | {{noeofBGZFParameter}} |
| + | {{paramsParameter}} |
| + | |
| + | == Optional Output Filenames == |
| + | |
| + | === Output FastQ File Base Name (<code>--outBase</code>) === |
| | | |
| You can replace the default output base name by using the <code>--outBase</code> option. | | You can replace the default output base name by using the <code>--outBase</code> option. |
Line 94: |
Line 201: |
| The value specified by this parameter is overridden by <code>--firstOut</code>, <code>--secondOut</code>, and <code>--unpairedOut</code>, but is used for whichever output files are not specified. | | The value specified by this parameter is overridden by <code>--firstOut</code>, <code>--secondOut</code>, and <code>--unpairedOut</code>, but is used for whichever output files are not specified. |
| | | |
| + | With the [[#Generate 1 Paired-End Output File (--merge)|<code>--merge</code>]] option, the resulting fastq's would instead be: |
| + | #myNewPath/myFastQBase.fastq |
| + | #myNewPath/myFastQBase_interleaved.fastq |
| | | |
− | == Output FastQ File Name For the First End of Paired End (<code>--firstOut</code>) == | + | === Output FastQ File Name For the First End of Paired End (<code>--firstOut</code>) === |
| | | |
| This setting overides the default and <code>--outBase</code> file name. | | This setting overides the default and <code>--outBase</code> file name. |
Line 101: |
Line 211: |
| The entire filename and extension must be specified. | | The entire filename and extension must be specified. |
| | | |
− | Does not affect the filenames for the first end or for unpaired reads. | + | Does not affect the filenames for the second end or for unpaired reads. |
| | | |
| For example: | | For example: |
| ./bam bam2FastQ --in myFile.bam --firstOut myFileEnd1.fastq | | ./bam bam2FastQ --in myFile.bam --firstOut myFileEnd1.fastq |
| | | |
− | | + | === Output FastQ File Name For the Second End of Paired End (<code>--secondOut</code>) === |
− | == Output FastQ File Name For the Second End of Paired End (<code>--secondOut</code>) == | |
| | | |
| This setting overides the default and <code>--outBase</code> file name. | | This setting overides the default and <code>--outBase</code> file name. |
Line 118: |
Line 227: |
| ./bam bam2FastQ --in myFile.bam --secondOut myFileEnd2.fastq | | ./bam bam2FastQ --in myFile.bam --secondOut myFileEnd2.fastq |
| | | |
− | | + | === Output FastQ File Name For Unpaired Reads (<code>--unpairedOut</code>) === |
− | == Output FastQ File Name For Unpaired Reads (<code>--unpairedOut</code>) == | |
| | | |
| This setting overides the default and <code>--outBase</code> file names. | | This setting overides the default and <code>--outBase</code> file names. |
Line 125: |
Line 233: |
| The entire filename and extension must be specified. | | The entire filename and extension must be specified. |
| | | |
− | Does not affect the filenames for the two paired end fastq files. | + | Does not affect the filenames for the paired-end fastq files. |
| | | |
| For example: | | For example: |
| ./bam bam2FastQ --in myFile.bam --unpairedOut myFileUnpaired.fastq | | ./bam bam2FastQ --in myFile.bam --unpairedOut myFileUnpaired.fastq |
| | | |
| + | {{PhoneHomeParameters}} |
| | | |
− | == BAM File Is Sorted By Read Name (<code>--readname</code>) == | + | = Return Value = |
| | | |
− | The bam2FastQ program by default checks the sort order in the SAM/BAM header when converting to FASTQ, and if that is not specified, assumes it is sorted by coordinate.
| + | Returns -1 if input parameters are invalid. |
| | | |
− | To override the default and force it to assume the file is sorted by readname, specify the <code>--readName</code> option
| + | Returns the SamStatus for the reads/writes (0 on success, non-0 on failure). |
− | | |
− | The file does not need to be strictly sorted by read name. The only requirement is that matching read names are next to each other.
| |
− | | |
− | | |
− | == Reference File for Converting '=' in the Sequence to Bases <code>--refFile</code>==
| |
− | If the SAM/BAM file contains '=' in the sequence instead of the actual bases, the bam2FastQ program needs to convert the '=' back to the bases. To do that it needs the reference. Specify the reference by using <code>--refFile</code> followed by the reference filename.
| |
− | | |
− | For example:
| |
− | ./bam bam2FastQ --in myFile.bam --refFile myPath/myRefFile.fa
| |