Difference between revisions of "BamUtil: polishBam"

From Genome Analysis Wiki
Jump to: navigation, search
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Polish BAM ==
+
= Overview of the <code>polishBam</code> function of <code>bamUtil</code> =
The <code>polishBam</code> program is released as part of the StatGen Library & Tools download.
+
The <code>polishBam</code> option on the [[bamUtil]] executable adds/updates header lines & adds the RG tag to each record.
  
<code>polishBam</code> trims the end of reads in a SAM/BAM file, changing read ends to ‘N’ and quality to ‘!’.
+
= Usage =
 +
./bam polishBam (options) --in <inBamFile> --out <outBamFile>
  
 
+
= Parameters =
=== Parameters ===
 
 
<pre>
 
<pre>
 
   Required parameters:  
 
   Required parameters:  
Line 22: Line 22:
 
         --checkSQ : check the consistency of SQ tags (SN and LN) with existing header lines. Must be used with --fasta option
 
         --checkSQ : check the consistency of SQ tags (SN and LN) with existing header lines. Must be used with --fasta option
 
</pre>
 
</pre>
 +
{{PhoneHomeParamDesc}}
 +
 +
== Required Parameters ==
 +
{{InBAMInputFile}}
 +
{{OutBAMOutputFile}}
 +
 +
== Optional Parameters ==
 +
=== Verbose (<code>--verbose</code>) ===
 +
Use <code>--verbose</code> to turn on verbose mode.
 +
 +
=== Specify Log Filename (<code>--log</code>) ===
 +
Use <code>--log</code> followed by the log filename to specify the log filename.  Default is the output file basename with a <code>.log</code> extension
 +
 +
=== Add the HD Header (<code>--HD</code>) ===
 +
Use <code>--HD</code> followed by the HD header line to add a HD header.  Be sure to include "@HD" in the line you specify.
 +
 +
=== Add the RG Header (<code>--RG</code>) ===
 +
Use <code>--RG</code> followed by the RG header line to add a RG header.  Be sure to include "@RG" in the line you specify.
 +
 +
=== Add the PG Header (<code>--PG</code>) ===
 +
Use <code>--PG</code> followed by the PG header line to add a PG header.  Be sure to include "@PG" in the line you specify.
 +
 +
=== Add MD5 and UR tags to SQ Headers (<code>--fasta</code>) ===
 +
Use <code>--fasta</code> followed by the fasta reference file name to compute MD5sums and update SQ tags with the M5 & UR values.  Use the [[#Add the UR tag to SQ Headers (--UR)|<code>--UR</code>]] option to specify a different UR value.
 +
 +
=== Add the AS tag to SQ Headers (<code>--AS</code>) ===
 +
Use <code>--AS</code> followed by the genome assembly identify to add the AS tag to the SQ Headers.
 +
 +
=== Add the UR tag to SQ Headers (<code>--UR</code>) ===
 +
Use <code>--UR</code> followed by the URI of the sequence to add the UR tag to the SQ Headers.
  
=== Usage ===
+
The UR header will be automatically added with the [[#Add MD5 and UR tags to SQ Headers (--fasta)|<code>--fasta</code>]] option, so if [[#Add MD5 and UR tags to SQ Headers (--fasta)|<code>--fasta</code>]] is used, <code>--UR</code> only needs to be specified if it is different from [[#Add MD5 and UR tags to SQ Headers (--fasta)|<code>--fasta</code>]].
polishBAM (options) --in <inBamFile> --out <outBamFile>
 
  
 +
=== Add the SP tag to SQ Headers (<code>--SP</code>) ===
 +
Use <code>--SP</code> followed by the species to add the SP tag to the SQ Headers.
  
=== Return Value ===
+
{{PhoneHomeParameters}}
Returns 0.
 
  
=== Example Output ===
+
= Return Value =
 +
Returns 0 on success, non-0 on failure.
 +
 
 +
= Example =
 +
Command:
 
<pre>
 
<pre>
 +
./bam polishBam  --in testFiles/sortedSam.sam --out results/updatedSam.sam --log results/updated.log --checkSQ --fasta testFiles/testFasta.fa --AS my37 --UR testFasta.fa --RG "@RG ID:UM0037:1 SM:Sample2 LB:lb2 PU:mypu CN:UMCORE DT:2010-11-01 PL:ILLUMINA" --PG "@PG ID:polish VN:0.0.1" --SP new --HD "@HD VN:1.0 SO:coordinate GO:none"
 
</pre>
 
</pre>
  
 +
Input File:
 +
<pre>
 +
@SQ SN:1 LN:2004
 +
@SQ SN:2 LN:2000
 +
@SQ SN:3 LN:2005
 +
@SQ SN:4 LN:2040
 +
@SQ SN:5 LN:2006
 +
@RG ID:myID LB:library SM:sample
 +
@RG ID:myID2 SM:sample2 LB:library2
 +
@CO Comment 1
 +
@CO Comment 2
 +
18:462+29M5I3M:F:295 97 1 75 0 5M 18 757 0 ACGTN ;>>>> AM:i:0 MD:Z:30A0C5 NM:i:2 XT:A:R
 +
18:462+29M5I3M:F:295 97 1 75 0 * 18 757 0 * * AM:i:0
 +
1:1011:F:255+17M15D20M 73 1 1011 0 5M2D = 1011 0 CCGAA 6>6+4 AM:i:0 MD:Z:37 NM:i:0 XT:A:R
 +
1:1011:F:255+17M15D20M 133 1 1012 0 * = 1011 0 CTGT >>9>
 +
18:462+29M5I3M:F:296 97 1 1751 0 3S2H5M 18 757 0 TGCACGTN 453;>>>>
 +
18:462+29M5I3M:F:295 97 2 75 0 5M 18 757 0 ACGTN * AM:i:0 MD:Z:30A0C5 NM:i:2 XT:A:R
 +
18:462+29M5I3M:F:297 97 2 1751 0 3S5M1S3H 18 757 0 TGCACGTNG 453;>>>>5
 +
18:462+29M5I3M:F:298 97 3 75 0 3S5M4H 18 757 0 TGCACGTN 453;>>>>
 +
Y:16597235+13M13I11M:F:181 141 * 0 0 * * 0 0 AACT ==;;
 +
Y:16597235+13M13I11M:F:181 141 * 0 0 * * 0 0 * *
 +
</pre>
 +
 +
 +
Output File:
 +
<pre>
 +
@SQ SN:1 LN:2004 AS:my37 M5:a9cfe5b8c11aa0cc2c0d2bf3602c9804 UR:testFasta.fa SP:new
 +
@SQ SN:2 LN:2000 AS:my37 M5:7c342606b54aa211a50f5f63ac1cb2eb UR:testFasta.fa SP:new
 +
@SQ SN:3 LN:2005 AS:my37 M5:c30e547093f33de240b164a4a2ebe3b5 UR:testFasta.fa SP:new
 +
@SQ SN:4 LN:2040 AS:my37 M5:fc4c559e9da51e93e7875031ddf65f2a UR:testFasta.fa SP:new
 +
@SQ SN:5 LN:2006 AS:my37 M5:c876194283debb8b507ebd0f82309ec4 UR:testFasta.fa SP:new
 +
@RG ID:myID LB:library SM:sample
 +
@RG ID:myID2 SM:sample2 LB:library2
 +
@HD VN:1.0 SO:coordinate GO:none
 +
@RG ID:UM0037:1 SM:Sample2 LB:lb2 PU:mypu CN:UMCORE DT:2010-11-01 PL:ILLUMINA
 +
@PG ID:polish VN:0.0.1
 +
@CO Comment 1
 +
@CO Comment 2
 +
18:462+29M5I3M:F:295 97 1 75 0 5M 18 757 0 ACGTN ;>>>> AM:i:0 MD:Z:30A0C5 NM:i:2 RG:Z:UM0037:1 XT:A:R
 +
18:462+29M5I3M:F:295 97 1 75 0 * 18 757 0 * * AM:i:0 RG:Z:UM0037:1
 +
1:1011:F:255+17M15D20M 73 1 1011 0 5M2D = 1011 0 CCGAA 6>6+4 AM:i:0 MD:Z:37 NM:i:0 RG:Z:UM0037:1 XT:A:R
 +
1:1011:F:255+17M15D20M 133 1 1012 0 * = 1011 0 CTGT >>9> RG:Z:UM0037:1
 +
18:462+29M5I3M:F:296 97 1 1751 0 3S2H5M 18 757 0 TGCACGTN 453;>>>> RG:Z:UM0037:1
 +
18:462+29M5I3M:F:295 97 2 75 0 5M 18 757 0 ACGTN * AM:i:0 MD:Z:30A0C5 NM:i:2 RG:Z:UM0037:1 XT:A:R
 +
18:462+29M5I3M:F:297 97 2 1751 0 3S5M1S3H 18 757 0 TGCACGTNG 453;>>>>5 RG:Z:UM0037:1
 +
18:462+29M5I3M:F:298 97 3 75 0 3S5M4H 18 757 0 TGCACGTN 453;>>>> RG:Z:UM0037:1
 +
Y:16597235+13M13I11M:F:181 141 * 0 0 * * 0 0 AACT ==;; RG:Z:UM0037:1
 +
Y:16597235+13M13I11M:F:181 141 * 0 0 * * 0 0 * * RG:Z:UM0037:1
 +
</pre>
 +
 +
Output:
 +
<pre>
 +
in testFiles/sortedSam.sam
 +
out results/updatedSam.sam
 +
log results/updated.log
 +
checkSQ
 +
</pre>
 +
 +
Log File:
 +
<pre>
 +
Arguments in effect:
 +
--in [testFiles/sortedSam.sam]
 +
--out [results/updatedSam.sam]
 +
--log [results/updated.log]
 +
--fasta [testFiles/testFasta.fa]
 +
--AS [my37]
 +
--UR [testFasta.fa]
 +
--SP [new]
 +
--checkSQ [ON]
 +
--HD [@HD VN:1.0 SO:coordinate GO:none]
 +
--RG [@RG ID:UM0037:1 SM:Sample2 LB:lb2 PU:mypu CN:UMCORE DT:2010-11-01 PL:ILLUMINA]
 +
--PG [@PG ID:polish VN:0.0.1]
 +
Reading the reference file testFiles/testFasta.fa
 +
Finished reading the reference file testFiles/testFasta.fa
 +
Finished checking the consistency of SQ tags
 +
Creating the header of new output file
 +
Adding 1 HD, 1 RG, and 1 PG headers
 +
Finished writing output headers
 +
Writing output BAM file
 +
Successfully written 10 records
 +
</pre>
 +
 +
 +
[[Category:BamUtil|polishBam]]
 +
[[Category:BAM Software]]
 
[[Category:Software]]
 
[[Category:Software]]
[[Category:StatGen Download]]
 
[[Category:BAM Software]]
 

Latest revision as of 13:06, 6 January 2014

Overview of the polishBam function of bamUtil

The polishBam option on the bamUtil executable adds/updates header lines & adds the RG tag to each record.

Usage

./bam polishBam (options) --in <inBamFile> --out <outBamFile>

Parameters

   Required parameters: 
        -i/--in : input BAM file
        -o/--out : output BAM file
   Optional parameters:
        -v : turn on verbose mode
        -l/--log : writes logfile. <outBamFile>.log will be used if value is unspecified
        --HD : add @HD header line
        --RG : add @RG header line
        --PG : add @PG header line
        -f/--fasta : fasta reference file to compute MD5sums and update SQ tags
        --AS : AS tag for genome assembly identifier
        --UR : UR tag for @SQ tag (if different from --fasta)
        --SP : SP tag for @SQ tag
        --checkSQ : check the consistency of SQ tags (SN and LN) with existing header lines. Must be used with --fasta option
	PhoneHome:
		--noPhoneHome       : disable PhoneHome (default enabled)
		--phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)

Required Parameters

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output File (--out)

Use --out followed by your file name to specify the SAM/BAM output file.

The file extension is used to determine whether to write SAM/BAM/uncompressed BAM. A - is used to indicate stdout and the extension for file type (no extension is SAM).

SAM to file --out yourFileName.sam
BAM to file --out yourFileName.bam
Uncompressed BAM to file --out yourFileName.ubam
SAM to stdout --out -
BAM to stdout --out -.bam
Uncompressed BAM to stdout --out -.ubam


Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Optional Parameters

Verbose (--verbose)

Use --verbose to turn on verbose mode.

Specify Log Filename (--log)

Use --log followed by the log filename to specify the log filename. Default is the output file basename with a .log extension

Add the HD Header (--HD)

Use --HD followed by the HD header line to add a HD header. Be sure to include "@HD" in the line you specify.

Add the RG Header (--RG)

Use --RG followed by the RG header line to add a RG header. Be sure to include "@RG" in the line you specify.

Add the PG Header (--PG)

Use --PG followed by the PG header line to add a PG header. Be sure to include "@PG" in the line you specify.

Add MD5 and UR tags to SQ Headers (--fasta)

Use --fasta followed by the fasta reference file name to compute MD5sums and update SQ tags with the M5 & UR values. Use the --UR option to specify a different UR value.

Add the AS tag to SQ Headers (--AS)

Use --AS followed by the genome assembly identify to add the AS tag to the SQ Headers.

Add the UR tag to SQ Headers (--UR)

Use --UR followed by the URI of the sequence to add the UR tag to the SQ Headers.

The UR header will be automatically added with the --fasta option, so if --fasta is used, --UR only needs to be specified if it is different from --fasta.

Add the SP tag to SQ Headers (--SP)

Use --SP followed by the species to add the SP tag to the SQ Headers.

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

Turn off PhoneHome (--noPhoneHome)

Use the --noPhoneHome option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.

Adjust the Frequency of PhoneHome (--phoneHomeThinning)

Use --phoneHomeThinning to modify the percentage of the time that PhoneHome will run (0-100).

  • By default, --phoneHomeThinning is set to 50, running 50% of the time.
  • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
  • N/A if --noPhoneHome is set.

Return Value

Returns 0 on success, non-0 on failure.

Example

Command:

./bam polishBam  --in testFiles/sortedSam.sam --out results/updatedSam.sam --log results/updated.log --checkSQ --fasta testFiles/testFasta.fa --AS my37 --UR testFasta.fa --RG "@RG	ID:UM0037:1	SM:Sample2	LB:lb2	PU:mypu	CN:UMCORE	DT:2010-11-01	PL:ILLUMINA" --PG "@PG	ID:polish	VN:0.0.1" --SP new --HD "@HD	VN:1.0	SO:coordinate	GO:none"

Input File:

@SQ	SN:1	LN:2004
@SQ	SN:2	LN:2000
@SQ	SN:3	LN:2005
@SQ	SN:4	LN:2040
@SQ	SN:5	LN:2006
@RG	ID:myID	LB:library	SM:sample
@RG	ID:myID2	SM:sample2	LB:library2
@CO	Comment 1
@CO	Comment 2
18:462+29M5I3M:F:295	97	1	75	0	5M	18	757	0	ACGTN	;>>>>	AM:i:0	MD:Z:30A0C5	NM:i:2	XT:A:R
18:462+29M5I3M:F:295	97	1	75	0	*	18	757	0	*	*	AM:i:0
1:1011:F:255+17M15D20M	73	1	1011	0	5M2D	=	1011	0	CCGAA	6>6+4	AM:i:0	MD:Z:37	NM:i:0	XT:A:R
1:1011:F:255+17M15D20M	133	1	1012	0	*	=	1011	0	CTGT	>>9>
18:462+29M5I3M:F:296	97	1	1751	0	3S2H5M	18	757	0	TGCACGTN	453;>>>>
18:462+29M5I3M:F:295	97	2	75	0	5M	18	757	0	ACGTN	*	AM:i:0	MD:Z:30A0C5	NM:i:2	XT:A:R
18:462+29M5I3M:F:297	97	2	1751	0	3S5M1S3H	18	757	0	TGCACGTNG	453;>>>>5
18:462+29M5I3M:F:298	97	3	75	0	3S5M4H	18	757	0	TGCACGTN	453;>>>>
Y:16597235+13M13I11M:F:181	141	*	0	0	*	*	0	0	AACT	==;;
Y:16597235+13M13I11M:F:181	141	*	0	0	*	*	0	0	*	*


Output File:

@SQ	SN:1	LN:2004	AS:my37	M5:a9cfe5b8c11aa0cc2c0d2bf3602c9804	UR:testFasta.fa	SP:new
@SQ	SN:2	LN:2000	AS:my37	M5:7c342606b54aa211a50f5f63ac1cb2eb	UR:testFasta.fa	SP:new
@SQ	SN:3	LN:2005	AS:my37	M5:c30e547093f33de240b164a4a2ebe3b5	UR:testFasta.fa	SP:new
@SQ	SN:4	LN:2040	AS:my37	M5:fc4c559e9da51e93e7875031ddf65f2a	UR:testFasta.fa	SP:new
@SQ	SN:5	LN:2006	AS:my37	M5:c876194283debb8b507ebd0f82309ec4	UR:testFasta.fa	SP:new
@RG	ID:myID	LB:library	SM:sample
@RG	ID:myID2	SM:sample2	LB:library2
@HD	VN:1.0	SO:coordinate	GO:none
@RG	ID:UM0037:1	SM:Sample2	LB:lb2	PU:mypu	CN:UMCORE	DT:2010-11-01	PL:ILLUMINA
@PG	ID:polish	VN:0.0.1
@CO	Comment 1
@CO	Comment 2
18:462+29M5I3M:F:295	97	1	75	0	5M	18	757	0	ACGTN	;>>>>	AM:i:0	MD:Z:30A0C5	NM:i:2	RG:Z:UM0037:1	XT:A:R
18:462+29M5I3M:F:295	97	1	75	0	*	18	757	0	*	*	AM:i:0	RG:Z:UM0037:1
1:1011:F:255+17M15D20M	73	1	1011	0	5M2D	=	1011	0	CCGAA	6>6+4	AM:i:0	MD:Z:37	NM:i:0	RG:Z:UM0037:1	XT:A:R
1:1011:F:255+17M15D20M	133	1	1012	0	*	=	1011	0	CTGT	>>9>	RG:Z:UM0037:1
18:462+29M5I3M:F:296	97	1	1751	0	3S2H5M	18	757	0	TGCACGTN	453;>>>>	RG:Z:UM0037:1
18:462+29M5I3M:F:295	97	2	75	0	5M	18	757	0	ACGTN	*	AM:i:0	MD:Z:30A0C5	NM:i:2	RG:Z:UM0037:1	XT:A:R
18:462+29M5I3M:F:297	97	2	1751	0	3S5M1S3H	18	757	0	TGCACGTNG	453;>>>>5	RG:Z:UM0037:1
18:462+29M5I3M:F:298	97	3	75	0	3S5M4H	18	757	0	TGCACGTN	453;>>>>	RG:Z:UM0037:1
Y:16597235+13M13I11M:F:181	141	*	0	0	*	*	0	0	AACT	==;;	RG:Z:UM0037:1
Y:16597235+13M13I11M:F:181	141	*	0	0	*	*	0	0	*	*	RG:Z:UM0037:1

Output:

in	testFiles/sortedSam.sam
out	results/updatedSam.sam
log	results/updated.log
checkSQ

Log File:

Arguments in effect:
	--in [testFiles/sortedSam.sam]
	--out [results/updatedSam.sam]
	--log [results/updated.log]
	--fasta [testFiles/testFasta.fa]
	--AS [my37]
	--UR [testFasta.fa]
	--SP [new]
	--checkSQ [ON]
	--HD [@HD	VN:1.0	SO:coordinate	GO:none]
	--RG [@RG	ID:UM0037:1	SM:Sample2	LB:lb2	PU:mypu	CN:UMCORE	DT:2010-11-01	PL:ILLUMINA]
	--PG [@PG	ID:polish	VN:0.0.1]
Reading the reference file testFiles/testFasta.fa
Finished reading the reference file testFiles/testFasta.fa
Finished checking the consistency of SQ tags
Creating the header of new output file
Adding 1 HD, 1 RG, and 1 PG headers
Finished writing output headers
Writing output BAM file
Successfully written 10 records