BamUtil: gapInfo

From Genome Analysis Wiki
Jump to: navigation, search

Overview of the gapInfo function of bamUtil

The gapInfo option on the bamUtil prints information on the gap between read pairs in a SAM/BAM File.

There are two ways to run: standard/detailed. To run as detailed, use the --detailed option.

Standard output prints the number of pairs that have a given gap size.

The gap size is calculated by counting the number of bases between the clipped end of the first read and the clipped start of the 2nd read. (mate0BasedClippedStart - 0BasedPositionClippedEnd - 1)

The gap size will be negative if the reads overlap.


gapInfo skips any records that are marked in the flag as:

  • unmapped
  • not paired
  • mate is unmapped
  • secondary alignment (not primary alignment)
  • supplementary alignment
  • duplicates
  • QC Failure
  • mate is on a different chromosome
  • chromosome is unknown (-1/*)
  • mate starts before this record
  • mate starts at the same location as this record & this record is the reverse strand
  • reverse strands (unless --detailed is specified)

When --refFile and --detailed is not specified gaps that contain reference base 'N' are skipped.


./bam gapInfo --in <inputFile> --out <outputFile> [--noeof] [--params]


	Required Parameters:
		--in          : the SAM/BAM file to print read pair gap info for
		--out         : the output file to be written
	Optional Parameters:
		--refFile     : reference file, used to skip gaps that include reference base 'N' (for runs without --detailed)		--detailed    : Print  the details for each read pair
	Optional Parameters for the Detailed Option:
		--checkFirst  : Check the first in pair flag and print "NotFirst" if it isn't first
		--checkStrand : Check the strand flag and print "Reverse" if it is reverse complimented
		--noeof       : Do not expect an EOF block on a bam file.
		--params      : Print the parameter settings to stderr
		--noPhoneHome       : disable PhoneHome (default enabled)
		--phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)

Required Parameters

Input File (--in)

Use --in followed by your file name to specify the SAM/BAM input file.

The program automatically determines if your input file is SAM/BAM/uncompressed BAM without any input other than a filename from the user, unless your input file is stdin.

A - is used to indicate to read from stdin and the extension is used to determine the file type (no extension indicates SAM).

SAM/BAM/Uncompressed BAM from file --in yourFileName
SAM from stdin --in -
BAM from stdin --in -.bam
Uncompressed BAM from stdin --in -.ubam

Note: Uncompressed BAM is compressed using compression level-0 (so it is not an entirely uncompressed file). This matches the samtools implementation so pipes between our tools and samtools are supported.

Output File (--out)

Use --out followed by a file name to specify the output file to write.

The Standard Output prints a 2-column (separated by tabs) line for each gapSize found in the SAM/BAM file. The first column contains the gap size and the 2nd column contains the number of pairs that have that gap size. The first line is a header line describing the columns.

Detailed output does not have a header line and is described below under the [[#Print Detailed Per-Pair Information (--detailed) | --detailed]] parameter.

Optional Prameters

Reference File (--refFile)

Use --refFile followed by the reference file name to specify the reference sequence file.

With this option specified, do not increment counters for the number of times that a gap is found if any of the reference bases in the gap are an 'N'. (N/A if --detailed is specified.)

Print Detailed Per-Pair Information (--detailed)

With this option, for every record processed per the above rules, the following information is printed on a line as tab separated columns:

  • Reference/Chromosome Name
  • 1-based read end position (clipped)
  • gap size

Additional columns if --checkFirst and/or --checkStrand are specified.

Detailed output does not have a header line.

See Optional Parameters for --detailed for additional options related to --detailed.

Do not require BGZF EOF block (--noeof)

Use --noeof if you do not expect a trailing eof block in your bgzf file.

By default, the trailing empty block is expected and checked for.

Print the Program Parameters (--params)

Use --params to print the parameters for your program to stderr.

Optional Parameters for --detailed

Check First (--checkFirst)

Only applicable if --detailed is also provided.

When specified along with --detailed, the output for each record processed also includes "NotFirst" if it is not marked as FirstFragment in the flags.

Check Strand (--checkStrand)

Only applicable if --detailed is also provided.

When specified along with --detailed, the output for each record processed also includes "Reverse" if it is marked as the reverse strand in the flags.

PhoneHome Parameters

See PhoneHome for more information on how PhoneHome works and what it does.

Turn off PhoneHome (--noPhoneHome)

Use the --noPhoneHome option to completely disable PhoneHome. PhoneHome is enabled by default based on the thinning parameter.

Adjust the Frequency of PhoneHome (--phoneHomeThinning)

Use --phoneHomeThinning to modify the percentage of the time that PhoneHome will run (0-100).

  • By default, --phoneHomeThinning is set to 50, running 50% of the time.
  • PhoneHome will only occur if the run's random number modulo 100 is less than the --phoneHomeThinning value.
  • N/A if --noPhoneHome is set.

Return Value

Returns -1 if input parameters are invalid.

Returns the SamStatus for the reads/writes (0 on success, non-0 on failure).


All status messages are written to stderr.

Tab-delimited columns as described above.

Example Output

For standard output:

GapSize	NumPairs
-23	3
-21	3
-20	4
-5	1
30	1
70	3

For detailed output with both --checkFirst & --checkStrand specified:

1	28	70
1	10028	71	NotFirst	Reverse
1	10028	70
1	10028	70
1	10028	30
1	10028	-19	NotFirst	Reverse
1	10028	-19	NotFirst	Reverse
1	10028	-19	NotFirst	Reverse
1	10030	-21
1	10030	-20
1	10030	-20
1	10030	-21
1	10030	-21
1	10030	-20
1	10030	-20
2	32	-18	NotFirst	Reverse
4	24	-23
4	27	-23
4	30	-23
4	34	-5