Bam read count

From Genome Analysis Wiki
Jump to: navigation, search



NOTE: The current version works only for single end reads. If the input bam file contains paired end sequences, reads from the same fragment will be counted independently

  • The tools calculates the read count for each region in the input list of regions from a BAM file, and also outputs the normalized read count as Read Per Million Mapped Reads per Kilobases (RPKM). To correct for the bias of the read count due to GC bias, it will also output the GC content of each region along with the total reads mapped to the corresponding GC content bins. By providing this statistic, the extent of GC bias can be investigated and if excessive bias is observed this statistic can be used to correct the global GC bias.
  • This tool was initially developed for RNA-seq data to quantify gene/exon expression but can also be used for other purposes as well.


A command without any input will display the basic usage

The following parameters are available.  Ones with "[]" are in effect:
      Input Files : --reference [/data/local/ref/karma.ref/human.g1k.v37],
                    --regions [/home/bingshan/data/db/refFlat.tbl/refFlat.exons.v37.tbl]
  Mapping filters : --min_overlap [1], --min_map_quality
    Normalization : --norm_by_mapped2target
      Output file : --exon_out [], --gene_out [], --uniq

An example command for quantify RNAseq gene/exon expression looks like the following:

./readCount --reference hg19.fa --regions refFalt.exon.hg19 --min_overlap 5 --uniq --exon_out exon.readcount --gene_out gene.readcount input.bam

The following example generate normalized read counts using the number of reads that are only mapped to targets:

./readCount --reference hg19.fa --regions refFalt.exon.hg19 --min_overlap 5 --uniq --norm_by_mapped2target --exon_out exon.readcount --gene_out gene.readcount input.bam

Input files

Required input files are --reference, --regions and input.bam

  • An example --regions file is a BED file and looks like the following:
1  1000  2000  GENE1
1  5000  6000  GENE1
1  8000  9000  GENE1
2  1000  2000  GENE2
2  5000  6000  GENE2
2  8000  9000  GENE2

Other options

Some of command line options are explained below and others are self-explanatory.

--min_overlap : minimum number of bases overlapping an input region to consider that the read is in the region.
--min_map_quality : reads with map quality below this number will be ignored
--uniq : If a read is mapped to multiple regions (e.g. exons) this read will be counted only once toward the read count in the gene
--norm_by_mapped2target : The RPM and RPKM will be normalized by reads mapped only to target regions. Default is to use all mapped reads in a bam file.
--exon_cout : read count for each exon. A read may be counted to multiple exons if this read is mapped to multiple exons.
--gene_cout : read count for each gene.

Output files

  • --exon_out
##Mapped2targets reads: 19220055
#Chr    Start   End     Len     ReadCnt RPM     RPKM    %GC     nReadGCBin
12      53701240        53701497        AAAS    257     55      2.862   11.135  56.42   581629
12      53701628        53701713        AAAS    85      16      0.832   9.794   60.00   324929
12      53701835        53701917        AAAS    82      22      1.145   13.959  51.22   611596
  • --gene_out
##Mapped2targets reads: 19220055
#Gene   Len     ReadCnt RPM     RPKM    %GC     nReadGCBin
AAAS:12 1836    360     18.730  10.202  57.52   389925
AAA1:7  6184    0       0.000   0.000   37.18   289310
AACS:12 3322    700     36.420  10.963  55.72   835010
AACSL:5 2171    15      0.780   0.359   49.65   443653
AADAC:3 1607    172     8.949   5.569   38.39   468603


The latest version of source code v0.01 can be downloaded here. To compile, cd readcount.0.01/ and type make. The executable is under bin/ directory.


For questions please contact the authors (Bingshan Li: