Bam read count
From Genome Analysis Wiki
Introduction
NOTE: the current version works only for single end reads
- The tools calculates the read count for each region in the input list of regions from a BAM file, and also outputs the normalized read count as Read Per Million Mapped Reads per Kilobases (RPKM). To correct for the bias of the read count due to GC bias, it will also output the GC contact of each region along with the total reads mapped to the corresponding GC content bins. By providing this statistic, the extent of GC bias can be investigated and if excessive bias is observed this statistic can be used to correct the global GC bias.
- This tool was initially developed for RNA-seq data to quantify gene/exon expression but can also be used for other purposes as well.
Usage
A command without any input will display the basic usage
readcount
The following parameters are available. Ones with "[]" are in effect: Input Files : --reference [/data/local/ref/karma.ref/human.g1k.v37], --regions [/home/bingshan/data/db/refFlat.tbl/refFlat.exons.v37.tbl] Mapping filters : --min_overlap [1], --min_map_quality Normalization : --norm_by_mapped2target Output file : --exon_out [], --gene_out [], --uniq
An example command for quantify RNAseq gene/exon expression looks like the following:
./readCount --reference hg19.fa --regions refFalt.exon.hg19 --min_overlap 5 --uniq --exon_out exon.readcount --gene_out gene.readcount input.bam
The following example generate normalized read counts using the number of reads that are only mapped to targets:
./readCount --reference hg19.fa --regions refFalt.exon.hg19 --min_overlap 5 --uniq --norm_by_mapped2target --exon_out exon.readcount --gene_out gene.readcount input.bam
Input files
Required input files are --reference, --regions and input.bam
- An example --regions file is a BED file and looks like the following:
1 1000 2000 GENE1 1 5000 6000 GENE1 1 8000 9000 GENE1 2 1000 2000 GENE2 2 5000 6000 GENE2 2 8000 9000 GENE2 ...
Other options
Some of command line options are explained below and others are self-explanatory.
--min_overlap : minimum number of bases overlapping an input region to consider that the read is in the region. --min_map_quality : reads with map quality below this number will be ignored --uniq : If a read is mapped to multiple regions (e.g. exons) this read will be counted only once toward the read count in the gene --norm_by_mapped2target : The RPM and RPKM will be normalized by reads mapped only to target regions. Default is to use all mapped reads in a bam file. --exon_cout : read count for each exon. A read may be counted to multiple exons if this read is mapped to multiple exons. --gene_cout : read count for each gene.
Output files
- --exon_out
##Mapped2targets reads: 19220055 #Chr Start End Len ReadCnt RPM RPKM %GC nReadGCBin 12 53701240 53701497 AAAS 257 55 2.862 11.135 56.42 581629 12 53701628 53701713 AAAS 85 16 0.832 9.794 60.00 324929 12 53701835 53701917 AAAS 82 22 1.145 13.959 51.22 611596 ...
- --gene_out
##Mapped2targets reads: 19220055 #Gene Len ReadCnt RPM RPKM %GC nReadGCBin AAAS:12 1836 360 18.730 10.202 57.52 389925 AAA1:7 6184 0 0.000 0.000 37.18 289310 AACS:12 3322 700 36.420 10.963 55.72 835010 AACSL:5 2171 15 0.780 0.359 49.65 443653 AADAC:3 1607 172 8.949 5.569 38.39 468603
Download
The latest version of source code v0.01 can be downloaded here. To compile, cd readcount.0.01/ and type make. The executable is under bin/ directory.
Contact
For questions please contact the authors (Bingshan Li: bingshan@umich.edu)