Difference between revisions of "QPLOT"
Line 104: | Line 104: | ||
*Region list | *Region list | ||
− | If the interest of qplot is a list of regions, e.g. exons, this can be achieved by providing a list of regions. The regions should be in the form of "chr start end label" each line in the file. In order for this option to work, within each chromosome (contig) the regions have to be sorted by starting position, and also the input bam files have to be sorted. | + | If the interest of qplot is a list of regions, e.g. exons, this can be achieved by providing a list of regions. The regions should be in the form of "chr start end label" each line in the file (NOTE: ''start'' and ''end'' position are inclusive and they follow the convention of [http://genome.ucsc.edu/FAQ/FAQformat#format1 BED file]). |
+ | In order for this option to work, within each chromosome (contig) the regions have to be sorted by starting position, and also the input bam files have to be sorted. | ||
+ | For example, you can create a text file, region.txt like following: | ||
+ | |||
1 100 500 region_A | 1 100 500 region_A | ||
1 600 800 region_B | 1 600 800 region_B | ||
2 100 300 region_C | 2 100 300 region_C | ||
− | . | + | |
+ | Then specifying <code> --regions region.txt</code> enables qplot calculate various statistics out of sequenced bases only from this region. | ||
Qplot also provides the <code>--invertRegion</code> option. Enabling this option tells qplot to operate on those sequence bases that are outside the given region. | Qplot also provides the <code>--invertRegion</code> option. Enabling this option tells qplot to operate on those sequence bases that are outside the given region. |
Revision as of 08:57, 15 February 2012
Introduction
The qplot program calculates various summary statistics some of which are plotted in a PDF file. These statistics can be used to assess the sequencing quality of Illumina sequence reads mapped to the reference genome. The main statistics are empirical Phred scores which are calculated based on the background mismatch rate. Background mismatch rate is the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count, and so on.
In the following sections, we will guide you through: how to obtain qplot, how to use qplot, example outputs, interactive diagnostic plots, and real applications in which qplot has helped identify sequencing problems.
Where to Find It
You can obtain qplot in two ways:
(1) Download and extract pre-compiled binary as described in Binary Download.
(2) Download source code of qplot and compile it on your own machine. Please follow the instruction in Source Code Distribution on fetching source code and building instructions.
We recommend the first method since the pre-compiled binary should work out of the box.
Binary Download
We have prepared a pre-compiled qplot. You can download it from: qplot.20120213.tar.gz (File Size: 1.7G)
The executable file is under qplot/bin/qplot.
In addition, we provided the necessary input files under qplot/data/ (NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100).
You can also find an example BAM input file under qplot/example/chrom20.9M.10M.bam. It is taken from the 1000 Genome Project with sequencing reads aligned to chromosome 20 positions 8M to 9M.
Source Code Distribution
The qplot repository is available both via release downloads and via github.
On github, https://github.com/statgen/qplot, you can both browse and download the qplot source code as well as explore the history of changes.
You can obtain the source either with or without git.
The releases may be available both with and without libStatGen included.
If you do not use the release version that already contains libStatGen, you need to download the library: libStatGen.
If you try to compile qplot and it cannot find libStatGen, it will fail and provide instructions of what to do next:
- if libStatGen is in a different location then expected
- follow the directions to set the path to libStatGen
- if libStatGen is not downloaded and you have git
make libStatGen
will download via git and build libStatGen
- if libStatGen is not downloaded and you don't have git
- See libStatGen
Using Git To Track the Current Development Version
Clone (get your own copy)
You can create your own git clone (copy) using:
git clone https://github.com/statgen/qplot.git
or
git clone git://github.com/statgen/qplot.git
Either of these commands create a directory called qplot
in the current directory.
Then just cd qplot
and compile.
Get the latest Updates (update your copy)
To update your copy to the latest version (a major advantage of using git):
cd pathToYourCopy/qplot
make clean
git pull
make all
Git Refresher
If you decide to use git, but need a refresher, see How To Use Git or Notes on how to use git (if you have access)
Downloading From GitHub Without Git
If you download the latest code/version, make sure you periodically update it by downloading a newer version.
From github you can download:
- Latest Code (master branch)
- via Website
- Goto: https://github.com/statgen/qplot
- Click on the
Download ZIP
button on the right side panel.
- via Command Line
- via Website
- Specific Release (via a tag)
- via Website
- Goto: https://github.com/statgen/qplot/releases to see the available releases
- Click
zip
ortar.gz
for the desired version.
- via Command Line
wget https://github.com/statgen/qplot/archive/<tagName>.tar.gz
- or
wget https://github.com/statgen/qplot/archive/<tagName>.zip
- via Website
After downloading the file, uncompress (unzip/untar) it. The directory created will be named qplot-<name of version you downloaded>
.
Building
After obtaining the qplot repository (either by download or from github), compile the code using:
make all
Object (.o) files are compiled into the obj
directory with a subdirectory debug
and profile
for the debugging and profiling objects.
This creates the executable(s) in the qplot/bin/
directory, the debug executable(s) in the qplot/bin/debug/
directory, and the profiling executable(s) in the qplot/bin/profile/
directory.
make install
installs the opt binary if you have permission.
make test
compiles for opt, debug, and profile and runs the tests (found in the test
subdirectory).
To see all make options, type make help
.
If compilation fails due to warnings being treated as errors, please contact us so we can fix the warnings. As a work-around to get it to compile, you can disable the treatment of warnings as errors by editing libStatGen/general/Makefile to remove -Werror
.
Usage
Command line
After you obtain the qplot executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under qplot/bin/qplot.
Here is the qplot help page:
some_linux_host > qplot/bin/qplot The following parameters are available. Ones with "[]" are in effect: References : --reference [/data/local/ref/karma.ref/human.g1k.v37.umfa], --dbsnp [/home/bingshan/data/db/dbSNP/dbSNP130.UCSC.coordinates.tbl], --gccontent [/home/bingshan/data/db/gcContent/gcContent.hg37.w250.out] Create gcContent file : --create_gc [], --winsize [100] Region list : --regions [], --invertRegion Flag filters : --read1_skip, --read2_skip, --paired_skip, --unpaired_skip Dup and QCFail : --dup_keep, --qcfail_keep Mapping filters : --minMapQuality [0.00] Records to process : --first_n_record [-1] Lanes to process : --lanes [] Output files : --plot [], --stats [], --Rcode [] Plot labels : --label [], --bamLabel []
Input files
qplot runs on the input BAM/SAM file(s) specified on the command-line after all other parameters.
Additoinally, three (3) precomputed files are required.
--reference
The reference genome is the same as karma reference genome. If the index files do not exist, qplot will create the index files using the input reference fasta file.
--dbsnp
This file has two columns. First column is the chromosome name which must be consistent with the reference created above.
--gccontent
Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file. To generate the file, use the following command:
qplot --rerefence reference.fa --windowsize winsize --create_gc reference.gc
Note: Before running qplot, it is critical to check how the chromosome numbers are coded. Some bam files use just numbers, others use chr + numbers. You need to make sure that the chromosome numbers from the reference and dbSNP are consistent with the bam file.
Parameters
Some of the command line parameters are described here, but most are self explanatory.
- Flag filter
By default all reads are processed. If it is desired to check only the first read of a pair, use --read2_skip
to ignore the second read. And so on.
- Duplication and QCFail
By default reads marked as duplication and QCFail are ignored but can be retained by
--dup_keep
or
--qcfail_keep
- Records to process
The --first_n_record
option followed by a number, n, will enable qplot to read the first n reads to test the bam files and verify it works.
- Lanes to process
If the input bam files have more than one lane and only some of them need to be checked, use something like --lanes 1,3,5
to specify that only lanes 1, 3, and 5 need to be checked.
NOTE In order for this to work, the lane info has to be encoded in the read name such that lane number is the second field with the delimiter ":".
- Mapping filters
Qplot will exclude reads with lower mapping qualities than the user specified parameter, --minMapQuality
. By default, all reads will be included in analysis.
- Region list
If the interest of qplot is a list of regions, e.g. exons, this can be achieved by providing a list of regions. The regions should be in the form of "chr start end label" each line in the file (NOTE: start and end position are inclusive and they follow the convention of BED file). In order for this option to work, within each chromosome (contig) the regions have to be sorted by starting position, and also the input bam files have to be sorted. For example, you can create a text file, region.txt like following:
1 100 500 region_A 1 600 800 region_B 2 100 300 region_C
Then specifying --regions region.txt
enables qplot calculate various statistics out of sequenced bases only from this region.
Qplot also provides the --invertRegion
option. Enabling this option tells qplot to operate on those sequence bases that are outside the given region.
- Plot labels
Two kinds of labels are enabled. --label
is the label for the plot (default is empty) which is prepended to the title of each subplot. --bamLabels
followed by a column separated list of labels provides the labels for each input SAM/BAM file, e.g. sample ID (default is numbers 1, 2, ... until the number of input bam files). For example:
--label Run100 --bamLabels s1,s2,s3,s4,s5,s6,s7,s8
- Multiple threading (not officially supported)
Number of concurrent threads running for the input bam files. One bam file at a time will be processed by one thread. Therefore using a number which is dividable by the number of input bam files will make it more efficient. One extra thread requires about 375Mb more memory on top of the around 4Gb of memory used to hold reference and GC content files.
Output files
There are three (optional) output files.
--plot qa.pdf
Qplot will generate a PDF file named qa.pdf containing 2 pages each with 4 figures. If --pages 1
is specified, only page 1 is output. The plot is generated using Rscript.
--stats qa.stats
Qplot will generate a text file named qa.stats containing various summary statistics for each input BAM/SAM file.
--Rcode qa.R
Qplot will generate qa.R which is R code used for plotting the figures in qa.pdf file. If Rscript is not installed in the system, you can use the qa.R to generate the figures on other machines, or extract plotting data from each run and combine multiple runs together to generate more comprehensive plots (See Example).
Example
Qplot can generate diagnostic graphs, related R code, and summary statistics for each SAM/BAM file.
Built-in example
In the pre-compiled binary download, you will find a subdirectory named examples. We provide a sample file from the 1000 Genome project, it contains aligned reads on chromosome 20 from position 8 Mbp to 9Mbp. You can use qplot using the following commandline:
../bin/qplot --reference ../data/human.g1k.v37.umfa --dbsnp ../data/dbSNP130.UCSC.coordinates.tbl --gccontent ../data/human.g1k.w100.gc --plot qplot.pdf --stats qplot.stats --Rcode qplot.R --label "chr20:9M-10M" chrom20.9M.10M.bam
Sample outputs are listed below:
Figure: qplot.pdf
Summary statistics:
Stats\BAM chrom20.9M.10M.bam TotalReads(e6) 1.11 MappingRate(%) 97.24 MapRate_MQpass(%) 97.24 TargetMapping(%) 0.00 ZeroMapQual(%) 2.39 MapQual<10(%) 2.86 PairedReads(%) 83.76 ProperPaired(%) 71.34 MappedBases(e9) 0.04 Q20Bases(e9) 0.04 Q20BasesPct(%) 88.63 MeanDepth 42.22 GenomeCover(%) 0.03 EPS_MSE 1.81 EPS_Cycle_Mean 18.71 GCBiasMSE 0.01 ISize_mode 137 ISize_medium 184 DupRate(%) 5.90 QCFailRate(%) 0.00 BaseComp_A(%) 29.9 BaseComp_C(%) 20.1 BaseComp_G(%) 20.2 BaseComp_T(%) 29.8 BaseComp_O(%) 0.1
Gallery of examples
Here we show qplot can be applied in various sequencing scenarios. Also users can customize statistics generated by qplot to their needs.
- Whole genome sequencing with more than one lane
Figures
https://statgen.sph.umich.edu/w/images/5/53/Sardinia_Run_84_QA.pdf
Summary statistics text file
TotalReads(e6) 72.94 64.52 74.87 62.25 67.21 MappingRate(%) 97.62 97.51 97.75 97.52 97.35 MapRate_MQpass(%) 97.62 97.51 97.75 97.52 97.35 TargetMapping(%) 45.53 45.51 46.39 45.81 46.23 ZeroMapQual(%) 11.52 11.64 11.77 10.97 11.14 MapQual<10(%) 11.91 12.03 12.17 11.34 11.52 PairedReads(%) 100.00 100.00 100.00 100.00 100.00 ProperPaired(%) 96.14 96.10 96.34 95.60 95.91 MappedBases(e9) 2.11 1.87 2.20 1.82 1.97 Q20Bases(e9) 2.05 1.81 2.13 1.76 1.91 Q20BasesPct(%) 97.12 96.98 96.75 96.90 96.91 MeanDepth 35.08 31.05 36.55 30.30 32.82 GenomeCover(%) 2.10 2.10 2.10 2.09 2.10 EPS_MSE 8.89 6.88 8.50 13.32 6.86 EPS_Cycle_Mean 26.04 25.88 25.86 26.12 25.77 GCBiasMSE 0.04 0.05 0.04 0.07 0.04 ISize_mode 250 250 249 210 250 ISize_medium 271 270 270 260 270 DupRate(%) 3.50 3.79 3.27 4.51 3.56 QCFailRate(%) 0.00 0.00 0.00 0.00 0.00 BaseComp_A(%) 26.3 26.4 26.3 26.8 26.3 BaseComp_C(%) 23.7 23.6 23.7 23.2 23.7 BaseComp_G(%) 23.2 23.0 23.2 22.7 23.1 BaseComp_T(%) 26.8 27.1 26.8 27.3 26.9 BaseComp_O(%) 0.0 0.0 0.0 0.0 0.0
- Whole genome sequencing with 24-multiplexing
With a customized script, we aggregated 24 bar-coded samples in the same graph. The graph will help compare sequencing quality between samples.
- Interactive qplot
Qplot can be interactive. In the following example, you can use mouse scroll to zoom in and zoom out on each graph and pan to a certain part of the graph. By presenting qplot data on a web page, users can easily identify problematic sequencing samples. Users of qplot can customize its outputs into webpage format greatly easing the data exploring process.
Diagnose sequencing quality
Qplot is designed and implemented for the need of checking sequencing quality. Besides the example of analyzing RNA-seq data as shown in our manuscript, here we demonstrate two additional scenarios in which qplot can help identify problems after obtaining sequencing data.
- Base quality distributed abnormally
Example of qplot helping to identify wrong phred base quality
By checking the first graph "Empirical vs reported Phred score", we found reported base qualities are shifted to the right. Further we notice that effects is caused by different software from Illumina sequencers. In this particular example, '33' was incorrectly added to all base qualities. Such data used in variant calling may increase false positive SNP calling.
- Bar-coded samples
Example of qplot identifying the effect of ignoring bar-coding
By checking "Empirical phred score by cycle" (top right graph on the first page), we noticed the empirical qualities in the first several cycles are abnormally low. This phenomenon leads us to hypothesize that the first several bases have different properties. Further investigation confirmed that this sequencing was done using bar-coded DNA samples, but the analysis did not properly de-multiplex each sample.
Contact
Questions and requests should be sent to Bingshan Li (bingshan@umich.edu)