QPLOT

From Genome Analysis Wiki
Revision as of 12:36, 14 February 2012 by Zhanxw (talk | contribs) (Diagnose sequencing quality)
Jump to: navigation, search

Introduction

The qplot program is to calculate various summary statistics some of which will be plotted in a PDF file which can be used to assess the sequencing quality for Illumina sequencing after mapping reads to the reference genome. The main statistics are empirical Phred scores which was calculated based on the background mismatch rate. By background mismatch rate, it means the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count and so on.

In the following sections, we will guide through: how to obtain qplot, how to use qplot, example outputs, interactive diagnostic plots and real applications in which qplot has helped identify sequencing problems.

Where to Find It

You can obtain qplot in two ways:

(1) Download and extract pre-compiled binary as described in Binary Download.

(2) Download source code of qplot and compile it on your own machine. Please follow the instruction in Source Code Distribution on fetching source code and building instructions.

We suggest use the first method as we try to make pre-compiled binary working out of the box.

Binary Download

We have prepared pre-compiled qplot. You can download from: qplot.20120213.tar.gz (File Size: 1.7G)

The executable file is under qplot/bin/qplot.

In addition, we provided necessary inputs files (NCBI human genome build v37, dbSNP 130 and pre-computed GC file with windows size 100, they are all under qplot/data/).

You can also find example BAM input file under qplot/example/chrom20.9M.10M.bam. it is taken from 1000 Genome Project with sequencing reads aligned to chromosome 20 positioned 8M to 9M.

Source Code Distribution

The qplot repository is available both via release downloads and via github.

On github, https://github.com/statgen/qplot, you can both browse and download the qplot source code as well as explore the history of changes.

You can obtain the source either with or without git.

The releases may be available both with and without libStatGen included.

If you do not use the release version that already contains libStatGen, you need to download the library: libStatGen.

If you try to compile qplot and it cannot find libStatGen, it will fail and provide instructions of what to do next:

  • if libStatGen is in a different location then expected
    • follow the directions to set the path to libStatGen
  • if libStatGen is not downloaded and you have git
    • make libStatGen will download via git and build libStatGen
  • if libStatGen is not downloaded and you don't have git

Using Git To Track the Current Development Version

Clone (get your own copy)

You can create your own git clone (copy) using:

git clone https://github.com/statgen/qplot.git

or

git clone git://github.com/statgen/qplot.git

Either of these commands create a directory called qplot in the current directory.

Then just cd qplot and compile.

Get the latest Updates (update your copy)

To update your copy to the latest version (a major advantage of using git):

  1. cd pathToYourCopy/qplot
  2. make clean
  3. git pull
  4. make all

Git Refresher

If you decide to use git, but need a refresher, see How To Use Git or Notes on how to use git (if you have access)


Downloading From GitHub Without Git

If you download the latest code/version, make sure you periodically update it by downloading a newer version.

From github you can download:

  1. Latest Code (master branch)
    via Website
    1. Goto: https://github.com/statgen/qplot
    2. Click on the Download ZIP button on the right side panel.
    via Command Line
    wget https://github.com/statgen/qplot/archive/master.tar.gz
    or
    wget https://github.com/statgen/qplot/archive/master.zip
  2. Specific Release (via a tag)
    via Website
    1. Goto: https://github.com/statgen/qplot/releases to see the available releases
    2. Click zip or tar.gz for the desired version.
    via Command Line
    wget https://github.com/statgen/qplot/archive/<tagName>.tar.gz
    or
    wget https://github.com/statgen/qplot/archive/<tagName>.zip


After downloading the file, uncompress (unzip/untar) it. The directory created will be named qplot-<name of version you downloaded>.

Building

After obtaining the qplot repository (either by download or from github), compile the code using:

make all  

Object (.o) files are compiled into the obj directory with a subdirectory debug and profile for the debugging and profiling objects.

This creates the executable(s) in the qplot/bin/ directory, the debug executable(s) in the qplot/bin/debug/ directory, and the profiling executable(s) in the qplot/bin/profile/ directory.

make install installs the opt binary if you have permission.

make test compiles for opt, debug, and profile and runs the tests (found in the test subdirectory).

To see all make options, type make help.


If compilation fails due to warnings being treated as errors, please contact us so we can fix the warnings. As a work-around to get it to compile, you can disable the treatment of warnings as errors by editing libStatGen/general/Makefile to remove -Werror.

Usage

Command line

After you obtain qplot executable (either from compiling source codes or download pre-compiled binary file), you will find executable file under qplot/bin/qplot. Here is the qplot help page

 some_linux_host > qplot/bin/qplot

 The following parameters are available.  Ones with "[]" are in effect:

             References : --reference [/data/local/ref/karma.ref/human.g1k.v37.umfa],
                          --dbsnp [/home/bingshan/data/db/dbSNP/dbSNP130.UCSC.coordinates.tbl],
                          --gccontent [/home/bingshan/data/db/gcContent/gcContent.hg37.w250.out]
  Create gcContent file : --create_gc [], --winsize [100]
           Flag filters : --read1_skip, --read2_skip, --paired_skip,
                          --unpaired_skip
         Dup and QCFail : --dup_keep, --qcfail_keep
        Mapping filters : --minMapQuality [0.00]
     Records to process : --first_n_record [-1]
       Lanes to process : --lanes []
           Output files : --plot [], --stats [], --Rcode []
            Plot labels : --label [], --bamLabel []

Input files

Three (3) precomputed files are required. Multiple bam/sam files should be appended after all other parameters.

  • --reference

The reference genome is the same as karma reference genome. If the index files do not exist, qplot will create the index files using the input reference fasta file.

  • --dbsnp

This file has two columns. First column is the chromosome name wich have to be consistent with the reference created above.

  • --gccontent

Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file. To generate the file, use the following command

qplot --rerefence reference.fa --windowsize winsize --create_gc reference.gc

Note: Before running the qplot, it is critical to check how the chromosome numbers are coded. Some bam files use just numbers, others use chr + numbers. You need to make sure that the chromosome numbers from reference and dbSNP are consistent with the bam file.

Parameters

Most of command line parameters are self explanatory and some of them are described here

  • Flag filter

By default all reads are processed. If it is desired to check only the first read of a pair, use --read2_skip to ignore the second read. And so on.

  • Duplication and QCFail

By default reads marked as duplication and QCFail are ignored but can be retained by

--dup_keep 

or

--qcfail_keep
  • Records to process

This option will enable qplot to read the first n reads to test the bam files and check whether it works.

  • Lanes to process

If the input bam files have more than one lane and only some of them need to be checked, they can be specified by --lanes 1,3,5 whatever the number of lanes needed.

NOTE In order for this to work, the lane info has to be encoded in the read name such that lane number is the second field with the delimit of ":".

  • Mapping filters

Qplot will exclude reads with lower mapping qualities than user specified parameter. By default, all reads will be included in analysis.

  • Region list

If the interest of qplot is a list of regions, e.g. exons, this can be achieved by providing a list of region. The regions should be in the form of "chr start end label" each line in the file. In order for this option to work, within each chromosome (contig) the regions have to be sorted by starting position, and also the input bam files have to be sorted.

1 100 500 region_A
1 600 800 region_B
2 100 300 region_C
...

Qplot also provide --invertRegion option. Enabling this option will let qplot calculate those sequence bases that are off the given region.


  • Plot labels

Two kinds of labels are enabled. First one is the label for the plot (default is empty), e.g. label on the title of each subplot. Second one is a set of labels for each input bam files, e.g. sample ID (default is numbers 1, 2, ... until the number of input bam files. For example:

--label Run100 --bamLabels s1,s2,s3,s4,s5,s6,s7,s8
  • Multiple threading (not officially supported)

Number of concurrent threads running for the input bam files. One bam file will be processed by one thread. Therefore using a number which is dividable by the number of input bam files will make it efficient. One extra thread requires memory about 375Mb on top of around 4Gb memory used to hold reference and GC content file.

Output files

There are three (optional) output files.

  • --plot qa.pdf

Qplot will generate a PDF file named qa.pdf containing 2 pages each with 4 figures. If --pages 1 is specified, only page 1 is output. The plot is generated using Rscript.

  • --stats qa.stats

Qplot will generate a text file names qa.stats containing various summary statistics for each input bam/sam file.

  • --Rcode qa.R

Qplot will generate qa.R which is R code used for plotting the figures in qa.pdf file. If Rscript is not installed in the system, you can use the qa.R to generate the figures in other machines, or extract plotting data from each run and combine multiple runs together to generate more comprehensive plots (See Example).

Example

Qplot can generate diagnostic graphs, related R code and summary statistics for each sam/bam files.

Build-in example

In pre-compiled binary file, you will find a subdirectory named examples. We provide a sample file from 1000 Genome project, it contained aligned read on chromosome 20 from position 8 Mbp to 9Mbp. You can use qplot using the following commandline:

../bin/qplot --reference ../data/human.g1k.v37.umfa --dbsnp ../data/dbSNP130.UCSC.coordinates.tbl --gccontent ../data/human.g1k.w100.gc --plot qplot.pdf --stats qplot.stats --Rcode qplot.R --label "chr20:9M-10M" chrom20.9M.10M.bam

Sample outputs are listed below:

Figure: qplot.pdf

Summary statistics:

Stats\BAM       chrom20.9M.10M.bam
TotalReads(e6)  1.11
MappingRate(%)  97.24
MapRate_MQpass(%)       97.24
TargetMapping(%)        0.00
ZeroMapQual(%)  2.39
MapQual<10(%)   2.86
PairedReads(%)  83.76
ProperPaired(%) 71.34
MappedBases(e9) 0.04
Q20Bases(e9)    0.04
Q20BasesPct(%)  88.63
MeanDepth       42.22
GenomeCover(%)  0.03
EPS_MSE 1.81
EPS_Cycle_Mean  18.71
GCBiasMSE       0.01
ISize_mode      137
ISize_medium    184
DupRate(%)      5.90
QCFailRate(%)   0.00
BaseComp_A(%)   29.9
BaseComp_C(%)   20.1
BaseComp_G(%)   20.2
BaseComp_T(%)   29.8
BaseComp_O(%)   0.1


Gallery of examples

Here we show qplot can be applied in different sequencing scenarios. Also users can customize statistics generated by qplot in various formats.


  • Whole genome sequencing with more than one lanes

Figures

https://statgen.sph.umich.edu/w/images/5/53/Sardinia_Run_84_QA.pdf

Summary statistics text file

TotalReads(e6)  72.94   64.52   74.87   62.25   67.21
MappingRate(%)  97.62   97.51   97.75   97.52   97.35
MapRate_MQpass(%)       97.62   97.51   97.75   97.52   97.35
TargetMapping(%)        45.53   45.51   46.39   45.81   46.23
ZeroMapQual(%)  11.52   11.64   11.77   10.97   11.14 
MapQual<10(%)   11.91   12.03   12.17   11.34   11.52
PairedReads(%)  100.00  100.00  100.00  100.00  100.00
ProperPaired(%) 96.14   96.10   96.34   95.60   95.91 
MappedBases(e9) 2.11    1.87    2.20    1.82    1.97
Q20Bases(e9)    2.05    1.81    2.13    1.76    1.91
Q20BasesPct(%)  97.12   96.98   96.75   96.90   96.91
MeanDepth       35.08   31.05   36.55   30.30   32.82
GenomeCover(%)  2.10    2.10    2.10    2.09    2.10
EPS_MSE 8.89    6.88    8.50    13.32   6.86
EPS_Cycle_Mean  26.04   25.88   25.86   26.12   25.77
GCBiasMSE       0.04    0.05    0.04    0.07    0.04
ISize_mode      250     250     249     210     250
ISize_medium    271     270     270     260     270 
DupRate(%)      3.50    3.79    3.27    4.51    3.56
QCFailRate(%)   0.00    0.00    0.00    0.00    0.00
BaseComp_A(%)   26.3    26.4    26.3    26.8    26.3
BaseComp_C(%)   23.7    23.6    23.7    23.2    23.7
BaseComp_G(%)   23.2    23.0    23.2    22.7    23.1
BaseComp_T(%)   26.8    27.1    26.8    27.3    26.9
BaseComp_O(%)   0.0     0.0     0.0     0.0     0.0


  • Whole genome sequencing with 24-multiplexing

With customized script, we aggregated 24 bar-coded samples in the same graph. The graph will help compare sequencing quality between samples.

QPlot of 24 samples(PDF)


  • Interactive qplot

Qplot can be interactive. In the following example, you can use scroll mouse to zoom in, zoom out each graph; pan to certain part of graph. By presenting qplot data in web page, users can identify problematic sequencing samples easily.

QPlot of 24 samples(HTML)

Diagnose sequencing quality

Qplot is designed and implemented by the need of checking sequencing quality. Besides the exampled of analyzing RNA-seq data as shown in our manuscript, here we demonstrate two additional scenarios in which qplot can help identify problems after obtaining sequencing data.


  • Base quality distributed abnormally

Example of qplot help identify wrong phred base quality

By checking the first graph "Empirical vs reported Phred score", we found reported base qualities are shifted to right. Further we notice that effects is caused by different software from Illumina sequencers. In this particular example, all base qualities are wrongly added '33'. Such data used in variant calling may increase false positive SNP calling.


  • Bar-coded samples

Example of qplot identifying the effect of ignoring bar-coding

By checking "Empirical phred score by cycle" (top right graph on the first page), we notice the empirical qualities in the first several cycle are abnormally low. This question leads us hypnotize the first several bases have different properties. Further investigation revealed that this sequencing was done using bar-coded DNA samples, but the analysis did not properly de-multiplexing to each sample.

Contact

Questions and requests should be sent to Bingshan Li (bingshan@umich.edu)