Difference between revisions of "QPLOT"

From Genome Analysis Wiki
Jump to: navigation, search
m (Where to Find It)
(Source Code Distribution)
 
(75 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 
= Introduction =
 
= Introduction =
The qplot program is to calculate various summary statistics some of which will be plotted in a pdf file which can be used to assess the sequencing quality for illumina sequencing after mapping reads to the reference genome. The main statistics are empirical Phred scores which was calculated based on the background mismatch rate. By background mismatch rate, it means the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count and so on. An example plot and summary text will follow at the end
 
  
 +
The qplot program calculates various summary statistics some of which are plotted in a PDF file. These statistics can be used to assess the sequencing quality of sequence reads mapped to the reference genome. The main statistics are empirical Phred scores which are calculated based on the background mismatch rate. Background mismatch rate is the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count, and so on.
 +
 +
In the following sections, we will guide you through: [[#Where to Find It |how to obtain qplot]], [[#Usage |how to use qplot]], [[#Built-in example |example outputs]], [[#anchorOfInteractiveQplot |interactive diagnostic plots]], and [[#Diagnose sequencing quality |real applications]] in which qplot has helped identify sequencing problems.
 +
 +
= Citing QPLOT =
 +
 +
If you found QPLOT useful and wants to cite in your paper, please copy and paste the information below.
 +
 +
* Bingshan Li, Xiaowei Zhan, Mary-Kate Wing, Paul Anderson, Hyun Min Kang, and Goncalo R. Abecasis, “QPLOT: A Quality Assessment Tool for Next Generation Sequencing Data,” BioMed Research International, vol. 2013, Article ID 865181, 4 pages, 2013. doi:10.1155/2013/865181  http://www.hindawi.com/journals/bmri/2013/865181/
  
 
= Where to Find It =
 
= Where to Find It =
Line 7: Line 15:
 
You can obtain qplot in two ways:  
 
You can obtain qplot in two ways:  
  
(1) Download and extract pre-compiled binary as described in [[#Binary Download|Binary Download]].  
+
(1) Download the pre-compiled binary along with the source code as described in [[#Binary Download|Binary Download]].  
  
(2) Download source code of qplot and compile it on your own machine. Please follow the instruction in [[#Source Code Distribution|Source Code Distribution]] on fetching source code and building instructions.
+
(2) Download source code only and compile it on your own machine. Please follow the instruction in [[#Source Code Distribution|Source Code Distribution]] on fetching source code and building instructions.
 
 
We suggest use the first method as we try to make pre-compiled binary working out of the box.
 
  
 
== Binary Download ==
 
== Binary Download ==
  
We have prepared pre-compiled qplot. You can download from: [http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot.20120213.tar.gz qplot.20120213.tar.gz (File Size: 1.7G)]  
+
We have prepared a pre-compiled (under Ubuntu) qplot along with source code . You can download it from: [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot.20130627.tar.gz qplot.20130627.tar.gz (File Size: 1.7G)]  
  
 
The executable file is under qplot/bin/qplot.  
 
The executable file is under qplot/bin/qplot.  
  
In addition, we provided necessary inputs files (NCBI human genome build v37, dbSNP 130 and pre-computed GC file with windows size 100, they are all under qplot/data/).
+
In addition, we provided the necessary input files under qplot/data/ (NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100).
  
You can also find example BAM input file under qplot/example/chrom20.9M.10M.bam. it is taken from 1000 Genome Project with sequencing reads aligned to chromosome 20 positioned 8M to 9M.
+
You can also find an example BAM input file under qplot/example/chrom20.9M.10M.bam. It is taken from the 1000 Genome Project with sequencing reads aligned to chromosome 20 positions 8M to 9M.
  
 
== Source Code Distribution ==
 
== Source Code Distribution ==
  
 +
We provide a source code only download in [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-source.20130627.tar.gz qplot-source.20130627.tar.gz]. Optionally, you can download example file and/or data file:
 +
 +
[http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-example.tar.gz  example]: example input file, and expected outputs if you following the [[#Built-in example | direction]].
 +
 +
[http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-data.tar.gz resources data]: necessary input files for qplot, including NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100.
 +
 +
You can put above file(s) in the same folder and follow these steps:
 +
 +
* 1. Unarchive downloaded file
 +
tar zvxf qplot-source.20130627.tar.gz
 +
 +
A new folder ''qplot'' will be created.
 +
 +
* 2. Build libStatGen
 +
cd qplot
 +
(cd ../libStatGen; make cloneLib)
 +
 +
This step will download a necessary software library [http://genome.sph.umich.edu/wiki/C%2B%2B_Library:_libStatGen libStatGen] and compile source code into a binary code library.
 +
 +
* 3. Build qplot
 +
make
 +
 +
This step will then build qplot. Upon success, the executable qplot can be found under qplot/bin/.
 +
 +
* 4. (Optional) unarchive example and/or data
 +
tar zvxf qplot-example.tar.gz
 +
 +
An example file, ''chrom20.9M.10M.bam'', will be extracted to qplot/example/. It contains ~1.1 million aligned Illumina sequencing reads of NA12878 from 1000 Genome Project. Example command line, ''cmd.sh'', example outputs, ''qplot.pdf'', ''qplot.stats'', and ''qplot.R'' are also provided and will be extracted qplot/example/ as well.
 +
 +
tar zvxf qplot-data.tar.gz
 +
 +
Three files will be extracted to qplot/data/: ''human.g1k.v37-bs.umfa'' is binary NCBI reference genome build 37; ''dbSNP130.UCSC.coordinates.tbl'' is dbSNP version 130; and ''human.g1k.w100.gc'' is pre-calculated GC content with windows size 100.
 +
 +
<!-- Please download source code from [[]], the building
 
{{ToolGitRepo|repoName=qplot|noDownload=}}
 
{{ToolGitRepo|repoName=qplot|noDownload=}}
 +
-->
  
 
= Usage =
 
= Usage =
  
 
== Command line ==
 
== Command line ==
After you obtained qplot executable (either from compiling source codes or downloaded pre-compiled binary file), you will find executable file under qplot/bin/qplot. Here is the qplot help page
+
 
 +
After you obtain the qplot executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under qplot/bin/qplot.  
 +
 
 +
Here is the qplot help page by invoking qplot without any command line arguments:
  
 
   some_linux_host > qplot/bin/qplot
 
   some_linux_host > qplot/bin/qplot
+
    The following parameters are available.  Ones with "[]" are in effect:
  The following parameters are available.  Ones with "[]" are in effect:
+
   
+
   
              References : --reference [/data/local/ref/karma.ref/human.g1k.v37.umfa],
+
   
                          --dbsnp [/home/bingshan/data/db/dbSNP/dbSNP130.UCSC.coordinates.tbl],
+
                    References : --reference [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.v37.fa],
                          --gccontent [/home/bingshan/data/db/gcContent/gcContent.hg37.w250.out]
+
                                --dbsnp [/net/fantasia/home/zhanxw/software/qplot/data/dbSNP130.UCSC.coordinates.tbl]
  Create gcContent file : --create_gc [], --winsize [100]
+
      GC content file options : --winsize [100]
            Flag filters : --read1_skip, --read2_skip, --paired_skip,
+
                  Region list : --regions [], --invertRegion
                          --unpaired_skip
+
                  Flag filters : --read1_skip, --read2_skip, --paired_skip,
          Dup and QCFail : --dup_keep, --qcfail_keep
+
                                --unpaired_skip
        Mapping filters : --minMapQuality [0.00]
+
                Dup and QCFail : --dup_keep, --qcfail_keep
      Records to process : --first_n_record [-1]
+
              Mapping filters : --minMapQuality [0.00]
        Lanes to process : --lanes []
+
            Records to process : --first_n_record [-1]
             Output files : --plot [], --stats [], --Rcode []
+
              Lanes to process : --lanes []
            Plot labels : --label [], --bamLabel []
+
        Read group to process : --readGroup []
 +
             Input file options : --noeof
 +
                  Output files : --plot [], --stats [], --Rcode [], --xml []
 +
                  Plot labels : --label [], --bamLabel []
 +
        Obsoleted (DO NOT USE) : --gccontent [], --create_gc
  
 
== Input files ==
 
== Input files ==
  
Three (3) precomputed files are required. Multiple bam/sam files should be appended after all other parameters.
+
qplot runs on the input BAM/SAM file(s) specified on the command-line after all other parameters.
 +
 
 +
Additionally, three (3) precomputed files are required.  
 +
 
 +
* <code>--reference</code>
  
* --reference
+
The reference genome is the same as karma reference genome. If the index files do not exist, qplot will create the index files '''automatically''' using the input reference fasta file.
  
The reference genome is the same as karma reference genome. If the index files do not exist, qplot will create the index files using the input reference fasta file.
+
* <code>--dbsnp</code>
  
* --dbsnp
+
This file has two columns. First column is the chromosome name which must be consistent with the reference created above. Second column is 1-based SNP position. If you want to create your own dbSNP data from downloaded UCSC dbSNP file, one way to do it is: <code>cat dbsnp_129_b36.rod|grep "single" | awk '$4-$3==1' |cut -f2,4 > dbSNP_129_b36.tbl</code>
  
This file has two columns. First column is the chromosome name wich have to be consistent with the reference created above.
+
* <code> **OBSOLETED** --gccontent, --create_gc </code>
  
* --gccontent
+
Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file.
 +
GC content file name is automatically determined in this format: <reference_genome_base_file_name>.winsize<gc_content_window_size>.gc.
 +
For example, if your reference genome is human.g1k.v37.fa and the window size is 100, then the GC content file name is: human.g1k.v37.winsize100.gc .
  
Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file. To generate the file, use the following command
+
As it said, there is no need to use --gccontent to specify GC content file in each run.
qplot --rerefence reference.fa --windowsize winsize --create_gc reference.gc
 
  
''Note'': Before running the qplot, it is critical to check how the chromosome numbers are coded. Some bam files use just numbers, others use chr + numbers. '''You need to make sure that the chromosome numbers from reference and dbSNP are consistent with the bam file.'''
+
* <code> input files </code>
 +
 
 +
QPLOT take SAM/BAM files.
 +
 
 +
''Note'': Before running qplot, it is critical to check how the chromosome names are coded. Some BAM/SAM files use just numbers, others use chr + numbers. '''You need to make sure that the chromosome names from the reference and dbSNP are consistent with the BAM/SAM files.'''
  
 
== Parameters ==
 
== Parameters ==
  
Most of command line parameters are self explanatory and some of them are described here
+
Some of the command line parameters are described here, but most are self explanatory.
  
 
*Flag filter
 
*Flag filter
  
By default all reads are processed. If it is desired to check only the first read of a pair, use --read2_skip to ignore the second read. And so on.
+
By default all reads are processed. If it is desired to check only the first read of a pair, use <code>--read2_skip</code> to ignore the second read. And so on.
  
 
*Duplication and QCFail
 
*Duplication and QCFail
Line 82: Line 139:
 
or  
 
or  
 
  --qcfail_keep
 
  --qcfail_keep
 +
  
 
*Records to process  
 
*Records to process  
This option will enable qplot to read the first '''n''' reads to test the bam files and check whether it works.
 
  
* Lanes to process
+
The <code>--first_n_record</code> option followed by a number, '''n''', will enable qplot to read the first '''n''' reads to test the bam files and verify it works.
 +
 
 +
* Lanes to process (only works for Illumina sequences)
 +
 
 +
If the input bam files have more than one lane and only some of them need to be checked, use something like <code>--lanes 1,3,5</code> to specify that only lanes 1, 3, and 5 need to be checked.
 +
 
 +
'''NOTE''' In order for this to work, the lane info has to be encoded in the read name such that the lane number is the second field with the delimiter ":".
 +
 
 +
 
 +
* Read group to process :
 +
 
 +
The read group option can restrict qplot to process a subset of reads. For example, if the BAM contains the following @RG tags:
 +
 
 +
@RG ID:UM0348_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0348_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0348_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0348_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
 
 +
QPLOT will by default (without specifying --readgroup) process all reads.
 +
 
 +
If you specify "--readGroup UM0348", then only read groups UM0348_1, UM_0348_2, UM_0348_3, UM_0348_4 will be processed.
 +
 
 +
If you specify "--readGroup UM0348_1", then only one read group, UM0348_1, will be processed.
 +
 
 +
 
 +
* Input file options :
 +
 
 +
BAM files are compressed using BGZF and should contain the EOF indicator by default. QPLOT will, by default, stop working if it does not find a valid EOF indicator inside the BAM files.
 +
However, you can force QPLOT to continue processing BAM files without an EOF indicator using --noeof. But you should be aware the input files may be corrupted.
  
If the input bam files have more than one lane and only some of them need to be checked, they can be specified by --lanes 1,3,5 whatever the number of lanes needed.
 
  
'''NOTE''' In order for this to work, the lane info has to be encoded in the read name such that lane number is the second field with the delimit of ":".
+
* Mapping filters
  
*Mapping filters
+
Qplot will exclude reads with lower mapping qualities than the user specified parameter, <code>--minMapQuality</code>. By default, mapped reads with all mapping quality will be included in the analysis.
  
Qplot will exclude reads with lower mapping qualities than user specified parameter. By default, all reads will be included in analysis.
 
  
 
*Region list
 
*Region list
  
If the interest of qplot is a list of regions, e.g. exons, this can be achieved by providing a list of region. The regions should be in the form of "chr start end label" each line in the file. In order for this option to work, within each chromosome (contig) the regions have to be sorted by starting position, and also the input bam files have to be sorted.  
+
If the interest of qplot is a list of regions, e.g. exons, this can be achieved by providing a list of regions. The regions should be in the form of "chr start end label" each line in the file (NOTE: ''start'' and ''end'' position are inclusive and they follow the convention of [http://genome.ucsc.edu/FAQ/FAQformat#format1 BED file]).
 +
In order for this option to work, within each chromosome (contig) the regions have to be sorted by starting position, and also the input bam files have to be sorted.  
 +
For example, you can create a text file, region.txt like following:
 +
 
 
  1 100 500 region_A
 
  1 100 500 region_A
 
  1 600 800 region_B
 
  1 600 800 region_B
 
  2 100 300 region_C
 
  2 100 300 region_C
  ...
+
   
 +
Then specifying <code> --regions region.txt</code> enables qplot to calculate various statistics out of sequenced bases only within the above 3 regions.
  
Qplot also provide --invertRegion option. Enabling this option will let qplot calculate those sequence bases that are off the given region.
+
Qplot also provides the <code>--invertRegion</code> option. Enabling this option tells qplot to operate on those sequence bases that are outside the given region.
  
  
 
* Plot labels
 
* Plot labels
  
Two kinds of labels are enabled. First one is the label for the plot (default is empty), e.g. label on the title of each subplot. Second one is a set of labels for each input bam files, e.g. sample ID (default is numbers 1, 2, ... until the number of input bam files. For example:
+
Two kinds of labels are enabled. <code>--label</code> is the label for the plot (default is empty) which is appended to the title of each subplot. <code>--bamLabels</code> followed by a column separated list of labels provides the labels for each input SAM/BAM file, e.g. sample ID (default is numbers 1, 2, ... until the number of input bam files). For example:
 
  --label Run100 --bamLabels s1,s2,s3,s4,s5,s6,s7,s8
 
  --label Run100 --bamLabels s1,s2,s3,s4,s5,s6,s7,s8
 
* Multiple threading (not officially supported)
 
 
Number of concurrent threads running for the input bam files. One bam file will be processed by one thread. Therefore using a number which is dividable by the number of input bam files will make it efficient. One extra thread requires memory about 375Mb on top of around 4Gb memory used to hold reference and GC content file.
 
  
 
== Output files ==
 
== Output files ==
  
 
There are three (optional) output files.
 
There are three (optional) output files.
* --plot ''qa.pdf''
+
* <code>--plot ''qa.pdf''</code>
  
Qplot will generate a PDF file named ''qa.pdf'' containing 2 pages each with 4 figures. If --pages 1 is specified, only page 1 is output. The plot is generated using Rscript.
+
Qplot will generate a PDF file named ''qa.pdf'' containing 2 pages each with 4 figures. The plot is generated using Rscript.
  
* --stats ''qa.stats''
+
* <code>--stats ''qa.stats''</code>
  
Qplot will generate a text file names ''qa.stats'' containing various summary statistics for each input bam/sam file.
+
Qplot will generate a text file named ''qa.stats'' containing various summary statistics for each input BAM/SAM file.
  
* --Rcode ''qa.R''
+
* <code>--Rcode ''qa.R''</code>
  
Qplot will generate ''qa.R'' which is R code used for plotting the figures in ''qa.pdf'' file. If Rscript is not installed in the system, you can use the qa.R to generate the figures in other machines, or extract plotting data from each run and combine multiple runs together to generate more comprehensive plots (See [[Example]]).
+
Qplot will generate ''qa.R'' which is the R code used for plotting the figures in the ''qa.pdf'' file. If Rscript is not installed in the system, you can use the qa.R to generate the figures on other machines, or extract plotting data from each run and combine multiple runs together to generate more comprehensive plots (See [[#Example | Example]]).
  
 
= Example =
 
= Example =
  
Qplot can generate diagnostic graphs, related R code and summary statistics for each sam/bam files.
+
Qplot can generate diagnostic graphs, related R code, and summary statistics for each SAM/BAM file.
  
== Build-in example ==
+
== Built-in example ==
  
In pre-compiled binary file, you will find a subdirectory named examples. We provide a sample file from 1000 Genome project, it contained aligned read on chromosome 20 from position 8 Mbp to 9Mbp. You can use qplot using the following commandline:
+
In the pre-compiled binary download, you will find a subdirectory named examples. We provide a sample file from the 1000 Genome project, it contains aligned reads on chromosome 20 from position 8 Mbp to 9Mbp. You can invoke qplot using the following command line:
  
 
  ../bin/qplot --reference ../data/human.g1k.v37.umfa --dbsnp ../data/dbSNP130.UCSC.coordinates.tbl --gccontent ../data/human.g1k.w100.gc --plot qplot.pdf --stats qplot.stats --Rcode qplot.R --label "chr20:9M-10M" chrom20.9M.10M.bam
 
  ../bin/qplot --reference ../data/human.g1k.v37.umfa --dbsnp ../data/dbSNP130.UCSC.coordinates.tbl --gccontent ../data/human.g1k.w100.gc --plot qplot.pdf --stats qplot.stats --Rcode qplot.R --label "chr20:9M-10M" chrom20.9M.10M.bam
Line 143: Line 230:
 
Sample outputs are listed below:
 
Sample outputs are listed below:
  
Figure: [[Media:qplot.pdf | qplot.pdf]]
+
1) Figure: [[Media:qplot.pdf | qplot.pdf]]
  
Summary statistics:
+
2) Summary statistics:
 
  Stats\BAM      chrom20.9M.10M.bam
 
  Stats\BAM      chrom20.9M.10M.bam
 
  TotalReads(e6)  1.11
 
  TotalReads(e6)  1.11
Line 173: Line 260:
 
  BaseComp_O(%)  0.1
 
  BaseComp_O(%)  0.1
  
 
 
== Gallery of examples ==
 
== Gallery of examples ==
  
Here we show qplot can be applied in different sequencing scenarios. Also users can customize statistics generated by qplot in various formats.
+
Here we show qplot can be applied in various sequencing scenarios. Also users can customize statistics generated by qplot to their needs.
 
 
* Whole genome sequencing with more than one lanes
 
 
 
 
 
Figures
 
 
 
https://statgen.sph.umich.edu/w/images/5/53/Sardinia_Run_84_QA.pdf
 
 
 
Summary statistics text file
 
TotalReads(e6)  72.94  64.52  74.87  62.25  67.21
 
MappingRate(%)  97.62  97.51  97.75  97.52  97.35
 
MapRate_MQpass(%)      97.62  97.51  97.75  97.52  97.35
 
TargetMapping(%)        45.53  45.51  46.39  45.81  46.23
 
ZeroMapQual(%)  11.52  11.64  11.77  10.97  11.14
 
MapQual<10(%)  11.91  12.03  12.17  11.34  11.52
 
PairedReads(%)  100.00  100.00  100.00  100.00  100.00
 
ProperPaired(%) 96.14  96.10  96.34  95.60  95.91
 
MappedBases(e9) 2.11    1.87    2.20    1.82    1.97
 
Q20Bases(e9)    2.05    1.81    2.13    1.76    1.91
 
Q20BasesPct(%)  97.12  96.98  96.75  96.90  96.91
 
MeanDepth      35.08  31.05  36.55  30.30  32.82
 
GenomeCover(%)  2.10    2.10    2.10    2.09    2.10
 
EPS_MSE 8.89    6.88    8.50    13.32  6.86
 
EPS_Cycle_Mean  26.04  25.88  25.86  26.12  25.77
 
GCBiasMSE      0.04    0.05    0.04    0.07    0.04
 
ISize_mode      250    250    249    210    250
 
ISize_medium    271    270    270    260    270
 
DupRate(%)      3.50    3.79    3.27    4.51    3.56
 
QCFailRate(%)  0.00    0.00    0.00    0.00    0.00
 
BaseComp_A(%)  26.3    26.4    26.3    26.8    26.3
 
BaseComp_C(%)  23.7    23.6    23.7    23.2    23.7
 
BaseComp_G(%)  23.2    23.0    23.2    22.7    23.1
 
BaseComp_T(%)  26.8    27.1    26.8    27.3    26.9
 
BaseComp_O(%)  0.0    0.0    0.0    0.0    0.0
 
  
 
* Whole genome sequencing with 24-multiplexing
 
* Whole genome sequencing with 24-multiplexing
  
With customized script, we aggregated 24 bar-coded samples in the same graph.
+
With a customized script, we aggregated 24 bar-coded samples in the same graph.
 
The graph will help compare sequencing quality between samples.  
 
The graph will help compare sequencing quality between samples.  
  
Line 221: Line 273:
 
* Interactive qplot  
 
* Interactive qplot  
  
Qplot can be interactive. In the following example, you can use scroll mouse to zoom in, zoom out each graph; pan to certain part of graph.
+
<span id="anchorOfInteractiveQplot"></span>
By presenting qplot data in web page, users can identify problematic sequencing samples easily.  
+
Qplot can be interactive. In the following example, you can use mouse scroll to zoom in and zoom out on each graph and pan to a certain part of the graph.
 +
By presenting qplot data on a web page, users can easily identify problematic sequencing samples. Users of qplot can customize its outputs into web page format greatly easing the data exploring process.
  
 
[http://www-personal.umich.edu/~zhanxw/qplot.Pool.9847.html  QPlot of 24 samples(HTML) ]
 
[http://www-personal.umich.edu/~zhanxw/qplot.Pool.9847.html  QPlot of 24 samples(HTML) ]
  
 +
== Diagnose sequencing quality ==
  
== Diagnose sequencing quality ==
+
Qplot is designed and implemented for the need of checking sequencing quality.  
Qplot is designed and implemented by the need of checking sequencing quality.  
+
Besides the example of analyzing RNA-seq data as shown in our manuscript,  
Besides the exampled of analyzing RNA-seq data as in the manuscript,  
+
here we demonstrate two additional scenarios in which qplot can help identify problems after obtaining sequencing data.  
here we demonstrate two scenarios in which qplot help us identify potential problem after obtaining sequencing data.  
 
  
 
* Base quality distributed abnormally
 
* Base quality distributed abnormally
  
[[Media: WrongBaseQual.pdf | Example of qplot help identify wrong phred base quality]]
+
[[Media: WrongBaseQual.pdf | Example of qplot helping to identify wrong phred base quality]]
  
By checking the first graph "Empirical vs reported Phred score", we found reported base qualities are shifted to right.
+
By checking the first graph "Empirical vs reported Phred score", we found reported base qualities are shifted to the right.
Further we notice that effects is caused by different software from Illumina sequencers.  
+
In this particular example, '33' was incorrectly added to all base qualities.  
In this particular example, all base qualities are wrongly added '33'. Such data used in variant calling may increase false positive SNP calling.
+
When such data used in variant calling, we may increase false positive SNP variants.
  
 
* Bar-coded samples
 
* Bar-coded samples
Line 244: Line 297:
 
[[Media: WrongBarCoding.pdf | Example of qplot identifying the effect of ignoring bar-coding]]
 
[[Media: WrongBarCoding.pdf | Example of qplot identifying the effect of ignoring bar-coding]]
  
By checking "Empirical phred score by cycle" (top right graph on the first page), we notice the empirical qualities in the first several cycle are abnormally low. This question leads us hypnotize the first several bases have different properties. Further investigation revealed that this sequencing was done using bar-coded DNA samples, but the analysis did not properly de-multiplexing to each sample.
+
By checking "Empirical phred score by cycle" (top right graph on the first page), we noticed the empirical qualities in the first several cycles are abnormally low. This phenomenon leads us to hypothesize that the first several bases have different properties. Further investigation confirmed that this sequencing was done using bar-coded DNA samples, but the analysis did not properly de-multiplex each sample.
  
 
= Contact =
 
= Contact =
  
Questions and requests should be sent to Bingshan Li ([mailto:bingshan@umich.edu bingshan@umich.edu])
+
Questions and requests should be sent to Bingshan Li ([mailto:bingshan@umich.edu bingshan@umich.edu]) or Xiaowei Zhan ([mailto:zhanxw@umich.edu zhanxw@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])

Latest revision as of 11:42, 2 February 2017

Introduction

The qplot program calculates various summary statistics some of which are plotted in a PDF file. These statistics can be used to assess the sequencing quality of sequence reads mapped to the reference genome. The main statistics are empirical Phred scores which are calculated based on the background mismatch rate. Background mismatch rate is the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count, and so on.

In the following sections, we will guide you through: how to obtain qplot, how to use qplot, example outputs, interactive diagnostic plots, and real applications in which qplot has helped identify sequencing problems.

Citing QPLOT

If you found QPLOT useful and wants to cite in your paper, please copy and paste the information below.

  • Bingshan Li, Xiaowei Zhan, Mary-Kate Wing, Paul Anderson, Hyun Min Kang, and Goncalo R. Abecasis, “QPLOT: A Quality Assessment Tool for Next Generation Sequencing Data,” BioMed Research International, vol. 2013, Article ID 865181, 4 pages, 2013. doi:10.1155/2013/865181 http://www.hindawi.com/journals/bmri/2013/865181/

Where to Find It

You can obtain qplot in two ways:

(1) Download the pre-compiled binary along with the source code as described in Binary Download.

(2) Download source code only and compile it on your own machine. Please follow the instruction in Source Code Distribution on fetching source code and building instructions.

Binary Download

We have prepared a pre-compiled (under Ubuntu) qplot along with source code . You can download it from: qplot.20130627.tar.gz (File Size: 1.7G)

The executable file is under qplot/bin/qplot.

In addition, we provided the necessary input files under qplot/data/ (NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100).

You can also find an example BAM input file under qplot/example/chrom20.9M.10M.bam. It is taken from the 1000 Genome Project with sequencing reads aligned to chromosome 20 positions 8M to 9M.

Source Code Distribution

We provide a source code only download in qplot-source.20130627.tar.gz. Optionally, you can download example file and/or data file:

example: example input file, and expected outputs if you following the direction.

resources data: necessary input files for qplot, including NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100.

You can put above file(s) in the same folder and follow these steps:

  • 1. Unarchive downloaded file
tar zvxf qplot-source.20130627.tar.gz

A new folder qplot will be created.

  • 2. Build libStatGen
cd qplot
(cd ../libStatGen; make cloneLib)

This step will download a necessary software library libStatGen and compile source code into a binary code library.

  • 3. Build qplot
make 

This step will then build qplot. Upon success, the executable qplot can be found under qplot/bin/.

  • 4. (Optional) unarchive example and/or data
tar zvxf qplot-example.tar.gz

An example file, chrom20.9M.10M.bam, will be extracted to qplot/example/. It contains ~1.1 million aligned Illumina sequencing reads of NA12878 from 1000 Genome Project. Example command line, cmd.sh, example outputs, qplot.pdf, qplot.stats, and qplot.R are also provided and will be extracted qplot/example/ as well.

tar zvxf qplot-data.tar.gz

Three files will be extracted to qplot/data/: human.g1k.v37-bs.umfa is binary NCBI reference genome build 37; dbSNP130.UCSC.coordinates.tbl is dbSNP version 130; and human.g1k.w100.gc is pre-calculated GC content with windows size 100.


Usage

Command line

After you obtain the qplot executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under qplot/bin/qplot.

Here is the qplot help page by invoking qplot without any command line arguments:

 some_linux_host > qplot/bin/qplot
   The following parameters are available.  Ones with "[]" are in effect:
   
   
   
                   References : --reference [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.v37.fa],
                                --dbsnp [/net/fantasia/home/zhanxw/software/qplot/data/dbSNP130.UCSC.coordinates.tbl]
      GC content file options : --winsize [100]
                  Region list : --regions [], --invertRegion
                 Flag filters : --read1_skip, --read2_skip, --paired_skip,
                                --unpaired_skip
               Dup and QCFail : --dup_keep, --qcfail_keep
              Mapping filters : --minMapQuality [0.00]
           Records to process : --first_n_record [-1]
             Lanes to process : --lanes []
        Read group to process : --readGroup []
           Input file options : --noeof
                 Output files : --plot [], --stats [], --Rcode [], --xml []
                  Plot labels : --label [], --bamLabel []
       Obsoleted (DO NOT USE) : --gccontent [], --create_gc

Input files

qplot runs on the input BAM/SAM file(s) specified on the command-line after all other parameters.

Additionally, three (3) precomputed files are required.

  • --reference

The reference genome is the same as karma reference genome. If the index files do not exist, qplot will create the index files automatically using the input reference fasta file.

  • --dbsnp

This file has two columns. First column is the chromosome name which must be consistent with the reference created above. Second column is 1-based SNP position. If you want to create your own dbSNP data from downloaded UCSC dbSNP file, one way to do it is: cat dbsnp_129_b36.rod|grep "single" | awk '$4-$3==1' |cut -f2,4 > dbSNP_129_b36.tbl

  • **OBSOLETED** --gccontent, --create_gc

Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file. GC content file name is automatically determined in this format: <reference_genome_base_file_name>.winsize<gc_content_window_size>.gc. For example, if your reference genome is human.g1k.v37.fa and the window size is 100, then the GC content file name is: human.g1k.v37.winsize100.gc .

As it said, there is no need to use --gccontent to specify GC content file in each run.

  • input files

QPLOT take SAM/BAM files.

Note: Before running qplot, it is critical to check how the chromosome names are coded. Some BAM/SAM files use just numbers, others use chr + numbers. You need to make sure that the chromosome names from the reference and dbSNP are consistent with the BAM/SAM files.

Parameters

Some of the command line parameters are described here, but most are self explanatory.

  • Flag filter

By default all reads are processed. If it is desired to check only the first read of a pair, use --read2_skip to ignore the second read. And so on.

  • Duplication and QCFail

By default reads marked as duplication and QCFail are ignored but can be retained by

--dup_keep 

or

--qcfail_keep


  • Records to process

The --first_n_record option followed by a number, n, will enable qplot to read the first n reads to test the bam files and verify it works.

  • Lanes to process (only works for Illumina sequences)

If the input bam files have more than one lane and only some of them need to be checked, use something like --lanes 1,3,5 to specify that only lanes 1, 3, and 5 need to be checked.

NOTE In order for this to work, the lane info has to be encoded in the read name such that the lane number is the second field with the delimiter ":".


  • Read group to process :

The read group option can restrict qplot to process a subset of reads. For example, if the BAM contains the following @RG tags:

@RG	ID:UM0348_1:1	PL:ILLUMINA	LB:M5390	SM:M5390	CN:UM
@RG	ID:UM0348_2:1	PL:ILLUMINA	LB:M5390	SM:M5390	CN:UM
@RG	ID:UM0348_3:1	PL:ILLUMINA	LB:M5390	SM:M5390	CN:UM
@RG	ID:UM0348_4:1	PL:ILLUMINA	LB:M5390	SM:M5390	CN:UM
@RG	ID:UM0360_1:1	PL:ILLUMINA	LB:M5390	SM:M5390	CN:UM
@RG	ID:UM0360_2:1	PL:ILLUMINA	LB:M5390	SM:M5390	CN:UM
@RG	ID:UM0360_3:1	PL:ILLUMINA	LB:M5390	SM:M5390	CN:UM
@RG	ID:UM0360_4:1	PL:ILLUMINA	LB:M5390	SM:M5390	CN:UM

QPLOT will by default (without specifying --readgroup) process all reads.

If you specify "--readGroup UM0348", then only read groups UM0348_1, UM_0348_2, UM_0348_3, UM_0348_4 will be processed.

If you specify "--readGroup UM0348_1", then only one read group, UM0348_1, will be processed.


  • Input file options :

BAM files are compressed using BGZF and should contain the EOF indicator by default. QPLOT will, by default, stop working if it does not find a valid EOF indicator inside the BAM files. However, you can force QPLOT to continue processing BAM files without an EOF indicator using --noeof. But you should be aware the input files may be corrupted.


  • Mapping filters

Qplot will exclude reads with lower mapping qualities than the user specified parameter, --minMapQuality. By default, mapped reads with all mapping quality will be included in the analysis.


  • Region list

If the interest of qplot is a list of regions, e.g. exons, this can be achieved by providing a list of regions. The regions should be in the form of "chr start end label" each line in the file (NOTE: start and end position are inclusive and they follow the convention of BED file). In order for this option to work, within each chromosome (contig) the regions have to be sorted by starting position, and also the input bam files have to be sorted. For example, you can create a text file, region.txt like following:

1 100 500 region_A
1 600 800 region_B
2 100 300 region_C

Then specifying --regions region.txt enables qplot to calculate various statistics out of sequenced bases only within the above 3 regions.

Qplot also provides the --invertRegion option. Enabling this option tells qplot to operate on those sequence bases that are outside the given region.


  • Plot labels

Two kinds of labels are enabled. --label is the label for the plot (default is empty) which is appended to the title of each subplot. --bamLabels followed by a column separated list of labels provides the labels for each input SAM/BAM file, e.g. sample ID (default is numbers 1, 2, ... until the number of input bam files). For example:

--label Run100 --bamLabels s1,s2,s3,s4,s5,s6,s7,s8

Output files

There are three (optional) output files.

  • --plot qa.pdf

Qplot will generate a PDF file named qa.pdf containing 2 pages each with 4 figures. The plot is generated using Rscript.

  • --stats qa.stats

Qplot will generate a text file named qa.stats containing various summary statistics for each input BAM/SAM file.

  • --Rcode qa.R

Qplot will generate qa.R which is the R code used for plotting the figures in the qa.pdf file. If Rscript is not installed in the system, you can use the qa.R to generate the figures on other machines, or extract plotting data from each run and combine multiple runs together to generate more comprehensive plots (See Example).

Example

Qplot can generate diagnostic graphs, related R code, and summary statistics for each SAM/BAM file.

Built-in example

In the pre-compiled binary download, you will find a subdirectory named examples. We provide a sample file from the 1000 Genome project, it contains aligned reads on chromosome 20 from position 8 Mbp to 9Mbp. You can invoke qplot using the following command line:

../bin/qplot --reference ../data/human.g1k.v37.umfa --dbsnp ../data/dbSNP130.UCSC.coordinates.tbl --gccontent ../data/human.g1k.w100.gc --plot qplot.pdf --stats qplot.stats --Rcode qplot.R --label "chr20:9M-10M" chrom20.9M.10M.bam

Sample outputs are listed below:

1) Figure: qplot.pdf

2) Summary statistics:

Stats\BAM       chrom20.9M.10M.bam
TotalReads(e6)  1.11
MappingRate(%)  97.24
MapRate_MQpass(%)       97.24
TargetMapping(%)        0.00
ZeroMapQual(%)  2.39
MapQual<10(%)   2.86
PairedReads(%)  83.76
ProperPaired(%) 71.34
MappedBases(e9) 0.04
Q20Bases(e9)    0.04
Q20BasesPct(%)  88.63
MeanDepth       42.22
GenomeCover(%)  0.03
EPS_MSE 1.81
EPS_Cycle_Mean  18.71
GCBiasMSE       0.01
ISize_mode      137
ISize_medium    184
DupRate(%)      5.90
QCFailRate(%)   0.00
BaseComp_A(%)   29.9
BaseComp_C(%)   20.1
BaseComp_G(%)   20.2
BaseComp_T(%)   29.8
BaseComp_O(%)   0.1

Gallery of examples

Here we show qplot can be applied in various sequencing scenarios. Also users can customize statistics generated by qplot to their needs.

  • Whole genome sequencing with 24-multiplexing

With a customized script, we aggregated 24 bar-coded samples in the same graph. The graph will help compare sequencing quality between samples.

QPlot of 24 samples(PDF)

  • Interactive qplot

Qplot can be interactive. In the following example, you can use mouse scroll to zoom in and zoom out on each graph and pan to a certain part of the graph. By presenting qplot data on a web page, users can easily identify problematic sequencing samples. Users of qplot can customize its outputs into web page format greatly easing the data exploring process.

QPlot of 24 samples(HTML)

Diagnose sequencing quality

Qplot is designed and implemented for the need of checking sequencing quality. Besides the example of analyzing RNA-seq data as shown in our manuscript, here we demonstrate two additional scenarios in which qplot can help identify problems after obtaining sequencing data.

  • Base quality distributed abnormally

Example of qplot helping to identify wrong phred base quality

By checking the first graph "Empirical vs reported Phred score", we found reported base qualities are shifted to the right. In this particular example, '33' was incorrectly added to all base qualities. When such data used in variant calling, we may increase false positive SNP variants.

  • Bar-coded samples

Example of qplot identifying the effect of ignoring bar-coding

By checking "Empirical phred score by cycle" (top right graph on the first page), we noticed the empirical qualities in the first several cycles are abnormally low. This phenomenon leads us to hypothesize that the first several bases have different properties. Further investigation confirmed that this sequencing was done using bar-coded DNA samples, but the analysis did not properly de-multiplex each sample.

Contact

Questions and requests should be sent to Bingshan Li (bingshan@umich.edu) or Xiaowei Zhan (zhanxw@umich.edu) or Goncalo Abecasis (goncalo@umich.edu)