Changes

From Genome Analysis Wiki
Jump to navigationJump to search
2,526 bytes added ,  11:42, 2 February 2017
Line 1: Line 1:  
= Introduction =
 
= Introduction =
 +
 
The qplot program calculates various summary statistics some of which are plotted in a PDF file. These statistics can be used to assess the sequencing quality of sequence reads mapped to the reference genome. The main statistics are empirical Phred scores which are calculated based on the background mismatch rate. Background mismatch rate is the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count, and so on.  
 
The qplot program calculates various summary statistics some of which are plotted in a PDF file. These statistics can be used to assess the sequencing quality of sequence reads mapped to the reference genome. The main statistics are empirical Phred scores which are calculated based on the background mismatch rate. Background mismatch rate is the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count, and so on.  
    
In the following sections, we will guide you through: [[#Where to Find It |how to obtain qplot]], [[#Usage |how to use qplot]], [[#Built-in example |example outputs]], [[#anchorOfInteractiveQplot |interactive diagnostic plots]], and [[#Diagnose sequencing quality |real applications]] in which qplot has helped identify sequencing problems.
 
In the following sections, we will guide you through: [[#Where to Find It |how to obtain qplot]], [[#Usage |how to use qplot]], [[#Built-in example |example outputs]], [[#anchorOfInteractiveQplot |interactive diagnostic plots]], and [[#Diagnose sequencing quality |real applications]] in which qplot has helped identify sequencing problems.
 +
 +
= Citing QPLOT =
 +
 +
If you found QPLOT useful and wants to cite in your paper, please copy and paste the information below.
 +
 +
* Bingshan Li, Xiaowei Zhan, Mary-Kate Wing, Paul Anderson, Hyun Min Kang, and Goncalo R. Abecasis, “QPLOT: A Quality Assessment Tool for Next Generation Sequencing Data,” BioMed Research International, vol. 2013, Article ID 865181, 4 pages, 2013. doi:10.1155/2013/865181  http://www.hindawi.com/journals/bmri/2013/865181/
    
= Where to Find It =
 
= Where to Find It =
Line 14: Line 21:  
== Binary Download ==
 
== Binary Download ==
   −
We have prepared a pre-compiled (under Ubuntu) qplot along with source code . You can download it from: [http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot.20120213.tar.gz qplot.20120213.tar.gz (File Size: 1.7G)]  
+
We have prepared a pre-compiled (under Ubuntu) qplot along with source code . You can download it from: [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot.20130627.tar.gz qplot.20130627.tar.gz (File Size: 1.7G)]  
    
The executable file is under qplot/bin/qplot.  
 
The executable file is under qplot/bin/qplot.  
Line 24: Line 31:  
== Source Code Distribution ==
 
== Source Code Distribution ==
   −
We provide a source code only download in [http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot-source.20120213.tar.gz qplot-source.20120213.tar.gz]. Optionally, you can download example file and/or data file:
+
We provide a source code only download in [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-source.20130627.tar.gz qplot-source.20130627.tar.gz]. Optionally, you can download example file and/or data file:
   −
[http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot-example.tar.gz  example]: example input file, and expected outputs if you following the [[#Built-in example direction | direction]].  
+
[http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-example.tar.gz  example]: example input file, and expected outputs if you following the [[#Built-in example | direction]].  
   −
[http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot-data.tar.gz resources data]: necessary input files for qplot, including NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100.
+
[http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-data.tar.gz resources data]: necessary input files for qplot, including NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100.
    
You can put above file(s) in the same folder and follow these steps:
 
You can put above file(s) in the same folder and follow these steps:
    
* 1. Unarchive downloaded file
 
* 1. Unarchive downloaded file
  tar zvxf qplot-source.20120213.tar.gz
+
  tar zvxf qplot-source.20130627.tar.gz
    
A new folder ''qplot'' will be created.
 
A new folder ''qplot'' will be created.
Line 39: Line 46:  
* 2. Build libStatGen
 
* 2. Build libStatGen
 
  cd qplot
 
  cd qplot
  make libStatGen
+
  (cd ../libStatGen; make cloneLib)
    
This step will download a necessary software library [http://genome.sph.umich.edu/wiki/C%2B%2B_Library:_libStatGen libStatGen] and compile source code into a binary code library.
 
This step will download a necessary software library [http://genome.sph.umich.edu/wiki/C%2B%2B_Library:_libStatGen libStatGen] and compile source code into a binary code library.
    
* 3. Build qplot
 
* 3. Build qplot
  make all
+
  make  
    
This step will then build qplot. Upon success, the executable qplot can be found under qplot/bin/.
 
This step will then build qplot. Upon success, the executable qplot can be found under qplot/bin/.
Line 64: Line 71:     
== Command line ==
 
== Command line ==
 +
 
After you obtain the qplot executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under qplot/bin/qplot.  
 
After you obtain the qplot executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under qplot/bin/qplot.  
   Line 69: Line 77:     
   some_linux_host > qplot/bin/qplot
 
   some_linux_host > qplot/bin/qplot
 
+
    The following parameters are available.  Ones with "[]" are in effect:
  The following parameters are available.  Ones with "[]" are in effect:
+
   
+
   
              References : --reference [/data/local/ref/karma.ref/human.g1k.v37.umfa],
+
   
                          --dbsnp [/home/bingshan/data/db/dbSNP/dbSNP130.UCSC.coordinates.tbl],
+
                    References : --reference [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.v37.fa],
                          --gccontent [/home/bingshan/data/db/gcContent/gcContent.hg37.w250.out]
+
                                --dbsnp [/net/fantasia/home/zhanxw/software/qplot/data/dbSNP130.UCSC.coordinates.tbl]
  Create gcContent file : --create_gc [], --winsize [100]
+
      GC content file options : --winsize [100]
            Region list : --regions [], --invertRegion
+
                  Region list : --regions [], --invertRegion
            Flag filters : --read1_skip, --read2_skip, --paired_skip,
+
                  Flag filters : --read1_skip, --read2_skip, --paired_skip,
                          --unpaired_skip
+
                                --unpaired_skip
          Dup and QCFail : --dup_keep, --qcfail_keep
+
                Dup and QCFail : --dup_keep, --qcfail_keep
        Mapping filters : --minMapQuality [0.00]
+
              Mapping filters : --minMapQuality [0.00]
      Records to process : --first_n_record [-1]
+
            Records to process : --first_n_record [-1]
        Lanes to process : --lanes []
+
              Lanes to process : --lanes []
             Output files : --plot [], --stats [], --Rcode []
+
        Read group to process : --readGroup []
            Plot labels : --label [], --bamLabel []
+
             Input file options : --noeof
 +
                  Output files : --plot [], --stats [], --Rcode [], --xml []
 +
                  Plot labels : --label [], --bamLabel []
 +
        Obsoleted (DO NOT USE) : --gccontent [], --create_gc
    
== Input files ==
 
== Input files ==
Line 94: Line 105:  
* <code>--reference</code>
 
* <code>--reference</code>
   −
The reference genome is the same as karma reference genome. If the index files do not exist, qplot will create the index files using the input reference fasta file.
+
The reference genome is the same as karma reference genome. If the index files do not exist, qplot will create the index files '''automatically''' using the input reference fasta file.
    
* <code>--dbsnp</code>
 
* <code>--dbsnp</code>
   −
This file has two columns. First column is the chromosome name which must be consistent with the reference created above.
+
This file has two columns. First column is the chromosome name which must be consistent with the reference created above. Second column is 1-based SNP position. If you want to create your own dbSNP data from downloaded UCSC dbSNP file, one way to do it is: <code>cat dbsnp_129_b36.rod|grep "single" | awk '$4-$3==1' |cut -f2,4 > dbSNP_129_b36.tbl</code>
 +
 
 +
* <code> **OBSOLETED** --gccontent, --create_gc </code>
 +
 
 +
Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file.
 +
GC content file name is automatically determined in this format: <reference_genome_base_file_name>.winsize<gc_content_window_size>.gc.
 +
For example, if your reference genome is human.g1k.v37.fa and the window size is 100, then the GC content file name is: human.g1k.v37.winsize100.gc .
 +
 
 +
As it said, there is no need to use --gccontent to specify GC content file in each run.
   −
* <code>--gccontent</code>
+
* <code> input files </code>
   −
Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file. To generate the file, use the following command:
+
QPLOT take SAM/BAM files.
qplot --rerefence reference.fa --windowsize winsize --create_gc reference.gc
     −
''Note'': Before running qplot, it is critical to check how the chromosome numbers are coded. Some BAM/SAM files use just numbers, others use chr + numbers. '''You need to make sure that the chromosome numbers from the reference and dbSNP are consistent with the BAM/SAM file.'''
+
''Note'': Before running qplot, it is critical to check how the chromosome names are coded. Some BAM/SAM files use just numbers, others use chr + numbers. '''You need to make sure that the chromosome names from the reference and dbSNP are consistent with the BAM/SAM files.'''
    
== Parameters ==
 
== Parameters ==
Line 121: Line 139:  
or  
 
or  
 
  --qcfail_keep
 
  --qcfail_keep
 +
    
*Records to process  
 
*Records to process  
 +
 
The <code>--first_n_record</code> option followed by a number, '''n''', will enable qplot to read the first '''n''' reads to test the bam files and verify it works.
 
The <code>--first_n_record</code> option followed by a number, '''n''', will enable qplot to read the first '''n''' reads to test the bam files and verify it works.
   Line 131: Line 151:  
'''NOTE''' In order for this to work, the lane info has to be encoded in the read name such that the lane number is the second field with the delimiter ":".
 
'''NOTE''' In order for this to work, the lane info has to be encoded in the read name such that the lane number is the second field with the delimiter ":".
   −
*Mapping filters
+
 
 +
* Read group to process :
 +
 
 +
The read group option can restrict qplot to process a subset of reads. For example, if the BAM contains the following @RG tags:
 +
 
 +
@RG ID:UM0348_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0348_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0348_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0348_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
 
 +
QPLOT will by default (without specifying --readgroup) process all reads.
 +
 
 +
If you specify "--readGroup UM0348", then only read groups UM0348_1, UM_0348_2, UM_0348_3, UM_0348_4 will be processed.
 +
 
 +
If you specify "--readGroup UM0348_1", then only one read group, UM0348_1, will be processed.
 +
 
 +
 
 +
* Input file options :
 +
 
 +
BAM files are compressed using BGZF and should contain the EOF indicator by default. QPLOT will, by default, stop working if it does not find a valid EOF indicator inside the BAM files.
 +
However, you can force QPLOT to continue processing BAM files without an EOF indicator using --noeof. But you should be aware the input files may be corrupted.
 +
 
 +
 
 +
* Mapping filters
    
Qplot will exclude reads with lower mapping qualities than the user specified parameter, <code>--minMapQuality</code>. By default, mapped reads with all mapping quality will be included in the analysis.
 
Qplot will exclude reads with lower mapping qualities than the user specified parameter, <code>--minMapQuality</code>. By default, mapped reads with all mapping quality will be included in the analysis.
 +
    
*Region list
 
*Region list
Line 152: Line 200:  
* Plot labels
 
* Plot labels
   −
Two kinds of labels are enabled. <code>--label</code> is the label for the plot (default is empty) which is prepended to the title of each subplot. <code>--bamLabels</code> followed by a column separated list of labels provides the labels for each input SAM/BAM file, e.g. sample ID (default is numbers 1, 2, ... until the number of input bam files). For example:
+
Two kinds of labels are enabled. <code>--label</code> is the label for the plot (default is empty) which is appended to the title of each subplot. <code>--bamLabels</code> followed by a column separated list of labels provides the labels for each input SAM/BAM file, e.g. sample ID (default is numbers 1, 2, ... until the number of input bam files). For example:
 
  --label Run100 --bamLabels s1,s2,s3,s4,s5,s6,s7,s8
 
  --label Run100 --bamLabels s1,s2,s3,s4,s5,s6,s7,s8
   Line 176: Line 224:  
== Built-in example ==
 
== Built-in example ==
   −
In the pre-compiled binary download, you will find a subdirectory named examples. We provide a sample file from the 1000 Genome project, it contains aligned reads on chromosome 20 from position 8 Mbp to 9Mbp. You can invoke qplot using the following commandline:
+
In the pre-compiled binary download, you will find a subdirectory named examples. We provide a sample file from the 1000 Genome project, it contains aligned reads on chromosome 20 from position 8 Mbp to 9Mbp. You can invoke qplot using the following command line:
    
  ../bin/qplot --reference ../data/human.g1k.v37.umfa --dbsnp ../data/dbSNP130.UCSC.coordinates.tbl --gccontent ../data/human.g1k.w100.gc --plot qplot.pdf --stats qplot.stats --Rcode qplot.R --label "chr20:9M-10M" chrom20.9M.10M.bam
 
  ../bin/qplot --reference ../data/human.g1k.v37.umfa --dbsnp ../data/dbSNP130.UCSC.coordinates.tbl --gccontent ../data/human.g1k.w100.gc --plot qplot.pdf --stats qplot.stats --Rcode qplot.R --label "chr20:9M-10M" chrom20.9M.10M.bam
Line 227: Line 275:  
<span id="anchorOfInteractiveQplot"></span>
 
<span id="anchorOfInteractiveQplot"></span>
 
Qplot can be interactive. In the following example, you can use mouse scroll to zoom in and zoom out on each graph and pan to a certain part of the graph.
 
Qplot can be interactive. In the following example, you can use mouse scroll to zoom in and zoom out on each graph and pan to a certain part of the graph.
By presenting qplot data on a web page, users can easily identify problematic sequencing samples. Users of qplot can customize its outputs into webpage format greatly easing the data exploring process.
+
By presenting qplot data on a web page, users can easily identify problematic sequencing samples. Users of qplot can customize its outputs into web page format greatly easing the data exploring process.
    
[http://www-personal.umich.edu/~zhanxw/qplot.Pool.9847.html  QPlot of 24 samples(HTML) ]
 
[http://www-personal.umich.edu/~zhanxw/qplot.Pool.9847.html  QPlot of 24 samples(HTML) ]
Line 253: Line 301:  
= Contact =
 
= Contact =
   −
Questions and requests should be sent to Bingshan Li ([mailto:bingshan@umich.edu bingshan@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])
+
Questions and requests should be sent to Bingshan Li ([mailto:bingshan@umich.edu bingshan@umich.edu]) or Xiaowei Zhan ([mailto:zhanxw@umich.edu zhanxw@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])
96

edits

Navigation menu