Changes

From Genome Analysis Wiki
Jump to navigationJump to search
2,240 bytes added ,  11:42, 2 February 2017
Line 1: Line 1:  
= Introduction =
 
= Introduction =
 +
 
The qplot program calculates various summary statistics some of which are plotted in a PDF file. These statistics can be used to assess the sequencing quality of sequence reads mapped to the reference genome. The main statistics are empirical Phred scores which are calculated based on the background mismatch rate. Background mismatch rate is the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count, and so on.  
 
The qplot program calculates various summary statistics some of which are plotted in a PDF file. These statistics can be used to assess the sequencing quality of sequence reads mapped to the reference genome. The main statistics are empirical Phred scores which are calculated based on the background mismatch rate. Background mismatch rate is the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count, and so on.  
    
In the following sections, we will guide you through: [[#Where to Find It |how to obtain qplot]], [[#Usage |how to use qplot]], [[#Built-in example |example outputs]], [[#anchorOfInteractiveQplot |interactive diagnostic plots]], and [[#Diagnose sequencing quality |real applications]] in which qplot has helped identify sequencing problems.
 
In the following sections, we will guide you through: [[#Where to Find It |how to obtain qplot]], [[#Usage |how to use qplot]], [[#Built-in example |example outputs]], [[#anchorOfInteractiveQplot |interactive diagnostic plots]], and [[#Diagnose sequencing quality |real applications]] in which qplot has helped identify sequencing problems.
 +
 +
= Citing QPLOT =
 +
 +
If you found QPLOT useful and wants to cite in your paper, please copy and paste the information below.
 +
 +
* Bingshan Li, Xiaowei Zhan, Mary-Kate Wing, Paul Anderson, Hyun Min Kang, and Goncalo R. Abecasis, “QPLOT: A Quality Assessment Tool for Next Generation Sequencing Data,” BioMed Research International, vol. 2013, Article ID 865181, 4 pages, 2013. doi:10.1155/2013/865181  http://www.hindawi.com/journals/bmri/2013/865181/
    
= Where to Find It =
 
= Where to Find It =
Line 14: Line 21:  
== Binary Download ==
 
== Binary Download ==
   −
We have prepared a pre-compiled (under Ubuntu) qplot along with source code . You can download it from: [http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot.20120213.tar.gz qplot.20120213.tar.gz (File Size: 1.7G)]  
+
We have prepared a pre-compiled (under Ubuntu) qplot along with source code . You can download it from: [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot.20130627.tar.gz qplot.20130627.tar.gz (File Size: 1.7G)]  
    
The executable file is under qplot/bin/qplot.  
 
The executable file is under qplot/bin/qplot.  
Line 24: Line 31:  
== Source Code Distribution ==
 
== Source Code Distribution ==
   −
We provide a source code only download in [http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot-source.20120213.tar.gz qplot-source.20120213.tar.gz]. Optionally, you can download example file and/or data file:
+
We provide a source code only download in [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-source.20130627.tar.gz qplot-source.20130627.tar.gz]. Optionally, you can download example file and/or data file:
   −
[http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot-example.tar.gz  example]: example input file, and expected outputs if you following the [[#Built-in example | direction]].  
+
[http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-example.tar.gz  example]: example input file, and expected outputs if you following the [[#Built-in example | direction]].  
   −
[http://www.sph.umich.edu/csg/zhanxw/software/qplot/qplot-data.tar.gz resources data]: necessary input files for qplot, including NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100.
+
[http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-data.tar.gz resources data]: necessary input files for qplot, including NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100.
    
You can put above file(s) in the same folder and follow these steps:
 
You can put above file(s) in the same folder and follow these steps:
    
* 1. Unarchive downloaded file
 
* 1. Unarchive downloaded file
  tar zvxf qplot-source.20120213.tar.gz
+
  tar zvxf qplot-source.20130627.tar.gz
    
A new folder ''qplot'' will be created.
 
A new folder ''qplot'' will be created.
Line 39: Line 46:  
* 2. Build libStatGen
 
* 2. Build libStatGen
 
  cd qplot
 
  cd qplot
  make libStatGen
+
  (cd ../libStatGen; make cloneLib)
    
This step will download a necessary software library [http://genome.sph.umich.edu/wiki/C%2B%2B_Library:_libStatGen libStatGen] and compile source code into a binary code library.
 
This step will download a necessary software library [http://genome.sph.umich.edu/wiki/C%2B%2B_Library:_libStatGen libStatGen] and compile source code into a binary code library.
    
* 3. Build qplot
 
* 3. Build qplot
  make all
+
  make  
    
This step will then build qplot. Upon success, the executable qplot can be found under qplot/bin/.
 
This step will then build qplot. Upon success, the executable qplot can be found under qplot/bin/.
Line 64: Line 71:     
== Command line ==
 
== Command line ==
 +
 
After you obtain the qplot executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under qplot/bin/qplot.  
 
After you obtain the qplot executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under qplot/bin/qplot.  
   Line 69: Line 77:     
   some_linux_host > qplot/bin/qplot
 
   some_linux_host > qplot/bin/qplot
   
+
    The following parameters are available. Ones with "[]" are in effect:
              References : --reference [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.v37.fa],
+
   
                          --dbsnp [/net/fantasia/home/zhanxw/software/qplot/data/dbSNP130.UCSC.coordinates.tbl],
+
   
                          --gccontent [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.w100.gc]
+
   
  Create gcContent file : --create_gc [], --winsize [100]
+
                    References : --reference [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.v37.fa],
            Region list : --regions [], --invertRegion
+
                                --dbsnp [/net/fantasia/home/zhanxw/software/qplot/data/dbSNP130.UCSC.coordinates.tbl]
            Flag filters : --read1_skip, --read2_skip, --paired_skip,
+
      GC content file options : --winsize [100]
                          --unpaired_skip
+
                  Region list : --regions [], --invertRegion
          Dup and QCFail : --dup_keep, --qcfail_keep
+
                  Flag filters : --read1_skip, --read2_skip, --paired_skip,
        Mapping filters : --minMapQuality [0.00]
+
                                --unpaired_skip
      Records to process : --first_n_record [-1]
+
                Dup and QCFail : --dup_keep, --qcfail_keep
        Lanes to process : --lanes []
+
              Mapping filters : --minMapQuality [0.00]
  Read group to process : --readGroup []
+
            Records to process : --first_n_record [-1]
      Input file options : --noeof
+
              Lanes to process : --lanes []
            Output files : --plot [], --stats [], --Rcode []
+
        Read group to process : --readGroup []
            Plot labels : --label [], --bamLabel []
+
            Input file options : --noeof
 +
                  Output files : --plot [], --stats [], --Rcode [], --xml []
 +
                  Plot labels : --label [], --bamLabel []
 +
        Obsoleted (DO NOT USE) : --gccontent [], --create_gc
    
== Input files ==
 
== Input files ==
Line 100: Line 111:  
This file has two columns. First column is the chromosome name which must be consistent with the reference created above. Second column is 1-based SNP position. If you want to create your own dbSNP data from downloaded UCSC dbSNP file, one way to do it is: <code>cat dbsnp_129_b36.rod|grep "single" | awk '$4-$3==1' |cut -f2,4 > dbSNP_129_b36.tbl</code>  
 
This file has two columns. First column is the chromosome name which must be consistent with the reference created above. Second column is 1-based SNP position. If you want to create your own dbSNP data from downloaded UCSC dbSNP file, one way to do it is: <code>cat dbsnp_129_b36.rod|grep "single" | awk '$4-$3==1' |cut -f2,4 > dbSNP_129_b36.tbl</code>  
   −
* <code>--gccontent</code>
+
* <code> **OBSOLETED** --gccontent, --create_gc </code>
 +
 
 +
Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file.
 +
GC content file name is automatically determined in this format: <reference_genome_base_file_name>.winsize<gc_content_window_size>.gc.
 +
For example, if your reference genome is human.g1k.v37.fa and the window size is 100, then the GC content file name is: human.g1k.v37.winsize100.gc .
 +
 
 +
As it said, there is no need to use --gccontent to specify GC content file in each run.
 +
 
 +
* <code> input files </code>
   −
Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file. To generate the file, use the following command:
+
QPLOT take SAM/BAM files.
qplot --rerefence reference.fa --windowsize winsize --create_gc reference.gc
      
''Note'': Before running qplot, it is critical to check how the chromosome names are coded. Some BAM/SAM files use just numbers, others use chr + numbers. '''You need to make sure that the chromosome names from the reference and dbSNP are consistent with the BAM/SAM files.'''
 
''Note'': Before running qplot, it is critical to check how the chromosome names are coded. Some BAM/SAM files use just numbers, others use chr + numbers. '''You need to make sure that the chromosome names from the reference and dbSNP are consistent with the BAM/SAM files.'''
Line 121: Line 139:  
or  
 
or  
 
  --qcfail_keep
 
  --qcfail_keep
 +
    
*Records to process  
 
*Records to process  
 +
 
The <code>--first_n_record</code> option followed by a number, '''n''', will enable qplot to read the first '''n''' reads to test the bam files and verify it works.
 
The <code>--first_n_record</code> option followed by a number, '''n''', will enable qplot to read the first '''n''' reads to test the bam files and verify it works.
   Line 131: Line 151:  
'''NOTE''' In order for this to work, the lane info has to be encoded in the read name such that the lane number is the second field with the delimiter ":".
 
'''NOTE''' In order for this to work, the lane info has to be encoded in the read name such that the lane number is the second field with the delimiter ":".
   −
*Mapping filters
+
 
 +
* Read group to process :
 +
 
 +
The read group option can restrict qplot to process a subset of reads. For example, if the BAM contains the following @RG tags:
 +
 
 +
@RG ID:UM0348_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0348_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0348_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0348_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
@RG ID:UM0360_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM
 +
 
 +
QPLOT will by default (without specifying --readgroup) process all reads.
 +
 
 +
If you specify "--readGroup UM0348", then only read groups UM0348_1, UM_0348_2, UM_0348_3, UM_0348_4 will be processed.
 +
 
 +
If you specify "--readGroup UM0348_1", then only one read group, UM0348_1, will be processed.
 +
 
 +
 
 +
* Input file options :
 +
 
 +
BAM files are compressed using BGZF and should contain the EOF indicator by default. QPLOT will, by default, stop working if it does not find a valid EOF indicator inside the BAM files.
 +
However, you can force QPLOT to continue processing BAM files without an EOF indicator using --noeof. But you should be aware the input files may be corrupted.
 +
 
 +
 
 +
* Mapping filters
    
Qplot will exclude reads with lower mapping qualities than the user specified parameter, <code>--minMapQuality</code>. By default, mapped reads with all mapping quality will be included in the analysis.
 
Qplot will exclude reads with lower mapping qualities than the user specified parameter, <code>--minMapQuality</code>. By default, mapped reads with all mapping quality will be included in the analysis.
 +
    
*Region list
 
*Region list
Line 253: Line 301:  
= Contact =
 
= Contact =
   −
Questions and requests should be sent to Bingshan Li ([mailto:bingshan@umich.edu bingshan@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])
+
Questions and requests should be sent to Bingshan Li ([mailto:bingshan@umich.edu bingshan@umich.edu]) or Xiaowei Zhan ([mailto:zhanxw@umich.edu zhanxw@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])
96

edits

Navigation menu