Changes

2,240 bytes added , 11:42, 2 February 2017

Line 1: Line 1:

= Introduction =

+

The qplot program calculates various summary statistics some of which are plotted in a PDF file. These statistics can be used to assess the sequencing quality of sequence reads mapped to the reference genome. The main statistics are empirical Phred scores which are calculated based on the background mismatch rate. Background mismatch rate is the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count, and so on.

In the following sections, we will guide you through: [[#Where to Find It |how to obtain qplot]], [[#Usage |how to use qplot]], [[#Built-in example |example outputs]], [[#anchorOfInteractiveQplot |interactive diagnostic plots]], and [[#Diagnose sequencing quality |real applications]] in which qplot has helped identify sequencing problems.

+

= Citing QPLOT =

+

If you found QPLOT useful and wants to cite in your paper, please copy and paste the information below.

+

* Bingshan Li, Xiaowei Zhan, Mary-Kate Wing, Paul Anderson, Hyun Min Kang, and Goncalo R. Abecasis, “QPLOT: A Quality Assessment Tool for Next Generation Sequencing Data,” BioMed Research International, vol. 2013, Article ID 865181, 4 pages, 2013. doi:10.1155/2013/865181 http://www.hindawi.com/journals/bmri/2013/865181/

= Where to Find It =

Line 14: Line 21:

== Binary Download ==

−

We have prepared a pre-compiled (under Ubuntu) qplot along with source code . You can download it from: [http://~~www~~.sph.umich.edu/~~csg~~/zhanxw/software/qplot/qplot.~~20120213~~.tar.gz qplot.~~20120213~~.tar.gz (File Size: 1.7G)]

+

We have prepared a pre-compiled (under Ubuntu) qplot along with source code . You can download it from: [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot.20130627.tar.gz qplot.20130627.tar.gz (File Size: 1.7G)]

The executable file is under qplot/bin/qplot.

Line 24: Line 31:

== Source Code Distribution ==

−

We provide a source code only download in [http://~~www~~.sph.umich.edu/~~csg~~/zhanxw/software/qplot/qplot-source.~~20120213~~.tar.gz qplot-source.~~20120213~~.tar.gz]. Optionally, you can download example file and/or data file:

+

We provide a source code only download in [http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-source.20130627.tar.gz qplot-source.20130627.tar.gz]. Optionally, you can download example file and/or data file:

−

[http://~~www~~.sph.umich.edu/~~csg~~/zhanxw/software/qplot/qplot-example.tar.gz example]: example input file, and expected outputs if you following the [[#Built-in example | direction]].

+

[http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-example.tar.gz example]: example input file, and expected outputs if you following the [[#Built-in example | direction]].

−

[http://~~www~~.sph.umich.edu/~~csg~~/zhanxw/software/qplot/qplot-data.tar.gz resources data]: necessary input files for qplot, including NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100.

+

[http://csg.sph.umich.edu//zhanxw/software/qplot/qplot-data.tar.gz resources data]: necessary input files for qplot, including NCBI human genome build v37, dbSNP 130, and pre-computed GC file with windows size 100.

You can put above file(s) in the same folder and follow these steps:

* 1. Unarchive downloaded file

−

tar zvxf qplot-source.~~20120213~~.tar.gz

+

tar zvxf qplot-source.20130627.tar.gz

A new folder ''qplot'' will be created.

Line 39: Line 46:

* 2. Build libStatGen

cd qplot

−

make ~~libStatGen~~

+

(cd ../libStatGen; make cloneLib)

This step will download a necessary software library [http://genome.sph.umich.edu/wiki/C%2B%2B_Library:_libStatGen libStatGen] and compile source code into a binary code library.

* 3. Build qplot

−

make ~~all~~

+

make

This step will then build qplot. Upon success, the executable qplot can be found under qplot/bin/.

Line 64: Line 71:

== Command line ==

+

After you obtain the qplot executable (either by compiling the source code or by downloading the pre-compiled binary file), you will find the executable file under qplot/bin/qplot.

Line 69: Line 77:

some_linux_host > qplot/bin/qplot

−

+

The following parameters are available. Ones with "[]" are in effect:

−

References : --reference [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.v37.fa],

+

−

--dbsnp [/net/fantasia/home/zhanxw/software/qplot/data/dbSNP130.UCSC.coordinates.tbl],

+

−

~~--gccontent [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.w100.gc]~~

+

−

~~Create gcContent~~ file : ~~--create_gc [],~~ --winsize [100]

+

References : --reference [/net/fantasia/home/zhanxw/software/qplot/data/human.g1k.v37.fa],

−

Region list : --regions [], --invertRegion

+

--dbsnp [/net/fantasia/home/zhanxw/software/qplot/data/dbSNP130.UCSC.coordinates.tbl]

−

Flag filters : --read1_skip, --read2_skip, --paired_skip,

+

GC content file options : --winsize [100]

−

--unpaired_skip

+

Region list : --regions [], --invertRegion

−

Dup and QCFail : --dup_keep, --qcfail_keep

+

Flag filters : --read1_skip, --read2_skip, --paired_skip,

−

Mapping filters : --minMapQuality [0.00]

+

--unpaired_skip

−

Records to process : --first_n_record [-1]

+

Dup and QCFail : --dup_keep, --qcfail_keep

−

Lanes to process : --lanes []

+

Mapping filters : --minMapQuality [0.00]

−

Read group to process : --readGroup []

+

Records to process : --first_n_record [-1]

−

Input file options : --noeof

+

Lanes to process : --lanes []

−

Output files : --plot [], --stats [], --Rcode []

+

Read group to process : --readGroup []

−

Plot labels : --label [], --bamLabel []

+

Input file options : --noeof

+

Output files : --plot [], --stats [], --Rcode [], --xml []

+

Plot labels : --label [], --bamLabel []

+

Obsoleted (DO NOT USE) : --gccontent [], --create_gc

== Input files ==

Line 100: Line 111:

This file has two columns. First column is the chromosome name which must be consistent with the reference created above. Second column is 1-based SNP position. If you want to create your own dbSNP data from downloaded UCSC dbSNP file, one way to do it is: <code>cat dbsnp_129_b36.rod|grep "single" | awk '$4-$3==1' |cut -f2,4 > dbSNP_129_b36.tbl</code>

−

* <code>--gccontent</code>

+

* <code> **OBSOLETED** --gccontent, --create_gc </code>

+

Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file.

+

GC content file name is automatically determined in this format: <reference_genome_base_file_name>.winsize<gc_content_window_size>.gc.

+

For example, if your reference genome is human.g1k.v37.fa and the window size is 100, then the GC content file name is: human.g1k.v37.winsize100.gc .

+

As it said, there is no need to use --gccontent to specify GC content file in each run.

+

* <code> input files </code>

−

~~Although GC content can be calculated on the fly each time, it is much more efficient to load a precomputed GC content from a file~~. ~~To generate the file, use the following command:~~

+

QPLOT take SAM/BAM files.

−

~~qplot --rerefence reference.fa --windowsize winsize --create_gc reference.gc~~

''Note'': Before running qplot, it is critical to check how the chromosome names are coded. Some BAM/SAM files use just numbers, others use chr + numbers. '''You need to make sure that the chromosome names from the reference and dbSNP are consistent with the BAM/SAM files.'''

Line 121: Line 139:

or

--qcfail_keep

+

*Records to process

+

The <code>--first_n_record</code> option followed by a number, '''n''', will enable qplot to read the first '''n''' reads to test the bam files and verify it works.

Line 131: Line 151:

'''NOTE''' In order for this to work, the lane info has to be encoded in the read name such that the lane number is the second field with the delimiter ":".

−

*Mapping filters

+

* Read group to process :

+

The read group option can restrict qplot to process a subset of reads. For example, if the BAM contains the following @RG tags:

+

@RG ID:UM0348_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM

+

@RG ID:UM0348_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM

+

@RG ID:UM0348_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM

+

@RG ID:UM0348_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM

+

@RG ID:UM0360_1:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM

+

@RG ID:UM0360_2:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM

+

@RG ID:UM0360_3:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM

+

@RG ID:UM0360_4:1 PL:ILLUMINA LB:M5390 SM:M5390 CN:UM

+

QPLOT will by default (without specifying --readgroup) process all reads.

+

If you specify "--readGroup UM0348", then only read groups UM0348_1, UM_0348_2, UM_0348_3, UM_0348_4 will be processed.

+

If you specify "--readGroup UM0348_1", then only one read group, UM0348_1, will be processed.

+

* Input file options :

+

BAM files are compressed using BGZF and should contain the EOF indicator by default. QPLOT will, by default, stop working if it does not find a valid EOF indicator inside the BAM files.

+

However, you can force QPLOT to continue processing BAM files without an EOF indicator using --noeof. But you should be aware the input files may be corrupted.

+

* Mapping filters

Qplot will exclude reads with lower mapping qualities than the user specified parameter, <code>--minMapQuality</code>. By default, mapped reads with all mapping quality will be included in the analysis.

+

*Region list

Line 253: Line 301:

= Contact =

−

Questions and requests should be sent to Bingshan Li ([mailto:bingshan@umich.edu bingshan@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])

+

Questions and requests should be sent to Bingshan Li ([mailto:bingshan@umich.edu bingshan@umich.edu]) or Xiaowei Zhan ([mailto:zhanxw@umich.edu zhanxw@umich.edu]) or Goncalo Abecasis ([mailto:goncalo@umich.edu goncalo@umich.edu])

Ppwhite

96

edits

Changes

QPLOT (view source)

Revision as of 11:42, 2 February 2017

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools