QPLOT

From Genome Analysis Wiki
Revision as of 16:33, 2 November 2010 by Pha (talk | contribs) (initial copy from internal wiki - needs more edits)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

The qplot program is to calculate various summary statistics some of which will be plotted in a pdf file which can be used to assess the sequencing quality for illumina sequencing after mapping reads to the reference genome. The main statistics are empirical Phred scores which was calculated based on the background mismatch rate. By background mismatch rate, it means the rate that sequenced bases are different from the reference genome, EXCLUDING dbSNP positions. Other statistics include GC biases, insert size distribution, depth distribution, genome coverage, empirical Q20 count and so on. An example plot and summary text will follow at the end

Usage

The code is now deposited into the pipeline. If you have not seen it in your pipeline, you shall update you pipeline. See pipeline and git.

In the pipeline folder, go to qplot and make will create an executable qplot in the qplot sub-directory. Here is the qplot help page

 wonderland > ./qplot
 The following parameters are available.  Ones with "[]" are in effect:
             References : --reference [/data/local/ref/karma.ref/human.g1k.v37.umfa],
                          --dbsnp [/home/bingshan/data/db/dbSNP/dbSNP130.UCSC.coordinates.tbl],
                          --gccontent [/home/bingshan/data/db/gcContent/gcContent.hg37.w250.out]
  Create gcContent file : --create_gc [], --winsize [100]
           Flag filters : --read1_skip, --read2_skip, --paired_skip,
                          --unpaired_skip
         Dup and QCFail : --dup_keep, --qcfail_keep
        Mapping filters : --minMapQuality [0.00]
     Records to process : --first_n_record [-1]
       Lanes to process : --lanes []
           Output files : --plot [], --stats [], --Rcode []
            Plot labels : --label [], --bamLabel []

Input files

Three (3) precomputed files are required. Multiple bam/sam files should be appended after all other parameters.

  • --reference
    • The reference genome is the same as karma reference genome. If the index files do not exist, qplot will create the index files using the input reference fasta file.
    • Or use the latest karma to generate reference index files using the command:
karma create -i reference.fa
  • --dbsnp

This file has two columns. First column is the chromosome name wich have to be consistent with the reference created above

  • --gccontent

Although GC content can be calculated on fly each time, it is much more efficient to load a precomputed GC content from a file. To generate the file, use the following command

/home/bingshan/code/calcGCContent/genGCContent -r referenceByKarma -w windowSize -o gccontent.out

Note: Before running the qplot, it is critical to check how the chromosome numbers are coded. Some bam file use just numbers, others use chr + numbers. You need to make sure that the chromosome numbers from reference and dbsnp are consistent with the bam file.

Parameters

Most of command line parameters are self explanatory and some of them are described here

  • Flag filter
    • By default all reads are processed. If it is desired to check only the first read of a pair, use --read2_skip to ignore the second read. And so on.
  • Duplication and QCFile
    • By default reads marked as dup and QCFile are ignored but can be retained by
--dup_keep or --qcfail_keep
  • Records to process is to try the first n reads to test the bam files and check whether it works.
  • Lanes to process
    • If the input bam files have more than one lane and only some of them need to be checked, they can be specified by --lanes 1,3,5 whatever the number of lanes needed.
    • In order for this to work, the lane info has to be encoded in the read name such that lane number is the second field with the delimit of ":".
  • Plot filters
    • GC bias plot requires large memory (genome size which is ~3Gb) and if it is not desired to generate it to save memory it can be disabled by
--gc_plot_skip
  • Region list
    • If the interest of QA is a list of regions, e.g. exons, this can be achieved by providing a list of region. The regions should be in the form of "chr start end label" each line in the file. In order for this option to work, within each chromosome (contig) the regions have to be sorted by starting position, and also the input bam files have to be sorted.
1 100 500 region_A
1 600 800 region_B
2 100 300 region_C
...
  • Plot labels
    • Two kinds of labels are enabled. First one is the label for the plot (default is empty), e.g. label on the title of each subplot. Second one is a set of labels for each input bam files, e.g. sample ID (default is numbers 1, 2, ... until the number of input bam files.
--label Run100 --bamLabels s1,s2,s3,s4,s5,s6,s7,s8
  • Multiple threading
    • Number of concurrent threads running for the input bam files. One bam file will be processed by one thread. Therefore using a number which is dividable by the number of input bam files will make it efficient. One extra thread requires memory about 375Mb on top of around 4Gb memory used to hold reference and GC content file.

Output files

There are three (optional) output files.

  • --plot qa.pdf
    • This is a pdf file containing 2 pages each with 4 figures. If --pages 1 is specified, only page 1 is output. The plot is generated using Rscript.
  • --stats qa.stats
    • This is a text file containing various summary statistics for each input bam/sam file
  • --Rcode qa.R
    • This is the R code used for plotting the figures in qa.pdf file. If Rscript is not installed in the system the qplot is run, you can use the qa.R to generate the figures in other machines, or extract plotting data from each run and combine multiple runs together to generate more comprehensive plots.


Example output

  • Figures
https://statgen.sph.umich.edu/w/images/5/53/Sardinia_Run_84_QA.pdf
  • Summary statistics text file
TotalReads(e6)  75.74   76.37   76.33   76.47   76.76   76.33   76.34   75.30
MappingRate(%)  95.27   95.21   95.68   95.64   95.49   95.99   95.72   95.54
MapRate_pass(%) 95.27   95.21   95.68   95.64   95.49   95.99   95.72   95.54
ZeroMapQual(%)  3.91    3.90    3.86    3.95    3.65    3.44    3.47    3.46
MapQual<10(%)   4.15    4.16    4.12    4.22    3.91    3.69    3.72    3.72
TargetMapping(%)        0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
PairedReads(%)  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00
ProperPaired(%) 91.74   90.92   92.53   92.19   91.74   92.64   91.99   92.10
MappedBases(e9) 7.62    7.51    7.56    7.52    7.55    7.60    7.62    7.41
Q20Bases(e9)    7.41    7.35    7.42    7.37    7.40    7.44    7.47    7.28
Q20BasesPct(%)  97.27   97.96   98.16   98.02   98.00   97.99   98.12   98.16
MeanDepth       3.01    2.98    3.01    2.99    2.99    3.00    2.99    2.93
GenomeCover(%)  88.45   87.92   87.74   87.88   88.00   88.37   88.78   88.21
GCBiasMSE       0.02    0.02    0.02    0.02    0.02    0.02    0.02    0.02
ISize_mode      242     267     284     282     268     258     260     262
ISize_medium    241     268     285     281     269     258     259     262
DupRate(%)      1.58    1.55    1.77    1.71    1.46    1.51    1.73    1.95
QCFailRate(%)   0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
BaseComp_A(%)   30.3    30.1    30.6    30.3    30.0    30.4    30.4    30.5
BaseComp_C(%)   19.9    20.1    19.6    19.9    20.2    19.8    19.8    19.7
BaseComp_G(%)   19.9    20.1    19.7    20.0    20.2    19.8    19.9    19.8
BaseComp_T(%)   29.9    29.7    30.1    29.8    29.6    30.0    29.9    30.0
BaseComp_O(%)   0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0