Talk:QPLOT

From Genome Analysis Wiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Qplot process each aligned bam files.

(1) Empirical vs reported Phred score:

Conditioning on the reported base quality, we count the total time of bases that matches or not matches the reference genome, and thus calculate the empirical quality by : -10 * log10 ( 1 - total # of mismatched / total bases) . In following cases, we will not use that bases for calculating empirical qualities:

(a) 'N' for either sequence read or reference genome,

(b) sequence reads that have sam flags including duplicated, unmapped

(c) overlapping dbSNP sites (if dbSNP position file provided)

If specifying --regions, only bases in the target regions will be calculated.

(2) Empirical Phred score by cycle:

Conditioning on read cycle (e.g. first base, second base... be cautious using quality trimmed reads or bar-coded reads, as the real cycle may differ), we calculate empirical quality as above. If specifying --region, only bases falling in the target regions will be calculated.

(3) Mean depth vs. GC

We will count depth for whole genome or specified region (--region). Default GC window size is 100. For each position, we know its GC counts should between 0-100. Across the whole genome of specified region, conditioning on the above GC counts, we calculated the total depth. Then divides by frequency of positions where they have the same GC counts in the windows and divide by the coverage (total mapped bases divided by total covered positions, aka. see (7) Mean depth of sequencing)

(4) Insert size

For mapped paired-end reads, the insert size distribution will be ploted. Otherwise, this graph would be empty. Specifying --region will not affect this graph.

(5) Empirical Q20 bases count by cycle

We count the number of Q20 bases (base qualities that are larger than 20) conditioning on cycle number. If specifying --regions, only bases in the target regions will be calculated. In such case, some reads will have their head and trail outside of the region. Thus you will likely to see a parabolic shape.

(6) Flag stats

We count the number of reads in these categories: total, mapped, paired, proper paired, duplicated, QC failed. These categories are determined by FLAG field from each BAM file.

(7) Mean depth of sequencing

Total mapped bases divided by total number of positions that are covered by at least one base. The y-axis, percentage is calculated by sites divided by total sites (e.g. for whole genome, it's the total length; for target sequencing, it's the total length of all targeted region).

(8) Empirical Q20 count

We examine each base by its reported base quality, if that reported base quality corresponds to empirical base quality bettern Phred score 20, than we will count once as Q20 base.