Code Sample: Generating QQ Plots in R

From Genome Analysis Wiki
Jump to: navigation, search

Quantile-quantile plots (qq-plots) can be useful for verifying that a set of values come from a certain distribution. For example in a genome-wide association study, we expect that most of the SNPs we are testing not to be associated with the disease. Under the null, this means that the p-values we get from tests where no true association exists should follow a uniform(0,1) distribution. Since we're usually most interested in really small p-values, we generally transform the p-values by -log10 so that the smallest values near zero become the larger values and are thus easier to see.

Pval-qq-sample.png

Credit

This page is based on a tutorial originally written by Matthew Flickinger

Credits

This page is entirely cribbed from an earlier version by Matthew Flickinger, one of our more outstanding graduate students.

R Lattice Graphics

The easiest way to create a -log10 qq-plot is with the qqmath function in the lattice package. It can make a quantile-quantile plot for any distribution as long as you supply it with the correct quantile function. Many of the quantile functions for the standard distributions are built in (qnorm, qt, qbeta, qgamma, qunif, etc). However, we must specify the correct function for the -log10 uniform ourself. Here is some code which will do that with some sample data:

#FAKE SAMPLE DATA
my.pvalues<-runif(10000)

library(lattice);
qqmath(~-log10(my.pvalues),
        distribution=function(x){-log10(qunif(1-x))}
);

A Fancier QQ Plot by Matthew Flickinger

Unfortunately the simple way of doing it leaves out many of the things that are nice to have on the plot such as a reference line and a confidence interval plus if your data set is large it plots a lot of points that aren't very interesting in the lower left. Here is a more complex example that adds a few more niceties and thins the data to only plot meaningful points

library(lattice)
qqunif.plot<-function(pvalues, 
	should.thin=T, thin.obs.places=2, thin.exp.places=2, 
	xlab=expression(paste("Expected (",-log[10], " p-value)")),
	ylab=expression(paste("Observed (",-log[10], " p-value)")), 
	draw.conf=TRUE, conf.points=1000, conf.col="lightgray", conf.alpha=.05,
	already.transformed=FALSE, pch=20, aspect="iso", prepanel=prepanel.qqunif,
	par.settings=list(superpose.symbol=list(pch=pch)), ...) {
	
	
	#error checking
	if (length(pvalues)==0) stop("pvalue vector is empty, can't draw plot")
	if(!(class(pvalues)=="numeric" || 
		(class(pvalues)=="list" && all(sapply(pvalues, class)=="numeric"))))
		stop("pvalue vector is not numeric, can't draw plot")
	if (any(is.na(unlist(pvalues)))) stop("pvalue vector contains NA values, can't draw plot")
	if (already.transformed==FALSE) {
		if (any(unlist(pvalues)==0)) stop("pvalue vector contains zeros, can't draw plot")
	} else {
		if (any(unlist(pvalues)<0)) stop("-log10 pvalue vector contains negative values, can't draw plot")
	}
	
	
	grp<-NULL
	n<-1
	exp.x<-c()
	if(is.list(pvalues)) {
		nn<-sapply(pvalues, length)
		rs<-cumsum(nn)
		re<-rs-nn+1
		n<-min(nn)
		if (!is.null(names(pvalues))) {
			grp=factor(rep(names(pvalues), nn), levels=names(pvalues))
			names(pvalues)<-NULL
		} else {
			grp=factor(rep(1:length(pvalues), nn))
		}
		pvo<-pvalues
		pvalues<-numeric(sum(nn))
		exp.x<-numeric(sum(nn))
		for(i in 1:length(pvo)) {
			if (!already.transformed) {
				pvalues[rs[i]:re[i]] <- -log10(pvo[[i]])
				exp.x[rs[i]:re[i]] <- -log10((rank(pvo[[i]], ties.method="first")-.5)/nn[i])
			} else {
				pvalues[rs[i]:re[i]] <- pvo[[i]]
				exp.x[rs[i]:re[i]] <- -log10((nn[i]+1-rank(pvo[[i]], ties.method="first")-.5)/(nn[i]+1))
			}
		}
	} else {
		n <- length(pvalues)+1
		if (!already.transformed) {
			exp.x <- -log10((rank(pvalues, ties.method="first")-.5)/n)
			pvalues <- -log10(pvalues)
		} else {
			exp.x <- -log10((n-rank(pvalues, ties.method="first")-.5)/n)
		}
	}


	#this is a helper function to draw the confidence interval
	panel.qqconf<-function(n, conf.points=1000, conf.col="gray", conf.alpha=.05, ...) {
		require(grid)
		conf.points = min(conf.points, n-1);
		mpts<-matrix(nrow=conf.points*2, ncol=2)
        	for(i in seq(from=1, to=conf.points)) {
            		mpts[i,1]<- -log10((i-.5)/n)
            		mpts[i,2]<- -log10(qbeta(1-conf.alpha/2, i, n-i))
            		mpts[conf.points*2+1-i,1]<- -log10((i-.5)/n)
            		mpts[conf.points*2+1-i,2]<- -log10(qbeta(conf.alpha/2, i, n-i))
        	}
        	grid.polygon(x=mpts[,1],y=mpts[,2], gp=gpar(fill=conf.col, lty=0), default.units="native")
    	}

	#reduce number of points to plot
	if (should.thin==T) {
		if (!is.null(grp)) {
			thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places),
				exp.x = round(exp.x, thin.exp.places),
				grp=grp))
			grp = thin$grp
		} else {
			thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places),
				exp.x = round(exp.x, thin.exp.places)))
		}
		pvalues <- thin$pvalues
		exp.x <- thin$exp.x
	}
	gc()
	
	prepanel.qqunif= function(x,y,...) {
		A = list()
		A$xlim = range(x, y)*1.02
		A$xlim[1]=0
		A$ylim = A$xlim
		return(A)
	}

	#draw the plot
	xyplot(pvalues~exp.x, groups=grp, xlab=xlab, ylab=ylab, aspect=aspect,
		prepanel=prepanel, scales=list(axs="i"), pch=pch,
		panel = function(x, y, ...) {
			if (draw.conf) {
				panel.qqconf(n, conf.points=conf.points, 
					conf.col=conf.col, conf.alpha=conf.alpha)
			};
			panel.xyplot(x,y, ...);
			panel.abline(0,1);
		}, par.settings=par.settings, ...
	)
}

Sample Usage

A sample call to this function would be

qqunif.plot(my.pvalues) #these are the raw p-values, not log-transformed

Under the Hood: Multiple P-value Lists

If you are comparing two-test or want to show data before and after it has been corrected for genomic control, you can pass multiple sets of p-values to the function via a list.

my.pvalue.list<-list("Study 1"=runif(10000), "Study 2"=runif(10000,0,.90))
qqunif.plot(my.pvalue.list, auto.key=list(corner=c(.95,.05)))

Internally the different groups are drawn using the lattice superpose settings, so if you want more control over the color and shapes, you can use the par.settings=list(superpose.symbol=) settings. Furthermore, you can use any of the lattice methods of adding a legend to your plot. The names used in the legend correspond to the names of the elements in the list you pass in.

Under the Hood: Confidence Intervals

The confidence intervals are calculated using the fact that the standard uniform order statistics follow a beta distribution. The default settings will draw confidence intervals around the 1000 more significant points. You can change that with the conf.points= parameter and you can change the alpha level from the default .05 using the conf.alpha= parameter. If you wish to disable the confidence interval, use draw.conf=F in your call to qqunif.plot().

Note that the confidence interval drawn depends on the total number of p-values given. When you pass in a list, the number of tests the confidence interval uses is determined by the vector with the least number of p-values - this gives the widest, most conservative confidence bands.

Under the Hood: Thinning the Data

This function does thin the data by rounding the observer and expected -log10 p-values to two places by default. You can control the thinning with the should.thin=, thin.obs.places=, and thin.exp.places= parameters.

Under the Hood: Customizing Graphics

The function should also accept any other lattice graphing parameters should you want to change the plot title (main=), plotting character (pch=), or plot colors (col= for points, conf.col= for confidence interval). By default the aspect="iso" parameter is set which ensures that the reference line lies on a 45-degree angle. If you have very significant results, this may make your plot taller than you would like. You can set the parameter to aspect="fill" to use the standard layout which stretches the values on each axis to take up as much room as possible.

R Base Graphics

Unfortunately, base graphics only offers a built in plot type for normal qq plots. Luckily, it's not too hard to calculate our own expected p-values under the null. We simply rank the p-values from lowest to highest and divide by the total number of tests. Then we take the -log10 transformation of these values. Here is an example

#Fake sample data
my.pvalues=runif(10000)

#Calculate expectations
exp.pvalues<-(rank(my.pvalues, ties.method="first")+.5)/(length(my.pvalues)+1)

#Make plot
plot(-log10(exp.pvalues), -log10(my.pvalues), asp=1)
abline(0,1)