Difference between revisions of "Code Sample: Generating QQ Plots in R"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(3 intermediate revisions by one other user not shown)
Line 2: Line 2:
  
 
[[File:Pval-qq-sample.png|center]]
 
[[File:Pval-qq-sample.png|center]]
 +
 +
= Credit =
 +
 +
This page is based on a tutorial originally written by [mailto:mflick@umich.edu Matthew Flickinger]
  
 
== Credits ==
 
== Credits ==
Line 20: Line 24:
 
);
 
);
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
== A Fancier QQ Plot by Matthew Flickinger ==
  
 
Unfortunately the simple way of doing it leaves out many of the things that are nice to have on the plot such as a reference line and a confidence interval plus if your data set is large it plots a lot of points that aren't very interesting in the lower left. Here is a more complex example that adds a few more niceties and thins the data to only plot meaningful points
 
Unfortunately the simple way of doing it leaves out many of the things that are nice to have on the plot such as a reference line and a confidence interval plus if your data set is large it plots a lot of points that aren't very interesting in the lower left. Here is a more complex example that adds a few more niceties and thins the data to only plot meaningful points
Line 30: Line 36:
 
ylab=expression(paste("Observed (",-log[10], " p-value)")),  
 
ylab=expression(paste("Observed (",-log[10], " p-value)")),  
 
draw.conf=TRUE, conf.points=1000, conf.col="lightgray", conf.alpha=.05,
 
draw.conf=TRUE, conf.points=1000, conf.col="lightgray", conf.alpha=.05,
already.transformed=FALSE, pch=20, aspect="iso",  
+
already.transformed=FALSE, pch=20, aspect="iso", prepanel=prepanel.qqunif,
 
par.settings=list(superpose.symbol=list(pch=pch)), ...) {
 
par.settings=list(superpose.symbol=list(pch=pch)), ...) {
 
 
Line 40: Line 46:
 
stop("pvalue vector is not numeric, can't draw plot")
 
stop("pvalue vector is not numeric, can't draw plot")
 
if (any(is.na(unlist(pvalues)))) stop("pvalue vector contains NA values, can't draw plot")
 
if (any(is.na(unlist(pvalues)))) stop("pvalue vector contains NA values, can't draw plot")
if (any(unlist(pvalues)==0)) stop("pvalue vector contains zeros, can't draw plot")
+
if (already.transformed==FALSE) {
 
+
if (any(unlist(pvalues)==0)) stop("pvalue vector contains zeros, can't draw plot")
 +
} else {
 +
if (any(unlist(pvalues)<0)) stop("-log10 pvalue vector contains negative values, can't draw plot")
 +
}
 
 
 
 
Line 111: Line 120:
 
gc()
 
gc()
 
 
prepanel.origin = function(x,y,...) {
+
prepanel.qqunif= function(x,y,...) {
 
A = list()
 
A = list()
 
A$xlim = range(x, y)*1.02
 
A$xlim = range(x, y)*1.02
Line 121: Line 130:
 
#draw the plot
 
#draw the plot
 
xyplot(pvalues~exp.x, groups=grp, xlab=xlab, ylab=ylab, aspect=aspect,
 
xyplot(pvalues~exp.x, groups=grp, xlab=xlab, ylab=ylab, aspect=aspect,
prepanel=prepanel.origin, scales=list(axs="i"), pch=pch,
+
prepanel=prepanel, scales=list(axs="i"), pch=pch,
 
panel = function(x, y, ...) {
 
panel = function(x, y, ...) {
 
if (draw.conf) {
 
if (draw.conf) {
Line 133: Line 142:
 
}
 
}
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
=== Sample Usage ===
  
 
A sample call to this function would be
 
A sample call to this function would be
 +
 
<syntaxhighlight lang="rsplus">
 
<syntaxhighlight lang="rsplus">
 
qqunif.plot(my.pvalues) #these are the raw p-values, not log-transformed
 
qqunif.plot(my.pvalues) #these are the raw p-values, not log-transformed
 
</syntaxhighlight>
 
</syntaxhighlight>
  
The confidence intervals are calculated using the fact that the standard uniform order statistics follow a beta distribution. The default settings will draw confidence intervals around the 1000 more significant points. You can change that with the <code>conf.points=</code> parameter and you can change the alpha level from the default .05 using the <tt>conf.alpha=</tt> parameter. If you wish to disable the confidence interval, use <tt>draw.conf=F</tt> in your call to <tt>qqunif.plot()</tt>.
+
=== Under the Hood: Multiple P-value Lists ===
  
This function does thin the data by rounding the observer and expected -log10 p-values to two places by default. You can control the thinning with the <tt>should.thin=</tt>, <tt>thin.obs.places=</tt>, and <tt>thin.exp.places=</tt> parameters.
+
If you are comparing two-test or want to show data before and after it has been corrected for genomic control, you can pass multiple sets of p-values to the function via a list.
 
 
The function should also accept any other lattice graphing parameters should you want to change the plot title (<tt>main=</tt>), plotting character (<tt>pch=</tt>), or plot colors (<tt>col=</tt> for points, <tt>conf.col=</tt> for confidence interval). By default the <tt>aspect="iso"</tt> parameter is set which ensures that the reference line lies on a 45-degree angle. If you have very significant results, this may make your plot taller than you would like. You can set the parameter to <tt>aspect="fill"</tt> to use the standard layout which stretches the values on each axis to take up as much room as possible.
 
  
If you are comparing two-test or want to show data before and after it has been corrected for genomic control, you can pass multiple sets of p-values to the function via a list.
 
 
<syntaxhighlight lang="rsplus">
 
<syntaxhighlight lang="rsplus">
 
my.pvalue.list<-list("Study 1"=runif(10000), "Study 2"=runif(10000,0,.90))
 
my.pvalue.list<-list("Study 1"=runif(10000), "Study 2"=runif(10000,0,.90))
Line 151: Line 160:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Note that the confidence interval drawn depends on the total number of p-values given. When you pass in a list, the number of tests the confidence interval uses is determined by the vector with the '''least number of p-values''' - this gives the widest, most conservative confidence bands. Internally the different groups are drawn using the lattice superpose settings, so if you want more control over the color and shapes, you can use the <tt>par.settings=list(superpose.symbol=)</tt> settings. Furthermore, you can use any of the lattice methods of adding a legend to your plot. The names used in the legend correspond to the names of the elements in the list you pass in.
+
Internally the different groups are drawn using the lattice superpose settings, so if you want more control over the color and shapes, you can use the <tt>par.settings=list(superpose.symbol=)</tt> settings. Furthermore, you can use any of the lattice methods of adding a legend to your plot. The names used in the legend correspond to the names of the elements in the list you pass in.
 +
 
 +
=== Under the Hood: Confidence Intervals ===
 +
 
 +
The confidence intervals are calculated using the fact that the standard uniform order statistics follow a beta distribution. The default settings will draw confidence intervals around the 1000 more significant points. You can change that with the <tt>conf.points=</tt> parameter and you can change the alpha level from the default .05 using the <tt>conf.alpha=</tt> parameter. If you wish to disable the confidence interval, use <tt>draw.conf=F</tt> in your call to <tt>qqunif.plot()</tt>.
 +
 
 +
Note that the confidence interval drawn depends on the total number of p-values given. When you pass in a list, the number of tests the confidence interval uses is determined by the vector with the '''least number of p-values''' - this gives the widest, most conservative confidence bands.
 +
 
 +
=== Under the Hood: Thinning the Data ===
 +
 
 +
This function does thin the data by rounding the observer and expected -log10 p-values to two places by default. You can control the thinning with the <tt>should.thin=</tt>, <tt>thin.obs.places=</tt>, and <tt>thin.exp.places=</tt> parameters.
 +
 
 +
=== Under the Hood: Customizing Graphics ===
 +
 
 +
The function should also accept any other lattice graphing parameters should you want to change the plot title (<tt>main=</tt>), plotting character (<tt>pch=</tt>), or plot colors (<tt>col=</tt> for points, <tt>conf.col=</tt> for confidence interval). By default the <tt>aspect="iso"</tt> parameter is set which ensures that the reference line lies on a 45-degree angle. If you have very significant results, this may make your plot taller than you would like. You can set the parameter to <tt>aspect="fill"</tt> to use the standard layout which stretches the values on each axis to take up as much room as possible.
  
 
== R Base Graphics ==
 
== R Base Graphics ==

Latest revision as of 14:14, 12 November 2013

Quantile-quantile plots (qq-plots) can be useful for verifying that a set of values come from a certain distribution. For example in a genome-wide association study, we expect that most of the SNPs we are testing not to be associated with the disease. Under the null, this means that the p-values we get from tests where no true association exists should follow a uniform(0,1) distribution. Since we're usually most interested in really small p-values, we generally transform the p-values by -log10 so that the smallest values near zero become the larger values and are thus easier to see.

Pval-qq-sample.png

Credit

This page is based on a tutorial originally written by Matthew Flickinger

Credits

This page is entirely cribbed from an earlier version by Matthew Flickinger, one of our more outstanding graduate students.

R Lattice Graphics

The easiest way to create a -log10 qq-plot is with the qqmath function in the lattice package. It can make a quantile-quantile plot for any distribution as long as you supply it with the correct quantile function. Many of the quantile functions for the standard distributions are built in (qnorm, qt, qbeta, qgamma, qunif, etc). However, we must specify the correct function for the -log10 uniform ourself. Here is some code which will do that with some sample data:

#FAKE SAMPLE DATA
my.pvalues<-runif(10000)

library(lattice);
qqmath(~-log10(my.pvalues),
        distribution=function(x){-log10(qunif(1-x))}
);

A Fancier QQ Plot by Matthew Flickinger

Unfortunately the simple way of doing it leaves out many of the things that are nice to have on the plot such as a reference line and a confidence interval plus if your data set is large it plots a lot of points that aren't very interesting in the lower left. Here is a more complex example that adds a few more niceties and thins the data to only plot meaningful points

library(lattice)
qqunif.plot<-function(pvalues, 
	should.thin=T, thin.obs.places=2, thin.exp.places=2, 
	xlab=expression(paste("Expected (",-log[10], " p-value)")),
	ylab=expression(paste("Observed (",-log[10], " p-value)")), 
	draw.conf=TRUE, conf.points=1000, conf.col="lightgray", conf.alpha=.05,
	already.transformed=FALSE, pch=20, aspect="iso", prepanel=prepanel.qqunif,
	par.settings=list(superpose.symbol=list(pch=pch)), ...) {
	
	
	#error checking
	if (length(pvalues)==0) stop("pvalue vector is empty, can't draw plot")
	if(!(class(pvalues)=="numeric" || 
		(class(pvalues)=="list" && all(sapply(pvalues, class)=="numeric"))))
		stop("pvalue vector is not numeric, can't draw plot")
	if (any(is.na(unlist(pvalues)))) stop("pvalue vector contains NA values, can't draw plot")
	if (already.transformed==FALSE) {
		if (any(unlist(pvalues)==0)) stop("pvalue vector contains zeros, can't draw plot")
	} else {
		if (any(unlist(pvalues)<0)) stop("-log10 pvalue vector contains negative values, can't draw plot")
	}
	
	
	grp<-NULL
	n<-1
	exp.x<-c()
	if(is.list(pvalues)) {
		nn<-sapply(pvalues, length)
		rs<-cumsum(nn)
		re<-rs-nn+1
		n<-min(nn)
		if (!is.null(names(pvalues))) {
			grp=factor(rep(names(pvalues), nn), levels=names(pvalues))
			names(pvalues)<-NULL
		} else {
			grp=factor(rep(1:length(pvalues), nn))
		}
		pvo<-pvalues
		pvalues<-numeric(sum(nn))
		exp.x<-numeric(sum(nn))
		for(i in 1:length(pvo)) {
			if (!already.transformed) {
				pvalues[rs[i]:re[i]] <- -log10(pvo[[i]])
				exp.x[rs[i]:re[i]] <- -log10((rank(pvo[[i]], ties.method="first")-.5)/nn[i])
			} else {
				pvalues[rs[i]:re[i]] <- pvo[[i]]
				exp.x[rs[i]:re[i]] <- -log10((nn[i]+1-rank(pvo[[i]], ties.method="first")-.5)/(nn[i]+1))
			}
		}
	} else {
		n <- length(pvalues)+1
		if (!already.transformed) {
			exp.x <- -log10((rank(pvalues, ties.method="first")-.5)/n)
			pvalues <- -log10(pvalues)
		} else {
			exp.x <- -log10((n-rank(pvalues, ties.method="first")-.5)/n)
		}
	}


	#this is a helper function to draw the confidence interval
	panel.qqconf<-function(n, conf.points=1000, conf.col="gray", conf.alpha=.05, ...) {
		require(grid)
		conf.points = min(conf.points, n-1);
		mpts<-matrix(nrow=conf.points*2, ncol=2)
        	for(i in seq(from=1, to=conf.points)) {
            		mpts[i,1]<- -log10((i-.5)/n)
            		mpts[i,2]<- -log10(qbeta(1-conf.alpha/2, i, n-i))
            		mpts[conf.points*2+1-i,1]<- -log10((i-.5)/n)
            		mpts[conf.points*2+1-i,2]<- -log10(qbeta(conf.alpha/2, i, n-i))
        	}
        	grid.polygon(x=mpts[,1],y=mpts[,2], gp=gpar(fill=conf.col, lty=0), default.units="native")
    	}

	#reduce number of points to plot
	if (should.thin==T) {
		if (!is.null(grp)) {
			thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places),
				exp.x = round(exp.x, thin.exp.places),
				grp=grp))
			grp = thin$grp
		} else {
			thin <- unique(data.frame(pvalues = round(pvalues, thin.obs.places),
				exp.x = round(exp.x, thin.exp.places)))
		}
		pvalues <- thin$pvalues
		exp.x <- thin$exp.x
	}
	gc()
	
	prepanel.qqunif= function(x,y,...) {
		A = list()
		A$xlim = range(x, y)*1.02
		A$xlim[1]=0
		A$ylim = A$xlim
		return(A)
	}

	#draw the plot
	xyplot(pvalues~exp.x, groups=grp, xlab=xlab, ylab=ylab, aspect=aspect,
		prepanel=prepanel, scales=list(axs="i"), pch=pch,
		panel = function(x, y, ...) {
			if (draw.conf) {
				panel.qqconf(n, conf.points=conf.points, 
					conf.col=conf.col, conf.alpha=conf.alpha)
			};
			panel.xyplot(x,y, ...);
			panel.abline(0,1);
		}, par.settings=par.settings, ...
	)
}

Sample Usage

A sample call to this function would be

qqunif.plot(my.pvalues) #these are the raw p-values, not log-transformed

Under the Hood: Multiple P-value Lists

If you are comparing two-test or want to show data before and after it has been corrected for genomic control, you can pass multiple sets of p-values to the function via a list.

my.pvalue.list<-list("Study 1"=runif(10000), "Study 2"=runif(10000,0,.90))
qqunif.plot(my.pvalue.list, auto.key=list(corner=c(.95,.05)))

Internally the different groups are drawn using the lattice superpose settings, so if you want more control over the color and shapes, you can use the par.settings=list(superpose.symbol=) settings. Furthermore, you can use any of the lattice methods of adding a legend to your plot. The names used in the legend correspond to the names of the elements in the list you pass in.

Under the Hood: Confidence Intervals

The confidence intervals are calculated using the fact that the standard uniform order statistics follow a beta distribution. The default settings will draw confidence intervals around the 1000 more significant points. You can change that with the conf.points= parameter and you can change the alpha level from the default .05 using the conf.alpha= parameter. If you wish to disable the confidence interval, use draw.conf=F in your call to qqunif.plot().

Note that the confidence interval drawn depends on the total number of p-values given. When you pass in a list, the number of tests the confidence interval uses is determined by the vector with the least number of p-values - this gives the widest, most conservative confidence bands.

Under the Hood: Thinning the Data

This function does thin the data by rounding the observer and expected -log10 p-values to two places by default. You can control the thinning with the should.thin=, thin.obs.places=, and thin.exp.places= parameters.

Under the Hood: Customizing Graphics

The function should also accept any other lattice graphing parameters should you want to change the plot title (main=), plotting character (pch=), or plot colors (col= for points, conf.col= for confidence interval). By default the aspect="iso" parameter is set which ensures that the reference line lies on a 45-degree angle. If you have very significant results, this may make your plot taller than you would like. You can set the parameter to aspect="fill" to use the standard layout which stretches the values on each axis to take up as much room as possible.

R Base Graphics

Unfortunately, base graphics only offers a built in plot type for normal qq plots. Luckily, it's not too hard to calculate our own expected p-values under the null. We simply rank the p-values from lowest to highest and divide by the total number of tests. Then we take the -log10 transformation of these values. Here is an example

#Fake sample data
my.pvalues=runif(10000)

#Calculate expectations
exp.pvalues<-(rank(my.pvalues, ties.method="first")+.5)/(length(my.pvalues)+1)

#Make plot
plot(-log10(exp.pvalues), -log10(my.pvalues), asp=1)
abline(0,1)