Changes

From Genome Analysis Wiki
Jump to navigationJump to search
10,459 bytes added ,  16:29, 9 May 2013
no edit summary
Line 16: Line 16:  
*[http://www.python.org/download/ Python 2.6] (do '''not''' download the 3.0 branch!)  
 
*[http://www.python.org/download/ Python 2.6] (do '''not''' download the 3.0 branch!)  
 
*[http://www.r-project.org/ R 2.10+]  
 
*[http://www.r-project.org/ R 2.10+]  
 +
 +
 +
The following software is optional but recommended:
 +
 
*[[New Fugue|new_fugue]], a program for computing LD, written by Goncalo Abecasis.
 
*[[New Fugue|new_fugue]], a program for computing LD, written by Goncalo Abecasis.
 
*[http://pngu.mgh.harvard.edu/~purcell/plink/ PLINK], written by Shaun Purcell.  
 
*[http://pngu.mgh.harvard.edu/~purcell/plink/ PLINK], written by Shaun Purcell.  
   −
For the latest stable LocusZoom package, see our [https://statgen.sph.umich.edu/locuszoom/download/ download] page.  
+
 
 +
The following R packages are optional but recommended:
 +
*[http://cran.r-project.org/web/packages/gridExtra/index.html gridExtra] (used for creating summary tables of GWAS hits / fine-mapping SNPs as additional pages in the PDF)
 +
 
 +
 
 +
For the latest stable LocusZoom package, see our [https://statgen.sph.umich.edu/locuszoom/download/ download] page. The current version is '''1.2''', released on May 10th, 2013.  
    
Currently only '''Unix/Linux''' is supported, though Mac OS X should be supported in a future release. Support for Windows may come at a much later date.
 
Currently only '''Unix/Linux''' is supported, though Mac OS X should be supported in a future release. Support for Windows may come at a much later date.
Line 36: Line 45:     
See our [https://statgen.sph.umich.edu/locuszoom/download/ download] page for links to the latest as well as previous releases.
 
See our [https://statgen.sph.umich.edu/locuszoom/download/ download] page for links to the latest as well as previous releases.
 +
 +
== Changes in Version 1.2 ==
 +
 +
A number of new features have been added for this version. See the following sections for more info:
 +
 +
* [[#EPACTS formatted file|Loading EPACTS results]]
 +
* [[#Plotting LD with additional reference SNPs|Plotting LD with additional reference SNPs]]
 +
* [[#Labeling multiple SNPs|Labeling multiple SNPs]]
 +
* [[#Fine-mapping credible sets|Fine-mapping credible sets]]
 +
* [[#GWAS catalog variants|GWAS catalog variants]]
 +
* [[#Supply VCF files for calculating LD|Supply VCF files for calculating LD]]
 +
 +
 +
The full changelog is available on the [https://statgen.sph.umich.edu/locuszoom/download/ download] site.
    
== Installation  ==
 
== Installation  ==
Line 47: Line 70:  
R is also required for generating the plots. You can download R at [http://www.r-project.org/ www.r-project.org]. Version 2.10 or greater is required.  
 
R is also required for generating the plots. You can download R at [http://www.r-project.org/ www.r-project.org]. Version 2.10 or greater is required.  
   −
=== Step 3: Install new_fugue ===
+
=== Step 3: Install LD calculation software (optional) ===
 +
 
 +
* If you wish to calculate from hg18 sources (hapmap, earlier releases of 1000G): install '''new_fugue''' (see below.)
 +
* If you wish to calculate from hg19 sources (latest 1000G): install '''PLINK''' (see below.)
 +
* If you plan to supply your own LD files per region, or calculate LD directly from VCF files: install nothing! See options for --ld and --ld-vcf.
 +
 
 +
==== new_fugue ====
    
New_fugue is a program that calculates linkage disequilibrium measures from genotype files. While installing new_fugue is optional, we highly recommend it as it makes the process of generating plots much easier. If you opt to skip installing new_fugue, you will need to provide your own computed LD files for each region that you want to plot.  
 
New_fugue is a program that calculates linkage disequilibrium measures from genotype files. While installing new_fugue is optional, we highly recommend it as it makes the process of generating plots much easier. If you opt to skip installing new_fugue, you will need to provide your own computed LD files for each region that you want to plot.  
Line 60: Line 89:  
You may need administrator rights to install this program.
 
You may need administrator rights to install this program.
   −
=== Step 4: Install PLINK ===  
+
==== PLINK ====
    
PLINK is now used to calculate LD for all future LD sources / populations that we may add. The program new_fugue (above) is used to calculate LD from older sources (such as hapmap) and older builds (such as hg18) where LD files are sufficiently small.  
 
PLINK is now used to calculate LD for all future LD sources / populations that we may add. The program new_fugue (above) is used to calculate LD from older sources (such as hapmap) and older builds (such as hg18) where LD files are sufficiently small.  
   −
You can download PLINK and find instructions for installing it [http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml here].  
+
You can download PLINK and find instructions for installing it [http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml here].
    
=== Step 5: Install LocusZoom  ===
 
=== Step 5: Install LocusZoom  ===
Line 109: Line 138:     
For annotation:  
 
For annotation:  
*We used various sources including RefSeq Genes (refFlat), TFBS Conserved (tfbsConsSites), and Conservation (phaseConsElements44wayPlacental), all available from the [http://genome.usc.edu UCSC Genome Browser].  
+
*We use various sources including RefSeq Genes (refFlat), TFBS Conserved (tfbsConsSites), and Conservation (phaseConsElements44wayPlacental), all available from the [http://genome.usc.edu UCSC Genome Browser].  
 
*[ftp://ftp.hapmap.org/hapmap/recombination/2008-03_rel22_B36/rates/ Recombination rates from HapMap].
 
*[ftp://ftp.hapmap.org/hapmap/recombination/2008-03_rel22_B36/rates/ Recombination rates from HapMap].
 +
 +
For GWAS hits:
 +
*We use the NHGRI GWAS catalog, available at [http://www.genome.gov/gwastudies/ genome.gov]
    
== Input  ==
 
== Input  ==
   −
=== Association results file ("metal" file===
+
=== Association results file ===
 +
 
 +
LocusZoom requires an association results file similar in formatting to what METAL or EPACTS provides.
 +
 
 +
==== METAL formatted file ====
   −
The main input to LocusZoom is a file containing results from an association scan or meta-analysis. The file must have 2 columns: markers (SNPs), and p-values. The file should look something like this:  
+
The file must have 2 columns: markers (SNPs), and p-values. The file should look something like this:  
    
<br>  
 
<br>  
Line 144: Line 180:     
P-values of any magnitude are supported in scientific notation (we use an arbitrary precision library built-in to python, and transform p-values to the log scale.) If you've already transformed your p-values to the log scale, simply use <code>--no-transform</code> and LocusZoom will not transform them.
 
P-values of any magnitude are supported in scientific notation (we use an arbitrary precision library built-in to python, and transform p-values to the log scale.) If you've already transformed your p-values to the log scale, simply use <code>--no-transform</code> and LocusZoom will not transform them.
 +
 +
==== EPACTS formatted file ====
 +
 +
The file can come directly from [[EPACTS]], or simply be formatted similarly to the following:
 +
 +
{|
 +
|-
 +
! scope="col" | #CHROM
 +
! scope="col" | BEGIN
 +
! scope="col" | END
 +
! scope="col" | MARKER_ID
 +
! scope="col" | NS
 +
! scope="col" | AC
 +
! scope="col" | CALLRATE
 +
! scope="col" | MAF
 +
! scope="col" | PVALUE
 +
! scope="col" | SCORE
 +
! scope="col" | N.CASE
 +
! scope="col" | N.CTRL
 +
! scope="col" | AF.CASE
 +
! scope="col" | AF.CTRL
 +
|-
 +
| 1 || 15903 || 15903 || 1:15903_G/GC || 2657 || 3892.2 || 1 || 0.26757 || 0.36771 || 0.90077 || 1326 || 1331 || 1.4688 || 1.4609
 +
|-
 +
| 1 || 19190 || 19191 || 1:19190_GC/G || 2657 || 823.65 || 1 || 0.155 || 0.67173 || 0.42378 || 1326 || 1331 || 0.3115 || 0.30849
 +
|-
 +
| 1 || 20316 || 20317 || 1:20316_GA/G || 2657 || 1005.3 || 1 || 0.18917 || 0.50804 || 0.66189 || 1326 || 1331 || 0.38062 || 0.37607
 +
|-
 +
| 1 || 30967 || 30970 || 1:30967_CCCA/C || 2657 || 435.35 || 1 || 0.081925 || 0.08848 || -1.7035 || 1326 || 1331 || 0.16007 || 0.16762
 +
|-
 +
| 1 || 51972 || 51975 || 1:51972_GGAC/G || 2657 || 207.8 || 1 || 0.039104 || 0.51638 || -0.64893 || 1326 || 1331 || 0.077187 || 0.079226
 +
|-
 +
| 1 || 53138 || 53140 || 1:53138_TAA/T || 2657 || 216.2 || 1 || 0.040685 || 0.55679 || 0.58762 || 1326 || 1331 || 0.083145 || 0.079602
 +
|-
 +
| 1 || 54421 || 54421 || 1:54421_A/G || 2657 || 179.45 || 1 || 0.033769 || 0.73592 || 0.33726 || 1326 || 1331 || 0.068213 || 0.066867
 +
|-
 +
| 1 || 66221 || 66221 || 1:66221_A/AT || 2657 || 664.45 || 1 || 0.12504 || 0.48676 || 0.69547 || 1326 || 1331 || 0.25366 || 0.24651
 +
|-
 +
| 1 || 66222 || 66223 || 1:66222_TA/T || 2657 || 470.3 || 1 || 0.088502 || 0.64258 || 0.4641 || 1326 || 1331 || 0.17941 || 0.17461
 +
|-
 +
|}
 +
 +
The chrom, start, end, marker ID, and p-value columns must all be present. The file must be tab-delimited.
 +
 +
To load this file, use --epacts.
    
=== Region  ===
 
=== Region  ===
Line 184: Line 265:  
! align="left" scope="col" | Population  
 
! align="left" scope="col" | Population  
 
! align="left" scope="col" | LocusZoom Arguments
 
! align="left" scope="col" | LocusZoom Arguments
 +
|-
 +
| March 2012
 +
| hg19
 +
| ASN
 +
| --pop ASN --build hg19 --source 1000G_March2012
 +
|-
 +
| March 2012
 +
| hg19
 +
| AFR
 +
| --pop AFR--build hg19 --source 1000G_March2012
 +
|-
 +
| March 2012
 +
| hg19
 +
| EUR
 +
| --pop EUR --build hg19 --source 1000G_March2012
 +
|-
 +
| March 2012
 +
| hg19
 +
| AMR
 +
| --pop AMR --build hg19 --source 1000G_March2012
 
|-
 
|-
 
| Nov 2010
 
| Nov 2010
Line 251: Line 352:  
| --pop JPT+CHB --build hg18 --source hapmap
 
| --pop JPT+CHB --build hg18 --source hapmap
 
|}
 
|}
 +
    
=== Batch mode  ===
 
=== Batch mode  ===
Line 363: Line 465:     
The file should be whitespace delimited, and the header (column names shown above) must exist.
 
The file should be whitespace delimited, and the header (column names shown above) must exist.
 +
 +
=== Supply VCF files for calculating LD ===
 +
 +
You can give LocusZoom a VCF file directly to use for calculating LD:
 +
 +
<syntaxhighlight lang="bash">
 +
locuszoom --ld-vcf my_genotypes.vcf.gz ...
 +
</syntaxhighlight>
 +
 +
This option takes the place of having to supply per-region pre-calculated LD (--ld) or having to specify --pop and --source for calculating LD from genotype files supplied by LZ.
 +
 +
<span style="color:#FF6600">'''Warning: '''</span> The VCF file must also have a [http://samtools.sourceforge.net/tabix.shtml tabix] index located in the same directory. For the above example, the tabix index "my_genotypes.vcf.gz.tbi" must exist.
 +
 +
 +
You can also calculate D' from phased VCF files:
 +
 +
<syntaxhighlight lang="bash">
 +
locuszoom --ld-vcf my_genotypes.vcf.gz --ld-measure dprime ...
 +
</syntaxhighlight>
 +
 +
The default measure is "rsquared".
 +
 +
== Optional Input ==
 +
 +
=== Plotting LD with additional reference SNPs ===
 +
 +
LocusZoom can now show LD with multiple SNPs in a region (for example, you might want to show LD with a number of SNPs from a conditional analysis.)
 +
 +
You give LocusZoom the usual reference SNP (used for centering the plot and calculating the region) but an additional set of lead/reference SNPs as well.
 +
 +
For all other SNPs not in the "lead SNP set" of { reference SNP, additional reference SNPs }, LZ will find which of the lead SNPs it is in highest LD with, and color it to match that lead SNP. The extent of LD with the lead SNP is shown by a gradient of color.
 +
 +
 +
As an example:
 +
 +
<syntaxhighlight lang="bash">
 +
locuszoom --metal <DIAGRAM T2D results> --refsnp "rs231362" --add-refsnps "rs163184"
 +
</syntaxhighlight>
 +
 +
Will generate the following plot:
 +
 +
[[File:New lz cond only.png|700px]]
 +
 +
 +
The following options are available for changing the style of these types of plots:
 +
 +
{| width="85%" cellspacing="0" cellpadding="5" border="1"
 +
|-
 +
! scope="col" | Option (with default value)
 +
! scope="col" | Description
 +
|-
 +
| condLdColors="gray60,#E41A1C,#377EB8,#4DAF4A,#984EA3,#FF7F00,#A65628,#F781BF"
 +
| First color is missing LD color, the rest are used as needed for each additional lead SNP
 +
|-
 +
| drawMarkerNames = T
 +
| Display marker names (or not) above lead SNPs
 +
|-
 +
| condLdLow=NULL
 +
| Used to set all SNPs with LD in the lowest bin to the same color, for example condLdLow="gray70"
 +
|-
 +
| condRefsnpPch=23
 +
| Symbol for each lead SNP, defaults to diamond
 +
|-
 +
| condPch='4,16,17,15,25,8,7,13,12,9,10'
 +
| Plotting symbols for groups of SNPs in LD with additional refsnps, make sure they don't overlap with condRefsnpPch above
 +
|-
 +
| ldCuts = "0,.2,.4,.6,.8,1"
 +
| Bins for LD
 +
|}
 +
 +
 +
=== GWAS catalog variants ===
 +
 +
You can add known GWAS variants to your plots. For example:
 +
 +
<syntaxhighlight lang="bash">
 +
locuszoom ... --gwas-cat whole_cat-significant-only --build hg19
 +
</syntaxhighlight>
 +
 +
[[File:New lz gwas cat.png|900px]]
 +
 +
 +
Currently the only catalog is the NHGRI GWAS catalog from [http://www.genome.gov/gwastudies/ genome.gov].
 +
 +
<pre>
 +
Available GWAS catalogs for build hg19:
 +
 +
+----------------------------+----------------------------------------------------------------+
 +
|          Option          |                          Description                          |
 +
+----------------------------+----------------------------------------------------------------+
 +
| whole-cat_significant-only | The entire GWAS catalog, filtered to SNPs with p-value < 5E-08 |
 +
+----------------------------+----------------------------------------------------------------+
 +
</pre>
 +
 +
 +
If the R package '''gridExtra''' is installed, a summary of each GWAS catalog variant in your region is listed later in the PDF:
 +
 +
[[File:New lz gwas summary.png|500px]]
 +
 +
=== Fine-mapping credible sets ===
 +
 +
LocusZoom can add an additional track to the plot showing results from a fine-mapping analysis. These are typically SNPs within the 95% credible set (see [http://www.nature.com/ng/journal/v44/n12/full/ng.2435.html this paper] for an example.)
 +
 +
To add this fine-mapping track, you supply (as a plotting option) the fine-mapping set of credible SNPs as a file:
 +
 +
<syntaxhighlight lang="bash">
 +
locuszoom ... fineMap="my_finemapping_results.txt"
 +
</syntaxhighlight>
 +
 +
 +
The fine-mapping results file should be a tab-delimited file with each fine-mapping SNP (for example, all those fine-mapping SNPs in the 95% credible set), a descriptive label (EUR/AMR/AFR/etc.), and a color:
 +
 +
{| class="wikitable sortable"
 +
|-
 +
! scope="col" | snp
 +
! scope="col" | chr
 +
! scope="col" | pos
 +
! scope="col" | pp
 +
! scope="col" | group
 +
! scope="col" | color
 +
|-
 +
| rs1 || 18 || 55931115 || 0.88 || AMR || red
 +
|-
 +
| rs1 || 18 || 55920115 || 0.88 || AMR || red
 +
|-
 +
| rs1 || 18 || 55940115 || 0.88 || AMR || red
 +
|-
 +
| rs1 || 18 || 55930115 || 0.88 || EUR || blue
 +
|-
 +
| rs2 || 18 || 55940115 || 0.02 || EUR || blue
 +
|-
 +
| rs3 || 18 || 56000000 || 0.03 || AFR || green
 +
|-
 +
| rs4 || 18 || 56022000 || 0.03 || AFR || green
 +
|-
 +
| rs3 || 18 || 56100000 || 0.03 || ASN || purple
 +
|-
 +
| rs3 || 18 || 56150000 || 0.03 || ASN || purple
 +
|-
 +
| rs4 || 18 || 56160000 || 0.03 || ASN || purple
 +
|-
 +
| rs4 || 18 || 56180000 || 0.03 || ASN || purple
 +
|-
 +
|}
 +
 +
LocusZoom will extract from the file only those SNPs falling within the region to be plotted, so you can provide all of your fine-mapping results in a single file.
 +
 +
 +
The generated plot will have a track showing the fine-mapping SNPs:
 +
 +
[[File:New lz finemap.png|900px]]
 +
 +
 +
If the R package '''gridExtra''' is installed, the PDF will also have a summary of each fine-mapping SNP:
 +
 +
[[File:New lz finemap summary.png|400px]]
 +
 +
 +
=== Labeling multiple SNPs ===
 +
 +
You can specify a file controlling the labels for either the reference SNP, or any other arbitrary SNP within the region. For example:
 +
 +
[[File:New lz denote markers.png|700px]]
 +
 +
Use the --denote-markers-file <file> argument to do this:
 +
 +
<syntaxhighlight lang="bash">
 +
locuszoom ... --denote-markers-file <your file>
 +
</syntaxhighlight>
 +
 +
The file looks like:
 +
 +
{|
 +
|-
 +
! scope="col" align="left" | snp
 +
! scope="col" align="left" | string
 +
! scope="col" align="left" | color
 +
|-
 +
| rs231362 || GWAS || blue
 +
|-
 +
| rs163184 || Conditional || purple
 +
|-
 +
|}
 +
 +
It must be tab-delimited and the columns must have a header and be named as such.
    
== Output  ==
 
== Output  ==
Line 396: Line 683:  
| --markercol  
 
| --markercol  
 
| Name of the SNP column in the --metal file.
 
| Name of the SNP column in the --metal file.
 +
|-
 +
| --epacts
 +
| Provide a results file generated by [[EPACTS]] instead of a --metal file.
 
|-
 
|-
 
| --refsnp  
 
| --refsnp  
Line 416: Line 706:  
| --ld  
 
| --ld  
 
| Provide a file specifying LD between your reference SNP and all SNPs within the region you wish to plot. You only need to supply this file if you have created LD specifically for your purposes (perhaps a different population or genome build.) Otherwise, LD is computed automatically for you.
 
| Provide a file specifying LD between your reference SNP and all SNPs within the region you wish to plot. You only need to supply this file if you have created LD specifically for your purposes (perhaps a different population or genome build.) Otherwise, LD is computed automatically for you.
 +
|-
 +
| --ld-vcf
 +
| Use a VCF file to calculate LD between SNPs. This can be a VCF file with an entire genome of SNPs and does not have to be subsetted to your region. The VCF file must also have a tabix index file. For calculating D', the VCF must be phased.
 
|-
 
|-
 
| --source  
 
| --source  
Line 920: Line 1,213:  
| PLINK_PATH
 
| PLINK_PATH
 
| Path to the PLINK binary. Defaults to "plink", which searches for PLINK&nbsp;on your path. If it is not on your path, specify the full path here.  
 
| Path to the PLINK binary. Defaults to "plink", which searches for PLINK&nbsp;on your path. If it is not on your path, specify the full path here.  
 +
|-
 +
| RSCRIPT_PATH
 +
| Path to the Rscript binary. Defaults to "Rscript", which searches for Rscript&nbsp;on your path. If it is not on your path, specify the full path here.
 
|-
 
|-
 
| SQLITE_DB  
 
| SQLITE_DB  
Line 926: Line 1,222:  
| LD_DB  
 
| LD_DB  
 
| Contains a "tree" which maps a tuple of (genotype source, genotype population, genome build) to genotype files.
 
| Contains a "tree" which maps a tuple of (genotype source, genotype population, genome build) to genotype files.
 +
|-
 +
| GWAS_CATS
 +
| Contains a "tree" which maps genome build and the name of a GWAS catalog to the actual file containing the GWAS hits.
 
|}
 
|}
  
239

edits

Navigation menu