METAL Quick Start

From Genome Analysis Wiki
Jump to: navigation, search

Learning by Example

Many people find that it easiest to learn by example. So, if reading through our extensive documentation is not for you, you might find it more appealing to walk through a simple METAL analysis.

The Glucose Data

This example examines evidence for association between fasting glucose levels and genetic markers in the G6PC2 (chr. 2), GCK (chr. 7) and MTNR1B (chr. 11) regions. It uses results from 3 genomewide association studies: FUSION, SardiNIA and DGI. Genetic variants in the three loci impact fasting glucose levels and, in the case of MTRN1B, also impact the risk of type 2 diabetes.

You can download a copy of the input files used in this example analysis from the METAL Download Page.

Input Files

Initially, each of the three studies analyzed association between fasting glucose levels and genotyped and imputed variants in each of the three loci. Study specific results are stored in the following files:

Input Files with Single Study Results
'STUDY INPUT FILE FILE SIZE
FUSION MAGIC_FUSION_Results.txt.gz 46 kb
SardiNIA magic_SARDINIA.tbl 188 kb
DGI DGI_three_regions.txt 188 kb

Although some effort was made to harmonize analysis strategies (for example, by excluding individuals with a diagnosis of diabetes as well as other individuals with elevated fasting glucose levels), you will notice that the three files are formatted somewhat differently. In order to combine results across studies, one critical piece of information that METAL will need are details of the formating used for each file. You probably also noticed that the FUSION input files are a bit smaller than those for the other studies. This is because they have been compressed with [www.gzip.org gzip]. This is not a problem, because METAL can transparently handle gzip-compressed files.

Running METAL

If you haven't used METAL before, try starting the program from the Linux or Windows command prompt. By default, METAL runs in interactive mode and responds to each of the commands you issue (by typing!, unfortunately, there is no point and click interface).

For example, you can trying issuing the HELP command, to which METAL responds by printing a list of available options. Or you could try to issue the command MARKERLABEL SNP to indicate that marker names are tabulated in a column labelled SNP. METAL would respond to this later command by reporting:

## Set marker header to SNP ...

For convenience, many commands can be shortened. For example, instead of writing MARKERLABEL SNP, you could write MARKER SNP. If you make a mistake and METAL doesn't understand your command, it will usually say:

## ERROR: The command you issued could not be processed ...

Interactive mode is an easy way to learn METAL, once you are familiar with the basic workings of the program it will typically be better to store a series of commands in a METAL script which can be conveniently edited and run multiple times. To run commands stored in a script, just specify the script name on the METAL command line. As with other command line programs, you can redirect screen output to a file using the > operator.

The METAL Script

We will now walk through the METAL Glucose Example Script. The first thing to know is that METAL scripts can include comments and that these are indicating by using a hash sign # as the first character in a line. Thus:

 # This is a comment.

Our example script starts with a series of comments, which we will ignore for now. Instead, we will proceed directly to the description of study specific input files -- which is an essential step in any meta-analysis with METAL.

Describing the DGI Input Files

METAL expects study specific results will be stored in a plain text tabular file, with one line per marker. While that sounds simple, there is a wide variety of ways to implement the details. Thus, the first step in any analysis is to specify these details for each study being analyzed. The Diabetes Genetic Initiative (DGI) results are stored described in the following snippet of METAL code:

MARKER   SNP
WEIGHT   N
ALLELE   EFFECT_ALLELE NON_EFFECT_ALLELE
FREQ     EFFECT_ALLELE_FREQ
EFFECT   BETA
STDERR   SE
PVAL     P_VAL

Each line specifies the header for a key column in the input. For example, here we specify that the marker name is stored in a column labelled SNP (with the MARKER SNP command), that the number of individuals analyzed for each row -- and which can be used to weight the contribution of each study in sample size and p-value based meta-analysis is stored in a column labeled N (with the WEIGHT N command), that the two allele labels are stored in columns labelled EFFECT_ALLELE and NON_EFFECT_ALLELE (with the ALLELE EFFECT_ALLELE NON_EFFECT_ALLELE command), that the allele frequency of the first of these alleles is stored in the column EFFECT_ALLELE_FREQ (with the FREQ EFFECT_ALLELE_FREQ command), that the effect size is stored in a column labelled BETA (with the EFFECT BETA command), and that the standard error and p-value are stored in columns labeled SE and P_VAL (with the STDERR SE and PVAL P_VAL commands).

Most of these columns are optional. For example, for a p-value and sample size based analysis only the columns with marker name, allele labels, sample size, p-value and direction of effect are required.

After we have described the structure of input files, we can load results for the DGI study into memory using the command:

 
PROCESS DGI_three_regions.txt

This is the expected output from the above series of commands:

## Set marker header to SNP ...
## Set weight header to N ...
## Set allele headers to EFFECT_ALLELE and NON_EFFECT_ALLELE ...
## Set frequency header to EFFECT_ALLELE_FREQ ...
## If you want frequencies to averaged, issue the 'AVERAGEFREQ ON' command
## Set effect header to BETA ...
## Set standard error header to SE ...
## Set p-value header to P_VAL ...
###########################################################################
## Processing file 'DGI_three_regions.txt'
## Processed 2369 markers ...

Describing the FUSION Input Files

To process the FUSION input files, we use a similar approach. Hopefully, the meaning of this next snipped of METAL code will be easier to decode:

 
# Describe and process the FUSION input files
MARKER   SNP
ALLELE   EFFECT_ALLELE NON_EFFECT_ALLELE
FREQ     FREQ_EFFECT
WEIGHT   N
EFFECT   BETA
STDERR   SE
PVAL     PVALUE
 
PROCESS MAGIC_FUSION_Results.txt.gz

The structure of the code is very similar to that used in processing the DGI file. One difference that more eagle-eyed readers might notice is that the FUSION results are stored in a file with the .gz extension, rather than a regular text file. This is okay, because METAL will automatically recognize and transparently handle [Gzip File|.gz files].

Because many of the column labels are exactly the same as in the DGI input file, we might have shortened the code by writing:

 
# Describe and process the FUSION input files
FREQ     FREQ_EFFECT
PVAL     PVALUE
 
PROCESS MAGIC_FUSION_Results.txt.gz

Although this would work, it would be slightly riskier: if we now edit any of the column labels in the DGI section of the METAL script, we could unexpectedly prevent METAL from processing the FUSION input file.

This is the expected output from the above series of commands:

## Set marker header to SNP ...
## Set allele headers to EFFECT_ALLELE and NON_EFFECT_ALLELE ...
## Set frequency header to FREQ_EFFECT ...
## If you want frequencies to averaged, issue the 'AVERAGEFREQ ON' command
## Set weight header to N ...
## Set effect header to BETA ...
## Set standard error header to SE ...
## Set p-value header to PVALUE ...
###########################################################################
## Processing file 'MAGIC_FUSION_Results.txt.gz'
## Processed 2293 markers ...

Describing the SardiNIA Input Files

The last step is to describe and load the SardiNIA study results. This is achieved with the code:

# Describe and process the SardiNIA input files
MARKER   SNP
DEFAULT  4106
ALLELE   AL1 AL2
FREQ     FREQ1
EFFECT   EFFECT
STDERR   SE
PVAL     PVALUE
 
PROCESS magic_SARDINIA.tbl

In contrast to the previous files, the unique thing about the Sardinia input files is that they include no column detailing the number of individuals analyzed for each marker. Instead, using [Genotype Imputation] all missing genotypes were filled in and exactly 4,106 individuals were analyzed in each row. This is specified with the DEFAULT 4106 command.

This is the expected output from the above series of commands:

## Set marker header to SNP ...
## Set default weight to 4106.00 ...
## Set allele headers to AL1 and AL2 ...
## Set frequency header to FREQ1 ...
## If you want frequencies to averaged, issue the 'AVERAGEFREQ ON' command
## Set effect header to EFFECT ...
## Set standard error header to SE ...
## Set p-value header to PVALUE ...
###########################################################################
## Processing file 'magic_SARDINIA.tbl'
## WARNING: No 'N' column found -- using DEFAULTWEIGHT = 4106
## Processed 2361 markers ...

Executing the Meta-Analysis

Once we have loaded all input files into memory, we are ready to execute the meta-analysis and store relevant results to disk. To do this, simply tell METAL:

ANALYZE

If you are interested in the possibility of between study heterogeneity, use the ANALYZE HETEROGENEITY command instead.

This is the expected output from the ANALYZE command:

###########################################################################
## Executing meta-analysis ...
## Complete results will be stored in file 'METAANALYSIS1.TBL'
## Column descriptions will be stored in file 'METAANALYSIS1.TBL.info'
## Completed meta-analysis for 2495 markers!
## Smallest p-value is 1.491e-12 at marker 'rs560887'

Note that results are stored in files labeled 'METAANALYSIS1.TBL' and 'METAANALYSIS1.TBL.info'. If you load data for additional studies and repeat the meta-analysis command, those results will be stored in a file named 'METAANALYSIS2.TBL' and 'METAANALYSIS2.TBL.info'. You can change default output file names with the OUTFILE command.

Meta-Analysis Results

Meta-analysis results are stored in the file 'METAANALYSIS1.TBL'. By default, this file isn't sorted, but you should be able to sort it in a variety of ways. Here are the top 10 smallest p-values as obtained with the UNIX sort command:

Top 10 Meta-Analysis Results
MarkerName Allele1 Allele2 Weight Zscore P-value Direction
rs560887 t c 6806 -7.075 1.491*10-12 ---
rs853787 t g 6806 6.691 2.221*10-11 +++
rs853789 a g 5339 -6.597 4.189*10-11 ?--
rs853773 a g 6806 -6.132 8.662*10-10 ---
rs537183 t c 6806 6.007 1.887*10-9 +++
rs557462 t c 6806 6.005 1.917*10-9 +++
rs502570 a g 6806 -6.001 1.955*10-9 ---
rs563694 a c 6806 5.975 2.300*10-9 +++
rs475612 t c 6806 -5.867 4.423*10-9 ---
rs853781 a g 6806 -5.844 5.092*10-9 ---


One good way to understand the contents of this file is to look at the accompanying .info file. This file describes the contents of each column.

# This file contains a short description of the columns in the
# meta-analysis summary file, named 'METAANALYSIS1.TBL'

# Marker - this is the marker name
# Allele1 - the first allele for this marker in the first file where it occurs
# Allele2 - the second allele for this marker in the first file where it occurs
# Weight - the sum of the individual study weights (typically, N) for this marker
# Z-score - the combined z-statistic for this marker
# P-value - meta-analysis p-value
# Direction - summary of effect direction for each study, with one '+' or '-' per study

# Input for this meta-analysis was stored in the files:
# --> Input File 1 : DGI_three_regions.txt
# --> Input File 2 : MAGIC_FUSION_Results.txt.gz
# --> Input File 3 : magic_SARDINIA.tbl

In this case, a few interesting patterns are worth noting. First, marker rs560887 (near G6PC2) exhibits the strongest evidence for association in the region. Second, you will notice that nearly all markers that make this top ten list have been examined in a total of 6,806 individuals and exhibit consistent directions of effect across studies (these are noted by the pattern of pluses (+) and minuses (-) in the last column). The exception is marker rs853789 which was not examined in the DGI study and thus has a smaller corresponding weight and a question mark (?) in the first entry of the direction of effect column. When analyzing very large numbers of samples, the direction of effect column allows you to check, at a glance, whether all studies support your strongest findings. If you find that one study consistently suggests an opposite direction of effect for your strongest signals, that might suggest inconsistent labeling of allele 1 and 2 for that study.

Refining Your Meta-Analysis

Here we briefly review some of the possibilities for tweaking the meta-analysis.

Effect Size and Standard Error Based Analysis

If effect sizes are reported consistently across studies, you can execute an effect size and standard error based analysis by uncommenting the command SCHEME STDERR at the beginning of the METAL input file. Look for the section that reads:

# Meta-analysis weighted by standard error does not work well
# when different studies used very different transformations.
# In this case, some attempt was made to use similar trait
# transformation and you can request a standard error based
# analysis by uncommenting the following line:
# SCHEME   STDERR

Track Allele Frequency Information

To track information about allele frequencies across studies, you can execute the commands AVERAGEFREQ ON and MINMAXFREQ ON at the beginning of the analysis. If you find markers where very wide allele frequency ranges (for example, with allele frequency of ~0.05 in one study but ~0.95 in others) that can suggest inconsistencies in allele labeling across studies.

# To help identify allele flips, it can be useful to track
# allele frequencies in the meta-analysis. To enable this
# capability, uncomment the following two lines.
# AVERAGEFREQ ON
# MINMAXFREQ ON

Extracting Study by Study Results For Individual SNPs

Sometimes, it is desirable to tabulate results from individual studies in a consistent format. If you enable the VERBOSE option, METAL outputs details of the key information it extracts from each study. Although you could enable this option for all SNPs, that typically generates unmanageable amounts of output. Instead, you should combine this option with the METAL filter command, which let's you focus on specific rows in each input file. In this case, we will focus on two SNPs that are the focus of the Chen et al and Prokopenko et al. papers.

The key section of the script is the following one:

# To restric meta-analysis to two previously reported SNPs
# and summarize study specific results, uncomment the two
# lines that follow.
# ADDFILTER SNP IN (rs10830963,rs563694)
# VERBOSE ON

To see how the two options work together, first uncomment the last two lines of the script and save the script as "metal-targeted-analysis-script.txt". Next, run METAL:

prompt> metal metal-targeted-analysis-script.txt > metal-targeted-analysis-script.log

Now, to extract review results for each of the markers, use:

prompt> grep "rs10830963" metal-targeted-analysis-script.log
prompt> grep "rs563694" metal-targeted-analysis-script.log

Acknowledgements

Thanks to colleagues in the FUSION, DGI and SardiNIA studies for sharing study results for the MTRN1B, G6PC2 and GCK regions. Special thanks also to Josee Dupuis, at Boston College, for helping prepare scripts and input files for this worked example.

Comments and suggestions on this tutorial are always welcome but, since this is a wiki, you can also help by creating an account and contributing your own edits (all thoughtful contributions, ranging from small typo fixes to additional examples are welcome).