SeqShop: Genetic Association Analysis Practical, June 2014

From Genome Analysis Wiki
Jump to: navigation, search

Note: the latest version of this practical is available at: SeqShop: Genetic Association Analysis Practical

  • The ones here is the original one from the June workshop (updated to be run from elsewhere)


Introduction

View Lecture Slides

View Introductory Slides for Practical Session

Goals of This Session

  • Understand how to annotate variants using EPACTS
  • Understand how to run single variant association analysis using EPACTS
  • Understand how to run rare variant association test using EPACTS
  • Understand how to visualize the association output from EPACTS

Setup in person at the SeqShop Workshop

This section is specifically for the SeqShop Workshop computers.

If you are not running during the SeqShop Workshop, please skip this section.

Login to the seqshop-server Linux Machine

This section will appear redundantly in each session. If you are already logged in or know how to log in to the server, please skip this section

  1. Login to the windows machine
    • The username/password for the Windows machine should be written on the right-hand monitor
  2. Start xming so you can open external windows on our Linux machine
    • Start->Enter "Xming" in the search and select "Xming" from the program list
    • Nothing will happen, but Xming was started.
    • View Screenshot
    • Xming.png

  3. Open putty
    • Start->Enter "putty" in the search and select "PuTTY" from the program list
    • View Screenshot
    • PuttyS.png

  4. Configure PuTTY in the PuTTY Configuration window
    • Host Name: seqshop-server.sph.umich.edu
    • View Screenshot
    • Seqshop.png

    • Setup to allow you to open external windows:
      • In the left pannel: Connection->SSH->X11
        • Add a check mark in the box next to Enable X11 forwarding
        • View Screenshot
        • SeqshopX11.png

    • Click Open
    • If it prompts about a key, click OK
  5. Enter your provided username & password as provided


You should now be logged into a terminal on the seqshop-server and be able to access the test files.

  • If you need another terminal, repeat from step 3.

Login to the seqshop Machine

So you can each run multiple jobs at once, we will have you run on 4 different machines within our seqshop setup.

  • You can only access these machines after logging onto seqshop-server

3 users logon to:

ssh -X seqshop1

3 users logon to:

ssh -X seqshop2

2 users logon to:

ssh -X seqshop3

2 users logon to:

ssh -X seqshop4

Setup your run environment

This is the same setup you did for the previous tutorial, but you need to redo it each time you log in.

This will setup some environment variables to point you to

  • GotCloud program
  • Tutorial input files
  • Setup an output directory
    • It will leave your output directory from the previous tutorial in tact.
source /home/hmkang/seqshop/setup.txt
  • You won't see any output after running source
    • It silently sets up your environment
    • If you want to view the detail of the setup, type
less /home/mktrost/seqshop/setup.txt

and press 'q' to finish.

View setup.txt

export GC=/home/hmkang/seqshop/gotcloud
export IN=/home/hmkang/seqshop/inputs
export REF=/home/hmkang/seqshop/reference/chr22
export VTREF=/home/hmkang/seqshop/reference/vtRef
export SV=/home/hmkang/seqshop/reference/svtoolkit
export EXT=/home/hmkang/seqshop/external
export EPACTS=/home/hmkang/seqshop/epacts
export OUT=~/out
mkdir -p ${OUT}


Setup when running on your own outside of the SeqShop Workshop

This section is specifically for running on your own outside of the SeqShop Workshop.

If you are running during the SeqShop Workshop, please skip this section.

This tutorial builds on the alignment & snpcall tutorials, if you have not already, please first run those tutorials: Alignment Tutorial & SNP Calling Tutorial

Download & Build EPACTS

If you do not already have EPACTS:

  • cd to where you want EPACTS installed (you can change this to any directory you want)
mkdir -p ~/seqshop
cd ~/seqshop/
  • download, decompress, and build the version of epacts that was tested with this tutorial:
wget http://csg.sph.umich.edu//kang/epacts/download/EPACTS-3.2.6.tar.gz
tar xvf EPACTS-3.2.6.tar.gz
cd EPACTS-3.2.6
./configure --prefix=$HOME/seqshop/epacts
make
make install
cd ../..

Setup your run environment

Environment variables will be used throughout the tutorial.

We recommend that you setup these variables so you won't have to modify every command in the tutorial.

  1. Point to where you installed GotCloud
  2. Point to where you installed the seqshop files
  3. Point to where you want the output to go
Using bash (replace the paths below with the appropriate paths):
export GC=~/seqshop/gotcloud
export SS=~/seqshop/example
export OUT=~/seqshop/output
Using tcsh (replace the paths below with the appropriate paths):
setenv GC ~/seqshop/gotcloud
setenv SS ~/seqshop/example
setenv OUT ~/seqshop/output
  • Additional variables for EPACTS:
    • Using bash (replace the paths below with the appropriate paths):
    • export EPACTS=~/seqshop/epacts
    • Using tcsh (replace the paths below with the appropriate paths):
    • setenv EPACTS ~/seqshop/epacts

Preparing Input Files

Input VCF file

We will use SNP genotypes from the SNP calling session, after LD-aware genotype refinement. Check the contents of the VCF file using the following command.

zless ${OUT}/thunder/chr22/ALL/thunder/chr22.filtered.PASS.beagled.ALL.thunder.vcf.gz

Phenotype Information

Phenotype information is prepared in PED format commonly used in other GWAS software such as MERLIN or PLINK.

less ${SS}/assoc/seqshop.ped

The first several line should look like below.

View Data

#FAM_ID	IND_ID	DAD_ID	MOM_ID	SEX	PHENO
HG00551	HG00551	0	0	0	0
HG00553	HG00553	0	0	0	0
HG00554	HG00554	0	0	0	0
HG00637	HG00637	0	0	0	0
HG00638	HG00638	0	0	0	0
HG00640	HG00640	0	0	0	1
HG00641	HG00641	0	0	0	1
HG00734	HG00734	0	0	0	1
HG00736	HG00736	0	0	0	0
...

Binary phenotype can be encoded as 0-1 or 1-2. If the column contains more than two distinct values, it will automatically be recognized as quantitative values.

EPACTS allows PED file to have a header line. The header line should contain the description of each column. EPACTS also accepts a standard PED format where .ped file contains the phenotype data and .dat file contains the information about each column.

Installed version of EPACTS

EPACTS are installed in the server. If you want to install EPACTS by yourself, visit EPACTS page for more details

ls $EPACTS/bin

View EPACTS executable files

anno   epacts       epacts-cis-extract  epacts-group       epacts-multi     epacts.pm      epstopdf  test_run_epacts.sh
bgzip  epacts-anno  epacts-download     epacts-make-group  epacts-pca-plot  epacts-single  pEmmax    vcfast
chaps  epacts-cat   epacts-enrich       epacts-make-kin    epacts-plot      epacts-zoom    tabix     wGetOptions.pm

Note that some tools undocumented in EPACTS documentation is under development and may not work.

Annotating Variants with EPACTS

There are multiple software tools that provides a function to annotate variants, such as Variant Effect Predictor (VEP) that is used in 1000 Genomes Project. While most annotation software provides very similar results to each other, their computational efficiency can substantially vary. The annotation software EPACTS provides is extremely fast and can provide genome-wide annotation results in orders of magnitude faster than other widely available annotation software.

In order to annotate variants with EPACTS, one can use epacts-anno module.

mkdir --p $OUT/assoc
$EPACTS/bin/epacts-anno --in $OUT/thunder/chr22/ALL/thunder/chr22.filtered.PASS.beagled.ALL.thunder.vcf.gz --out $OUT/assoc/snps.anno.vcf.gz --ref $SS/ref22/human.g1k.v37.chr22.fa

Then you will see a series of messages before annotation finishes.

View the expected messages

/home/hmkang/seqshop/epacts/bin/anno -i /net/seqshop-server/hmkang/out/thunder/chr22/ALL/thunder/chr22.filtered.PASS.beagled.ALL.thunder.vcf.gz -r \
   /home/hmkang/seqshop/ref22/human_g1k_v37.chr22.fa -f refGene -g /home/hmkang/seqshop/epacts/share/EPACTS/hg19_gencodeV14.txt.gz \
  -c /home/hmkang/seqshop/epacts/share/EPACTS/codon.txt -o  /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz --inputFormat vcf \
  -p /home/hmkang/seqshop/epacts/share/EPACTS/priority.txt
The following parameters are available.  Ones with "[]" are in effect:

Available Options
    Required Parameters :
                         -i [/net/seqshop-server/hmkang/out/thunder/chr22/ALL/thunder/chr22.filtered.PASS.beagled.ALL.thunder.vcf.gz]
                         -o [/net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz]
         Gene Annotation
            Parameters : -g [/home/hmkang/seqshop/epacts/share/EPACTS/hg19_gencodeV14.txt.gz]
                         -r [/home/hmkang/seqshop/ref22/human_g1k_v37.chr22.fa]
                         --inputFormat [vcf], --checkReference, -f [refGene]
                         -p [/home/hmkang/seqshop/epacts/share/EPACTS/priority.txt]
                         -c [/home/hmkang/seqshop/epacts/share/EPACTS/codon.txt]
                         -u [], -d [], --se [], --si [], --outputFormat []
 Other Annotation Tools : --genomeScore [], --bed [], --tabix []
Load reference genome /home/hmkang/seqshop/ref22/human_g1k_v37.chr22.fa...
DONE: 1 chromosomes and 51304566 bases are loaded.
Load codon file /home/hmkang/seqshop/epacts/share/EPACTS/codon.txt...
DONE: codon file loaded.
Load priority file /home/hmkang/seqshop/epacts/share/EPACTS/priority.txt...
DONE: 24 priority annotation types loaded.
Load gene file /home/hmkang/seqshop/epacts/share/EPACTS/hg19_gencodeV14.txt.gz...
DONE: 92627 gene loaded.
DONE: Generated frequency of each annotype type in [ /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz.anno.frq ].
DONE: Generated frequency of each highest priority annotation type in [ /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz.top.anno.frq ].
Ts/Tv ratio: 2.35733
Ts observed: 2718  times; Tv observed: 1153 times.
DONE: Generated frequency of each base change in [ /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz.base.frq ].
DONE: Generated frequency of each codon change in [ /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz.codon.frq ].
DONE: Generated frequency of indel length in [ /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz.indel.frq ].
..............................................         
 ...      Anno(tation)                       ...       
 ...      Xiaowei Zhan, Goncalo Abecasis     ...      
  ...      Speical Thanks:                    ...     
   ...      Hyun Ming Kang, Yanming Li         ...    
    ...      zhanxw@umich.edu                    ...  
     ...      Sep 2011                            ... 
      ................................................
                                                      
DONE: 3871 varaints are annotated.
DONE: Generated annotation output in [ /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz ].
Annotation succeed!
mv /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz /net/seqshop-server/hmkang/out/thunder/chr22/ALL/thunder/chr22.filtered.PASS.beagled.ALL.thunder.vcf.gz.tmp
/home/hmkang/seqshop/epacts/bin/bgzip -c /net/seqshop-server/hmkang/out/thunder/chr22/ALL/thunder/chr22.filtered.PASS.beagled.ALL.thunder.vcf.gz.tmp > /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz
/home/hmkang/seqshop/epacts/bin/tabix -pvcf -f /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz
rm /net/seqshop-server/hmkang/out/thunder/chr22/ALL/thunder/chr22.filtered.PASS.beagled.ALL.thunder.vcf.gz.tmp
rm /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz.log /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz.top.anno.frq /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz.anno.frq /net/seqshop- server/hmkang/out/assoc/snps.anno.vcf.gz.base.frq /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz.codon.frq /net/seqshop-server/hmkang/out/assoc/snps.anno.vcf.gz.indel.frq

After running annotation, you can check the annotation results. Let's look at the APOL g1 risk allele we manually examined in the SNP calling section.

$GC/bin/tabix $OUT/assoc/snps.anno.vcf.gz 22:36661906 | head -1 | cut -f 1-8

View the annotation results

22	36661906	.	A	G	18	PASS	DP=409;MQ=59;NS=62;AN=124;AC=2;AF=0.013827;AB=0.4065;AZ=-0.5287;FIC=-0.0092;
            SLRT=-0.0075;HWEAF=0.0138;HWDAF=0.0276,0.0000;LBS=36,36,0,0,1,1,0,0;OBS=145,191,0,0,3,2,0,0;STR=-0.040;
            STZ=-0.740;CBR=0.008;CBZ=0.144;IOR=0.000;IOZ=-1.370;AOI=-5.614;AOZ=-4.243;LQR=0.178;MQ0=0.000;MQ10=0.000;MQ20=0.000;
            MQ30=0.000;SVM=1.51214;BAVGPOST=0.998;BRSQ=0.941;LDAF=0.0161;AVGPOST=1.0000;RSQ=1.0000;ERATE=0.0019;THETA=0.0013;
           ANNO=Nonsynonymous:APOL1;ANNOFULL=APOL1/ENST00000397278.3:+:Nonsynonymous(AGC/Ser/S->GGC/Gly/G:Base1025/1197:Codon342/399:Exon6/6):Exon|
           APOL1/ENST00000426053.1:+:Nonsynonymous(AGC/Ser/S->GGC/Gly/G:Base971/1143:Codon324/381:Exon5/5):Exon|
           APOL1/ENST00000422706.1:+:Nonsynonymous(AGC/Ser/S->GGC/Gly/G:Base1025/1197:Codon342/399:Exon6/6):Exon|
           APOL1/ENST00000319136.4:+:Nonsynonymous(AGC/Ser/S->GGC/Gly/G:Base1073/1245:Codon358/415:Exon7/7):Exon|
           APOL1/ENST00000347595.7:+:Nonsynonymous(AGC/Ser/S->GGC/Gly/G:Base662/834:Codon221/278:Exon3/3):Exon|
           APOL1/ENST00000397279.4:+:Nonsynonymous(AGC/Ser/S->GGC/Gly/G:Base1025/1197:Codon342/399:Exon6/7):Exon
  • What is the function of this variant?
  • How many different transcript does the variant overlap with?
  • How can you represent the variant in terms of amino acid changes?

Single Variant Association Analysis

Let's run a single-variant association analysis using a score test.

$EPACTS/bin/epacts-single --ped $SS/assoc/seqshop.ped --vcf $OUT/assoc/snps.anno.vcf.gz --out $OUT/assoc/single --region 22:36000000-37000000 --test b.score --pheno PHENO --run 2 

After running it, you will see EPACTS output files by looking at

ls $OUT/assoc

The top association results can be viewed by

head $OUT/assoc/single.epacts.top5000

View top association results

#CHROM	BEGIN	END	MARKER_ID	NS	AC	CALLRATE	MAF	PVALUE	SCORE	NS.CASE	NS.CTRL	AF.CASE	AF.CTRL
22	36995620	36995620	22:36995620_A/G	62	36	1	0.29032	5.6717e-09	5.8262	31	31	0.51613	0.064516
22	36993088	36993088	22:36993088_G/C	62	30	1	0.24194	7.3258e-07	4.9525	31	31	0.43548	0.048387
22	36997871	36997871	22:36997871_G/T	62	30	1	0.24194	7.3258e-07	4.9525	31	31	0.43548	0.048387
22	36987368	36987368	22:36987368_G/A	62	31	1	0.25	2.0898e-06	4.7445	31	31	0.43548	0.064516
22	36987861	36987861	22:36987861_A/G	62	31	1	0.25	2.0898e-06	4.7445	31	31	0.43548	0.064516
22	36985499	36985499	22:36985499_C/T	62	29	1	0.23387	5.7389e-06	4.5358	31	31	0.40323	0.064516
22	36978260	36978260	22:36978260_G/T	62	28	1	0.22581	1.5051e-05	4.3279	31	31	0.3871	0.064516
22	36998907	36998907	22:36998907_C/T	62	61	1	0.49194	0.00015557	-3.782	31	31	0.30645	0.67742
22	36667082	36667082	22:36667082_T/G	62	28	1	0.22581	0.0003506	-3.5747	31	31	0.080645	0.37097

You can look also visualize the results by QQ-plot and Manhattan plot

View QQ plots

evince $OUT/assoc/single.epacts.qq.pdf&

Single.epacts.qq.png

View Manhattan plots

evince $OUT/assoc/single.epacts.mh.pdf&

Single.epacts.mh.png

Also, you can create a zoom plot focusing on the region of interest

$EPACTS/bin/epacts-zoom --vcf $OUT/assoc/snps.anno.vcf.gz --pos 22:36995620 --prefix $OUT/assoc/single

View Zoom Plots

evince $OUT/assoc/single.zoom.22.36995620.pdf&

Single.zoom.22.36995620.png

If you want to run EMMAX, you first need to create a kinship matrix

$EPACTS/bin/epacts-make-kin --vcf $OUT/assoc/snps.anno.vcf.gz --min-maf 0.01 --out $OUT/assoc/snps.anno.kinf --run 2 --chr 22

And run EMMAX test specifying the kinship matrix

$EPACTS/bin/epacts-single --ped $SS/assoc/seqshop.ped --vcf $OUT/assoc/snps.anno.vcf.gz --out $OUT/assoc/emmax --region 22:36000000-37000000 --test q.emmax --pheno PHENO --run 2 --kinf $OUT/assoc/snps.anno.kinf

Then the results may look similar to previous ones.

cat $OUT/assoc/emmax.epacts.top5000

Run Groupwise Test

To run group-wise test such as gene-level burden test, you need to create a marker group file. If your VCF is already annotated, you can create a group file by running

$EPACTS/bin/epacts make-group --vcf $OUT/assoc/snps.anno.vcf.gz --out $OUT/assoc/snps.anno.grp --nonsyn

The group file is simply a list of marker per group name, as shown below.

cat $OUT/assoc/snps.anno.grp 
APOL1	22:36655735_G/A	22:36657740_G/A	22:36661330_G/A	22:36661566_G/A	22:36661646_G/A	22:36661891_G/A	22:36661906_A/G
APOL2	22:36623731_T/C	22:36623920_G/A	22:36629466_T/A	22:36633107_C/A
APOL3	22:36537763_C/T	22:36537798_G/A	22:36556768_G/A	22:36556823_G/T
APOL4	22:36587154_G/T	22:36587202_G/A	22:36587223_G/T	22:36587346_C/T	22:36587511_C/T	22:36587704_T/C	22:36587886_C/T	22:36593714_G/A	22:36597744_A/C	22:36598049_C/G	22:36598058_T/C	22:36598081_A/T
APOL5	22:36122356_G/A	22:36122380_T/A	22:36122930_C/T	22:36123083_C/T	22:36124860_C/G
FOXRED2	22:36900271_T/C	22:36900806_A/G
MYH9	22:36681163_G/C	22:36684354_T/C	22:36710183_T/C
Metazoa_SRP	22:36711990_C/G
RBFOX2	22:36424450_A/C
RP4-633O19__A.1	22:36792162_G/A

If you have your own annotation, you can create your own burden test unit by modifying this file.

If you want to run a collapsing burden test (CMC), run the following command

$EPACTS/bin/epacts group --ped $SS/assoc/seqshop.ped --vcf $OUT/assoc/snps.anno.vcf.gz --out $OUT/assoc/group.collapse --test b.collapse --groupf $OUT/assoc/snps.anno.grp --pheno PHENO --run 2 

You can view the results by examining the output file

cat $OUT/assoc/group.collapse.epacts

View Output file

#CHROM	BEGIN	END	MARKER_ID	NS	FRAC_WITH_RARE	NUM_ALL_VARS	NUM_PASS_VARS	NUM_SING_VARS	PVALUE	STATRHO
22	36655735	36661906	22:36655735-36661906_APOL1	62	0.14516	7	4	0	0.42748	1
22	36623731	36633107	22:36623731-36633107_APOL2	62	0.064516	4	1	0	0.038657	NA
22	36537763	36556823	22:36537763-36556823_APOL3	62	0.080645	4	2	0	0.40634	0
22	36587154	36598081	22:36587154-36598081_APOL4	62	0.14516	12	4	0	0.67891	0
22	36122356	36124860	22:36122356-36124860_APOL5	62	0.1129	5	2	0	0.15055	0.3
22	36900271	36900806	22:36900271-36900806_FOXRED2	NA	NA	2	0	0	NA	NA
22	36681163	36710183	22:36681163-36710183_MYH9	62	0.032258	3	1	0	1	NA
22	36711990	36711990	22:36711990-36711990_Metazoa_SRP	NA	NA	1	0	0	NA	NA
22	36424450	36424450	22:36424450-36424450_RBFOX2	62	0.032258	1	1	0	1	NA
22	36792162	36792162	22:36792162-36792162_RP4-633O19__A.1	NA	NA	1	0	0	NA	NA

You can run SKAT-O test in a similar way, but with a special tag

$EPACTS/bin/epacts group --ped $SS/assoc/seqshop.ped --vcf $OUT/assoc/snps.anno.vcf.gz --out $OUT/assoc/group.skato --test skat --skat-o --groupf $OUT/assoc/snps.anno.grp --pheno PHENO --run 2

And view output files

cat $OUT/assoc/group.skato.epacts

View Output file

#CHROM	BEGIN	END	MARKER_ID	NS	FRAC_WITH_RARE	NUM_ALL_VARS	NUM_PASS_VARS	NUM_SING_VARS	PVALUE	STATRHO
22	36655735	36661906	22:36655735-36661906_APOL1	62	0.14516	7	4	0	0.42748	1
22	36623731	36633107	22:36623731-36633107_APOL2	62	0.064516	4	1	0	0.038657	NA
22	36537763	36556823	22:36537763-36556823_APOL3	62	0.080645	4	2	0	0.40634	0
22	36587154	36598081	22:36587154-36598081_APOL4	62	0.14516	12	4	0	0.67891	0
22	36122356	36124860	22:36122356-36124860_APOL5	62	0.1129	5	2	0	0.15055	0.3
22	36900271	36900806	22:36900271-36900806_FOXRED2	NA	NA	2	0	0	NA	NA
22	36681163	36710183	22:36681163-36710183_MYH9	62	0.032258	3	1	0	1	NA
22	36711990	36711990	22:36711990-36711990_Metazoa_SRP	NA	NA	1	0	0	NA	NA
22	36424450	36424450	22:36424450-36424450_RBFOX2	62	0.032258	1	1	0	1	NA
22	36792162	36792162	22:36792162-36792162_RP4-633O19__A.1	NA	NA	1	0	0	NA	NA