Changes

606 bytes removed , 12:35, 10 January 2014

no edit summary

Line 7: Line 7:

GotCloud and this basic tutorial were presented at the [http://ibg.colorado.edu/dokuwiki/doku.php?id=workshop:2013:announcement 2013 IBG Workshop]. It was presented in two sessions. On Wednesday an overview was presented with steps for running the tutorial data: [[Media:IBG2013GotCloud.pdf|IBG2013GotCloud.pdf]]. On Friday more detail on the input files and what goes into generating the input files was presented: [[Media:GotCloudIBGWorkshop2013Friday.pdf|GotCloudIBGWorkshop2013Friday.pdf]].

+

'''This tutorial is in the process of being updated for gotcloud version 1.08 (July 30, 2013).'''

== STEP 1 : Setup GotCloud ==

Line 12: Line 14:

[[GotCloud]] has been developed and tested on Linux Ubuntu 12.10 and 12.04.2 LTS but has not been tested on other Linux operating systems. It is not available for Windows. If you do not have your own set of machines to run on, GotCloud is also available for Ubuntu running on the Amazon Elastic Compute Cloud, see [[Amazon_Snapshot]] for more information.

+

We will use 3 different directories for this tutorial:

+

# path to the directory where gotcloud is installed, default ~/gotcloud-latest/

+

# path to the directory where the example data is installed, default ~/gotcloudExample

+

# path to your output directory, default ~/gotcloudTutorialOut/

−

~~=== Step 1a: Setup Environment ===~~

+

If the directories specified above do not reflect the directories you would like to use, replace their occurrances in the instructions below with the appropriate paths.

−

~~We will use 3 different directories for this tutorial and will use the following variables to stand for these directories:~~

−

~~# $GCHOME : path to the directory where gotcloud is installed~~

−

~~# $GCDATA : path to the directory where the example data is installed~~

−

~~# $GCOUT : path to your output directory~~

−

~~The following steps will set these environment variables in your Linux terminal allowing you to type $GCHOME to specify the path to gotcloud instead of having to type the entire path (absolute path).~~

−

~~Note: These settings will be local to that specific terminal. If you open a new terminal you will need to set the variables again in that terminal.~~

−

There are two different ways to set these variables and they depend on what type of shell your terminal is using. The shell you are using does not matter, other than for using the appropriate commands for setting the variables. To determine which shell your terminal is running, the following command will display your shell type, so type the following in your terminal:

−

~~ps -p $$ -ocomm=~~

−

~~If your shell is bash or sh, set your variables using:~~

−

~~export GCHOME=~/gotcloud~~

−

~~export GCDATA=~/gotcloudExample~~

−

~~export GCOUT=~/gotcloudTutorial~~

−

~~If your shell is csh, tcsh, set your variables using:~~

−

~~setenv GCHOME ~/gotcloud~~

−

~~setenv GCDATA ~/gotcloudExample~~

−

~~setenv GCOUT ~/gotcloudTutorial~~

−

If the directories specified above do not reflect ~~where gotcloud and/or~~ the ~~example data is installed or where~~ you ~~want your output~~ to go, ~~then~~ replace ~~those directories with the full paths to the appropriate directories.~~

−

~~After setting these variables, you can copy and paste the rest of the commands~~ in the ~~tutorial. If you do not want to use variables, you can type the commands in~~ with the appropriate paths ~~specified~~.

−

=== Step 1b: Install GotCloud ===

+

=== Step 1a: Install GotCloud ===

In order to run this tutorial, you need to make sure you have GotCloud installed on your system.

Line 49: Line 28:

Otherwise, you can install it in your own directory:

−

# ~~Create & change~~ to the directory where you want gotcloud installed

+

# Change to the directory where you want gotcloud-latest/ installed

−

# Download the gotcloud tar ~~from the ftp site~~.

+

# Download the gotcloud tar.

# Extract the tar

# Build (compile) the source

#* Note: as the source builds, many messages will scroll through your terminal. You may even see some warnings. These messages are normal and expected. As long as the build does not end with an error, you have successfully built the source.

−

~~mkdir -p $GCHOME;~~ cd ~~$GCHOME~~

+

cd ~

−

wget ~~ftp~~://~~share~~.~~sph.umich.edu~~/gotcloud/~~gotcloud_latest~~.~~tgz~~ # Download

+

wget https://github.com/statgen/gotcloud/archive/latest.tar.gz # Download

−

tar ~~xvf gotcloud_latest~~.~~tgz~~ -~~-strip 1 # Extract~~

+

tar xf latest.tar.gz # Extracts into gotcloud-latest/

−

cd ~~$GCHOME~~/src; make # Build source

+

cd gotcloud-latest/src; make # Build source

−

~~GotCloud requires the following tools to be installed.~~

−

~~You can run $GCHOME/scripts/check_requirements.sh~~

−

~~...TBD – put in required programs/tools.~~

−

* java (java-common default-jre on ubuntu)

−

* make (make on ubuntu)

−

* libssl (libssl0.9.8 on ubuntu)

−

* gcc 4.4 or newer

−

=== Step 1c: Install Example Dataset ===

+

=== Step 1b: Install Example Dataset ===

−

Our dataset consists of 60 individuals from Great Britain (GBR) sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.

+

Our dataset consists of individuals from Great Britain (GBR) sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.

To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000.

−

The tutorial will run ~~the alignment pipeline~~ on 2 of the individuals (HG00096, HG00100). The fastqs used for this step are reduced to reads that fall into our target region.

+

The alignment pipeline in this tutorial will be run on 2 of the individuals (HG00096, HG00100). The fastqs used for this step are reduced to reads that fall into our target region.

−

The ~~tutorial~~ will ~~then used~~ previously aligned/mapped reads for ~~the full~~ 60 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.

+

The snpcall and ldrefine pipelines will use previously aligned/mapped reads for 60 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.

−

The example dataset we'll be using is available at: ftp://share.sph.umich.edu/gotcloud/gotcloudExample.~~tar~~

+

The example dataset we'll be using is available at: ftp://share.sph.umich.edu/gotcloud/gotcloudExample.tgz

−

# ~~Create &~~ Change directory to where you want to install the ~~Tutorail~~ data

+

# Change directory to where you want to install the Tutorial data

# Download the dataset tar from the ftp site

# Extract the tar

−

~~mkdir -p $GCDATA;~~ cd ~~$GCDATA~~

+

cd ~

−

wget ftp://share.sph.umich.edu/gotcloud/~~gotcloudExample~~.~~tar~~ # Download

+

wget ftp://share.sph.umich.edu/gotcloud/gotcloudExample_latest.tgz # Download

−

tar ~~xvf gotcloudExample~~.~~tar --strip 1~~ # ~~Extract~~

+

tar xf gotcloudExample_latest.tgz # Extracts into gotcloudExample/

== STEP 2 : Run GotCloud Alignment Pipeline ==

Line 104: Line 75:

Run the alignment pipeline (the example aligns 2 samples) :

−

~~$GCHOME~~/gotcloud align --conf ~~$GCDATA~~/[[#Alignment Configuration File|~~GBR60align~~.conf]] --outdir [[#Alignment Output Directory|~~$GCOUT~~]]

+

cd ~

+

gotcloud-latest/gotcloud align --conf gotcloudExample/[[#Alignment Configuration File|GBR2align.conf]] --outdir [[#Alignment Output Directory|gotcloudTutorialOut]] --baseprefix gotcloudExample

−

Upon successful completion of the alignment pipeline (about 1-2 minutes), you will see the following message:

+

Upon successful completion of the alignment pipeline (about 1-3 minutes), you will see the following message:

−

Processing finished in nn secs with no errors reported

+

Processing finished in n secs with no errors reported

The final BAM files produced by the alignment pipeline are:

−

ls ~~$GCOUT~~/bams

+

ls gotcloudTutorialOut/bams

In this directory you will see:

* BAM (.bam) files - 1 per sample

Line 118: Line 90:

** HG00096.recal.bam.bai

** HG00100.recal.bam.bai

−

* ~~BAM checksum~~ files ~~(.md5) – 1 per sample~~

+

* Indicator files that the step completed successfully:

−

** HG00096.recal.bam.md5

−

** HG00100.recal.bam.md5

−

* Indicator flies that the step completed successfully:

** HG00096.recal.bam.done

** HG00100.recal.bam.done

The Quality Control (QC) files are:

−

ls ~~$GCOUT~~/QCFiles

+

ls gotcloudTutorialOut/QCFiles

In this directory you will see:

* VerifyBamID output files:

Line 154: Line 123:

For information on the VerifyBamID output, see: [[Understanding VerifyBamID output]]

−

For information on the QPLOT output, see: [[Understanding QPLOT output]]

+

For information on the QPLOT output, see: [[Understanding QPLOT output]]

== STEP 3 : Run GotCloud Variant Calling Pipeline ==

Line 172: Line 141:

Run the variant calling pipeline:

−

~~$GCHOME~~/gotcloud snpcall --conf [[GBR60vc.conf]] --outdir ~~$GCOUT~~ --numjobs 2 --region 20:42900000-43200000

+

cd ~

+

gotcloud-latest/gotcloud snpcall --conf gotcloudExample/[[GBR60vc.conf]] --outdir gotcloudTutorialOut --numjobs 2 --region 20:42900000-43200000 --baseprefix gotcloudExample

−

Upon successful completion of the variant calling pipeline (about 3-4 minutes), you will see the following message:

+

Upon successful completion of the variant calling pipeline (about 2-4 minutes), you will see the following message:

Commands finished in nnn secs with no errors reported

On SNP Call success, the VCF files of interest are:

−

ls ~~$GCOUT~~/vcfs/chr20/chr20.filtered*

+

ls gotcloudTutorialOut/vcfs/chr20/chr20.filtered*

This gives you the following files:

−

* '''chr20.filtered.vcf.gz ''' - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL including per sample genotypes

+

* '''chr20.filtered.vcf.gz ''' - vcf for whole chromosome after it has been run through hardfilters and SVM filters and marked with PASS/FAIL including per sample genotypes

* chr20.filtered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes

−

* chr20.filtered.sites.vcf.log - log file

+

* chr20.filtered.sites.vcf.norm.log - log file

* chr20.filtered.sites.vcf.summary - summary of filters applied

* chr20.filtered.vcf.gz.OK - indicator that the filtering completed successfully

* chr20.filtered.vcf.gz.tbi - index file for the vcf file

−

Also in the ~~$GCOUT~~/vcfs/chr20 directory are intermediate files:

+

Also in the ~/gotcloudTutorialOut/vcfs/chr20 directory are intermediate files:

* the whole chromosome variant calls prior to any filtering:

** chr20.merged.sites.vcf - without per sample genotypes

Line 194: Line 164:

** chr20.merged.vcf - including per sample genotypes

** chr20.merged.vcf.OK - indicator that the step completed successfully

+

* the hardfiltered (pre-svm filtered) variant calls:

+

** chr20.hardfiltered.vcf.gz - vcf for whole chromosome after it has been run through hard filters

+

** chr20.hardfiltered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes

+

** chr20.hardfiltered.sites.vcf.log - log file

+

** chr20.hardfiltered.sites.vcf.summary - summary of filters applied

+

** chr20.hardfiltered.vcf.gz.OK - indicator that the filtering completed successfully

+

** chr20.hardfiltered.vcf.gz.tbi - index file for the vcf file

* 40000001.45000000 subdirectory contains the data for just that region.

−

The ~~$GCOUT~~/split/chr20 folder contains a VCF with just the sites that pass the filters.

+

The ~/gotcloudTutorialOut/split/chr20 folder contains a VCF with just the sites that pass the filters.

−

ls ~~$GCOUT~~/split/chr20/

+

ls ~/gotcloudTutorialOut/split/chr20/

* '''chr20.filtered.PASS.vcf.gz ''' – vcf of just sites that pass all filters

* chr20.filtered.PASS.split.1.vcf.gz - intermediate file

Line 214: Line 191:

Run the LD-aware genotype refinement pipeline:

−

~~$GCHOME~~/gotcloud ldrefine --conf [[GBR60vc.conf]] --outdir ~~$GCOUT~~ --numjobs 2

+

cd ~

+

gotcloud-latest/gotcloud ldrefine --conf gotcloudExample/[[GBR60vc.conf]] --outdir gotcloudTutorialOut --numjobs 2 --baseprefix gotcloudExample

−

Upon successful completion of this pipeline, you will see the following message:

+

Upon successful completion of this pipeline (about 3-10 minutes), you will see the following message:

Commands finished in nnn secs with no errors reported

The output from the beagle step of the genotype refinement pipeline is found in:

−

ls ~~$GCOUT~~/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz ~~$GCOUT~~/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz.tbi

+

ls gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz.tbi

The output from the thunderVcf (final step) of the genotype refinement pipeline is found in:

−

ls ~~$GCOUT~~/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz ~~$GCOUT~~/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz.tbi

+

ls gotcloudTutorialOut/thunder/chr20/GBR/thunder/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz gotcloudTutorialOut/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz.tbi

−

~~== STEP 5 : Run Support Vector Machine (SVM) Pipeline ==~~

−

== STEP 6 : Run GotCloud Association Analysis Pipeline (EPACTS) ==

+

== STEP 5 : Run GotCloud Association Analysis Pipeline (EPACTS) ==

We will assume that the EPACTS are installed in the following directory

Line 234: Line 209:

(If you need to install EPACTS, please refer to the documentation at [[EPACTS#Installation_Details]])

−

$EPACTS/epacts single --vcf ~~$GCOUT~~/vcfs/chr20/chr20.filtered.vcf.gz --ped ~~$GCDATA~~/test.GBR60.ped \\

+

$EPACTS/epacts single --vcf ~/gotcloudTutorialOut/vcfs/chr20/chr20.filtered.vcf.gz --ped ~/gotcloudExample/test.GBR60.ped \\

−

--out ~~$GCOUT~~/epacts --test q.linear --run 1 --top 1 --chr 20

+

--out ~/gotcloudTutorialOut/epacts --test q.linear --run 1 --top 1 --chr 20

−

Upon successful run, you will see files starting with ~~$GCOUT~~/epacts

+

Upon successful run, you will see files starting with ~/gotcloudTutorialOut/epacts

−

ls ~~$GCOUT~~/epacts*

+

ls ~/gotcloudTutorialOut/epacts*

To see the top associated variants, you can run

−

less ~~$GCOUT~~/epacts.epacts.top5000

+

less ~/gotcloudTutorialOut/epacts.epacts.top5000

To see the locus-zoom like plot, you can type the following command (assuming GNU gnuplot 4.2 or higher version was installed)

−

xpdf ~~$GCOUT~~/epacs.zoom.20.42987877.pdf

+

xpdf ~/gotcloudTutorialOut/epacs.zoom.20.42987877.pdf

Click [[Media:EPACTS TEST.zoom.20.42987877.pdf | Exampe LocusZoom PDF]] to see the expected output pdf

Line 250: Line 225:

= Frequently Asked Questions (FAQs) =

−

'''I ran the ~~tutorai~~ example successfully, how can I run it with my real sequence data?'''

+

'''I ran the tutorial example successfully, how can I run it with my real sequence data?'''

−

Congratulations for your successful run of your [[GotCloud]] Tutorial. Please see [[#Tutorial Inputs]] section to prepare your own input files for your sequence data. You will need to specify the FASTQ files associated with its sample names as explained. In addition, you will need to download the full reference and resource file across whole genome (the Tutorial contains only chr20 portion to make it compact) See [[#Alignment Configuration File]] section for the detailed information. Also, please refer to the original documentation of [[GotCloud]] for more detailed guide on installation beyond the scope of tutorial

+

Congratulations for your successful run of your [[GotCloud]] Tutorial. Please see [[#Tutorial Inputs]] section to prepare your own input files for your sequence data. You will need to specify the FASTQ files associated with its sample names as explained. In addition, you will need to download the full reference and resource file across whole genome (the Tutorial contains only chr20 portion to make it compact) See [[#Alignment Configuration File]] section for the detailed information. Also, please refer to the original documentation of [[GotCloud]] for more detailed guide on installation beyond the scope of the tutorial.

= Input Files for GotCloud Tutorial =

Line 290: Line 265:

** AS - assembly value to put in the BAM

−

The index file and chromosome 20 references used in this tutorial are included with the example data under the ~~$GCDATA~~ directory. The tutorial uses chromosome 20 only references in order to speed the processing time.

+

The index file and chromosome 20 references used in this tutorial are included with the example data under the ~/gotcloudExample directory. The tutorial uses chromosome 20 only references in order to speed the processing time.

When running with your own data, you will need to update the:

Line 313: Line 288:

* INDEL_PREFIX - Indel sites file (need for variant calling pipeline)

−

The tutorial configuration file is setup to point to the required chromosome 20 reference files which are included with the tutorial example data in ~~$GCDATA~~/chr20Ref/.

+

The tutorial configuration file is setup to point to the required chromosome 20 reference files which are included with the tutorial example data in ~/gotcloudExample/chr20Ref/.

If you are running more than just chromosome 20, you will need whole genome reference files which can be downloaded from [[GotCloudReference]].

Mktrost

Administrators

3,045

edits

Changes

Tutorial: GotCloud (view source)

Revision as of 12:35, 10 January 2014

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools