Difference between revisions of "Tutorial: GotCloud"

From Genome Analysis Wiki
Jump to: navigation, search
m
 
(103 intermediate revisions by 6 users not shown)
Line 1: Line 1:
==Installation==
+
= GotCloud Tutorial =
 +
In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data.
  
First, make sure GotCloud is installed on your system.  Installation instructions [[GotCloud#Setup|here]].
+
For a background on GotCloud and Sequence Analysis Pipelines, see [[GotCloud]]
  
 +
While GotCloud can run on a cluster of machines or instances, this tutorial is just a small test that just runs on the machine the commands are run on.
  
==Running the Automatic Test==
+
GotCloud and this basic tutorial were presented at the [http://ibg.colorado.edu/dokuwiki/doku.php?id=workshop:2013:announcement 2013 IBG Workshop].  It was presented in two sessions.  On Wednesday an overview was presented with steps for running the tutorial data: [[Media:IBG2013GotCloud.pdf|IBG2013GotCloud.pdf]].  On Friday more detail on the input files and what goes into generating the input files was presented: [[Media:GotCloudIBGWorkshop2013Friday.pdf|GotCloudIBGWorkshop2013Friday.pdf]].
  
This verifies that GotCloud was installed correctly.
+
'''This tutorial is in the process of being updated for gotcloud version 1.08 (July 30, 2013).'''
  
To run the test case automatically, change your current directory to GotCloud's root directory, and type in the following command:
+
== STEP 1 : Setup GotCloud ==
  
bin/gen_biopipeline.pl --test OUTPUT_DIR
+
[[GotCloud]] has been developed and tested on Linux Ubuntu 12.10 and 12.04.2 LTS but has not been tested on other Linux operating systems. It is not available for Windows. If you do not have your own set of machines to run on, GotCloud is also available for Ubuntu running on the Amazon Elastic Compute Cloud, see [[Amazon_Snapshot]] for more information.
  
where OUTPUT_DIR is the directory where you want to store the results. 
+
We will use 3 different directories for this tutorial:
 +
# path to the directory where gotcloud is installed, default ~/gotcloud-latest/
 +
# path to the directory where the example data is installed, default ~/gotcloudExample
 +
# path to your output directory, default ~/gotcloudTutorialOut/
  
If you see "Test Passed", then you are ready to run a sample.
+
If the directories specified above do not reflect the directories you would like to use, replace their occurrances in the instructions below with the appropriate paths.
  
  
==Aligning a Sample==
+
=== Step 1a: Install GotCloud ===
 +
In order to run this tutorial, you need to make sure you have GotCloud installed on your system. 
  
As an example, we can align the sample files used in the automatic test.
+
If you have root and would like to install gotcloud on your system, follow: [[GotCloud#Install_GotCloud_Software| root access installation instructions]]
  
To make this easier, change to the {ROOT_DIR}/test/align directory. (We will call the directory in which GotCloud is installed "{ROOT_DIR}".) It contains an index file and a configuration file that can be used directly.
+
Otherwise, you can install it in your own directory:
 +
# Change to the directory where you want gotcloud-latest/ installed
 +
# Download the gotcloud tar.
 +
# Extract the tar
 +
# Build (compile) the source
 +
#* Note: as the source builds, many messages will scroll through your terminal.  You may even see some warnings.  These messages are normal and expected.  As long as the build does not end with an error, you have successfully built the source.
  
 +
cd ~
 +
wget https://github.com/statgen/gotcloud/archive/latest.tar.gz  # Download
 +
tar xf latest.tar.gz    # Extracts into gotcloud-latest/
 +
cd gotcloud-latest/src; make        # Build source
  
===Index file===
+
=== Step 1b: Install Example Dataset ===
 +
Our dataset consists of individuals from Great Britain (GBR) sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.
  
There are four fastq files in {ROOT_DIR}/test/align/fastq/Sample_1 and four fastq files in {ROOT_DIR}/test/align/fastq/Sample_2, both in paired-end format.  Normally, we would need to build an index file for these files. Conveniently, an index file (indexFile.txt) already exists for the automatic test samples. It can be found in {ROOT_DIR}/test/align/, and contains the following information in tab-delimited format:
+
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000.  
  
MERGE_NAME FASTQ1                          FASTQ2                          RGID  SAMPLE    LIBRARY CENTER PLATFORM
+
The alignment pipeline in this tutorial will be run on 2 of the individuals (HG00096, HG00100)The fastqs used for this step are reduced to reads that fall into our target region.
Sample1    fastq/Sample_1/File1_R1.fastq.gz fastq/Sample_1/File1_R2.fastq.gz RGID1 SampleID1 Lib1    UM    ILLUMINA
 
Sample1    fastq/Sample_1/File2_R1.fastq.gz fastq/Sample_1/File2_R2.fastq.gz RGID1a SampleID1 Lib1    UM    ILLUMINA
 
Sample2    fastq/Sample_2/File1_R1.fastq.gz fastq/Sample_2/File1_R2.fastq.gz RGID2  SampleID2 Lib2    UM    ILLUMINA
 
Sample2    fastq/Sample_2/File2_R1.fastq.gz fastq/Sample_2/File2_R2.fastq.gz RGID2  SampleID2 Lib2    UM    ILLUMINA
 
  
If you are in the {ROOT_DIR}/test/align directory, you can use this file as-is.  If you prefer, you can create a new index file and change the MERGE_NAME, RGID, SAMPLE, LIBRARY, CENTER, or PLATFORM values. It is recommended that you do not modify existing files in {ROOT_DIR}/test/align.  
+
The snpcall and ldrefine pipelines will use previously aligned/mapped reads for 60 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.
 +
 
 +
The example dataset we'll be using is available at: ftp://share.sph.umich.edu/gotcloud/gotcloudExample.tgz
 +
 
 +
# Change directory to where you want to install the Tutorial data
 +
# Download the dataset tar from the ftp site
 +
# Extract the tar
 +
 
 +
cd ~
 +
wget ftp://share.sph.umich.edu/gotcloud/gotcloudExample_latest.tgz  # Download
 +
tar xf gotcloudExample_latest.tgz    # Extracts into gotcloudExample/
 +
 
 +
== STEP 2 : Run GotCloud Alignment Pipeline ==
 +
The first step in processing next generation sequence data is mapping the reads to the reference genome, generating per sample BAM files.
 +
 
 +
The alignment pipeline has multiple built-in steps to generate BAMs:
 +
# Align the fastqs to the reference genome
 +
#* handles both single & paired end
 +
# Merge the results from multiple fastqs into 1 file per sample
 +
# Mark Duplicate Reads are marked
 +
# Recalibrate Base Qualities
 +
 
 +
This processing results in 1 BAM file per sample.
 +
 
 +
The alignment pipeline also includes Quality Control (QC) steps:
 +
# Visualization of various quality measures (QPLOT)
 +
# Screen for sample contamination & swap (VerifyBamID)
 +
 
 +
Run the alignment pipeline (the example aligns 2 samples) :
 +
cd ~
 +
gotcloud-latest/gotcloud align --conf gotcloudExample/[[#Alignment Configuration File|GBR2align.conf]] --outdir [[#Alignment Output Directory|gotcloudTutorialOut]] --baseprefix gotcloudExample
 +
 
 +
Upon successful completion of the alignment pipeline (about 1-3 minutes), you will see the following message:
 +
Processing finished in n secs with no errors reported
 +
 
 +
The final BAM files produced by the alignment pipeline are:
 +
ls gotcloudTutorialOut/bams
 +
In this directory you will see:
 +
* BAM (.bam) files - 1 per sample
 +
** HG00096.recal.bam
 +
** HG00100.recal.bam
 +
* BAM index files (.bai) – 1 per sample
 +
** HG00096.recal.bam.bai
 +
** HG00100.recal.bam.bai
 +
* Indicator files that the step completed successfully:
 +
** HG00096.recal.bam.done
 +
** HG00100.recal.bam.done
 +
 
 +
The Quality Control (QC) files are:
 +
ls gotcloudTutorialOut/QCFiles
 +
In this directory you will see:
 +
* VerifyBamID output files:
 +
** HG00096.genoCheck.depthRG
 +
** HG00096.genoCheck.depthSM
 +
** HG00096.genoCheck.selfRG
 +
** HG00096.genoCheck.selfSM
 +
** HG00100.genoCheck.depthRG
 +
** HG00100.genoCheck.depthSM
 +
** HG00100.genoCheck.selfRG
 +
** HG00100.genoCheck.selfSM
 +
 
 +
* VerifyBamID step completion files – 1 per sample
 +
** HG00096.genoCheck.done
 +
** HG00100.genoCheck.done
 +
 
 +
* QPLOT output files
 +
** HG00096.qplot.R
 +
** HG00096.qplot.stats
 +
** HG00100.qplot.R
 +
** HG00100.qplot.stats
 +
 
 +
* QPLOT step completion files – 1 per sample
 +
** HG00096.qplot.done
 +
** HG00100.qplot.done
 +
 
 +
For information on the VerifyBamID output, see: [[Understanding VerifyBamID output]]
 +
 
 +
For information on the QPLOT output, see: [[Understanding QPLOT output]]
 +
 
 +
== STEP 3 : Run GotCloud Variant Calling Pipeline ==
 +
The next step is to analyze BAM files by calling SNPs and generating a VCF file containing the variant calls.
 +
 
 +
The variant calling pipeline has multiple built-in steps to generate BAMs:
 +
# Filter out reads with low mapping quality
 +
# Per Base Alignment Quality Adjustment (BAQ)
 +
# Resolve overlapping paired end reads
 +
# Generate genotype likelihood files
 +
# Perform variant calling
 +
# Extract features from variant sites
 +
# Perform variant filtering
 +
 
 +
 
 +
To speed variant calling, each chromosome is broken up into smaller regions which are processed separately.  While initially split by sample, the per sample data gets merged and is processed together for each region.  These regions are later merged to result in a single Variant Call File (VCF) per chromosome.  For the tutorial all of the data falls within a single region.
 +
 
 +
Run the variant calling pipeline:
 +
  cd ~
 +
gotcloud-latest/gotcloud snpcall --conf gotcloudExample/[[GBR60vc.conf]] --outdir gotcloudTutorialOut --numjobs 2 --region 20:42900000-43200000 --baseprefix gotcloudExample
 +
 
 +
Upon successful completion of the variant calling pipeline (about 2-4 minutes), you will see the following message:
 +
  Commands finished in nnn secs with no errors reported
 +
 
 +
On SNP Call success, the VCF files of interest are:
 +
ls gotcloudTutorialOut/vcfs/chr20/chr20.filtered*
 +
 
 +
This gives you the following files:
 +
* '''chr20.filtered.vcf.gz ''' - vcf for whole chromosome after it has been run through hardfilters and SVM filters and marked with PASS/FAIL including per sample genotypes
 +
* chr20.filtered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes
 +
* chr20.filtered.sites.vcf.norm.log - log file
 +
* chr20.filtered.sites.vcf.summary - summary of filters applied
 +
* chr20.filtered.vcf.gz.OK - indicator that the filtering completed successfully
 +
* chr20.filtered.vcf.gz.tbi - index file for the vcf file
 +
 
 +
Also in the ~/gotcloudTutorialOut/vcfs/chr20 directory are intermediate files:
 +
* the whole chromosome variant calls prior to any filtering:
 +
** chr20.merged.sites.vcf - without per sample genotypes
 +
** chr20.merged.stats.vcf
 +
** chr20.merged.vcf - including per sample genotypes
 +
** chr20.merged.vcf.OK - indicator that the step completed successfully
 +
* the hardfiltered (pre-svm filtered) variant calls:
 +
** chr20.hardfiltered.vcf.gz - vcf for whole chromosome after it has been run through hard filters
 +
** chr20.hardfiltered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes
 +
** chr20.hardfiltered.sites.vcf.log - log file
 +
** chr20.hardfiltered.sites.vcf.summary - summary of filters applied
 +
** chr20.hardfiltered.vcf.gz.OK - indicator that the filtering completed successfully
 +
** chr20.hardfiltered.vcf.gz.tbi - index file for the vcf file
 +
* 40000001.45000000 subdirectory contains the data for just that region.
 +
 
 +
The ~/gotcloudTutorialOut/split/chr20 folder contains a VCF with just the sites that pass the filters.
 +
ls ~/gotcloudTutorialOut/split/chr20/
 +
* '''chr20.filtered.PASS.vcf.gz ''' – vcf of just sites that pass all filters
 +
* chr20.filtered.PASS.split.1.vcf.gz - intermediate file
 +
* chr20.filtered.PASS.split.err - log file
 +
* chr20.filtered.PASS.split.vcflist - list of intermediate files
 +
* subset.OK
 +
 
 +
In addition to the vcfs subdirectory, there are additional intermediate files/directories:
 +
* glfs – holds genotype likelihood format [[GLF]] files split by chromosome, region, and sample
 +
* pvcfs – holds intermediate vcf files split by chromosome and region
 +
 
 +
Note: the tutorial does not produce a target directory, but if you run with targeted data, you may see that.
 +
 
 +
== STEP 4 : Run GotCloud Genotype Refinement Pipeline ==
 +
The next step is to perform genotype refinement using linkage disequilibrium information using [http://faculty.washington.edu/browning/beagle/beagle.html Beagle] & [[ThunderVCF]].
 +
 
 +
Run the LD-aware genotype refinement pipeline:
 +
cd ~
 +
gotcloud-latest/gotcloud ldrefine --conf gotcloudExample/[[GBR60vc.conf]] --outdir gotcloudTutorialOut --numjobs 2 --baseprefix gotcloudExample
 +
 
 +
Upon successful completion of this pipeline (about 3-10 minutes), you will see the following message:
 +
Commands finished in nnn secs with no errors reported
 +
 
 +
The output from the beagle step of the genotype refinement pipeline is found in:
 +
ls gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz.tbi
 +
 
 +
The output from the thunderVcf (final step) of the genotype refinement pipeline is found in:
 +
ls gotcloudTutorialOut/thunder/chr20/GBR/thunder/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz gotcloudTutorialOut/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz.tbi
 +
 
 +
== STEP 5 : Run GotCloud Association Analysis Pipeline (EPACTS) ==
 +
 
 +
We will assume that the EPACTS are installed in the following directory
 +
setenv EPACTS /path/to/epacts
 +
(If you need to install EPACTS, please refer to the documentation at [[EPACTS#Installation_Details]])
 +
 
 +
$EPACTS/epacts single --vcf ~/gotcloudTutorialOut/vcfs/chr20/chr20.filtered.vcf.gz --ped ~/gotcloudExample/test.GBR60.ped \\
 +
    --out ~/gotcloudTutorialOut/epacts --test q.linear --run 1 --top 1 --chr 20
 +
 
 +
Upon successful run, you will see files starting with ~/gotcloudTutorialOut/epacts
 +
ls ~/gotcloudTutorialOut/epacts*
 +
 
 +
To see the top associated variants, you can run
 +
less ~/gotcloudTutorialOut/epacts.epacts.top5000
 +
 
 +
To see the locus-zoom like plot, you can type the following command (assuming GNU gnuplot 4.2 or higher version was installed)
 +
xpdf ~/gotcloudTutorialOut/epacs.zoom.20.42987877.pdf
 +
 
 +
Click [[Media:EPACTS TEST.zoom.20.42987877.pdf | Exampe LocusZoom PDF]] to see the expected output pdf
 +
 
 +
= Frequently Asked Questions (FAQs) =
 +
 
 +
'''I ran the tutorial example successfully, how can I run it with my real sequence data?'''
 +
 
 +
Congratulations for your successful run of your [[GotCloud]] Tutorial. Please see [[#Tutorial Inputs]] section to prepare your own input files for your sequence data. You will need to specify the FASTQ files associated with its sample names as explained. In addition, you will need to download the full reference and resource file across whole genome (the Tutorial contains only chr20 portion to make it compact) See [[#Alignment Configuration File]] section for the detailed information. Also, please refer to the original documentation of [[GotCloud]] for more detailed guide on installation beyond the scope of the tutorial.
  
If you want to run this example from a different directory, make sure the FASTQ1 and FASTQ2 paths are correct.  That is, each of the FASTQ1 and FASTQ2 entry in the index file should look like the following:
+
= Input Files for GotCloud Tutorial =
  
{ROOT_DIR}/test/align/fastq/Sample_1/File1_R1.fastq.gz
+
This section describes the input files needed for the GotCloud tutorial. You don't need to know this detail to run the tutorial, but if you're interested in understanding the structure of GotCloud pipeline and run with your own sample, this would be a good starting point
  
Alternately, if you want to run this example from a different directory, but do not want to edit the index file, you can copy all the fastq files to a new directory with the relative path listed in the index file:
+
== Alignment Pipeline ==
 +
=== List of Input Files needed for Alignment ===
 +
The command-line inputs to the tutorial alignment pipeline are:
 +
# [[#Alignment Configuration File|Configuration File (--conf)]]
 +
#* Specifies the configuration file to use when running
 +
# [[#Alignment Output Directory|Output Directory (--outdir)]]
 +
#* Directory where the output should be placed.
  
ln -s {ROOT_DIR}/test/align/fastq fastq
+
Additional information required to run the alignment pipeline:
 +
# [[#Alignment FASTQ Index File|Index file of FASTQs]]
 +
# [[#Alignment Reference Files|Alignment Reference Files]]
 +
For the tutorial, these values are specified in the configuration file.
  
This will create a symbolic link to the test fastq directory from your current directory.
+
=== Alignment Configuration File ===
 +
The configuration file contains KEY = VALUE settings that override defaults and set specific values for the given run.  
  
(More information about: [[Mapping_Pipeline#Sequence_Index_File|the index file]].)
+
<pre>
 +
INDEX_FILE = GBR60fastq.index
 +
############
 +
# References
 +
REF_DIR = chr20Ref
 +
AS = NCBI37
 +
FA_REF = $(REF_DIR)/human_g1k_v37_chr20.fa
 +
DBSNP_VCF =  $(REF_DIR)/dbsnp135_chr20.vcf.gz
 +
HM3_VCF = $(REF_DIR)/hapmap_3.3.b37.sites.chr20.vcf.gz
 +
</pre>
  
 +
This configuration file sets:
 +
* [[#Alignment FASTQ Index File|INDEX_FILE]] - file containing the fastqs to be processed as well as the read group information for these fastqs.
 +
* Reference Information: see [[#Alignment Reference Files|Alignment Reference Files]] for more information
 +
** AS - assembly value to put in the BAM
  
===Configuration file===
+
The index file and chromosome 20 references used in this tutorial are included with the example data under the ~/gotcloudExample directory.  The tutorial uses chromosome 20 only references in order to speed the processing time. 
  
Similar to the index file, a configuration file (test.conf) already exists for the automatic test samples. It contains the following information:
+
When running with your own data, you will need to update the:
 +
* Index File to contain the information for your own fastq files
 +
** See [[#Alignment FASTQ Index File|Alignment FASTQ Index File]] for more information on the contents of the index file.
 +
* The reference files to be whole genome references
 +
** If you are just running chromosome 20, you can use the tutorial references
 +
** Whole genome reference files can be downloaded from [[GotCloudReference]].
 +
** See [[#Alignment Reference Files|Alignment Reference Files]] for more information.
  
INDEX_FILE = indexFile.txt
+
Note: It is recommended that you use absolute paths (full path names, like “/home/mktrost/gotcloudReference” rather than just “gotcloudReference”).  This example does not use absolute paths in order to be flexible to where the data is installed, but using relative paths requires it to be run from the correct directory.
############
 
# References
 
REF_DIR = $(PIPELINE_DIR)/test/align/chr20Ref
 
AS = NCBI37
 
FA_REF = $(REF_DIR)/human_g1k_v37_chr20.fa
 
DBSNP_VCF = $(REF_DIR)/dbsnp.b130.ncbi37.chr20.vcf.gz
 
PLINK = $(REF_DIR)/hapmap_3.3.b37.chr20
 
  
If you are in the {ROOT_DIR}/test/align directory, you can use this file as-is. If you are using a different index file, make sure your index file is named correctly in the first line. If you are not running this from {ROOT_DIR}/test/align, make sure your configuration and index files are in the same directory. 
 
  
(More information about: [[Mapping_Pipeline#Reference_Files|reference files]], [[Mapping_Pipeline#Optional_Configurable_Settings|optional configurable settings]], or [[Mapping_Pipeline#Command-Line_Options|command-line options]].)
+
=== Reference Files ===
  
 +
Reference files are required for running both the alignment and variant calling pipelines. 
  
===Running the alignment pipeline===
+
The configuration keys for setting these are:
 +
* FA_REF - Genome sequence reference files (needed for both pipelines)
 +
* DBSNP_VCF – DBSNP site VCF file (needed for both pipelines)
 +
* HM3_VCF - HAPMAP site VCF file (needed for both pipelines)
 +
* INDEL_PREFIX - Indel sites file (need for variant calling pipeline)
  
You are now ready to run the alignment pipeline.  This requires two steps: first, generating the Makefiles; and second, running those Makefiles.
+
The tutorial configuration file is setup to point to the required chromosome 20 reference files which are included with the tutorial example data in ~/gotcloudExample/chr20Ref/.  
  
 +
If you are running more than just chromosome 20, you will need whole genome reference files which can be downloaded from [[GotCloudReference]].
  
====Generating the Makefiles====
+
The configuration settings for these files are setup in the default configuration so do not need to be specified.  You just need to set REF_DIR in your configuration file to the path where you installed your reference files.
  
Enter the following command:
+
To learn more about the reference files that are required, see [[GotCloud: Reference Files]].
  
{ROOT_DIR}/bin/gen_biopipeline.pl --conf test.conf --out_dir {OUT_DIR}
 
  
where {OUT_DIR} is the directory in which you wish to store the resulting BAM files (for example, ~/out).
+
=== Alignment Output Directory ===
 +
This setting tells the alignment pipeline where to write the output and intermediate files.  
  
If everything went well, you will see the following messages:
+
The output directory will be created if it doesn't already exist and will contain the following Directories/files:  
 +
* bams - directory containing the final bams/bai files
 +
** HG00096.OK - indicates that this sample completed alignment processing
 +
** HG00100.OK - indicates that this sample completed alignment processing
 +
* failLogs - directory containing logs from steps that failed
 +
** this directory is only created if an error is detected
 +
* Makefiles - directory containing the makefiles with commands for processing each sample
 +
** biopipe_HG00096.Makefile – commands for processing sample HG00096
 +
** biopipe_HG00100.Makefile – commands for processing sample HG00100
 +
** biopipe_HG00096.Makefile.log – log file from running the associated Makefile
 +
** biopipe_HG00100.Makefile.log – log file from running the associated Makefile
 +
* QCFiles - directory containing the QC Results
 +
**
 +
* tmp - directory containing temporary alignment files
 +
** bwa.sai.t – contains temporary files for the 1st step of bwa that generates sai files from fastq files
 +
*** fastq
 +
**** *.done – indicator files that the step to generate the file completed
 +
**** HG0096
 +
** alignment.bwa – contains temporary files for the 2nd step of bwa that generates BAM files
 +
*** fastq
 +
**** *.done – indicator files that the step to generate the file completed
 +
** alignment.pol - contains temporary files for the polish bam step that cleans up BAM files
 +
** alignment.dedup – contains temporary files for the deduping step
 +
*** *.done – indicator files that the step to generate the file completed
 +
*** *.metrics – metrics files for the deduping step
 +
** alignment.recal – contains temporary files for the deduping step
 +
*** *.log – contains information about the recalibration step
 +
*** *.qemp – contains the recalibration tables used for recalibrating each BAM file
  
Finished creating makefile {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile
 
Finished creating makefile {OUT_DIR}/Makefiles/biopipe_Sample1.Makefile
 
--------------------------------------------------------------------
 
Run the following commands:
 
 
make -f {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile > {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile.log
 
make -f {OUT_DIR}/Makefiles/biopipe_Sample1.Makefile > {OUT_DIR}/Makefiles/biopipe_Sample1.Makefile.log
 
  
where {OUT_DIR} will be replaced with the directory you entered above.
 
  
 +
===Alignment FASTQ Index File===
  
====Running the Makefiles====
+
There are four fastq files in {ROOT_DIR}/test/align/fastq/Sample_1 and four fastq files in {ROOT_DIR}/test/align/fastq/Sample_2, both in paired-end format.  Normally, we would need to build an index file for these files. Conveniently, an index file (indexFile.txt) already exists for the automatic test samples.  It can be found in {ROOT_DIR}/test/align/, and contains the following information in tab-delimited format:
  
To run a Makefile, simply enter one-by-one the commands generated in the previous step. If you wish to run the alignment in the background, add "&" after the make command, as follows:
+
MERGE_NAME FASTQ1                          FASTQ2                          RGID  SAMPLE    LIBRARY CENTER PLATFORM
 +
Sample1    fastq/Sample_1/File1_R1.fastq.gz fastq/Sample_1/File1_R2.fastq.gz RGID1  SampleID1 Lib1    UM    ILLUMINA
 +
Sample1    fastq/Sample_1/File2_R1.fastq.gz fastq/Sample_1/File2_R2.fastq.gz RGID1a SampleID1 Lib1    UM    ILLUMINA
 +
Sample2    fastq/Sample_2/File1_R1.fastq.gz fastq/Sample_2/File1_R2.fastq.gz RGID2  SampleID2 Lib2    UM    ILLUMINA
 +
Sample2    fastq/Sample_2/File2_R1.fastq.gz fastq/Sample_2/File2_R2.fastq.gz RGID2  SampleID2 Lib2    UM    ILLUMINA
  
make -f {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile > {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile.log &
+
If you are in the {ROOT_DIR}/test/align directory, you can use this file as-is.  If you prefer, you can create a new index file and change the MERGE_NAME, RGID, SAMPLE, LIBRARY, CENTER, or PLATFORM values. It is recommended that you do not modify existing files in {ROOT_DIR}/test/align.  
  
The log files for the runs will be found in the Makefiles directory, while the BAM files will be found in the {OUT_DIR}/alignment.recal directory.
+
If you want to run this example from a different directory, make sure the FASTQ1 and FASTQ2 paths are correct.  That is, each of the FASTQ1 and FASTQ2 entry in the index file should look like the following:
  
 +
{ROOT_DIR}/test/align/fastq/Sample_1/File1_R1.fastq.gz
  
==Analyzing a Sample==
+
Alternately, if you want to run this example from a different directory, but do not want to edit the index file, you can create a relative path to the test fastq files so their path agrees with that listed in the index file:
  
Using UMAKE, you can analyze the BAM files generated in the previous step and generate a VCF file. Once again, we can analyze BAM files used in the automatic test.  You will need three files for this: index, configuration, and bed.
+
  ln -s {ROOT_DIR}/test/align/fastq fastq
  
 +
This will create a symbolic link to the test fastq directory from your current directory.
  
===Index file===
+
(More information about: [[Mapping_Pipeline#Sequence_Index_File|the index file]].)
  
First, you need a list of all the BAM files to be analyzed. Conveniently, the a test index file (umake_test.index) already exists in {ROOT_DIR}/test/umake/.  It contains the following information:
 
  
NA12272 ALL    bams/NA12272.mapped.ILLUMINA.bwa.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam
+
==Analyzing a Sample==
NA12004 ALL    bams/NA12004.mapped.ILLUMINA.bwa.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam
 
...
 
NA12874 ALL    bams/NA12874.mapped.LS454.ssaha2.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam
 
  
You can use this file directly if you change your current directory to {ROOT_DIR}/test/umake/.
+
Using UMAKE, you can analyze BAM files by calling SNPs, and generate a VCF file containing the results.  Once again, we can analyze BAM files used in the automatic test.  For this example, we have 60 BAM files, which can be found in {ROOT_DIR}/test/umake/bams.  These contain sequence information for a targeted region in chromosome 20.  
  
Alternately, if you want to copy and use this index file to a different directory, you can create a symbolic link to the bams folder as follows:
+
In addition to the BAM files, you will need three files to run UMAKE: an index file, a configuration file, and a bed file (needed to analyze BAM files from targeted/exome sequencing).
 +
  
ln -s {ROOT_DIR}/test/umake/bams bams
+
===Index file===
  
(More information about: [[Variant_Calling_Pipeline_(UMAKE)#Index_File|the index file]].)
+
First, you need a list of all the BAM files to be analyzed. Conveniently, the a test index file (umake_test.index) already exists in {ROOT_DIR}/test/umake/.  It contains the following information:
  
 +
NA12272 ALL    bams/NA12272.mapped.ILLUMINA.bwa.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam
 +
NA12004 ALL    bams/NA12004.mapped.ILLUMINA.bwa.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam
 +
...
 +
NA12874 ALL    bams/NA12874.mapped.LS454.ssaha2.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam
  
===BED file===
+
You can use this file directly if you change your current directory to {ROOT_DIR}/test/umake/.
  
This file contains a single line:
+
Alternately, if you want to copy and use this index file to a different directory, you can create a symbolic link to the bams folder as follows:  
  
  chr20  20000050        20300000
+
  ln -s {ROOT_DIR}/test/umake/bams bams
  
You can copy this to the current directory and use it as-is.
+
(More information about: [[Variant_Calling_Pipeline_(UMAKE)#Index_File|the index file]].)
  
(More information about: [[Variant_Calling_Pipeline_(UMAKE)#Targeted.2FExome_Sequencing_Settings|targeted/exome sequencing settings]].)
 
  
 +
===BED file===
  
===Configuration file===
+
This file contains a single line:
  
A configuration file (umake_test.conf) already exists in {ROOT_DIR}/test/umake/. It contains the following information:
+
  chr20  20000050        20300000
  
CHRS = 20
+
You can copy this to the current directory and use it as-is.  
TEST_ROOT = $(UMAKE_ROOT)/test/umake
 
BAM_INDEX = $(TEST_ROOT)/umake_test.index
 
OUT_PREFIX = umake_test
 
REF_ROOT = $(TEST_ROOT)/ref
 
#
 
REF = $(REF_ROOT)/karma.ref/human.g1k.v37.chr20.fa
 
INDEL_PREFIX = $(REF_ROOT)/indels/1kg.pilot_release.merged.indels.sites.hg19
 
DBSNP_PREFIX =  $(REF_ROOT)/dbSNP/dbsnp_135_b37.rod
 
HM3_PREFIX =  $(REF_ROOT)/HapMap3/hapmap3_r3_b37_fwd.consensus.qc.poly
 
#
 
RUN_INDEX = TRUE        # create BAM index file
 
RUN_PILEUP = TRUE      # create GLF file from BAM
 
RUN_GLFMULTIPLES = TRUE # create unfiltered SNP calls
 
RUN_VCFPILEUP = TRUE    # create PVCF files using vcfPileup and run infoCollector
 
RUN_FILTER = TRUE      # filter SNPs using vcfCooker
 
RUN_SPLIT = TRUE        # split SNPs into chunks for genotype refinement
 
RUN_BEAGLE = FALSE  # BEAGLE - MUST SET AFTER FINISHING PREVIOUS STEPS
 
RUN_SUBSET = FALSE  # SUBSET FOR THUNDER - MAY BE SET WITH BEAGLE STEP TOGETHER
 
RUN_THUNDER = FALSE # THUNDER - MUST SET AFTER FINISHING PREVIOUS STEPS
 
###############################################################################
 
WRITE_TARGET_LOCI = TRUE  # FOR TARGETED SEQUENCING ONLY -- Write loci file when performing pileup
 
UNIFORM_TARGET_BED = $(TEST_ROOT)/umake_test.bed # Targeted sequencing : When all individuals has the same target. Otherwise, comment it out
 
OFFSET_OFF_TARGET = 50 # Extend target by given # of bases
 
MULTIPLE_TARGET_MAP =  # Target per individual : Each line contains [SM_ID] [TARGET_BED]
 
TARGET_DIR = target    # Directory to store target information
 
SAMTOOLS_VIEW_TARGET_ONLY = TRUE # When performing samtools view, exclude off-target regions (may make command line too long)
 
  
If you are running this from a different directory, you will want to change some of the lines as follows:
+
(More information about: [[Variant_Calling_Pipeline_(UMAKE)#Targeted.2FExome_Sequencing_Settings|targeted/exome sequencing settings]].)
  
BAM_INDEX = {CURRENT_DIR}/umake_test.index
 
UNIFORM_TARGET_BED = {CURRENT_DIR}/umake_test.bed
 
  
where {CURRENT_DIR} is the absolute path to the directory that contains the index and bed files. 
+
===Configuration file===
  
An additional option can be added in the configuration file:
+
A configuration file (umake_test.conf) already exists in {ROOT_DIR}/test/umake/.  It contains the following information:  
  
  OUT_DIR = {OUT_DIR}
+
CHRS = 20
 +
BAM_INDEX = GBR60bam.index
 +
############
 +
# References
 +
REF_ROOT = chr20Ref
 +
#
 +
REF = $(REF_ROOT)/human_g1k_v37_chr20.fa
 +
INDEL_PREFIX = $(REF_ROOT)/1kg.pilot_release.merged.indels.sites.hg19
 +
DBSNP_VCF = $(REF_ROOT)/dbsnp135_chr20.vcf.gz
 +
HM3_VCF = $(REF_ROOT)/hapmap_3.3.b37.sites.chr20.vcf.gz
  
where {OUT_DIR} is the name of directory in which you want the output to be stored.  If you do not specify this in the configuration file, you will need to add an extra parameter when you run UMAKE in the next step.
 
  
(More information about: [[Variant_Calling_Pipeline_(UMAKE)#Configuration_File|the configuration file]], [[Variant_Calling_Pipeline_(UMAKE)#Reference_Files|reference files]].)
+
CHRS = 20
 +
TEST_ROOT = $(UMAKE_ROOT)/test/umake
 +
BAM_INDEX = $(TEST_ROOT)/umake_test.index
 +
OUT_PREFIX = umake_test
 +
REF_ROOT = $(TEST_ROOT)/ref
 +
#
 +
REF = $(REF_ROOT)/karma.ref/human.g1k.v37.chr20.fa
 +
INDEL_PREFIX = $(REF_ROOT)/indels/1kg.pilot_release.merged.indels.sites.hg19
 +
DBSNP_PREFIX =  $(REF_ROOT)/dbSNP/dbsnp_135_b37.rod
 +
HM3_PREFIX =  $(REF_ROOT)/HapMap3/hapmap3_r3_b37_fwd.consensus.qc.poly
 +
#
 +
RUN_INDEX = TRUE        # create BAM index file
 +
RUN_PILEUP = TRUE      # create GLF file from BAM
 +
RUN_GLFMULTIPLES = TRUE # create unfiltered SNP calls
 +
RUN_VCFPILEUP = TRUE    # create PVCF files using vcfPileup and run infoCollector
 +
RUN_FILTER = TRUE      # filter SNPs using vcfCooker
 +
RUN_SPLIT = TRUE        # split SNPs into chunks for genotype refinement
 +
RUN_BEAGLE = FALSE  # BEAGLE - MUST SET AFTER FINISHING PREVIOUS STEPS
 +
RUN_SUBSET = FALSE  # SUBSET FOR THUNDER - MAY BE SET WITH BEAGLE STEP TOGETHER
 +
RUN_THUNDER = FALSE # THUNDER - MUST SET AFTER FINISHING PREVIOUS STEPS
 +
###############################################################################
 +
WRITE_TARGET_LOCI = TRUE  # FOR TARGETED SEQUENCING ONLY -- Write loci file when performing pileup
 +
UNIFORM_TARGET_BED = $(TEST_ROOT)/umake_test.bed # Targeted sequencing : When all individuals has the same target. Otherwise, comment it out
 +
OFFSET_OFF_TARGET = 50 # Extend target by given # of bases
 +
MULTIPLE_TARGET_MAP =  # Target per individual : Each line contains [SM_ID] [TARGET_BED]  
 +
TARGET_DIR = target    # Directory to store target information
 +
SAMTOOLS_VIEW_TARGET_ONLY = TRUE # When performing samtools view, exclude off-target regions (may make command line too long)  
  
 +
If you are running this from a different directory, you will want to change some of the lines as follows:
  
===Running UMAKE===
+
BAM_INDEX = {CURRENT_DIR}/umake_test.index
 +
UNIFORM_TARGET_BED = {CURRENT_DIR}/umake_test.bed
  
If you added an OUTDIR line to the configuration file, you can run UMAKE with the following command:
+
where {CURRENT_DIR} is the absolute path to the directory that contains the index and bed files. 
  
{ROOT_DIR}/bin/umake.pl --conf umake_test.conf --snpcall --numjobs 2
+
An additional option can be added in the configuration file:
  
If you have not added an OUTDIR line to the configuration file, you can specify the output directory directly with the following command:
+
OUT_DIR = {OUT_DIR}
  
{ROOT_DIR}/bin/umake.pl --conf umake_test.conf --outdir {OUT_DIR} --snpcall --numjobs 2
+
where {OUT_DIR} is the name of directory in which you want the output to be stored. If you do not specify this in the configuration file, you will need to add an extra parameter when you run UMAKE in the next step.  
  
where {OUT_DIR} is the directory in which you want the output to be stored.
+
(More information about: [[Variant_Calling_Pipeline_(UMAKE)#Configuration_File|the configuration file]], [[Variant_Calling_Pipeline_(UMAKE)#Reference_Files|reference files]].)
  
Either command will perform SNP calling on the test samples. The resulting VCF files from this will be located in {OUT_DIR}/vcfs/chr20.
 
  
  
==Further Information==
+
==Further Information==  
  
 
[[Mapping_Pipeline|Mapping (Alignment) Pipeline]]  
 
[[Mapping_Pipeline|Mapping (Alignment) Pipeline]]  
  
 
[[Variant_Calling_Pipeline_(UMAKE)|Variant Calling Pipeline (UMAKE)]]
 
[[Variant_Calling_Pipeline_(UMAKE)|Variant Calling Pipeline (UMAKE)]]

Latest revision as of 12:35, 10 January 2014

GotCloud Tutorial

In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data.

For a background on GotCloud and Sequence Analysis Pipelines, see GotCloud

While GotCloud can run on a cluster of machines or instances, this tutorial is just a small test that just runs on the machine the commands are run on.

GotCloud and this basic tutorial were presented at the 2013 IBG Workshop. It was presented in two sessions. On Wednesday an overview was presented with steps for running the tutorial data: IBG2013GotCloud.pdf. On Friday more detail on the input files and what goes into generating the input files was presented: GotCloudIBGWorkshop2013Friday.pdf.

This tutorial is in the process of being updated for gotcloud version 1.08 (July 30, 2013).

STEP 1 : Setup GotCloud

GotCloud has been developed and tested on Linux Ubuntu 12.10 and 12.04.2 LTS but has not been tested on other Linux operating systems. It is not available for Windows. If you do not have your own set of machines to run on, GotCloud is also available for Ubuntu running on the Amazon Elastic Compute Cloud, see Amazon_Snapshot for more information.

We will use 3 different directories for this tutorial:

  1. path to the directory where gotcloud is installed, default ~/gotcloud-latest/
  2. path to the directory where the example data is installed, default ~/gotcloudExample
  3. path to your output directory, default ~/gotcloudTutorialOut/

If the directories specified above do not reflect the directories you would like to use, replace their occurrances in the instructions below with the appropriate paths.


Step 1a: Install GotCloud

In order to run this tutorial, you need to make sure you have GotCloud installed on your system.

If you have root and would like to install gotcloud on your system, follow: root access installation instructions

Otherwise, you can install it in your own directory:

  1. Change to the directory where you want gotcloud-latest/ installed
  2. Download the gotcloud tar.
  3. Extract the tar
  4. Build (compile) the source
    • Note: as the source builds, many messages will scroll through your terminal. You may even see some warnings. These messages are normal and expected. As long as the build does not end with an error, you have successfully built the source.
cd ~
wget https://github.com/statgen/gotcloud/archive/latest.tar.gz  # Download
tar xf latest.tar.gz     # Extracts into gotcloud-latest/
cd gotcloud-latest/src; make         # Build source

Step 1b: Install Example Dataset

Our dataset consists of individuals from Great Britain (GBR) sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.

To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000.

The alignment pipeline in this tutorial will be run on 2 of the individuals (HG00096, HG00100). The fastqs used for this step are reduced to reads that fall into our target region.

The snpcall and ldrefine pipelines will use previously aligned/mapped reads for 60 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.

The example dataset we'll be using is available at: ftp://share.sph.umich.edu/gotcloud/gotcloudExample.tgz

  1. Change directory to where you want to install the Tutorial data
  2. Download the dataset tar from the ftp site
  3. Extract the tar
cd ~
wget ftp://share.sph.umich.edu/gotcloud/gotcloudExample_latest.tgz  # Download 
tar xf gotcloudExample_latest.tgz    # Extracts into gotcloudExample/

STEP 2 : Run GotCloud Alignment Pipeline

The first step in processing next generation sequence data is mapping the reads to the reference genome, generating per sample BAM files.

The alignment pipeline has multiple built-in steps to generate BAMs:

  1. Align the fastqs to the reference genome
    • handles both single & paired end
  2. Merge the results from multiple fastqs into 1 file per sample
  3. Mark Duplicate Reads are marked
  4. Recalibrate Base Qualities

This processing results in 1 BAM file per sample.

The alignment pipeline also includes Quality Control (QC) steps:

  1. Visualization of various quality measures (QPLOT)
  2. Screen for sample contamination & swap (VerifyBamID)

Run the alignment pipeline (the example aligns 2 samples) :

cd ~
gotcloud-latest/gotcloud align --conf gotcloudExample/GBR2align.conf --outdir gotcloudTutorialOut --baseprefix gotcloudExample

Upon successful completion of the alignment pipeline (about 1-3 minutes), you will see the following message:

Processing finished in n secs with no errors reported 

The final BAM files produced by the alignment pipeline are:

ls gotcloudTutorialOut/bams

In this directory you will see:

  • BAM (.bam) files - 1 per sample
    • HG00096.recal.bam
    • HG00100.recal.bam
  • BAM index files (.bai) – 1 per sample
    • HG00096.recal.bam.bai
    • HG00100.recal.bam.bai
  • Indicator files that the step completed successfully:
    • HG00096.recal.bam.done
    • HG00100.recal.bam.done

The Quality Control (QC) files are:

ls gotcloudTutorialOut/QCFiles

In this directory you will see:

  • VerifyBamID output files:
    • HG00096.genoCheck.depthRG
    • HG00096.genoCheck.depthSM
    • HG00096.genoCheck.selfRG
    • HG00096.genoCheck.selfSM
    • HG00100.genoCheck.depthRG
    • HG00100.genoCheck.depthSM
    • HG00100.genoCheck.selfRG
    • HG00100.genoCheck.selfSM
  • VerifyBamID step completion files – 1 per sample
    • HG00096.genoCheck.done
    • HG00100.genoCheck.done
  • QPLOT output files
    • HG00096.qplot.R
    • HG00096.qplot.stats
    • HG00100.qplot.R
    • HG00100.qplot.stats
  • QPLOT step completion files – 1 per sample
    • HG00096.qplot.done
    • HG00100.qplot.done

For information on the VerifyBamID output, see: Understanding VerifyBamID output

For information on the QPLOT output, see: Understanding QPLOT output

STEP 3 : Run GotCloud Variant Calling Pipeline

The next step is to analyze BAM files by calling SNPs and generating a VCF file containing the variant calls.

The variant calling pipeline has multiple built-in steps to generate BAMs:

  1. Filter out reads with low mapping quality
  2. Per Base Alignment Quality Adjustment (BAQ)
  3. Resolve overlapping paired end reads
  4. Generate genotype likelihood files
  5. Perform variant calling
  6. Extract features from variant sites
  7. Perform variant filtering


To speed variant calling, each chromosome is broken up into smaller regions which are processed separately. While initially split by sample, the per sample data gets merged and is processed together for each region. These regions are later merged to result in a single Variant Call File (VCF) per chromosome. For the tutorial all of the data falls within a single region.

Run the variant calling pipeline:

cd ~
gotcloud-latest/gotcloud snpcall --conf gotcloudExample/GBR60vc.conf --outdir gotcloudTutorialOut --numjobs 2 --region 20:42900000-43200000 --baseprefix gotcloudExample

Upon successful completion of the variant calling pipeline (about 2-4 minutes), you will see the following message:

 Commands finished in nnn secs with no errors reported 

On SNP Call success, the VCF files of interest are:

ls gotcloudTutorialOut/vcfs/chr20/chr20.filtered*

This gives you the following files:

  • chr20.filtered.vcf.gz - vcf for whole chromosome after it has been run through hardfilters and SVM filters and marked with PASS/FAIL including per sample genotypes
  • chr20.filtered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes
  • chr20.filtered.sites.vcf.norm.log - log file
  • chr20.filtered.sites.vcf.summary - summary of filters applied
  • chr20.filtered.vcf.gz.OK - indicator that the filtering completed successfully
  • chr20.filtered.vcf.gz.tbi - index file for the vcf file

Also in the ~/gotcloudTutorialOut/vcfs/chr20 directory are intermediate files:

  • the whole chromosome variant calls prior to any filtering:
    • chr20.merged.sites.vcf - without per sample genotypes
    • chr20.merged.stats.vcf
    • chr20.merged.vcf - including per sample genotypes
    • chr20.merged.vcf.OK - indicator that the step completed successfully
  • the hardfiltered (pre-svm filtered) variant calls:
    • chr20.hardfiltered.vcf.gz - vcf for whole chromosome after it has been run through hard filters
    • chr20.hardfiltered.sites.vcf - vcf for whole chromosome after it has been run through filters and marked with PASS/FAIL without the per sample genotypes
    • chr20.hardfiltered.sites.vcf.log - log file
    • chr20.hardfiltered.sites.vcf.summary - summary of filters applied
    • chr20.hardfiltered.vcf.gz.OK - indicator that the filtering completed successfully
    • chr20.hardfiltered.vcf.gz.tbi - index file for the vcf file
  • 40000001.45000000 subdirectory contains the data for just that region.

The ~/gotcloudTutorialOut/split/chr20 folder contains a VCF with just the sites that pass the filters.

ls ~/gotcloudTutorialOut/split/chr20/
  • chr20.filtered.PASS.vcf.gz – vcf of just sites that pass all filters
  • chr20.filtered.PASS.split.1.vcf.gz - intermediate file
  • chr20.filtered.PASS.split.err - log file
  • chr20.filtered.PASS.split.vcflist - list of intermediate files
  • subset.OK

In addition to the vcfs subdirectory, there are additional intermediate files/directories:

  • glfs – holds genotype likelihood format GLF files split by chromosome, region, and sample
  • pvcfs – holds intermediate vcf files split by chromosome and region

Note: the tutorial does not produce a target directory, but if you run with targeted data, you may see that.

STEP 4 : Run GotCloud Genotype Refinement Pipeline

The next step is to perform genotype refinement using linkage disequilibrium information using Beagle & ThunderVCF.

Run the LD-aware genotype refinement pipeline:

cd ~
gotcloud-latest/gotcloud ldrefine --conf gotcloudExample/GBR60vc.conf --outdir gotcloudTutorialOut --numjobs 2 --baseprefix gotcloudExample

Upon successful completion of this pipeline (about 3-10 minutes), you will see the following message:

Commands finished in nnn secs with no errors reported 

The output from the beagle step of the genotype refinement pipeline is found in:

ls gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz gotcloudTutorialOut/beagle/chr20/chr20.filtered.PASS.beagled.vcf.gz.tbi 

The output from the thunderVcf (final step) of the genotype refinement pipeline is found in:

ls gotcloudTutorialOut/thunder/chr20/GBR/thunder/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz gotcloudTutorialOut/thunder/chr20/GBR/chr20.filtered.PASS.beagled.GBR.thunder.vcf.gz.tbi

STEP 5 : Run GotCloud Association Analysis Pipeline (EPACTS)

We will assume that the EPACTS are installed in the following directory

setenv EPACTS /path/to/epacts

(If you need to install EPACTS, please refer to the documentation at EPACTS#Installation_Details)

$EPACTS/epacts single --vcf ~/gotcloudTutorialOut/vcfs/chr20/chr20.filtered.vcf.gz --ped ~/gotcloudExample/test.GBR60.ped \\
   --out ~/gotcloudTutorialOut/epacts --test q.linear --run 1 --top 1 --chr 20

Upon successful run, you will see files starting with ~/gotcloudTutorialOut/epacts

ls ~/gotcloudTutorialOut/epacts*

To see the top associated variants, you can run

less ~/gotcloudTutorialOut/epacts.epacts.top5000

To see the locus-zoom like plot, you can type the following command (assuming GNU gnuplot 4.2 or higher version was installed)

xpdf ~/gotcloudTutorialOut/epacs.zoom.20.42987877.pdf

Click Exampe LocusZoom PDF to see the expected output pdf

Frequently Asked Questions (FAQs)

I ran the tutorial example successfully, how can I run it with my real sequence data?

Congratulations for your successful run of your GotCloud Tutorial. Please see #Tutorial Inputs section to prepare your own input files for your sequence data. You will need to specify the FASTQ files associated with its sample names as explained. In addition, you will need to download the full reference and resource file across whole genome (the Tutorial contains only chr20 portion to make it compact) See #Alignment Configuration File section for the detailed information. Also, please refer to the original documentation of GotCloud for more detailed guide on installation beyond the scope of the tutorial.

Input Files for GotCloud Tutorial

This section describes the input files needed for the GotCloud tutorial. You don't need to know this detail to run the tutorial, but if you're interested in understanding the structure of GotCloud pipeline and run with your own sample, this would be a good starting point

Alignment Pipeline

List of Input Files needed for Alignment

The command-line inputs to the tutorial alignment pipeline are:

  1. Configuration File (--conf)
    • Specifies the configuration file to use when running
  2. Output Directory (--outdir)
    • Directory where the output should be placed.

Additional information required to run the alignment pipeline:

  1. Index file of FASTQs
  2. Alignment Reference Files

For the tutorial, these values are specified in the configuration file.

Alignment Configuration File

The configuration file contains KEY = VALUE settings that override defaults and set specific values for the given run.

 
INDEX_FILE = GBR60fastq.index 
############ 
# References 
REF_DIR = chr20Ref 
AS = NCBI37 
FA_REF = $(REF_DIR)/human_g1k_v37_chr20.fa 
DBSNP_VCF =  $(REF_DIR)/dbsnp135_chr20.vcf.gz 
HM3_VCF = $(REF_DIR)/hapmap_3.3.b37.sites.chr20.vcf.gz 

This configuration file sets:

  • INDEX_FILE - file containing the fastqs to be processed as well as the read group information for these fastqs.
  • Reference Information: see Alignment Reference Files for more information
    • AS - assembly value to put in the BAM

The index file and chromosome 20 references used in this tutorial are included with the example data under the ~/gotcloudExample directory. The tutorial uses chromosome 20 only references in order to speed the processing time.

When running with your own data, you will need to update the:

  • Index File to contain the information for your own fastq files
  • The reference files to be whole genome references

Note: It is recommended that you use absolute paths (full path names, like “/home/mktrost/gotcloudReference” rather than just “gotcloudReference”). This example does not use absolute paths in order to be flexible to where the data is installed, but using relative paths requires it to be run from the correct directory.


Reference Files

Reference files are required for running both the alignment and variant calling pipelines.

The configuration keys for setting these are:

  • FA_REF - Genome sequence reference files (needed for both pipelines)
  • DBSNP_VCF – DBSNP site VCF file (needed for both pipelines)
  • HM3_VCF - HAPMAP site VCF file (needed for both pipelines)
  • INDEL_PREFIX - Indel sites file (need for variant calling pipeline)

The tutorial configuration file is setup to point to the required chromosome 20 reference files which are included with the tutorial example data in ~/gotcloudExample/chr20Ref/.

If you are running more than just chromosome 20, you will need whole genome reference files which can be downloaded from GotCloudReference.

The configuration settings for these files are setup in the default configuration so do not need to be specified. You just need to set REF_DIR in your configuration file to the path where you installed your reference files.

To learn more about the reference files that are required, see GotCloud: Reference Files.


Alignment Output Directory

This setting tells the alignment pipeline where to write the output and intermediate files.

The output directory will be created if it doesn't already exist and will contain the following Directories/files:

  • bams - directory containing the final bams/bai files
    • HG00096.OK - indicates that this sample completed alignment processing
    • HG00100.OK - indicates that this sample completed alignment processing
  • failLogs - directory containing logs from steps that failed
    • this directory is only created if an error is detected
  • Makefiles - directory containing the makefiles with commands for processing each sample
    • biopipe_HG00096.Makefile – commands for processing sample HG00096
    • biopipe_HG00100.Makefile – commands for processing sample HG00100
    • biopipe_HG00096.Makefile.log – log file from running the associated Makefile
    • biopipe_HG00100.Makefile.log – log file from running the associated Makefile
  • QCFiles - directory containing the QC Results
  • tmp - directory containing temporary alignment files
    • bwa.sai.t – contains temporary files for the 1st step of bwa that generates sai files from fastq files
      • fastq
        • *.done – indicator files that the step to generate the file completed
        • HG0096
    • alignment.bwa – contains temporary files for the 2nd step of bwa that generates BAM files
      • fastq
        • *.done – indicator files that the step to generate the file completed
    • alignment.pol - contains temporary files for the polish bam step that cleans up BAM files
    • alignment.dedup – contains temporary files for the deduping step
      • *.done – indicator files that the step to generate the file completed
      • *.metrics – metrics files for the deduping step
    • alignment.recal – contains temporary files for the deduping step
      • *.log – contains information about the recalibration step
      • *.qemp – contains the recalibration tables used for recalibrating each BAM file


Alignment FASTQ Index File

There are four fastq files in {ROOT_DIR}/test/align/fastq/Sample_1 and four fastq files in {ROOT_DIR}/test/align/fastq/Sample_2, both in paired-end format. Normally, we would need to build an index file for these files. Conveniently, an index file (indexFile.txt) already exists for the automatic test samples. It can be found in {ROOT_DIR}/test/align/, and contains the following information in tab-delimited format:

MERGE_NAME FASTQ1                           FASTQ2                           RGID   SAMPLE    LIBRARY CENTER PLATFORM 
Sample1    fastq/Sample_1/File1_R1.fastq.gz fastq/Sample_1/File1_R2.fastq.gz RGID1  SampleID1 Lib1    UM     ILLUMINA 
Sample1    fastq/Sample_1/File2_R1.fastq.gz fastq/Sample_1/File2_R2.fastq.gz RGID1a SampleID1 Lib1    UM     ILLUMINA 
Sample2    fastq/Sample_2/File1_R1.fastq.gz fastq/Sample_2/File1_R2.fastq.gz RGID2  SampleID2 Lib2    UM     ILLUMINA 
Sample2    fastq/Sample_2/File2_R1.fastq.gz fastq/Sample_2/File2_R2.fastq.gz RGID2  SampleID2 Lib2    UM     ILLUMINA 

If you are in the {ROOT_DIR}/test/align directory, you can use this file as-is. If you prefer, you can create a new index file and change the MERGE_NAME, RGID, SAMPLE, LIBRARY, CENTER, or PLATFORM values. It is recommended that you do not modify existing files in {ROOT_DIR}/test/align.

If you want to run this example from a different directory, make sure the FASTQ1 and FASTQ2 paths are correct. That is, each of the FASTQ1 and FASTQ2 entry in the index file should look like the following:

{ROOT_DIR}/test/align/fastq/Sample_1/File1_R1.fastq.gz 

Alternately, if you want to run this example from a different directory, but do not want to edit the index file, you can create a relative path to the test fastq files so their path agrees with that listed in the index file:

ln -s {ROOT_DIR}/test/align/fastq fastq 

This will create a symbolic link to the test fastq directory from your current directory.

(More information about: the index file.)


Analyzing a Sample

Using UMAKE, you can analyze BAM files by calling SNPs, and generate a VCF file containing the results. Once again, we can analyze BAM files used in the automatic test. For this example, we have 60 BAM files, which can be found in {ROOT_DIR}/test/umake/bams. These contain sequence information for a targeted region in chromosome 20.

In addition to the BAM files, you will need three files to run UMAKE: an index file, a configuration file, and a bed file (needed to analyze BAM files from targeted/exome sequencing).


Index file

First, you need a list of all the BAM files to be analyzed. Conveniently, the a test index file (umake_test.index) already exists in {ROOT_DIR}/test/umake/. It contains the following information:

NA12272 ALL     bams/NA12272.mapped.ILLUMINA.bwa.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam 
NA12004 ALL     bams/NA12004.mapped.ILLUMINA.bwa.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam 
... 
NA12874 ALL     bams/NA12874.mapped.LS454.ssaha2.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam 

You can use this file directly if you change your current directory to {ROOT_DIR}/test/umake/.

Alternately, if you want to copy and use this index file to a different directory, you can create a symbolic link to the bams folder as follows:

ln -s {ROOT_DIR}/test/umake/bams bams 

(More information about: the index file.)


BED file

This file contains a single line:

chr20   20000050        20300000 

You can copy this to the current directory and use it as-is.

(More information about: targeted/exome sequencing settings.)


Configuration file

A configuration file (umake_test.conf) already exists in {ROOT_DIR}/test/umake/. It contains the following information:

CHRS = 20 BAM_INDEX = GBR60bam.index

  1. References

REF_ROOT = chr20Ref

REF = $(REF_ROOT)/human_g1k_v37_chr20.fa INDEL_PREFIX = $(REF_ROOT)/1kg.pilot_release.merged.indels.sites.hg19 DBSNP_VCF = $(REF_ROOT)/dbsnp135_chr20.vcf.gz HM3_VCF = $(REF_ROOT)/hapmap_3.3.b37.sites.chr20.vcf.gz


CHRS = 20 
TEST_ROOT = $(UMAKE_ROOT)/test/umake 
BAM_INDEX = $(TEST_ROOT)/umake_test.index 
OUT_PREFIX = umake_test 
REF_ROOT = $(TEST_ROOT)/ref 
# 
REF = $(REF_ROOT)/karma.ref/human.g1k.v37.chr20.fa 
INDEL_PREFIX = $(REF_ROOT)/indels/1kg.pilot_release.merged.indels.sites.hg19 
DBSNP_PREFIX =  $(REF_ROOT)/dbSNP/dbsnp_135_b37.rod 
HM3_PREFIX =  $(REF_ROOT)/HapMap3/hapmap3_r3_b37_fwd.consensus.qc.poly 
# 
RUN_INDEX = TRUE        # create BAM index file 
RUN_PILEUP = TRUE       # create GLF file from BAM 
RUN_GLFMULTIPLES = TRUE # create unfiltered SNP calls 
RUN_VCFPILEUP = TRUE    # create PVCF files using vcfPileup and run infoCollector 
RUN_FILTER = TRUE       # filter SNPs using vcfCooker 
RUN_SPLIT = TRUE        # split SNPs into chunks for genotype refinement 
RUN_BEAGLE = FALSE  # BEAGLE - MUST SET AFTER FINISHING PREVIOUS STEPS 
RUN_SUBSET = FALSE  # SUBSET FOR THUNDER - MAY BE SET WITH BEAGLE STEP TOGETHER 
RUN_THUNDER = FALSE # THUNDER - MUST SET AFTER FINISHING PREVIOUS STEPS 
############################################################################### 
WRITE_TARGET_LOCI = TRUE  # FOR TARGETED SEQUENCING ONLY -- Write loci file when performing pileup 
UNIFORM_TARGET_BED = $(TEST_ROOT)/umake_test.bed # Targeted sequencing : When all individuals has the same target. Otherwise, comment it out 
OFFSET_OFF_TARGET = 50 # Extend target by given # of bases 
MULTIPLE_TARGET_MAP =  # Target per individual : Each line contains [SM_ID] [TARGET_BED] 
TARGET_DIR = target    # Directory to store target information 
SAMTOOLS_VIEW_TARGET_ONLY = TRUE # When performing samtools view, exclude off-target regions (may make command line too long) 

If you are running this from a different directory, you will want to change some of the lines as follows:

BAM_INDEX = {CURRENT_DIR}/umake_test.index 
UNIFORM_TARGET_BED = {CURRENT_DIR}/umake_test.bed 

where {CURRENT_DIR} is the absolute path to the directory that contains the index and bed files.

An additional option can be added in the configuration file:

OUT_DIR = {OUT_DIR} 

where {OUT_DIR} is the name of directory in which you want the output to be stored. If you do not specify this in the configuration file, you will need to add an extra parameter when you run UMAKE in the next step.

(More information about: the configuration file, reference files.)


Further Information

Mapping (Alignment) Pipeline

Variant Calling Pipeline (UMAKE)