Difference between revisions of "Tutorial: GotCloud UW CMG"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 25: Line 25:
 
In order to run this tutorial, you need to make sure you have GotCloud installed on your system.   
 
In order to run this tutorial, you need to make sure you have GotCloud installed on your system.   
  
Try to run the following commands sequentially to follow the instruction
+
Try to run the following commands sequentially to follow the instruction.
 +
 
 +
First, let's set an environment variable for your convenience.
  
 
  % export GC=/home/presenter02/day2/session1/uwcmg_2013_08/
 
  % export GC=/home/presenter02/day2/session1/uwcmg_2013_08/
 +
 +
The raw sequence reads can be found by
  
 
  % ls ${GC}/examples/fastq
 
  % ls ${GC}/examples/fastq
Line 33: Line 37:
 
  SRR035022_2.fastq.gz  SRR035023_2.fastq.gz  SRR035024_2.fastq.gz  SRR035025_2.fastq.gz  SRR035026_2.fastq.gz  SRR035027_2.fastq.gz  SRR035669_2.fastq.gz  SRR622461_2.fastq.gz
 
  SRR035022_2.fastq.gz  SRR035023_2.fastq.gz  SRR035024_2.fastq.gz  SRR035025_2.fastq.gz  SRR035026_2.fastq.gz  SRR035027_2.fastq.gz  SRR035669_2.fastq.gz  SRR622461_2.fastq.gz
 
  SRR035022.fastq.gz    SRR035023.fastq.gz    SRR035024.fastq.gz    SRR035025.fastq.gz    SRR035026.fastq.gz    SRR035027.fastq.gz    SRR035669.fastq.gz    SRR622461.fastq.gz
 
  SRR035022.fastq.gz    SRR035023.fastq.gz    SRR035024.fastq.gz    SRR035025.fastq.gz    SRR035026.fastq.gz    SRR035027.fastq.gz    SRR035669.fastq.gz    SRR622461.fastq.gz
 +
 +
This set of FASTQ are extracted from two samples from the 1000 Genomes project, focusing around 500kb region are CFTR.
 +
 +
The example directory contain other set of files, such as reference sequence, index, and BAM files (for variant calling).
  
 
  % ls ${GC}/examples
 
  % ls ${GC}/examples
 
  bams  chr7Ref  fastq  index
 
  bams  chr7Ref  fastq  index
 +
 +
In order to run alignment pipeline, we need an index file for FASTQ files.
  
 
  % cat ${GC}/examples/index/chr7.CFTR.fastq.index
 
  % cat ${GC}/examples/index/chr7.CFTR.fastq.index
Line 55: Line 65:
 
  NA12878 fastq/SRR622461.fastq.gz . SRR622461 NA12878 Illumina_NA12878 ILLUMINA ILLUMINA
 
  NA12878 fastq/SRR622461.fastq.gz . SRR622461 NA12878 Illumina_NA12878 ILLUMINA ILLUMINA
 
  NA12878 fastq/SRR622461_1.fastq.gz fastq/SRR622461_2.fastq.gz SRR622461 NA12878 Illumina_NA12878 ILLUMINA ILLUMINA
 
  NA12878 fastq/SRR622461_1.fastq.gz fastq/SRR622461_2.fastq.gz SRR622461 NA12878 Illumina_NA12878 ILLUMINA ILLUMINA
 +
 +
Also, we need a configuration file for GotCloud (default configuration usually works if you install GotCloud on your own, but here we're using reduced examples).
  
 
  % cat ${GC}/examples/index/chr7.CFTR.align.conf
 
  % cat ${GC}/examples/index/chr7.CFTR.align.conf
Line 65: Line 77:
 
  DBSNP_VCF =  $(REF_DIR)/dbsnp_135.b37.chr7.CFTR.vcf.gz
 
  DBSNP_VCF =  $(REF_DIR)/dbsnp_135.b37.chr7.CFTR.vcf.gz
 
  HM3_VCF = $(REF_DIR)/hapmap_3.3.b37.sites.chr7.CFTR.vcf.gz
 
  HM3_VCF = $(REF_DIR)/hapmap_3.3.b37.sites.chr7.CFTR.vcf.gz
 +
 +
Create a test directory and enter it for running alignment pipeline
  
 
  % mkdir test
 
  % mkdir test
Line 86: Line 100:
 
  Processing finished in 42 secs with no errors reported
 
  Processing finished in 42 secs with no errors reported
  
Examine the QC metrics by
+
Let's see where the BAM files are generated
  
 
  % ls align/bams
 
  % ls align/bams
 
  NA06984.recal.bam      NA06984.recal.bam.bai.done  NA06984.recal.bam.metrics  NA12878.recal.bam      NA12878.recal.bam.bai.done  NA12878.recal.bam.metrics
 
  NA06984.recal.bam      NA06984.recal.bam.bai.done  NA06984.recal.bam.metrics  NA12878.recal.bam      NA12878.recal.bam.bai.done  NA12878.recal.bam.metrics
 
  NA06984.recal.bam.bai  NA06984.recal.bam.done      NA06984.recal.bam.qemp    NA12878.recal.bam.bai  NA12878.recal.bam.done      NA12878.recal.bam.qemp
 
  NA06984.recal.bam.bai  NA06984.recal.bam.done      NA06984.recal.bam.qemp    NA12878.recal.bam.bai  NA12878.recal.bam.done      NA12878.recal.bam.qemp
+
 
 +
Examine the QC metrics by
 +
 
 
  % ls align/QCFiles
 
  % ls align/QCFiles
 
  NA06984.genoCheck.depthRG  NA06984.genoCheck.selfRG  NA06984.qplot.R            NA12878.genoCheck.depthSM  NA12878.genoCheck.selfSM  NA12878.qplot.stats
 
  NA06984.genoCheck.depthRG  NA06984.genoCheck.selfRG  NA06984.qplot.R            NA12878.genoCheck.depthSM  NA12878.genoCheck.selfSM  NA12878.qplot.stats
 
  NA06984.genoCheck.depthSM  NA06984.genoCheck.selfSM  NA06984.qplot.stats        NA12878.genoCheck.done    NA12878.qplot.done
 
  NA06984.genoCheck.depthSM  NA06984.genoCheck.selfSM  NA06984.qplot.stats        NA12878.genoCheck.done    NA12878.qplot.done
 
  NA06984.genoCheck.done    NA06984.qplot.done        NA12878.genoCheck.depthRG  NA12878.genoCheck.selfRG  NA12878.qplot.R
 
  NA06984.genoCheck.done    NA06984.qplot.done        NA12878.genoCheck.depthRG  NA12878.genoCheck.selfRG  NA12878.qplot.R
 +
 +
Text-version of QC metrics looks as follows
  
 
  % cat align/QCFiles/NA06984.qplot.stats
 
  % cat align/QCFiles/NA06984.qplot.stats
Line 124: Line 142:
 
  BaseComp_T(%) 31.6
 
  BaseComp_T(%) 31.6
 
  BaseComp_O(%) 0.0
 
  BaseComp_O(%) 0.0
 +
 +
You can also run the R command to create PDF files (see lecture slides for examples).
 +
 +
To check whether the samples are contaminated, look at the verifyBamID results.
  
 
  % cat align/QCFiles/NA06984.genoCheck.selfSM
 
  % cat align/QCFiles/NA06984.genoCheck.selfSM

Revision as of 12:16, 6 August 2013

GotCloud / EPACTS Tutorial

In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data.

For a background on GotCloud and Sequence Analysis Pipelines, see GotCloud

While GotCloud can run on a cluster of machines or instances, this tutorial is just a small test that just runs on the machine the commands are run on.

This tutorial is a specialized version for UW Center for Mendelian Genomics. For other tutorial, please see Tutorial: GotCloud.

STEP 0 : Login to the cluster

In UW CMG cluster, you need to login to one of the high-performance cluster node via the following way.

% ssh yourid@uwcmg-head.gs.washington.edu
Password:
% qlogin
Your job 120 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled …
Your interactive job 120 has been successfully scheduled.
Establishing /usr/local/bin/qlogin_command session to host uwcmg001.gs.washington.edu ...
yourid@uwcmg001.gs.washington.edu's password: 

Step 1 : GotCloud Alignment Pipeline

In order to run this tutorial, you need to make sure you have GotCloud installed on your system.

Try to run the following commands sequentially to follow the instruction.

First, let's set an environment variable for your convenience.

% export GC=/home/presenter02/day2/session1/uwcmg_2013_08/

The raw sequence reads can be found by

% ls ${GC}/examples/fastq
SRR035022_1.fastq.gz  SRR035023_1.fastq.gz  SRR035024_1.fastq.gz  SRR035025_1.fastq.gz  SRR035026_1.fastq.gz  SRR035027_1.fastq.gz  SRR035669_1.fastq.gz  SRR622461_1.fastq.gz
SRR035022_2.fastq.gz  SRR035023_2.fastq.gz  SRR035024_2.fastq.gz  SRR035025_2.fastq.gz  SRR035026_2.fastq.gz  SRR035027_2.fastq.gz  SRR035669_2.fastq.gz  SRR622461_2.fastq.gz
SRR035022.fastq.gz    SRR035023.fastq.gz    SRR035024.fastq.gz    SRR035025.fastq.gz    SRR035026.fastq.gz    SRR035027.fastq.gz    SRR035669.fastq.gz    SRR622461.fastq.gz

This set of FASTQ are extracted from two samples from the 1000 Genomes project, focusing around 500kb region are CFTR.

The example directory contain other set of files, such as reference sequence, index, and BAM files (for variant calling).

% ls ${GC}/examples
bams  chr7Ref  fastq  index

In order to run alignment pipeline, we need an index file for FASTQ files.

% cat ${GC}/examples/index/chr7.CFTR.fastq.index
MERGE_NAME	FASTQ1	FASTQ2	RGID	SAMPLE	LIBRARY	CENTER	PLATFORM
NA06984	fastq/SRR035022.fastq.gz	.	SRR035022	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035022_1.fastq.gz	fastq/SRR035022_2.fastq.gz	SRR035022	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035023.fastq.gz	.	SRR035023	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035023_1.fastq.gz	fastq/SRR035023_2.fastq.gz	SRR035023	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035024.fastq.gz	.	SRR035024	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035024_1.fastq.gz	fastq/SRR035024_2.fastq.gz	SRR035024	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035025.fastq.gz	.	SRR035025	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035025_1.fastq.gz	fastq/SRR035025_2.fastq.gz	SRR035025	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035026.fastq.gz	.	SRR035026	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035026_1.fastq.gz	fastq/SRR035026_2.fastq.gz	SRR035026	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035027.fastq.gz	.	SRR035027	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035027_1.fastq.gz	fastq/SRR035027_2.fastq.gz	SRR035027	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035669.fastq.gz	.	SRR035669	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035669_1.fastq.gz	fastq/SRR035669_2.fastq.gz	SRR035669	NA06984	Solexa-16556	BI	ILLUMINA
NA12878	fastq/SRR622461.fastq.gz	.	SRR622461	NA12878	Illumina_NA12878	ILLUMINA	ILLUMINA
NA12878	fastq/SRR622461_1.fastq.gz	fastq/SRR622461_2.fastq.gz	SRR622461	NA12878	Illumina_NA12878	ILLUMINA	ILLUMINA

Also, we need a configuration file for GotCloud (default configuration usually works if you install GotCloud on your own, but here we're using reduced examples).

% cat ${GC}/examples/index/chr7.CFTR.align.conf
INDEX_FILE = index/chr7.CFTR.fastq.index
###################
# References
REF_DIR = chr7Ref
AS = NCBI37
REF = $(REF_DIR)/hs37d5.chr7.fa
DBSNP_VCF =  $(REF_DIR)/dbsnp_135.b37.chr7.CFTR.vcf.gz
HM3_VCF = $(REF_DIR)/hapmap_3.3.b37.sites.chr7.CFTR.vcf.gz

Create a test directory and enter it for running alignment pipeline

% mkdir test
% cd test

Then, run the alignment pipeline through the following command

% ${GC}/gotcloud/gotcloud align --conf ${GC}/examples/index/chr7.CFTR.align.conf --outDir align --baseprefix ${GC}/examples
Retrieved 168 bytes.
Random chance is 54  (compared to 30)
File sizes of 32 FASTQ input files referenced in '/home/presenter02/day2/session1/uwcmg_2013_08/examples/index/chr7.CFTR.fastq.index' = 6849341.4
Size of BAMs from aligner will be about 0.01 GB
Intermediate files from snpcaller will be about 0.01 GB
Final VCF output from snpcaller will be about 0.00 GB
Be sure you have enough space to hold all this data
Created /home/presenter02/test/align/Makefiles/align_NA06984.Makefile
Created /home/presenter02/test/align/Makefiles/align_NA12878.Makefile
---------------------------------------------------------------------
Waiting while samples are processed...
Processing finished in 42 secs with no errors reported

Let's see where the BAM files are generated

% ls align/bams
NA06984.recal.bam      NA06984.recal.bam.bai.done  NA06984.recal.bam.metrics  NA12878.recal.bam      NA12878.recal.bam.bai.done  NA12878.recal.bam.metrics
NA06984.recal.bam.bai  NA06984.recal.bam.done      NA06984.recal.bam.qemp     NA12878.recal.bam.bai  NA12878.recal.bam.done      NA12878.recal.bam.qemp

Examine the QC metrics by

% ls align/QCFiles
NA06984.genoCheck.depthRG  NA06984.genoCheck.selfRG  NA06984.qplot.R            NA12878.genoCheck.depthSM  NA12878.genoCheck.selfSM  NA12878.qplot.stats
NA06984.genoCheck.depthSM  NA06984.genoCheck.selfSM  NA06984.qplot.stats        NA12878.genoCheck.done     NA12878.qplot.done
NA06984.genoCheck.done     NA06984.qplot.done        NA12878.genoCheck.depthRG  NA12878.genoCheck.selfRG   NA12878.qplot.R

Text-version of QC metrics looks as follows

% cat align/QCFiles/NA06984.qplot.stats
Stats\BAM	/home/presenter02/test2/align/bams/NA06984.recal.bam
TotalReads(e6)	0.06
MappingRate(%)	97.44
MapRate_MQpass(%)	97.44
TargetMapping(%)	0.00
ZeroMapQual(%)	0.56
MapQual<10(%)	0.70
PairedReads(%)	98.27
ProperPaired(%)	92.83
MappedBases(e9)	0.00
Q20Bases(e9)	0.00
Q20BasesPct(%)	95.27
MeanDepth	7.33
GenomeCover(%)	0.33
EPS_MSE	293.97
EPS_Cycle_Mean	25.06
GCBiasMSE	0.00
ISize_mode	335
ISize_medium	331
DupRate(%)	2.05
QCFailRate(%)	0.00
BaseComp_A(%)	31.8
BaseComp_C(%)	18.4
BaseComp_G(%)	18.1
BaseComp_T(%)	31.6
BaseComp_O(%)	0.0

You can also run the R command to create PDF files (see lecture slides for examples).

To check whether the samples are contaminated, look at the verifyBamID results.

% cat align/QCFiles/NA06984.genoCheck.selfSM
#SEQ_ID	RG	CHIP_ID	#SNPS	#READS	AVG_DP	FREEMIX	FREELK1	FREELK0	FREE_RH	FREE_RA	CHIPMIX	CHIPLK1	CHIPLK0	CHIP_RH	CHIP_RA	DPREF	RDPHET	RDPALT
NA06984	ALL	NA	175	1038	5.93	0.00841	351.80	351.85	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

STEP 3 : Run GotCloud SNP Calling Pipeline

The next step is to analyze BAM files by calling SNPs and generating a VCF file containing the variant calls.

The variant calling pipeline has multiple built-in steps to generate BAMs:

  1. Filter out reads with low mapping quality
  2. Per Base Alignment Quality Adjustment (BAQ)
  3. Resolve overlapping paired end reads
  4. Generate genotype likelihood files
  5. Perform variant calling
  6. Extract features from variant sites
  7. Perform variant filtering


To speed variant calling, each chromosome is broken up into smaller regions which are processed separately. While initially split by sample, the per sample data gets merged and is processed together for each region. These regions are later merged to result in a single Variant Call File (VCF) per chromosome.

% ls ${GC}/examples/bams
% cat ${GC}/examples/index/chr7.CFTR.low_coverage.index
% cat ${GC}/examples/index/chr7.CFTR.low_coverage.conf
% time ${GC}/gotcloud/gotcloud snpcall --conf ${GC}/examples/index/chr7.CFTR.low_coverage.conf --outDir snps --baseprefix ${GC}/examples --region 7:117000000-117500000 --numjobs 2
% cat ${GC}/out/snps/vcfs/chr7/chr7.filtered.sites.vcf.summary
% time ${GC}/gotcloud/gotcloud beagle --conf ${GC}/examples/index/chr7.CFTR.low_coverage.conf --outDir snps --baseprefix ${GC}/examples --region 7:117000000-117500000 --numjobs 2
% samtools tview ${GC}/examples/bams/NA12843.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam ${GC}/examples/chr7Ref/hs37d5.chr7.fa

STEP 4 : EPACTS association analysis

% head ${GC}/examples/index/chr7.CFTR.ped 
% mkdir assoc
% time ${GC}/epacts/bin/epacts single  --ped ${GC}/examples/index/chr7.CFTR.ped   --vcf ${GC}/out/snps/beagle/chr7/chr7.filtered.PASS.beagled.vcf.gz  --pheno PHENO --out assoc/single.b.score --test b.score --anno   --ref ${GC}/examples/chr7Ref/hs37d5.chr7.fa   --region 7:117000000-117500000 --run 1
% time ${GC}/epacts/bin/epacts anno --in ${GC}/out/snps/beagle/chr7/chr7.filtered.PASS.beagled.vcf.gz --out snps/chr7.filtered.PASS.beagled.anno.vcf.gz --ref ${GC}examples/chr7Ref/hs37d5.chr7.fa
% zcat snps/chr7.filtered.PASS.beagled.anno.vcf.gz | grep Nonsynonymous | grep CFTR | cut -f 1-8 | head -1
% ${GC}/epacts/bin/epacts make-group --vcf snps/chr7.filtered.PASS.beagled.anno.vcf.gz --out snps/chr7.filtered.PASS.beagled.anno.grp –nonsyn
% ${GC}/epacts/bin/epacts group --ped ${GC}/examples/index/chr7.CFTR.ped --vcf snps/chr7.filtered.PASS.beagled.anno.vcf.gz --out assoc/group.skat.o --groupf snps/chr7.filtered.PASS.beagled.anno.grp --test skat --skat-o --run 2
% cat assoc/group.skat.o.epacts 
% ${GC}/epacts/bin/epacts group --ped ${GC}/examples/index/chr7.CFTR.ped --vcf snps/

STEP 5 : All-in-one running of all scripts (for later use)

% tar xzvf /home/presenter02/day2/session1/uwcmg_2013_08.tar.gz�
% cd /home/presenter02/day2/session1/uwcmg_2013_08�
% sh go.sh  �