Tutorial: GotCloud UW CMG
GotCloud / EPACTS Tutorial
In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data.
For a background on GotCloud and Sequence Analysis Pipelines, see GotCloud
While GotCloud can run on a cluster of machines or instances, this tutorial is just a small test that just runs on the machine the commands are run on.
This tutorial is a specialized version for UW Center for Mendelian Genomics. For other tutorial, please see Tutorial: GotCloud.
STEP 0 : Login to the cluster
In UW CMG cluster, you need to login to one of the high-performance cluster node via the following way.
% ssh yourid@uwcmg-head.gs.washington.edu Password:
% qlogin Your job 120 ("QLOGIN") has been submitted waiting for interactive job to be scheduled … Your interactive job 120 has been successfully scheduled. Establishing /usr/local/bin/qlogin_command session to host uwcmg001.gs.washington.edu ... yourid@uwcmg001.gs.washington.edu's password:
Step 1 : GotCloud Alignment Pipeline
In order to run this tutorial, you need to make sure you have GotCloud installed on your system.
Try to run the following commands sequentially to follow the instruction
% export GC=/home/presenter02/day2/session1/uwcmg_2013_08/
% ls ${GC}/examples/fastq SRR035022_1.fastq.gz SRR035023_1.fastq.gz SRR035024_1.fastq.gz SRR035025_1.fastq.gz SRR035026_1.fastq.gz SRR035027_1.fastq.gz SRR035669_1.fastq.gz SRR622461_1.fastq.gz SRR035022_2.fastq.gz SRR035023_2.fastq.gz SRR035024_2.fastq.gz SRR035025_2.fastq.gz SRR035026_2.fastq.gz SRR035027_2.fastq.gz SRR035669_2.fastq.gz SRR622461_2.fastq.gz SRR035022.fastq.gz SRR035023.fastq.gz SRR035024.fastq.gz SRR035025.fastq.gz SRR035026.fastq.gz SRR035027.fastq.gz SRR035669.fastq.gz SRR622461.fastq.gz
% ls ${GC}/examples bams chr7Ref fastq index
% cat ${GC}/examples/index/chr7.CFTR.fastq.index MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY CENTER PLATFORM NA06984 fastq/SRR035022.fastq.gz . SRR035022 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035022_1.fastq.gz fastq/SRR035022_2.fastq.gz SRR035022 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035023.fastq.gz . SRR035023 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035023_1.fastq.gz fastq/SRR035023_2.fastq.gz SRR035023 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035024.fastq.gz . SRR035024 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035024_1.fastq.gz fastq/SRR035024_2.fastq.gz SRR035024 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035025.fastq.gz . SRR035025 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035025_1.fastq.gz fastq/SRR035025_2.fastq.gz SRR035025 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035026.fastq.gz . SRR035026 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035026_1.fastq.gz fastq/SRR035026_2.fastq.gz SRR035026 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035027.fastq.gz . SRR035027 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035027_1.fastq.gz fastq/SRR035027_2.fastq.gz SRR035027 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035669.fastq.gz . SRR035669 NA06984 Solexa-16556 BI ILLUMINA NA06984 fastq/SRR035669_1.fastq.gz fastq/SRR035669_2.fastq.gz SRR035669 NA06984 Solexa-16556 BI ILLUMINA NA12878 fastq/SRR622461.fastq.gz . SRR622461 NA12878 Illumina_NA12878 ILLUMINA ILLUMINA NA12878 fastq/SRR622461_1.fastq.gz fastq/SRR622461_2.fastq.gz SRR622461 NA12878 Illumina_NA12878 ILLUMINA ILLUMINA
% cat ${GC}/examples/index/chr7.CFTR.align.conf INDEX_FILE = index/chr7.CFTR.fastq.index ################### # References REF_DIR = chr7Ref AS = NCBI37 REF = $(REF_DIR)/hs37d5.chr7.fa DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.chr7.CFTR.vcf.gz HM3_VCF = $(REF_DIR)/hapmap_3.3.b37.sites.chr7.CFTR.vcf.gz
% mkdir test
% cd test
Then, run the alignment pipeline through the following command
% ${GC}/gotcloud/gotcloud align --conf ${GC}/examples/index/chr7.CFTR.align.conf --outDir align --baseprefix ${GC}/examples Retrieved 168 bytes. Random chance is 54 (compared to 30) File sizes of 32 FASTQ input files referenced in '/home/presenter02/day2/session1/uwcmg_2013_08/examples/index/chr7.CFTR.fastq.index' = 6849341.4 Size of BAMs from aligner will be about 0.01 GB Intermediate files from snpcaller will be about 0.01 GB Final VCF output from snpcaller will be about 0.00 GB Be sure you have enough space to hold all this data Created /home/presenter02/test/align/Makefiles/align_NA06984.Makefile Created /home/presenter02/test/align/Makefiles/align_NA12878.Makefile --------------------------------------------------------------------- Waiting while samples are processed... Processing finished in 42 secs with no errors reported
Examine the QC metrics by
% ls align/bams NA06984.recal.bam NA06984.recal.bam.bai.done NA06984.recal.bam.metrics NA12878.recal.bam NA12878.recal.bam.bai.done NA12878.recal.bam.metrics NA06984.recal.bam.bai NA06984.recal.bam.done NA06984.recal.bam.qemp NA12878.recal.bam.bai NA12878.recal.bam.done NA12878.recal.bam.qemp % ls align/QCFiles NA06984.genoCheck.depthRG NA06984.genoCheck.selfRG NA06984.qplot.R NA12878.genoCheck.depthSM NA12878.genoCheck.selfSM NA12878.qplot.stats NA06984.genoCheck.depthSM NA06984.genoCheck.selfSM NA06984.qplot.stats NA12878.genoCheck.done NA12878.qplot.done NA06984.genoCheck.done NA06984.qplot.done NA12878.genoCheck.depthRG NA12878.genoCheck.selfRG NA12878.qplot.R
% cat align/QCFiles/NA06984.qplot.stats Stats\BAM /home/presenter02/test2/align/bams/NA06984.recal.bam TotalReads(e6) 0.06 MappingRate(%) 97.44 MapRate_MQpass(%) 97.44 TargetMapping(%) 0.00 ZeroMapQual(%) 0.56 MapQual<10(%) 0.70 PairedReads(%) 98.27 ProperPaired(%) 92.83 MappedBases(e9) 0.00 Q20Bases(e9) 0.00 Q20BasesPct(%) 95.27 MeanDepth 7.33 GenomeCover(%) 0.33 EPS_MSE 293.97 EPS_Cycle_Mean 25.06 GCBiasMSE 0.00 ISize_mode 335 ISize_medium 331 DupRate(%) 2.05 QCFailRate(%) 0.00 BaseComp_A(%) 31.8 BaseComp_C(%) 18.4 BaseComp_G(%) 18.1 BaseComp_T(%) 31.6 BaseComp_O(%) 0.0
% cat align/QCFiles/NA06984.genoCheck.selfSM #SEQ_ID RG CHIP_ID #SNPS #READS AVG_DP FREEMIX FREELK1 FREELK0 FREE_RH FREE_RA CHIPMIX CHIPLK1 CHIPLK0 CHIP_RH CHIP_RA DPREF RDPHET RDPALT NA06984 ALL NA 175 1038 5.93 0.00841 351.80 351.85 NA NA NA NA NA NA NA NA NA NA
STEP 3 : Run GotCloud SNP Calling Pipeline
The next step is to analyze BAM files by calling SNPs and generating a VCF file containing the variant calls.
The variant calling pipeline has multiple built-in steps to generate BAMs:
- Filter out reads with low mapping quality
- Per Base Alignment Quality Adjustment (BAQ)
- Resolve overlapping paired end reads
- Generate genotype likelihood files
- Perform variant calling
- Extract features from variant sites
- Perform variant filtering
To speed variant calling, each chromosome is broken up into smaller regions which are processed separately. While initially split by sample, the per sample data gets merged and is processed together for each region. These regions are later merged to result in a single Variant Call File (VCF) per chromosome.
% ls ${GC}/examples/bams
% cat ${GC}/examples/index/chr7.CFTR.low_coverage.index
% cat ${GC}/examples/index/chr7.CFTR.low_coverage.conf
% time ${GC}/gotcloud/gotcloud snpcall --conf ${GC}/examples/index/chr7.CFTR.low_coverage.conf --outDir snps --baseprefix ${GC}/examples --region 7:117000000-117500000 --numjobs 2
% cat ${GC}/out/snps/vcfs/chr7/chr7.filtered.sites.vcf.summary
% time ${GC}/gotcloud/gotcloud beagle --conf ${GC}/examples/index/chr7.CFTR.low_coverage.conf --outDir snps --baseprefix ${GC}/examples --region 7:117000000-117500000 --numjobs 2
% samtools tview ${GC}/examples/bams/NA12843.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam ${GC}/examples/chr7Ref/hs37d5.chr7.fa
STEP 4 : EPACTS association analysis
% head ${GC}/examples/index/chr7.CFTR.ped
% mkdir assoc
% time ${GC}/epacts/bin/epacts single --ped ${GC}/examples/index/chr7.CFTR.ped --vcf ${GC}/out/snps/beagle/chr7/chr7.filtered.PASS.beagled.vcf.gz --pheno PHENO --out assoc/single.b.score --test b.score --anno --ref ${GC}/examples/chr7Ref/hs37d5.chr7.fa --region 7:117000000-117500000 --run 1
% time ${GC}/epacts/bin/epacts anno --in ${GC}/out/snps/beagle/chr7/chr7.filtered.PASS.beagled.vcf.gz --out snps/chr7.filtered.PASS.beagled.anno.vcf.gz --ref ${GC}examples/chr7Ref/hs37d5.chr7.fa
% zcat snps/chr7.filtered.PASS.beagled.anno.vcf.gz | grep Nonsynonymous | grep CFTR | cut -f 1-8 | head -1
% ${GC}/epacts/bin/epacts make-group --vcf snps/chr7.filtered.PASS.beagled.anno.vcf.gz --out snps/chr7.filtered.PASS.beagled.anno.grp –nonsyn
% ${GC}/epacts/bin/epacts group --ped ${GC}/examples/index/chr7.CFTR.ped --vcf snps/chr7.filtered.PASS.beagled.anno.vcf.gz --out assoc/group.skat.o --groupf snps/chr7.filtered.PASS.beagled.anno.grp --test skat --skat-o --run 2
% cat assoc/group.skat.o.epacts
% ${GC}/epacts/bin/epacts group --ped ${GC}/examples/index/chr7.CFTR.ped --vcf snps/
STEP 5 : All-in-one running of all scripts (for later use)
% tar xzvf /home/presenter02/day2/session1/uwcmg_2013_08.tar.gz�
% cd /home/presenter02/day2/session1/uwcmg_2013_08�
% sh go.sh �