Tutorial: GotCloud UW CMG

From Genome Analysis Wiki
Jump to navigationJump to search

GotCloud / EPACTS Tutorial

In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data.

For a background on GotCloud and Sequence Analysis Pipelines, see GotCloud

While GotCloud can run on a cluster of machines or instances, this tutorial is just a small test that just runs on the machine the commands are run on.

This tutorial is a specialized version for UW Center for Mendelian Genomics. For other tutorial, please see Tutorial: GotCloud.

STEP 0 : Login to the cluster

In UW CMG cluster, you need to login to one of the high-performance cluster node via the following way.

% ssh yourid@uwcmg-head.gs.washington.edu
Password:
% qlogin
Your job 120 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled …
Your interactive job 120 has been successfully scheduled.
Establishing /usr/local/bin/qlogin_command session to host uwcmg001.gs.washington.edu ...
yourid@uwcmg001.gs.washington.edu's password: 

Step 1 : GotCloud Alignment Pipeline

In order to run this tutorial, you need to make sure you have GotCloud installed on your system.

Try to run the following commands sequentially to follow the instruction.

First, let's set an environment variable for your convenience.

% export GC=/home/presenter02/day2/session1/uwcmg_2013_08/

The raw sequence reads can be found by

% ls ${GC}/examples/fastq
SRR035022_1.fastq.gz  SRR035023_1.fastq.gz  SRR035024_1.fastq.gz  SRR035025_1.fastq.gz  SRR035026_1.fastq.gz  SRR035027_1.fastq.gz  SRR035669_1.fastq.gz  SRR622461_1.fastq.gz
SRR035022_2.fastq.gz  SRR035023_2.fastq.gz  SRR035024_2.fastq.gz  SRR035025_2.fastq.gz  SRR035026_2.fastq.gz  SRR035027_2.fastq.gz  SRR035669_2.fastq.gz  SRR622461_2.fastq.gz
SRR035022.fastq.gz    SRR035023.fastq.gz    SRR035024.fastq.gz    SRR035025.fastq.gz    SRR035026.fastq.gz    SRR035027.fastq.gz    SRR035669.fastq.gz    SRR622461.fastq.gz

This set of FASTQ are extracted from two samples from the 1000 Genomes project, focusing around 500kb region are CFTR.

The example directory contain other set of files, such as reference sequence, index, and BAM files (for variant calling).

% ls ${GC}/examples
bams  chr7Ref  fastq  index

In order to run alignment pipeline, we need an index file for FASTQ files.

% cat ${GC}/examples/index/chr7.CFTR.fastq.index
MERGE_NAME	FASTQ1	FASTQ2	RGID	SAMPLE	LIBRARY	CENTER	PLATFORM
NA06984	fastq/SRR035022.fastq.gz	.	SRR035022	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035022_1.fastq.gz	fastq/SRR035022_2.fastq.gz	SRR035022	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035023.fastq.gz	.	SRR035023	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035023_1.fastq.gz	fastq/SRR035023_2.fastq.gz	SRR035023	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035024.fastq.gz	.	SRR035024	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035024_1.fastq.gz	fastq/SRR035024_2.fastq.gz	SRR035024	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035025.fastq.gz	.	SRR035025	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035025_1.fastq.gz	fastq/SRR035025_2.fastq.gz	SRR035025	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035026.fastq.gz	.	SRR035026	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035026_1.fastq.gz	fastq/SRR035026_2.fastq.gz	SRR035026	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035027.fastq.gz	.	SRR035027	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035027_1.fastq.gz	fastq/SRR035027_2.fastq.gz	SRR035027	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035669.fastq.gz	.	SRR035669	NA06984	Solexa-16556	BI	ILLUMINA
NA06984	fastq/SRR035669_1.fastq.gz	fastq/SRR035669_2.fastq.gz	SRR035669	NA06984	Solexa-16556	BI	ILLUMINA
NA12878	fastq/SRR622461.fastq.gz	.	SRR622461	NA12878	Illumina_NA12878	ILLUMINA	ILLUMINA
NA12878	fastq/SRR622461_1.fastq.gz	fastq/SRR622461_2.fastq.gz	SRR622461	NA12878	Illumina_NA12878	ILLUMINA	ILLUMINA

Also, we need a configuration file for GotCloud (default configuration usually works if you install GotCloud on your own, but here we're using reduced examples).

% cat ${GC}/examples/index/chr7.CFTR.align.conf
INDEX_FILE = index/chr7.CFTR.fastq.index
###################
# References
REF_DIR = chr7Ref
AS = NCBI37
REF = $(REF_DIR)/hs37d5.chr7.fa
DBSNP_VCF =  $(REF_DIR)/dbsnp_135.b37.chr7.CFTR.vcf.gz
HM3_VCF = $(REF_DIR)/hapmap_3.3.b37.sites.chr7.CFTR.vcf.gz

Create a test directory and enter it for running alignment pipeline

% mkdir test
% cd test

Then, run the alignment pipeline through the following command

% ${GC}/gotcloud/gotcloud align --conf ${GC}/examples/index/chr7.CFTR.align.conf --outDir align --baseprefix ${GC}/examples
Retrieved 168 bytes.
Random chance is 54  (compared to 30)
File sizes of 32 FASTQ input files referenced in '/home/presenter02/day2/session1/uwcmg_2013_08/examples/index/chr7.CFTR.fastq.index' = 6849341.4
Size of BAMs from aligner will be about 0.01 GB
Intermediate files from snpcaller will be about 0.01 GB
Final VCF output from snpcaller will be about 0.00 GB
Be sure you have enough space to hold all this data
Created /home/presenter02/test/align/Makefiles/align_NA06984.Makefile
Created /home/presenter02/test/align/Makefiles/align_NA12878.Makefile
---------------------------------------------------------------------
Waiting while samples are processed...
Processing finished in 42 secs with no errors reported

Let's see where the BAM files are generated

% ls align/bams
NA06984.recal.bam      NA06984.recal.bam.bai.done  NA06984.recal.bam.metrics  NA12878.recal.bam      NA12878.recal.bam.bai.done  NA12878.recal.bam.metrics
NA06984.recal.bam.bai  NA06984.recal.bam.done      NA06984.recal.bam.qemp     NA12878.recal.bam.bai  NA12878.recal.bam.done      NA12878.recal.bam.qemp

Examine the QC metrics by

% ls align/QCFiles
NA06984.genoCheck.depthRG  NA06984.genoCheck.selfRG  NA06984.qplot.R            NA12878.genoCheck.depthSM  NA12878.genoCheck.selfSM  NA12878.qplot.stats
NA06984.genoCheck.depthSM  NA06984.genoCheck.selfSM  NA06984.qplot.stats        NA12878.genoCheck.done     NA12878.qplot.done
NA06984.genoCheck.done     NA06984.qplot.done        NA12878.genoCheck.depthRG  NA12878.genoCheck.selfRG   NA12878.qplot.R

Text-version of QC metrics looks as follows

% cat align/QCFiles/NA06984.qplot.stats
Stats\BAM	/home/presenter02/test2/align/bams/NA06984.recal.bam
TotalReads(e6)	0.06
MappingRate(%)	97.44
MapRate_MQpass(%)	97.44
TargetMapping(%)	0.00
ZeroMapQual(%)	0.56
MapQual<10(%)	0.70
PairedReads(%)	98.27
ProperPaired(%)	92.83
MappedBases(e9)	0.00
Q20Bases(e9)	0.00
Q20BasesPct(%)	95.27
MeanDepth	7.33
GenomeCover(%)	0.33
EPS_MSE	293.97
EPS_Cycle_Mean	25.06
GCBiasMSE	0.00
ISize_mode	335
ISize_medium	331
DupRate(%)	2.05
QCFailRate(%)	0.00
BaseComp_A(%)	31.8
BaseComp_C(%)	18.4
BaseComp_G(%)	18.1
BaseComp_T(%)	31.6
BaseComp_O(%)	0.0

You can also run the R command to create PDF files (see lecture slides for examples).

To check whether the samples are contaminated, look at the verifyBamID results.

% cat align/QCFiles/NA06984.genoCheck.selfSM
#SEQ_ID	RG	CHIP_ID	#SNPS	#READS	AVG_DP	FREEMIX	FREELK1	FREELK0	FREE_RH	FREE_RA	CHIPMIX	CHIPLK1	CHIPLK0	CHIP_RH	CHIP_RA	DPREF	RDPHET	RDPALT
NA06984	ALL	NA	175	1038	5.93	0.00841	351.80	351.85	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

STEP 3 : Run GotCloud SNP Calling Pipeline

The next step is to analyze BAM files by calling SNPs and generating a VCF file containing the variant calls.

The variant calling pipeline has multiple built-in steps to generate BAMs:

  1. Filter out reads with low mapping quality
  2. Per Base Alignment Quality Adjustment (BAQ)
  3. Resolve overlapping paired end reads
  4. Generate genotype likelihood files
  5. Perform variant calling
  6. Extract features from variant sites
  7. Perform variant filtering


To speed variant calling, each chromosome is broken up into smaller regions which are processed separately. While initially split by sample, the per sample data gets merged and is processed together for each region. These regions are later merged to result in a single Variant Call File (VCF) per chromosome.

% ls ${GC}/examples/bams
NA06984.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam      NA12272.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam.bai
NA06984.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam.bai  NA12273.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.CFTR.bam
NA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam      NA12273.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.CFTR.bam.bai
NA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam.bai  NA12275.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam
NA06986.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam      NA12275.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam.bai
NA06986.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam.bai  NA12282.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam
NA06989.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam      NA12282.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam.bai
NA06989.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam.bai  NA12283.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam
NA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam      NA12283.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam.bai
NA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam.bai  NA12286.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam
NA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam      NA12286.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam.bai
NA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam.bai  NA12287.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam
NA07037.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.CFTR.bam      NA12287.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam.bai
...

To run SNP calling pipeline, we need to prepare index file corresponding to the BAM files with sample IDs

% cat ${GC}/examples/index/chr7.CFTR.low_coverage.index
NA06984	EUR	bams/NA06984.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam
NA06985	EUR	bams/NA06985.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam
NA06986	EUR	bams/NA06986.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam
NA06989	EUR	bams/NA06989.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam
NA06994	EUR	bams/NA06994.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam
NA07000	EUR	bams/NA07000.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam
NA07037	EUR	bams/NA07037.mapped.ILLUMINA.bwa.CEU.low_coverage.20130502.CFTR.bam
NA07048	EUR	bams/NA07048.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam
NA07051	EUR	bams/NA07051.mapped.ILLUMINA.bwa.CEU.low_coverage.20120522.CFTR.bam
NA07056	EUR	bams/NA07056.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam
...

Also, we need a configuration file.

% cat ${GC}/examples/index/chr7.CFTR.low_coverage.conf
CHRS = 7
BAM_INDEX = index/chr7.CFTR.low_coverage.index 
############
# References
REF_ROOT = chr7Ref
#
REF = $(REF_ROOT)/hs37d5.chr7.fa
INDEL_PREFIX = $(REF_ROOT)/1kg.pilot_release.merged.indels.sites.hg19
DBSNP_VCF =  $(REF_ROOT)/dbsnp_135.b37.chr7.CFTR.vcf.gz
HM3_VCF =  $(REF_ROOT)/hapmap_3.3.b37.sites.chr7.CFTR.vcf.gz
OMNI_VCF = $(REF_ROOT)/1000G_omni2.5.b37.sites.PASS.chr7.CFTR.vcf.gz

Then we're ready to run SNP calling pipeline

% ${GC}/gotcloud/gotcloud snpcall --conf ${GC}/examples/index/chr7.CFTR.low_coverage.conf --outDir snps --baseprefix ${GC}/examples --region 7:117000000-117500000 --numjobs 2
CHRS = 7
BAM_INDEX = index/chr7.CFTR.low_coverage.index 
############
# References
REF_ROOT = chr7Ref
#
REF = $(REF_ROOT)/hs37d5.chr7.fa
INDEL_PREFIX = $(REF_ROOT)/1kg.pilot_release.merged.indels.sites.hg19
DBSNP_VCF =  $(REF_ROOT)/dbsnp_135.b37.chr7.CFTR.vcf.gz
HM3_VCF =  $(REF_ROOT)/hapmap_3.3.b37.sites.chr7.CFTR.vcf.gz
OMNI_VCF = $(REF_ROOT)/1000G_omni2.5.b37.sites.PASS.chr7.CFTR.vcf.gz
[presenter02@uwcmg002 test]$ time ${GC}/gotcloud/gotcloud snpcall --conf ${GC}/examples/index/chr7.CFTR.low_coverage.conf --outDir snps --baseprefix ${GC}/examples --region 7:117000000-117500000 --numjobs 2
Key configurations:
GOTCLOUD_ROOT: /home/presenter02/day2/session1/uwcmg_2013_08/gotcloud
OUT_DIR:       snps
BAM_INDEX:     index/chr7.CFTR.low_coverage.index 
REF:           chr7Ref/hs37d5.chr7.fa
CHRS:          7
BATCH_TYPE:    local
BATCH_OPTS:    

Processing the following steps...
2: RUN_PILEUP
3: RUN_GLFMULTIPLES
4: RUN_VCFPILEUP
5: RUN_FILTER
6: RUN_SVM
8: RUN_SPLIT
Call region is 7:117000000-117500000
Generating commands for chr7...
Creating glf INDEX at 7:115000001-120000000..
--------------------------------------------------------------------
Finished creating makefile /home/presenter02/test/snps/umake.Makefile

Running /home/presenter02/test/snps/umake.Makefile

Start merging VCFs...
Merging /home/presenter02/test/snps/vcfs/chr7/115000001.120000000/chr7.115000001.120000000.vcf into STDOUT...
Start merging VCFs...
Merging /home/presenter02/test/snps/vcfs/chr7/115000001.120000000/chr7.115000001.120000000.stats.vcf into STDOUT...

The following parameters are available.  Ones with "[]" are in effect:

Available Options
                         Recipes : --write-bed, --write-vcf [ON], --upgrade,
                                   --summarize, --filter [ON], --subset
               VCF Input options : --in-vcf [/home/presenter02/test/snps/vcfs/chr7/chr7.merged.stats.vcf]
               BED Input options : --in-bfile [], --in-bed [], --in-bim [],
                                   --in-fam [],
                                   --ref [/data/local/ref/karma.ref/human.g1k.v37.fa]
              Subsetting options : --in-subset [], --mono-subset,
                                   --filt-only-subset
                  Output Options : --out [/home/presenter02/test/snps/vcfs/chr7/chr7.hardfiltered.sites.vcf],
                                   --qGeno, --print-every [10000]
      Output compression Options : --plain [ON], --bgzf, --gzip
   Genotype-level Filter Options : --minGQ, --minGD
                  Filter Options : --winIndel [5],
                                   --indelVCF [/home/presenter02/day2/session1/uwcmg_2013_08/examples/chr7Ref/1kg.pilot_release.merged.indels.sites.hg19.chr7.vcf],
                                   --minQUAL [5], --minMQ [20],
                                   --maxDP [99000], --minDP [99],
                                   --maxABL [70], --winFFRQ, --maxFFRQ,
                                   --winFVAR, --merFVAR, --maxFVAR,
                                   --minNS [1], --maxSTP [2147483647],
                                   --maxTTT [2147483647],
                                   --minTTT [-2147483648], --maxSTR [20],
                                   --minSTR [-20], --maxSTZ [5], --minSTZ [-5],
                                   --maxCBR [20], --minCBR [-100],
                                   --maxQBR [100], --minQBR [-100],
                                   --maxCBZ [2147483647], --maxCSR [100],
                                   --minCSR [-100], --maxLQZ [2147483647],
                                   --minLQZ [-2147483648],
                                   --maxRBZ [2147483647],
                                   --minRBZ [-2147483648],
                                   --maxIOZ [2147483647],
                                   --minIOR [-2147483648],
                                   --maxIOR [2147483647],
                                   --maxAOZ [2147483647], --maxAOI [5],
                                   --maxMQ0 [10], --maxMQ10 [100],
                                   --maxMQ20 [100], --minFIC [-20],
                                   --minABE [-100], --maxABE [100],
                                   --maxLQR [30], --minMBR [-100],
                                   --maxMBR [100], --minABZ [-2147483648],
                                   --maxABZ [2147483647],
                                   --maxBCS [2147483647], --keepFilter

Analysis started on Tue Aug  6 12:25:04 2013

The following filters are in effect:
m20 : MQ >= 20.00
dp99 : DP >= 99.00
DP99000 : DP <= 99000.00
ns1 : NS >= 1.00
AB70 : AB <= 0.70
STR20 : STR <= 0.20
str-20 : STR >= -0.20
stz-5 : STZ >= -5.00
STZ5 : STZ <= 5.00
CBR20 : CBR <= 0.20
AOI5 : AOI <= 5.00
MQ0_10 : MQ0 <= 0.10
fic-20 : FIC >= -0.20
LQR30 : LQR <= 0.30
q5 : QUAL >= 5
INDEL5 : INDEL >= 5 bp with /home/presenter02/day2/session1/uwcmg_2013_08/examples/chr7Ref/1kg.pilot_release.merged.indels.sites.hg19.chr7.vcf

Writing to VCF file /home/presenter02/test/snps/vcfs/chr7/chr7.hardfiltered.sites.vcf
Analysis finished on Tue Aug  6 12:25:05 2013

loading /home/presenter02/day2/session1/uwcmg_2013_08/examples/chr7Ref/dbsnp_135.b37.chr7.CFTR.vcf.gz as a VCF input..
finished loading dbSNP for 7415 variants
TOTAL	1538	1538	662	378	166	25	145	129	27	6

The following parameters are available.  Ones with "[]" are in effect:

Available Options
                Input options : --in [/home/presenter02/test/snps/vcfs/chr7/chr7.filtered.sites.vcf.raw],
                                --ignore
               Output Options : --out [/home/presenter02/test/snps/vcfs/chr7/chr7.filtered.sites.vcf.norm],
                                --digits [3], --rank, --verbose [ON]
   Output compression Options : --plain [ON], --bgzf, --gzip

Analysis started on Tue Aug  6 12:25:05 2013

Reading Input File /home/presenter02/test/snps/vcfs/chr7/chr7.filtered.sites.vcf.raw
Finished Reading 1538 lines
Starting sorting values
Writing output values
Analysis finished on Tue Aug  6 12:25:05 2013

loading /home/presenter02/day2/session1/uwcmg_2013_08/examples/chr7Ref/dbsnp_135.b37.chr7.CFTR.vcf.gz as a VCF input..
finished loading dbSNP for 7415 variants
TOTAL	1538	1538	662	378	166	25	145	129	27	6
 Commands finished in 306 secs with no errors reported
--------------------------------------------------------------------

We can summarize the quality of variants.

% cat ${GC}/out/snps/vcfs/chr7/chr7.filtered.sites.vcf.summary
--------------------------------------------------------------------------------------------------------------
              FILTER	#SNPs	#dbSNP	%dbSNP	%CpG	%CpG	%Known	%Novel	%nCpG-K	%nCpG-N	%HM3	%HM3
                    				Known	Novel	Ts/Tv	Ts/Tv	Ts/Tv	Ts/Tv	sens	/SNP
--------------------------------------------------------------------------------------------------------------
              INDEL5	11	8	72.7	25.0	0.0	3.00	0.50	2.00	0.50	0.000	0.000
          INDEL5;SVM	8	4	50.0	50.0	25.0	0.33	0.33	1.00	0.00	0.000	0.000
                PASS	1388	1186	85.4	15.5	11.4	2.10	1.62	1.79	1.42	85.083	11.095
                 SVM	131	33	25.2	9.1	9.2	1.06	0.85	0.88	0.78	0.000	0.000
--------------------------------------------------------------------------------------------------------------
              FILTER	#SNPs	#dbSNP	%dbSNP	%CpG	%CpG	%Known	%Novel	%nCpG-K	%nCpG-N	%HM3	%HM3
                    				Known	Novel	Ts/Tv	Ts/Tv	Ts/Tv	Ts/Tv	sens	/SNP 
--------------------------------------------------------------------------------------------------------------
              INDEL5	19	12	63.2	33.3	14.3	1.40	0.40	1.67	0.20	0.000	0.000
                PASS	1388	1186	85.4	15.5	11.4	2.10	1.62	1.79	1.42	85.083	11.095
                 SVM	139	37	26.6	13.5	9.8	0.95	0.82	0.88	0.74	0.000	0.000 
--------------------------------------------------------------------------------------------------------------
                PASS	1388	1186	85.4	15.5	11.4	2.10	1.62	1.79	1.42	85.083	11.095
                FAIL	150	45	30.0	15.6	9.5	1.14	0.81	1.00	0.73	0.000	0.000
               TOTAL	1538	1231	80.0	15.5	10.7	2.05	1.27	1.75	1.12	85.083	10.013
------------------------------------------------------------------------------------------------------------

To run LD-refinement you can run beagle only (gotcloud beagle) or beagle+thunder (gotcloud ldrefine) to improve the quality of genotypes.

If you're interested in looking at individual reads from the variant, samtools tview provides a good alternative to IGV. In order to see CFTR R75Q variant in one of the variant carriers, you can run

% samtools tview ${GC}/examples/bams/NA12843.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.CFTR.bam ${GC}/examples/chr7Ref/hs37d5.chr7.fa

And follow the steps

  • Type ‘g’
  • Type 7:117149147
  • Type ‘n’ to color by nucleotide
  • Move with left arrow to see the variant site clearly
% ${GC}/gotcloud/gotcloud beagle --conf ${GC}/examples/index/chr7.CFTR.low_coverage.conf --outDir snps --baseprefix ${GC}/examples --region 7:117000000-117500000 --numjobs 2
Key configurations:
GOTCLOUD_ROOT: /home/presenter02/day2/session1/uwcmg_2013_08/gotcloud
OUT_DIR:       snps
BAM_INDEX:     index/chr7.CFTR.low_coverage.index 
REF:           chr7Ref/hs37d5.chr7.fa
CHRS:          7
BATCH_TYPE:    local
BATCH_OPTS:    

Processing the following steps...
9: RUN_BEAGLE
10: RUN_SUBSET
Call region is 7:117000000-117500000
Generating commands for chr7...
--------------------------------------------------------------------
Finished creating makefile /home/presenter02/test/snps/umake.Makefile

Running /home/presenter02/test/snps/umake.Makefile

 Commands finished in 74 secs with no errors reported
--------------------------------------------------------------------

STEP 4 : EPACTS association analysis

In addition to variant calls, we need phenotype information to perform association mapping. Here, the phenotypes are created so that case is enriched in the individuals carrying rare nonsynonymous variants at CFTR.

% less ${GC}/examples/index/chr7.CFTR.ped 
#FAM_ID	IND_ID	DAD_ID	MOM_ID	SEX	PHENO	QT
NA06984	NA06984	0	0	0	1	-0.657149296617419
NA06985	NA06985	0	0	0	2	-0.449034856685281
NA06986	NA06986	0	0	0	1	0.0975849179986626
NA06989	NA06989	0	0	0	2	1.77095069670763
NA06994	NA06994	0	0	0	1	0.287475900193007
NA07000	NA07000	0	0	0	1	-1.36632872407691
NA07037	NA07037	0	0	0	2	-0.1278913321612
NA07048	NA07048	0	0	0	2	-1.45798868745693
NA07051	NA07051	0	0	0	1	-0.599618650565132
...

Create a directory to store association results

% mkdir assoc

Run single variant logistic score test.

% ${GC}/epacts/bin/epacts single  --ped ${GC}/examples/index/chr7.CFTR.ped   --vcf ${GC}/out/snps/beagle/chr7/chr7.filtered.PASS.beagled.vcf.gz  --pheno PHENO --out assoc/single.b.score --test b.score --anno   --ref ${GC}/examples/chr7Ref/hs37d5.chr7.fa   --region 7:117000000-117500000 --run 1
% time ${GC}/epacts/bin/epacts anno --in ${GC}/out/snps/beagle/chr7/chr7.filtered.PASS.beagled.vcf.gz --out snps/chr7.filtered.PASS.beagled.anno.vcf.gz --ref ${GC}examples/chr7Ref/hs37d5.chr7.fa
% zcat snps/chr7.filtered.PASS.beagled.anno.vcf.gz | grep Nonsynonymous | grep CFTR | cut -f 1-8 | head -1
% ${GC}/epacts/bin/epacts make-group --vcf snps/chr7.filtered.PASS.beagled.anno.vcf.gz --out snps/chr7.filtered.PASS.beagled.anno.grp –nonsyn
% ${GC}/epacts/bin/epacts group --ped ${GC}/examples/index/chr7.CFTR.ped --vcf snps/chr7.filtered.PASS.beagled.anno.vcf.gz --out assoc/group.skat.o --groupf snps/chr7.filtered.PASS.beagled.anno.grp --test skat --skat-o --run 2
% cat assoc/group.skat.o.epacts 
% ${GC}/epacts/bin/epacts group --ped ${GC}/examples/index/chr7.CFTR.ped --vcf snps/

STEP 5 : All-in-one running of all scripts (for later use)

% tar xzvf /home/presenter02/day2/session1/uwcmg_2013_08.tar.gz�
% cd /home/presenter02/day2/session1/uwcmg_2013_08�
% sh go.sh  �