Tutorial: GotCloud
GotCloud Tutorial
In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data.
We will start with a set of sequence reads and associated base quality scores stored in fastq file.
The mapping pipeline will find the most likely genomic location for each read producing a BAM file.
The variant calling pipeline generates an initial list of polymorphic sites and genotypes stored in a VCF file and then uses haplotype information to refine these genotypes in an updated VCF file.
Example Dataset
Our dataset consists of 60 individuals from GBR sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000. We will first map the reads for a single individual (labeled TBD). We will then combine the results with mapped reads from the other 59 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.
The example dataset we'll be using is included in this tar-ball TBD.
Required Software
In order to run this tutorial, you need to make sure you have GotCloud is installed on your system.
Installation instructions here.
Mapping Reads
The first step in processing next generation sequence data is mapping the reads to the reference genome, generating per sample BAM files.
The mapping pipeline has multiple built-in steps to generate BAMs:
- Align the fastqs to the reference genome
- handles both single & paired end
- Merge the results from multiple fastqs into 1 file per sample
- Mark Duplicate Reads are marked
- Recalibrate Base Qualities
This processing results in 1 BAM file per sample.
The mapping pipeline also includes Quality Control (QC) steps:
- Visualization of various quality measures (QPLOT)
- Screen for sample contamination & swap (VerifyBamID)
Run the mapping pipeline:
gen_biopipeline.pl --conf GBR60map.conf --out_dir mappingResults
TBD - add link explaining the contents of the .conf & .index files.
Upon successful completion of the mapping pipeline, you will see the following message:
Commands finished in nn secs with no errors reported
The final BAM files produced by the mapping pipeline can be found in the files:
ls mappingResults/alignment.recal/*.recal.bam
Index files (.bai) for these BAMs are also in that directory.
The QC files for verifyBamID are:
ls mappingResults/QCFiles/*.genoCheck.selfRG mappingResults/QCFiles/*.genoCheck.selfSM
Understanding VerifyBamID output
The QC files for qplot are:
ls mappingResults/QCFiles/*.qplot.R mappingResults/QCFiles/*.qplot.stats
Index file
There are four fastq files in {ROOT_DIR}/test/align/fastq/Sample_1 and four fastq files in {ROOT_DIR}/test/align/fastq/Sample_2, both in paired-end format. Normally, we would need to build an index file for these files. Conveniently, an index file (indexFile.txt) already exists for the automatic test samples. It can be found in {ROOT_DIR}/test/align/, and contains the following information in tab-delimited format:
MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY CENTER PLATFORM Sample1 fastq/Sample_1/File1_R1.fastq.gz fastq/Sample_1/File1_R2.fastq.gz RGID1 SampleID1 Lib1 UM ILLUMINA Sample1 fastq/Sample_1/File2_R1.fastq.gz fastq/Sample_1/File2_R2.fastq.gz RGID1a SampleID1 Lib1 UM ILLUMINA Sample2 fastq/Sample_2/File1_R1.fastq.gz fastq/Sample_2/File1_R2.fastq.gz RGID2 SampleID2 Lib2 UM ILLUMINA Sample2 fastq/Sample_2/File2_R1.fastq.gz fastq/Sample_2/File2_R2.fastq.gz RGID2 SampleID2 Lib2 UM ILLUMINA
If you are in the {ROOT_DIR}/test/align directory, you can use this file as-is. If you prefer, you can create a new index file and change the MERGE_NAME, RGID, SAMPLE, LIBRARY, CENTER, or PLATFORM values. It is recommended that you do not modify existing files in {ROOT_DIR}/test/align.
If you want to run this example from a different directory, make sure the FASTQ1 and FASTQ2 paths are correct. That is, each of the FASTQ1 and FASTQ2 entry in the index file should look like the following:
{ROOT_DIR}/test/align/fastq/Sample_1/File1_R1.fastq.gz
Alternately, if you want to run this example from a different directory, but do not want to edit the index file, you can create a relative path to the test fastq files so their path agrees with that listed in the index file:
ln -s {ROOT_DIR}/test/align/fastq fastq
This will create a symbolic link to the test fastq directory from your current directory.
(More information about: the index file.)
Configuration file
Similar to the index file, a configuration file (test.conf) already exists for the automatic test samples. It contains the following information:
INDEX_FILE = indexFile.txt ############ # References REF_DIR = $(PIPELINE_DIR)/test/align/chr20Ref AS = NCBI37 FA_REF = $(REF_DIR)/human_g1k_v37_chr20.fa DBSNP_VCF = $(REF_DIR)/dbsnp.b130.ncbi37.chr20.vcf.gz PLINK = $(REF_DIR)/hapmap_3.3.b37.chr20
If you are in the {ROOT_DIR}/test/align directory, you can use this file as-is. If you are using a different index file, make sure your index file is named correctly in the first line. If you are not running this from {ROOT_DIR}/test/align, make sure your configuration and index files are in the same directory.
(More information about: reference files, optional configurable settings, or command-line options.)
Running the alignment pipeline
You are now ready to run the alignment pipeline.
To run the alignment pipeline, enter the following command:
{ROOT_DIR}/bin/gen_biopipeline.pl -conf test.conf -out_dir {OUT_DIR}
where {OUT_DIR} is the directory in which you wish to store the resulting BAM files (for example, ~/out).
If everything went well, you will see the following messages:
Created {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile Created {OUT_DIR}/Makefiles/biopipe_Sample1.Makefile --------------------------------------------------------------------- Submitted 2 commands Waiting for commands to complete... . . Commands finished in 33 secs with no errors reported
The aligned BAM files are found in {OUT_DIR}/alignment.recal/
Analyzing a Sample
Using UMAKE, you can analyze BAM files by calling SNPs, and generate a VCF file containing the results. Once again, we can analyze BAM files used in the automatic test. For this example, we have 60 BAM files, which can be found in {ROOT_DIR}/test/umake/bams. These contain sequence information for a targeted region in chromosome 20.
In addition to the BAM files, you will need three files to run UMAKE: an index file, a configuration file, and a bed file (needed to analyze BAM files from targeted/exome sequencing).
Index file
First, you need a list of all the BAM files to be analyzed. Conveniently, the a test index file (umake_test.index) already exists in {ROOT_DIR}/test/umake/. It contains the following information:
NA12272 ALL bams/NA12272.mapped.ILLUMINA.bwa.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam NA12004 ALL bams/NA12004.mapped.ILLUMINA.bwa.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam ... NA12874 ALL bams/NA12874.mapped.LS454.ssaha2.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam
You can use this file directly if you change your current directory to {ROOT_DIR}/test/umake/.
Alternately, if you want to copy and use this index file to a different directory, you can create a symbolic link to the bams folder as follows:
ln -s {ROOT_DIR}/test/umake/bams bams
(More information about: the index file.)
BED file
This file contains a single line:
chr20 20000050 20300000
You can copy this to the current directory and use it as-is.
(More information about: targeted/exome sequencing settings.)
Configuration file
A configuration file (umake_test.conf) already exists in {ROOT_DIR}/test/umake/. It contains the following information:
CHRS = 20 TEST_ROOT = $(UMAKE_ROOT)/test/umake BAM_INDEX = $(TEST_ROOT)/umake_test.index OUT_PREFIX = umake_test REF_ROOT = $(TEST_ROOT)/ref # REF = $(REF_ROOT)/karma.ref/human.g1k.v37.chr20.fa INDEL_PREFIX = $(REF_ROOT)/indels/1kg.pilot_release.merged.indels.sites.hg19 DBSNP_PREFIX = $(REF_ROOT)/dbSNP/dbsnp_135_b37.rod HM3_PREFIX = $(REF_ROOT)/HapMap3/hapmap3_r3_b37_fwd.consensus.qc.poly # RUN_INDEX = TRUE # create BAM index file RUN_PILEUP = TRUE # create GLF file from BAM RUN_GLFMULTIPLES = TRUE # create unfiltered SNP calls RUN_VCFPILEUP = TRUE # create PVCF files using vcfPileup and run infoCollector RUN_FILTER = TRUE # filter SNPs using vcfCooker RUN_SPLIT = TRUE # split SNPs into chunks for genotype refinement RUN_BEAGLE = FALSE # BEAGLE - MUST SET AFTER FINISHING PREVIOUS STEPS RUN_SUBSET = FALSE # SUBSET FOR THUNDER - MAY BE SET WITH BEAGLE STEP TOGETHER RUN_THUNDER = FALSE # THUNDER - MUST SET AFTER FINISHING PREVIOUS STEPS ############################################################################### WRITE_TARGET_LOCI = TRUE # FOR TARGETED SEQUENCING ONLY -- Write loci file when performing pileup UNIFORM_TARGET_BED = $(TEST_ROOT)/umake_test.bed # Targeted sequencing : When all individuals has the same target. Otherwise, comment it out OFFSET_OFF_TARGET = 50 # Extend target by given # of bases MULTIPLE_TARGET_MAP = # Target per individual : Each line contains [SM_ID] [TARGET_BED] TARGET_DIR = target # Directory to store target information SAMTOOLS_VIEW_TARGET_ONLY = TRUE # When performing samtools view, exclude off-target regions (may make command line too long)
If you are running this from a different directory, you will want to change some of the lines as follows:
BAM_INDEX = {CURRENT_DIR}/umake_test.index UNIFORM_TARGET_BED = {CURRENT_DIR}/umake_test.bed
where {CURRENT_DIR} is the absolute path to the directory that contains the index and bed files.
An additional option can be added in the configuration file:
OUT_DIR = {OUT_DIR}
where {OUT_DIR} is the name of directory in which you want the output to be stored. If you do not specify this in the configuration file, you will need to add an extra parameter when you run UMAKE in the next step.
(More information about: the configuration file, reference files.)
Running UMAKE
If you added an OUT_DIR line to the configuration file, you can run UMAKE with the following command:
{ROOT_DIR}/bin/umake.pl --conf umake_test.conf --snpcall --numjobs 2
If you have not added an OUT_DIR line to the configuration file, you can specify the output directory directly with the following command:
{ROOT_DIR}/bin/umake.pl --conf umake_test.conf --outdir {OUT_DIR} --snpcall --numjobs 2
where {OUT_DIR} is the directory in which you want the output to be stored.
Either command will perform SNP calling on the test samples. If you find the resulting VCF files located in {OUT_DIR}/vcfs/chr20, then you have successfully called the SNPs from the test BAM files.