Tutorial: GotCloud

From Genome Analysis Wiki
Revision as of 01:44, 10 January 2013 by Amkwong (talk | contribs)
Jump to navigationJump to search

Installation

First, make sure GotCloud is installed on your system. Installation instructions here.


Running the Automatic Test

This verifies that GotCloud was installed correctly.

To run the test case automatically, change your current directory to GotCloud's root directory, and type in the following command:

bin/gen_biopipeline.pl --test OUTPUT_DIR

where OUTPUT_DIR is the directory where you want to store the results.

If you see "Test Passed", then you are ready to run a sample.


Aligning a Sample

As an example, we can align the sample fastq files used in the automatic test. They belong to two different samples, which we will call "Sample1" and "Sample2". They are found in {ROOT_DIR}/test/align/fastq. (We will call the directory in which GotCloud is installed "{ROOT_DIR}".)

To make this easier, change to the {ROOT_DIR}/test/align directory. It contains an index file and a configuration file that can be used directly.


Index file

There are four fastq files in {ROOT_DIR}/test/align/fastq/Sample_1 and four fastq files in {ROOT_DIR}/test/align/fastq/Sample_2, both in paired-end format. Normally, we would need to build an index file for these files. Conveniently, an index file (indexFile.txt) already exists for the automatic test samples. It can be found in {ROOT_DIR}/test/align/, and contains the following information in tab-delimited format:

MERGE_NAME FASTQ1                           FASTQ2                           RGID   SAMPLE    LIBRARY CENTER PLATFORM
Sample1    fastq/Sample_1/File1_R1.fastq.gz fastq/Sample_1/File1_R2.fastq.gz RGID1  SampleID1 Lib1    UM     ILLUMINA
Sample1    fastq/Sample_1/File2_R1.fastq.gz fastq/Sample_1/File2_R2.fastq.gz RGID1a SampleID1 Lib1    UM     ILLUMINA
Sample2    fastq/Sample_2/File1_R1.fastq.gz fastq/Sample_2/File1_R2.fastq.gz RGID2  SampleID2 Lib2    UM     ILLUMINA
Sample2    fastq/Sample_2/File2_R1.fastq.gz fastq/Sample_2/File2_R2.fastq.gz RGID2  SampleID2 Lib2    UM     ILLUMINA

If you are in the {ROOT_DIR}/test/align directory, you can use this file as-is. If you prefer, you can create a new index file and change the MERGE_NAME, RGID, SAMPLE, LIBRARY, CENTER, or PLATFORM values. It is recommended that you do not modify existing files in {ROOT_DIR}/test/align.

If you want to run this example from a different directory, make sure the FASTQ1 and FASTQ2 paths are correct. That is, each of the FASTQ1 and FASTQ2 entry in the index file should look like the following:

{ROOT_DIR}/test/align/fastq/Sample_1/File1_R1.fastq.gz 

Alternately, if you want to run this example from a different directory, but do not want to edit the index file, you can copy all the fastq files to a new directory with the relative path listed in the index file:

ln -s {ROOT_DIR}/test/align/fastq fastq

This will create a symbolic link to the test fastq directory from your current directory.

(More information about: the index file.)


Configuration file

Similar to the index file, a configuration file (test.conf) already exists for the automatic test samples. It contains the following information:

INDEX_FILE = indexFile.txt
############
# References
REF_DIR = $(PIPELINE_DIR)/test/align/chr20Ref
AS = NCBI37
FA_REF = $(REF_DIR)/human_g1k_v37_chr20.fa
DBSNP_VCF =  $(REF_DIR)/dbsnp.b130.ncbi37.chr20.vcf.gz
PLINK = $(REF_DIR)/hapmap_3.3.b37.chr20

If you are in the {ROOT_DIR}/test/align directory, you can use this file as-is. If you are using a different index file, make sure your index file is named correctly in the first line. If you are not running this from {ROOT_DIR}/test/align, make sure your configuration and index files are in the same directory.

(More information about: reference files, optional configurable settings, or command-line options.)


Running the alignment pipeline

You are now ready to run the alignment pipeline. This requires two steps: first, generating the Makefiles; and second, running those Makefiles.


Generating the Makefiles

Enter the following command:

{ROOT_DIR}/bin/gen_biopipeline.pl --conf test.conf --out_dir {OUT_DIR}

where {OUT_DIR} is the directory in which you wish to store the resulting BAM files (for example, ~/out).

If everything went well, you will see the following messages:

Finished creating makefile {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile
Finished creating makefile {OUT_DIR}/Makefiles/biopipe_Sample1.Makefile
--------------------------------------------------------------------
Run the following commands:

make -f {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile > {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile.log
make -f {OUT_DIR}/Makefiles/biopipe_Sample1.Makefile > {OUT_DIR}/Makefiles/biopipe_Sample1.Makefile.log

where {OUT_DIR} will be replaced with the directory you entered above.


Running the Makefiles

To run a Makefile, simply enter one-by-one the commands generated in the previous step. If you wish to run the alignment in the background, add "&" after the make command, as follows:

make -f {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile > {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile.log &

The log files for the runs will be found in the Makefiles directory, while the BAM files will be found in the {OUT_DIR}/alignment.recal directory. If you see two BAM files, one for each sample, then you have successfully aligned the fastq files.


Analyzing a Sample

Using UMAKE, you can analyze BAM files by calling SNPs, and generate a VCF file containing the results. Once again, we can analyze BAM files used in the automatic test. For this example, we have 60 BAM files, found in {ROOT_DIR}/test/umake/bams. In addition to the BAM files, you will need three files to run UMAKE: an index file, a configuration file, and a bed file.


Index file

First, you need a list of all the BAM files to be analyzed. Conveniently, the a test index file (umake_test.index) already exists in {ROOT_DIR}/test/umake/. It contains the following information:

NA12272 ALL     bams/NA12272.mapped.ILLUMINA.bwa.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam
NA12004 ALL     bams/NA12004.mapped.ILLUMINA.bwa.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam
...
NA12874 ALL     bams/NA12874.mapped.LS454.ssaha2.CEU.low_coverage.20101123.chrom20.20000001.20300000.bam

You can use this file directly if you change your current directory to {ROOT_DIR}/test/umake/.

Alternately, if you want to copy and use this index file to a different directory, you can create a symbolic link to the bams folder as follows:

ln -s {ROOT_DIR}/test/umake/bams bams

(More information about: the index file.)


BED file

This file contains a single line:

chr20   20000050        20300000

You can copy this to the current directory and use it as-is.

(More information about: targeted/exome sequencing settings.)


Configuration file

A configuration file (umake_test.conf) already exists in {ROOT_DIR}/test/umake/. It contains the following information:

CHRS = 20
TEST_ROOT = $(UMAKE_ROOT)/test/umake
BAM_INDEX = $(TEST_ROOT)/umake_test.index
OUT_PREFIX = umake_test
REF_ROOT = $(TEST_ROOT)/ref
#
REF = $(REF_ROOT)/karma.ref/human.g1k.v37.chr20.fa
INDEL_PREFIX = $(REF_ROOT)/indels/1kg.pilot_release.merged.indels.sites.hg19
DBSNP_PREFIX =  $(REF_ROOT)/dbSNP/dbsnp_135_b37.rod
HM3_PREFIX =  $(REF_ROOT)/HapMap3/hapmap3_r3_b37_fwd.consensus.qc.poly
#
RUN_INDEX = TRUE        # create BAM index file
RUN_PILEUP = TRUE       # create GLF file from BAM
RUN_GLFMULTIPLES = TRUE # create unfiltered SNP calls
RUN_VCFPILEUP = TRUE    # create PVCF files using vcfPileup and run infoCollector
RUN_FILTER = TRUE       # filter SNPs using vcfCooker
RUN_SPLIT = TRUE        # split SNPs into chunks for genotype refinement
RUN_BEAGLE = FALSE  # BEAGLE - MUST SET AFTER FINISHING PREVIOUS STEPS
RUN_SUBSET = FALSE  # SUBSET FOR THUNDER - MAY BE SET WITH BEAGLE STEP TOGETHER
RUN_THUNDER = FALSE # THUNDER - MUST SET AFTER FINISHING PREVIOUS STEPS
###############################################################################
WRITE_TARGET_LOCI = TRUE  # FOR TARGETED SEQUENCING ONLY -- Write loci file when performing pileup
UNIFORM_TARGET_BED = $(TEST_ROOT)/umake_test.bed # Targeted sequencing : When all individuals has the same target. Otherwise, comment it out
OFFSET_OFF_TARGET = 50 # Extend target by given # of bases
MULTIPLE_TARGET_MAP =  # Target per individual : Each line contains [SM_ID] [TARGET_BED]
TARGET_DIR = target    # Directory to store target information
SAMTOOLS_VIEW_TARGET_ONLY = TRUE # When performing samtools view, exclude off-target regions (may make command line too long)

If you are running this from a different directory, you will want to change some of the lines as follows:

BAM_INDEX = {CURRENT_DIR}/umake_test.index
UNIFORM_TARGET_BED = {CURRENT_DIR}/umake_test.bed

where {CURRENT_DIR} is the absolute path to the directory that contains the index and bed files.

An additional option can be added in the configuration file:

OUT_DIR = {OUT_DIR}

where {OUT_DIR} is the name of directory in which you want the output to be stored. If you do not specify this in the configuration file, you will need to add an extra parameter when you run UMAKE in the next step.

(More information about: the configuration file, reference files.)


Running UMAKE

If you added an OUT_DIR line to the configuration file, you can run UMAKE with the following command:

{ROOT_DIR}/bin/umake.pl --conf umake_test.conf --snpcall --numjobs 2

If you have not added an OUT_DIR line to the configuration file, you can specify the output directory directly with the following command:

{ROOT_DIR}/bin/umake.pl --conf umake_test.conf --outdir {OUT_DIR} --snpcall --numjobs 2

where {OUT_DIR} is the directory in which you want the output to be stored.

Either command will perform SNP calling on the test samples. If you find the resulting VCF files located in {OUT_DIR}/vcfs/chr20, then you have successfully called the SNPs from the test BAM files.


Further Information

Mapping (Alignment) Pipeline

Variant Calling Pipeline (UMAKE)