Changes

Tutorial: GotCloud (view source)

Revision as of 19:42, 5 March 2013

337 bytes removed , 19:42, 5 March 2013

no edit summary

Line 74: Line 74:

* make (make on ubuntu)

* libssl (libssl0.9.8 on ubuntu)

−

* gcc 4.6 or newer

+

* gcc 4.4 or newer

=== Step 1c: Install Example Dataset ===

Line 112: Line 112:

Run the alignment pipeline (the example aligns 2 samples) :

−

$GCHOME/gotcloud align --conf $GCDATA/[[Alignment Configuration File|GBR60align.conf]] --outdir $GCOUT

+

$GCHOME/gotcloud align --conf $GCDATA/[[#Alignment Configuration File|GBR60align.conf]] --outdir [[#Alignment Output Directory|$GCOUT]]

Upon successful completion of the alignment pipeline (about 1-2 minutes), you will see the following message:

Line 263: Line 263:

== Alignment Pipeline ==

−

=== List of Input Files needed ===

+

=== List of Input Files needed for Alignment ===

The command-line inputs to the tutorial alignment pipeline are:

# [[#Alignment Configuration File|Configuration File (--conf)]]

Line 269: Line 269:

# [[#Alignment Output Directory|Output Directory (--outdir)]]

#* Directory where the output should be placed.

+

Additional information required to run the alignment pipeline:

−

# Index file of FASTQs

+

# [[#Alignment FASTQ Index File|Index file of FASTQs]]

−

# Reference ~~files~~

+

# [[#Alignment Reference Files|Alignment Reference Files]]

For the tutorial, these values are specified in the configuration file.

Line 289: Line 290:

This configuration file sets:

−

* INDEX_FILE - file containing the fastqs to be processed as well as the read group information for these fastqs.

+

* [[#Alignment FASTQ Index File|INDEX_FILE]] - file containing the fastqs to be processed as well as the read group information for these fastqs.

+

* Reference Information: see [[#Alignment Reference Files|Alignment Reference Files]] for more information

+

** AS - assembly value to put in the BAM

+

The index file and chromosome 20 references used in this tutorial are included with the example data under the $GCDATA directory. The tutorial uses chromosome 20 only references in order to speed the processing time.

+

When running with your own data, you will need to update the:

+

* Index File to contain the information for your own fastq files

+

** See [[#Alignment FASTQ Index File|Alignment FASTQ Index File]] for more information on the contents of the index file.

+

* The reference files to be whole genome references

+

** If you are just running chromosome 20, you can use the tutorial references

+

** Whole genome reference files can be downloaded from [[GotCloudReference]].

+

** See [[#Alignment Reference Files|Alignment Reference Files]] for more information.

−

* Reference Information:

+

Note: It is recommended that you use absolute paths (full path names, like “/home/mktrost/gotcloudReference” rather than just “gotcloudReference”). This example does not use absolute paths in order to be flexible to where the data is installed, but using relative paths requires it to be run from the correct directory.

−

** AS - assembly value to ~~put in the BAM~~

+

−

** FA_REF - the ~~reference file (.fa extension)~~, ~~the additional files should~~ be at the ~~same location:~~

+

−

*** human_g1k_v37_chr20-bs.~~umfa~~

+

=== Reference Files ===

−

*** human_g1k_v37_chr20.dict

+

−

*** human_g1k_v37_chr20.fa

+

Reference files are required for running both the alignment and variant calling pipelines.

−

*** human_g1k_v37_chr20.fa.amb

+

−

*** human_g1k_v37_chr20.fa.ann

+

The configuration keys for setting these are:

−

*** human_g1k_v37_chr20.~~fa.bwt~~

+

* FA_REF - Genome sequence reference files (needed for both pipelines)

−

*** human_g1k_v37_chr20.fa.fai

+

* DBSNP_VCF – DBSNP site VCF file (needed for both pipelines)

−

*** human_g1k_v37_chr20.fa.GCcontent

+

* HM3_VCF - HAPMAP site VCF file (needed for both pipelines)

−

*** human_g1k_v37_chr20.fa.pac

+

* INDEL_PREFIX - Indel sites file (need for variant calling pipeline)

−

*** human_g1k_v37_chr20.fa.rbwt

−

*** human_g1k_v37_chr20.fa.rpac

−

*** human_g1k_v37_chr20.fa.rsa

−

*** human_g1k_v37_chr20.fa.sa

−

** DBSNP_VCF - ~~a vcf containing the dbsnp positions~~

−

** HM3_VCF - ~~hapmap vcf~~

−

The ~~index~~ file ~~and~~ chromosome 20 ~~references~~ are included with the example data ~~under the~~ $GCDATA ~~directory. The tutorial uses chromosome 20 only references in order to speed the processing time~~.

+

The tutorial configuration file is setup to point to the required chromosome 20 reference files which are included with the tutorial example data in $GCDATA/chr20Ref/.

−

~~When~~ you ~~run on your own data, that is~~ more than just chromosome 20, you will need ~~to use the full~~ reference files~~. Full Reference files~~ can be downloaded from [[GotCloudReference]]~~. If you are using these reference files, you will only need to specify REF_DIR in your configuration file to the full path to where they are installed~~.

+

If you are running more than just chromosome 20, you will need whole genome reference files which can be downloaded from [[GotCloudReference]].

+

The configuration settings for these files are setup in the default configuration so do not need to be specified. You just need to set REF_DIR in your configuration file to the path where you installed your reference files.

−

~~When running on your own data~~, ~~you will also need to update the INDEX_FILE to point to your own index file. See~~ [[~~#Alignment Index File|Alignment Index File~~]] ~~for more information on the contents of the index file~~.

+

To learn more about the reference files that are required, see [[GotCloud: Reference Files]].

−

Note: It is recommended that you use absolute paths (full path names, like “/home/mktrost/gotcloudReference” rather than just “gotcloudReference”). This example does not use absolute paths in order to be flexible to where the data is installed, but using relative paths requires it to be run from the correct directory.

=== Alignment Output Directory ===

−

This setting tells the pipeline where to write the output files.

+

This setting tells the alignment pipeline where to write the output and intermediate files.

−

The output directory will be created if ~~necessary~~ and will contain the following Directories/files:

+

The output directory will be created if it doesn't already exist and will contain the following Directories/files:

* bams - directory containing the final bams/bai files

** HG00096.OK - indicates that this sample completed alignment processing

** HG00100.OK - indicates that this sample completed alignment processing

* failLogs - directory containing logs from steps that failed

+

** this directory is only created if an error is detected

* Makefiles - directory containing the makefiles with commands for processing each sample

−

** biopipe_HG00096.Makefile

+

** biopipe_HG00096.Makefile – commands for processing sample HG00096

−

** biopipe_HG00100.Makefile

+

** biopipe_HG00100.Makefile – commands for processing sample HG00100

−

** biopipe_HG00096.Makefile.log – log file from running the associated ~~Makefiles~~

+

** biopipe_HG00096.Makefile.log – log file from running the associated Makefile

−

** biopipe_HG00100.Makefile.log – log file from running the associated ~~Makefiles~~

+

** biopipe_HG00100.Makefile.log – log file from running the associated Makefile

* QCFiles - directory containing the QC Results

**

* tmp - directory containing temporary alignment files

−

**

+

** bwa.sai.t – contains temporary files for the 1st step of bwa that generates sai files from fastq files

+

*** fastq

+

**** *.done – indicator files that the step to generate the file completed

+

**** HG0096

+

** alignment.bwa – contains temporary files for the 2nd step of bwa that generates BAM files

+

*** fastq

+

**** *.done – indicator files that the step to generate the file completed

+

** alignment.pol - contains temporary files for the polish bam step that cleans up BAM files

+

** alignment.dedup – contains temporary files for the deduping step

+

*** *.done – indicator files that the step to generate the file completed

+

*** *.metrics – metrics files for the deduping step

+

** alignment.recal – contains temporary files for the deduping step

+

*** *.log – contains information about the recalibration step

+

*** *.qemp – contains the recalibration tables used for recalibrating each BAM file

−

+

===Alignment FASTQ Index File===

−

===Index ~~file~~===

There are four fastq files in {ROOT_DIR}/test/align/fastq/Sample_1 and four fastq files in {ROOT_DIR}/test/align/fastq/Sample_2, both in paired-end format. Normally, we would need to build an index file for these files. Conveniently, an index file (indexFile.txt) already exists for the automatic test samples. It can be found in {ROOT_DIR}/test/align/, and contains the following information in tab-delimited format:

Line 363: Line 383:

(More information about: [[Mapping_Pipeline#Sequence_Index_File|the index file]].)

−

~~===Configuration file===~~

−

~~Similar to the index file, a configuration file (test.conf) already exists for the automatic test samples. It contains the following information:~~

−

~~INDEX_FILE = indexFile.txt~~

−

~~############~~

−

~~# References~~

−

~~REF_DIR = $(PIPELINE_DIR)/test/align/chr20Ref~~

−

~~AS = NCBI37~~

−

~~FA_REF = $(REF_DIR)/human_g1k_v37_chr20.fa~~

−

~~DBSNP_VCF = $(REF_DIR)/dbsnp.b130.ncbi37.chr20.vcf.gz~~

−

~~PLINK = $(REF_DIR)/hapmap_3.3.b37.chr20~~

−

If you are in the {ROOT_DIR}/test/align directory, you can use this file as-is. If you are using a different index file, make sure your index file is named correctly in the first line. If you are not running this from {ROOT_DIR}/test/align, make sure your configuration and index files are in the same directory.

−

(More information about: [[Mapping_Pipeline#Reference_Files|reference files]], [[Mapping_Pipeline#Optional_Configurable_Settings|optional configurable settings]], or [[Mapping_Pipeline#Command-Line_Options|command-line options]].)

−

~~===Running the alignment pipeline===~~

−

~~You are now ready to run the alignment pipeline.~~

−

~~To run the alignment pipeline, enter the following command:~~

−

~~{ROOT_DIR}/bin/gen_biopipeline.pl -conf test.conf -out_dir {OUT_DIR}~~

−

~~where {OUT_DIR} is the directory in which you wish to store the resulting BAM files (for example, ~/out).~~

−

~~If everything went well, you will see the following messages:~~

−

~~Created {OUT_DIR}/Makefiles/biopipe_Sample2.Makefile~~

−

~~Created {OUT_DIR}/Makefiles/biopipe_Sample1.Makefile~~

−

~~---------------------------------------------------------------------~~

−

~~Submitted 2 commands~~

−

~~Waiting for commands to complete... . . Commands finished in 33 secs with no errors reported~~

−

~~The aligned BAM files are found in {OUT_DIR}/alignment.recal/~~

Line 443: Line 424:

A configuration file (umake_test.conf) already exists in {ROOT_DIR}/test/umake/. It contains the following information:

+

CHRS = 20

+

BAM_INDEX = GBR60bam.index

+

############

+

# References

+

REF_ROOT = chr20Ref

+

#

+

REF = $(REF_ROOT)/human_g1k_v37_chr20.fa

+

INDEL_PREFIX = $(REF_ROOT)/1kg.pilot_release.merged.indels.sites.hg19

+

DBSNP_VCF = $(REF_ROOT)/dbsnp135_chr20.vcf.gz

+

HM3_VCF = $(REF_ROOT)/hapmap_3.3.b37.sites.chr20.vcf.gz

+

CHRS = 20

Line 487: Line 480:

(More information about: [[Variant_Calling_Pipeline_(UMAKE)#Configuration_File|the configuration file]], [[Variant_Calling_Pipeline_(UMAKE)#Reference_Files|reference files]].)

−

~~===Running UMAKE===~~

−

~~If you added an OUT_DIR line to the configuration file, you can run UMAKE with the following command:~~

−

~~{ROOT_DIR}/bin/umake.pl --conf umake_test.conf --snpcall --numjobs 2~~

−

~~If you have not added an OUT_DIR line to the configuration file, you can specify the output directory directly with the following command:~~

−

~~{ROOT_DIR}/bin/umake.pl --conf umake_test.conf --outdir {OUT_DIR} --snpcall --numjobs 2~~

−

~~where {OUT_DIR} is the directory in which you want the output to be stored.~~

−

~~Either command will perform SNP calling on the test samples. If you find the resulting VCF files located in {OUT_DIR}/vcfs/chr20, then you have successfully called the SNPs from the test BAM files.~~

Mktrost

Administrators

3,045

edits

Changes

Tutorial: GotCloud (view source)

Revision as of 19:42, 5 March 2013

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools