Changes

9,550 bytes added , 14:55, 31 August 2015

→‎Command Line Parameters: fix formatting about BAM_LIST

Line 1: Line 1: +

== Creating a New BAM Processing Pipeline ==

GotCloud allows you to configure new basic BAM processing pipelines via configuration.

+

To define new processing pipelines, you will use Configuration sections to define both the pipeline and each of the steps. So first you need to understand how configuration sections work.

+

=== GotCloud Configuration Sections ===

+

GotCloud configuration files can be broken into sections:

+

* Section names are specified between square brakets (<code>[]</code>)

+

*: <pre>[sectionName]</pre>

+

** Any configuration settings specified after the section header belong to that section

+

** A section can be specified multiple times in the file and the configuration settings are accumulated

+

** To access a value for a key defined in another section, use <code>$(otherSectionName/keyName)</code>

+

* If a section is not specified, the configuration settings belong to the <code>global</code> section

+

** The <code>global</code> section does not need to be specified at the beginning of the file (it is the default section).

+

** Additional <code>global</code> settings can be set later in the file after other settings, by defining the explicitly section:

+

**: <pre>[global]</pre>

+

* Sections can be derived from another section

+

** All sections automatically derive from <code>[global]</code>

+

** A derived section inherits all the configuration settings from its parent sections

+

*** Parent settings are overridden by redefining the configuration key/value pair

+

** A parent section is specified following a semicolon <code>:</code> on the section definition line:

+

**: <pre>[childSectionName] : parentSectionName</pre>

+

<ul>

+

<li> Section specific configuration settings are specified on the lines following the section definition:

+

<dd><pre>[section1]

+

KEY1 = VAL1

+

KEY2 = VAL2

+

[section2]

+

KEY1 = VAL1_2

+

KEY3 = VAL3</pre></dd>

+

</li>

+

</ul>

+

=== Defining a New Pipeline ===

+

There are 2 parts for creating a new pipeline

+

# [[#Overall Pipeline Definition|Overall Pipeline Definition]]

+

#* Basics for the overall pipeline

+

#* '''NOTE: Currently, configurations set in the overall pipeline's section do not by default pass onto the step's configurations'''

+

# [[#Configure Each Step|Configure Each Step]]

+

==== Overall Pipeline Definition ====

+

<ol>

+

<li> Define a new configuration section for your pipeline

+

<ul><li> Example:</li></ul>

+

<dd> <pre>[pipelineName]</pre>

+

</li>

+

<li>Define the steps in this pipeline using the key <code>STEPS</code> under that section

+

<ul><li> Example:</li></ul>

+

<dd><pre>[pipelineName]

+

STEPS = stepName1 stepName2 stepName3</pre>

+

<ul><li> Note: each step must have its own configuration section</li></ul>

+

</li>

+

</ol>

+

Optional Overall Pipeline Settings:

+

* BATCH_OPTS

+

* BATCH_TYPE

+

* IGNORE_SM_CHECK - turn off the default validation that the @RG SM tag matches the bam list sample name.

+

* IGNORE_REF_CHR_CHECK - turn off the default validation that checks that all of the BAM's chromosomes are in the reference file - eventually we may update to just validate those in CHRS.

+

* OUT_DIR

+

* BAM_LIST

+

* REF

+

* REF_FAI

+

* MULTIPLE_TARGET_MAP

+

* UNIFORM_TARGET_BED

+

* OFFSET_OFF_TARET

+

* CHRS - defines which chromosomes to run.

+

* UNIT_CHUNK

+

* NO_CRAM - do not allow CRAM files as input

+

* MAKE_BASE_NAME_PIPE - base makefile name

+

* MAKE_OPTS - options to pass to the make command that runs the jobs.

+

* BAM_DEPEND - set to TRUE if you want the BAM file to be included as a make dependency

+

NOTES:

+

* The BAM_LIST file can contain config values within it - the overall pipeline section will be checked for those config values.

+

* By default if a value is not defined in the section, it will check global.

+

==== Configure Each Step ====

+

'''Create a section for each step'''

+

* Example: <code>[stepName1]</code>

+

====Required keys for each step:====

+

# <code>DEPEND</code> - dependencies for this step

+

#: Valid Values (separate multiple dependencies with a space):

+

#:*<code>BAM</code>

+

#:*Name of step that must complete prior to this step

+

#:*PER_SAMPLE_BAM??? can only be BAM or PER_SAMPLE_BAM

+

#<code>OUTPUT</code> - name of output file

+

#* See below for temporary keys for step iteration

+

#<code>CMD</code> - command for running the step

+

#* See below for temporary keys for step iteration

+

====Optional Step Settings:====

+

General Settings:

+

* <code>LOCAL</code> - run the step locally rather than on the cluster

+

* <code>NEED_BAI</code> - Set if a step requires a BAI file

+

** Per chromosome steps always require a BAI file

+

** Tells GotCloud to fail if a BAI can't be found

+

* <code>BAM_DEPEND</code> - Add the BAM file as a Makefile dependency for this step

+

Settings to limit which samples this step runs on:

+

* <code>SAMPLES</code> - use this to define a step to run only for samples with a single BAM or multiple BAMs (merging)

+

*: Possible values:

+

*:* <code>MULTI_BAM</code> - run the step only for samples that have multiple BAMs

+

*:* <code>SINGLE_BAM</code> - run the step only for samples that have one BAM

+

*Deprecated settings - still in pipeline.pl and may or may not work:

+

** <code>MULTI_ONLY</code> - set to non-blank if step should run if there are more than 1 input per output.

+

** <code>SINGLE_ONLY</code> - set to non-blank if step should run if there is only 1 input per output.

+

Joining multiple inputs for a single output:

+

* Can occur if there are multiple dependencies

+

* Can occur if a step runs at a more generic iteration level than a dependency

+

* <code>INPUT_JOIN</code> - value to pass to perl "join" command for joining multiple inputs for each output.

+

** Looks across all dependencies

+

* <code>dependStepName_JOIN</code> - how to join the "dependStepName"'s output into the command line for a step that depends on it if there are multiple outputs per input of this step

+

** Substitutes <code>?(${depend}/OUTPUT)</code> with perl "join" using the specified value to join multiple outputs for that dependency

+

Log Output filenames

+

* <code>FILELIST</code> - writes/appends the iteration's output file name into the specified file list.

+

** Typically will be used in a later "merge" step

+

** See below for temporary keys for step iteration that can be used in this filename

+

*** Temporary keys can be more general than those in OUTPUT, but cannot be more specific.

+

====Iterating a command for each Bam/Sample/Chromosome/Region====

+

Temporary keys are used when iterating a command per BAM/sample/chromosome/region.

+

* Specify using <code>?()</code> rather than <code>$()</code>

+

* Temporary keys can be used in:

+

** <code>OUTPUT</code>

+

** <code>CMD</code>

+

** <code>FILELIST</code>

+

* They will be substituted as it iterates

+

* How to iterate a command is determined by the temporary keys in <code>OUTPUT</code>

+

* Temporary Keys for determining iterations:

+

** <code>?(BAM)</code> - per BAM per sample

+

** <code>?(SAMPLE)</code> - per sample

+

** <code>?(CHR)</code> - per chromosome

+

** <code>?(START)</code> - Per region of a Chromosome (must also include <code>?(CHR)</code>):

+

* Additional Temporary Keys:

+

** <code>?(END)</code> - end of the region - only used if <code>?(START)</code> is also specified.

+

** <code>?(INPUT)</code>

+

** <code>?(${depend}/OUTPUT)</code>

+

'''Notes:'''

+

* Currently each step iteration will:

+

** be its own Makefile target/.OK file

+

** run independently on the cluster

+

== Command Line Parameters ==

+

Required Parameters:

+

* <code>--name</code> <pipelineName> - name of the pipeline to run

+

* <code>--conf</code> <configuration file> - configuration file to use

+

NOTE: Currently, any "overrides" are for the global setting only - not for the pipeline/step.

+

* this needs to be fixed so they can override the pipeline settings

+

Optional Parameters:

+

* <code>--ignoreSmCheck</code> - overrides <code>IGNORE_SM_CHECK</code>

+

* <code>--ignoreRefChrCheck</code> - overrides <code>IGNORE_REF_CHR_CHECK</code>

+

* <code>--verbose</code> <number> - verbose value passed to the loadConf method

+

Optional Parameters like SnpCall:

+

* <code>--numjobs|numjobs</code> <number> - number of jobs to run in parallel

+

* <code>--maxlocaljobs</code> <number> - number of jobs to allow to run when batchtype is local (default 10) - does not validate for commands running LOCAL

+

* <code>--region</code> <region to process> - like snpcall, specifies a single region to process

+

* <code>--bam_list|list|bamlist|bam_index|bamindex</code> <bam list file> - overrides <code>BAM_LIST</code>, the list of sample bam files to process

+

* <code>--out_dir|outdir</code> <output directory> - overrides <code>OUT_DIR</code>

+

* <code>--batchtype</code> <type> - overrides <code>BATCHTYPE</code>

+

* <code>--batchopts</code> <options> - overrides <code>BATCHOPTS</code>

+

* <code>--chrs|chroms</code> <comma separated chromosomes> - overrides <code>CHRS</code> (CHRS is space separated - commas are converted to spaces)

+

* <code>--ref_dir|refdir</code> <reference directory> - overrides <code>REF_DIR</code>

+

* <code>--ref_prefix|refprefix</code> <prefix> - overrides <code>REF_PREFIX</code>

+

* <code>--bam_prefix|bamprefix</code> <prefix> - overrides <code>BAM_PREFIX</code>

+

* <code>--base_prefix|baseprefix</code> <prefix> - overrides <code>BASE_PREFIX</code>

+

* <code>--gotcloudroot|gcroot</code> <path to gotcloud> - by default gotcloud root is determined from the path to the pipeline script, but this setting overrides that.

+

* <code>--help</code> - print Usage

+

* <code>--test</code> <test directory> - run the test code (just for indel right now)

+

Unused command line options:

+

* In the code, but are not actually used:

+

* <code>--keeptmp</code> - overrides <code>KEEP_TMP</code>

+

* <code>--keeplog</code> - overrides <code>KEEP_LOG</code>

−

~~# Define a new configuration section~~ for ~~your pipeline~~

+

== Example Pipelines Created ==

−

#: <code>~~[pipelineName]~~</code>

+

Look for sections & <code>STEPS</code> in the defaults.

+

https://github.com/statgen/gotcloud/blob/master/bin/gotcloudDefaults.conf

+

https://github.com/statgen/gotcloud/blob/alignPrep/bin/gotcloudDefaults.conf

Pjvh

61

edits

Changes

GotCloud: Creating a New Pipeline (view source)

Revision as of 14:55, 31 August 2015

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools