Difference between revisions of "GotCloud: Creating a New Pipeline"

From Genome Analysis Wiki
Jump to navigationJump to search
(→‎Command Line Parameters: fix formatting about BAM_LIST)
 
(7 intermediate revisions by 2 users not shown)
Line 43: Line 43:
 
</ul>
 
</ul>
 
=== Defining a New Pipeline ===
 
=== Defining a New Pipeline ===
 +
 +
There are 2 parts for creating a new pipeline
 +
# [[#Overall Pipeline Definition|Overall Pipeline Definition]]
 +
#* Basics for the overall pipeline
 +
#* '''NOTE: Currently, configurations set in the overall pipeline's section do not by default pass onto the step's configurations'''
 +
# [[#Configure Each Step|Configure Each Step]]
 +
 +
==== Overall Pipeline Definition ====
 
<ol>
 
<ol>
 
<li> Define a new configuration section for your pipeline
 
<li> Define a new configuration section for your pipeline
Line 53: Line 61:
 
STEPS = stepName1 stepName2 stepName3</pre>
 
STEPS = stepName1 stepName2 stepName3</pre>
 
<ul><li> Note: each step must have its own configuration section</li></ul>
 
<ul><li> Note: each step must have its own configuration section</li></ul>
</li>
 
<li>Create a section for each step
 
<ul><li> Example:
 
<dd> <pre>[stepName1]</pre>
 
</li>
 
</ul>
 
<ol>
 
<li>Set required keys for each step:
 
<ol>
 
<li><code>DEPEND</code> - dependencies for this step
 
<dd> Valid Values (separate multiple dependencies with a space):
 
<ul>
 
<li><code>BAM</code></li>
 
<li>Name of step that must complete prior to this step</li>
 
<li></li>
 
</ul>
 
</li>
 
<li><code>OUTPUT</code> - name of output file
 
</li>
 
<li><code>CMD</code> - command for running the step
 
</li>
 
</ol>
 
</li>
 
</ol>
 
 
</li>
 
</li>
 
</ol>
 
</ol>
 +
 +
Optional Overall Pipeline Settings:
 +
* BATCH_OPTS
 +
* BATCH_TYPE
 +
* IGNORE_SM_CHECK - turn off the default validation that the @RG SM tag matches the bam list sample name.
 +
* IGNORE_REF_CHR_CHECK - turn off the default validation that checks that all of the BAM's chromosomes are in the reference file - eventually we may update to just validate those in CHRS.
 +
* OUT_DIR
 +
* BAM_LIST
 +
* REF
 +
* REF_FAI
 +
* MULTIPLE_TARGET_MAP
 +
* UNIFORM_TARGET_BED
 +
* OFFSET_OFF_TARET
 +
* CHRS - defines which chromosomes to run.
 +
* UNIT_CHUNK
 +
* NO_CRAM - do not allow CRAM files as input
 +
* MAKE_BASE_NAME_PIPE - base makefile name
 +
* MAKE_OPTS - options to pass to the make command that runs the jobs.
 +
* BAM_DEPEND - set to TRUE if you want the BAM file to be included as a make dependency
 +
 +
 +
 +
NOTES:
 +
* The BAM_LIST file can contain config values within it - the overall pipeline section will be checked for those config values.
 +
* By default if a value is not defined in the section, it will check global.
 +
 +
==== Configure Each Step ====
 +
'''Create a section for each step'''
 +
* Example: <code>[stepName1]</code>
 +
 +
 +
====Required keys for each step:====
 +
 +
# <code>DEPEND</code> - dependencies for this step
 +
#: Valid Values (separate multiple dependencies with a space):
 +
#:*<code>BAM</code>
 +
#:*Name of step that must complete prior to this step
 +
#:*PER_SAMPLE_BAM??? can only be BAM or PER_SAMPLE_BAM
 +
#<code>OUTPUT</code> - name of output file
 +
#* See below for temporary keys for step iteration
 +
#<code>CMD</code> - command for running the step
 +
#* See below for temporary keys for step iteration
 +
 +
 +
====Optional Step Settings:====
 +
General Settings:
 +
* <code>LOCAL</code> - run the step locally rather than on the cluster
 +
* <code>NEED_BAI</code> - Set if a step requires a BAI file
 +
** Per chromosome steps always require a BAI file
 +
** Tells GotCloud to fail if a BAI can't be found
 +
* <code>BAM_DEPEND</code> - Add the BAM file as a Makefile dependency for this step
 +
 +
Settings to limit which samples this step runs on:
 +
* <code>SAMPLES</code> - use this to define a step to run only for samples with a single BAM or multiple BAMs (merging)
 +
*: Possible values:
 +
*:* <code>MULTI_BAM</code> - run the step only for samples that have multiple BAMs
 +
*:* <code>SINGLE_BAM</code> - run the step only for samples that have one BAM
 +
*Deprecated settings - still in pipeline.pl and may or may not work:
 +
** <code>MULTI_ONLY</code> - set to non-blank if step should run if there are more than 1 input per output.
 +
** <code>SINGLE_ONLY</code> - set to non-blank if step should run if there is only 1 input per output.
 +
 +
Joining multiple inputs for a single output:
 +
* Can occur if there are multiple dependencies
 +
* Can occur if a step runs at a more generic iteration level than a dependency
 +
* <code>INPUT_JOIN</code> - value to pass to perl "join" command for joining multiple inputs for each output.
 +
** Looks across all dependencies
 +
* <code>dependStepName_JOIN</code> - how to join the "dependStepName"'s output into the command line for a step that depends on it if there are multiple outputs per input of this step
 +
** Substitutes <code>?(${depend}/OUTPUT)</code> with perl "join" using the specified value to join multiple outputs for that dependency
 +
 +
Log Output filenames
 +
* <code>FILELIST</code> - writes/appends the iteration's output file name into the specified file list.
 +
** Typically will be used in a later "merge" step
 +
** See below for temporary keys for step iteration that can be used in this filename
 +
*** Temporary keys can be more general than those in OUTPUT, but cannot be more specific.
 +
 +
 +
====Iterating a command for each Bam/Sample/Chromosome/Region====
 +
Temporary keys are used when iterating a command per BAM/sample/chromosome/region.
 +
* Specify using <code>?()</code> rather than <code>$()</code>
 +
* Temporary keys can be used in:
 +
** <code>OUTPUT</code>
 +
** <code>CMD</code>
 +
** <code>FILELIST</code>
 +
* They will be substituted as it iterates
 +
* How to iterate a command is determined by the temporary keys in <code>OUTPUT</code>
 +
* Temporary Keys for determining iterations:
 +
** <code>?(BAM)</code> - per BAM per sample
 +
** <code>?(SAMPLE)</code> - per sample
 +
** <code>?(CHR)</code> - per chromosome
 +
** <code>?(START)</code> - Per region of a Chromosome (must also include <code>?(CHR)</code>):
 +
* Additional Temporary Keys:
 +
** <code>?(END)</code> - end of the region - only used if <code>?(START)</code> is also specified.
 +
** <code>?(INPUT)</code>
 +
** <code>?(${depend}/OUTPUT)</code>
 +
 +
'''Notes:'''
 +
* Currently each step iteration will:
 +
** be its own Makefile target/.OK file
 +
** run independently on the cluster
 +
 +
== Command Line Parameters ==
 +
Required Parameters:
 +
* <code>--name</code> <pipelineName> - name of the pipeline to run
 +
* <code>--conf</code> <configuration file> - configuration file to use
 +
 +
NOTE: Currently, any "overrides" are for the global setting only - not for the pipeline/step.
 +
* this needs to be fixed so they can override the pipeline settings
 +
 +
Optional Parameters:
 +
* <code>--ignoreSmCheck</code> - overrides <code>IGNORE_SM_CHECK</code>
 +
* <code>--ignoreRefChrCheck</code> - overrides <code>IGNORE_REF_CHR_CHECK</code>
 +
* <code>--verbose</code> <number> - verbose value passed to the loadConf method
 +
 +
Optional Parameters like SnpCall:
 +
* <code>--numjobs|numjobs</code> <number> - number of jobs to run in parallel
 +
* <code>--maxlocaljobs</code> <number> - number of jobs to allow to run when batchtype is local (default 10) - does not validate for commands running LOCAL
 +
* <code>--region</code> <region to process> - like snpcall, specifies a single region to process
 +
* <code>--bam_list|list|bamlist|bam_index|bamindex</code> <bam list file> - overrides <code>BAM_LIST</code>, the list of sample bam files to process
 +
* <code>--out_dir|outdir</code> <output directory> - overrides <code>OUT_DIR</code>
 +
* <code>--batchtype</code> <type> - overrides <code>BATCHTYPE</code>
 +
* <code>--batchopts</code> <options> - overrides <code>BATCHOPTS</code>
 +
* <code>--chrs|chroms</code> <comma separated chromosomes> - overrides <code>CHRS</code> (CHRS is space separated - commas are converted to spaces)
 +
* <code>--ref_dir|refdir</code> <reference directory> - overrides <code>REF_DIR</code>
 +
* <code>--ref_prefix|refprefix</code> <prefix> - overrides <code>REF_PREFIX</code>
 +
* <code>--bam_prefix|bamprefix</code> <prefix> - overrides <code>BAM_PREFIX</code>
 +
* <code>--base_prefix|baseprefix</code> <prefix> - overrides <code>BASE_PREFIX</code>
 +
* <code>--gotcloudroot|gcroot</code> <path to gotcloud> - by default gotcloud root is determined from the path to the pipeline script, but this setting overrides that.
 +
* <code>--help</code> - print Usage
 +
* <code>--test</code> <test directory> - run the test code (just for indel right now)
 +
 +
Unused command line options:
 +
* In the code, but are not actually used:
 +
* <code>--keeptmp</code> - overrides <code>KEEP_TMP</code>
 +
* <code>--keeplog</code> - overrides <code>KEEP_LOG</code>
 +
 +
== Example Pipelines Created ==
 +
Look for sections & <code>STEPS</code> in the defaults.
 +
https://github.com/statgen/gotcloud/blob/master/bin/gotcloudDefaults.conf
 +
https://github.com/statgen/gotcloud/blob/alignPrep/bin/gotcloudDefaults.conf

Latest revision as of 14:55, 31 August 2015

Creating a New BAM Processing Pipeline

GotCloud allows you to configure new basic BAM processing pipelines via configuration.

To define new processing pipelines, you will use Configuration sections to define both the pipeline and each of the steps. So first you need to understand how configuration sections work.

GotCloud Configuration Sections

GotCloud configuration files can be broken into sections:

  • Section names are specified between square brakets ([])
    [sectionName]
    • Any configuration settings specified after the section header belong to that section
    • A section can be specified multiple times in the file and the configuration settings are accumulated
    • To access a value for a key defined in another section, use $(otherSectionName/keyName)


  • If a section is not specified, the configuration settings belong to the global section
    • The global section does not need to be specified at the beginning of the file (it is the default section).
    • Additional global settings can be set later in the file after other settings, by defining the explicitly section:
      [global]


  • Sections can be derived from another section
    • All sections automatically derive from [global]
    • A derived section inherits all the configuration settings from its parent sections
      • Parent settings are overridden by redefining the configuration key/value pair
    • A parent section is specified following a semicolon : on the section definition line:
      [childSectionName] : parentSectionName


  • Section specific configuration settings are specified on the lines following the section definition:
    [section1]
    KEY1 = VAL1
    KEY2 = VAL2
    
    [section2]
    KEY1 = VAL1_2
    KEY3 = VAL3

Defining a New Pipeline

There are 2 parts for creating a new pipeline

  1. Overall Pipeline Definition
    • Basics for the overall pipeline
    • NOTE: Currently, configurations set in the overall pipeline's section do not by default pass onto the step's configurations
  2. Configure Each Step

Overall Pipeline Definition

  1. Define a new configuration section for your pipeline
    • Example:
    [pipelineName]
  2. Define the steps in this pipeline using the key STEPS under that section
    • Example:
    [pipelineName]
    STEPS = stepName1 stepName2 stepName3
    • Note: each step must have its own configuration section

Optional Overall Pipeline Settings:

  • BATCH_OPTS
  • BATCH_TYPE
  • IGNORE_SM_CHECK - turn off the default validation that the @RG SM tag matches the bam list sample name.
  • IGNORE_REF_CHR_CHECK - turn off the default validation that checks that all of the BAM's chromosomes are in the reference file - eventually we may update to just validate those in CHRS.
  • OUT_DIR
  • BAM_LIST
  • REF
  • REF_FAI
  • MULTIPLE_TARGET_MAP
  • UNIFORM_TARGET_BED
  • OFFSET_OFF_TARET
  • CHRS - defines which chromosomes to run.
  • UNIT_CHUNK
  • NO_CRAM - do not allow CRAM files as input
  • MAKE_BASE_NAME_PIPE - base makefile name
  • MAKE_OPTS - options to pass to the make command that runs the jobs.
  • BAM_DEPEND - set to TRUE if you want the BAM file to be included as a make dependency


NOTES:

  • The BAM_LIST file can contain config values within it - the overall pipeline section will be checked for those config values.
  • By default if a value is not defined in the section, it will check global.

Configure Each Step

Create a section for each step

  • Example: [stepName1]


Required keys for each step:

  1. DEPEND - dependencies for this step
    Valid Values (separate multiple dependencies with a space):
    • BAM
    • Name of step that must complete prior to this step
    • PER_SAMPLE_BAM??? can only be BAM or PER_SAMPLE_BAM
  2. OUTPUT - name of output file
    • See below for temporary keys for step iteration
  3. CMD - command for running the step
    • See below for temporary keys for step iteration


Optional Step Settings:

General Settings:

  • LOCAL - run the step locally rather than on the cluster
  • NEED_BAI - Set if a step requires a BAI file
    • Per chromosome steps always require a BAI file
    • Tells GotCloud to fail if a BAI can't be found
  • BAM_DEPEND - Add the BAM file as a Makefile dependency for this step

Settings to limit which samples this step runs on:

  • SAMPLES - use this to define a step to run only for samples with a single BAM or multiple BAMs (merging)
    Possible values:
    • MULTI_BAM - run the step only for samples that have multiple BAMs
    • SINGLE_BAM - run the step only for samples that have one BAM
  • Deprecated settings - still in pipeline.pl and may or may not work:
    • MULTI_ONLY - set to non-blank if step should run if there are more than 1 input per output.
    • SINGLE_ONLY - set to non-blank if step should run if there is only 1 input per output.

Joining multiple inputs for a single output:

  • Can occur if there are multiple dependencies
  • Can occur if a step runs at a more generic iteration level than a dependency
  • INPUT_JOIN - value to pass to perl "join" command for joining multiple inputs for each output.
    • Looks across all dependencies
  • dependStepName_JOIN - how to join the "dependStepName"'s output into the command line for a step that depends on it if there are multiple outputs per input of this step
    • Substitutes ?(${depend}/OUTPUT) with perl "join" using the specified value to join multiple outputs for that dependency

Log Output filenames

  • FILELIST - writes/appends the iteration's output file name into the specified file list.
    • Typically will be used in a later "merge" step
    • See below for temporary keys for step iteration that can be used in this filename
      • Temporary keys can be more general than those in OUTPUT, but cannot be more specific.


Iterating a command for each Bam/Sample/Chromosome/Region

Temporary keys are used when iterating a command per BAM/sample/chromosome/region.

  • Specify using ?() rather than $()
  • Temporary keys can be used in:
    • OUTPUT
    • CMD
    • FILELIST
  • They will be substituted as it iterates
  • How to iterate a command is determined by the temporary keys in OUTPUT
  • Temporary Keys for determining iterations:
    • ?(BAM) - per BAM per sample
    • ?(SAMPLE) - per sample
    • ?(CHR) - per chromosome
    • ?(START) - Per region of a Chromosome (must also include ?(CHR)):
  • Additional Temporary Keys:
    • ?(END) - end of the region - only used if ?(START) is also specified.
    • ?(INPUT)
    • ?(${depend}/OUTPUT)

Notes:

  • Currently each step iteration will:
    • be its own Makefile target/.OK file
    • run independently on the cluster

Command Line Parameters

Required Parameters:

  • --name <pipelineName> - name of the pipeline to run
  • --conf <configuration file> - configuration file to use

NOTE: Currently, any "overrides" are for the global setting only - not for the pipeline/step.

  • this needs to be fixed so they can override the pipeline settings

Optional Parameters:

  • --ignoreSmCheck - overrides IGNORE_SM_CHECK
  • --ignoreRefChrCheck - overrides IGNORE_REF_CHR_CHECK
  • --verbose <number> - verbose value passed to the loadConf method

Optional Parameters like SnpCall:

  • --numjobs|numjobs <number> - number of jobs to run in parallel
  • --maxlocaljobs <number> - number of jobs to allow to run when batchtype is local (default 10) - does not validate for commands running LOCAL
  • --region <region to process> - like snpcall, specifies a single region to process
  • --bam_list|list|bamlist|bam_index|bamindex <bam list file> - overrides BAM_LIST, the list of sample bam files to process
  • --out_dir|outdir <output directory> - overrides OUT_DIR
  • --batchtype <type> - overrides BATCHTYPE
  • --batchopts <options> - overrides BATCHOPTS
  • --chrs|chroms <comma separated chromosomes> - overrides CHRS (CHRS is space separated - commas are converted to spaces)
  • --ref_dir|refdir <reference directory> - overrides REF_DIR
  • --ref_prefix|refprefix <prefix> - overrides REF_PREFIX
  • --bam_prefix|bamprefix <prefix> - overrides BAM_PREFIX
  • --base_prefix|baseprefix <prefix> - overrides BASE_PREFIX
  • --gotcloudroot|gcroot <path to gotcloud> - by default gotcloud root is determined from the path to the pipeline script, but this setting overrides that.
  • --help - print Usage
  • --test <test directory> - run the test code (just for indel right now)

Unused command line options:

  • In the code, but are not actually used:
  • --keeptmp - overrides KEEP_TMP
  • --keeplog - overrides KEEP_LOG

Example Pipelines Created

Look for sections & STEPS in the defaults.

https://github.com/statgen/gotcloud/blob/master/bin/gotcloudDefaults.conf
https://github.com/statgen/gotcloud/blob/alignPrep/bin/gotcloudDefaults.conf