Line 1: |
Line 1: |
| + | |
| == Creating a New BAM Processing Pipeline == | | == Creating a New BAM Processing Pipeline == |
| | | |
| GotCloud allows you to configure new basic BAM processing pipelines via configuration. | | GotCloud allows you to configure new basic BAM processing pipelines via configuration. |
| | | |
| + | To define new processing pipelines, you will use Configuration sections to define both the pipeline and each of the steps. So first you need to understand how configuration sections work. |
| + | |
| + | === GotCloud Configuration Sections === |
| + | |
| + | |
| + | GotCloud configuration files can be broken into sections: |
| + | * Section names are specified between square brakets (<code>[]</code>) |
| + | *: <pre>[sectionName]</pre> |
| + | ** Any configuration settings specified after the section header belong to that section |
| + | ** A section can be specified multiple times in the file and the configuration settings are accumulated |
| + | ** To access a value for a key defined in another section, use <code>$(otherSectionName/keyName)</code> |
| + | |
| + | |
| + | * If a section is not specified, the configuration settings belong to the <code>global</code> section |
| + | ** The <code>global</code> section does not need to be specified at the beginning of the file (it is the default section). |
| + | ** Additional <code>global</code> settings can be set later in the file after other settings, by defining the explicitly section: |
| + | **: <pre>[global]</pre> |
| + | |
| + | |
| + | * Sections can be derived from another section |
| + | ** All sections automatically derive from <code>[global]</code> |
| + | ** A derived section inherits all the configuration settings from its parent sections |
| + | *** Parent settings are overridden by redefining the configuration key/value pair |
| + | ** A parent section is specified following a semicolon <code>:</code> on the section definition line: |
| + | **: <pre>[childSectionName] : parentSectionName</pre> |
| + | |
| + | |
| + | <ul> |
| + | <li> Section specific configuration settings are specified on the lines following the section definition: |
| + | <dd><pre>[section1] |
| + | KEY1 = VAL1 |
| + | KEY2 = VAL2 |
| + | |
| + | [section2] |
| + | KEY1 = VAL1_2 |
| + | KEY3 = VAL3</pre></dd> |
| + | </li> |
| + | </ul> |
| + | === Defining a New Pipeline === |
| + | |
| + | There are 2 parts for creating a new pipeline |
| + | # [[#Overall Pipeline Definition|Overall Pipeline Definition]] |
| + | #* Basics for the overall pipeline |
| + | #* '''NOTE: Currently, configurations set in the overall pipeline's section do not by default pass onto the step's configurations''' |
| + | # [[#Configure Each Step|Configure Each Step]] |
| + | |
| + | ==== Overall Pipeline Definition ==== |
| + | <ol> |
| + | <li> Define a new configuration section for your pipeline |
| + | <ul><li> Example:</li></ul> |
| + | <dd> <pre>[pipelineName]</pre> |
| + | </li> |
| + | <li>Define the steps in this pipeline using the key <code>STEPS</code> under that section |
| + | <ul><li> Example:</li></ul> |
| + | <dd><pre>[pipelineName] |
| + | STEPS = stepName1 stepName2 stepName3</pre> |
| + | <ul><li> Note: each step must have its own configuration section</li></ul> |
| + | </li> |
| + | </ol> |
| + | |
| + | Optional Overall Pipeline Settings: |
| + | * BATCH_OPTS |
| + | * BATCH_TYPE |
| + | * IGNORE_SM_CHECK - turn off the default validation that the @RG SM tag matches the bam list sample name. |
| + | * IGNORE_REF_CHR_CHECK - turn off the default validation that checks that all of the BAM's chromosomes are in the reference file - eventually we may update to just validate those in CHRS. |
| + | * OUT_DIR |
| + | * BAM_LIST |
| + | * REF |
| + | * REF_FAI |
| + | * MULTIPLE_TARGET_MAP |
| + | * UNIFORM_TARGET_BED |
| + | * OFFSET_OFF_TARET |
| + | * CHRS - defines which chromosomes to run. |
| + | * UNIT_CHUNK |
| + | * NO_CRAM - do not allow CRAM files as input |
| + | * MAKE_BASE_NAME_PIPE - base makefile name |
| + | * MAKE_OPTS - options to pass to the make command that runs the jobs. |
| + | * BAM_DEPEND - set to TRUE if you want the BAM file to be included as a make dependency |
| + | |
| + | |
| + | |
| + | NOTES: |
| + | * The BAM_LIST file can contain config values within it - the overall pipeline section will be checked for those config values. |
| + | * By default if a value is not defined in the section, it will check global. |
| + | |
| + | ==== Configure Each Step ==== |
| + | '''Create a section for each step''' |
| + | * Example: <code>[stepName1]</code> |
| + | |
| + | |
| + | ====Required keys for each step:==== |
| + | |
| + | # <code>DEPEND</code> - dependencies for this step |
| + | #: Valid Values (separate multiple dependencies with a space): |
| + | #:*<code>BAM</code> |
| + | #:*Name of step that must complete prior to this step |
| + | #:*PER_SAMPLE_BAM??? can only be BAM or PER_SAMPLE_BAM |
| + | #<code>OUTPUT</code> - name of output file |
| + | #* See below for temporary keys for step iteration |
| + | #<code>CMD</code> - command for running the step |
| + | #* See below for temporary keys for step iteration |
| + | |
| + | |
| + | ====Optional Step Settings:==== |
| + | General Settings: |
| + | * <code>LOCAL</code> - run the step locally rather than on the cluster |
| + | * <code>NEED_BAI</code> - Set if a step requires a BAI file |
| + | ** Per chromosome steps always require a BAI file |
| + | ** Tells GotCloud to fail if a BAI can't be found |
| + | * <code>BAM_DEPEND</code> - Add the BAM file as a Makefile dependency for this step |
| + | |
| + | Settings to limit which samples this step runs on: |
| + | * <code>SAMPLES</code> - use this to define a step to run only for samples with a single BAM or multiple BAMs (merging) |
| + | *: Possible values: |
| + | *:* <code>MULTI_BAM</code> - run the step only for samples that have multiple BAMs |
| + | *:* <code>SINGLE_BAM</code> - run the step only for samples that have one BAM |
| + | *Deprecated settings - still in pipeline.pl and may or may not work: |
| + | ** <code>MULTI_ONLY</code> - set to non-blank if step should run if there are more than 1 input per output. |
| + | ** <code>SINGLE_ONLY</code> - set to non-blank if step should run if there is only 1 input per output. |
| + | |
| + | Joining multiple inputs for a single output: |
| + | * Can occur if there are multiple dependencies |
| + | * Can occur if a step runs at a more generic iteration level than a dependency |
| + | * <code>INPUT_JOIN</code> - value to pass to perl "join" command for joining multiple inputs for each output. |
| + | ** Looks across all dependencies |
| + | * <code>dependStepName_JOIN</code> - how to join the "dependStepName"'s output into the command line for a step that depends on it if there are multiple outputs per input of this step |
| + | ** Substitutes <code>?(${depend}/OUTPUT)</code> with perl "join" using the specified value to join multiple outputs for that dependency |
| + | |
| + | Log Output filenames |
| + | * <code>FILELIST</code> - writes/appends the iteration's output file name into the specified file list. |
| + | ** Typically will be used in a later "merge" step |
| + | ** See below for temporary keys for step iteration that can be used in this filename |
| + | *** Temporary keys can be more general than those in OUTPUT, but cannot be more specific. |
| + | |
| + | |
| + | ====Iterating a command for each Bam/Sample/Chromosome/Region==== |
| + | Temporary keys are used when iterating a command per BAM/sample/chromosome/region. |
| + | * Specify using <code>?()</code> rather than <code>$()</code> |
| + | * Temporary keys can be used in: |
| + | ** <code>OUTPUT</code> |
| + | ** <code>CMD</code> |
| + | ** <code>FILELIST</code> |
| + | * They will be substituted as it iterates |
| + | * How to iterate a command is determined by the temporary keys in <code>OUTPUT</code> |
| + | * Temporary Keys for determining iterations: |
| + | ** <code>?(BAM)</code> - per BAM per sample |
| + | ** <code>?(SAMPLE)</code> - per sample |
| + | ** <code>?(CHR)</code> - per chromosome |
| + | ** <code>?(START)</code> - Per region of a Chromosome (must also include <code>?(CHR)</code>): |
| + | * Additional Temporary Keys: |
| + | ** <code>?(END)</code> - end of the region - only used if <code>?(START)</code> is also specified. |
| + | ** <code>?(INPUT)</code> |
| + | ** <code>?(${depend}/OUTPUT)</code> |
| + | |
| + | '''Notes:''' |
| + | * Currently each step iteration will: |
| + | ** be its own Makefile target/.OK file |
| + | ** run independently on the cluster |
| + | |
| + | == Command Line Parameters == |
| + | Required Parameters: |
| + | * <code>--name</code> <pipelineName> - name of the pipeline to run |
| + | * <code>--conf</code> <configuration file> - configuration file to use |
| + | |
| + | NOTE: Currently, any "overrides" are for the global setting only - not for the pipeline/step. |
| + | * this needs to be fixed so they can override the pipeline settings |
| + | |
| + | Optional Parameters: |
| + | * <code>--ignoreSmCheck</code> - overrides <code>IGNORE_SM_CHECK</code> |
| + | * <code>--ignoreRefChrCheck</code> - overrides <code>IGNORE_REF_CHR_CHECK</code> |
| + | * <code>--verbose</code> <number> - verbose value passed to the loadConf method |
| + | |
| + | Optional Parameters like SnpCall: |
| + | * <code>--numjobs|numjobs</code> <number> - number of jobs to run in parallel |
| + | * <code>--maxlocaljobs</code> <number> - number of jobs to allow to run when batchtype is local (default 10) - does not validate for commands running LOCAL |
| + | * <code>--region</code> <region to process> - like snpcall, specifies a single region to process |
| + | * <code>--bam_list|list|bamlist|bam_index|bamindex</code> <bam list file> - overrides <code>BAM_LIST</code>, the list of sample bam files to process |
| + | * <code>--out_dir|outdir</code> <output directory> - overrides <code>OUT_DIR</code> |
| + | * <code>--batchtype</code> <type> - overrides <code>BATCHTYPE</code> |
| + | * <code>--batchopts</code> <options> - overrides <code>BATCHOPTS</code> |
| + | * <code>--chrs|chroms</code> <comma separated chromosomes> - overrides <code>CHRS</code> (CHRS is space separated - commas are converted to spaces) |
| + | * <code>--ref_dir|refdir</code> <reference directory> - overrides <code>REF_DIR</code> |
| + | * <code>--ref_prefix|refprefix</code> <prefix> - overrides <code>REF_PREFIX</code> |
| + | * <code>--bam_prefix|bamprefix</code> <prefix> - overrides <code>BAM_PREFIX</code> |
| + | * <code>--base_prefix|baseprefix</code> <prefix> - overrides <code>BASE_PREFIX</code> |
| + | * <code>--gotcloudroot|gcroot</code> <path to gotcloud> - by default gotcloud root is determined from the path to the pipeline script, but this setting overrides that. |
| + | * <code>--help</code> - print Usage |
| + | * <code>--test</code> <test directory> - run the test code (just for indel right now) |
| + | |
| + | Unused command line options: |
| + | * In the code, but are not actually used: |
| + | * <code>--keeptmp</code> - overrides <code>KEEP_TMP</code> |
| + | * <code>--keeplog</code> - overrides <code>KEEP_LOG</code> |
| | | |
− | # Define a new configuration section for your pipeline
| + | == Example Pipelines Created == |
− | #: <code>[pipelineName]</code>
| + | Look for sections & <code>STEPS</code> in the defaults. |
| + | https://github.com/statgen/gotcloud/blob/master/bin/gotcloudDefaults.conf |
| + | https://github.com/statgen/gotcloud/blob/alignPrep/bin/gotcloudDefaults.conf |