Changes

From Genome Analysis Wiki
Jump to navigationJump to search
→‎Command Line Parameters: fix formatting about BAM_LIST
Line 43: Line 43:  
</ul>
 
</ul>
 
=== Defining a New Pipeline ===
 
=== Defining a New Pipeline ===
# Define a new configuration section for your pipeline
+
 
#: <code>[pipelineName]</code>
+
There are 2 parts for creating a new pipeline
 +
# [[#Overall Pipeline Definition|Overall Pipeline Definition]]
 +
#* Basics for the overall pipeline
 +
#* '''NOTE: Currently, configurations set in the overall pipeline's section do not by default pass onto the step's configurations'''
 +
# [[#Configure Each Step|Configure Each Step]]
 +
 
 +
==== Overall Pipeline Definition ====
 +
<ol>
 +
<li> Define a new configuration section for your pipeline
 +
<ul><li> Example:</li></ul>
 +
<dd> <pre>[pipelineName]</pre>
 +
</li>
 +
<li>Define the steps in this pipeline using the key <code>STEPS</code> under that section
 +
<ul><li> Example:</li></ul>
 +
<dd><pre>[pipelineName]
 +
STEPS = stepName1 stepName2 stepName3</pre>
 +
<ul><li> Note: each step must have its own configuration section</li></ul>
 +
</li>
 +
</ol>
 +
 
 +
Optional Overall Pipeline Settings:
 +
* BATCH_OPTS
 +
* BATCH_TYPE
 +
* IGNORE_SM_CHECK - turn off the default validation that the @RG SM tag matches the bam list sample name.
 +
* IGNORE_REF_CHR_CHECK - turn off the default validation that checks that all of the BAM's chromosomes are in the reference file - eventually we may update to just validate those in CHRS.
 +
* OUT_DIR
 +
* BAM_LIST
 +
* REF
 +
* REF_FAI
 +
* MULTIPLE_TARGET_MAP
 +
* UNIFORM_TARGET_BED
 +
* OFFSET_OFF_TARET
 +
* CHRS - defines which chromosomes to run.
 +
* UNIT_CHUNK
 +
* NO_CRAM - do not allow CRAM files as input
 +
* MAKE_BASE_NAME_PIPE - base makefile name
 +
* MAKE_OPTS - options to pass to the make command that runs the jobs.
 +
* BAM_DEPEND - set to TRUE if you want the BAM file to be included as a make dependency
 +
 
 +
 
 +
 
 +
NOTES:
 +
* The BAM_LIST file can contain config values within it - the overall pipeline section will be checked for those config values.
 +
* By default if a value is not defined in the section, it will check global.
 +
 
 +
==== Configure Each Step ====
 +
'''Create a section for each step'''
 +
* Example: <code>[stepName1]</code>
 +
 
 +
 
 +
====Required keys for each step:====
 +
 
 +
# <code>DEPEND</code> - dependencies for this step
 +
#: Valid Values (separate multiple dependencies with a space):
 +
#:*<code>BAM</code>
 +
#:*Name of step that must complete prior to this step
 +
#:*PER_SAMPLE_BAM??? can only be BAM or PER_SAMPLE_BAM
 +
#<code>OUTPUT</code> - name of output file
 +
#* See below for temporary keys for step iteration
 +
#<code>CMD</code> - command for running the step
 +
#* See below for temporary keys for step iteration
 +
 
 +
 
 +
====Optional Step Settings:====
 +
General Settings:
 +
* <code>LOCAL</code> - run the step locally rather than on the cluster
 +
* <code>NEED_BAI</code> - Set if a step requires a BAI file
 +
** Per chromosome steps always require a BAI file
 +
** Tells GotCloud to fail if a BAI can't be found
 +
* <code>BAM_DEPEND</code> - Add the BAM file as a Makefile dependency for this step
 +
 
 +
Settings to limit which samples this step runs on:
 +
* <code>SAMPLES</code> - use this to define a step to run only for samples with a single BAM or multiple BAMs (merging)
 +
*: Possible values:
 +
*:* <code>MULTI_BAM</code> - run the step only for samples that have multiple BAMs
 +
*:* <code>SINGLE_BAM</code> - run the step only for samples that have one BAM
 +
*Deprecated settings - still in pipeline.pl and may or may not work:
 +
** <code>MULTI_ONLY</code> - set to non-blank if step should run if there are more than 1 input per output.
 +
** <code>SINGLE_ONLY</code> - set to non-blank if step should run if there is only 1 input per output.
 +
 
 +
Joining multiple inputs for a single output:
 +
* Can occur if there are multiple dependencies
 +
* Can occur if a step runs at a more generic iteration level than a dependency
 +
* <code>INPUT_JOIN</code> - value to pass to perl "join" command for joining multiple inputs for each output.
 +
** Looks across all dependencies
 +
* <code>dependStepName_JOIN</code> - how to join the "dependStepName"'s output into the command line for a step that depends on it if there are multiple outputs per input of this step
 +
** Substitutes <code>?(${depend}/OUTPUT)</code> with perl "join" using the specified value to join multiple outputs for that dependency
 +
 
 +
Log Output filenames
 +
* <code>FILELIST</code> - writes/appends the iteration's output file name into the specified file list.
 +
** Typically will be used in a later "merge" step
 +
** See below for temporary keys for step iteration that can be used in this filename
 +
*** Temporary keys can be more general than those in OUTPUT, but cannot be more specific.
 +
 
 +
 
 +
====Iterating a command for each Bam/Sample/Chromosome/Region====
 +
Temporary keys are used when iterating a command per BAM/sample/chromosome/region.
 +
* Specify using <code>?()</code> rather than <code>$()</code>
 +
* Temporary keys can be used in:
 +
** <code>OUTPUT</code>
 +
** <code>CMD</code>
 +
** <code>FILELIST</code>
 +
* They will be substituted as it iterates
 +
* How to iterate a command is determined by the temporary keys in <code>OUTPUT</code>
 +
* Temporary Keys for determining iterations:
 +
** <code>?(BAM)</code> - per BAM per sample
 +
** <code>?(SAMPLE)</code> - per sample
 +
** <code>?(CHR)</code> - per chromosome
 +
** <code>?(START)</code> - Per region of a Chromosome (must also include <code>?(CHR)</code>):
 +
* Additional Temporary Keys:
 +
** <code>?(END)</code> - end of the region - only used if <code>?(START)</code> is also specified.
 +
** <code>?(INPUT)</code>
 +
** <code>?(${depend}/OUTPUT)</code>
 +
 
 +
'''Notes:'''
 +
* Currently each step iteration will:
 +
** be its own Makefile target/.OK file
 +
** run independently on the cluster
 +
 
 +
== Command Line Parameters ==
 +
Required Parameters:
 +
* <code>--name</code> <pipelineName> - name of the pipeline to run
 +
* <code>--conf</code> <configuration file> - configuration file to use
 +
 
 +
NOTE: Currently, any "overrides" are for the global setting only - not for the pipeline/step.
 +
* this needs to be fixed so they can override the pipeline settings
 +
 
 +
Optional Parameters:
 +
* <code>--ignoreSmCheck</code> - overrides <code>IGNORE_SM_CHECK</code>
 +
* <code>--ignoreRefChrCheck</code> - overrides <code>IGNORE_REF_CHR_CHECK</code>
 +
* <code>--verbose</code> <number> - verbose value passed to the loadConf method
 +
 
 +
Optional Parameters like SnpCall:
 +
* <code>--numjobs|numjobs</code> <number> - number of jobs to run in parallel
 +
* <code>--maxlocaljobs</code> <number> - number of jobs to allow to run when batchtype is local (default 10) - does not validate for commands running LOCAL
 +
* <code>--region</code> <region to process> - like snpcall, specifies a single region to process
 +
* <code>--bam_list|list|bamlist|bam_index|bamindex</code> <bam list file> - overrides <code>BAM_LIST</code>, the list of sample bam files to process
 +
* <code>--out_dir|outdir</code> <output directory> - overrides <code>OUT_DIR</code>
 +
* <code>--batchtype</code> <type> - overrides <code>BATCHTYPE</code>
 +
* <code>--batchopts</code> <options> - overrides <code>BATCHOPTS</code>
 +
* <code>--chrs|chroms</code> <comma separated chromosomes> - overrides <code>CHRS</code> (CHRS is space separated - commas are converted to spaces)
 +
* <code>--ref_dir|refdir</code> <reference directory> - overrides <code>REF_DIR</code>
 +
* <code>--ref_prefix|refprefix</code> <prefix> - overrides <code>REF_PREFIX</code>
 +
* <code>--bam_prefix|bamprefix</code> <prefix> - overrides <code>BAM_PREFIX</code>
 +
* <code>--base_prefix|baseprefix</code> <prefix> - overrides <code>BASE_PREFIX</code>
 +
* <code>--gotcloudroot|gcroot</code> <path to gotcloud> - by default gotcloud root is determined from the path to the pipeline script, but this setting overrides that.
 +
* <code>--help</code> - print Usage
 +
* <code>--test</code> <test directory> - run the test code (just for indel right now)
 +
 
 +
Unused command line options:
 +
* In the code, but are not actually used:
 +
* <code>--keeptmp</code> - overrides <code>KEEP_TMP</code>
 +
* <code>--keeplog</code> - overrides <code>KEEP_LOG</code>
 +
 
 +
== Example Pipelines Created ==
 +
Look for sections & <code>STEPS</code> in the defaults.
 +
https://github.com/statgen/gotcloud/blob/master/bin/gotcloudDefaults.conf
 +
https://github.com/statgen/gotcloud/blob/alignPrep/bin/gotcloudDefaults.conf
61

edits

Navigation menu