Difference between revisions of "GotCloud: Variant Calling Options"
(10 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
| --outdir ''path'' || OUT_DIR || output directory || | | --outdir ''path'' || OUT_DIR || output directory || | ||
|- | |- | ||
− | | --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[#BAM List File|BAM List File]] || $(OUT_DIR)/bam.list | + | | --list/--bam_list/--bamlist ''file'' || BAM_LIST || path to the [[GotCloud: Variant Calling Pipeline#BAM List File|BAM List File]] || $(OUT_DIR)/bam.list |
|- | |- | ||
| --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run) | | --numjobs ''#'' || || number of jobs to run in parallel || 0 (generate Makefile of steps, but do not run) | ||
Line 43: | Line 43: | ||
| --verbose || || Add additional messages when reading configuration || | | --verbose || || Add additional messages when reading configuration || | ||
|} | |} | ||
+ | |||
+ | ===Reference/Resource Files=== | ||
+ | * See [[GotCloud: Genetic Reference and Resource Files]] for reference/resource file configuration settings | ||
=== Analysis Region Options === | === Analysis Region Options === | ||
+ | See [[#Targeted/Exome Sequencing Settings|Targeted/Exome Sequencing Settings]] for more information on specifying exome/targetted regions and other settings. | ||
+ | |||
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
! Command-line Flag !! Configuration Key !! Value Description !! Default Value | ! Command-line Flag !! Configuration Key !! Value Description !! Default Value | ||
Line 53: | Line 58: | ||
format (-end is optional): chr:start-end | format (-end is optional): chr:start-end | ||
| | | | ||
+ | |- | ||
+ | | || UNIT_CHUNK || chunk size of SNP calling (GotCloud breaks up each chromosome into regions of this size) || 5000000 | ||
+ | |- | ||
+ | | || LD_NSNPS || chunk size (number of SNPs) of genotype refinement || 10000 | ||
+ | |- | ||
+ | | || LD_OVERLAP || overlapping # of SNPs between chunks for genotype refinement || 1000 | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | ==== Chromosome X/Y Calling ==== | ||
+ | For proper Chromosome X/Y calling, it is recommended to specify a PED file with sex information: | ||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! Configuration Key !! Value Description | ||
+ | |- | ||
+ | |PED_INDEX|| ped file containing sampleID (2nd column) and sex (5th column) | ||
+ | |} | ||
+ | |||
+ | Format of PED file: | ||
+ | :<code>familyID sampleID fatherID motherID sex</code> | ||
+ | * Only <code>sampleID</code> and <code>sex</code> are used | ||
+ | |||
+ | ====Targeted/Exome Sequencing Settings==== | ||
+ | If you are running Targeted/Exome Sequencing, the user should specify: | ||
+ | |||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! Configuration Key !! Value Description | ||
+ | |- | ||
+ | |UNIFORM_TARGET_BED|| Bed file of targeted regions (same bed for all samples) | ||
+ | |- | ||
+ | |MULTIPLE_TARGET_MAP|| Filename of file mapping: sample id -> bed file of targeted regions | ||
+ | Each line of the file contains: [SM_ID] [TARGET_BED] | ||
+ | |- | ||
+ | |OFFSET_OFF_TARGET|| Number of bases by which to extend the target region | ||
+ | (default is 0, do not extend the target region) | ||
+ | |- | ||
+ | |SAMTOOLS_VIEW_TARGET_ONLY || '''true''': speeds up processing by excluding off-target regions initially when performing samtools view | ||
+ | |||
+ | '''false''' (default): off-target regions are not excluded when performing samtools view, but are excluded at a later step | ||
+ | |||
+ | '''Warning:''' You may not want to set this to true due to it may: | ||
+ | *''make command line too long'' | ||
+ | *''produce an error if reads overlap multiple targeted regions'' | ||
+ | ** see: [[GotCloud: FAQs#Targetted/Exome|GotCloud: FAQs->Targetted/Exome]] | ||
+ | |- | ||
+ | | WGS_SVM || whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions. || | ||
+ | |- | ||
|} | |} | ||
Line 91: | Line 142: | ||
|- | |- | ||
| --nophonehome || || disable phonehome in GotCloud and the tools it calls || | | --nophonehome || || disable phonehome in GotCloud and the tools it calls || | ||
+ | |- | ||
+ | | || BAMUTIL_THINNING || thinning parameter for bamUtil programs (will be set to 0 - if --nophonehome is specified) || --phoneHomeThinning 10 | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | ==== Directory Settings Options ==== | ||
+ | These values set GotCloud output subdirectories (relative paths under the OUT_DIR directory). You should not need to change these from the defaults unless you want to use different sub-directory names. | ||
+ | |||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! Configuration Key !! Value Description !! Default Value | ||
+ | |- | ||
+ | | BAM_GLF_DIR || GLF outputs per BAM (if multiple BAMs per sample) (intermediate files) || glfs/bams | ||
+ | |- | ||
+ | | SM_GLF_DIR || GLF outputs per sample (intermediate files) || glfs/samples | ||
+ | |- | ||
+ | | '''VCF_DIR''' || unfiltered and filtered VCFs || vcfs | ||
+ | |- | ||
+ | | PVCF_DIR || vcfPileup results (intermediate files) || pvcfs | ||
+ | |- | ||
+ | | SPLIT_DIR || VCFs with PASS variants only & split into multiple files || split | ||
+ | |- | ||
+ | | BEAGLE_DIR || beagle output || beagle | ||
+ | |- | ||
+ | | SPLIT4_DIR || VCFs with PASS variants only & split into multiple files for running beagle4 || split4 | ||
+ | |- | ||
+ | | BEAGLE4_DIR || beagle version 4 output || beagle4 | ||
+ | |- | ||
+ | | THUNDER_DIR || thunder output || thunder | ||
+ | |- | ||
+ | | TARGET_DIR || directory to store target information when running with a BED file || target | ||
+ | |- | ||
+ | | GLF_INDEX || filename for index file needed for glfflex (file is created by GotCloud) || glfIndex.ped | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | ==== Tool Options ==== | ||
+ | These values set the binaries GotCloud should use. You should not need to change these from the defaults unless you want to try a different version of one of the tools. | ||
+ | |||
+ | Some tools have the options specified with the binary command, while others have them separate or hard coded | ||
+ | |||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! Configuration Key !! Program Description !! Default Value | ||
+ | |- | ||
+ | | SAMTOOLS_FOR_PILEUP || samtools to use for pileup || $(BIN_DIR)/samtools-hybrid | ||
+ | |- | ||
+ | | SAMTOOLS_FOR_OTHERS || samtools to use for view and calmd || $(BIN_DIR)/samtools-hybrid | ||
+ | |- | ||
+ | | GLFMERGE || merge glf files when there are multiple BAMs per indvidual || $(BIN_DIR)/glfMerge | ||
+ | |- | ||
+ | | GLFFLEX || perform glf-based variant calling (replacement for glfMultiples) || $(BIN_DIR)/glfFlex --minMapQuality 0 --minDepth 1 --maxDepth 10000000 --uniformTsTv --smartFilter | ||
+ | |- | ||
+ | | VCFPILEUP || vcfPileup to generate rich per-site information || $(BIN_DIR)/vcfPileup | ||
+ | |- | ||
+ | | INFOCOLLECTOR || gather filtering statistics || $(BIN_DIR)/infoCollector | ||
+ | |- | ||
+ | | VCFMERGE || merge multiple VCFs separated by chunk of genomes || perl $(SCRIPT_DIR)/bams2vcfMerge.pl | ||
+ | |- | ||
+ | | VCFCOOKER || vcfCooker program for filtering || $(BIN_DIR)/vcfCooker | ||
+ | |- | ||
+ | | VCFSUMMARY || script to generate summary statistics of discovered sites || perl $(SCRIPT_DIR)/vcf-summary | ||
+ | |- | ||
+ | | VCFSPLIT || splits VCF into overlapping chunks for genotype refinement || perl $(SCRIPT_DIR)/vcfSplit.pl | ||
+ | |- | ||
+ | | VCFSPLIT4 || splits VCF into overlapping chunks for beagle version 4 genotype refinement || perl $(SCRIPT_DIR)/vcfSplit4.pl | ||
+ | |- | ||
+ | | VCF_SPLIT_CHROM || splits VCF into per chromosome VCFs || perl $(SCRIPT_DIR)/vcfSplitChr.pl | ||
+ | |- | ||
+ | | VCFPASTE || generate filtered genotype VCF || perl $(SCRIPT_DIR)/vcfPaste.p | ||
+ | |- | ||
+ | | BEAGLE || beagle program || java -Xmx4g -jar $(BIN_DIR)/beagle.20101226.jar seed=993478 gprobs=true niterations=50 lowmem=true | ||
+ | |- | ||
+ | | BEAGLE4 || beagle version 4 program || java -Xmx4g -jar $(BIN_DIR)/b4.r1219.jar seed=993478 gprobs=true | ||
+ | |- | ||
+ | | VCF2BEAGLE || convert VCF (with PL tag) into beagle input || perl $(SCRIPT_DIR)/vcf2Beagle.pl --PL | ||
+ | |- | ||
+ | | BEAGLE2VCF || convert beagle output to VCF || perl $(SCRIPT_DIR)/beagle2Vcf.pl | ||
+ | |- | ||
+ | | SVM_SCRIPT || SVM script || perl $(SCRIPT_DIR)/run_libsvm.pl | ||
+ | |- | ||
+ | | SVMLEARN || SVM program || $(BIN_DIR)/svm-train | ||
+ | |- | ||
+ | | SVMCLASSIFY || SVM program || $(BIN_DIR)/svm-predict | ||
+ | |- | ||
+ | | INVNORM || SVM program || $(BIN_DIR)/invNorm | ||
+ | |- | ||
+ | | THUNDER_STATES || flags for thunder states and weighted states || --states 400 --weightedStates 300 | ||
+ | |- | ||
+ | | THUNDER || MaCH/Thunder genotype refinement step || $(BIN_DIR)/thunderVCF -r 30 --phase --dosage --compact --inputPhased $(THUNDER_STATES) | ||
+ | |- | ||
+ | | LIGATEVCF || ligate multiple phased VCFs while resolving the phase between VCFs || perl $(SCRIPT_DIR)/ligateVcf.pl | ||
+ | |- | ||
+ | | LIGATEVCF4 || ligate multiple phased VCFs while resolving the phase between VCFs || perl $(SCRIPT_DIR)/ligateVcf4.pl | ||
+ | |- | ||
+ | | VCFCAT || concatenate multiple VCFs || perl $(SCRIPT_DIR)/vcfCat.pl | ||
+ | |- | ||
+ | | BGZIP || bgzip program || $(BIN_DIR)/bgzip | ||
+ | |- | ||
+ | | TABIX || tabix program || $(BIN_DIR)/tabix | ||
+ | |- | ||
+ | | BAMUTIL || bam util program || $(BIN_DIR)/bam | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | ==== Options ==== | ||
+ | |||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! Configuration Key !! Program Description !! Default Value | ||
+ | |- | ||
+ | | || SLEEP_MULT || add sleep time prior to some steps; use only if too many steps are starting at the same time doing the same thing || 0 | ||
+ | |- | ||
+ | | || REMOTE_PREFIX || add a prefix to paths when sending across to a remote machine || | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | ==== GlfFlex Options ==== | ||
+ | |||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! Configuration Key !! Program Description !! Default Value | ||
+ | |- | ||
+ | | WGS_SVM || whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions. || | ||
+ | |- | ||
+ | | VCF_EXTRACT || position file to use for glfFlex || | ||
+ | |- | ||
+ | | MODEL_GLFSINGLE || set to TRUE if glfSingle model should be used for glfFlex || | ||
+ | |- | ||
+ | | MODEL_SKIP_DISCOVER || set to true to disable variant discovery for glfFlex || | ||
+ | |- | ||
+ | | MODEL_AF_PRIOR || set to true to use AF prior for genotyping for glfFlex || | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | ==== SVM Filtering Options ==== | ||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! Configuration Key !! Program Description !! Default Value | ||
+ | |- | ||
+ | | POS_SAMPLE || percentage of positive samples used for training || 100 | ||
+ | |- | ||
+ | | NEG_SAMPLE || percentage of negative samples used for training || 100 | ||
+ | |- | ||
+ | | SVM_CUTOFF || SVM score cutoff for PASS/FAIL || 0 | ||
+ | |- | ||
+ | | USE_SVMMODEL || whether to use pre-trained model for SVM filtering || FALSE | ||
+ | |- | ||
+ | | SVMMODEL || pre-trained model file (if USE_SVMMODEL is set to TRUE) || | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | ==== Hard Filtering Options ==== | ||
+ | These options set the values to use when applying hard filters. | ||
+ | * To remove any filter, set it to blank in your configuration file | ||
+ | |||
+ | For additional hard filter information, see: [[GotCloud: Filters]] | ||
+ | |||
+ | '''Basic per variant filters:''' | ||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! Filter !! Configuration Key !! VCF value checked !! Filter Variants with... !! Default Value | ||
+ | |- | ||
+ | | max depth || FILTER_MAX_SAMPLE_DP || rowspan="2" |INFO:DP || > ''conf value'' * total number of samples || 1000[[ | ||
+ | |- | ||
+ | | min depth ||FILTER_MIN_SAMPLE_DP || < ''conf value'' * total number of samples || 1 | ||
+ | |- | ||
+ | | rowspan="2"|number of samples with coverage || FILTER_MIN_NS_FRAC || rowspan="2" |INFO:NS || < ''conf value'' * total number of samples || .50 | ||
+ | |- | ||
+ | | FILTER_MIN_NS || < ''conf value'' || | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | '''Per variant filters that allow a range of values:''' | ||
+ | * values of these filters must be numbers (or comma/space separated list of numbers) | ||
+ | * Rules: | ||
+ | ** Specifying 1 value in the filter will turn that filter on and use that value | ||
+ | ** Specifying 2 values in the filter (separated by ',' and/or ' ') turns on the filter | ||
+ | *** Use the 1st value if the number of samples is below FILTER_FORMULA_MIN_SAMPLES | ||
+ | *** Use the 2nd value if the number of samples is above FILTER_FORMULA_MAX_SAMPLES | ||
+ | *** If the number of samples is between the MIN & MAX, a logscale is used: | ||
+ | ***: <pre> (minVal - maxVal) * (log(maxSamples) - log(numSamples)) / (log(maxSamples) - log(minSamples)) + maxVal</pre> | ||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! colspan="3" | Configuration settings for min/max # samples to determine filter value when the filter setting contains multiple values separated by ',' or ' ' | ||
+ | |- | ||
+ | ! Configuration Key !! Description !! Default Value | ||
+ | |- | ||
+ | | FILTER_FORMULA_MIN_SAMPLES || total number of samples < ''conf value'', use the value '''before''' the ',' or ' ' || 100 | ||
+ | |- | ||
+ | | FILTER_FORMULA_MAX_SAMPLES || total number of samples > ''conf value'', use the value '''after''' the ',' or ' ' || 1000 | ||
+ | |- | ||
+ | | colspan="3" | total number of samples between min & max, use logscale | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! colspan="6" | Filters | ||
+ | |- | ||
+ | ! Filter !! Configuration Key !! VCF value checked !! Filter Variants with... !! Default Value !! Conf Value Requirements | ||
+ | |- | ||
+ | | max Allele Balance in Heterozygotes || FILTER_MAX_ABL || INFO:AB || > ''conf value''/100.0 || 70,65 || < 100 | ||
+ | |- | ||
+ | | max Strand Bias Pearson's Correlation || FILTER_MAX_STR || rowspan="2" | INFO:STR || > ''conf value''/100.0|| 20, 10 || < 100 | ||
+ | |- | ||
+ | | min Strand Bias Pearson's Correlation || FILTER_MIN_STR || < ''conf value''/100.0 || -20, -10 || > -100 | ||
+ | |- | ||
+ | | distance from known indel || FILTER_WIN_INDEL || position || distance from known indel < ''conf value'' || 5 || > 0 | ||
+ | |- | ||
+ | | max Strand Bias z-score || FILTER_MAX_STZ || rowspan="2" | INFO:STZ || > ''conf value'' || 5, 10 || < INT_MAX | ||
+ | |- | ||
+ | | min Strand Bias z-score || FILTER_MIN_STZ || < ''conf value'' || -5, -10 || > INT_MIN | ||
+ | |- | ||
+ | | max Alternate allele inflation score || FILTER_MAX_AOI || INFO:AOI || > ''conf value'' || 5 || < INT_MAX | ||
+ | |- | ||
+ | | min FIC || FILTER_MIN_FIC || INFO:FIC || < ''conf value''/100.0 || -20, -10 || > INT_MIN | ||
+ | |- | ||
+ | | max Cycle Bias Peason's correlation || FILTER_MAX_CBR || INFO:CBR || > ''conf value''/100.0 || 20, 10 || < 100 | ||
+ | |- | ||
+ | | max LQR || FILTER_MAX_LQR || INFO:LQR || > ''conf value''/100.0 || 30, 20 || < 100 | ||
+ | |- | ||
+ | | min pred-scaled quality score || FILTER_MIN_QUAL || QUAL || < ''conf value'' || 5 || > 0 | ||
+ | |- | ||
+ | | min Root Mean Squared Mapping Quality || FILTER_MIN_MQ || INFO:MQ || < ''conf value'' || 20 || > 0 | ||
+ | |- | ||
+ | | max Fraction of bases with mapQ=0 || FILTER_MAX_MQ0 || INFO:MQ0 || > ''conf value''/100.0 || 10 || < 100 | ||
+ | |- | ||
+ | | max Alternate allele quality z-score || FILTER_MAX_AOZ || INFO:AOZ || > ''conf value'' || || < INT_MAX | ||
+ | |- | ||
+ | | max Ratio of base-quality inflation || FILTER_MAX_IOR || INFO:IOR || > ''conf value'' || || < INT_MAX | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | '''Additional VCF Cooker filters:''' | ||
+ | * If you want to add any additional VCF Cooker filters that don't already have a configuration item, you can do that by adding the vcfCooker command-line filter to GotCloud: | ||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! Configuration Key !! Default Value | ||
+ | |- | ||
+ | | FILTER_ADDITIONAL || | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | ==== Additional Options ==== | ||
+ | |||
+ | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | ! Configuration Key !! Program Description !! Default Value | ||
+ | |- | ||
+ | |||
+ | | SAMTOOLS_VIEW_FILTER || filter settings for samtools view (default filters by mapping quality and flag) || -q 20 -F 0x0704 | ||
+ | |- | ||
+ | | NOBAQ_SUBSTRINGS || skip the BAQ step if the BAM filename contains the specified space-separated substrings || SOLID | ||
+ | |- | ||
+ | | BAM_DEPEND || set to true to rerun the pipeline if the BAM files are newer than previously run steps that use them || FALSE | ||
+ | |- | ||
+ | | MAKE_OPTS || set to add additional makefile options || | ||
|- | |- | ||
|} | |} |
Latest revision as of 14:57, 11 February 2015
Required Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--outdir path | OUT_DIR | output directory | |
--list/--bam_list/--bamlist file | BAM_LIST | path to the BAM List File | $(OUT_DIR)/bam.list |
--numjobs # | number of jobs to run in parallel | 0 (generate Makefile of steps, but do not run) |
Common Options
Common Options | |||
---|---|---|---|
Command-line Flag | Configuration Key | Value Description | Default Value |
--conf file | configuration file to use |
Cluster Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--batchtype type | BATCH_TYPE | name of cluster type | local |
--batchopts opts | BATCH_OPTS | options to pass to the cluster command | |
--copyglf path | COPY_GLF | path to copy glfs to before processing them (path local to remote nodes, maybe in /tmp) |
Test/Debug Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--help | print help information | ||
--test path | run the snpcall/ldrefine test and write output to the specified path | ||
--verbose | Add additional messages when reading configuration |
Reference/Resource Files
- See GotCloud: Genetic Reference and Resource Files for reference/resource file configuration settings
Analysis Region Options
See Targeted/Exome Sequencing Settings for more information on specifying exome/targetted regions and other settings.
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--chrs # # | CHRS | pace separated list of chromosomes to process | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X |
--region #:#-# | call region - skip regions of chromosome outside of specified region
format (-end is optional): chr:start-end |
||
UNIT_CHUNK | chunk size of SNP calling (GotCloud breaks up each chromosome into regions of this size) | 5000000 | |
LD_NSNPS | chunk size (number of SNPs) of genotype refinement | 10000 | |
LD_OVERLAP | overlapping # of SNPs between chunks for genotype refinement | 1000 |
Chromosome X/Y Calling
For proper Chromosome X/Y calling, it is recommended to specify a PED file with sex information:
Configuration Key | Value Description |
---|---|
PED_INDEX | ped file containing sampleID (2nd column) and sex (5th column) |
Format of PED file:
familyID sampleID fatherID motherID sex
- Only
sampleID
andsex
are used
Targeted/Exome Sequencing Settings
If you are running Targeted/Exome Sequencing, the user should specify:
Configuration Key | Value Description | |
---|---|---|
UNIFORM_TARGET_BED | Bed file of targeted regions (same bed for all samples) | |
MULTIPLE_TARGET_MAP | Filename of file mapping: sample id -> bed file of targeted regions
Each line of the file contains: [SM_ID] [TARGET_BED] | |
OFFSET_OFF_TARGET | Number of bases by which to extend the target region
(default is 0, do not extend the target region) | |
SAMTOOLS_VIEW_TARGET_ONLY | true: speeds up processing by excluding off-target regions initially when performing samtools view
false (default): off-target regions are not excluded when performing samtools view, but are excluded at a later step Warning: You may not want to set this to true due to it may:
| |
WGS_SVM | whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions. |
Path Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--makebasename name | MAKE_BASE_NAME | basename of the Makefile generated by GotCloud | umake |
--bamprefix prefix | BAM_PREFIX | path to prepend to relative BAM file paths in the BAM list | |
--refprefix prefix | REF_PREFIX | path to prepend to relative reference/resource file paths | |
--baseprefix prefix | BASE_PREFIX | path to prepend to relative paths for the BAM list file, PED_INDEX, BAM (if BAM_PREFIX isn't specified), reference/resource files (if REF_PREFIX isn't specified) | |
--refdir path | REF_DIR | value to use for REF_DIR key | $(GOTCLOUD_ROOT)/gotcloud.ref |
--gotcloudroot path | GOTCLOUD_ROOT | specify to use a different directory for finding GotCloud bins/scripts | based on the location of the gotcloud/umake.pl script |
Validation Adjustment Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--maxlocaljobs # | maximum # of jobs that can run if batchtype is local (to prevent accidentally starting jobs locally that were meant to be on a cluster) | 10 | |
--ignoresmcheck | IGNORE_SM_CHECK | disable the validation that the Sample name in the BAM file matches the one in the BAM list file |
Miscellaneous Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--nophonehome | disable phonehome in GotCloud and the tools it calls | ||
BAMUTIL_THINNING | thinning parameter for bamUtil programs (will be set to 0 - if --nophonehome is specified) | --phoneHomeThinning 10 |
Directory Settings Options
These values set GotCloud output subdirectories (relative paths under the OUT_DIR directory). You should not need to change these from the defaults unless you want to use different sub-directory names.
Configuration Key | Value Description | Default Value |
---|---|---|
BAM_GLF_DIR | GLF outputs per BAM (if multiple BAMs per sample) (intermediate files) | glfs/bams |
SM_GLF_DIR | GLF outputs per sample (intermediate files) | glfs/samples |
VCF_DIR | unfiltered and filtered VCFs | vcfs |
PVCF_DIR | vcfPileup results (intermediate files) | pvcfs |
SPLIT_DIR | VCFs with PASS variants only & split into multiple files | split |
BEAGLE_DIR | beagle output | beagle |
SPLIT4_DIR | VCFs with PASS variants only & split into multiple files for running beagle4 | split4 |
BEAGLE4_DIR | beagle version 4 output | beagle4 |
THUNDER_DIR | thunder output | thunder |
TARGET_DIR | directory to store target information when running with a BED file | target |
GLF_INDEX | filename for index file needed for glfflex (file is created by GotCloud) | glfIndex.ped |
Tool Options
These values set the binaries GotCloud should use. You should not need to change these from the defaults unless you want to try a different version of one of the tools.
Some tools have the options specified with the binary command, while others have them separate or hard coded
Configuration Key | Program Description | Default Value |
---|---|---|
SAMTOOLS_FOR_PILEUP | samtools to use for pileup | $(BIN_DIR)/samtools-hybrid |
SAMTOOLS_FOR_OTHERS | samtools to use for view and calmd | $(BIN_DIR)/samtools-hybrid |
GLFMERGE | merge glf files when there are multiple BAMs per indvidual | $(BIN_DIR)/glfMerge |
GLFFLEX | perform glf-based variant calling (replacement for glfMultiples) | $(BIN_DIR)/glfFlex --minMapQuality 0 --minDepth 1 --maxDepth 10000000 --uniformTsTv --smartFilter |
VCFPILEUP | vcfPileup to generate rich per-site information | $(BIN_DIR)/vcfPileup |
INFOCOLLECTOR | gather filtering statistics | $(BIN_DIR)/infoCollector |
VCFMERGE | merge multiple VCFs separated by chunk of genomes | perl $(SCRIPT_DIR)/bams2vcfMerge.pl |
VCFCOOKER | vcfCooker program for filtering | $(BIN_DIR)/vcfCooker |
VCFSUMMARY | script to generate summary statistics of discovered sites | perl $(SCRIPT_DIR)/vcf-summary |
VCFSPLIT | splits VCF into overlapping chunks for genotype refinement | perl $(SCRIPT_DIR)/vcfSplit.pl |
VCFSPLIT4 | splits VCF into overlapping chunks for beagle version 4 genotype refinement | perl $(SCRIPT_DIR)/vcfSplit4.pl |
VCF_SPLIT_CHROM | splits VCF into per chromosome VCFs | perl $(SCRIPT_DIR)/vcfSplitChr.pl |
VCFPASTE | generate filtered genotype VCF | perl $(SCRIPT_DIR)/vcfPaste.p |
BEAGLE | beagle program | java -Xmx4g -jar $(BIN_DIR)/beagle.20101226.jar seed=993478 gprobs=true niterations=50 lowmem=true |
BEAGLE4 | beagle version 4 program | java -Xmx4g -jar $(BIN_DIR)/b4.r1219.jar seed=993478 gprobs=true |
VCF2BEAGLE | convert VCF (with PL tag) into beagle input | perl $(SCRIPT_DIR)/vcf2Beagle.pl --PL |
BEAGLE2VCF | convert beagle output to VCF | perl $(SCRIPT_DIR)/beagle2Vcf.pl |
SVM_SCRIPT | SVM script | perl $(SCRIPT_DIR)/run_libsvm.pl |
SVMLEARN | SVM program | $(BIN_DIR)/svm-train |
SVMCLASSIFY | SVM program | $(BIN_DIR)/svm-predict |
INVNORM | SVM program | $(BIN_DIR)/invNorm |
THUNDER_STATES | flags for thunder states and weighted states | --states 400 --weightedStates 300 |
THUNDER | MaCH/Thunder genotype refinement step | $(BIN_DIR)/thunderVCF -r 30 --phase --dosage --compact --inputPhased $(THUNDER_STATES) |
LIGATEVCF | ligate multiple phased VCFs while resolving the phase between VCFs | perl $(SCRIPT_DIR)/ligateVcf.pl |
LIGATEVCF4 | ligate multiple phased VCFs while resolving the phase between VCFs | perl $(SCRIPT_DIR)/ligateVcf4.pl |
VCFCAT | concatenate multiple VCFs | perl $(SCRIPT_DIR)/vcfCat.pl |
BGZIP | bgzip program | $(BIN_DIR)/bgzip |
TABIX | tabix program | $(BIN_DIR)/tabix |
BAMUTIL | bam util program | $(BIN_DIR)/bam |
Options
Configuration Key | Program Description | Default Value | |
---|---|---|---|
SLEEP_MULT | add sleep time prior to some steps; use only if too many steps are starting at the same time doing the same thing | 0 | |
REMOTE_PREFIX | add a prefix to paths when sending across to a remote machine |
GlfFlex Options
Configuration Key | Program Description | Default Value |
---|---|---|
WGS_SVM | whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions. | |
VCF_EXTRACT | position file to use for glfFlex | |
MODEL_GLFSINGLE | set to TRUE if glfSingle model should be used for glfFlex | |
MODEL_SKIP_DISCOVER | set to true to disable variant discovery for glfFlex | |
MODEL_AF_PRIOR | set to true to use AF prior for genotyping for glfFlex |
SVM Filtering Options
Configuration Key | Program Description | Default Value |
---|---|---|
POS_SAMPLE | percentage of positive samples used for training | 100 |
NEG_SAMPLE | percentage of negative samples used for training | 100 |
SVM_CUTOFF | SVM score cutoff for PASS/FAIL | 0 |
USE_SVMMODEL | whether to use pre-trained model for SVM filtering | FALSE |
SVMMODEL | pre-trained model file (if USE_SVMMODEL is set to TRUE) |
Hard Filtering Options
These options set the values to use when applying hard filters.
- To remove any filter, set it to blank in your configuration file
For additional hard filter information, see: GotCloud: Filters
Basic per variant filters:
Filter | Configuration Key | VCF value checked | Filter Variants with... | Default Value |
---|---|---|---|---|
max depth | FILTER_MAX_SAMPLE_DP | INFO:DP | > conf value * total number of samples | 1000[[ |
min depth | FILTER_MIN_SAMPLE_DP | < conf value * total number of samples | 1 | |
number of samples with coverage | FILTER_MIN_NS_FRAC | INFO:NS | < conf value * total number of samples | .50 |
FILTER_MIN_NS | < conf value |
Per variant filters that allow a range of values:
- values of these filters must be numbers (or comma/space separated list of numbers)
- Rules:
- Specifying 1 value in the filter will turn that filter on and use that value
- Specifying 2 values in the filter (separated by ',' and/or ' ') turns on the filter
- Use the 1st value if the number of samples is below FILTER_FORMULA_MIN_SAMPLES
- Use the 2nd value if the number of samples is above FILTER_FORMULA_MAX_SAMPLES
- If the number of samples is between the MIN & MAX, a logscale is used:
(minVal - maxVal) * (log(maxSamples) - log(numSamples)) / (log(maxSamples) - log(minSamples)) + maxVal
Configuration settings for min/max # samples to determine filter value when the filter setting contains multiple values separated by ',' or ' ' | ||
---|---|---|
Configuration Key | Description | Default Value |
FILTER_FORMULA_MIN_SAMPLES | total number of samples < conf value, use the value before the ',' or ' ' | 100 |
FILTER_FORMULA_MAX_SAMPLES | total number of samples > conf value, use the value after the ',' or ' ' | 1000 |
total number of samples between min & max, use logscale |
Filters | |||||
---|---|---|---|---|---|
Filter | Configuration Key | VCF value checked | Filter Variants with... | Default Value | Conf Value Requirements |
max Allele Balance in Heterozygotes | FILTER_MAX_ABL | INFO:AB | > conf value/100.0 | 70,65 | < 100 |
max Strand Bias Pearson's Correlation | FILTER_MAX_STR | INFO:STR | > conf value/100.0 | 20, 10 | < 100 |
min Strand Bias Pearson's Correlation | FILTER_MIN_STR | < conf value/100.0 | -20, -10 | > -100 | |
distance from known indel | FILTER_WIN_INDEL | position | distance from known indel < conf value | 5 | > 0 |
max Strand Bias z-score | FILTER_MAX_STZ | INFO:STZ | > conf value | 5, 10 | < INT_MAX |
min Strand Bias z-score | FILTER_MIN_STZ | < conf value | -5, -10 | > INT_MIN | |
max Alternate allele inflation score | FILTER_MAX_AOI | INFO:AOI | > conf value | 5 | < INT_MAX |
min FIC | FILTER_MIN_FIC | INFO:FIC | < conf value/100.0 | -20, -10 | > INT_MIN |
max Cycle Bias Peason's correlation | FILTER_MAX_CBR | INFO:CBR | > conf value/100.0 | 20, 10 | < 100 |
max LQR | FILTER_MAX_LQR | INFO:LQR | > conf value/100.0 | 30, 20 | < 100 |
min pred-scaled quality score | FILTER_MIN_QUAL | QUAL | < conf value | 5 | > 0 |
min Root Mean Squared Mapping Quality | FILTER_MIN_MQ | INFO:MQ | < conf value | 20 | > 0 |
max Fraction of bases with mapQ=0 | FILTER_MAX_MQ0 | INFO:MQ0 | > conf value/100.0 | 10 | < 100 |
max Alternate allele quality z-score | FILTER_MAX_AOZ | INFO:AOZ | > conf value | < INT_MAX | |
max Ratio of base-quality inflation | FILTER_MAX_IOR | INFO:IOR | > conf value | < INT_MAX |
Additional VCF Cooker filters:
- If you want to add any additional VCF Cooker filters that don't already have a configuration item, you can do that by adding the vcfCooker command-line filter to GotCloud:
Configuration Key | Default Value |
---|---|
FILTER_ADDITIONAL |
Additional Options
Configuration Key | Program Description | Default Value |
---|---|---|
SAMTOOLS_VIEW_FILTER | filter settings for samtools view (default filters by mapping quality and flag) | -q 20 -F 0x0704 |
NOBAQ_SUBSTRINGS | skip the BAQ step if the BAM filename contains the specified space-separated substrings | SOLID |
BAM_DEPEND | set to true to rerun the pipeline if the BAM files are newer than previously run steps that use them | FALSE |
MAKE_OPTS | set to add additional makefile options |