GotCloud: Variant Calling Options

From Genome Analysis Wiki
Jump to navigationJump to search

Required Options

Command-line Flag Configuration Key Value Description Default Value
--outdir path OUT_DIR output directory
--list/--bam_list/--bamlist file BAM_LIST path to the BAM List File $(OUT_DIR)/bam.list
--numjobs # number of jobs to run in parallel 0 (generate Makefile of steps, but do not run)

Common Options

Common Options
Command-line Flag Configuration Key Value Description Default Value
--conf file configuration file to use

Cluster Options

Command-line Flag Configuration Key Value Description Default Value
--batchtype type BATCH_TYPE name of cluster type local
--batchopts opts BATCH_OPTS options to pass to the cluster command
--copyglf path COPY_GLF path to copy glfs to before processing them (path local to remote nodes, maybe in /tmp)

Test/Debug Options

Command-line Flag Configuration Key Value Description Default Value
--help print help information
--test path run the snpcall/ldrefine test and write output to the specified path
--verbose Add additional messages when reading configuration

Reference/Resource Files

Analysis Region Options

See Targeted/Exome Sequencing Settings for more information on specifying exome/targetted regions and other settings.

Command-line Flag Configuration Key Value Description Default Value
--chrs # # CHRS pace separated list of chromosomes to process 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
--region #:#-# call region - skip regions of chromosome outside of specified region

format (-end is optional): chr:start-end

UNIT_CHUNK chunk size of SNP calling (GotCloud breaks up each chromosome into regions of this size) 5000000
LD_NSNPS chunk size (number of SNPs) of genotype refinement 10000
LD_OVERLAP overlapping # of SNPs between chunks for genotype refinement 1000

Chromosome X/Y Calling

For proper Chromosome X/Y calling, it is recommended to specify a PED file with sex information:

Configuration Key Value Description
PED_INDEX ped file containing sampleID (2nd column) and sex (5th column)

Format of PED file:

familyID sampleID fatherID motherID sex
  • Only sampleID and sex are used

Targeted/Exome Sequencing Settings

If you are running Targeted/Exome Sequencing, the user should specify:

Configuration Key Value Description
UNIFORM_TARGET_BED Bed file of targeted regions (same bed for all samples)
MULTIPLE_TARGET_MAP Filename of file mapping: sample id -> bed file of targeted regions

Each line of the file contains: [SM_ID] [TARGET_BED]

OFFSET_OFF_TARGET Number of bases by which to extend the target region

(default is 0, do not extend the target region)

SAMTOOLS_VIEW_TARGET_ONLY true: speeds up processing by excluding off-target regions initially when performing samtools view

false (default): off-target regions are not excluded when performing samtools view, but are excluded at a later step

Warning: You may not want to set this to true due to it may:

WGS_SVM whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions.

Path Options

Command-line Flag Configuration Key Value Description Default Value
--makebasename name MAKE_BASE_NAME basename of the Makefile generated by GotCloud umake
--bamprefix prefix BAM_PREFIX path to prepend to relative BAM file paths in the BAM list
--refprefix prefix REF_PREFIX path to prepend to relative reference/resource file paths
--baseprefix prefix BASE_PREFIX path to prepend to relative paths for the BAM list file, PED_INDEX, BAM (if BAM_PREFIX isn't specified), reference/resource files (if REF_PREFIX isn't specified)
--refdir path REF_DIR value to use for REF_DIR key $(GOTCLOUD_ROOT)/gotcloud.ref
--gotcloudroot path GOTCLOUD_ROOT specify to use a different directory for finding GotCloud bins/scripts based on the location of the gotcloud/umake.pl script

Validation Adjustment Options

Command-line Flag Configuration Key Value Description Default Value
--maxlocaljobs # maximum # of jobs that can run if batchtype is local (to prevent accidentally starting jobs locally that were meant to be on a cluster) 10
--ignoresmcheck IGNORE_SM_CHECK disable the validation that the Sample name in the BAM file matches the one in the BAM list file

Miscellaneous Options

Command-line Flag Configuration Key Value Description Default Value
--nophonehome disable phonehome in GotCloud and the tools it calls
BAMUTIL_THINNING thinning parameter for bamUtil programs (will be set to 0 - if --nophonehome is specified) --phoneHomeThinning 10


Directory Settings Options

These values set GotCloud output subdirectories (relative paths under the OUT_DIR directory). You should not need to change these from the defaults unless you want to use different sub-directory names.

Configuration Key Value Description Default Value
BAM_GLF_DIR GLF outputs per BAM (if multiple BAMs per sample) (intermediate files) glfs/bams
SM_GLF_DIR GLF outputs per sample (intermediate files) glfs/samples
VCF_DIR unfiltered and filtered VCFs vcfs
PVCF_DIR vcfPileup results (intermediate files) pvcfs
SPLIT_DIR VCFs with PASS variants only & split into multiple files split
BEAGLE_DIR beagle output beagle
SPLIT4_DIR VCFs with PASS variants only & split into multiple files for running beagle4 split4
BEAGLE4_DIR beagle version 4 output beagle4
THUNDER_DIR thunder output thunder
TARGET_DIR directory to store target information when running with a BED file target
GLF_INDEX filename for index file needed for glfflex (file is created by GotCloud) glfIndex.ped

Tool Options

These values set the binaries GotCloud should use. You should not need to change these from the defaults unless you want to try a different version of one of the tools.

Some tools have the options specified with the binary command, while others have them separate or hard coded

Configuration Key Program Description Default Value
SAMTOOLS_FOR_PILEUP samtools to use for pileup $(BIN_DIR)/samtools-hybrid
SAMTOOLS_FOR_OTHERS samtools to use for view and calmd $(BIN_DIR)/samtools-hybrid
GLFMERGE merge glf files when there are multiple BAMs per indvidual $(BIN_DIR)/glfMerge
GLFFLEX perform glf-based variant calling (replacement for glfMultiples) $(BIN_DIR)/glfFlex --minMapQuality 0 --minDepth 1 --maxDepth 10000000 --uniformTsTv --smartFilter
VCFPILEUP vcfPileup to generate rich per-site information $(BIN_DIR)/vcfPileup
INFOCOLLECTOR gather filtering statistics $(BIN_DIR)/infoCollector
VCFMERGE merge multiple VCFs separated by chunk of genomes perl $(SCRIPT_DIR)/bams2vcfMerge.pl
VCFCOOKER vcfCooker program for filtering $(BIN_DIR)/vcfCooker
VCFSUMMARY script to generate summary statistics of discovered sites perl $(SCRIPT_DIR)/vcf-summary
VCFSPLIT splits VCF into overlapping chunks for genotype refinement perl $(SCRIPT_DIR)/vcfSplit.pl
VCFSPLIT4 splits VCF into overlapping chunks for beagle version 4 genotype refinement perl $(SCRIPT_DIR)/vcfSplit4.pl
VCF_SPLIT_CHROM splits VCF into per chromosome VCFs perl $(SCRIPT_DIR)/vcfSplitChr.pl
VCFPASTE generate filtered genotype VCF perl $(SCRIPT_DIR)/vcfPaste.p
BEAGLE beagle program java -Xmx4g -jar $(BIN_DIR)/beagle.20101226.jar seed=993478 gprobs=true niterations=50 lowmem=true
BEAGLE4 beagle version 4 program java -Xmx4g -jar $(BIN_DIR)/b4.r1219.jar seed=993478 gprobs=true
VCF2BEAGLE convert VCF (with PL tag) into beagle input perl $(SCRIPT_DIR)/vcf2Beagle.pl --PL
BEAGLE2VCF convert beagle output to VCF perl $(SCRIPT_DIR)/beagle2Vcf.pl
SVM_SCRIPT SVM script perl $(SCRIPT_DIR)/run_libsvm.pl
SVMLEARN SVM program $(BIN_DIR)/svm-train
SVMCLASSIFY SVM program $(BIN_DIR)/svm-predict
INVNORM SVM program $(BIN_DIR)/invNorm
THUNDER_STATES flags for thunder states and weighted states --states 400 --weightedStates 300
THUNDER MaCH/Thunder genotype refinement step $(BIN_DIR)/thunderVCF -r 30 --phase --dosage --compact --inputPhased $(THUNDER_STATES)
LIGATEVCF ligate multiple phased VCFs while resolving the phase between VCFs perl $(SCRIPT_DIR)/ligateVcf.pl
LIGATEVCF4 ligate multiple phased VCFs while resolving the phase between VCFs perl $(SCRIPT_DIR)/ligateVcf4.pl
VCFCAT concatenate multiple VCFs perl $(SCRIPT_DIR)/vcfCat.pl
BGZIP bgzip program $(BIN_DIR)/bgzip
TABIX tabix program $(BIN_DIR)/tabix
BAMUTIL bam util program $(BIN_DIR)/bam

Options

Configuration Key Program Description Default Value
SLEEP_MULT add sleep time prior to some steps; use only if too many steps are starting at the same time doing the same thing 0
REMOTE_PREFIX add a prefix to paths when sending across to a remote machine

GlfFlex Options

Configuration Key Program Description Default Value
WGS_SVM whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions.
VCF_EXTRACT position file to use for glfFlex
MODEL_GLFSINGLE set to TRUE if glfSingle model should be used for glfFlex
MODEL_SKIP_DISCOVER set to true to disable variant discovery for glfFlex
MODEL_AF_PRIOR set to true to use AF prior for genotyping for glfFlex


SVM Filtering Options

Configuration Key Program Description Default Value
POS_SAMPLE percentage of positive samples used for training 100
NEG_SAMPLE percentage of negative samples used for training 100
SVM_CUTOFF SVM score cutoff for PASS/FAIL 0
USE_SVMMODEL whether to use pre-trained model for SVM filtering FALSE
SVMMODEL pre-trained model file (if USE_SVMMODEL is set to TRUE)


Hard Filtering Options

These options set the values to use when applying hard filters.

  • To remove any filter, set it to blank in your configuration file

For additional hard filter information, see: GotCloud: Filters

Basic per variant filters:

Filter Configuration Key VCF value checked Filter Variants with... Default Value
max depth FILTER_MAX_SAMPLE_DP INFO:DP > conf value * total number of samples 1000[[
min depth FILTER_MIN_SAMPLE_DP < conf value * total number of samples 1
number of samples with coverage FILTER_MIN_NS_FRAC INFO:NS < conf value * total number of samples .50
FILTER_MIN_NS < conf value


Per variant filters that allow a range of values:

  • values of these filters must be numbers (or comma/space separated list of numbers)
  • Rules:
    • Specifying 1 value in the filter will turn that filter on and use that value
    • Specifying 2 values in the filter (separated by ',' and/or ' ') turns on the filter
      • Use the 1st value if the number of samples is below FILTER_FORMULA_MIN_SAMPLES
      • Use the 2nd value if the number of samples is above FILTER_FORMULA_MAX_SAMPLES
      • If the number of samples is between the MIN & MAX, a logscale is used:
         (minVal - maxVal) * (log(maxSamples) - log(numSamples)) / (log(maxSamples) - log(minSamples)) + maxVal
Configuration settings for min/max # samples to determine filter value when the filter setting contains multiple values separated by ',' or ' '
Configuration Key Description Default Value
FILTER_FORMULA_MIN_SAMPLES total number of samples < conf value, use the value before the ',' or ' ' 100
FILTER_FORMULA_MAX_SAMPLES total number of samples > conf value, use the value after the ',' or ' ' 1000
total number of samples between min & max, use logscale
Filters
Filter Configuration Key VCF value checked Filter Variants with... Default Value Conf Value Requirements
max Allele Balance in Heterozygotes FILTER_MAX_ABL INFO:AB > conf value/100.0 70,65 < 100
max Strand Bias Pearson's Correlation FILTER_MAX_STR INFO:STR > conf value/100.0 20, 10 < 100
min Strand Bias Pearson's Correlation FILTER_MIN_STR < conf value/100.0 -20, -10 > -100
distance from known indel FILTER_WIN_INDEL position distance from known indel < conf value 5 > 0
max Strand Bias z-score FILTER_MAX_STZ INFO:STZ > conf value 5, 10 < INT_MAX
min Strand Bias z-score FILTER_MIN_STZ < conf value -5, -10 > INT_MIN
max Alternate allele inflation score FILTER_MAX_AOI INFO:AOI > conf value 5 < INT_MAX
min FIC FILTER_MIN_FIC INFO:FIC < conf value/100.0 -20, -10 > INT_MIN
max Cycle Bias Peason's correlation FILTER_MAX_CBR INFO:CBR > conf value/100.0 20, 10 < 100
max LQR FILTER_MAX_LQR INFO:LQR > conf value/100.0 30, 20 < 100
min pred-scaled quality score FILTER_MIN_QUAL QUAL < conf value 5 > 0
min Root Mean Squared Mapping Quality FILTER_MIN_MQ INFO:MQ < conf value 20 > 0
max Fraction of bases with mapQ=0 FILTER_MAX_MQ0 INFO:MQ0 > conf value/100.0 10 < 100
max Alternate allele quality z-score FILTER_MAX_AOZ INFO:AOZ > conf value < INT_MAX
max Ratio of base-quality inflation FILTER_MAX_IOR INFO:IOR > conf value < INT_MAX


Additional VCF Cooker filters:

  • If you want to add any additional VCF Cooker filters that don't already have a configuration item, you can do that by adding the vcfCooker command-line filter to GotCloud:
Configuration Key Default Value
FILTER_ADDITIONAL

Additional Options

Configuration Key Program Description Default Value
SAMTOOLS_VIEW_FILTER filter settings for samtools view (default filters by mapping quality and flag) -q 20 -F 0x0704
NOBAQ_SUBSTRINGS skip the BAQ step if the BAM filename contains the specified space-separated substrings SOLID
BAM_DEPEND set to true to rerun the pipeline if the BAM files are newer than previously run steps that use them FALSE
MAKE_OPTS set to add additional makefile options