Difference between revisions of "GotCloud: Variant Calling Options"

Latest revision as of 14:57, 11 February 2015

Required Options

Command-line Flag	Configuration Key	Value Description	Default Value
--outdir path	OUT_DIR	output directory
--list/--bam_list/--bamlist file	BAM_LIST	path to the BAM List File	$(OUT_DIR)/bam.list
--numjobs #		number of jobs to run in parallel	0 (generate Makefile of steps, but do not run)

Common Options

Common Options
Command-line Flag	Configuration Key	Value Description	Default Value
--conf file		configuration file to use

Cluster Options

Command-line Flag	Configuration Key	Value Description	Default Value
--batchtype type	BATCH_TYPE	name of cluster type	local
--batchopts opts	BATCH_OPTS	options to pass to the cluster command
--copyglf path	COPY_GLF	path to copy glfs to before processing them (path local to remote nodes, maybe in /tmp)

Test/Debug Options

Command-line Flag	Configuration Key	Value Description	Default Value
--help		print help information
--test path		run the snpcall/ldrefine test and write output to the specified path
--verbose		Add additional messages when reading configuration

Reference/Resource Files

See GotCloud: Genetic Reference and Resource Files for reference/resource file configuration settings

Analysis Region Options

See Targeted/Exome Sequencing Settings for more information on specifying exome/targetted regions and other settings.

Command-line Flag	Configuration Key	Value Description	Default Value
--chrs # #	CHRS	pace separated list of chromosomes to process	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
--region #:#-#		call region - skip regions of chromosome outside of specified region format (-end is optional): chr:start-end
	UNIT_CHUNK	chunk size of SNP calling (GotCloud breaks up each chromosome into regions of this size)	5000000
	LD_NSNPS	chunk size (number of SNPs) of genotype refinement	10000
	LD_OVERLAP	overlapping # of SNPs between chunks for genotype refinement	1000

Chromosome X/Y Calling

For proper Chromosome X/Y calling, it is recommended to specify a PED file with sex information:

Configuration Key	Value Description
PED_INDEX	ped file containing sampleID (2nd column) and sex (5th column)

Format of PED file:

familyID sampleID fatherID motherID sex

Only sampleID and sex are used

Targeted/Exome Sequencing Settings

If you are running Targeted/Exome Sequencing, the user should specify:

Configuration Key	Value Description
UNIFORM_TARGET_BED	Bed file of targeted regions (same bed for all samples)
MULTIPLE_TARGET_MAP	Filename of file mapping: sample id -> bed file of targeted regions Each line of the file contains: [SM_ID] [TARGET_BED]
OFFSET_OFF_TARGET	Number of bases by which to extend the target region (default is 0, do not extend the target region)
SAMTOOLS_VIEW_TARGET_ONLY	true: speeds up processing by excluding off-target regions initially when performing samtools view false (default): off-target regions are not excluded when performing samtools view, but are excluded at a later step Warning: You may not want to set this to true due to it may: make command line too long produce an error if reads overlap multiple targeted regions see: GotCloud: FAQs->Targetted/Exome
WGS_SVM	whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions.

Path Options

Command-line Flag	Configuration Key	Value Description	Default Value
--makebasename name	MAKE_BASE_NAME	basename of the Makefile generated by GotCloud	umake
--bamprefix prefix	BAM_PREFIX	path to prepend to relative BAM file paths in the BAM list
--refprefix prefix	REF_PREFIX	path to prepend to relative reference/resource file paths
--baseprefix prefix	BASE_PREFIX	path to prepend to relative paths for the BAM list file, PED_INDEX, BAM (if BAM_PREFIX isn't specified), reference/resource files (if REF_PREFIX isn't specified)
--refdir path	REF_DIR	value to use for REF_DIR key	$(GOTCLOUD_ROOT)/gotcloud.ref
--gotcloudroot path	GOTCLOUD_ROOT	specify to use a different directory for finding GotCloud bins/scripts	based on the location of the gotcloud/umake.pl script

Validation Adjustment Options

Command-line Flag	Configuration Key	Value Description	Default Value
--maxlocaljobs #		maximum # of jobs that can run if batchtype is local (to prevent accidentally starting jobs locally that were meant to be on a cluster)	10
--ignoresmcheck	IGNORE_SM_CHECK	disable the validation that the Sample name in the BAM file matches the one in the BAM list file

Miscellaneous Options

Command-line Flag	Configuration Key	Value Description	Default Value
--nophonehome		disable phonehome in GotCloud and the tools it calls
	BAMUTIL_THINNING	thinning parameter for bamUtil programs (will be set to 0 - if --nophonehome is specified)	--phoneHomeThinning 10

Directory Settings Options

These values set GotCloud output subdirectories (relative paths under the OUT_DIR directory). You should not need to change these from the defaults unless you want to use different sub-directory names.

Configuration Key	Value Description	Default Value
BAM_GLF_DIR	GLF outputs per BAM (if multiple BAMs per sample) (intermediate files)	glfs/bams
SM_GLF_DIR	GLF outputs per sample (intermediate files)	glfs/samples
VCF_DIR	unfiltered and filtered VCFs	vcfs
PVCF_DIR	vcfPileup results (intermediate files)	pvcfs
SPLIT_DIR	VCFs with PASS variants only & split into multiple files	split
BEAGLE_DIR	beagle output	beagle
SPLIT4_DIR	VCFs with PASS variants only & split into multiple files for running beagle4	split4
BEAGLE4_DIR	beagle version 4 output	beagle4
THUNDER_DIR	thunder output	thunder
TARGET_DIR	directory to store target information when running with a BED file	target
GLF_INDEX	filename for index file needed for glfflex (file is created by GotCloud)	glfIndex.ped

Tool Options

These values set the binaries GotCloud should use. You should not need to change these from the defaults unless you want to try a different version of one of the tools.

Some tools have the options specified with the binary command, while others have them separate or hard coded

Configuration Key	Program Description	Default Value
SAMTOOLS_FOR_PILEUP	samtools to use for pileup	$(BIN_DIR)/samtools-hybrid
SAMTOOLS_FOR_OTHERS	samtools to use for view and calmd	$(BIN_DIR)/samtools-hybrid
GLFMERGE	merge glf files when there are multiple BAMs per indvidual	$(BIN_DIR)/glfMerge
GLFFLEX	perform glf-based variant calling (replacement for glfMultiples)	$(BIN_DIR)/glfFlex --minMapQuality 0 --minDepth 1 --maxDepth 10000000 --uniformTsTv --smartFilter
VCFPILEUP	vcfPileup to generate rich per-site information	$(BIN_DIR)/vcfPileup
INFOCOLLECTOR	gather filtering statistics	$(BIN_DIR)/infoCollector
VCFMERGE	merge multiple VCFs separated by chunk of genomes	perl $(SCRIPT_DIR)/bams2vcfMerge.pl
VCFCOOKER	vcfCooker program for filtering	$(BIN_DIR)/vcfCooker
VCFSUMMARY	script to generate summary statistics of discovered sites	perl $(SCRIPT_DIR)/vcf-summary
VCFSPLIT	splits VCF into overlapping chunks for genotype refinement	perl $(SCRIPT_DIR)/vcfSplit.pl
VCFSPLIT4	splits VCF into overlapping chunks for beagle version 4 genotype refinement	perl $(SCRIPT_DIR)/vcfSplit4.pl
VCF_SPLIT_CHROM	splits VCF into per chromosome VCFs	perl $(SCRIPT_DIR)/vcfSplitChr.pl
VCFPASTE	generate filtered genotype VCF	perl $(SCRIPT_DIR)/vcfPaste.p
BEAGLE	beagle program	java -Xmx4g -jar $(BIN_DIR)/beagle.20101226.jar seed=993478 gprobs=true niterations=50 lowmem=true
BEAGLE4	beagle version 4 program	java -Xmx4g -jar $(BIN_DIR)/b4.r1219.jar seed=993478 gprobs=true
VCF2BEAGLE	convert VCF (with PL tag) into beagle input	perl $(SCRIPT_DIR)/vcf2Beagle.pl --PL
BEAGLE2VCF	convert beagle output to VCF	perl $(SCRIPT_DIR)/beagle2Vcf.pl
SVM_SCRIPT	SVM script	perl $(SCRIPT_DIR)/run_libsvm.pl
SVMLEARN	SVM program	$(BIN_DIR)/svm-train
SVMCLASSIFY	SVM program	$(BIN_DIR)/svm-predict
INVNORM	SVM program	$(BIN_DIR)/invNorm
THUNDER_STATES	flags for thunder states and weighted states	--states 400 --weightedStates 300
THUNDER	MaCH/Thunder genotype refinement step	$(BIN_DIR)/thunderVCF -r 30 --phase --dosage --compact --inputPhased $(THUNDER_STATES)
LIGATEVCF	ligate multiple phased VCFs while resolving the phase between VCFs	perl $(SCRIPT_DIR)/ligateVcf.pl
LIGATEVCF4	ligate multiple phased VCFs while resolving the phase between VCFs	perl $(SCRIPT_DIR)/ligateVcf4.pl
VCFCAT	concatenate multiple VCFs	perl $(SCRIPT_DIR)/vcfCat.pl
BGZIP	bgzip program	$(BIN_DIR)/bgzip
TABIX	tabix program	$(BIN_DIR)/tabix
BAMUTIL	bam util program	$(BIN_DIR)/bam

Options

Configuration Key	Program Description	Default Value
	SLEEP_MULT	add sleep time prior to some steps; use only if too many steps are starting at the same time doing the same thing	0
	REMOTE_PREFIX	add a prefix to paths when sending across to a remote machine

GlfFlex Options

Configuration Key	Program Description	Default Value
WGS_SVM	whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions.
VCF_EXTRACT	position file to use for glfFlex
MODEL_GLFSINGLE	set to TRUE if glfSingle model should be used for glfFlex
MODEL_SKIP_DISCOVER	set to true to disable variant discovery for glfFlex
MODEL_AF_PRIOR	set to true to use AF prior for genotyping for glfFlex

SVM Filtering Options

Configuration Key	Program Description	Default Value
POS_SAMPLE	percentage of positive samples used for training	100
NEG_SAMPLE	percentage of negative samples used for training	100
SVM_CUTOFF	SVM score cutoff for PASS/FAIL	0
USE_SVMMODEL	whether to use pre-trained model for SVM filtering	FALSE
SVMMODEL	pre-trained model file (if USE_SVMMODEL is set to TRUE)

Hard Filtering Options

These options set the values to use when applying hard filters.

To remove any filter, set it to blank in your configuration file

For additional hard filter information, see: GotCloud: Filters

Basic per variant filters:

Filter	Configuration Key	VCF value checked	Filter Variants with...	Default Value
max depth	FILTER_MAX_SAMPLE_DP	INFO:DP	> conf value * total number of samples	1000[[
min depth	FILTER_MIN_SAMPLE_DP	INFO:DP	< conf value * total number of samples	1
number of samples with coverage	FILTER_MIN_NS_FRAC	INFO:NS	< conf value * total number of samples	.50
number of samples with coverage	FILTER_MIN_NS	INFO:NS	< conf value

Per variant filters that allow a range of values:

values of these filters must be numbers (or comma/space separated list of numbers)
Rules:
- Specifying 1 value in the filter will turn that filter on and use that value
- Specifying 2 values in the filter (separated by ',' and/or ' ') turns on the filter
  - Use the 1st value if the number of samples is below FILTER_FORMULA_MIN_SAMPLES
  - Use the 2nd value if the number of samples is above FILTER_FORMULA_MAX_SAMPLES
  - If the number of samples is between the MIN & MAX, a logscale is used:
    (minVal - maxVal) * (log(maxSamples) - log(numSamples)) / (log(maxSamples) - log(minSamples)) + maxVal

Configuration settings for min/max # samples to determine filter value when the filter setting contains multiple values separated by ',' or ' '
Configuration Key	Description	Default Value
FILTER_FORMULA_MIN_SAMPLES	total number of samples < conf value, use the value before the ',' or ' '	100
FILTER_FORMULA_MAX_SAMPLES	total number of samples > conf value, use the value after the ',' or ' '	1000
total number of samples between min & max, use logscale

Filters
Filter	Configuration Key	VCF value checked	Filter Variants with...	Default Value	Conf Value Requirements
max Allele Balance in Heterozygotes	FILTER_MAX_ABL	INFO:AB	> conf value/100.0	70,65	< 100
max Strand Bias Pearson's Correlation	FILTER_MAX_STR	INFO:STR	> conf value/100.0	20, 10	< 100
min Strand Bias Pearson's Correlation	FILTER_MIN_STR	INFO:STR	< conf value/100.0	-20, -10	> -100
distance from known indel	FILTER_WIN_INDEL	position	distance from known indel < conf value	5	> 0
max Strand Bias z-score	FILTER_MAX_STZ	INFO:STZ	> conf value	5, 10	< INT_MAX
min Strand Bias z-score	FILTER_MIN_STZ	INFO:STZ	< conf value	-5, -10	> INT_MIN
max Alternate allele inflation score	FILTER_MAX_AOI	INFO:AOI	> conf value	5	< INT_MAX
min FIC	FILTER_MIN_FIC	INFO:FIC	< conf value/100.0	-20, -10	> INT_MIN
max Cycle Bias Peason's correlation	FILTER_MAX_CBR	INFO:CBR	> conf value/100.0	20, 10	< 100
max LQR	FILTER_MAX_LQR	INFO:LQR	> conf value/100.0	30, 20	< 100
min pred-scaled quality score	FILTER_MIN_QUAL	QUAL	< conf value	5	> 0
min Root Mean Squared Mapping Quality	FILTER_MIN_MQ	INFO:MQ	< conf value	20	> 0
max Fraction of bases with mapQ=0	FILTER_MAX_MQ0	INFO:MQ0	> conf value/100.0	10	< 100
max Alternate allele quality z-score	FILTER_MAX_AOZ	INFO:AOZ	> conf value		< INT_MAX
max Ratio of base-quality inflation	FILTER_MAX_IOR	INFO:IOR	> conf value		< INT_MAX

Additional VCF Cooker filters:

If you want to add any additional VCF Cooker filters that don't already have a configuration item, you can do that by adding the vcfCooker command-line filter to GotCloud:

Configuration Key	Default Value
FILTER_ADDITIONAL

Additional Options

Configuration Key	Program Description	Default Value
SAMTOOLS_VIEW_FILTER	filter settings for samtools view (default filters by mapping quality and flag)	-q 20 -F 0x0704
NOBAQ_SUBSTRINGS	skip the BAQ step if the BAM filename contains the specified space-separated substrings	SOLID
BAM_DEPEND	set to true to rerun the pipeline if the BAM files are newer than previously run steps that use them	FALSE
MAKE_OPTS	set to add additional makefile options