Difference between revisions of "GotCloud: Variant Calling Options"
Line 67: | Line 67: | ||
|} | |} | ||
− | ==== Chromosome X Calling ==== | + | ==== Chromosome X/Y Calling ==== |
− | For proper Chromosome X calling, it is recommended to specify a PED file with sex information: | + | For proper Chromosome X/Y calling, it is recommended to specify a PED file with sex information: |
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | {| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
! Configuration Key !! Value Description | ! Configuration Key !! Value Description |
Latest revision as of 14:57, 11 February 2015
Required Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--outdir path | OUT_DIR | output directory | |
--list/--bam_list/--bamlist file | BAM_LIST | path to the BAM List File | $(OUT_DIR)/bam.list |
--numjobs # | number of jobs to run in parallel | 0 (generate Makefile of steps, but do not run) |
Common Options
Common Options | |||
---|---|---|---|
Command-line Flag | Configuration Key | Value Description | Default Value |
--conf file | configuration file to use |
Cluster Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--batchtype type | BATCH_TYPE | name of cluster type | local |
--batchopts opts | BATCH_OPTS | options to pass to the cluster command | |
--copyglf path | COPY_GLF | path to copy glfs to before processing them (path local to remote nodes, maybe in /tmp) |
Test/Debug Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--help | print help information | ||
--test path | run the snpcall/ldrefine test and write output to the specified path | ||
--verbose | Add additional messages when reading configuration |
Reference/Resource Files
- See GotCloud: Genetic Reference and Resource Files for reference/resource file configuration settings
Analysis Region Options
See Targeted/Exome Sequencing Settings for more information on specifying exome/targetted regions and other settings.
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--chrs # # | CHRS | pace separated list of chromosomes to process | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X |
--region #:#-# | call region - skip regions of chromosome outside of specified region
format (-end is optional): chr:start-end |
||
UNIT_CHUNK | chunk size of SNP calling (GotCloud breaks up each chromosome into regions of this size) | 5000000 | |
LD_NSNPS | chunk size (number of SNPs) of genotype refinement | 10000 | |
LD_OVERLAP | overlapping # of SNPs between chunks for genotype refinement | 1000 |
Chromosome X/Y Calling
For proper Chromosome X/Y calling, it is recommended to specify a PED file with sex information:
Configuration Key | Value Description |
---|---|
PED_INDEX | ped file containing sampleID (2nd column) and sex (5th column) |
Format of PED file:
familyID sampleID fatherID motherID sex
- Only
sampleID
andsex
are used
Targeted/Exome Sequencing Settings
If you are running Targeted/Exome Sequencing, the user should specify:
Configuration Key | Value Description | |
---|---|---|
UNIFORM_TARGET_BED | Bed file of targeted regions (same bed for all samples) | |
MULTIPLE_TARGET_MAP | Filename of file mapping: sample id -> bed file of targeted regions
Each line of the file contains: [SM_ID] [TARGET_BED] | |
OFFSET_OFF_TARGET | Number of bases by which to extend the target region
(default is 0, do not extend the target region) | |
SAMTOOLS_VIEW_TARGET_ONLY | true: speeds up processing by excluding off-target regions initially when performing samtools view
false (default): off-target regions are not excluded when performing samtools view, but are excluded at a later step Warning: You may not want to set this to true due to it may:
| |
WGS_SVM | whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions. |
Path Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--makebasename name | MAKE_BASE_NAME | basename of the Makefile generated by GotCloud | umake |
--bamprefix prefix | BAM_PREFIX | path to prepend to relative BAM file paths in the BAM list | |
--refprefix prefix | REF_PREFIX | path to prepend to relative reference/resource file paths | |
--baseprefix prefix | BASE_PREFIX | path to prepend to relative paths for the BAM list file, PED_INDEX, BAM (if BAM_PREFIX isn't specified), reference/resource files (if REF_PREFIX isn't specified) | |
--refdir path | REF_DIR | value to use for REF_DIR key | $(GOTCLOUD_ROOT)/gotcloud.ref |
--gotcloudroot path | GOTCLOUD_ROOT | specify to use a different directory for finding GotCloud bins/scripts | based on the location of the gotcloud/umake.pl script |
Validation Adjustment Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--maxlocaljobs # | maximum # of jobs that can run if batchtype is local (to prevent accidentally starting jobs locally that were meant to be on a cluster) | 10 | |
--ignoresmcheck | IGNORE_SM_CHECK | disable the validation that the Sample name in the BAM file matches the one in the BAM list file |
Miscellaneous Options
Command-line Flag | Configuration Key | Value Description | Default Value |
---|---|---|---|
--nophonehome | disable phonehome in GotCloud and the tools it calls | ||
BAMUTIL_THINNING | thinning parameter for bamUtil programs (will be set to 0 - if --nophonehome is specified) | --phoneHomeThinning 10 |
Directory Settings Options
These values set GotCloud output subdirectories (relative paths under the OUT_DIR directory). You should not need to change these from the defaults unless you want to use different sub-directory names.
Configuration Key | Value Description | Default Value |
---|---|---|
BAM_GLF_DIR | GLF outputs per BAM (if multiple BAMs per sample) (intermediate files) | glfs/bams |
SM_GLF_DIR | GLF outputs per sample (intermediate files) | glfs/samples |
VCF_DIR | unfiltered and filtered VCFs | vcfs |
PVCF_DIR | vcfPileup results (intermediate files) | pvcfs |
SPLIT_DIR | VCFs with PASS variants only & split into multiple files | split |
BEAGLE_DIR | beagle output | beagle |
SPLIT4_DIR | VCFs with PASS variants only & split into multiple files for running beagle4 | split4 |
BEAGLE4_DIR | beagle version 4 output | beagle4 |
THUNDER_DIR | thunder output | thunder |
TARGET_DIR | directory to store target information when running with a BED file | target |
GLF_INDEX | filename for index file needed for glfflex (file is created by GotCloud) | glfIndex.ped |
Tool Options
These values set the binaries GotCloud should use. You should not need to change these from the defaults unless you want to try a different version of one of the tools.
Some tools have the options specified with the binary command, while others have them separate or hard coded
Configuration Key | Program Description | Default Value |
---|---|---|
SAMTOOLS_FOR_PILEUP | samtools to use for pileup | $(BIN_DIR)/samtools-hybrid |
SAMTOOLS_FOR_OTHERS | samtools to use for view and calmd | $(BIN_DIR)/samtools-hybrid |
GLFMERGE | merge glf files when there are multiple BAMs per indvidual | $(BIN_DIR)/glfMerge |
GLFFLEX | perform glf-based variant calling (replacement for glfMultiples) | $(BIN_DIR)/glfFlex --minMapQuality 0 --minDepth 1 --maxDepth 10000000 --uniformTsTv --smartFilter |
VCFPILEUP | vcfPileup to generate rich per-site information | $(BIN_DIR)/vcfPileup |
INFOCOLLECTOR | gather filtering statistics | $(BIN_DIR)/infoCollector |
VCFMERGE | merge multiple VCFs separated by chunk of genomes | perl $(SCRIPT_DIR)/bams2vcfMerge.pl |
VCFCOOKER | vcfCooker program for filtering | $(BIN_DIR)/vcfCooker |
VCFSUMMARY | script to generate summary statistics of discovered sites | perl $(SCRIPT_DIR)/vcf-summary |
VCFSPLIT | splits VCF into overlapping chunks for genotype refinement | perl $(SCRIPT_DIR)/vcfSplit.pl |
VCFSPLIT4 | splits VCF into overlapping chunks for beagle version 4 genotype refinement | perl $(SCRIPT_DIR)/vcfSplit4.pl |
VCF_SPLIT_CHROM | splits VCF into per chromosome VCFs | perl $(SCRIPT_DIR)/vcfSplitChr.pl |
VCFPASTE | generate filtered genotype VCF | perl $(SCRIPT_DIR)/vcfPaste.p |
BEAGLE | beagle program | java -Xmx4g -jar $(BIN_DIR)/beagle.20101226.jar seed=993478 gprobs=true niterations=50 lowmem=true |
BEAGLE4 | beagle version 4 program | java -Xmx4g -jar $(BIN_DIR)/b4.r1219.jar seed=993478 gprobs=true |
VCF2BEAGLE | convert VCF (with PL tag) into beagle input | perl $(SCRIPT_DIR)/vcf2Beagle.pl --PL |
BEAGLE2VCF | convert beagle output to VCF | perl $(SCRIPT_DIR)/beagle2Vcf.pl |
SVM_SCRIPT | SVM script | perl $(SCRIPT_DIR)/run_libsvm.pl |
SVMLEARN | SVM program | $(BIN_DIR)/svm-train |
SVMCLASSIFY | SVM program | $(BIN_DIR)/svm-predict |
INVNORM | SVM program | $(BIN_DIR)/invNorm |
THUNDER_STATES | flags for thunder states and weighted states | --states 400 --weightedStates 300 |
THUNDER | MaCH/Thunder genotype refinement step | $(BIN_DIR)/thunderVCF -r 30 --phase --dosage --compact --inputPhased $(THUNDER_STATES) |
LIGATEVCF | ligate multiple phased VCFs while resolving the phase between VCFs | perl $(SCRIPT_DIR)/ligateVcf.pl |
LIGATEVCF4 | ligate multiple phased VCFs while resolving the phase between VCFs | perl $(SCRIPT_DIR)/ligateVcf4.pl |
VCFCAT | concatenate multiple VCFs | perl $(SCRIPT_DIR)/vcfCat.pl |
BGZIP | bgzip program | $(BIN_DIR)/bgzip |
TABIX | tabix program | $(BIN_DIR)/tabix |
BAMUTIL | bam util program | $(BIN_DIR)/bam |
Options
Configuration Key | Program Description | Default Value | |
---|---|---|---|
SLEEP_MULT | add sleep time prior to some steps; use only if too many steps are starting at the same time doing the same thing | 0 | |
REMOTE_PREFIX | add a prefix to paths when sending across to a remote machine |
GlfFlex Options
Configuration Key | Program Description | Default Value |
---|---|---|
WGS_SVM | whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions. | |
VCF_EXTRACT | position file to use for glfFlex | |
MODEL_GLFSINGLE | set to TRUE if glfSingle model should be used for glfFlex | |
MODEL_SKIP_DISCOVER | set to true to disable variant discovery for glfFlex | |
MODEL_AF_PRIOR | set to true to use AF prior for genotyping for glfFlex |
SVM Filtering Options
Configuration Key | Program Description | Default Value |
---|---|---|
POS_SAMPLE | percentage of positive samples used for training | 100 |
NEG_SAMPLE | percentage of negative samples used for training | 100 |
SVM_CUTOFF | SVM score cutoff for PASS/FAIL | 0 |
USE_SVMMODEL | whether to use pre-trained model for SVM filtering | FALSE |
SVMMODEL | pre-trained model file (if USE_SVMMODEL is set to TRUE) |
Hard Filtering Options
These options set the values to use when applying hard filters.
- To remove any filter, set it to blank in your configuration file
For additional hard filter information, see: GotCloud: Filters
Basic per variant filters:
Filter | Configuration Key | VCF value checked | Filter Variants with... | Default Value |
---|---|---|---|---|
max depth | FILTER_MAX_SAMPLE_DP | INFO:DP | > conf value * total number of samples | 1000[[ |
min depth | FILTER_MIN_SAMPLE_DP | < conf value * total number of samples | 1 | |
number of samples with coverage | FILTER_MIN_NS_FRAC | INFO:NS | < conf value * total number of samples | .50 |
FILTER_MIN_NS | < conf value |
Per variant filters that allow a range of values:
- values of these filters must be numbers (or comma/space separated list of numbers)
- Rules:
- Specifying 1 value in the filter will turn that filter on and use that value
- Specifying 2 values in the filter (separated by ',' and/or ' ') turns on the filter
- Use the 1st value if the number of samples is below FILTER_FORMULA_MIN_SAMPLES
- Use the 2nd value if the number of samples is above FILTER_FORMULA_MAX_SAMPLES
- If the number of samples is between the MIN & MAX, a logscale is used:
(minVal - maxVal) * (log(maxSamples) - log(numSamples)) / (log(maxSamples) - log(minSamples)) + maxVal
Configuration settings for min/max # samples to determine filter value when the filter setting contains multiple values separated by ',' or ' ' | ||
---|---|---|
Configuration Key | Description | Default Value |
FILTER_FORMULA_MIN_SAMPLES | total number of samples < conf value, use the value before the ',' or ' ' | 100 |
FILTER_FORMULA_MAX_SAMPLES | total number of samples > conf value, use the value after the ',' or ' ' | 1000 |
total number of samples between min & max, use logscale |
Filters | |||||
---|---|---|---|---|---|
Filter | Configuration Key | VCF value checked | Filter Variants with... | Default Value | Conf Value Requirements |
max Allele Balance in Heterozygotes | FILTER_MAX_ABL | INFO:AB | > conf value/100.0 | 70,65 | < 100 |
max Strand Bias Pearson's Correlation | FILTER_MAX_STR | INFO:STR | > conf value/100.0 | 20, 10 | < 100 |
min Strand Bias Pearson's Correlation | FILTER_MIN_STR | < conf value/100.0 | -20, -10 | > -100 | |
distance from known indel | FILTER_WIN_INDEL | position | distance from known indel < conf value | 5 | > 0 |
max Strand Bias z-score | FILTER_MAX_STZ | INFO:STZ | > conf value | 5, 10 | < INT_MAX |
min Strand Bias z-score | FILTER_MIN_STZ | < conf value | -5, -10 | > INT_MIN | |
max Alternate allele inflation score | FILTER_MAX_AOI | INFO:AOI | > conf value | 5 | < INT_MAX |
min FIC | FILTER_MIN_FIC | INFO:FIC | < conf value/100.0 | -20, -10 | > INT_MIN |
max Cycle Bias Peason's correlation | FILTER_MAX_CBR | INFO:CBR | > conf value/100.0 | 20, 10 | < 100 |
max LQR | FILTER_MAX_LQR | INFO:LQR | > conf value/100.0 | 30, 20 | < 100 |
min pred-scaled quality score | FILTER_MIN_QUAL | QUAL | < conf value | 5 | > 0 |
min Root Mean Squared Mapping Quality | FILTER_MIN_MQ | INFO:MQ | < conf value | 20 | > 0 |
max Fraction of bases with mapQ=0 | FILTER_MAX_MQ0 | INFO:MQ0 | > conf value/100.0 | 10 | < 100 |
max Alternate allele quality z-score | FILTER_MAX_AOZ | INFO:AOZ | > conf value | < INT_MAX | |
max Ratio of base-quality inflation | FILTER_MAX_IOR | INFO:IOR | > conf value | < INT_MAX |
Additional VCF Cooker filters:
- If you want to add any additional VCF Cooker filters that don't already have a configuration item, you can do that by adding the vcfCooker command-line filter to GotCloud:
Configuration Key | Default Value |
---|---|
FILTER_ADDITIONAL |
Additional Options
Configuration Key | Program Description | Default Value |
---|---|---|
SAMTOOLS_VIEW_FILTER | filter settings for samtools view (default filters by mapping quality and flag) | -q 20 -F 0x0704 |
NOBAQ_SUBSTRINGS | skip the BAQ step if the BAM filename contains the specified space-separated substrings | SOLID |
BAM_DEPEND | set to true to rerun the pipeline if the BAM files are newer than previously run steps that use them | FALSE |
MAKE_OPTS | set to add additional makefile options |