GotCloud: Variant Calling Pipeline

From Genome Analysis Wiki
Jump to navigationJump to search

Back to parent: GotCloud

The Variant Calling Pipeline (previously called 'UMAKE') makes genotype calls from recalibrated BAM files. These genotype calls are output into VCF (Variant Call Format) files.


Running the GotCloud Variant Calling Pipeline

The variant calling pipeline (umake) is run using gotcloud snpcall and gotcloud ldrefine.

Running the Automatic Test

The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly.

  • Run snpcall pipeline test:
gotcloud snpcall --test OUTPUT_DIR
    • Where OUTPUT_DIR is the directory where you want to store the test results
    • If you see Successfully ran the test case, congratulations!, then you are ready to run snpcall on your own samples.
  • Run ldrefine pipeline test:
gotcloud ldrefine --test OUTPUT_DIR
    • Where OUTPUT_DIR is the directory where you want to store the test results
    • If you see Successfully ran the test case, congratulations!, then you are ready to run ldrefine on your own samples.

Overview of Variant Calling Pipeline Steps

Here is an overview of the Variant Calling Pipeline:

 


For more information on the filters applied during the Variant Calling Pipeline, see, GotCloud: Filters.

Input Data

BAM Files

The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the Alignment Pipeline of GotCloud.

BAM List File

  • Automatically created when running the GotCloud Alignment Pipeline
  • Each line of the BAM list file represents a single individual

Columns:

  1. sample id
  2. comma separated population labels (optional column)
  3. BAM File 1 (preferable to have full paths to BAM files)
  4. BAM File 2 (if more than 1 BAM per sample)
...
# BAM File N (if more than 1 BAM per sample)
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...

or

[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
  • Notes:
    • tab delimited
    • multiple BAMs per individual may be provided, but should all be on the same line of the list file
    • population label is optional - it will default to ALL
      • only used by Thunder (part of ldrefine pipeline)
      • if all samples are from the same population, population label can be skipped or you can just specify ALL for the population label for each sample.

The path to the BAM List file is defaulted to the outputDirectory/bam.list. It can be overridden by setting --bamlist, --bam_list, or --list on the command-line or by setting BAM_LIST in your configuration file to the path to the BAM List File. See Required Options for more information.

Reference Files

See GotCloud: Genetic Reference and Resource Files for detailed information about the multiple required reference files for the variant calling pipeline, including:

  • How to obtain default references
  • Configuration keys & default values
  • How to generate your own references
  • How to point GotCloud to your reference files

Required Reference File Types:

Configuration File

The GotCloud configuration file contains the run-time options, including software binaries and command line arguments. A default configuration file is automatically loaded. Users may specify their own configuration file specifying just the values different than the defaults. The configuration file is not required if there are no values to override.

  • Default GotCloud configuration file is gotcloud/bin/gotcloudDefaults.conf
  • Comments begin with a #
  • Format: KEY = value
    • where KEY is the item being set and value is its new value
  • Some settings can be defined both in the configuration file and on the GotCloud command-line
    • command-line options take priority over configuration file settings
  • A KEY can be used in another KEY's value by specifying $(KEY)
    • Example:
      KEY1 = value1
      KEY2 = $(KEY1)/value2
      • When KEY2 is used, it will be equal to: value1/value2

Output Directory

  • The output directory is required for running GotCloud, so GotCloud knows where to write its output
Configuration Key Command-line Flag Value Description
OUT_DIR --outdir output directory

Reference/Resource Files

Cluster Configuration

Regardless of the type of cluster system used, GotCloud will wait for each job to complete after launching it.

  • For any BATCH_TYPEs that run in batch mode, GotCloud generates a script that will wait until the step is complete before returning
    • In a sense, it "fakes" interactive mode for all batch types since it will not proceed until a command is finished
  • If you are at UM and are using flux, you can specify either flux or pbs
Configuration Key Command-line Flag Value Description
BATCH_TYPE --batchtype type of cluster system
Valid Values Command to Launch Command to Check for Completion
mosix mosbatch -E/tmp N/A - interactive type
sge qsub qstat -u $USER
sgei qrsh -now n N/A - interactive type
pbs qsub qstat -u $USER
slurm sbatch squeue -u $USER
slurmi N/A - interactive type
local N/A - local command N/A - interactive type
BATCH_OPTS --batchopts options to pass to your cluster type, example:
-j36,37,38,39,40,41,45,46,47,48,49

See Variant Calling Command-line Options/Configuration Settings for more information on Configuration options.

Example Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5

CHRS = 20 22
BAM_LIST = /path/freeze5.bam.list
OUT_DIR = /path/freeze5/output
REF_DIR = /path/reference/
REF = $(REF_DIR)/hs37d5.fa
INDEL_PREFIX = $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz


Variant Calling Command-line Options/Configuration Settings

Required Options

Command-line Flag Configuration Key Value Description Default Value
--outdir path OUT_DIR output directory
--list/--bam_list/--bamlist file BAM_LIST path to the BAM List File $(OUT_DIR)/bam.list
--numjobs # number of jobs to run in parallel 0 (generate Makefile of steps, but do not run)

Common Options

Common Options
Command-line Flag Configuration Key Value Description Default Value
--conf file configuration file to use

Cluster Options

Command-line Flag Configuration Key Value Description Default Value
--batchtype type BATCH_TYPE name of cluster type local
--batchopts opts BATCH_OPTS options to pass to the cluster command
--copyglf path COPY_GLF path to copy glfs to before processing them (path local to remote nodes, maybe in /tmp)

Test/Debug Options

Command-line Flag Configuration Key Value Description Default Value
--help print help information
--test path run the snpcall/ldrefine test and write output to the specified path
--verbose Add additional messages when reading configuration

Reference/Resource Files

Analysis Region Options

See Targeted/Exome Sequencing Settings for more information on specifying exome/targetted regions and other settings.

Command-line Flag Configuration Key Value Description Default Value
--chrs # # CHRS pace separated list of chromosomes to process 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
--region #:#-# call region - skip regions of chromosome outside of specified region

format (-end is optional): chr:start-end

UNIT_CHUNK chunk size of SNP calling (GotCloud breaks up each chromosome into regions of this size) 5000000
LD_NSNPS chunk size (number of SNPs) of genotype refinement 10000
LD_OVERLAP overlapping # of SNPs between chunks for genotype refinement 1000

Chromosome X/Y Calling

For proper Chromosome X/Y calling, it is recommended to specify a PED file with sex information:

Configuration Key Value Description
PED_INDEX ped file containing sampleID (2nd column) and sex (5th column)

Format of PED file:

familyID sampleID fatherID motherID sex
  • Only sampleID and sex are used

Targeted/Exome Sequencing Settings

If you are running Targeted/Exome Sequencing, the user should specify:

Configuration Key Value Description
UNIFORM_TARGET_BED Bed file of targeted regions (same bed for all samples)
MULTIPLE_TARGET_MAP Filename of file mapping: sample id -> bed file of targeted regions

Each line of the file contains: [SM_ID] [TARGET_BED]

OFFSET_OFF_TARGET Number of bases by which to extend the target region

(default is 0, do not extend the target region)

SAMTOOLS_VIEW_TARGET_ONLY true: speeds up processing by excluding off-target regions initially when performing samtools view

false (default): off-target regions are not excluded when performing samtools view, but are excluded at a later step

Warning: You may not want to set this to true due to it may:

WGS_SVM whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions.

Path Options

Command-line Flag Configuration Key Value Description Default Value
--makebasename name MAKE_BASE_NAME basename of the Makefile generated by GotCloud umake
--bamprefix prefix BAM_PREFIX path to prepend to relative BAM file paths in the BAM list
--refprefix prefix REF_PREFIX path to prepend to relative reference/resource file paths
--baseprefix prefix BASE_PREFIX path to prepend to relative paths for the BAM list file, PED_INDEX, BAM (if BAM_PREFIX isn't specified), reference/resource files (if REF_PREFIX isn't specified)
--refdir path REF_DIR value to use for REF_DIR key $(GOTCLOUD_ROOT)/gotcloud.ref
--gotcloudroot path GOTCLOUD_ROOT specify to use a different directory for finding GotCloud bins/scripts based on the location of the gotcloud/umake.pl script

Validation Adjustment Options

Command-line Flag Configuration Key Value Description Default Value
--maxlocaljobs # maximum # of jobs that can run if batchtype is local (to prevent accidentally starting jobs locally that were meant to be on a cluster) 10
--ignoresmcheck IGNORE_SM_CHECK disable the validation that the Sample name in the BAM file matches the one in the BAM list file

Miscellaneous Options

Command-line Flag Configuration Key Value Description Default Value
--nophonehome disable phonehome in GotCloud and the tools it calls
BAMUTIL_THINNING thinning parameter for bamUtil programs (will be set to 0 - if --nophonehome is specified) --phoneHomeThinning 10


Directory Settings Options

These values set GotCloud output subdirectories (relative paths under the OUT_DIR directory). You should not need to change these from the defaults unless you want to use different sub-directory names.

Configuration Key Value Description Default Value
BAM_GLF_DIR GLF outputs per BAM (if multiple BAMs per sample) (intermediate files) glfs/bams
SM_GLF_DIR GLF outputs per sample (intermediate files) glfs/samples
VCF_DIR unfiltered and filtered VCFs vcfs
PVCF_DIR vcfPileup results (intermediate files) pvcfs
SPLIT_DIR VCFs with PASS variants only & split into multiple files split
BEAGLE_DIR beagle output beagle
SPLIT4_DIR VCFs with PASS variants only & split into multiple files for running beagle4 split4
BEAGLE4_DIR beagle version 4 output beagle4
THUNDER_DIR thunder output thunder
TARGET_DIR directory to store target information when running with a BED file target
GLF_INDEX filename for index file needed for glfflex (file is created by GotCloud) glfIndex.ped

Tool Options

These values set the binaries GotCloud should use. You should not need to change these from the defaults unless you want to try a different version of one of the tools.

Some tools have the options specified with the binary command, while others have them separate or hard coded

Configuration Key Program Description Default Value
SAMTOOLS_FOR_PILEUP samtools to use for pileup $(BIN_DIR)/samtools-hybrid
SAMTOOLS_FOR_OTHERS samtools to use for view and calmd $(BIN_DIR)/samtools-hybrid
GLFMERGE merge glf files when there are multiple BAMs per indvidual $(BIN_DIR)/glfMerge
GLFFLEX perform glf-based variant calling (replacement for glfMultiples) $(BIN_DIR)/glfFlex --minMapQuality 0 --minDepth 1 --maxDepth 10000000 --uniformTsTv --smartFilter
VCFPILEUP vcfPileup to generate rich per-site information $(BIN_DIR)/vcfPileup
INFOCOLLECTOR gather filtering statistics $(BIN_DIR)/infoCollector
VCFMERGE merge multiple VCFs separated by chunk of genomes perl $(SCRIPT_DIR)/bams2vcfMerge.pl
VCFCOOKER vcfCooker program for filtering $(BIN_DIR)/vcfCooker
VCFSUMMARY script to generate summary statistics of discovered sites perl $(SCRIPT_DIR)/vcf-summary
VCFSPLIT splits VCF into overlapping chunks for genotype refinement perl $(SCRIPT_DIR)/vcfSplit.pl
VCFSPLIT4 splits VCF into overlapping chunks for beagle version 4 genotype refinement perl $(SCRIPT_DIR)/vcfSplit4.pl
VCF_SPLIT_CHROM splits VCF into per chromosome VCFs perl $(SCRIPT_DIR)/vcfSplitChr.pl
VCFPASTE generate filtered genotype VCF perl $(SCRIPT_DIR)/vcfPaste.p
BEAGLE beagle program java -Xmx4g -jar $(BIN_DIR)/beagle.20101226.jar seed=993478 gprobs=true niterations=50 lowmem=true
BEAGLE4 beagle version 4 program java -Xmx4g -jar $(BIN_DIR)/b4.r1219.jar seed=993478 gprobs=true
VCF2BEAGLE convert VCF (with PL tag) into beagle input perl $(SCRIPT_DIR)/vcf2Beagle.pl --PL
BEAGLE2VCF convert beagle output to VCF perl $(SCRIPT_DIR)/beagle2Vcf.pl
SVM_SCRIPT SVM script perl $(SCRIPT_DIR)/run_libsvm.pl
SVMLEARN SVM program $(BIN_DIR)/svm-train
SVMCLASSIFY SVM program $(BIN_DIR)/svm-predict
INVNORM SVM program $(BIN_DIR)/invNorm
THUNDER_STATES flags for thunder states and weighted states --states 400 --weightedStates 300
THUNDER MaCH/Thunder genotype refinement step $(BIN_DIR)/thunderVCF -r 30 --phase --dosage --compact --inputPhased $(THUNDER_STATES)
LIGATEVCF ligate multiple phased VCFs while resolving the phase between VCFs perl $(SCRIPT_DIR)/ligateVcf.pl
LIGATEVCF4 ligate multiple phased VCFs while resolving the phase between VCFs perl $(SCRIPT_DIR)/ligateVcf4.pl
VCFCAT concatenate multiple VCFs perl $(SCRIPT_DIR)/vcfCat.pl
BGZIP bgzip program $(BIN_DIR)/bgzip
TABIX tabix program $(BIN_DIR)/tabix
BAMUTIL bam util program $(BIN_DIR)/bam

Options

Configuration Key Program Description Default Value
SLEEP_MULT add sleep time prior to some steps; use only if too many steps are starting at the same time doing the same thing 0
REMOTE_PREFIX add a prefix to paths when sending across to a remote machine

GlfFlex Options

Configuration Key Program Description Default Value
WGS_SVM whether or not to run SVM on the whole genome rather than by chromosome (default is by chromosome). Set to TRUE if you are running with a small number of target regions.
VCF_EXTRACT position file to use for glfFlex
MODEL_GLFSINGLE set to TRUE if glfSingle model should be used for glfFlex
MODEL_SKIP_DISCOVER set to true to disable variant discovery for glfFlex
MODEL_AF_PRIOR set to true to use AF prior for genotyping for glfFlex


SVM Filtering Options

Configuration Key Program Description Default Value
POS_SAMPLE percentage of positive samples used for training 100
NEG_SAMPLE percentage of negative samples used for training 100
SVM_CUTOFF SVM score cutoff for PASS/FAIL 0
USE_SVMMODEL whether to use pre-trained model for SVM filtering FALSE
SVMMODEL pre-trained model file (if USE_SVMMODEL is set to TRUE)


Hard Filtering Options

These options set the values to use when applying hard filters.

  • To remove any filter, set it to blank in your configuration file

For additional hard filter information, see: GotCloud: Filters

Basic per variant filters:

Filter Configuration Key VCF value checked Filter Variants with... Default Value
max depth FILTER_MAX_SAMPLE_DP INFO:DP > conf value * total number of samples 1000[[
min depth FILTER_MIN_SAMPLE_DP < conf value * total number of samples 1
number of samples with coverage FILTER_MIN_NS_FRAC INFO:NS < conf value * total number of samples .50
FILTER_MIN_NS < conf value


Per variant filters that allow a range of values:

  • values of these filters must be numbers (or comma/space separated list of numbers)
  • Rules:
    • Specifying 1 value in the filter will turn that filter on and use that value
    • Specifying 2 values in the filter (separated by ',' and/or ' ') turns on the filter
      • Use the 1st value if the number of samples is below FILTER_FORMULA_MIN_SAMPLES
      • Use the 2nd value if the number of samples is above FILTER_FORMULA_MAX_SAMPLES
      • If the number of samples is between the MIN & MAX, a logscale is used:
         (minVal - maxVal) * (log(maxSamples) - log(numSamples)) / (log(maxSamples) - log(minSamples)) + maxVal
Configuration settings for min/max # samples to determine filter value when the filter setting contains multiple values separated by ',' or ' '
Configuration Key Description Default Value
FILTER_FORMULA_MIN_SAMPLES total number of samples < conf value, use the value before the ',' or ' ' 100
FILTER_FORMULA_MAX_SAMPLES total number of samples > conf value, use the value after the ',' or ' ' 1000
total number of samples between min & max, use logscale
Filters
Filter Configuration Key VCF value checked Filter Variants with... Default Value Conf Value Requirements
max Allele Balance in Heterozygotes FILTER_MAX_ABL INFO:AB > conf value/100.0 70,65 < 100
max Strand Bias Pearson's Correlation FILTER_MAX_STR INFO:STR > conf value/100.0 20, 10 < 100
min Strand Bias Pearson's Correlation FILTER_MIN_STR < conf value/100.0 -20, -10 > -100
distance from known indel FILTER_WIN_INDEL position distance from known indel < conf value 5 > 0
max Strand Bias z-score FILTER_MAX_STZ INFO:STZ > conf value 5, 10 < INT_MAX
min Strand Bias z-score FILTER_MIN_STZ < conf value -5, -10 > INT_MIN
max Alternate allele inflation score FILTER_MAX_AOI INFO:AOI > conf value 5 < INT_MAX
min FIC FILTER_MIN_FIC INFO:FIC < conf value/100.0 -20, -10 > INT_MIN
max Cycle Bias Peason's correlation FILTER_MAX_CBR INFO:CBR > conf value/100.0 20, 10 < 100
max LQR FILTER_MAX_LQR INFO:LQR > conf value/100.0 30, 20 < 100
min pred-scaled quality score FILTER_MIN_QUAL QUAL < conf value 5 > 0
min Root Mean Squared Mapping Quality FILTER_MIN_MQ INFO:MQ < conf value 20 > 0
max Fraction of bases with mapQ=0 FILTER_MAX_MQ0 INFO:MQ0 > conf value/100.0 10 < 100
max Alternate allele quality z-score FILTER_MAX_AOZ INFO:AOZ > conf value < INT_MAX
max Ratio of base-quality inflation FILTER_MAX_IOR INFO:IOR > conf value < INT_MAX


Additional VCF Cooker filters:

  • If you want to add any additional VCF Cooker filters that don't already have a configuration item, you can do that by adding the vcfCooker command-line filter to GotCloud:
Configuration Key Default Value
FILTER_ADDITIONAL

Additional Options

Configuration Key Program Description Default Value
SAMTOOLS_VIEW_FILTER filter settings for samtools view (default filters by mapping quality and flag) -q 20 -F 0x0704
NOBAQ_SUBSTRINGS skip the BAQ step if the BAM filename contains the specified space-separated substrings SOLID
BAM_DEPEND set to true to rerun the pipeline if the BAM files are newer than previously run steps that use them FALSE
MAKE_OPTS set to add additional makefile options


Use Cases & Recommended Settings

Single Sample Processing

To run single sample processing we recommend adding the following settings to your configuration file:

UNIT_CHUNK = 20000000
MODEL_GLFSINGLE = TRUE
MODEL_SKIP_DISCOVER = FALSE
MODEL_AF_PRIOR = TRUE
VCF_EXTRACT = $(REF_DIR)/snpOnly.vcf.gz
EXT = $(REF_DIR)/ALL.chrCHR.phase3.combined.sites.unfiltered.vcf.gz $(REF_DIR)/chrCHR.filtered.sites.vcf.gz

Explanation of these settings:

  • UNIT_CHUNK - since this is only 1 sample, process larger regions at a time than default
  • MODEL_GLFSINGLE - single sample, so model glfsingle
  • MODEL_SKIP_DISCOVER - do not skip the variant discovery step
  • MODEL_AF_PRIOR - use AF prior for genotyping
  • VCF_EXTRACT - VCF file to use for extracting the site information to genotype
    • This file is included in the latest reference release: hs37d5-db142
  • EXT - VCF reference files to use for the external filtering
    • These files are included in the latest reference release: hs37d5-db142


Running

Running variant calling is straightforward:

gotcloud snpcall --conf vc.conf --numjobs 2
gotcloud ldrefine --conf vc.conf --numjobs 2

  • Replace vc.conf with the path/name of the user's configuration file
    • If you are not overriding any defaults, you can alternatively specify --list path/bam.list replacing path/bam.list with the path/name of your BAM list file.
  • Replace 2 following --numjobs with the number of jobs to be run in parallel
  • If OUT_DIR is not defined in the configuration file, add --outdir followed by the path to the user's desired output directory.

Running on a Cluster

See Cluster Configuration for information on how to configure GotCloud to run on a cluster.

Results

If there is a failure, you should see a message like:

make: *** [...] Error 1

Where ... is filled in with other text indicating what step failed.

On SNP Call success, you should see the following output sub-directories under your output directory:

  • glfs with a bams & samples subdirectory
  • pvcfs with a subdirectory per chromosome and then per region
  • split with a subdirectory per chromosome
  • vcfs with a subdirectory per chromosome
  • (optionally your target directory)

Under the vcf/chrXX directory, there should be:

  • chrXX.filtered.sites.vcf
  • chrXX.filtered.sites.vcf.norm.log
  • chrXX.filtered.sites.vcf.summary
  • chrXX.filtered.vcf.gz - final filtered variant call file
  • chrXX.filtered.vcf.gz.OK
  • chrXX.filtered.vcf.gz.tbi
  • chrXX.hardfiltered.sites.vcf
  • chrXX.hardfiltered.sites.vcf.log
  • chrXX.hardfiltered.sites.vcf.summary
  • chrXX.hardfiltered.vcf.gz
  • chrXX.hardfiltered.vcf.gz.OK
  • chrXX.hardfiltered.vcf.gz.tbi
  • chrXX.merged.sites.vcf
  • chrXX.merged.stats.vcf
  • chrXX.merged.vcf
  • chrXX.merged.vcf.OK

The .merged.vcf is the merged together versions of the separate regions in the same chromosome.

The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.

Under the split/chrXX directory, there should be:

  • chrXX.filtered.PASS.split.[N].vcf.gz
  • chrXX.filtered.PASS.split.err
  • chrXX.filtered.PASS.split.vcflist
  • chrXX.filtered.PASS.gz - final variant call file with only PASS variants
  • subset.OK