Difference between revisions of "GotCloud: Variant Calling Pipeline"

Revision as of 18:30, 23 October 2014

Back to parent: GotCloud

The Variant Calling Pipeline (previously called 'UMAKE') makes genotype calls from recalibrated BAM files. These genotype calls are output into VCF (Variant Call Format) files.

Running the GotCloud Variant Calling Pipeline

The variant calling pipeline (umake) is run using gotcloud snpcall and gotcloud ldrefine.

Running the Automatic Test

The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly.

Run snpcall pipeline test:

gotcloud snpcall --test OUTPUT_DIR

- Where OUTPUT_DIR is the directory where you want to store the test results
- If you see Successfully ran the test case, congratulations!, then you are ready to run snpcall on your own samples.
Run ldrefine pipeline test:

gotcloud snpcall --test OUTPUT_DIR

- Where OUTPUT_DIR is the directory where you want to store the test results
- If you see Successfully ran the test case, congratulations!, then you are ready to run ldrefine on your own samples.

Overview of Variant Calling Pipeline Steps

Here is an overview of the Variant Calling Pipeline:

For more information on the filters applied during the Variant Calling Pipeline, see, GotCloud: Filters.

Input Data

Aligned/Processed/Recalibrated BAM files
BAM list file containing Sample IDs & BAM file names
Reference files
(Optional) Configuration file to override default options

BAM Files

The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the Alignment Pipeline of GotCloud.

BAM List File

Automatically created when running the GotCloud Alignment Pipeline
Each line of the BAM list file represents a single individual

Columns:

sample id
comma separated population labels (optional column)
BAM File 1 (preferable to have full paths to BAM files)
BAM File 2 (if more than 1 BAM per sample)

...

# BAM File N (if more than 1 BAM per sample)

[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...

or

[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...

Notes:
- tab delimited
- multiple BAMs per individual may be provided, but should all be on the same line of the list file
- population label is optional - it will default to ALL
  - only used by Thunder (part of ldrefine pipeline)
  - if all samples are from the same population, population label can be skipped or you can just specify ALL for the population label for each sample.

Reference Files

See GotCloud: Genetic Reference and Resource Files for detailed information about the multiple required reference files for the variant calling pipeline, including:

How to obtain default references
Configuration keys & default values
How to generate your own references
How to point GotCloud to your reference files

Required Reference File Types:

Configuration File

The GotCloud configuration file contains the run-time options, including software binaries and command line arguments. A default configuration file is automatically loaded. Users may specify their own configuration file specifying just the values different than the defaults. The configuration file is not required if there are no values to override.

Default GotCloud configuration file is gotcloud/bin/gotcloudDefaults.conf
Comments begin with a #
Format: KEY = value
- where KEY is the item being set and value is its new value
Some settings can be defined both in the configuration file and on the GotCloud command-line
- command-line options take priority over configuration file settings
A KEY can be used in another KEY's value by specifying $(KEY)
- Example:
  KEY1 = value1
  
  KEY2 = $(KEY1)/value2
  - When KEY2 is used, it will be equal to: value1/value2

Output Directory

The output directory is required for running GotCloud, so GotCloud knows where to write its output

Configuration Key	Command-line Flag	Value Description
OUT_DIR	--outdir	output directory

Reference/Resource Files

See GotCloud: Genetic Reference and Resource Files for reference/resource file configuration settings

Cluster Configuration

Regardless of the type of cluster system used, GotCloud will wait for each job to complete after launching it.

For any BATCH_TYPEs that run in batch mode, GotCloud generates a script that will wait until the step is complete before returning
- In a sense, it "fakes" interactive mode for all batch types since it will not proceed until a command is finished
If you are at UM and are using flux, you can specify either flux or pbs

Configuration Key	Command-line Flag	Value Description
BATCH_TYPE	--batchtype	type of cluster system
		Valid Values	Command to Launch	Command to Check for Completion
		`mosix`	`mosbatch -E/tmp`	N/A - interactive type
		`sge`	`qsub`	`qstat -u $USER`
		`sgei`	`qrsh -now n`	N/A - interactive type
		`pbs`	`qsub`	`qstat -u $USER`
		`slurm`	`sbatch`	`squeue -u $USER`
		`slurmi`		N/A - interactive type
		`local`	N/A - local command	N/A - interactive type
BATCH_OPTS	--batchopts	options to pass to your cluster type, example: -j36,37,38,39,40,41,45,46,47,48,49

Additional Required User Config Files Settings

Configuration Key	Command-line Flag	Value Description	Default Value
CHRS	--chrs	pace separated list of chromosomes to process	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
BAM_LIST	--list	path to the BAM List File	$(OUT_DIR)/bam.list

Targeted/Exome Sequencing Settings

If you are running Targeted/Exome Sequencing, the user should specify:

Configuration Key	Value Description
UNIFORM_TARGET_BED	Bed file of targeted regions (same bed for all samples)
MULTIPLE_TARGET_MAP	Filename of file mapping: sample id -> bed file of targeted regions Each line of the file contains: [SM_ID] [TARGET_BED]
OFFSET_OFF_TARGET	Number of bases by which to extend the target region (default is 0, do not extend the target region)
SAMTOOLS_VIEW_TARGET_ONLY	When performing samtools view, if set to true, exclude off-target regions (default is false) You may not want to set this to true due to it may: make command line too long produce an error if reads overlap multiple targeted regions see: GotCloud: FAQs->Targetted/Exome

Chromosome X Calling

Making calls on the X chromosome requires the user to specifty a PED file with sex information.

PED_INDEX = pedfile.ped

Example Configuration File

Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5

CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
BAM_INDEX = /path/freeze5/freeze5.bam.index  ### The BAM index file described above
OUT_DIR = /path/freeze5/output               ### Directory in which to put all gotcloud output
REF = /path/reference/hs37d5.fa              ### Reference sequence
INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19   ### Known indel sites
HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz    ### HapMap variants (requires tabix index file in same directory)
DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz   ### dbSNP variants (requires tabix index file in same directory)

Running

Running variant calling is straightforward:

gotcloud snpcall --conf vc.conf --numjobs 2
gotcloud ldrefine --conf vc.conf --numjobs 2

Replace vc.conf with the path/name of the user's configuration file
- If you are not overriding any defaults, you can alternatively specify --list path/bam.list replacing path/bam.list with the path/name of your BAM list file.
Replace 2 following --numjobs with the number of jobs to be run in parallel
If OUT_DIR is not defined in the configuration file, add --outdir followed by the path to the user's desired output directory.

Running on a Cluster

See Cluster Configuration for information on how to configure GotCloud to run on a cluster.

Results

If there is a failure, you should see a message like:

make: *** [...] Error 1

Where ... is filled in with other text indicating what step failed.

On SNP Call success, you should see the following output sub-directories under your output directory:

glfs with a bams & samples subdirectory
pvcfs with a subdirectory per chromosome and then per region
split with a subdirectory per chromosome
vcfs with a subdirectory per chromosome
(optionally your target directory)

Under the vcf/chrXX directory, there should be:

chrXX.filtered.sites.vcf
chrXX.filtered.sites.vcf.norm.log
chrXX.filtered.sites.vcf.summary
chrXX.filtered.vcf.gz
chrXX.filtered.vcf.gz.OK
chrXX.filtered.vcf.gz.tbi
chrXX.hardfiltered.sites.vcf
chrXX.hardfiltered.sites.vcf.log
chrXX.hardfiltered.sites.vcf.summary
chrXX.hardfiltered.vcf.gz
chrXX.hardfiltered.vcf.gz.OK
chrXX.hardfiltered.vcf.gz.tbi
chrXX.merged.sites.vcf
chrXX.merged.stats.vcf
chrXX.merged.vcf
chrXX.merged.vcf.OK

The .merged.vcf is the merged together versions of the separate regions in the same chromosome.

The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.

Under the split/chrXX directory, there should be:

chrXX.filtered.PASS.split.[N].vcf.gz
chrXX.filtered.PASS.split.err
chrXX.filtered.PASS.split.vcflist
chrXX.filtered.PASS.gz
subset.OK

@@ Line 90: / Line 90: @@
 ====Targeted/Exome Sequencing Settings====
 If you are running Targeted/Exome Sequencing, the user should specify:
-* Write loci file when performing pileup
-** WRITE_TARGET_LOCI = TRUE
-* Specify the output sub-directory to store target information, for example: targetDir
-** Should not be a full path as this will co under the OUT_DIR directory.
-** TARGET_DIR = targetDir
-If all individuals have the same target:
+{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
-* Specify the single bed file, for example: target.bed
+! Configuration Key !! Value Description
-** UNIFORM_TARGET_BED = target.bed
+|-
+|UNIFORM_TARGET_BED|| Bed file of targeted regions (same bed for all samples)
+|-
+|MULTIPLE_TARGET_MAP|| Filename of file mapping: sample id -> bed file of targeted regions
+Each line of the file contains: [SM_ID] [TARGET_BED]
+|-
+|OFFSET_OFF_TARGET|| Number of bases by which to extend the target region
+(default is 0, do not extend the target region)
+|-
+|SAMTOOLS_VIEW_TARGET_ONLY || When performing samtools view, if set to true, exclude off-target regions
+(default is false)
-If not all individuals have the same target:
+You may not want to set this to true due to it may:
-* Specify the file containing the sample id -> bed map, for example: targetMap.txt
+*''make command line too long''
-** MULTIPLE_TARGET_MAP = targetMap.txt
+*''produce an error if reads overlap multiple targeted regions''
-*** Each line of the file contains [SM_ID] [TARGET_BED]
+** see: [[GotCloud: FAQs#Targetted/Exome|GotCloud: FAQs->Targetted/Exome]]
+|}
-Optional Settings:
-* Extend the target region by a given number of bases, for example: 50
-** OFFSET_OFF_TARGET = 50
 ==== Chromosome X Calling ====