Difference between revisions of "GotCloud: Variant Calling Pipeline"
Line 5: | Line 5: | ||
− | = Running the GotCloud Variant Calling Pipeline = | + | == Running the GotCloud Variant Calling Pipeline == |
The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>. | The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>. | ||
− | ==Running the Automatic Test== | + | ===Running the Automatic Test=== |
The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly. | The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly. | ||
Line 22: | Line 22: | ||
** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples. | ** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples. | ||
− | = Overview of Variant Calling Pipeline Steps = | + | == Overview of Variant Calling Pipeline Steps == |
Here is an overview of the Variant Calling Pipeline: | Here is an overview of the Variant Calling Pipeline: | ||
Line 30: | Line 30: | ||
For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]]. | For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]]. | ||
− | = Input Data= | + | == Input Data== |
* [[#BAM Files|Aligned/Processed/Recalibrated BAM files]] | * [[#BAM Files|Aligned/Processed/Recalibrated BAM files]] | ||
* [[#BAM List File|BAM list file containing Sample IDs & BAM file names]] | * [[#BAM List File|BAM list file containing Sample IDs & BAM file names]] | ||
Line 36: | Line 36: | ||
* (Optional) [[#Configuration File|Configuration file to override default options]] | * (Optional) [[#Configuration File|Configuration file to override default options]] | ||
− | == BAM Files == | + | === BAM Files === |
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud. | The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud. | ||
− | == BAM List File == | + | === BAM List File === |
* Automatically created when running the GotCloud [[Alignment Pipeline]] | * Automatically created when running the GotCloud [[Alignment Pipeline]] | ||
* Each line of the BAM list file represents a single individual | * Each line of the BAM list file represents a single individual | ||
Line 62: | Line 62: | ||
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample. | *** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample. | ||
− | == Reference Files == | + | === Reference Files === |
See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including: | See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including: | ||
* How to obtain default references | * How to obtain default references | ||
Line 76: | Line 76: | ||
* [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]] | * [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]] | ||
− | == Configuration File == | + | === Configuration File === |
{{:GotCloud: Configuration}} | {{:GotCloud: Configuration}} | ||
− | ===Additional Required User Config Files Settings=== | + | ====Additional Required User Config Files Settings==== |
The following Config File Settings must be specified by the user: | The following Config File Settings must be specified by the user: | ||
* CHRS = space separated list of chromosomes you want | * CHRS = space separated list of chromosomes you want | ||
* BAM_INDEX = path to the Index File of BAMs | * BAM_INDEX = path to the Index File of BAMs | ||
− | ===Targeted/Exome Sequencing Settings=== | + | ====Targeted/Exome Sequencing Settings==== |
If you are running Targeted/Exome Sequencing, the user should specify: | If you are running Targeted/Exome Sequencing, the user should specify: | ||
* Write loci file when performing pileup | * Write loci file when performing pileup | ||
Line 105: | Line 105: | ||
** OFFSET_OFF_TARGET = 50 | ** OFFSET_OFF_TARGET = 50 | ||
− | === Chromosome X Calling === | + | ==== Chromosome X Calling ==== |
Making calls on the X chromosome requires the user to specifty a PED file with sex information. | Making calls on the X chromosome requires the user to specifty a PED file with sex information. | ||
* PED_INDEX = pedfile.ped | * PED_INDEX = pedfile.ped | ||
− | == Example Configuration File == | + | === Example Configuration File === |
Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5 | Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5 | ||
CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | ||
Line 119: | Line 119: | ||
DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz ### dbSNP variants (requires tabix index file in same directory) | DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz ### dbSNP variants (requires tabix index file in same directory) | ||
− | = Running = | + | == Running == |
Running variant calling is straightforward: | Running variant calling is straightforward: | ||
Line 134: | Line 134: | ||
− | == Running on a Cluster == | + | === Running on a Cluster === |
To run on the Cluster, the following settings need to be added to the configuration file: | To run on the Cluster, the following settings need to be added to the configuration file: | ||
BATCH_TYPE = batch_type | BATCH_TYPE = batch_type | ||
Line 149: | Line 149: | ||
Here's the same configuration file we used above but now made to run on a cluster computer with MOSIX. | Here's the same configuration file we used above but now made to run on a cluster computer with MOSIX. | ||
− | == Example Configuration File == | + | === Example Configuration File === |
CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | ||
BAM_INDEX = /path/freeze5/freeze5.bam.index | BAM_INDEX = /path/freeze5/freeze5.bam.index | ||
Line 160: | Line 160: | ||
BATCH_OPTS = -j10,11,12,13 ### Specify available MOSIX compute nodes | BATCH_OPTS = -j10,11,12,13 ### Specify available MOSIX compute nodes | ||
− | = Results = | + | == Results == |
If there is a failure, you should see a message like: | If there is a failure, you should see a message like: |
Revision as of 18:04, 23 October 2014
Back to parent: GotCloud
The Variant Calling Pipeline (previously called 'UMAKE') makes genotype calls from recalibrated BAM files. These genotype calls are output into VCF (Variant Call Format) files.
Running the GotCloud Variant Calling Pipeline
The variant calling pipeline (umake) is run using gotcloud snpcall
and gotcloud ldrefine
.
Running the Automatic Test
The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly.
- Run
snpcall
pipeline test:
gotcloud snpcall --test OUTPUT_DIR
- Where OUTPUT_DIR is the directory where you want to store the test results
- If you see
Successfully ran the test case, congratulations!
, then you are ready to run snpcall on your own samples.
- Run
ldrefine
pipeline test:
gotcloud snpcall --test OUTPUT_DIR
- Where
OUTPUT_DIR
is the directory where you want to store the test results - If you see
Successfully ran the test case, congratulations!
, then you are ready to run ldrefine on your own samples.
- Where
Overview of Variant Calling Pipeline Steps
Here is an overview of the Variant Calling Pipeline:
For more information on the filters applied during the Variant Calling Pipeline, see, GotCloud: Filters.
Input Data
- Aligned/Processed/Recalibrated BAM files
- BAM list file containing Sample IDs & BAM file names
- Reference files
- (Optional) Configuration file to override default options
BAM Files
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the Alignment Pipeline of GotCloud.
BAM List File
- Automatically created when running the GotCloud Alignment Pipeline
- Each line of the BAM list file represents a single individual
Columns:
- sample id
- comma separated population labels (optional column)
- BAM File 1 (preferable to have full paths to BAM files)
- BAM File 2 (if more than 1 BAM per sample)
- ...
- # BAM File N (if more than 1 BAM per sample)
[SAMPLE_ID] [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
or
[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
- Notes:
- tab delimited
- multiple BAMs per individual may be provided, but should all be on the same line of the list file
- population label is optional - it will default to
ALL
- only used by Thunder (part of ldrefine pipeline)
- if all samples are from the same population, population label can be skipped or you can just specify
ALL
for the population label for each sample.
Reference Files
See GotCloud: Genetic Reference and Resource Files for detailed information about the multiple required reference files for the variant calling pipeline, including:
- How to obtain default references
- Configuration keys & default values
- How to generate your own references
- How to point GotCloud to your reference files
Required Reference File Types:
Configuration File
The GotCloud configuration file contains the run-time options, including software binaries and command line arguments. A default configuration file is automatically loaded. Users may specify their own configuration file specifying just the values different than the defaults. The configuration file is not required if there are no values to override.
- Default GotCloud configuration file is
gotcloud/bin/gotcloudDefaults.conf
- Comments begin with a
#
- Format:
KEY = value
- where
KEY
is the item being set andvalue
is its new value
- where
- Some settings can be defined both in the configuration file and on the GotCloud command-line
- command-line options take priority over configuration file settings
- A KEY can be used in another KEY's value by specifying $(KEY)
- Example:
KEY1 = value1
KEY2 = $(KEY1)/value2
- When
KEY2
is used, it will be equal to:value1/value2
- Example:
Output Directory
- The output directory is required for running GotCloud, so GotCloud knows where to write its output
Configuration Key | Command-line Flag | Value Description | ||
---|---|---|---|---|
OUT_DIR | --outdir | output directory |
Reference/Resource Files
- See GotCloud: Genetic Reference and Resource Files for reference/resource file configuration settings
Cluster Configuration
Regardless of the type of cluster system used, GotCloud will wait for each job to complete after launching it.
- For any BATCH_TYPEs that run in batch mode, GotCloud generates a script that will wait until the step is complete before returning
- In a sense, it "fakes" interactive mode for all batch types since it will not proceed until a command is finished
- If you are at UM and are using flux, you can specify either
flux
orpbs
Configuration Key | Command-line Flag | Value Description | ||
---|---|---|---|---|
BATCH_TYPE | --batchtype | type of cluster system | ||
Valid Values | Command to Launch | Command to Check for Completion | ||
mosix |
mosbatch -E/tmp |
N/A - interactive type | ||
sge |
qsub |
qstat -u $USER
| ||
sgei |
qrsh -now n |
N/A - interactive type | ||
pbs |
qsub |
qstat -u $USER
| ||
slurm |
sbatch |
squeue -u $USER
| ||
slurmi |
|
N/A - interactive type | ||
local |
N/A - local command | N/A - interactive type | ||
BATCH_OPTS | --batchopts | options to pass to your cluster type, example:
-j36,37,38,39,40,41,45,46,47,48,49 |
Additional Required User Config Files Settings
The following Config File Settings must be specified by the user:
- CHRS = space separated list of chromosomes you want
- BAM_INDEX = path to the Index File of BAMs
Targeted/Exome Sequencing Settings
If you are running Targeted/Exome Sequencing, the user should specify:
- Write loci file when performing pileup
- WRITE_TARGET_LOCI = TRUE
- Specify the output sub-directory to store target information, for example: targetDir
- Should not be a full path as this will co under the OUT_DIR directory.
- TARGET_DIR = targetDir
If all individuals have the same target:
- Specify the single bed file, for example: target.bed
- UNIFORM_TARGET_BED = target.bed
If not all individuals have the same target:
- Specify the file containing the sample id -> bed map, for example: targetMap.txt
- MULTIPLE_TARGET_MAP = targetMap.txt
- Each line of the file contains [SM_ID] [TARGET_BED]
- MULTIPLE_TARGET_MAP = targetMap.txt
Optional Settings:
- Extend the target region by a given number of bases, for example: 50
- OFFSET_OFF_TARGET = 50
Chromosome X Calling
Making calls on the X chromosome requires the user to specifty a PED file with sex information.
- PED_INDEX = pedfile.ped
Example Configuration File
Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5
CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 BAM_INDEX = /path/freeze5/freeze5.bam.index ### The BAM index file described above OUT_DIR = /path/freeze5/output ### Directory in which to put all gotcloud output REF = /path/reference/hs37d5.fa ### Reference sequence INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19 ### Known indel sites HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz ### HapMap variants (requires tabix index file in same directory) DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz ### dbSNP variants (requires tabix index file in same directory)
Running
Running variant calling is straightforward:
gotcloud snpcall --conf vc.conf --numjobs 2
gotcloud ldrefine --conf vc.conf --numjobs 2
- Replace
vc.conf
with the path/name of the user's configuration file- If you are not overriding any defaults, you can alternatively specify
--list path/bam.list
replacingpath/bam.list
with the path/name of your BAM list file.
- If you are not overriding any defaults, you can alternatively specify
- Replace
2
following--numjobs
with the number of jobs to be run in parallel - If
OUT_DIR
is not defined in the configuration file, add--outdir
followed by the path to the user's desired output directory.
Running on a Cluster
To run on the Cluster, the following settings need to be added to the configuration file:
BATCH_TYPE = batch_type BATCH_OPTS = options to your batch system, as you would normally specify them
Alternatively, --batchtype
and --batchopts
can be specified on the command line.
Valid values for BATCH_TYPE are: mosix, sge, sgei, slurm, slurmi, pbs, local
- If you are at UM and are using flux, you can specify either
flux
orpbs
. sgei
andslurmi
run in interactive mode.- For any BATCH_TYPEs that run in batch mode, GotCloud generates a script that will wait until the step is complete before returning.
- In a sense, it "fakes" interactive mode for all batch types since it will not proceed until a command is finished.
Here's the same configuration file we used above but now made to run on a cluster computer with MOSIX.
=== Example Configuration File === CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 BAM_INDEX = /path/freeze5/freeze5.bam.index OUT_DIR = /path/freeze5/output REF = /path/reference/hs37d5.fa INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19 HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz BATCH_TYPE = mosix ### Specify MOSIX as the batch system BATCH_OPTS = -j10,11,12,13 ### Specify available MOSIX compute nodes
Results
If there is a failure, you should see a message like:
make: *** [...] Error 1
Where ... is filled in with other text indicating what step failed.
On SNP Call success, you should see the following output sub-directories under your output directory:
- glfs with a bams & samples subdirectory
- pvcfs with a subdirectory per chromosome and then per region
- split with a subdirectory per chromosome
- vcfs with a subdirectory per chromosome
- (optionally your target directory)
Under the vcf/chrXX directory, there should be:
- chrXX.filtered.sites.vcf
- chrXX.filtered.sites.vcf.norm.log
- chrXX.filtered.sites.vcf.summary
- chrXX.filtered.vcf.gz
- chrXX.filtered.vcf.gz.OK
- chrXX.filtered.vcf.gz.tbi
- chrXX.hardfiltered.sites.vcf
- chrXX.hardfiltered.sites.vcf.log
- chrXX.hardfiltered.sites.vcf.summary
- chrXX.hardfiltered.vcf.gz
- chrXX.hardfiltered.vcf.gz.OK
- chrXX.hardfiltered.vcf.gz.tbi
- chrXX.merged.sites.vcf
- chrXX.merged.stats.vcf
- chrXX.merged.vcf
- chrXX.merged.vcf.OK
The .merged.vcf is the merged together versions of the separate regions in the same chromosome.
The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.
Under the split/chrXX directory, there should be:
- chrXX.filtered.PASS.split.[N].vcf.gz
- chrXX.filtered.PASS.split.err
- chrXX.filtered.PASS.split.vcflist
- chrXX.filtered.PASS.gz
- subset.OK