Line 5: |
Line 5: |
| | | |
| | | |
− | = Running the GotCloud Variant Calling Pipeline = | + | == Running the GotCloud Variant Calling Pipeline == |
| | | |
| The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>. | | The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>. |
| | | |
− | ==Running the Automatic Test== | + | ===Running the Automatic Test=== |
| | | |
| The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly. | | The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly. |
Line 18: |
Line 18: |
| ** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run snpcall on your own samples. | | ** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run snpcall on your own samples. |
| *Run <code>ldrefine</code> pipeline test: | | *Run <code>ldrefine</code> pipeline test: |
− | gotcloud snpcall --test OUTPUT_DIR | + | gotcloud ldrefine --test OUTPUT_DIR |
| ** Where <code>OUTPUT_DIR</code> is the directory where you want to store the test results | | ** Where <code>OUTPUT_DIR</code> is the directory where you want to store the test results |
| ** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples. | | ** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples. |
| | | |
− | = Overview of Variant Calling Pipeline Steps = | + | == Overview of Variant Calling Pipeline Steps == |
| Here is an overview of the Variant Calling Pipeline: | | Here is an overview of the Variant Calling Pipeline: |
| | | |
Line 30: |
Line 30: |
| For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]]. | | For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]]. |
| | | |
− | = Input Data= | + | == Input Data== |
| * [[#BAM Files|Aligned/Processed/Recalibrated BAM files]] | | * [[#BAM Files|Aligned/Processed/Recalibrated BAM files]] |
| * [[#BAM List File|BAM list file containing Sample IDs & BAM file names]] | | * [[#BAM List File|BAM list file containing Sample IDs & BAM file names]] |
Line 36: |
Line 36: |
| * (Optional) [[#Configuration File|Configuration file to override default options]] | | * (Optional) [[#Configuration File|Configuration file to override default options]] |
| | | |
− | == BAM Files == | + | === BAM Files === |
| The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud. | | The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud. |
| | | |
− | == BAM List File == | + | === BAM List File === |
| * Automatically created when running the GotCloud [[Alignment Pipeline]] | | * Automatically created when running the GotCloud [[Alignment Pipeline]] |
| * Each line of the BAM list file represents a single individual | | * Each line of the BAM list file represents a single individual |
Line 59: |
Line 59: |
| ** multiple BAMs per individual may be provided, but should all be on the same line of the list file | | ** multiple BAMs per individual may be provided, but should all be on the same line of the list file |
| ** population label is optional - it will default to <code>ALL</code> | | ** population label is optional - it will default to <code>ALL</code> |
− | *** population is only used by Thunder (part of ldrefine pipeline) | + | *** only used by Thunder (part of ldrefine pipeline) |
| *** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample. | | *** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample. |
| | | |
− | == Reference Files == | + | The path to the BAM List file is defaulted to the <code>outputDirectory/bam.list</code>. It can be overridden by setting <code>--bamlist</code>, <code>--bam_list</code>, or <code>--list</code> on the command-line or by setting BAM_LIST in your configuration file to the path to the BAM List File. See [[#Required_Options|Required Options]] for more information. |
| + | |
| + | === Reference Files === |
| See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including: | | See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including: |
| * How to obtain default references | | * How to obtain default references |
Line 76: |
Line 78: |
| * [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]] | | * [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]] |
| | | |
− | == Configuration File == | + | === Configuration File === |
− | Configuration file contains the run-time options including the software binaries and command line arguments. A default configuration file is automatically loaded. Users must specify their own configuration file specifying just the values different than the defaults. | + | {{:GotCloud: Configuration}} |
| | | |
− | Comments begin with a <code>#</code>
| + | See [[#Variant Calling Command-line Options/Configuration Settings|Variant Calling Command-line Options/Configuration Settings]] for more information on Configuration options. |
| | | |
− | Format: KEY = value
| + | ==== Example Configuration File ==== |
− | | + | Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5 |
− | Where KEY is the item being set and value is its new value
| + | CHRS = 20 22 |
− | | + | BAM_LIST = /path/freeze5.bam.list |
− | ===Required User Config Files Settings=== | + | OUT_DIR = /path/freeze5/output |
− | The following Config File Settings must be specified by the user:
| + | REF_DIR = /path/reference/ |
− | * CHRS = space separated list of chromosomes you want
| + | REF = $(REF_DIR)/hs37d5.fa |
− | * BAM_INDEX = path to the Index File of BAMs
| + | INDEL_PREFIX = $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 |
− | | + | HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz |
− | ===Required on Command-Line or in Config File=== | + | DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz |
− | The following Command-Line or Config File Settings must be specified by the user:
| |
− | * --outdir/OUT_DIR= path to desired output directory
| |
| | | |
− | ===Targeted/Exome Sequencing Settings===
| |
− | If you are running Targeted/Exome Sequencing, the user should specify:
| |
− | * Write loci file when performing pileup
| |
− | ** WRITE_TARGET_LOCI = TRUE
| |
− | * Specify the output sub-directory to store target information, for example: targetDir
| |
− | ** Should not be a full path as this will co under the OUT_DIR directory.
| |
− | ** TARGET_DIR = targetDir
| |
| | | |
− | If all individuals have the same target:
| + | == Variant Calling Command-line Options/Configuration Settings == |
− | * Specify the single bed file, for example: target.bed
| + | {{:GotCloud: Variant Calling Options}} |
− | ** UNIFORM_TARGET_BED = target.bed
| |
| | | |
− | If not all individuals have the same target:
| |
− | * Specify the file containing the sample id -> bed map, for example: targetMap.txt
| |
− | ** MULTIPLE_TARGET_MAP = targetMap.txt
| |
− | *** Each line of the file contains [SM_ID] [TARGET_BED]
| |
| | | |
− | Optional Settings:
| + | == Use Cases & Recommended Settings == |
− | * Extend the target region by a given number of bases, for example: 50
| + | === Single Sample Processing === |
− | ** OFFSET_OFF_TARGET = 50
| + | To run single sample processing we recommend adding the following settings to your configuration file: |
| + | UNIT_CHUNK = 20000000 |
| + | MODEL_GLFSINGLE = TRUE |
| + | MODEL_SKIP_DISCOVER = FALSE |
| + | MODEL_AF_PRIOR = TRUE |
| + | VCF_EXTRACT = $(REF_DIR)/snpOnly.vcf.gz |
| + | EXT = $(REF_DIR)/ALL.chrCHR.phase3.combined.sites.unfiltered.vcf.gz $(REF_DIR)/chrCHR.filtered.sites.vcf.gz |
| | | |
− | === Configure Reference Files ===
| + | Explanation of these settings: |
− | See [[#Reference Files| Reference Files]] for information on how to specify the reference files.
| + | * <code>UNIT_CHUNK</code> - since this is only 1 sample, process larger regions at a time than default |
| + | * <code>MODEL_GLFSINGLE</code> - single sample, so model glfsingle |
| + | * <code>MODEL_SKIP_DISCOVER</code> - do not skip the variant discovery step |
| + | * <code>MODEL_AF_PRIOR</code> - use AF prior for genotyping |
| + | * <code>VCF_EXTRACT</code> - VCF file to use for extracting the site information to genotype |
| + | ** This file is included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]] |
| + | * <code>EXT</code> - VCF reference files to use for the external filtering |
| + | ** These files are included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]] |
| | | |
− | === Chromosome X Calling ===
| |
− | Making calls on the X chromosome requires the user to specifty a PED file with sex information.
| |
− | * PED_INDEX = pedfile.ped
| |
| | | |
− | == Example Configuration File ==
| |
− | Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5
| |
− | CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| |
− | BAM_INDEX = /path/freeze5/freeze5.bam.index ### The BAM index file described above
| |
− | OUT_DIR = /path/freeze5/output ### Directory in which to put all gotcloud output
| |
− | REF = /path/reference/hs37d5.fa ### Reference sequence
| |
− | INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19 ### Known indel sites
| |
− | HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz ### HapMap variants (requires tabix index file in same directory)
| |
− | DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz ### dbSNP variants (requires tabix index file in same directory)
| |
| | | |
− | = Running = | + | == Running == |
| | | |
| Running variant calling is straightforward: | | Running variant calling is straightforward: |
Line 146: |
Line 135: |
| * If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory. | | * If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory. |
| | | |
| + | === Running on a Cluster === |
| + | See [[#Cluster Configuration|Cluster Configuration]] for information on how to configure GotCloud to run on a cluster. |
| | | |
− | == Running on a Cluster == | + | == Results == |
− | To run on the Cluster, the following settings need to be added to the configuration file:
| |
− | BATCH_TYPE = batch_type
| |
− | BATCH_OPTS = options to your batch system, as you would normally specify them
| |
− | | |
− | Alternatively, <code>--batchtype</code> and <code>--batchopts</code> can be specified on the command line.
| |
− | | |
− | Valid values for BATCH_TYPE are: mosix, sge, sgei, slurm, slurmi, pbs, local
| |
− | * If you are at UM and are using flux, you can specify either <code>flux</code> or <code>pbs</code>.
| |
− | * <code>sgei</code> and <code>slurmi</code> run in interactive mode.
| |
− | * For any BATCH_TYPEs that run in batch mode, GotCloud generates a script that will wait until the step is complete before returning.
| |
− | ** In a sense, it "fakes" interactive mode for all batch types since it will not proceed until a command is finished.
| |
− | | |
− | | |
− | Here's the same configuration file we used above but now made to run on a cluster computer with MOSIX.
| |
− | == Example Configuration File ==
| |
− | CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| |
− | BAM_INDEX = /path/freeze5/freeze5.bam.index
| |
− | OUT_DIR = /path/freeze5/output
| |
− | REF = /path/reference/hs37d5.fa
| |
− | INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19
| |
− | HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz
| |
− | DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz
| |
− | BATCH_TYPE = mosix ### Specify MOSIX as the batch system
| |
− | BATCH_OPTS = -j10,11,12,13 ### Specify available MOSIX compute nodes
| |
− | | |
− | = Results =
| |
| | | |
| If there is a failure, you should see a message like: | | If there is a failure, you should see a message like: |
Line 182: |
Line 147: |
| * glfs with a bams & samples subdirectory | | * glfs with a bams & samples subdirectory |
| * pvcfs with a subdirectory per chromosome and then per region | | * pvcfs with a subdirectory per chromosome and then per region |
− | * split with a subdirectory per chromosome | + | * '''split''' with a subdirectory per chromosome |
− | * vcfs with a subdirectory per chromosome | + | * '''vcfs''' with a subdirectory per chromosome |
| * (optionally your target directory) | | * (optionally your target directory) |
| | | |
− | Under the vcf/chrXX directory, there should be: | + | Under the '''vcf/chrXX''' directory, there should be: |
| * chrXX.filtered.sites.vcf | | * chrXX.filtered.sites.vcf |
| * chrXX.filtered.sites.vcf.norm.log | | * chrXX.filtered.sites.vcf.norm.log |
| * chrXX.filtered.sites.vcf.summary | | * chrXX.filtered.sites.vcf.summary |
− | * chrXX.filtered.vcf.gz | + | * '''chrXX.filtered.vcf.gz''' - final filtered variant call file |
| * chrXX.filtered.vcf.gz.OK | | * chrXX.filtered.vcf.gz.OK |
| * chrXX.filtered.vcf.gz.tbi | | * chrXX.filtered.vcf.gz.tbi |
Line 208: |
Line 173: |
| The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL. | | The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL. |
| | | |
− | Under the split/chrXX directory, there should be: | + | Under the '''split/chrXX''' directory, there should be: |
| * chrXX.filtered.PASS.split.[N].vcf.gz | | * chrXX.filtered.PASS.split.[N].vcf.gz |
| * chrXX.filtered.PASS.split.err | | * chrXX.filtered.PASS.split.err |
| * chrXX.filtered.PASS.split.vcflist | | * chrXX.filtered.PASS.split.vcflist |
− | * chrXX.filtered.PASS.gz | + | * '''chrXX.filtered.PASS.gz''' - final variant call file with only PASS variants |
| * subset.OK | | * subset.OK |