Changes

917 bytes removed , 18:54, 30 March 2015

→‎Running the Automatic Test

Line 5: Line 5: −

= Running the GotCloud Variant Calling Pipeline =

+

== Running the GotCloud Variant Calling Pipeline ==

The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>.

−

==Running the Automatic Test==

+

===Running the Automatic Test===

The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly.

Line 18: Line 18:

** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run snpcall on your own samples.

*Run <code>ldrefine</code> pipeline test:

−

gotcloud ~~snpcall~~ --test OUTPUT_DIR

+

gotcloud ldrefine --test OUTPUT_DIR

** Where <code>OUTPUT_DIR</code> is the directory where you want to store the test results

** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples.

−

= Overview of Variant Calling Pipeline Steps =

+

== Overview of Variant Calling Pipeline Steps ==

Here is an overview of the Variant Calling Pipeline:

Line 30: Line 30:

For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]].

−

= Input Data=

+

== Input Data==

* [[#BAM Files|Aligned/Processed/Recalibrated BAM files]]

* [[#BAM List File|BAM list file containing Sample IDs & BAM file names]]

Line 36: Line 36:

* (Optional) [[#Configuration File|Configuration file to override default options]]

−

== BAM Files ==

+

=== BAM Files ===

The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud.

−

== BAM List File ==

+

=== BAM List File ===

* Automatically created when running the GotCloud [[Alignment Pipeline]]

* Each line of the BAM list file represents a single individual

Line 62: Line 62:

*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.

−

== Reference Files ==

+

The path to the BAM List file is defaulted to the <code>outputDirectory/bam.list</code>. It can be overridden by setting <code>--bamlist</code>, <code>--bam_list</code>, or <code>--list</code> on the command-line or by setting BAM_LIST in your configuration file to the path to the BAM List File. See [[#Required_Options|Required Options]] for more information.

+

=== Reference Files ===

See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including:

* How to obtain default references

Line 76: Line 78:

* [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]]

−

== Configuration File ==

+

=== Configuration File ===

−

===~~Additional Required User Config Files Settings~~===

+

See [[#Variant Calling Command-line Options/Configuration Settings|Variant Calling Command-line Options/Configuration Settings]] for more information on Configuration options.

−

~~The following Config File Settings must~~ be ~~specified by the user:~~

+

−

* CHRS = ~~space separated~~ list ~~of chromosomes you want~~

+

==== Example Configuration File ====

−

* BAM_INDEX = path ~~to the Index File of BAMs~~

+

Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5

+

CHRS = 20 22

+

BAM_LIST = /path/freeze5.bam.list

+

OUT_DIR = /path/freeze5/output

+

REF_DIR = /path/reference/

+

REF = $(REF_DIR)/hs37d5.fa

+

INDEL_PREFIX = $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19

+

HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz

+

DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz

+

−

==~~=Targeted~~/~~Exome Sequencing~~ Settings===

+

== Variant Calling Command-line Options/Configuration Settings ==

−

~~If you are running Targeted/Exome Sequencing, the user should specify~~:

+

−

* Write loci file when performing pileup

−

** WRITE_TARGET_LOCI = TRUE

−

* Specify the output sub-directory to store target information, for example: ~~targetDir~~

−

** Should not be a full path as this will co under the OUT_DIR directory.

−

** TARGET_DIR = targetDir

−

~~If all individuals have the same target:~~

−

* Specify the single bed file, for example: target.bed

−

** UNIFORM_TARGET_BED = target.bed

−

~~If not all individuals have the same target:~~

+

== Use Cases & Recommended Settings ==

−

* Specify the file ~~containing the sample id -> bed map, for example~~: ~~targetMap~~.~~txt~~

+

=== Single Sample Processing ===

−

** MULTIPLE_TARGET_MAP = ~~targetMap~~.~~txt~~

+

To run single sample processing we recommend adding the following settings to your configuration file:

−

*** Each line of the file contains [SM_ID] [TARGET_BED]

+

UNIT_CHUNK = 20000000

+

MODEL_GLFSINGLE = TRUE

+

MODEL_SKIP_DISCOVER = FALSE

+

MODEL_AF_PRIOR = TRUE

+

VCF_EXTRACT = $(REF_DIR)/snpOnly.vcf.gz

+

EXT = $(REF_DIR)/ALL.chrCHR.phase3.combined.sites.unfiltered.vcf.gz $(REF_DIR)/chrCHR.filtered.sites.vcf.gz

−

~~Optional Settings~~:

+

Explanation of these settings:

−

* ~~Extend the target region by~~ a ~~given number of bases~~, for ~~example~~: 50

+

* <code>UNIT_CHUNK</code> - since this is only 1 sample, process larger regions at a time than default

−

** ~~OFFSET_OFF_TARGET = 50~~

+

* <code>MODEL_GLFSINGLE</code> - single sample, so model glfsingle

+

* <code>MODEL_SKIP_DISCOVER</code> - do not skip the variant discovery step

+

* <code>MODEL_AF_PRIOR</code> - use AF prior for genotyping

+

* <code>VCF_EXTRACT</code> - VCF file to use for extracting the site information to genotype

+

** This file is included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]]

+

* <code>EXT</code> - VCF reference files to use for the external filtering

+

** These files are included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]]

−

~~=== Chromosome X Calling ===~~

−

~~Making calls on the X chromosome requires the user to specifty a PED file with sex information.~~

−

* PED_INDEX = pedfile.ped

−

~~== Example Configuration File ==~~

−

~~Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5~~

−

~~CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22~~

−

~~BAM_INDEX = /path/freeze5/freeze5.bam.index ### The BAM index file described above~~

−

~~OUT_DIR = /path/freeze5/output ### Directory in which to put all gotcloud output~~

−

~~REF = /path/reference/hs37d5.fa ### Reference sequence~~

−

~~INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19 ### Known indel sites~~

−

~~HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz ### HapMap variants (requires tabix index file in same directory)~~

−

~~DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz ### dbSNP variants (requires tabix index file in same directory)~~

−

= Running =

+

== Running ==

Running variant calling is straightforward:

Line 133: Line 135:

* If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory.

+

=== Running on a Cluster ===

+

See [[#Cluster Configuration|Cluster Configuration]] for information on how to configure GotCloud to run on a cluster.

−

== ~~Running on a Cluster ==~~

+

== Results ==

−

~~To run on the Cluster, the following settings need to be added to the configuration file:~~

−

~~BATCH_TYPE = batch_type~~

−

~~BATCH_OPTS = options to your batch system, as you would normally specify them~~

−

~~Alternatively, <code>--batchtype</code> and <code>--batchopts</code> can be specified on the command line.~~

−

~~Valid values for BATCH_TYPE are: mosix, sge, sgei, slurm, slurmi, pbs, local~~

−

* If you are at UM and are using flux, you can specify either <code>flux</code> or <code>pbs</code>.

−

* <code>sgei</code> and <code>slurmi</code> run in interactive mode.

−

* For any BATCH_TYPEs that run in batch mode, GotCloud generates a script that will wait until the step is complete before returning.

−

** In a sense, it "fakes" interactive mode for all batch types since it will not proceed until a command is finished.

−

~~Here's the same configuration file we used above but now made to run on a cluster computer with MOSIX.~~

−

== ~~Example Configuration File ==~~

−

~~CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22~~

−

~~BAM_INDEX = /path/freeze5/freeze5.bam.index~~

−

~~OUT_DIR = /path/freeze5/output~~

−

~~REF = /path/reference/hs37d5.fa~~

−

~~INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19~~

−

~~HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz~~

−

~~DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz~~

−

~~BATCH_TYPE = mosix ### Specify MOSIX as the batch system~~

−

~~BATCH_OPTS = -j10,11,12,13 ### Specify available MOSIX compute nodes~~

−

~~= Results~~ =

If there is a failure, you should see a message like:

Line 169: Line 147:

* glfs with a bams & samples subdirectory

* pvcfs with a subdirectory per chromosome and then per region

−

* split with a subdirectory per chromosome

+

* '''split''' with a subdirectory per chromosome

−

* vcfs with a subdirectory per chromosome

+

* '''vcfs''' with a subdirectory per chromosome

* (optionally your target directory)

−

Under the vcf/chrXX directory, there should be:

+

Under the '''vcf/chrXX''' directory, there should be:

* chrXX.filtered.sites.vcf

* chrXX.filtered.sites.vcf.norm.log

* chrXX.filtered.sites.vcf.summary

−

* chrXX.filtered.vcf.gz

+

* '''chrXX.filtered.vcf.gz''' - final filtered variant call file

* chrXX.filtered.vcf.gz.OK

* chrXX.filtered.vcf.gz.tbi

Line 195: Line 173:

The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.

−

Under the split/chrXX directory, there should be:

+

Under the '''split/chrXX''' directory, there should be:

* chrXX.filtered.PASS.split.[N].vcf.gz

* chrXX.filtered.PASS.split.err

* chrXX.filtered.PASS.split.vcflist

−

* chrXX.filtered.PASS.gz

+

* '''chrXX.filtered.PASS.gz''' - final variant call file with only PASS variants

* subset.OK

Kleckner

87

edits

Changes

GotCloud: Variant Calling Pipeline (view source)

Revision as of 18:54, 30 March 2015

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools