Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 5: Line 5:       −
= Running the GotCloud Variant Calling Pipeline =
+
== Running the GotCloud Variant Calling Pipeline ==
    
The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>.  
 
The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>.  
   −
==Running the Automatic Test==
+
===Running the Automatic Test===
    
The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly.
 
The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly.
Line 18: Line 18:  
** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run snpcall on your own samples.
 
** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run snpcall on your own samples.
 
*Run <code>ldrefine</code> pipeline test:
 
*Run <code>ldrefine</code> pipeline test:
  gotcloud snpcall --test OUTPUT_DIR
+
  gotcloud ldrefine --test OUTPUT_DIR
 
** Where <code>OUTPUT_DIR</code> is the directory where you want to store the test results
 
** Where <code>OUTPUT_DIR</code> is the directory where you want to store the test results
 
** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples.
 
** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples.
   −
= Overview of Variant Calling Pipeline Steps =
+
== Overview of Variant Calling Pipeline Steps ==
 
Here is an overview of the Variant Calling Pipeline:
 
Here is an overview of the Variant Calling Pipeline:
   Line 30: Line 30:  
For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]].
 
For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]].
   −
= Input Data=
+
== Input Data==
 
* [[#BAM Files|Aligned/Processed/Recalibrated BAM files]]
 
* [[#BAM Files|Aligned/Processed/Recalibrated BAM files]]
 
* [[#BAM List File|BAM list file containing Sample IDs & BAM file names]]
 
* [[#BAM List File|BAM list file containing Sample IDs & BAM file names]]
Line 36: Line 36:  
* (Optional) [[#Configuration File|Configuration file to override default options]]
 
* (Optional) [[#Configuration File|Configuration file to override default options]]
   −
== BAM Files ==
+
=== BAM Files ===
 
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud.
 
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud.
   −
== BAM List File ==
+
=== BAM List File ===
 
* Automatically created when running the GotCloud [[Alignment Pipeline]]
 
* Automatically created when running the GotCloud [[Alignment Pipeline]]
 
* Each line of the BAM list file represents a single individual
 
* Each line of the BAM list file represents a single individual
Line 62: Line 62:  
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
 
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
   −
== Reference Files ==
+
The path to the BAM List file is defaulted to the <code>outputDirectory/bam.list</code>.  It can be overridden by setting <code>--bamlist</code>, <code>--bam_list</code>, or <code>--list</code> on the command-line or by setting BAM_LIST in your configuration file to the path to the BAM List File.  See [[#Required_Options|Required Options]] for more information.
 +
 
 +
=== Reference Files ===
 
See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including:
 
See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including:
 
* How to obtain default references
 
* How to obtain default references
Line 76: Line 78:  
* [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]]
 
* [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]]
   −
== Configuration File ==
+
=== Configuration File ===
 
{{:GotCloud: Configuration}}
 
{{:GotCloud: Configuration}}
   −
===Additional Required User Config Files Settings===
+
See [[#Variant Calling Command-line Options/Configuration Settings|Variant Calling Command-line Options/Configuration Settings]] for more information on Configuration options.
The following Config File Settings must be specified by the user:
+
 
* CHRS = space separated list of chromosomes you want
+
==== Example Configuration File ====
* BAM_INDEX = path to the Index File of BAMs
+
Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5
 +
CHRS = 20 22
 +
BAM_LIST = /path/freeze5.bam.list
 +
OUT_DIR = /path/freeze5/output
 +
REF_DIR = /path/reference/
 +
REF = $(REF_DIR)/hs37d5.fa
 +
INDEL_PREFIX = $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19
 +
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
 +
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
 +
 
   −
===Targeted/Exome Sequencing Settings===
+
== Variant Calling Command-line Options/Configuration Settings ==
If you are running Targeted/Exome Sequencing, the user should specify:
+
{{:GotCloud: Variant Calling Options}}
* Write loci file when performing pileup
  −
** WRITE_TARGET_LOCI = TRUE
  −
* Specify the output sub-directory to store target information, for example: targetDir
  −
** Should not be a full path as this will co under the OUT_DIR directory.
  −
** TARGET_DIR = targetDir
     −
If all individuals have the same target:
  −
* Specify the single bed file, for example: target.bed
  −
** UNIFORM_TARGET_BED = target.bed
     −
If not all individuals have the same target:
+
== Use Cases & Recommended Settings ==
* Specify the file containing the sample id -> bed map, for example: targetMap.txt
+
=== Single Sample Processing ===
** MULTIPLE_TARGET_MAP = targetMap.txt
+
To run single sample processing we recommend adding the following settings to your configuration file:
*** Each line of the file contains [SM_ID] [TARGET_BED]
+
UNIT_CHUNK = 20000000
 +
MODEL_GLFSINGLE = TRUE
 +
MODEL_SKIP_DISCOVER = FALSE
 +
MODEL_AF_PRIOR = TRUE
 +
VCF_EXTRACT = $(REF_DIR)/snpOnly.vcf.gz
 +
EXT = $(REF_DIR)/ALL.chrCHR.phase3.combined.sites.unfiltered.vcf.gz $(REF_DIR)/chrCHR.filtered.sites.vcf.gz
   −
Optional Settings:
+
Explanation of these settings:
* Extend the target region by a given number of bases, for example: 50
+
* <code>UNIT_CHUNK</code> - since this is only 1 sample, process larger regions at a time than default
** OFFSET_OFF_TARGET = 50
+
* <code>MODEL_GLFSINGLE</code> - single sample, so model glfsingle
 +
* <code>MODEL_SKIP_DISCOVER</code> - do not skip the variant discovery step
 +
* <code>MODEL_AF_PRIOR</code> - use AF prior for genotyping
 +
* <code>VCF_EXTRACT</code> - VCF file to use for extracting the site information to genotype
 +
**  This file is included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]]
 +
* <code>EXT</code> - VCF reference files to use for the external filtering
 +
** These files are included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]]
   −
=== Chromosome X Calling ===
  −
Making calls on the X chromosome requires the user to specifty a PED file with sex information.
  −
* PED_INDEX = pedfile.ped
     −
== Example Configuration File ==
  −
Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5
  −
CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
  −
BAM_INDEX = /path/freeze5/freeze5.bam.index  ### The BAM index file described above
  −
OUT_DIR = /path/freeze5/output              ### Directory in which to put all gotcloud output
  −
REF = /path/reference/hs37d5.fa              ### Reference sequence
  −
INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19  ### Known indel sites
  −
HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz    ### HapMap variants (requires tabix index file in same directory)
  −
DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz  ### dbSNP variants (requires tabix index file in same directory)
     −
= Running =
+
== Running ==
    
Running variant calling is straightforward:
 
Running variant calling is straightforward:
Line 133: Line 135:  
* If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory.
 
* If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory.
    +
=== Running on a Cluster ===
 +
See [[#Cluster Configuration|Cluster Configuration]] for information on how to configure GotCloud to run on a cluster.
   −
== Running on a Cluster ==
+
== Results ==
To run on the Cluster, the following settings need to be added to the configuration file:
  −
BATCH_TYPE = batch_type
  −
BATCH_OPTS = options to your batch system, as you would normally specify them
  −
 
  −
Alternatively, <code>--batchtype</code> and <code>--batchopts</code> can be specified on the command line.
  −
 
  −
Valid values for BATCH_TYPE are: mosix, sge, sgei, slurm, slurmi, pbs, local
  −
* If you are at UM and are using flux, you can specify either <code>flux</code> or <code>pbs</code>.
  −
* <code>sgei</code> and <code>slurmi</code> run in interactive mode.
  −
* For any BATCH_TYPEs that run in batch mode, GotCloud generates a script that will wait until the step is complete before returning.
  −
** In a sense, it "fakes" interactive mode for all batch types since it will not proceed until a command is finished.
  −
 
  −
 
  −
Here's the same configuration file we used above but now made to run on a cluster computer with MOSIX.
  −
== Example Configuration File ==
  −
CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
  −
BAM_INDEX = /path/freeze5/freeze5.bam.index
  −
OUT_DIR = /path/freeze5/output             
  −
REF = /path/reference/hs37d5.fa           
  −
INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19
  −
HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz
  −
DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz
  −
BATCH_TYPE = mosix            ### Specify MOSIX as the batch system
  −
BATCH_OPTS = -j10,11,12,13    ### Specify available MOSIX compute nodes
  −
 
  −
= Results =
      
If there is a failure, you should see a message like:  
 
If there is a failure, you should see a message like:  
Line 169: Line 147:  
* glfs with a bams & samples subdirectory
 
* glfs with a bams & samples subdirectory
 
* pvcfs with a subdirectory per chromosome and then per region
 
* pvcfs with a subdirectory per chromosome and then per region
* split with a subdirectory per chromosome
+
* '''split''' with a subdirectory per chromosome
* vcfs with a subdirectory per chromosome
+
* '''vcfs''' with a subdirectory per chromosome
 
* (optionally your target directory)
 
* (optionally your target directory)
   −
Under the vcf/chrXX directory, there should be:
+
Under the '''vcf/chrXX''' directory, there should be:
 
* chrXX.filtered.sites.vcf
 
* chrXX.filtered.sites.vcf
 
* chrXX.filtered.sites.vcf.norm.log
 
* chrXX.filtered.sites.vcf.norm.log
 
* chrXX.filtered.sites.vcf.summary
 
* chrXX.filtered.sites.vcf.summary
* chrXX.filtered.vcf.gz
+
* '''chrXX.filtered.vcf.gz''' - final filtered variant call file
 
* chrXX.filtered.vcf.gz.OK
 
* chrXX.filtered.vcf.gz.OK
 
* chrXX.filtered.vcf.gz.tbi
 
* chrXX.filtered.vcf.gz.tbi
Line 195: Line 173:  
The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.
 
The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.
   −
Under the split/chrXX directory, there should be:
+
Under the '''split/chrXX''' directory, there should be:
 
* chrXX.filtered.PASS.split.[N].vcf.gz
 
* chrXX.filtered.PASS.split.[N].vcf.gz
 
* chrXX.filtered.PASS.split.err
 
* chrXX.filtered.PASS.split.err
 
* chrXX.filtered.PASS.split.vcflist
 
* chrXX.filtered.PASS.split.vcflist
* chrXX.filtered.PASS.gz
+
* '''chrXX.filtered.PASS.gz''' - final variant call file with only PASS variants
 
* subset.OK
 
* subset.OK
87

edits

Navigation menu