Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 5: Line 5:       −
= Running the GotCloud Variant Calling Pipeline =
+
== Running the GotCloud Variant Calling Pipeline ==
   −
The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>.  <code>gotcloud</code> is found under <code>gotcloud/</code>.
+
The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>.  
   −
==Running the Automatic Test==
+
===Running the Automatic Test===
   −
The automatic test runs the variant calling pipeline on a small testset and checks the results against expected results validating that GotCloud is installed correctly.
+
The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly.
   −
*Run variant calling pipeline test:
+
*Run <code>snpcall</code> pipeline test:
 
  gotcloud snpcall --test OUTPUT_DIR
 
  gotcloud snpcall --test OUTPUT_DIR
where OUTPUT_DIR is the directory where you want to store the test results
+
** Where OUTPUT_DIR is the directory where you want to store the test results
 +
** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run snpcall on your own samples.
 +
*Run <code>ldrefine</code> pipeline test:
 +
gotcloud ldrefine --test OUTPUT_DIR
 +
** Where <code>OUTPUT_DIR</code> is the directory where you want to store the test results
 +
** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples.
   −
If you see "Successfully ran the test case, congratulations!", then you are ready to align samples.
+
== Overview of Variant Calling Pipeline Steps ==
 
  −
= Overview of Variant Calling Pipeline Steps =
   
Here is an overview of the Variant Calling Pipeline:
 
Here is an overview of the Variant Calling Pipeline:
   Line 25: Line 28:       −
= Input Data=
+
For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]].
*Aligned/Processed/Recalibrated BAM files
+
 
*Index file containing Sample IDs & BAM file names
+
== Input Data==
*Reference files
+
* [[#BAM Files|Aligned/Processed/Recalibrated BAM files]]
*(Optional) Configuration file to override default options
+
* [[#BAM List File|BAM list file containing Sample IDs & BAM file names]]
 +
* [[#Reference Files|Reference files]]
 +
* (Optional) [[#Configuration File|Configuration file to override default options]]
   −
== BAM files ==
+
=== BAM Files ===
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is documented elsewhere as part of the [[Alignment Pipeline]] of gotCloud.
+
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud.
   −
== Index File ==
+
=== BAM List File ===
Each line of the index file represents each individual under the following format. Note that multiple BAMs per individual may be provided. Note that if all samples are from the same population, just specify "ALL" for the population label for each sample.
+
* Automatically created when running the GotCloud [[Alignment Pipeline]]
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
+
* Each line of the BAM list file represents a single individual
    
Columns:
 
Columns:
 
# sample id
 
# sample id
# comma separated population labels
+
# comma separated population labels (optional column)
 
# BAM File 1 (preferable to have full paths to BAM files)
 
# BAM File 1 (preferable to have full paths to BAM files)
# BAM File 2 (if applicable)
+
# BAM File 2 (if more than 1 BAM per sample)
 
:...
 
:...
   −
: # BAM File N
+
: # BAM File N (if more than 1 BAM per sample)
 +
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
 +
or
 +
[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
   −
== Reference Files ==
+
* Notes:
The variant calling pipeline requires multiple reference files in order to work correctly.  
+
** tab delimited
 +
** multiple BAMs per individual may be provided, but should all be on the same line of the list file
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
   −
* Reference Sequence in fasta format.
+
The path to the BAM List file is defaulted to the <code>outputDirectory/bam.list</code>. It can be overridden by setting <code>--bamlist</code>, <code>--bam_list</code>, or <code>--list</code> on the command-line or by setting BAM_LIST in your configuration file to the path to the BAM List File. See [[#Required_Options|Required Options]] for more information.
** Configuration File Setting:  <code>REF = path/file.fa</code>
  −
* Indel VCF File Prefix
  −
** Configuration File Setting:  <code>INDEL_PREFIX = path/indels.sites.hg19</code>
  −
** <code>path/</code> contains <code>indels.sites.hg19.chr20.vcf</code> for each chromosome being processed
  −
* DBSNP File vcf.gz file (must be indexed with tabix)
  −
** Configuration File Setting:  <code>DBSNP_VCF = path/dbsnp_135.b37.vcf.gz</code>
  −
** <code>path/</code> contains <code>dbsnp_135_b37.rod.chr20.map</code> for each chromosome being processed
  −
* HapMap3 polymorphic site vcf.gz file (must be indexed with tabix)
  −
** Configuration File Setting: <code>HM3_VCF = path/hapmap_3.3.b37.sites.vcf.gz<code>
  −
** <code>path/</code> contains <code>hapmap3.qc.poly.chr20.bim</code> & <code>hapmap3.qc.poly.chr20.frq</code> for each chromosome being processed
     −
A set of reference files can be downloaded from: [[ftp://share.sph.umich.edu/1000genomes/umake-resources/ | FTP Download of Full Resource Files]]
+
=== Reference Files ===
 +
See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including:
 +
* How to obtain default references
 +
* Configuration keys & default values
 +
* How to generate your own references
 +
* How to point GotCloud to your reference files
   −
Configuration File Example Reference Settings:
+
Required Reference File Types:
REF = path/file.fa
+
* [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]]
INDEL_PREFIX = path/indels.sites.hg19
+
* [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF Files|DBSNP VCF Files]]
DBSNP_VCF = path/dbsnp_135_b37.vcf.gz
+
* [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF Files|HapMap3 VCF Files]]
HM3_VCF = path/hapmap_3.3.b37.sites.vcf.gz
+
* [[GotCloud: Genetic Reference and Resource Files#OMNI VCF Files|OMNI VCF Files]]
 +
* [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]]
   −
== Configuration File ==
+
=== Configuration File ===
Configuration file contains the run-time options including the software binaries and command line arguments.  A default configuration file is automatically loaded.  Users must specify their own configuration file specifying just the values different than the defaults.
+
{{:GotCloud: Configuration}}
   −
Comments begin with a <code>#</code>
+
See [[#Variant Calling Command-line Options/Configuration Settings|Variant Calling Command-line Options/Configuration Settings]] for more information on Configuration options.
   −
Format: KEY = value
+
==== Example Configuration File ====
 
+
Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5
Where KEY is the item being set and value is its new value
+
CHRS = 20 22
 
+
BAM_LIST = /path/freeze5.bam.list
===Required User Config Files Settings===
+
OUT_DIR = /path/freeze5/output
The following Config File Settings must be specified by the user:
+
REF_DIR = /path/reference/
* CHRS = space separated list of chromosomes you want
+
REF = $(REF_DIR)/hs37d5.fa
* BAM_INDEX = path to the Index File of BAMs
+
INDEL_PREFIX = $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19
 
+
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
===Required on Command-Line or in Config File===
+
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
The following Command-Line or Config File Settings must be specified by the user:
  −
* --outdir/OUT_DIR= path to desired output directory
     −
===Targeted/Exome Sequencing Settings===
  −
If you are running Targeted/Exome Sequencing, the user should specify:
  −
* Write loci file when performing pileup
  −
** WRITE_TARGET_LOCI = TRUE
  −
* Specify the output sub-directory to store target information, for example: targetDir
  −
** Should not be a full path as this will co under the OUT_DIR directory.
  −
** TARGET_DIR = targetDir
     −
If all individuals have the same target:
+
== Variant Calling Command-line Options/Configuration Settings ==
* Specify the single bed file, for example: target.bed
+
{{:GotCloud: Variant Calling Options}}
** UNIFORM_TARGET_BED = target.bed
     −
If not all individuals have the same target:
  −
* Specify the file containing the sample id -> bed map, for example: targetMap.txt
  −
** MULTIPLE_TARGET_MAP = targetMap.txt
  −
*** Each line of the file contains [SM_ID] [TARGET_BED]
     −
Optional Settings:
+
== Use Cases & Recommended Settings ==
* Extend the target region by a given number of bases, for example: 50
+
=== Single Sample Processing ===
** OFFSET_OFF_TARGET = 50
+
To run single sample processing we recommend adding the following settings to your configuration file:
 +
UNIT_CHUNK = 20000000
 +
MODEL_GLFSINGLE = TRUE
 +
MODEL_SKIP_DISCOVER = FALSE
 +
MODEL_AF_PRIOR = TRUE
 +
VCF_EXTRACT = $(REF_DIR)/snpOnly.vcf.gz
 +
EXT = $(REF_DIR)/ALL.chrCHR.phase3.combined.sites.unfiltered.vcf.gz $(REF_DIR)/chrCHR.filtered.sites.vcf.gz
   −
=== Configure Reference Files ===
+
Explanation of these settings:
See [[#Reference Files| Reference Files]] for information on how to specify the reference files.
+
* <code>UNIT_CHUNK</code> - since this is only 1 sample, process larger regions at a time than default
 +
* <code>MODEL_GLFSINGLE</code> - single sample, so model glfsingle
 +
* <code>MODEL_SKIP_DISCOVER</code> - do not skip the variant discovery step
 +
* <code>MODEL_AF_PRIOR</code> - use AF prior for genotyping
 +
* <code>VCF_EXTRACT</code> - VCF file to use for extracting the site information to genotype
 +
**  This file is included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]]
 +
* <code>EXT</code> - VCF reference files to use for the external filtering
 +
** These files are included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]]
   −
=== Chromosome X Calling ===
  −
Making calls on the X chromosome requires the user to specifty a PED file with sex information.
  −
* PED_INDEX = pedfile.ped
     −
== Example Configuration File ==
  −
Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5
  −
CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
  −
BAM_INDEX = /path/freeze5/freeze5.bam.index  ### The BAM index file described above
  −
OUT_DIR = /path/freeze5/output              ### Directory in which to put all gotcloud output
  −
REF = /path/reference/hs37d5.fa              ### Reference sequence
  −
INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19  ### Known indel sites
  −
HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz    ### HapMap variants (requires tabix index file in same directory)
  −
DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz  ### dbSNP variants (requires tabix index file in same directory)
     −
= Running =
+
== Running ==
    
Running variant calling is straightforward:
 
Running variant calling is straightforward:
Line 135: Line 130:  
</code>
 
</code>
   −
Replace vc.conf with the approprate path/name of the user's configuration file.
+
* Replace <code>vc.conf</code> with the path/name of the user's configuration file
 
+
** If you are not overriding any defaults, you can alternatively specify <code>--list path/bam.list</code> replacing <code>path/bam.list</code> with the path/name of your BAM list file.
If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory.
+
* Replace <code>2</code> following <code>--numjobs</code> with the number of jobs to be run in parallel
 
+
* If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory.
Update the value following <code>--numjobs</code> to the appropriate number of jobs to be run in parallel.
  −
 
  −
 
  −
== Running on a Cluster ==
  −
To run on the Cluster, the following settings need to be added to the configuration file:
  −
 
  −
<--- Following may need revision --->
  −
TODO: COMING SOON
  −
SLEEP_MULT =    20
  −
REMOTE_PREFIX =  # REMOTE_PREFIX : Set if cluster node see the directory differently (e.g. /net/mymachine/[original-dir])
  −
<--- End: Following may need revision --->
  −
 
     −
Here's the same configuration file we used above but now made to run on a cluster computer with MOSIX.
+
=== Running on a Cluster ===
== Example Configuration File ==
+
See [[#Cluster Configuration|Cluster Configuration]] for information on how to configure GotCloud to run on a cluster.
Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5
  −
CHRS = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
  −
BAM_INDEX = /path/freeze5/freeze5.bam.index
  −
OUT_DIR = /path/freeze5/output             
  −
REF = /path/reference/hs37d5.fa           
  −
INDEL_PREFIX = /path/reference/1kg.pilot_release.merged.indels.sites.hg19
  −
HM3_VCF = /path/reference/hapmap3_r3_b37.sites.vcf.gz
  −
DBSNP_VCF = /path/reference/dbsnp_135.b37.sites.vcf.gz
  −
BATCH_TYPE = mosix            ### Specify MOSIX as the batch system
  −
BATCH_OPTS = -j10,11,12,13    ### Specify available MOSIX compute nodes
     −
= Results =
+
== Results ==
    
If there is a failure, you should see a message like:  
 
If there is a failure, you should see a message like:  
Line 174: Line 147:  
* glfs with a bams & samples subdirectory
 
* glfs with a bams & samples subdirectory
 
* pvcfs with a subdirectory per chromosome and then per region
 
* pvcfs with a subdirectory per chromosome and then per region
* split with a subdirectory per chromosome
+
* '''split''' with a subdirectory per chromosome
* vcfs with a subdirectory per chromosome
+
* '''vcfs''' with a subdirectory per chromosome
 
* (optionally your target directory)
 
* (optionally your target directory)
   −
Under the vcf/chrXX directory, there should be:
+
Under the '''vcf/chrXX''' directory, there should be:
 
* chrXX.filtered.sites.vcf
 
* chrXX.filtered.sites.vcf
* chrXX.filtered.sites.vcf.log
+
* chrXX.filtered.sites.vcf.norm.log
 
* chrXX.filtered.sites.vcf.summary
 
* chrXX.filtered.sites.vcf.summary
* chrXX.filtered.vcf.gz
+
* '''chrXX.filtered.vcf.gz''' - final filtered variant call file
 
* chrXX.filtered.vcf.gz.OK
 
* chrXX.filtered.vcf.gz.OK
 
* chrXX.filtered.vcf.gz.tbi
 
* chrXX.filtered.vcf.gz.tbi
 +
* chrXX.hardfiltered.sites.vcf
 +
* chrXX.hardfiltered.sites.vcf.log
 +
* chrXX.hardfiltered.sites.vcf.summary
 +
* chrXX.hardfiltered.vcf.gz
 +
* chrXX.hardfiltered.vcf.gz.OK
 +
* chrXX.hardfiltered.vcf.gz.tbi
 
* chrXX.merged.sites.vcf
 
* chrXX.merged.sites.vcf
 
* chrXX.merged.stats.vcf
 
* chrXX.merged.stats.vcf
Line 194: Line 173:  
The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.
 
The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.
   −
Under the split/chrXX directory, there should be:
+
Under the '''split/chrXX''' directory, there should be:
 
* chrXX.filtered.PASS.split.[N].vcf.gz
 
* chrXX.filtered.PASS.split.[N].vcf.gz
 
* chrXX.filtered.PASS.split.err
 
* chrXX.filtered.PASS.split.err
 
* chrXX.filtered.PASS.split.vcflist
 
* chrXX.filtered.PASS.split.vcflist
* chrXX.filtered.PASS.gz
+
* '''chrXX.filtered.PASS.gz''' - final variant call file with only PASS variants
 
* subset.OK
 
* subset.OK
87

edits

Navigation menu