Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1: −
      
Back to parent: [[GotCloud]]
 
Back to parent: [[GotCloud]]
Line 8: Line 7:  
== Running the GotCloud Variant Calling Pipeline ==
 
== Running the GotCloud Variant Calling Pipeline ==
   −
The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>.  <code>gotcloud</code> is found under <code>gotcloud/</code>.
+
The variant calling pipeline (umake) is run using <code>gotcloud snpcall</code> and <code>gotcloud ldrefine</code>.  
    
===Running the Automatic Test===
 
===Running the Automatic Test===
   −
The automatic test runs the variant calling pipeline on a small testset and checks the results against expected results validating that GotCloud is installed correctly.
+
The automatic test runs the variant calling pipeline on a small test set and checks the results against expected results validating that GotCloud is installed correctly.
   −
*Run variant calling pipeline test:
+
*Run <code>snpcall</code> pipeline test:
 
  gotcloud snpcall --test OUTPUT_DIR
 
  gotcloud snpcall --test OUTPUT_DIR
where OUTPUT_DIR is the directory where you want to store the test results
+
** Where OUTPUT_DIR is the directory where you want to store the test results
 
+
** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run snpcall on your own samples.
If you see "Successfully ran the test case, congratulations!", then you are ready to align samples.
+
*Run <code>ldrefine</code> pipeline test:
 
+
gotcloud ldrefine --test OUTPUT_DIR
 +
** Where <code>OUTPUT_DIR</code> is the directory where you want to store the test results
 +
** If you see <code>Successfully ran the test case, congratulations!</code>, then you are ready to run ldrefine on your own samples.
    
== Overview of Variant Calling Pipeline Steps ==
 
== Overview of Variant Calling Pipeline Steps ==
Line 26: Line 27:  
[[File: umakeSteps.png]]
 
[[File: umakeSteps.png]]
    +
 +
For more information on the filters applied during the Variant Calling Pipeline, see, [[GotCloud: Filters]].
    
== Input Data==
 
== Input Data==
*Aligned/Processed/Recalibrated BAM files
+
* [[#BAM Files|Aligned/Processed/Recalibrated BAM files]]
*Index file containing Sample IDs & BAM file names
+
* [[#BAM List File|BAM list file containing Sample IDs & BAM file names]]
*Reference files
+
* [[#Reference Files|Reference files]]
*(Optional) Configuration file to override default options
+
* (Optional) [[#Configuration File|Configuration file to override default options]]
   −
=== BAM files ===
+
=== BAM Files ===
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is documented elsewhere as part of the [[Alignment Pipeline]] of gotCloud.
+
The BAM files need to be duplicate-marked and base-quality recalibrated in order to obtain high quality SNP calls. Generating these BAM files from original FASTQs is automatically done as part of the [[Alignment Pipeline]] of GotCloud.
   −
=== Index File ===
+
=== BAM List File ===
Each line of the index file represents each individual under the following format. Note that multiple BAMs per individual may be provided.
+
* Automatically created when running the GotCloud [[Alignment Pipeline]]
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
+
* Each line of the BAM list file represents a single individual
    
Columns:
 
Columns:
 
# sample id
 
# sample id
# comma separated population labels
+
# comma separated population labels (optional column)
 
# BAM File 1 (preferable to have full paths to BAM files)
 
# BAM File 1 (preferable to have full paths to BAM files)
# BAM File 2 (if applicable)
+
# BAM File 2 (if more than 1 BAM per sample)
 
:...
 
:...
   −
: # BAM File N
+
: # BAM File N (if more than 1 BAM per sample)
 +
[SAMPLE_ID]    [COMMA SEPARATED POPULATION LABELS] [BAM_FILE1] [BAM_FILE2] ...
 +
or
 +
[SAMPLE_ID] [BAM_FILE1] [BAM_FILE2] ...
   −
=== Reference Files ===
+
* Notes:
The variant calling pipeline requires multiple reference files in order to work correctly.  
+
** tab delimited
 +
** multiple BAMs per individual may be provided, but should all be on the same line of the list file
 +
** population label is optional - it will default to <code>ALL</code>
 +
*** only used by Thunder (part of ldrefine pipeline)
 +
*** if all samples are from the same population, population label can be skipped or you can just specify <code>ALL</code> for the population label for each sample.
   −
* Reference Sequence in fasta format.
+
The path to the BAM List file is defaulted to the <code>outputDirectory/bam.list</code>.  It can be overridden by setting <code>--bamlist</code>, <code>--bam_list</code>, or <code>--list</code> on the command-line or by setting BAM_LIST in your configuration file to the path to the BAM List File. See [[#Required_Options|Required Options]] for more information.
** Configuration File Setting:  <code>REF = path/file.fa</code>
  −
* Indel VCF File Prefix
  −
** Configuration File Setting:  <code>INDEL_PREFIX = path/indels.sites.hg19</code>
  −
** <code>path/</code> contains <code>indels.sites.hg19.chr20.vcf</code> for each chromosome being processed
  −
* DBSNP File Prefix
  −
** Configuration File Setting: <code>DBSNP_PREFIX = path/dbsnp_135_b37.rod</code>
  −
** <code>path/</code> contains <code>dbsnp_135_b37.rod.chr20.map</code> for each chromosome being processed
  −
* HapMap3 polymorphic site prefix
  −
** Configuration File Setting:  <code>HM3_PREFIX = path/hapmap3.qc.poly</code>
  −
** <code>path/</code> contains <code>hapmap3.qc.poly.chr20.bim</code> & <code>hapmap3.qc.poly.chr20.frq</code> for each chromosome being processed
     −
A set of reference files can be downloaded from: [[ftp://share.sph.umich.edu/1000genomes/umake-resources/ | FTP Download of Full Resource Files]]
+
=== Reference Files ===
 +
See [[GotCloud: Genetic Reference and Resource Files]] for detailed information about the multiple required reference files for the variant calling pipeline, including:
 +
* How to obtain default references
 +
* Configuration keys & default values
 +
* How to generate your own references
 +
* How to point GotCloud to your reference files
   −
Configuration File Example Reference Settings:
+
Required Reference File Types:
REF = path/file.fa
+
* [[GotCloud: Genetic Reference and Resource Files#Reference fasta Files|Reference fasta Files]]
INDEL_PREFIX = path/indels.sites.hg19
+
* [[GotCloud: Genetic Reference and Resource Files#DBSNP VCF Files|DBSNP VCF Files]]
DBSNP_PREFIX = path/dbsnp_135_b37.rod
+
* [[GotCloud: Genetic Reference and Resource Files#HapMap3 VCF Files|HapMap3 VCF Files]]
HM3_PREFIX = path/hapmap3.qc.poly
+
* [[GotCloud: Genetic Reference and Resource Files#OMNI VCF Files|OMNI VCF Files]]
 +
* [[GotCloud: Genetic Reference and Resource Files#INDEL VCF File(s)|INDEL VCF File(s)]]
    
=== Configuration File ===
 
=== Configuration File ===
Configuration file contains the run-time options including the software binaries and command line arguments.  A default configuration file is automatically loaded.  Users must specify their own configuration file specifying just the values different than the defaults.
+
{{:GotCloud: Configuration}}
   −
Comments begin with a <code>#</code>
+
See [[#Variant Calling Command-line Options/Configuration Settings|Variant Calling Command-line Options/Configuration Settings]] for more information on Configuration options.
   −
Format: KEY = value
+
==== Example Configuration File ====
 +
Example configuration file where reference files happen to be stored in /path/reference, and bam index file in path/freeze5
 +
CHRS = 20 22
 +
BAM_LIST = /path/freeze5.bam.list
 +
OUT_DIR = /path/freeze5/output
 +
REF_DIR = /path/reference/
 +
REF = $(REF_DIR)/hs37d5.fa
 +
INDEL_PREFIX = $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19
 +
HM3_VCF = $(REF_DIR)/hapmap3_r3_b37.sites.vcf.gz
 +
DBSNP_VCF = $(REF_DIR)/dbsnp_135.b37.sites.vcf.gz
   −
Where KEY is the item being set and value is its new value
     −
====Required User Config Files Settings====
+
== Variant Calling Command-line Options/Configuration Settings ==
The following Config File Settings must be specified by the user:
+
{{:GotCloud: Variant Calling Options}}
* CHRS = space separated list of chromosomes you want
  −
* BAM_INDEX = path to the Index File of BAMs
     −
====Required on Command-Line or in Config File====
  −
The following Command-Line or Config File Settings must be specified by the user:
  −
* --outdir/OUT_DIR= path to desired output directory
     −
====Targeted/Exome Sequencing Settings====
+
== Use Cases & Recommended Settings ==
If you are running Targeted/Exome Sequencing, the user should specify:
+
=== Single Sample Processing ===
* Write loci file when performing pileup
+
To run single sample processing we recommend adding the following settings to your configuration file:
** WRITE_TARGET_LOCI = TRUE
+
UNIT_CHUNK = 20000000
* Specify the output sub-directory to store target information, for example: targetDir
+
MODEL_GLFSINGLE = TRUE
** Should not be a full path as this will co under the OUT_DIR directory.
+
MODEL_SKIP_DISCOVER = FALSE
** TARGET_DIR = targetDir
+
MODEL_AF_PRIOR = TRUE
 +
VCF_EXTRACT = $(REF_DIR)/snpOnly.vcf.gz
 +
EXT = $(REF_DIR)/ALL.chrCHR.phase3.combined.sites.unfiltered.vcf.gz $(REF_DIR)/chrCHR.filtered.sites.vcf.gz
   −
If all individuals have the same target:
+
Explanation of these settings:
* Specify the single bed file, for example: target.bed
+
* <code>UNIT_CHUNK</code> - since this is only 1 sample, process larger regions at a time than default
** UNIFORM_TARGET_BED = target.bed
+
* <code>MODEL_GLFSINGLE</code> - single sample, so model glfsingle
 +
* <code>MODEL_SKIP_DISCOVER</code> - do not skip the variant discovery step
 +
* <code>MODEL_AF_PRIOR</code> - use AF prior for genotyping
 +
* <code>VCF_EXTRACT</code> - VCF file to use for extracting the site information to genotype
 +
**  This file is included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]]
 +
* <code>EXT</code> - VCF reference files to use for the external filtering
 +
** These files are included in the latest reference release: [[GotCloud:_Genetic_Reference_and_Resource_Files#hs37d5-db142|hs37d5-db142]]
   −
If not all individuals have the same target:
  −
* Specify the file containing the sample id -> bed map, for example: targetMap.txt
  −
** MULTIPLE_TARGET_MAP = targetMap.txt
  −
*** Each line of the file contains [SM_ID] [TARGET_BED]
     −
Optional Settings:
  −
* Extend the target region by a given number of bases, for example: 50
  −
** OFFSET_OFF_TARGET = 50
  −
  −
==== Configure Reference Files ====
  −
See [[#Reference Files| Reference Files]] for information on how to specify the reference files.
  −
  −
==== Chromosome X Calling ====
  −
* PED_INDEX = pedfile.ped
      
== Running ==
 
== Running ==
Line 126: Line 130:  
</code>
 
</code>
   −
Replace vc.conf with the approprate path/name of the user's configuration file.
+
* Replace <code>vc.conf</code> with the path/name of the user's configuration file
 
+
** If you are not overriding any defaults, you can alternatively specify <code>--list path/bam.list</code> replacing <code>path/bam.list</code> with the path/name of your BAM list file.
If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory.
+
* Replace <code>2</code> following <code>--numjobs</code> with the number of jobs to be run in parallel
 
+
* If <code>OUT_DIR</code> is not defined in the configuration file, add <code>--outdir</code> followed by the path to the user's desired output directory.
Update the value following <code>--numjobs</code> to the appropriate number of jobs to be run in parallel.
  −
 
      
=== Running on a Cluster ===
 
=== Running on a Cluster ===
To run on the Cluster, the following settings need to be added to the configuration file:
+
See [[#Cluster Configuration|Cluster Configuration]] for information on how to configure GotCloud to run on a cluster.
 
  −
TODO: COMING SOON
  −
SLEEP_MULT =    20
  −
REMOTE_PREFIX =  # REMOTE_PREFIX : Set if cluster node see the directory differently (e.g. /net/mymachine/[original-dir])
  −
 
     −
=== Results ===
+
== Results ==
    
If there is a failure, you should see a message like:  
 
If there is a failure, you should see a message like:  
Line 150: Line 147:  
* glfs with a bams & samples subdirectory
 
* glfs with a bams & samples subdirectory
 
* pvcfs with a subdirectory per chromosome and then per region
 
* pvcfs with a subdirectory per chromosome and then per region
* split with a subdirectory per chromosome
+
* '''split''' with a subdirectory per chromosome
* vcfs with a subdirectory per chromosome
+
* '''vcfs''' with a subdirectory per chromosome
 
* (optionally your target directory)
 
* (optionally your target directory)
   −
Under the vcf/chrXX directory, there should be:
+
Under the '''vcf/chrXX''' directory, there should be:
 
* chrXX.filtered.sites.vcf
 
* chrXX.filtered.sites.vcf
* chrXX.filtered.sites.vcf.log
+
* chrXX.filtered.sites.vcf.norm.log
 
* chrXX.filtered.sites.vcf.summary
 
* chrXX.filtered.sites.vcf.summary
* chrXX.filtered.vcf.gz
+
* '''chrXX.filtered.vcf.gz''' - final filtered variant call file
 
* chrXX.filtered.vcf.gz.OK
 
* chrXX.filtered.vcf.gz.OK
 
* chrXX.filtered.vcf.gz.tbi
 
* chrXX.filtered.vcf.gz.tbi
 +
* chrXX.hardfiltered.sites.vcf
 +
* chrXX.hardfiltered.sites.vcf.log
 +
* chrXX.hardfiltered.sites.vcf.summary
 +
* chrXX.hardfiltered.vcf.gz
 +
* chrXX.hardfiltered.vcf.gz.OK
 +
* chrXX.hardfiltered.vcf.gz.tbi
 
* chrXX.merged.sites.vcf
 
* chrXX.merged.sites.vcf
 
* chrXX.merged.stats.vcf
 
* chrXX.merged.stats.vcf
Line 170: Line 173:  
The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.
 
The filtered is the merged.vcf after it has been run through filters and is marked with PASS/FAIL.
   −
Under the split/chrXX directory, there should be:
+
Under the '''split/chrXX''' directory, there should be:
 
* chrXX.filtered.PASS.split.[N].vcf.gz
 
* chrXX.filtered.PASS.split.[N].vcf.gz
 
* chrXX.filtered.PASS.split.err
 
* chrXX.filtered.PASS.split.err
 
* chrXX.filtered.PASS.split.vcflist
 
* chrXX.filtered.PASS.split.vcflist
* chrXX.filtered.PASS.gz
+
* '''chrXX.filtered.PASS.gz''' - final variant call file with only PASS variants
 
* subset.OK
 
* subset.OK
87

edits

Navigation menu