Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1: −
= Genetic Reference and Resource Files =
+
== Genetic Reference and Resource Files ==
    
Back to parent: [[GotCloud]]
 
Back to parent: [[GotCloud]]
Line 6: Line 6:     
You can generate your own files or use the set available for [[#Downloadable Reference and Resource Files|download]].
 
You can generate your own files or use the set available for [[#Downloadable Reference and Resource Files|download]].
 +
* By default, GotCloud looks for the reference/resource files in the <code>gotcloud.ref</code> subdirectory within the base GotCloud directory
 +
* To look in a different directory, set your reference/resource file location by setting either of the following to that path:
 +
** <code>REF_DIR</code> in your configuration file
 +
** <code>--ref_dir</code> on the command-line
    +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Description !! Configuration Key !! Default Value !! Pipelines !! Special Info
 +
|-
 +
| [[#Reference fasta Files| Reference fasta]] || REF || $(REF_DIR)/human.g1k.v37.fa
 +
| align, snpcall, indel || [[#Additional files generated from the reference fasta|Additional Files Required]]
 +
|-
 +
| [[#DBSNP VCF File|DBSNP VCF File]] || DBSNP_VCF || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
| align, snpcall || Must be tabixed
 +
|-
 +
| [[#HapMap3 VCF File|HapMap3 VCF File]] || HM3_VCF || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
| align, snpcall || Must be tabixed
 +
|-
 +
| [[#OMNI VCF File|OMNI VCF File]] || OMNI_VCF || $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz
 +
| snpcall || Must be tabixed
 +
|-
 +
| rowspan="2"|[[#INDEL VCF File(s)|INDEL VCF File(s)]] || INDEL_PREFIX || $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 || rowspan="2"|snpcall || .chr#.vcf extension will be appended
 +
|-
 +
| INDEL_VCF || ''alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome''||Must be tabixed
 +
|}
   −
== Required Files ==
     −
=== Human Reference Files ===
+
=== Reference fasta Files ===
 +
Reference Sequence in fasta format
 +
* Contains reference base at each reference position
 +
 
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| REF || $(REF_DIR)/human.g1k.v37.fa
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| align || mapping to reference, recalibration, quality control
 +
|-
 +
| snpcall || pileup & identify variants, summarize filtered variants
 +
|-
 +
| indel || discovery, genotyping
 +
|}
 +
 
 +
==== Additional files generated from the reference fasta ====
 +
In addition to the fasta file a few additional files generated from the fasta are required
 +
* Already included with default reference files
 +
* If you are using your own reference files, you will need to be sure to create these files
 +
** Expected to be at the same location as the reference file
 +
** Be sure to create these additional files using the version of tool being run by GotCloud (by default they are in the <code>gotcloud/bin/</code> directory)
 +
** In the commands below, replace <code>ref.fa</code> with the path/name of the reference fasta file
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Step !! Required Extensions !! Command to Create !! More Information
 +
|-
 +
| align, snpcall, indel || ||.fai || <code>bin/samtools faidx ref.fa</code>
 +
|-
 +
| align, snpcall, indel || || -bs.umfa || || If it does not already exist, GotCloud automatically creates this file in same directory as the REF file
 +
|-
 +
| align || bwa mapping || .amb, .ann, .bwt, .pac, .sa || <code>bin/bwa index ref.fa</code> || http://bio-bwa.sourceforge.net/bwa.shtml
 +
|-
 +
| align || qplot || .winsize100.gc || <code>bin/qplot --reference ref.fa || NOTE: Ignore the error at the end of qplot that says:
 +
<pre>FATAL ERROR -
 +
No SAM/BAM files provided, stopped!</pre>
 +
This error is due to using qplot to just generate a GC Content file and not also process a BAM file.
 +
 
 +
[[QPLOT#Input_files|QPLOT: InputFiles]]
 +
|}
    
=== DBSNP VCF File ===
 
=== DBSNP VCF File ===
 +
VCF file containing known dbsnp variant positions
 +
* Must be bgzip'd and tabix'd
 +
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| DBSNP_VCF || $(REF_DIR)/dbsnp_142.b37.vcf.gz
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| align || recalibration (exclude known dbsnps when generating recalibration tables) & qplot
 +
|-
 +
| snpcall || generating filtered VCF summary statistics
 +
|}
 +
 +
=== HapMap3 VCF File ===
 +
HapMap3 Polymorphic Sites VCF File
 +
* Must be bgzip'd and tabix'd
 +
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| HM3_VCF || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| align || verifyBamID (contamination checking)
 +
|-
 +
| snpcall || generating filtered VCF summary statistics & positive example sites for SVM filtering
 +
|}
   −
=== ===
+
=== OMNI VCF File ===
 +
VCF file containing OMNI positions
 +
* Must be bgzip'd and tabix'd
   −
= Downloadable Reference and Resource Files =
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| OMNI_VCF || $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| snpcall || positive example sites for SVM filtering
 +
|}
   −
'''Installing Genetic Reference and Resource Files'''
+
=== INDEL VCF File(s) ===
 +
VCF file containing known INDEL positions
    +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| INDEL_PREFIX || $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19
 +
|-
 +
| INDEL_VCF || ''alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome''
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| snpcall || used to filter variants that are too close to a known indel
 +
|}
   −
'''Get the Resource Files
+
* Use <code>INDEL_PREFIX</code> if <code>path/</code> contains a separate file for each chromosome in the format: <code>indels.sites.hg19.chr#.vcf</code> for each <code>#</code> chromosome being processed
'''
+
* Use <code>INDEL_VCF</code> if you have all chromosomes in a single VCF file (it can be, but does not have to be a gz file)
The GotCloud Aligner and Umake makes use of various reference and other genetic resource files.
  −
You are free to use your own files, of course, but we also are making the files we use available.
     −
<code>
+
== Downloadable Reference and Resource Files ==
#  The easiest way to get the data:
+
* When running on Amazon, a default set of reference files are included in the GotCloud AMI in the default <code>REF_DIR</code>  
<b>cd /tmp</b>
  −
<b>wget ftp://share.sph.umich.edu/gotcloud/hs37-db132.tar.gz</b>
     −
#  Another way:
  −
<b>cd /tmp</b>
  −
<b>ftp share.sph.umich.edu</b>
  −
Connected to share.sph.umich.edu.
  −
220 (vsFTPd 2.3.5)
  −
Name (share.sph.umich.edu:tpg): <b>anonymous</b>
  −
230 Login successful.
  −
Remote system type is UNIX.
  −
Using binary mode to transfer files.
  −
ftp> <b>prompt</b>
  −
Interactive mode off.
  −
ftp> <b>cd gotcloud</b>
  −
250 Directory successfully changed.
  −
ftp> <b>mget hs37-db132.tar.gz</b>
  −
ftp> <b>quit</b>
  −
221 Goodbye.
  −
</code>
     −
'''Install the Resource Files'''
+
'''Installing Genetic Reference and Resource Files'''
   −
Choose a destination for these files and install them as shown below (we'll assume you will use '''/usr/local/gotcloud.ref''').
+
Choose a destination for these files and install them as shown below.  We'll assume you will use '''gotcloud/gotcloud.ref'''.  Replace <code>gotcloud</code> with the path to where you installed gotcloud.
    
<code>
 
<code>
  <b>mkdir -p /usr/local/gotcloud.ref</b>   # Where you want the files installed
+
  <b>cd gotcloud</b>   # path to where you installed gotcloud
<b>cd /usr/local/gotcloud.ref</b>
  −
<b>tar xzvf hs37-db132.tar.gz</b>
  −
  ref/
  −
  ref/hs37d5.fa.fai
  −
  ref/metabochip.batch2.broken.b37.chr2.plink.MAF01.bed
  −
  ref/hs37d5-bs.umfa
  −
  ref/metabochip.batch2.broken.b37.chr2.plink.MAF01.fam
  −
  ref/dbsnp_132.b37.vcf.gz.tbi
  −
  ref/dbsnp_132.UCSC.coordinates.tbl
  −
    [lines deleted]
  −
<b>rm -f hs37-db132.tar.gz</b>
   
</code>
 
</code>
   −
Note this path as you will need to set the variable '''REF_DIR''' in the configuration file or options
+
If you use a path other than a gotcloud.ref subdirectory of gotcloud, note this path as you will need to set either of the following to the installation path:
'''gen_biopipeline.pl''' and '''umake.pl'''.
+
* <code>REF_DIR</code> in your configuration file
 +
* <code>--ref_dir</code> on the command-line
 +
 
 +
 
 +
'''Get & Install the Resource Files'''
 +
 
 +
GotCloud makes use of various reference and other genetic resource files.
 +
You are free to use your own files, of course, but we also are making the files we use available.
 +
 
 +
<ul>
 +
<li> <div id="h37-db135">Human reference 37, dbsnp 135:</div></li>
 +
<b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db135-v3.tgz</b>
 +
<b>tar xzf h37-db135-v3.tgz</b>
 +
<b>rm -f h37-db135-v3.tgz</b>
 +
<li><div id="h37-db142">Human reference 37, dbsnp 142:</div></li>
 +
<b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db142-v1.tgz</b>
 +
<b>tar xzf h37-db142-v1.tgz</b>
 +
<b>rm -f h37-db142-v1.tgz</b>
 +
<li><div id="hs37d5-db142">Human reference 37 with decoy, dbsnp 142:</div></li>
 +
<b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/hs37d5-db142-v1.tgz</b>
 +
<b>tar xzf hs37d5-db142-v1.tgz</b>
 +
<b>rm -f hs37d5-db142-v1.tgz</b>
 +
</ul>

Navigation menu