Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1: −
= Genetic Reference and Resource Files =
+
== Genetic Reference and Resource Files ==
    
Back to parent: [[GotCloud]]
 
Back to parent: [[GotCloud]]
Line 6: Line 6:     
You can generate your own files or use the set available for [[#Downloadable Reference and Resource Files|download]].
 
You can generate your own files or use the set available for [[#Downloadable Reference and Resource Files|download]].
 +
* By default, GotCloud looks for the reference/resource files in the <code>gotcloud.ref</code> subdirectory within the base GotCloud directory
 +
* To look in a different directory, set your reference/resource file location by setting either of the following to that path:
 +
** <code>REF_DIR</code> in your configuration file
 +
** <code>--ref_dir</code> on the command-line
    +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Description !! Configuration Key !! Default Value !! Pipelines !! Special Info
 +
|-
 +
| [[#Reference fasta Files| Reference fasta]] || REF || $(REF_DIR)/human.g1k.v37.fa
 +
| align, snpcall, indel || [[#Additional files generated from the reference fasta|Additional Files Required]]
 +
|-
 +
| [[#DBSNP VCF File|DBSNP VCF File]] || DBSNP_VCF || $(REF_DIR)/dbsnp_135.b37.vcf.gz
 +
| align, snpcall || Must be tabixed
 +
|-
 +
| [[#HapMap3 VCF File|HapMap3 VCF File]] || HM3_VCF || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
| align, snpcall || Must be tabixed
 +
|-
 +
| [[#OMNI VCF File|OMNI VCF File]] || OMNI_VCF || $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz
 +
| snpcall || Must be tabixed
 +
|-
 +
| rowspan="2"|[[#INDEL VCF File(s)|INDEL VCF File(s)]] || INDEL_PREFIX || $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19 || rowspan="2"|snpcall || .chr#.vcf extension will be appended
 +
|-
 +
| INDEL_VCF || ''alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome''||Must be tabixed
 +
|}
   −
== Required Files ==
     −
=== Human Reference Files ===
+
=== Reference fasta Files ===
 +
Reference Sequence in fasta format
 +
* Contains reference base at each reference position
   −
=== DBSNP VCF File ===
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| REF || $(REF_DIR)/human.g1k.v37.fa
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| align || mapping to reference, recalibration, quality control
 +
|-
 +
| snpcall || pileup & identify variants, summarize filtered variants
 +
|-
 +
| indel || discovery, genotyping
 +
|}
   −
=== ===
+
==== Additional files generated from the reference fasta ====
 +
In addition to the fasta file a few additional files generated from the fasta are required
 +
* Already included with default reference files
 +
* If you are using your own reference files, you will need to be sure to create these files
 +
** Expected to be at the same location as the reference file
 +
** Be sure to create these additional files using the version of tool being run by GotCloud (by default they are in the <code>gotcloud/bin/</code> directory)
 +
** In the commands below, replace <code>ref.fa</code> with the path/name of the reference fasta file
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Step !! Required Extensions !! Command to Create !! More Information
 +
|-
 +
| align, snpcall, indel || ||.fai || <code>bin/samtools faidx ref.fa</code>
 +
|-
 +
| align, snpcall, indel || || -bs.umfa || || If it does not already exist, GotCloud automatically creates this file in same directory as the REF file
 +
|-
 +
| align || bwa mapping || .amb, .ann, .bwt, .pac, .sa || <code>bin/bwa index ref.fa</code> || http://bio-bwa.sourceforge.net/bwa.shtml
 +
|-
 +
| align || qplot || .winsize100.gc || <code>bin/qplot --reference ref.fa || NOTE: Ignore the error at the end of qplot that says:
 +
<pre>FATAL ERROR -
 +
No SAM/BAM files provided, stopped!</pre>
 +
This error is due to using qplot to just generate a GC Content file and not also process a BAM file.
   −
= Downloadable Reference and Resource Files =
+
[[QPLOT#Input_files|QPLOT: InputFiles]]
 +
|}
   −
'''Installing Genetic Reference and Resource Files'''
+
=== DBSNP VCF File ===
 
+
VCF file containing known dbsnp variant positions
 
+
* Must be bgzip'd and tabix'd
'''Get the Resource Files
  −
'''
  −
The GotCloud Aligner and Umake makes use of various reference and other genetic resource files.
  −
You are free to use your own files, of course, but we also are making the files we use available.
     −
<code>
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
# The easiest way to get the data:
+
! Configuration Key !! Default Value
<b>cd /tmp</b>
+
|-
<b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db135-v2.tgz</b>
+
| DBSNP_VCF || $(REF_DIR)/dbsnp_142.b37.vcf.gz
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| align || recalibration (exclude known dbsnps when generating recalibration tables) & qplot
 +
|-
 +
| snpcall || generating filtered VCF summary statistics
 +
|}
   −
#  Another way:
+
=== HapMap3 VCF File ===
<b>cd /tmp</b>
+
HapMap3 Polymorphic Sites VCF File
<b>ftp share.sph.umich.edu</b>
+
* Must be bgzip'd and tabix'd
Connected to share.sph.umich.edu.
  −
220 (vsFTPd 2.3.5)
  −
Name (share.sph.umich.edu:tpg): <b>anonymous</b>
  −
230 Login successful.
  −
Remote system type is UNIX.
  −
Using binary mode to transfer files.
  −
ftp> <b>prompt</b>
  −
Interactive mode off.
  −
ftp> <b>cd gotcloud</b>
  −
250 Directory successfully changed.
  −
ftp> <b>mget h37-db135.tar.gz</b>
  −
ftp> <b>quit</b>
  −
221 Goodbye.
  −
</code>
     −
'''Install the Resource Files'''
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| HM3_VCF || $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| align || verifyBamID (contamination checking)
 +
|-
 +
| snpcall || generating filtered VCF summary statistics & positive example sites for SVM filtering
 +
|}
   −
Choose a destination for these files and install them as shown below (we'll assume you will use '''/usr/local/gotcloud.ref''').
+
=== OMNI VCF File ===
 +
VCF file containing OMNI positions
 +
* Must be bgzip'd and tabix'd
   −
<code>
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
<b>mkdir -p /usr/local/gotcloud.ref</b>    # Where you want the files installed
+
! Configuration Key !! Default Value
<b>cd /usr/local/gotcloud.ref</b>
+
|-
<b>tar xzf h37-db135.tar.gz</b>
+
| OMNI_VCF || $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz
<b>rm -f h37-db135.tar.gz</b>
+
|}
</code>
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| snpcall || positive example sites for SVM filtering
 +
|}
   −
Note this path as you will need to set the variable '''REF_DIR''' in the configuration file for gotcloud.
+
=== INDEL VCF File(s) ===
 +
VCF file containing known INDEL positions
   −
= Using Your own Reference Files =
+
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Configuration Key !! Default Value
 +
|-
 +
| INDEL_PREFIX || $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19
 +
|-
 +
| INDEL_VCF || ''alternate configuration setting if all INDEL sites are in a single VCF rather than broken up by chromosome''
 +
|}
 +
{| class="wikitable" style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
! Pipeline !! Use
 +
|-
 +
| snpcall || used to filter variants that are too close to a known indel
 +
|}
   −
== Human Reference ==
+
* Use <code>INDEL_PREFIX</code> if <code>path/</code> contains a separate file for each chromosome in the format: <code>indels.sites.hg19.chr#.vcf</code> for each <code>#</code> chromosome being processed
 +
* Use <code>INDEL_VCF</code> if you have all chromosomes in a single VCF file (it can be, but does not have to be a gz file)
   −
=== Generating BWA Reference Files ===
+
== Downloadable Reference and Resource Files ==
Use "bwa index" to generate the human reference files with the required extensions:
+
* When running on Amazon, a default set of reference files are included in the GotCloud AMI in the default <code>REF_DIR</code>
* .amb
  −
* .ann
  −
* .bwt
  −
* .fai
  −
* .pac
  −
* .rbwt
  −
* .rpac
  −
* .rsa
  −
* .sa
     −
See http://bio-bwa.sourceforge.net/bwa.shtml for more information about using "bwa index".
     −
=== Generating GC Content File ===
+
'''Installing Genetic Reference and Resource Files'''
The GC Content file is used by QPLOT.  It is assumed to be at the same location as the reference file.
     −
If the reference file is at path/ref.fa, the GC Content file is expected to be:path/ref.winsize100.gc
+
Choose a destination for these files and install them as shown below.  We'll assume you will use '''gotcloud/gotcloud.ref'''. Replace <code>gotcloud</code> with the path to where you installed gotcloud.
    +
<code>
 +
<b>cd gotcloud</b>  # path to where you installed gotcloud
 +
</code>
   −
To generate the GC content file, run qplot:
+
If you use a path other than a gotcloud.ref subdirectory of gotcloud, note this path as you will need to set either of the following to the installation path:
GOTCLOUD_DIR/bin/qplot --reference reference.fa --winsize windowSize
+
* <code>REF_DIR</code> in your configuration file
* Replace reference.fa with the name of your human reference fasta file.
+
* <code>--ref_dir</code> on the command-line
* Replace windowSize with your desired window size, or leave out --winsize to use the default (100).
     −
NOTE: You will get an error at the end of qplot that says:
  −
<pre>
  −
FATAL ERROR -
  −
No SAM/BAM files provided, stopped!
  −
</pre>
  −
This error is due to using qplot to just generate a GC Content file and not also process a BAM file.
     −
But it was successful as long as you see (where reference is the name of your reference file):
+
'''Get & Install the Resource Files'''
<pre>
  −
GC content file [ reference.winsize100.gc ] created.
  −
</pre>
      +
GotCloud makes use of various reference and other genetic resource files.
 +
You are free to use your own files, of course, but we also are making the files we use available.
   −
See [[QPLOT#Input_files|QPLOT: InputFiles]] for more information.
+
<ul>
 +
<li> <div id="h37-db135">Human reference 37, dbsnp 135:</div></li>
 +
<b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db135-v3.tgz</b>
 +
<b>tar xzf h37-db135-v3.tgz</b>
 +
<b>rm -f h37-db135-v3.tgz</b>
 +
<li><div id="h37-db142">Human reference 37, dbsnp 142:</div></li>
 +
<b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/h37-db142-v1.tgz</b>
 +
<b>tar xzf h37-db142-v1.tgz</b>
 +
<b>rm -f h37-db142-v1.tgz</b>
 +
<li><div id="hs37d5-db142">Human reference 37 with decoy, dbsnp 142:</div></li>
 +
<b>wget ftp://anonymous@share.sph.umich.edu/gotcloud/ref/hs37d5-db142-v1.tgz</b>
 +
<b>tar xzf hs37d5-db142-v1.tgz</b>
 +
<b>rm -f hs37d5-db142-v1.tgz</b>
 +
</ul>

Navigation menu