Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 164: Line 164:     
=== GotCloud FASTQ Index File ===
 
=== GotCloud FASTQ Index File ===
You need to tell GotCloud about each FASTQ file
+
The FASTQ index file is created by you to tell GotCloud about each of your FASTQ files:
* Full path
+
* Where to find it
 
* Sample name
 
* Sample name
 
** Each sample can have multiple FASTQs
 
** Each sample can have multiple FASTQs
 
** Each FASTQ is for a single sample
 
** Each FASTQ is for a single sample
 +
* Run identifier
 +
** For recalibration we need to know which reads were in the same run.
   −
The FASTQ index file is created by you to direct GotCloud to your FASTQ files, providing additional information for them.  
+
FASTQ Index Format:
 +
* Tab delimited
 +
* Starts with a header line
 +
* One line per single-end read
 +
* One line per paired-end read (only 1 line per pair).  
   −
* tab delimited
+
Let's look a look at the index file I prepared for this tutorial:
* columns may be in any order
+
less -S ${IN}/align.index
* starts with a header line
  −
* one line per single-end read
  −
* one line per paired-end read (only 1 line per pair).  
      +
; Which samples had multiple Runs?
 +
<ul>
 +
<div class="mw-collapsible mw-collapsed" style="width:500px">
 +
<li>Need a reminder of the format?</li>
 +
<div class="mw-collapsible-content">
 +
[[File:fqindex.png|650px]]
 +
</div>
 +
</div>
 +
<div class="mw-collapsible mw-collapsed" style="width:500px">
 +
<li>Answer:</li>
 +
<div class="mw-collapsible-content">
 +
<ul>
 +
<li>HG00553 & HG00640</li>
 +
<li>They have multiple unique values in the RGID field</li>
 +
[[File:fqindexRG.png|650px]]
 +
</div>
 +
</div>
 +
</ul>
 +
</ul>
   −
'''Required Columns'''
  −
{|class="wikitable" cellpadding=5
  −
! Column Name !! Description !! Recommended Value
  −
|-
  −
| MERGE_NAME ||
  −
* Base name for the resulting BAM file for the sample
  −
* Used to group multiple fastqs or fastq pairs into a single BAM
  −
| Sample Name
  −
|-
  −
| FASTQ1 ||
  −
* Name of the fastq or the first in the pair if paired-end.  (Only 1 line per pair)
  −
| path/fastq1
  −
|-
  −
| FASTQ2 ||
  −
*Name of the 2nd fastq in paired-end reads. 
  −
*Column is not required if all fastqs are single-end
  −
*'.' if the column is used, but this line is single-ended.
  −
| path/fastq2
  −
|}
     −
 
+
How do you point Gotcloud to your index file?
The following columns are optional and used to populate the Read Group Information in the BAM file.
+
* Command-line <code>--index_file</code> option
* RGID field is required if using any of these fields, the others are optional.
+
: or
 
+
* Configuration file <code>INDEX_FILE</code> setting.   
What is a Read Group?
  −
* Groups reads together
  −
* Used for recalibration
  −
** Each sequencing run should get a different ReadGroup
  −
* Typically a new name for each fastq pair/group
  −
 
  −
If you do not want the field for:
  −
* any fastq, leave the column out of the header line
  −
* a single line, use a '.'
  −
 
  −
 
  −
'''Optional Columns'''
  −
{|class="wikitable" cellpadding=5
  −
|-
  −
! Column Name !! Description !! Recommended Value
  −
|-
  −
| RGID || Read Group ID || Run ID
  −
|-
  −
| SAMPLE || Sample Name || Sample Name
  −
|-
  −
| LIBRARY || Library
  −
* separate FASTQs for a sample that were prepped separately
  −
| if you don't know or it is all the same, use Sample Name
  −
|-
  −
| CENTER || Center Name || Name of the sequencing center producing the FASTQ
  −
|-
  −
| PLATFORM || Platform || CAPILLARY, LS454, ILLUMINA,
  −
SOLID, HELICOS, IONTORRENT, or PACBIO
  −
|}
  −
 
  −
Your sequencing core may provide to you a file with information to fill in these columns.
  −
 
  −
For our example, we have <code>sequence.index</code> which contains the information from 1000 Genomes for the FASTQs we are processing.
  −
less -S ${GC}/inputs/fastq/sequence.index  
  −
 
  −
In this file, we want the SAMPLE_NAME, FASTQ_FILE, RUN_ID, LIBRARY_NAME, CENTER_NAME, INSTRUMENT_PLATFORM (columns 10, 1, 15, 6, 13).
  −
* You can use perl/awk/linux to extract these fields & format as necessary.
  −
* I prepared a perl script that you can use:
  −
perl ${GC}/scripts/genIndex.pl > ${SETUP}/align.index
  −
 
  −
Let's look at the index file:
  −
less -S ${SETUP}/align.index
  −
[[File:Align index.png|1000px]]
  −
 
  −
The command-line <code>--fastq</code> option or the configuration file <code>FASTQ_PREFIX</code> setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths.
  −
 
  −
This file is specified either via the command-line <code>--index_file</code> parameter or via the configuration file <code>INDEX_FILE</code> setting.   
      
The command-line setting takes precedence over the configuration file setting.
 
The command-line setting takes precedence over the configuration file setting.

Navigation menu