Line 164: |
Line 164: |
| | | |
| === GotCloud FASTQ Index File === | | === GotCloud FASTQ Index File === |
− | You need to tell GotCloud about each FASTQ file
| + | The FASTQ index file is created by you to tell GotCloud about each of your FASTQ files: |
− | * Full path | + | * Where to find it |
| * Sample name | | * Sample name |
| ** Each sample can have multiple FASTQs | | ** Each sample can have multiple FASTQs |
| ** Each FASTQ is for a single sample | | ** Each FASTQ is for a single sample |
| + | * Run identifier |
| + | ** For recalibration we need to know which reads were in the same run. |
| | | |
− | The FASTQ index file is created by you to direct GotCloud to your FASTQ files, providing additional information for them.
| + | FASTQ Index Format: |
| + | * Tab delimited |
| + | * Starts with a header line |
| + | * One line per single-end read |
| + | * One line per paired-end read (only 1 line per pair). |
| | | |
− | * tab delimited
| + | Let's look a look at the index file I prepared for this tutorial: |
− | * columns may be in any order
| + | less -S ${IN}/align.index |
− | * starts with a header line
| |
− | * one line per single-end read
| |
− | * one line per paired-end read (only 1 line per pair).
| |
| | | |
| + | ; Which samples had multiple Runs? |
| + | <ul> |
| + | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
| + | <li>Need a reminder of the format?</li> |
| + | <div class="mw-collapsible-content"> |
| + | [[File:fqindex.png|650px]] |
| + | </div> |
| + | </div> |
| + | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
| + | <li>Answer:</li> |
| + | <div class="mw-collapsible-content"> |
| + | <ul> |
| + | <li>HG00553 & HG00640</li> |
| + | <li>They have multiple unique values in the RGID field</li> |
| + | [[File:fqindexRG.png|650px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| + | </ul> |
| | | |
− | '''Required Columns'''
| |
− | {|class="wikitable" cellpadding=5
| |
− | ! Column Name !! Description !! Recommended Value
| |
− | |-
| |
− | | MERGE_NAME ||
| |
− | * Base name for the resulting BAM file for the sample
| |
− | * Used to group multiple fastqs or fastq pairs into a single BAM
| |
− | | Sample Name
| |
− | |-
| |
− | | FASTQ1 ||
| |
− | * Name of the fastq or the first in the pair if paired-end. (Only 1 line per pair)
| |
− | | path/fastq1
| |
− | |-
| |
− | | FASTQ2 ||
| |
− | *Name of the 2nd fastq in paired-end reads.
| |
− | *Column is not required if all fastqs are single-end
| |
− | *'.' if the column is used, but this line is single-ended.
| |
− | | path/fastq2
| |
− | |}
| |
| | | |
− | | + | How do you point Gotcloud to your index file? |
− | The following columns are optional and used to populate the Read Group Information in the BAM file.
| + | * Command-line <code>--index_file</code> option |
− | * RGID field is required if using any of these fields, the others are optional.
| + | : or |
− | | + | * Configuration file <code>INDEX_FILE</code> setting. |
− | What is a Read Group?
| |
− | * Groups reads together
| |
− | * Used for recalibration
| |
− | ** Each sequencing run should get a different ReadGroup
| |
− | * Typically a new name for each fastq pair/group
| |
− | | |
− | If you do not want the field for:
| |
− | * any fastq, leave the column out of the header line
| |
− | * a single line, use a '.'
| |
− | | |
− | | |
− | '''Optional Columns'''
| |
− | {|class="wikitable" cellpadding=5
| |
− | |-
| |
− | ! Column Name !! Description !! Recommended Value
| |
− | |-
| |
− | | RGID || Read Group ID || Run ID
| |
− | |-
| |
− | | SAMPLE || Sample Name || Sample Name
| |
− | |-
| |
− | | LIBRARY || Library
| |
− | * separate FASTQs for a sample that were prepped separately
| |
− | | if you don't know or it is all the same, use Sample Name
| |
− | |-
| |
− | | CENTER || Center Name || Name of the sequencing center producing the FASTQ
| |
− | |-
| |
− | | PLATFORM || Platform || CAPILLARY, LS454, ILLUMINA,
| |
− | SOLID, HELICOS, IONTORRENT, or PACBIO
| |
− | |}
| |
− | | |
− | Your sequencing core may provide to you a file with information to fill in these columns.
| |
− | | |
− | For our example, we have <code>sequence.index</code> which contains the information from 1000 Genomes for the FASTQs we are processing.
| |
− | less -S ${GC}/inputs/fastq/sequence.index
| |
− | | |
− | In this file, we want the SAMPLE_NAME, FASTQ_FILE, RUN_ID, LIBRARY_NAME, CENTER_NAME, INSTRUMENT_PLATFORM (columns 10, 1, 15, 6, 13).
| |
− | * You can use perl/awk/linux to extract these fields & format as necessary. | |
− | * I prepared a perl script that you can use:
| |
− | perl ${GC}/scripts/genIndex.pl > ${SETUP}/align.index
| |
− | | |
− | Let's look at the index file:
| |
− | less -S ${SETUP}/align.index
| |
− | [[File:Align index.png|1000px]]
| |
− | | |
− | The command-line <code>--fastq</code> option or the configuration file <code>FASTQ_PREFIX</code> setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths.
| |
− | | |
− | This file is specified either via the command-line <code>--index_file</code> parameter or via the configuration file <code>INDEX_FILE</code> setting.
| |
| | | |
| The command-line setting takes precedence over the configuration file setting. | | The command-line setting takes precedence over the configuration file setting. |