Changes

SeqShop: Sequence Mapping and Assembly Practical, June 2014 (view source)

Revision as of 19:03, 14 June 2014

1,821 bytes removed , 19:03, 14 June 2014

→‎GotCloud FASTQ Index File

Line 164: Line 164:

=== GotCloud FASTQ Index File ===

−

~~You need~~ to tell GotCloud about each FASTQ ~~file~~

+

The FASTQ index file is created by you to tell GotCloud about each of your FASTQ files:

−

* ~~Full path~~

+

* Where to find it

* Sample name

** Each sample can have multiple FASTQs

** Each FASTQ is for a single sample

+

* Run identifier

+

** For recalibration we need to know which reads were in the same run.

−

~~The~~ FASTQ ~~index file is created by you to direct GotCloud to your FASTQ files, providing additional information for them~~.

+

FASTQ Index Format:

+

* Tab delimited

+

* Starts with a header line

+

* One line per single-end read

+

* One line per paired-end read (only 1 line per pair).

−

* tab delimited

+

Let's look a look at the index file I prepared for this tutorial:

−

* columns may be in any order

+

less -S ${IN}/align.index

−

* starts with a ~~header line~~

−

* one line per single-end read

−

* one line per paired-~~end read (only 1 line per pair)~~.

+

; Which samples had multiple Runs?

+

<ul>

+

+

<li>Need a reminder of the format?</li>

+

+

[[File:fqindex.png|650px]]

+

</div>

+

</div>

+

+

<li>Answer:</li>

+

+

<ul>

+

+

<li>They have multiple unique values in the RGID field</li>

+

[[File:fqindexRG.png|650px]]

+

</div>

+

</div>

+

</ul>

+

</ul>

−

~~'''Required Columns'''~~

−

~~{|class="wikitable" cellpadding=5~~

−

~~! Column Name !! Description !! Recommended Value~~

−

|-

−

~~| MERGE_NAME ||~~

−

* Base name for the resulting BAM file for the sample

−

* Used to group multiple fastqs or fastq pairs into a single BAM

−

~~| Sample Name~~

−

|-

−

~~| FASTQ1 ||~~

−

* Name of the fastq or the first in the pair if paired-end. (Only 1 line per pair)

−

~~| path/fastq1~~

−

|-

−

~~| FASTQ2 ||~~

−

*Name of the 2nd fastq in paired-end reads.

−

*Column is not required if all fastqs are single-end

−

*'.' if the column is used, but this line is single-ended.

−

~~| path/fastq2~~

−

|}

−

+

How do you point Gotcloud to your index file?

−

~~The following columns are optional and used to populate the Read Group Information in the BAM file.~~

+

* Command-line <code>--index_file</code> option

−

* RGID field is required if using any of these fields, the others are optional.

+

: or

−

+

* Configuration file <code>INDEX_FILE</code> setting.

−

~~What is a Read Group?~~

−

* Groups reads together

−

* Used for recalibration

−

** Each sequencing run should get a different ReadGroup

−

* Typically a new name for each fastq pair/group

−

~~If you~~ do ~~not want the field for:~~

−

* any fastq, leave the column out of the header line

−

* a single line, use a '.'

−

~~'''Optional Columns'''~~

−

~~{|class="wikitable" cellpadding=5~~

−

|-

−

~~! Column Name !! Description !! Recommended Value~~

−

|-

−

~~| RGID || Read Group ID || Run ID~~

−

|-

−

~~| SAMPLE || Sample Name || Sample Name~~

−

|-

−

~~| LIBRARY || Library~~

−

* separate FASTQs for a sample that were prepped separately

−

~~| if you don't know or it is all the same, use Sample Name~~

−

|-

−

~~| CENTER || Center Name || Name of the sequencing center producing the FASTQ~~

−

|-

−

~~| PLATFORM || Platform || CAPILLARY, LS454, ILLUMINA,~~

−

~~SOLID, HELICOS, IONTORRENT, or PACBIO~~

−

|}

−

~~Your sequencing core may provide to~~ you ~~a file with information~~ to ~~fill in these columns.~~

−

~~For our example, we have <code>sequence.index</code> which contains the information from 1000 Genomes for the FASTQs we are processing.~~

−

~~less -S ${GC}/inputs/fastq/sequence.~~index

−

~~In this~~ file~~, we want the SAMPLE_NAME, FASTQ_FILE, RUN_ID, LIBRARY_NAME, CENTER_NAME, INSTRUMENT_PLATFORM (columns 10, 1, 15, 6, 13).~~

−

* ~~You can use perl/awk/linux to extract these fields & format as necessary.~~

−

* I prepared a perl script that you can use:

−

~~perl ${GC}/scripts/genIndex.pl > ${SETUP}/align.index~~

−

~~Let's look at the index file:~~

−

~~less -S ${SETUP}/align.index~~

−

~~[[File:Align index.png|1000px]]~~

−

~~The command~~-line <code>--~~fastq~~</code> option or ~~the configuration file <code>FASTQ_PREFIX</code> setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths.~~

−

~~This file is specified either via the command-line <code>--index_file</code> parameter or via the configuration~~ file <code>INDEX_FILE</code> setting.

The command-line setting takes precedence over the configuration file setting.

Mktrost

Administrators

3,045

edits

Changes

SeqShop: Sequence Mapping and Assembly Practical, June 2014 (view source)

Revision as of 19:03, 14 June 2014

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools