Difference between revisions of "SeqShop: Sequence Mapping and Assembly Practical, June 2014"
Line 29: | Line 29: | ||
− | === Input Files === | + | === Examining GotCloud Align Input Files === |
==== Sequence Data Files : FASTQs ==== | ==== Sequence Data Files : FASTQs ==== | ||
The FASTQ files are provided to you by those who did the sequencing. | The FASTQ files are provided to you by those who did the sequencing. | ||
Line 207: | Line 207: | ||
Let's look at the configuration file I created for this test: | Let's look at the configuration file I created for this test: | ||
more ${GC}/inputs/gotcloud.conf | more ${GC}/inputs/gotcloud.conf | ||
+ | |||
+ | It already points to your align file. | ||
+ | |||
+ | === Run GotCloud Align === | ||
+ | Now that we have all of our input files, we need just a simple command to run | ||
+ | ${GC}/gotcloud/gotcloud align --conf ${GC}/inputs/gotcloud.conf | ||
+ | |||
+ | [[File:gcalignStart.png|700px]] | ||
+ | |||
+ | This will take about 4 minutes to run. | ||
+ | |||
+ | It should end with a line like: <code>Processing finished in 251 secs with no errors reported</code> | ||
+ | |||
+ | If you cancelled, GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off. | ||
+ | |||
+ | === Examining GotCloud Align Output === | ||
+ | |||
+ | Let's look at the output directory: | ||
+ | ls ${OUTPUT} | ||
+ | |||
+ | [[File:gcalignOutM.png]] |
Revision as of 16:30, 11 June 2014
Step 0: Login to the machine & setup environment
- Login to the windows machine
- The username/password for the Windows machine should be written on it
- Open putty
- Start->.....
- In putty, login to seqshop-server.sph.umich.edu
- Server name: seqshop-server.sph.umich.edu
- Enter your provided username & password
- To simplify commands/typing, we will setup a couple of environment variables
- to point to the GotCloud directory export GC=/home/mktrost/seqshop/
- to point to your test setup export SETUP=~/setup
- to point to your output directory export OUTPUT=~/out
- Create your directories: mkdir ${SETUP} mkdir ${OUTPUT}
GotCloud Alignment Pipeline
Examining GotCloud Align Input Files
Sequence Data Files : FASTQs
The FASTQ files are provided to you by those who did the sequencing.
For this tutorial, we will use FASTQs for 6 1000 Genome samples
ls ${GC}/inputs/fastq/
There are 24 fastq files: combination of single-end & paired-end.
- Single-end: HG00641.chr7.CFTR.SRR069531.fastq
- Paired-end: HG00641.chr7.CFTR.SRR069531_1.fastq & HG00641.chr7.CFTR.SRR069531_2.fastq
Look at FASTQ:
less -S ${GC}/inputs/fastq/HG00641.chr7.CFTR.SRR069531_1.fastq
less
is a Linux command that allows you to look at a file.
-S
option prevents line wrap.- Use the arrow (up/down/left/right) keys to scroll through the file.
- Use
zless
if the file is compressed.
Use 'q'
to exit out of less
q
Reference Files
Reference files can be downloaded with GotCloud or from other sources.
ls ${GC}/reference/chr7
VCF files
- List of chromosome/position
- Used for:
- dbsnp - recalibration skips known variants
- hapmap - used for sample contamination/sample swap validation
- variant filtering
Let's read the first 5 lines of the genome reference FASTA file (all reference bases for a chromosome):
head -n 5 ${GC}/reference/chr7/human.g1k.v37.chr7.fa
The start of the chromosome is all N's, so let's look at a later section (reading 5 lines starting at line 2000):
tail -n+2000 ${GC}/reference/chr7/human.g1k.v37.chr7.fa |head -n 5
See GotCloud: Genetic Reference and Resource Files for more information on downloading/generating reference files.
GotCloud FASTQ Index File
This file is created by you and directs GotCloud to your FASTQ files, providing additional information for them.
- tab delimited
- columns may be in any order
- starts with a header line
- one line per single-end read
- one line per paired-end read (only 1 line per pair).
Required Columns
Column Name | Description | Recommended Value |
---|---|---|
MERGE_NAME |
|
Sample Name |
FASTQ1 |
|
path/fastq1 |
FASTQ2 |
|
path/fastq2 |
The following columns are optional and used to populate the Read Group Information in the BAM file.
- RGID field is required if using any of these fields, the others are optional.
What is a Read Group?
- Groups reads together
- Used for recalibration
- Each sequencing run should get a different ReadGroup
If you do not want the field for:
- any fastq, leave the column out of the header line
- a single line, use a '.'
Column Name | Description | Recommended Value |
---|---|---|
RGID | Read Group ID | Run ID |
SAMPLE | Sample Name | Sample Name |
LIBRARY | Library
|
if you don't know or it is all the same, use Sample Name |
CENTER | Center Name | Name of the sequencing center producing the FASTQ |
PLATFORM | Platform | CAPILLARY, LS454, ILLUMINA,
SOLID, HELICOS, IONTORRENT, or PACBIO |
Your sequencing core may provide to you a file with information to fill in these columns.
For our example, we have sequence.index
which contains the information from 1000 Genomes for the FASTQs we are processing.
less -S ${GC}/inputs/fastq/sequence.index
In this file, we want the SAMPLE_NAME, FASTQ_FILE, RUN_ID, LIBRARY_NAME, CENTER_NAME, INSTRUMENT_PLATFORM (columns 10, 1, 15, 6, 13).
- You can use perl/awk/linux to extract these fields & format as necessary.
- I prepared a perl script that you can use:
perl ${GC}/scripts/genIndex.pl > ${SETUP}/align.index
Let's look at the index file:
less -S ${SETUP}/align.index
The command-line --fastq
option or the configuration file FASTQ_PREFIX
setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths.
This file is specified either via the command-line --index_file
parameter or via the configuration file INDEX_FILE
setting.
The command-line setting takes precedence over the configuration file setting.
GotCloud Configuration File
This file is created by you to configure GotCloud for your data.
- Default values are provided in ${GC}/gotcloud/bin/gotcloudDefaults.conf
- Most values should be left as the defaults
- Specify values in your configuration file as:
KEY = value
- Use $(KEY) to refer to another key's value
- If a KEY is specified twice, the later value is used
- Does not have access to environment variables
- '#' indicates a comment
- Keys to override:
Key Name | Description |
---|---|
Index File Settings - pointing GotCloud to your data | |
INDEX_FILE | Path to the FASTQ index file that you created
|
FASTQ_PREFIX | Prefix to be added to the FASTQ files in INDEX_FILE
|
BAM_INDEX | Path to the BAM index file
|
Reference File Settings - telling GotCloud where to find your reference files | |
REF_DIR | Path to your reference files
|
REF | Path/filename of the FASTA reference file
|
DBSNP_VCF | Path/filename of the DBSNP file
|
HM3_VCF | Path/filename of the HapMap3 file
|
OMNI_VCF | Path/filename of the OMNI file
|
INDEL_PREFIX | Path/filename base of the indels file
|
Let's look at the configuration file I created for this test:
more ${GC}/inputs/gotcloud.conf
It already points to your align file.
Run GotCloud Align
Now that we have all of our input files, we need just a simple command to run
${GC}/gotcloud/gotcloud align --conf ${GC}/inputs/gotcloud.conf
This will take about 4 minutes to run.
It should end with a line like: Processing finished in 251 secs with no errors reported
If you cancelled, GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off.
Examining GotCloud Align Output
Let's look at the output directory:
ls ${OUTPUT}