SeqShop: Sequence Mapping and Assembly Practical, June 2014
Step 0: Login to the machine & setup environment
- Login to the windows machine
- The username/password for the Windows machine should be written on it
- Open putty
- Start->.....
- In putty, login to seqshop-server.sph.umich.edu
- Server name: seqshop-server.sph.umich.edu
- Enter your provided username & password
- To simplify commands/typing, we will setup a couple of environment variables
- to point to the GotCloud directory export GC=/home/mktrost/seqshop/
- to point to your test setup export SETUP=~/setup
- to point to your output directory export OUTPUT=~/out
- Create your directories: mkdir ${SETUP} mkdir ${OUTPUT}
GotCloud Alignment Pipeline
Input Files
Sequence Data Files : FASTQs
The FASTQ files are provided to you by those who did the sequencing.
For this tutorial, we will use FASTQs for 6 1000 Genome samples
ls ${GC}/inputs/fastq/
There are 24 fastq files: combination of single-end & paired-end.
- Single-end: HG00641.chr7.CFTR.SRR069531.fastq
- Paired-end: HG00641.chr7.CFTR.SRR069531_1.fastq & HG00641.chr7.CFTR.SRR069531_2.fastq
Look at FASTQ:
less -S ${GC}/inputs/fastq/HG00641.chr7.CFTR.SRR069531_1.fastq
less
is a Linux command that allows you to look at a file.
-S
option prevents line wrap.- Use the arrow (up/down/left/right) keys to scroll through the file.
- Use
zless
if the file is compressed.
Use 'q'
to exit out of less
q
Reference Files
Reference files can be downloaded with GotCloud or from other sources.
ls ${GC}/reference/chr7
VCF files
- List of chromosome/position
- Used for:
- dbsnp - recalibration skips known variants
- hapmap - used for sample contamination/sample swap validation
- variant filtering
Let's read the first 5 lines of the genome reference FASTA file (all reference bases for a chromosome):
head -n 5 ${GC}/reference/chr7/human.g1k.v37.chr7.fa
The start of the chromosome is all N's, so let's look at a later section (reading 5 lines starting at line 2000):
tail -n+2000 ${GC}/reference/chr7/human.g1k.v37.chr7.fa |head -n 5
See GotCloud: Genetic Reference and Resource Files for more information on downloading/generating reference files.
GotCloud FASTQ Index File
This file is created by you and directs GotCloud to your FASTQ files, providing additional information for them.
- tab delimited
- columns may be in any order
- starts with a header line
- one line per single-end read
- one line per paired-end read (only 1 line per pair).
Required Columns
Column Name | Description | Recommended Value |
---|---|---|
MERGE_NAME |
|
Sample Name |
FASTQ1 |
|
path/fastq1 |
FASTQ2 |
|
path/fastq2 |
The following columns are optional and used to populate the Read Group Information in the BAM file.
- RGID field is required if using any of these fields, the others are optional.
What is a Read Group?
- Groups reads together
- Used for recalibration
- Each sequencing run should get a different ReadGroup
If you do not want the field for:
- any fastq, leave the column out of the header line
- a single line, use a '.'
Column Name | Description | Recommended Value |
---|---|---|
RGID | Read Group ID | Run ID |
SAMPLE | Sample Name | Sample Name |
LIBRARY | Library
|
if you don't know or it is all the same, use Sample Name |
CENTER | Center Name | Name of the sequencing center producing the FASTQ |
PLATFORM | Platform | CAPILLARY, LS454, ILLUMINA,
SOLID, HELICOS, IONTORRENT, or PACBIO |
Your sequencing core may provide to you a file with information to fill in these columns.
For our example, we have sequence.index
which contains the information from 1000 Genomes for the FASTQs we are processing.
less -S ${GC}/inputs/fastq/sequence.index
In this file, we want the SAMPLE_NAME, FASTQ_FILE, RUN_ID, LIBRARY_NAME, CENTER_NAME, INSTRUMENT_PLATFORM (columns 10, 1, 15, 6, 13).
- You can use perl/awk/linux to extract these fields & format as necessary.
- I prepared a perl script that you can use:
perl ${GC}/scripts/genIndex.pl > ${SETUP}/align.index
Let's look at the index file:
less -S ${SETUP}/align.index
The command-line --fastq
option or the configuration file FASTQ_PREFIX
setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths.
This file is specified either via the command-line --index_file
parameter or via the configuration file INDEX_FILE
setting.
The command-line setting takes precedence over the configuration file setting.
GotCloud Configuration File
This file is created by you to configure GotCloud for your data.
- Default values are provided in ${GC}/gotcloud/bin/gotcloudDefaults.conf
- Most values should be left as the defaults
- Specify values in your configuration file as:
KEY = value
- Use $(KEY) to refer to another key's value
- If a KEY is specified twice, the later value is used
- Does not have access to environment variables
- '#' indicates a comment
- Keys to override:
Key Name | Description |
---|---|
Index File Settings - pointing GotCloud to your data | |
INDEX_FILE | Path to the FASTQ index file that you created
|
FASTQ_PREFIX | Prefix to be added to the FASTQ files in INDEX_FILE
|
BAM_INDEX | Path to the BAM index file
|
Reference File Settings - telling GotCloud where to find your reference files | |
REF_DIR | Path to your reference files
|
REF | Path/filename of the FASTA reference file
|
DBSNP_VCF | Path/filename of the DBSNP file
|
HM3_VCF | Path/filename of the HapMap3 file
|
OMNI_VCF | Path/filename of the OMNI file
|
INDEL_PREFIX | Path/filename base of the indels file
|
Let's look at the configuration file I created for this test:
more ${GC}/inputs/gotcloud.conf