Difference between revisions of "SeqShop: Sequence Mapping and Assembly Practical, June 2014"
Line 63: | Line 63: | ||
[[File:AlignDiagram.png|500px]] | [[File:AlignDiagram.png|500px]] | ||
+ | |||
+ | === Why GotCloud?=== | ||
+ | * Easy to learn & run | ||
+ | ** All-in-one package for sequence analysis pipeline | ||
+ | ** You don’t have to know the details of individual component | ||
+ | * Robust parallelization | ||
+ | ** Automatic partition of multi-sample jobs | ||
+ | ** Reliable and fault-tolerant parallelization via GNU make | ||
+ | *** Restart from where it stopped upon unexpected crash | ||
+ | * Cloud & Cluster-friendly | ||
+ | ** Supports multiple clusters such as MOSIX, Slurm, & SGE | ||
+ | ** Amazon instances allow running large-scale jobs without having your own cluster | ||
=== Examining GotCloud Align Input Files === | === Examining GotCloud Align Input Files === | ||
==== Sequence Data Files : FASTQs ==== | ==== Sequence Data Files : FASTQs ==== | ||
− | + | We already looked at those in: [[Examining Raw Sequence Reads]] | |
==== Reference Files ==== | ==== Reference Files ==== |
Revision as of 21:28, 12 June 2014
Goals of This Session
- What we want to learn
- Basic sequence data file formats (FASTQ, BAM)
- How to generate aligned sequences that are ready for variant calling from raw sequence reads
- How to evaluate the quality of sequence data
- How to visualize sequence data to examine the reads aligned to particular genomic positions
Step 0: Login to the machine & setup environment
- Login to the windows machine
- The username/password for the Windows machine should be written on it
- Open putty
- Start->.....
- In putty, login to seqshop-server.sph.umich.edu
- Server name: seqshop-server.sph.umich.edu
- Enter your provided username & password
- To simplify commands/typing, we will setup a couple of environment variables
- to point to the GotCloud directory export GC=/home/mktrost/seqshop/
- to point to your test setup export SETUP=~/setup
- to point to your output directory export OUTPUT=~/out
- Create your directories: mkdir ${SETUP} mkdir ${OUTPUT}
Examining Raw Sequence Reads
FASTQ : standard file format provided to you by those who did the sequencing.
For this tutorial, we will use FASTQs for 6 1000 Genome samples
ls ${GC}/inputs/fastq/
There are 24 fastq files: combination of single-end & paired-end.
- Single-end: HG00641.chr7.CFTR.SRR069531.fastq
- Paired-end: HG00641.chr7.CFTR.SRR069531_1.fastq & HG00641.chr7.CFTR.SRR069531_2.fastq
Look at FASTQ:
less -S ${GC}/inputs/fastq/HG00641.chr7.CFTR.SRR069531_1.fastq
less
is a Linux command that allows you to look at a file.
-S
option prevents line wrap.- Use the arrow (up/down/left/right) keys to scroll through the file.
- Use
zless
if the file is compressed.
Use 'q'
to exit out of less
q
Quality represented as ASCII code of (33 + Phred-scale-quality)
- Lower qualities : special characters, or digits
- ! (Q=0), “ (Q=1), # (Q=3), + (Q=10), / (Q=14)
- 0 (Q=15), 5 (Q=20), 9 (Q=24)
- Higher qualities : alphabet
- : (Q=25), ? (Q=30), @ (Q=31)
- A (Q=32), B (Q=33), G (Q=38)
GotCloud Alignment Pipeline
Why GotCloud?
- Easy to learn & run
- All-in-one package for sequence analysis pipeline
- You don’t have to know the details of individual component
- Robust parallelization
- Automatic partition of multi-sample jobs
- Reliable and fault-tolerant parallelization via GNU make
- Restart from where it stopped upon unexpected crash
- Cloud & Cluster-friendly
- Supports multiple clusters such as MOSIX, Slurm, & SGE
- Amazon instances allow running large-scale jobs without having your own cluster
Examining GotCloud Align Input Files
Sequence Data Files : FASTQs
We already looked at those in: Examining Raw Sequence Reads
Reference Files
Reference files can be downloaded with GotCloud or from other sources.
ls ${GC}/reference/chr7
VCF files
- List of chromosome/position
- Used for:
- dbsnp - recalibration skips known variants
- hapmap - used for sample contamination/sample swap validation
- variant filtering
Let's read the first 5 lines of the genome reference FASTA file (all reference bases for a chromosome):
head -n 5 ${GC}/reference/chr7/human.g1k.v37.chr7.fa
The start of the chromosome is all N's, so let's look at a later section (reading 5 lines starting at line 2000):
tail -n+2000 ${GC}/reference/chr7/human.g1k.v37.chr7.fa |head -n 5
See GotCloud: Genetic Reference and Resource Files for more information on downloading/generating reference files.
GotCloud FASTQ Index File
This file is created by you and directs GotCloud to your FASTQ files, providing additional information for them.
- tab delimited
- columns may be in any order
- starts with a header line
- one line per single-end read
- one line per paired-end read (only 1 line per pair).
Required Columns
Column Name | Description | Recommended Value |
---|---|---|
MERGE_NAME |
|
Sample Name |
FASTQ1 |
|
path/fastq1 |
FASTQ2 |
|
path/fastq2 |
The following columns are optional and used to populate the Read Group Information in the BAM file.
- RGID field is required if using any of these fields, the others are optional.
What is a Read Group?
- Groups reads together
- Used for recalibration
- Each sequencing run should get a different ReadGroup
If you do not want the field for:
- any fastq, leave the column out of the header line
- a single line, use a '.'
Column Name | Description | Recommended Value |
---|---|---|
RGID | Read Group ID | Run ID |
SAMPLE | Sample Name | Sample Name |
LIBRARY | Library
|
if you don't know or it is all the same, use Sample Name |
CENTER | Center Name | Name of the sequencing center producing the FASTQ |
PLATFORM | Platform | CAPILLARY, LS454, ILLUMINA,
SOLID, HELICOS, IONTORRENT, or PACBIO |
Your sequencing core may provide to you a file with information to fill in these columns.
For our example, we have sequence.index
which contains the information from 1000 Genomes for the FASTQs we are processing.
less -S ${GC}/inputs/fastq/sequence.index
In this file, we want the SAMPLE_NAME, FASTQ_FILE, RUN_ID, LIBRARY_NAME, CENTER_NAME, INSTRUMENT_PLATFORM (columns 10, 1, 15, 6, 13).
- You can use perl/awk/linux to extract these fields & format as necessary.
- I prepared a perl script that you can use:
perl ${GC}/scripts/genIndex.pl > ${SETUP}/align.index
Let's look at the index file:
less -S ${SETUP}/align.index
The command-line --fastq
option or the configuration file FASTQ_PREFIX
setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths.
This file is specified either via the command-line --index_file
parameter or via the configuration file INDEX_FILE
setting.
The command-line setting takes precedence over the configuration file setting.
GotCloud Configuration File
This file is created by you to configure GotCloud for your data.
- Default values are provided in ${GC}/gotcloud/bin/gotcloudDefaults.conf
- Most values should be left as the defaults
- Specify values in your configuration file as:
KEY = value
- Use $(KEY) to refer to another key's value
- If a KEY is specified twice, the later value is used
- Does not have access to environment variables
- '#' indicates a comment
- Keys to override:
Key Name | Description |
---|---|
Index File Settings - pointing GotCloud to your data | |
INDEX_FILE | Path to the FASTQ index file that you created
|
FASTQ_PREFIX | Prefix to be added to the FASTQ files in INDEX_FILE
|
BAM_INDEX | Path to the BAM index file
|
Reference File Settings - telling GotCloud where to find your reference files | |
REF_DIR | Path to your reference files
|
REF | Path/filename of the FASTA reference file
|
DBSNP_VCF | Path/filename of the DBSNP file
|
HM3_VCF | Path/filename of the HapMap3 file
|
OMNI_VCF | Path/filename of the OMNI file
|
INDEL_PREFIX | Path/filename base of the indels file
|
Let's look at the configuration file I created for this test:
more ${GC}/inputs/gotcloud.conf
It already points to your align file.
Run GotCloud Align
Now that we have all of our input files, we need just a simple command to run
${GC}/gotcloud/gotcloud align --conf ${GC}/inputs/gotcloud.conf --numcs 2
--numcs
means to run 2 samples at a time.- Depends on your system
This should take < 4 minutes to run.
It should end with a line like: Processing finished in 133 secs with no errors reported
If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off.
Examining GotCloud Align Output
Let's look at the output directory:
ls ${OUTPUT}
Let's look at the BAMs that will be used for variant calling:
ls ${OUTPUT}/bams
Let's take a look at our quality control output directory:
ls ${OUTPUT}/QCFiles
Check for sample contamination:
We need to generate the pdf's of our Quality metrics: