Difference between revisions of "SeqShop: Aligning Your Own Genome, December 2014"

Latest revision as of 10:23, 8 December 2014

Viewing Genetic Info

Slides on risks of viewing genetic information

Goals of This Session

Learn how to go from your FASTQ files to generate Aligned BAMs.

Your samples have already been aligned, so we will review the steps that were done
- Workshop computers don't have enough compute to align everyone's samples during the workshop
You will get to take home both the original FASTQs and the aligned BAMs on a USB drive
- You will get it by the end of the week if not before - it takes a while to copy 74G-111G
- 4 samples will get "modified" BAMs so both FASTQ & BAM would fit on the drive
  - "binned" qualities, duplicates & unmapped reads removed
    - In future all generated BAMs will be automatically binned
- Those that didn't get sequenced will get a drive with NA12878 public sample on it

Login instructions for seqshop-server

Login to the seqshop-server Linux Machine

This section will appear redundantly in each session. If you are already logged in or know how to log in to the server, please skip this section

Login to the windows machine

The username/password for the Windows machine should be written on the right-hand monitor

Start xming so you can open external windows on our Linux machine

Start->Enter "Xming" in the search and select "Xming" from the program list
Nothing will happen, but Xming was started.

View Screenshot

Open putty

Start->Enter "putty" in the search and select "PuTTY" from the program list

View Screenshot

Configure PuTTY in the PuTTY Configuration window

Host Name: seqshop-server.sph.umich.edu

View Screenshot

Setup to allow you to open external windows:

In the left pannel: Connection->SSH->X11

Add a check mark in the box next to Enable X11 forwarding

View Screenshot

Click Open
If it prompts about a key, click OK

Enter your provided username & password as provided

You should now be logged into a terminal on the seqshop-server and be able to access the test files.

If you need another terminal, repeat from step 3.

Login to the seqshop Machine

So you can each run multiple jobs at once, we will have you run on 4 different machines within our seqshop setup.

You can only access these machines after logging onto seqshop-server

3 users logon to:

ssh -X seqshop1

3 users logon to:

ssh -X seqshop2

2 users logon to:

ssh -X seqshop3

2 users logon to:

ssh -X seqshop4

First Things First

When you see Sample*/Sample#, replace it with your sample name/number

If you are using generic data, use NA12878

Locating your FASTQs

Your FASTQ files are under ~/Sample*/fastqs directory.

Look at your directory:

ls ~/Sample*/fastqs

ls does a directory listing
~/ means to start from your home/base directory
- Using ~/ means the command will work even if you have changed directories

You will see 2 files:

Sample#_R1.fastq.gz - 1st in pair
Sample#_R2.fastq.gz - 2nd in pair

What did I do to run the alignment?

Created FASTQ_LIST file

I created a FASTQ_LIST file with the sample #s and fastqs.

What columns do we need in our file that tells GotCloud about our FASTQ?
- SAMPLE
- FASTQ1
- FASTQ2

SAMPLE   FASTQ1                                        FASTQ2
Sample#  /path/to/Sample#/fastqs/Sample#_R1.fastq.gz    /path/to/Sample#/fastqs/Sample#_R2.fastq.gz

You can see what your fastq.list looks like

I took out the full path since I ran alignment in a different path:

cat ~/Sample*/fastq.list

Created a GotCloud Configuration File

After creating the FASTQ_LIST file, I created the GotCloud configuration.

What changes from the default settings did I make?
1. Use BWA instead of BWA_MEM
2. Use multiple BWA_THREADS
3. Set FASTQ_LIST

Look at the gotcloud.conf I setup for you

Not "exactly" what I used. In the original gotcloud.conf, I had
- cluster settings (now blank since you won't be using a cluster for further processing)
- various number of BWA_THREADS for each sample
- full paths to:
  - FASTQ_LIST (and I sometimes had multiple samples in 1 list - but each gets aligned independently)
  - OUT_DIR

cat ~/Sample*/gotcloud.conf

You'll notice that this file is very similar to the one we have been using.

Just a few modifications to run a new test on the whole genome

The settings needed for Single Sample SNP calling that we need for tomorrow are already in the gotcloud.conf (requires extra settings as the default snpcall works best for multiple samples).

Ran the Alignment

It took many threads & a couple of days to get all of the alignments complete - which is why I ran them last week.

Used screen to run overnight.

I ran something like:

gotcloud align --conf gotcloud.conf --numjobs 4

I set numjobs to the number of samples I was processing on that machine

Alignment Pipeline Output

The output is in ~/Sample#/output

cd ~/Sample*/output
ls

What is there?

bam.list - list of samples/BAMs (just one)
bams - directory with BAM files
Makefiles - makefiles generated when I ran GotCloud
QCFiles - quality control metrics
- We will look at this later

Look at the BAM list (will be used for snpcall that we will start tomorrow)

cat bam.list

sample(tab)bam

Look in the bams directory:

ls bams

Quality Control Output

We may hold off on reviewing this until Friday.

Check QC directory

ls QCFiles/

Check for Sample Contamination:

less -S QCFiles/Sample*.genoCheck.selfSM

Look for FREEMIX column. OR notice that it is column 7:

cut -f7 QCFiles/Sample*.genoCheck.selfSM

Look at QPLOT stats:

less QCFiles/Sample*.qplot.stats

What is your Mapping Rate%?
What is your MeanDepth?
What is your GenomeCover(%)?

Let's generate the plots:

R script will create PDF
- automatically set PDF path to full path where the R script is
  - That wouldn't work since I didn't align in your directory & instead moved the files in there afterwards
  - I hand modified it to relative directory from your home directory, so you need to move to your home directory to create the PDF

cd
Rscript Sample*/output/QCFiles/Sample*.qplot.R
evince Sample*/output/QCFiles/Sample*.qplot.pdf&

Recalibration Comparison

We may hold off on reviewing this until Friday.

I also ran picard/GATK on NA12878.

Tool	Time
Picard MarkDuplicates	5hrs 41min
GATK BaseRecalibrator	18hrs 57min
GATK PrintReads	18hrs 33min
Picard/GATK Total	43hrs 11min
Our Dedup & Recalibration	15hrs 3min
Just Dedup	5hr 19min
Just Recalibration	13hrs 5 min

We run Dedup & Recalibration at the same time for 2 total passes through the BAM file.

Alternatively you can run them separately

Our samples ranged from 8-19 hrs (only 2 at 19-19)

QPLOT comparison:

qplot.stats differences:

Stats\BAM	NA12878.recal.bam	NA12878.markDup_GATK.bam
Q20Bases(e9)	54.29	54.20
Q20BasesPct(%)	94.20	94.05
EPS_MSE	3.80	1.37

Plots: QplotComp.pdf

FEEDBACK!

Please provide feedback on the lectures/tutorials from today: https://docs.google.com/forms/d/1HY9g_GzTZzwddA9bmoyG1IJBgKukk3ouVR5uBboydec/viewform

Difference between revisions of "SeqShop: Aligning Your Own Genome, December 2014"

Latest revision as of 10:23, 8 December 2014

Contents

Viewing Genetic Info

Goals of This Session

Login to the seqshop-server Linux Machine

Login to the seqshop Machine

First Things First

Locating your FASTQs

What did I do to run the alignment?

Created FASTQ_LIST file

Created a GotCloud Configuration File

Ran the Alignment

Alignment Pipeline Output

Quality Control Output

Recalibration Comparison

FEEDBACK!

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools

@@ Line 1: / Line 1: @@
+== Viewing Genetic Info ==
+Slides on [[Media:Sequence Analysis Workshop-Risks viewing genetic information Dec2014.pdf|risks of viewing genetic information]]
 == Goals of This Session ==
 Learn how to go from your FASTQ files to generate Aligned BAMs.
@@ Line 5: / Line 8: @@
 * You will get to take home both the original FASTQs and the aligned BAMs on a USB drive
 ** You will get it by the end of the week if not before - it takes a while to copy 74G-111G
-** 4 samples will get BAMs with "binned" qualities so both FASTQ & BAM would fit on the drive
+** 4 samples will get "modified" BAMs so both FASTQ & BAM would fit on the drive
-*** In future all generated BAMs will be automatically binned.
+*** "binned" qualities, duplicates & unmapped reads removed
+**** In future all generated BAMs will be automatically binned
 ** Those that didn't get sequenced will get a drive with NA12878 public sample on it
+<div class="mw-collapsible mw-collapsed" style="width:500px">
+''Login instructions for seqshop-server''
+<div class="mw-collapsible-content">
 {{SeqShopLogin}}
+</div>
+</div>
-== I didn't get sequenced, what can I do? ==
+== First Things First ==
-I prepared some test files for you.
+When you see Sample*/Sample#, replace it with your sample name/number
-* I took a 1000g sample and reduced it to 2x.
+* If you are using generic data, use NA12878
-** I already created an align.2x.index file for you.
-*** This is a 1000g sample, so the filenames/RG information for these do not match the ones produced by our sequencer that are described below.
-To be consistent with everyone else you can do:
- mkdir ~/personal
- cp -r /home/mktrost/seqshop/inputs/2x/* ~/personal/.
- ls ~/personal/
- ls ~/personal/fastq
-== I got sequence, how to I get my data ready to run? ==
+== Locating your FASTQs ==
-=== Finding your FASTQs ===
+Your FASTQ files are under <code>~/Sample*/fastqs</code> directory.
-Your FASTQ files are under your <code>personal</code> directory.
 Look at your directory:
-  ls ~/personal
+  ls ~/Sample*/fastqs
 * <code>ls</code> does a directory listing
 * <code>~/</code> means to start from your home/base directory
 ** Using ~/ means the command will work even if you have changed directories
-You should see 2 directories: Run_992 & Run_993
+You will see 2 files:
-* Our DNA was run through as 2 sequencing runs.
+* Sample#_R1.fastq.gz - 1st in pair
+* Sample#_R2.fastq.gz - 2nd in pair
-What is under those directories?  Let's check:
+== What did I do to run the alignment? ==
- ls ~/personal/Run_*
+=== Created FASTQ_LIST file ===
-* <code>*</code> is a Linux wild-card match
+I created a FASTQ_LIST file with the sample #s and fastqs.
-** It will do a directory listing on both Run directories at the same time:
+* What columns do we need in our file that tells GotCloud about our FASTQ?
+** SAMPLE
+** FASTQ1
+** FASTQ2
+ SAMPLE   FASTQ1                                        FASTQ2
+ Sample#  /path/to/Sample#/fastqs/Sample#_R1.fastq.gz    /path/to/Sample#/fastqs/Sample#_R2.fastq.gz
-[[File:RunDir.png]]
+You can see what your fastq.list looks like
+* I took out the full path since I ran alignment in a different path:
+ cat ~/Sample*/fastq.list
-You will see your sample name instead of <code>12345</code>.
+=== Created a GotCloud Configuration File ===
+After creating the FASTQ_LIST file, I created the GotCloud configuration.
+* What changes from the default settings did I make?
+*# Use BWA instead of BWA_MEM
+*# Use multiple BWA_THREADS
+*# Set FASTQ_LIST
-Since we found another directory, let's check below that for our fastqs:
+Look at the gotcloud.conf I setup for you
-  ls ~/personal/Run_*/*
+* Not "exactly" what I used.  In the original gotcloud.conf, I had
-You should see your fastq files:
+** cluster settings (now blank since you won't be using a cluster for further processing)
+** various number of BWA_THREADS for each sample
+** full paths to:
+*** FASTQ_LIST (and I sometimes had multiple samples in 1 list - but each gets aligned independently)
+*** OUT_DIR
+  cat ~/Sample*/gotcloud.conf
-[[File:Fastqlist.png]]
+You'll notice that this file is very similar to the one we have been using.
+* Just a few modifications to run a new test on the whole genome
+The settings needed for Single Sample SNP calling that we need for tomorrow are already in the gotcloud.conf (requires extra settings as the default snpcall works best for multiple samples).
-Those filenames are cryptic, what do they mean?
+=== Ran the Alignment ===
+It took many threads & a couple of days to get all of the alignments complete - which is why I ran them last week.
+* Used screen to run overnight.
-[[File:FastqlistAnnotated.png]]
+I ran something like:
+ gotcloud align --conf gotcloud.conf --numjobs 4
+* I set numjobs to the number of samples I was processing on that machine
-=== Checking your index file listing your FASTQs ===
+== Alignment Pipeline Output ==
-Are you analyzing your own genome?  Do you think you setup your file correctly?
+The output is in ~/Sample#/output
+ cd ~/Sample*/output
+  ls
-Try running this script to see if you have any errors:
+What is there?
- perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index
+* '''bam.list''' - list of samples/BAMs (just one)
+* '''bams''' - directory with BAM files
+* Makefiles - makefiles generated when I ran GotCloud
+* QCFiles - quality control metrics
+** We will look at this later
-On success it prints: <code>Congratulations, your fastq index looks valid</code>
+Look at the BAM list (will be used for snpcall that we will start tomorrow)
+ cat bam.list
+* sample(tab)bam
-NOTE: This script is tailored to the filenames provided by our sequencing core as described above.
+Look in the bams directory:
-* It could be tailored to other methods, but is designed for the paths of our data.
+ ls bams
-=== Generating the index file listing your FASTQs ===
+=== Quality Control Output ===
-What columns do we need in our file that tells GotCloud about our FASTQ?
+<div class="mw-collapsible mw-collapsed" style="width:500px">
-* MERGE_NAME
+''We may hold off on reviewing this until Friday.''
-* FASTQ1
+<div class="mw-collapsible-content">
-* FASTQ2
+Check QC directory
-* RGID
+ ls QCFiles/
-* SAMPLE
-* LIBRARY
-* PLATFORM
-We will store our FASTQ info file in: ~/personal/align.2x.index.
+Check for Sample Contamination:
+ less -S QCFiles/Sample*.genoCheck.selfSM
+Look for FREEMIX column.  OR notice that it is column 7:
+ cut -f7 QCFiles/Sample*.genoCheck.selfSM
-There are a few ways to create this file.
+Look at QPLOT stats:
-* Write into a text file one fastq pair at a time.
+ less QCFiles/Sample*.qplot.stats
-* Copy fastq1s into a spreadsheet, fill it in and copy back to a text file
-* [[#Using a Script|Write/Use a script]]
-==== Using a Regular Text File ====
-Follow the instructions below, but do it one FASTQ1 at a time (you won't be able to paste a full column of FASTQs at a time).
-* Remember to put a tab between each field.
-==== Using a Spreadsheet ====
+* What is your Mapping Rate%?
-Since we just have a handful of FASTQs, we can use a spreadsheet to construct our file and then copy the data into a text file.
+* What is your MeanDepth?
-* Thanks to those who thought to do this yesterday - it was a great idea.
+* What is your GenomeCover(%)?
-First, open Excel
+Let's generate the plots:
+* R script will create PDF
+** automatically set PDF path to full path where the R script is
+*** That wouldn't work since I didn't align in your directory & instead moved the files in there afterwards
+*** I hand modified it to relative directory from your home directory, so you need to move to your home directory to create the PDF
+ cd
+ Rscript Sample*/output/QCFiles/Sample*.qplot.R
+ evince Sample*/output/QCFiles/Sample*.qplot.pdf&
+</div>
+</div>
-===== Header Row =====
+== Recalibration Comparison ==
-Create the header line by typing each of the column names in a row (you may be able to copy this line):
+<div class="mw-collapsible mw-collapsed" style="width:500px">
-* make sure you enter these in all CAPS & spelling does matter
+''We may hold off on reviewing this until Friday.''
- MERGE_NAME	FASTQ1	FASTQ2	RGID	SAMPLE	LIBRARY	PLATFORM
+<div class="mw-collapsible-content">
-[[File:HdrRow.png]]
+I also ran picard/GATK on NA12878.
+{| class="wikitable" cellpadding=5
-===== MERGE_NAME =====
+! Tool !! Time
-MERGE_NAME is just your sample name
+|-
-* Type your Sample name under the MERGE_NAME column, for example: <code>Sample_12345</code>
+| Picard MarkDuplicates || 5hrs 41min
-[[File:FastqHdrMN.png]]
+|-
+| GATK BaseRecalibrator || 18hrs 57min
-All FASTQs are for the same sample, so you will use <code>Sample_12345</code> on every line.  We will fill those in after we get know how many rows we need.
+|-
+| GATK PrintReads || 18hrs 33min
-===== FASTQ1 =====
+|-
-FASTQ1 is just the 1st in pair FASTQs (or the single FASTQ in single end)
+! Picard/GATK Total || 43hrs 11min
-* Our sequencing core indicated 1st in pair by <code>R1</code> in the filename.
+|-
-** All our FASTQs are paired end (no single end fastqs)
+! Our Dedup & Recalibration || 15hrs 3min
-* To get a a list of just the <code>R1</code> files:
+|-
- ls -1 ~/personal/Run_*/*/*R1*
+| Just Dedup || 5hr 19min
-* The -1 option tells <code>ls</code> to list the matching files in a single column
+|-
-** If you have noticed, the number of columns in a directory listing varies based on the width of your window.
+| Just Recalibration || 13hrs 5 min
-** We want to copy these files into one column of a spreadsheet so want them displayed as a single column
+|}
+We run Dedup & Recalibration at the same time for 2 total passes through the BAM file.
-Highlight this list and copy them into the FASTQ1 column of your spreadsheet:
+* Alternatively you can run them separately
-[[File:HighlightedFASTQs.png]]
+Our samples ranged from 8-19 hrs (only 2 at 19-19)
-[[File:HdrSheetFQ1.png|700]]
+QPLOT comparison:
+* qplot.stats differences:
+{| class="wikitable" cellpadding=5
+! Stats\BAM !! NA12878.recal.bam !! NA12878.markDup_GATK.bam
+|-
+| Q20Bases(e9) || 54.29 || 54.20
+|-
+| Q20BasesPct(%) || 94.20 || 94.05
+|-
+| EPS_MSE || 3.80 || 1.37
+|}
-Now that we know how many rows we have, copy your sample name into all rows:
+Plots: [[Media:QplotComp.pdf|QplotComp.pdf]]
-[[File:HdrSheetMN1.png|700]]
-=====FASTQ2=====
-As mentioned before, FASTQ2 files are the 2nd in pair.
-* They have the same filename as FASTQ1, except replace the R1 with R2
-You could do an ls and copy & paste as we did for FASTQ1, BUT we'd have to make sure we properly matched up the mates.
-EASIER solution:
-* In your spreadsheet, copy your FASTQ1 filenames into the FASTQ2 column:
-[[File:HdrSheetFQ2 1.png]]
-* Now, replace <code>R1</code> with <code>R2</code> in JUST the FASTQ2 column
-** Make sure FASTQ1 column keeps the <code>R1</code>
-[[File:HdrSheetFQ2 2.png]]
-===== RGID =====
-We want to group our FASTQs by Run & Lane.
-* Each Run/Lane combination should have a unique Read Group
-** Run is in the directory path: Run_992 or Run_993
-** Lane is indicated by L001/L002
-*** Lanes may have multiple FASTQ pairs
-**** the sequencing core split them into smaller FASTQs, but they are the same Run/Lane, so should have the same read group.
-Populate the RGID column with a name unique to each Read Group/Lane combination:
-[[File:HdrSheetRGannotated.png]]
-=====SAMPLE =====
-Put your sample name in each row of this column (you can copy from MERGE_NAME)
-===== LIBRARY =====
-Put your sample name in each row of this column (you can copy from MERGE_NAME)
-* If a sample has multiple library preparations done on it, you would want to give unique names
-** That is not our case, so just put in the sample name.
-===== PLATFORM =====
-Your data was sequenced on ILLUMINA, so enter <code>ILLUMINA</code> in each row of the platform column.
-[[File:HdrSheetDone.png]]
-===== Copy to Text File =====
-Open nedit or your favorite linux editor
- nedit ~/personal/align.2x.index&
-Click <code>New File</code> in the pop-up stating that it couldn't find that file.
-Copy (Ctrl-c) your table from EXCEL (including the header row, but with no extra rows/columns.
-Paste (Ctrl-v) into your nedit window.
-Navigate through the file - the columns should be delimited with tabs.
-Save (Ctrl-s) & close nedit.
-You now have a tab delimited align.2x.index file (a little simpler than yesterday).
-==== Using a Script ====
-When generating an index of your FASTQs, it can be easiest to have a script.
-* Especially if you have many samples/runs, it would be very tedious to do by hand
-If you are good at scripting, this may be even easier than doing it by hand
-* If you aren't good at scripting, and you have too much data to do by hand
-** Make friends with someone who is :-)
-** I always find it useful to start from another script (reminds me of commands/tricks)
-If you still need to create your file and you don't want to use the spreadsheet method above, you can run a script that I made:
- perl /home/mktrost/seqshop/inputs/buildIndex.pl ~/personal > ~/personal/align.2x.index
-* <code>></code> means to direct the output to the file specified after the <code>></code>
-Curious what the script looks like and what it does in case you want to create one in the future?
-<div class="mw-collapsible mw-collapsed" style="width:200px">
-<li>View Annotated Script</li>
-<div class="mw-collapsible-content">
-[[File:BuildIndex.png|800px]]
 </div>
 </div>
-=== Checking your index file listing your FASTQs ===
-Are you analyzing your own genome?  Do you think you setup your file correctly?
-Try running this script to see if you have any errors:
- perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index
-On success it prints: <code>Congratulations, your fastq index looks valid</code>
-NOTE: This script is tailored to the filenames provided by our sequencing core as described above.
-* It could be tailored to other methods, but is designed for the paths of our data.
-== Create your GotCloud Configuration File ==
-Now that you have your FASTQ info file created, we need to setup the configuration.
-'''If you updated your gotcloud.2x.conf file yesterday, you still need to do this step'''
-* We updated a few settings to match 1000g
-*# Updated the version of the genome reference
-*# Updated to use BWA instead of BWA_MEM
-*# Updated the version of BWA
-You can copy from my directory to your personal directory:
- cp /home/mktrost/seqshop/inputs/gotcloud.2x.conf ~/personal/.
-You need to update the conf file to find your align.2x.index and ensure your output directory is setup.
- nedit ~/personal/gotcloud.2x.conf &
-(or use your favorite editor)
-* Change the <code>IN_DIR</code> line, replacing <code>YOUR_USER_NAME</code> with your user name.
-* Everything else is configured already.
-[[File:Gc2xconf.png]]
-You'll notice that this file is very similar to the one we have been using.
-* Just a few modifications to run a new test on the whole genome
-== Run your Alignment ==
-=== Screen ===
-The alignment pipeline will run overnight, but you'll want to log out.
-; How do I leave something running on the server even if I log out?
-: One solution is screen!
-; How do I use screen?
-: Before running your command, you need to start screen:
-: <pre>screen</pre>
-[[File:Screen.png]]
-As it says, press <code>Space</code> or <code>Return</code>.
-* It should now look basically the same as your normal command line.
-You can now start your alignment:
- /home/mktrost/seqshop/gotcloud/gotcloud align --conf ~/personal/gotcloud.2x.conf --numjobs 1
-Yes, leave that as /home/mktrost/seqshop/gotcloud/gotcloud - that is where gotcloud is installed.  The ~/... points GotCloud to your specific configuration file.
-You should now see your alignment running:
-;Want to log out and leave your job running?
-In the screen window, type:
- Ctrl-a d
-(Hold down Ctrl and type 'a', let go of both and type 'd')
-* This will "detach" from your screen session while your alignment continues to run.
-;How do you log back into screen tomorrow?
- screen -r
-This will resume an already running screen.
-* Feel free to test it out and you will see your alignment still running
-** Just use Ctrl-a d to detach from screen and leave your job running
-; Scrolling problems?
-: If you want to scroll and screen doesn't scroll like you normally would?
-:* Type Ctrl-a Esc and you should be able to scroll up with your mouse wheel
-:** Or at least that is what I do from my Linux machine - (sorry I'm typing this up/testing these commands from Linux and not windows, so can't test it out)
-== Log Out ==
-If you have not detached from screen:
- Ctrl-a d
-exit PuTTY
 == FEEDBACK! ==
-Since I didn't send this out yesterday, today's survey has feedback for Tuesday & Wednesday.
+Please provide feedback on the lectures/tutorials from today: https://docs.google.com/forms/d/1HY9g_GzTZzwddA9bmoyG1IJBgKukk3ouVR5uBboydec/viewform
-https://docs.google.com/forms/d/1qaLHq9w1Ib3FZq0CtlrbK_-breNiqGRV06oRYNmUuME/viewform