Difference between revisions of "SeqShop: Aligning Your Own Genome, December 2014"

From Genome Analysis Wiki
Jump to navigationJump to search
 
(9 intermediate revisions by the same user not shown)
Line 20: Line 20:
 
</div>
 
</div>
 
</div>
 
</div>
 +
 +
== First Things First ==
 +
When you see Sample*/Sample#, replace it with your sample name/number
 +
* If you are using generic data, use NA12878
  
 
== Locating your FASTQs ==
 
== Locating your FASTQs ==
 
Your FASTQ files are under <code>~/Sample*/fastqs</code> directory.
 
Your FASTQ files are under <code>~/Sample*/fastqs</code> directory.
* ''or <code>~/NA12878/fastqs</code>''
 
  
 
Look at your directory:  
 
Look at your directory:  
 
  ls ~/Sample*/fastqs
 
  ls ~/Sample*/fastqs
or
+
 
ls ~/NA12878/fastqs
 
 
* <code>ls</code> does a directory listing
 
* <code>ls</code> does a directory listing
 
* <code>~/</code> means to start from your home/base directory
 
* <code>~/</code> means to start from your home/base directory
Line 34: Line 36:
  
 
You will see 2 files:
 
You will see 2 files:
* Sample_R1.fastq.gz - 1st in pair
+
* Sample#_R1.fastq.gz - 1st in pair
* Sample_R2.fastq.gz - 2nd in pair
+
* Sample#_R2.fastq.gz - 2nd in pair
  
 
== What did I do to run the alignment? ==
 
== What did I do to run the alignment? ==
Line 52: Line 54:
 
  cat ~/Sample*/fastq.list
 
  cat ~/Sample*/fastq.list
  
== Create your GotCloud Configuration File ==
+
=== Created a GotCloud Configuration File ===
Now that you have your FASTQ info file created, we need to setup the configuration.
+
After creating the FASTQ_LIST file, I created the GotCloud configuration.
  
'''If you updated your gotcloud.2x.conf file yesterday, you still need to do this step'''
+
* What changes from the default settings did I make?
* We updated a few settings to match 1000g
+
*# Use BWA instead of BWA_MEM
*# Updated the version of the genome reference
+
*# Use multiple BWA_THREADS
*# Updated to use BWA instead of BWA_MEM
+
*# Set FASTQ_LIST
*# Updated the version of BWA
 
  
You can copy from my directory to your personal directory:
+
Look at the gotcloud.conf I setup for you
  cp /home/mktrost/seqshop/inputs/gotcloud.2x.conf ~/personal/.
+
* Not "exactly" what I used.  In the original gotcloud.conf, I had
 +
** cluster settings (now blank since you won't be using a cluster for further processing)
 +
** various number of BWA_THREADS for each sample
 +
** full paths to:
 +
*** FASTQ_LIST (and I sometimes had multiple samples in 1 list - but each gets aligned independently)
 +
*** OUT_DIR
 +
  cat ~/Sample*/gotcloud.conf
  
You need to update the conf file to find your align.2x.index and ensure your output directory is setup.
+
You'll notice that this file is very similar to the one we have been using.
 +
* Just a few modifications to run a new test on the whole genome
  
nedit ~/personal/gotcloud.2x.conf &
+
The settings needed for Single Sample SNP calling that we need for tomorrow are already in the gotcloud.conf (requires extra settings as the default snpcall works best for multiple samples).
(or use your favorite editor)
 
* Change the <code>IN_DIR</code> line, replacing <code>YOUR_USER_NAME</code> with your user name.
 
* Everything else is configured already.
 
[[File:Gc2xconf.png]]
 
  
 +
=== Ran the Alignment ===
 +
It took many threads & a couple of days to get all of the alignments complete - which is why I ran them last week.
 +
* Used screen to run overnight.
  
You'll notice that this file is very similar to the one we have been using.
+
I ran something like:
* Just a few modifications to run a new test on the whole genome
+
gotcloud align --conf gotcloud.conf --numjobs 4
 +
* I set numjobs to the number of samples I was processing on that machine
  
== Run your Alignment ==
+
== Alignment Pipeline Output ==
 +
The output is in ~/Sample#/output
 +
cd ~/Sample*/output
 +
ls
  
=== Screen ===
+
What is there?
The alignment pipeline will run overnight, but you'll want to log out.
+
* '''bam.list''' - list of samples/BAMs (just one)
 +
* '''bams''' - directory with BAM files
 +
* Makefiles - makefiles generated when I ran GotCloud
 +
* QCFiles - quality control metrics
 +
** We will look at this later
  
; How do I leave something running on the server even if I log out?
+
Look at the BAM list (will be used for snpcall that we will start tomorrow)
: One solution is screen!
+
cat bam.list
 +
* sample(tab)bam
  
; How do I use screen?
+
Look in the bams directory:
: Before running your command, you need to start screen:
+
ls bams
: <pre>screen</pre>
 
  
[[File:Screen.png]]
+
=== Quality Control Output ===
 +
<div class="mw-collapsible mw-collapsed" style="width:500px">
 +
''We may hold off on reviewing this until Friday.''
 +
<div class="mw-collapsible-content">
 +
Check QC directory
 +
ls QCFiles/
  
As it says, press <code>Space</code> or <code>Return</code>.
+
Check for Sample Contamination:
* It should now look basically the same as your normal command line.
+
less -S QCFiles/Sample*.genoCheck.selfSM
 +
Look for FREEMIX column. OR notice that it is column 7:
 +
cut -f7 QCFiles/Sample*.genoCheck.selfSM
  
You can now start your alignment:
+
Look at QPLOT stats:
  /home/mktrost/seqshop/gotcloud/gotcloud align --conf ~/personal/gotcloud.2x.conf --numjobs 1
+
  less QCFiles/Sample*.qplot.stats
Yes, leave that as /home/mktrost/seqshop/gotcloud/gotcloud - that is where gotcloud is installed.  The ~/... points GotCloud to your specific configuration file.
 
  
You should now see your alignment running:
+
* What is your Mapping Rate%?
 +
* What is your MeanDepth?
 +
* What is your GenomeCover(%)?
  
;Want to log out and leave your job running?
+
Let's generate the plots:
In the screen window, type:
+
* R script will create PDF
Ctrl-a d
+
** automatically set PDF path to full path where the R script is
(Hold down Ctrl and type 'a', let go of both and type 'd')
+
*** That wouldn't work since I didn't align in your directory & instead moved the files in there afterwards
* This will "detach" from your screen session while your alignment continues to run.
+
*** I hand modified it to relative directory from your home directory, so you need to move to your home directory to create the PDF
 +
cd
 +
Rscript Sample*/output/QCFiles/Sample*.qplot.R
 +
evince Sample*/output/QCFiles/Sample*.qplot.pdf&
 +
</div>
 +
</div>
  
;How do you log back into screen tomorrow?
+
== Recalibration Comparison ==
screen -r
+
<div class="mw-collapsible mw-collapsed" style="width:500px">
This will resume an already running screen.
+
''We may hold off on reviewing this until Friday.''
* Feel free to test it out and you will see your alignment still running
+
<div class="mw-collapsible-content">
** Just use Ctrl-a d to detach from screen and leave your job running
+
I also ran picard/GATK on NA12878.
 +
{| class="wikitable" cellpadding=5
 +
! Tool !! Time
 +
|-
 +
| Picard MarkDuplicates || 5hrs 41min
 +
|-
 +
| GATK BaseRecalibrator || 18hrs 57min
 +
|-
 +
| GATK PrintReads || 18hrs 33min
 +
|-
 +
! Picard/GATK Total || 43hrs 11min
 +
|-
 +
! Our Dedup & Recalibration || 15hrs 3min
 +
|-
 +
| Just Dedup || 5hr 19min
 +
|-
 +
| Just Recalibration || 13hrs 5 min
 +
|}
 +
We run Dedup & Recalibration at the same time for 2 total passes through the BAM file.
 +
* Alternatively you can run them separately
  
; Scrolling problems?
+
Our samples ranged from 8-19 hrs (only 2 at 19-19)
: If you want to scroll and screen doesn't scroll like you normally would?
 
:* Type Ctrl-a Esc and you should be able to scroll up with your mouse wheel
 
:** Or at least that is what I do from my Linux machine - (sorry I'm typing this up/testing these commands from Linux and not windows, so can't test it out)
 
  
 +
QPLOT comparison:
 +
* qplot.stats differences:
 +
{| class="wikitable" cellpadding=5
 +
! Stats\BAM !! NA12878.recal.bam !! NA12878.markDup_GATK.bam
 +
|-
 +
| Q20Bases(e9) || 54.29 || 54.20
 +
|-
 +
| Q20BasesPct(%) || 94.20 || 94.05
 +
|-
 +
| EPS_MSE || 3.80 || 1.37
 +
|}
  
== Log Out ==
+
Plots: [[Media:QplotComp.pdf|QplotComp.pdf]]
If you have not detached from screen:
 
Ctrl-a d
 
  
exit PuTTY
+
</div>
 +
</div>
  
 
== FEEDBACK! ==
 
== FEEDBACK! ==
Since I didn't send this out yesterday, today's survey has feedback for Tuesday & Wednesday.
+
Please provide feedback on the lectures/tutorials from today: https://docs.google.com/forms/d/1HY9g_GzTZzwddA9bmoyG1IJBgKukk3ouVR5uBboydec/viewform
https://docs.google.com/forms/d/1qaLHq9w1Ib3FZq0CtlrbK_-breNiqGRV06oRYNmUuME/viewform
 

Latest revision as of 10:23, 8 December 2014

Viewing Genetic Info

Slides on risks of viewing genetic information

Goals of This Session

Learn how to go from your FASTQ files to generate Aligned BAMs.

  • Your samples have already been aligned, so we will review the steps that were done
    • Workshop computers don't have enough compute to align everyone's samples during the workshop
  • You will get to take home both the original FASTQs and the aligned BAMs on a USB drive
    • You will get it by the end of the week if not before - it takes a while to copy 74G-111G
    • 4 samples will get "modified" BAMs so both FASTQ & BAM would fit on the drive
      • "binned" qualities, duplicates & unmapped reads removed
        • In future all generated BAMs will be automatically binned
    • Those that didn't get sequenced will get a drive with NA12878 public sample on it


Login instructions for seqshop-server

Login to the seqshop-server Linux Machine

This section will appear redundantly in each session. If you are already logged in or know how to log in to the server, please skip this section

  1. Login to the windows machine
    • The username/password for the Windows machine should be written on the right-hand monitor
  2. Start xming so you can open external windows on our Linux machine
    • Start->Enter "Xming" in the search and select "Xming" from the program list
    • Nothing will happen, but Xming was started.
    • View Screenshot
    • Xming.png

  3. Open putty
    • Start->Enter "putty" in the search and select "PuTTY" from the program list
    • View Screenshot
    • PuttyS.png

  4. Configure PuTTY in the PuTTY Configuration window
    • Host Name: seqshop-server.sph.umich.edu
    • View Screenshot
    • Seqshop.png

    • Setup to allow you to open external windows:
      • In the left pannel: Connection->SSH->X11
        • Add a check mark in the box next to Enable X11 forwarding
        • View Screenshot
        • SeqshopX11.png

    • Click Open
    • If it prompts about a key, click OK
  5. Enter your provided username & password as provided


You should now be logged into a terminal on the seqshop-server and be able to access the test files.

  • If you need another terminal, repeat from step 3.

Login to the seqshop Machine

So you can each run multiple jobs at once, we will have you run on 4 different machines within our seqshop setup.

  • You can only access these machines after logging onto seqshop-server

3 users logon to:

ssh -X seqshop1

3 users logon to:

ssh -X seqshop2

2 users logon to:

ssh -X seqshop3

2 users logon to:

ssh -X seqshop4

First Things First

When you see Sample*/Sample#, replace it with your sample name/number

  • If you are using generic data, use NA12878

Locating your FASTQs

Your FASTQ files are under ~/Sample*/fastqs directory.

Look at your directory:

ls ~/Sample*/fastqs
  • ls does a directory listing
  • ~/ means to start from your home/base directory
    • Using ~/ means the command will work even if you have changed directories

You will see 2 files:

  • Sample#_R1.fastq.gz - 1st in pair
  • Sample#_R2.fastq.gz - 2nd in pair

What did I do to run the alignment?

Created FASTQ_LIST file

I created a FASTQ_LIST file with the sample #s and fastqs.

  • What columns do we need in our file that tells GotCloud about our FASTQ?
    • SAMPLE
    • FASTQ1
    • FASTQ2
SAMPLE   FASTQ1                                        FASTQ2
Sample#  /path/to/Sample#/fastqs/Sample#_R1.fastq.gz    /path/to/Sample#/fastqs/Sample#_R2.fastq.gz

You can see what your fastq.list looks like

  • I took out the full path since I ran alignment in a different path:
cat ~/Sample*/fastq.list

Created a GotCloud Configuration File

After creating the FASTQ_LIST file, I created the GotCloud configuration.

  • What changes from the default settings did I make?
    1. Use BWA instead of BWA_MEM
    2. Use multiple BWA_THREADS
    3. Set FASTQ_LIST

Look at the gotcloud.conf I setup for you

  • Not "exactly" what I used. In the original gotcloud.conf, I had
    • cluster settings (now blank since you won't be using a cluster for further processing)
    • various number of BWA_THREADS for each sample
    • full paths to:
      • FASTQ_LIST (and I sometimes had multiple samples in 1 list - but each gets aligned independently)
      • OUT_DIR
cat ~/Sample*/gotcloud.conf

You'll notice that this file is very similar to the one we have been using.

  • Just a few modifications to run a new test on the whole genome

The settings needed for Single Sample SNP calling that we need for tomorrow are already in the gotcloud.conf (requires extra settings as the default snpcall works best for multiple samples).

Ran the Alignment

It took many threads & a couple of days to get all of the alignments complete - which is why I ran them last week.

  • Used screen to run overnight.

I ran something like:

gotcloud align --conf gotcloud.conf --numjobs 4
  • I set numjobs to the number of samples I was processing on that machine

Alignment Pipeline Output

The output is in ~/Sample#/output

cd ~/Sample*/output
ls

What is there?

  • bam.list - list of samples/BAMs (just one)
  • bams - directory with BAM files
  • Makefiles - makefiles generated when I ran GotCloud
  • QCFiles - quality control metrics
    • We will look at this later

Look at the BAM list (will be used for snpcall that we will start tomorrow)

cat bam.list
  • sample(tab)bam

Look in the bams directory:

ls bams

Quality Control Output

We may hold off on reviewing this until Friday.

Check QC directory

ls QCFiles/

Check for Sample Contamination:

less -S QCFiles/Sample*.genoCheck.selfSM

Look for FREEMIX column. OR notice that it is column 7:

cut -f7 QCFiles/Sample*.genoCheck.selfSM

Look at QPLOT stats:

less QCFiles/Sample*.qplot.stats
  • What is your Mapping Rate%?
  • What is your MeanDepth?
  • What is your GenomeCover(%)?

Let's generate the plots:

  • R script will create PDF
    • automatically set PDF path to full path where the R script is
      • That wouldn't work since I didn't align in your directory & instead moved the files in there afterwards
      • I hand modified it to relative directory from your home directory, so you need to move to your home directory to create the PDF
cd
Rscript Sample*/output/QCFiles/Sample*.qplot.R
evince Sample*/output/QCFiles/Sample*.qplot.pdf&

Recalibration Comparison

We may hold off on reviewing this until Friday.

I also ran picard/GATK on NA12878.

Tool Time
Picard MarkDuplicates 5hrs 41min
GATK BaseRecalibrator 18hrs 57min
GATK PrintReads 18hrs 33min
Picard/GATK Total 43hrs 11min
Our Dedup & Recalibration 15hrs 3min
Just Dedup 5hr 19min
Just Recalibration 13hrs 5 min

We run Dedup & Recalibration at the same time for 2 total passes through the BAM file.

  • Alternatively you can run them separately

Our samples ranged from 8-19 hrs (only 2 at 19-19)

QPLOT comparison:

  • qplot.stats differences:
Stats\BAM NA12878.recal.bam NA12878.markDup_GATK.bam
Q20Bases(e9) 54.29 54.20
Q20BasesPct(%) 94.20 94.05
EPS_MSE 3.80 1.37

Plots: QplotComp.pdf

FEEDBACK!

Please provide feedback on the lectures/tutorials from today: https://docs.google.com/forms/d/1HY9g_GzTZzwddA9bmoyG1IJBgKukk3ouVR5uBboydec/viewform