Changes

11,648 bytes added , 14:25, 13 November 2014

no edit summary

Line 1: Line 1: +

'''Note:''' the latest version of this practical is available at: [[SeqShop: Aligning Your Own Genome]]

+

* The ones here is the original one from the June workshop (updated to be run from elsewhere)

+

== First Things First ==

+

*Helpful reference to many tools:

+

** http://infoplatter.wordpress.com/2014/04/06/bioinformaticians-pocket-reference/

+

*** links to "cheat-sheets", including, Unix, screen, and vi

+

* Our wiki with some brief description of how to do some basic commands

+

**http://genome.sph.umich.edu/wiki/Basic_Linux_Intro

+

* [[Screen Commands]] for commands to use screen (one way to leave your commands running even after you log out)

+

== Goals of This Session ==

+

Learn how to go from your FASTQ files to generate Aligned BAMs.

+

* Practice setting up and running GotCloud on your own.

+

== I didn't get sequenced, what can I do? ==

+

I prepared some test files for you.

+

* I took a 1000g sample and reduced it to 2x.

+

** I already created an align.2x.index file for you.

+

*** This is a 1000g sample, so the filenames/RG information for these do not match the ones produced by our sequencer that are described below.

+

To be consistent with everyone else you can do:

+

mkdir ~/personal

+

cp -r /home/mktrost/seqshop/inputs/2x/* ~/personal/.

+

ls ~/personal/

+

ls ~/personal/fastq

+

== I got sequence, how to I get my data ready to run? ==

+

=== Finding your FASTQs ===

+

Your FASTQ files are under your <code>personal</code> directory.

+

Look at your directory:

+

ls ~/personal

+

* <code>ls</code> does a directory listing

+

* <code>~/</code> means to start from your home/base directory

+

** Using ~/ means the command will work even if you have changed directories

+

You should see 2 directories: Run_992 & Run_993

+

* Our DNA was run through as 2 sequencing runs.

+

What is under those directories? Let's check:

+

ls ~/personal/Run_*

+

* <code>*</code> is a Linux wild-card match

+

** It will do a directory listing on both Run directories at the same time:

+

[[File:RunDir.png]]

+

You will see your sample name instead of <code>12345</code>.

+

Since we found another directory, let's check below that for our fastqs:

+

ls ~/personal/Run_*/*

+

You should see your fastq files:

+

[[File:Fastqlist.png]]

+

Those filenames are cryptic, what do they mean?

+

[[File:FastqlistAnnotated.png]]

+

=== Checking your index file listing your FASTQs ===

+

Are you analyzing your own genome? Do you think you setup your file correctly?

+

Try running this script to see if you have any errors:

+

perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index

+

On success it prints: <code>Congratulations, your fastq index looks valid</code>

+

NOTE: This script is tailored to the filenames provided by our sequencing core as described above.

+

* It could be tailored to other methods, but is designed for the paths of our data.

+

=== Generating the index file listing your FASTQs ===

+

What columns do we need in our file that tells GotCloud about our FASTQ?

+

* MERGE_NAME

+

* FASTQ1

+

* FASTQ2

+

* RGID

+

* SAMPLE

+

* LIBRARY

+

* PLATFORM

+

We will store our FASTQ info file in: ~/personal/align.2x.index.

+

There are a few ways to create this file.

+

* Write into a text file one fastq pair at a time.

+

* Copy fastq1s into a spreadsheet, fill it in and copy back to a text file

+

* [[#Using a Script|Write/Use a script]]

+

==== Using a Regular Text File ====

+

Follow the instructions below, but do it one FASTQ1 at a time (you won't be able to paste a full column of FASTQs at a time).

+

* Remember to put a tab between each field.

+

==== Using a Spreadsheet ====

+

Since we just have a handful of FASTQs, we can use a spreadsheet to construct our file and then copy the data into a text file.

+

* Thanks to those who thought to do this yesterday - it was a great idea.

+

First, open Excel

+

===== Header Row =====

+

Create the header line by typing each of the column names in a row (you may be able to copy this line):

+

* make sure you enter these in all CAPS & spelling does matter

+

MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY PLATFORM

+

[[File:HdrRow.png]]

+

===== MERGE_NAME =====

+

MERGE_NAME is just your sample name

+

* Type your Sample name under the MERGE_NAME column, for example: <code>Sample_12345</code>

+

[[File:FastqHdrMN.png]]

+

All FASTQs are for the same sample, so you will use <code>Sample_12345</code> on every line. We will fill those in after we get know how many rows we need.

+

===== FASTQ1 =====

+

FASTQ1 is just the 1st in pair FASTQs (or the single FASTQ in single end)

+

* Our sequencing core indicated 1st in pair by <code>R1</code> in the filename.

+

** All our FASTQs are paired end (no single end fastqs)

+

* To get a a list of just the <code>R1</code> files:

+

ls -1 ~/personal/Run_*/*/*R1*

+

* The -1 option tells <code>ls</code> to list the matching files in a single column

+

** If you have noticed, the number of columns in a directory listing varies based on the width of your window.

+

** We want to copy these files into one column of a spreadsheet so want them displayed as a single column

+

Highlight this list and copy them into the FASTQ1 column of your spreadsheet:

+

[[File:HighlightedFASTQs.png]]

+

[[File:HdrSheetFQ1.png|700]]

+

Now that we know how many rows we have, copy your sample name into all rows:

+

[[File:HdrSheetMN1.png|700]]

+

=====FASTQ2=====

+

As mentioned before, FASTQ2 files are the 2nd in pair.

+

* They have the same filename as FASTQ1, except replace the R1 with R2

+

You could do an ls and copy & paste as we did for FASTQ1, BUT we'd have to make sure we properly matched up the mates.

+

EASIER solution:

+

* In your spreadsheet, copy your FASTQ1 filenames into the FASTQ2 column:

+

[[File:HdrSheetFQ2 1.png]]

+

* Now, replace <code>R1</code> with <code>R2</code> in JUST the FASTQ2 column

+

** Make sure FASTQ1 column keeps the <code>R1</code>

+

[[File:HdrSheetFQ2 2.png]]

+

===== RGID =====

+

We want to group our FASTQs by Run & Lane.

+

* Each Run/Lane combination should have a unique Read Group

+

** Run is in the directory path: Run_992 or Run_993

+

** Lane is indicated by L001/L002

+

*** Lanes may have multiple FASTQ pairs

+

**** the sequencing core split them into smaller FASTQs, but they are the same Run/Lane, so should have the same read group.

+

Populate the RGID column with a name unique to each Read Group/Lane combination:

+

[[File:HdrSheetRGannotated.png]]

+

=====SAMPLE =====

+

Put your sample name in each row of this column (you can copy from MERGE_NAME)

+

===== LIBRARY =====

+

Put your sample name in each row of this column (you can copy from MERGE_NAME)

+

* If a sample has multiple library preparations done on it, you would want to give unique names

+

** That is not our case, so just put in the sample name.

+

===== PLATFORM =====

+

Your data was sequenced on ILLUMINA, so enter <code>ILLUMINA</code> in each row of the platform column.

+

[[File:HdrSheetDone.png]]

+

===== Copy to Text File =====

+

Open nedit or your favorite linux editor

+

nedit ~/personal/align.2x.index&

+

Click <code>New File</code> in the pop-up stating that it couldn't find that file.

+

Copy (Ctrl-c) your table from EXCEL (including the header row, but with no extra rows/columns.

+

Paste (Ctrl-v) into your nedit window.

+

Navigate through the file - the columns should be delimited with tabs.

+

Save (Ctrl-s) & close nedit.

+

You now have a tab delimited align.2x.index file (a little simpler than yesterday).

+

==== Using a Script ====

+

When generating an index of your FASTQs, it can be easiest to have a script.

+

* Especially if you have many samples/runs, it would be very tedious to do by hand

+

If you are good at scripting, this may be even easier than doing it by hand

+

* If you aren't good at scripting, and you have too much data to do by hand

+

** Make friends with someone who is :-)

+

** I always find it useful to start from another script (reminds me of commands/tricks)

+

If you still need to create your file and you don't want to use the spreadsheet method above, you can run a script that I made:

+

perl /home/mktrost/seqshop/inputs/buildIndex.pl ~/personal > ~/personal/align.2x.index

+

* <code>></code> means to direct the output to the file specified after the <code>></code>

+

Curious what the script looks like and what it does in case you want to create one in the future?

+

+

<li>View Annotated Script</li>

+

+

[[File:BuildIndex.png|800px]]

+

</div>

+

</div>

+

=== Checking your index file listing your FASTQs ===

+

Are you analyzing your own genome? Do you think you setup your file correctly?

+

Try running this script to see if you have any errors:

+

perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index

+

On success it prints: <code>Congratulations, your fastq index looks valid</code>

+

NOTE: This script is tailored to the filenames provided by our sequencing core as described above.

+

* It could be tailored to other methods, but is designed for the paths of our data.

+

== Create your GotCloud Configuration File ==

+

Now that you have your FASTQ info file created, we need to setup the configuration.

+

'''If you updated your gotcloud.2x.conf file yesterday, you still need to do this step'''

+

* We updated a few settings to match 1000g

+

*# Updated the version of the genome reference

+

*# Updated to use BWA instead of BWA_MEM

+

*# Updated the version of BWA

+

You can copy from my directory to your personal directory:

+

cp /home/mktrost/seqshop/inputs/gotcloud.2x.conf ~/personal/.

+

You need to update the conf file to find your align.2x.index and ensure your output directory is setup.

+

nedit ~/personal/gotcloud.2x.conf &

+

(or use your favorite editor)

+

* Change the <code>IN_DIR</code> line, replacing <code>YOUR_USER_NAME</code> with your user name.

+

* Everything else is configured already.

+

[[File:Gc2xconf.png]]

+

You'll notice that this file is very similar to the one we have been using.

+

* Just a few modifications to run a new test on the whole genome

+

== Run your Alignment ==

+

=== Screen ===

+

The alignment pipeline will run overnight, but you'll want to log out.

+

; How do I leave something running on the server even if I log out?

+

: One solution is screen!

+

; How do I use screen?

+

: Before running your command, you need to start screen:

+

: <pre>screen</pre>

+

[[File:Screen.png]]

+

As it says, press <code>Space</code> or <code>Return</code>.

+

* It should now look basically the same as your normal command line.

+

You can now start your alignment:

+

/home/mktrost/seqshop/gotcloud/gotcloud align --conf ~/personal/gotcloud.2x.conf --numjobs 1

+

Yes, leave that as /home/mktrost/seqshop/gotcloud/gotcloud - that is where gotcloud is installed. The ~/... points GotCloud to your specific configuration file.

+

You should now see your alignment running:

+

;Want to log out and leave your job running?

+

In the screen window, type:

+

Ctrl-a d

+

(Hold down Ctrl and type 'a', let go of both and type 'd')

+

* This will "detach" from your screen session while your alignment continues to run.

+

;How do you log back into screen tomorrow?

+

screen -r

+

This will resume an already running screen.

+

* Feel free to test it out and you will see your alignment still running

+

** Just use Ctrl-a d to detach from screen and leave your job running

+

; Scrolling problems?

+

: If you want to scroll and screen doesn't scroll like you normally would?

+

:* Type Ctrl-a Esc and you should be able to scroll up with your mouse wheel

+

:** Or at least that is what I do from my Linux machine - (sorry I'm typing this up/testing these commands from Linux and not windows, so can't test it out)

+

== Log Out ==

+

If you have not detached from screen:

+

Ctrl-a d

+

exit PuTTY

+

== FEEDBACK! ==

+

Since I didn't send this out yesterday, today's survey has feedback for Tuesday & Wednesday.

+

https://docs.google.com/forms/d/1qaLHq9w1Ib3FZq0CtlrbK_-breNiqGRV06oRYNmUuME/viewform

Mktrost

Administrators

3,045

edits

Changes

SeqShop: Aligning Your Own Genome, June 2014 (view source)

Revision as of 14:25, 13 November 2014

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools