Changes

SeqShop: Aligning Your Own Genome, December 2014 (view source)

Revision as of 21:52, 7 December 2014

6,504 bytes removed , 21:52, 7 December 2014

→‎I got sequence, how to I get my data ready to run?

Line 21: Line 21:

</div>

−

== ~~I got sequence, how to I get my data ready to run? ==~~

+

== Locating your FASTQs ==

−

~~=== Finding~~ your FASTQs ===

+

Your FASTQ files are under <code>~/Sample*/fastqs</code> directory.

−

Your FASTQ files are under ~~your~~ <code>~~personal~~</code> directory.

+

* ''or <code>~/NA12878/fastqs</code>''

Look at your directory:

−

ls ~/~~personal~~

+

ls ~/Sample*/fastqs

+

or

+

ls ~/NA12878/fastqs

* <code>ls</code> does a directory listing

* <code>~/</code> means to start from your home/base directory

** Using ~/ means the command will work even if you have changed directories

−

You ~~should~~ see 2 ~~directories~~: ~~Run_992 & Run_993~~

+

You will see 2 files:

−

* ~~Our DNA was run through as 2 sequencing runs~~.

+

* Sample_R1.fastq.gz - 1st in pair

+

* Sample_R2.fastq.gz - 2nd in pair

−

What ~~is under those directories~~? ~~Let'~~s ~~check:~~

+

== What did I do to run the alignment? ==

−

~~ls ~/personal/Run_~~*

+

=== Created FASTQ_LIST file ===

−

* ~~<code>~~*~~</code> is a Linux wild-card match~~

+

I created a FASTQ_LIST file with the sample #s and fastqs.

−

** ~~It will do a directory listing on both Run directories at the same time:~~

+

* What columns do we need in our file that tells GotCloud about our FASTQ?

+

** SAMPLE

+

** FASTQ1

+

** FASTQ2

+

SAMPLE FASTQ1 FASTQ2

+

Sample# /path/to/Sample#/fastqs/Sample#_R1.fastq.gz /path/to/Sample#/fastqs/Sample#_R2.fastq.gz

−

~~[[File:RunDir.png]]~~

+

You can see what your fastq.list looks like

−

+

* I took out the full path since I ran alignment in a different path:

−

You ~~will see your sample name instead of <code>12345</code>.~~

+

cat ~/Sample*/fastq.list

−

~~Since we found another directory, let's check below that for our fastqs:~~

−

ls ~/personal/Run_*/*

−

~~You should~~ see your fastq ~~files:~~

−

~~[[File:Fastqlist~~.~~png]]~~

−

~~Those filenames are cryptic, what do they mean?~~

−

~~[[File:FastqlistAnnotated.png]]~~

−

~~=== Checking your index file listing your FASTQs ===~~

−

~~Are you analyzing your own genome? Do you think you setup your file correctly?~~

−

~~Try running this script to see if you have any errors:~~

−

~~perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index~~

−

~~On success it prints: <code>Congratulations, your fastq index~~ looks ~~valid</code>~~

−

~~NOTE: This script is tailored to the filenames provided by our sequencing core as described above.~~

−

* ~~It could be tailored to other methods, but is designed for~~ the ~~paths of our data.~~

−

~~=== Generating the index file listing your FASTQs ===~~

−

~~What columns do we need in our file that tells GotCloud about our FASTQ?~~

−

* MERGE_NAME

−

* FASTQ1

−

* FASTQ2

−

* RGID

−

* SAMPLE

−

* LIBRARY

−

* PLATFORM

−

~~We will store our FASTQ info file in: ~/personal/align.2x.index.~~

−

~~There are a few ways to create this file.~~

−

* Write into a text file one fastq pair at a time.

−

* Copy fastq1s into a spreadsheet, fill it in and copy back to a text file

−

* [[#Using a Script|Write/Use a script]]

−

~~==== Using a Regular Text File ====~~

−

~~Follow the instructions below, but do it one FASTQ1 at a time (you won't be able to paste a~~ full ~~column of FASTQs at a time).~~

−

* Remember to put a tab between each field.

−

~~==== Using a Spreadsheet ====~~

−

~~Since we just have a handful of FASTQs, we can use a spreadsheet to construct our file and then copy the data into a text file.~~

−

* Thanks to those who thought to do this yesterday - it was a great idea.

−

~~First, open Excel~~

−

~~===== Header Row =====~~

−

~~Create the header line by typing each of the column names in a row (you may be able to copy this line):~~

−

* make sure you enter these in all CAPS & spelling does matter

−

~~MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY PLATFORM~~

−

~~[[File:HdrRow.png]]~~

−

~~===== MERGE_NAME =====~~

−

~~MERGE_NAME is just your sample name~~

−

* Type your Sample name under the MERGE_NAME column, for example: <code>Sample_12345</code>

−

~~[[File:FastqHdrMN.png]]~~

−

~~All FASTQs are for the same sample, so you will use <code>Sample_12345</code> on every line. We will fill those in after we get know how many rows we need.~~

−

~~===== FASTQ1 =====~~

−

~~FASTQ1 is just the 1st in pair FASTQs (or the single FASTQ in single end)~~

−

* Our sequencing core indicated 1st in pair by <code>R1</code> in the filename.

−

** All our FASTQs are paired end (no single end fastqs)

−

* To get a a list of just the <code>R1</code> files:

−

ls -1 ~/personal/Run_*/*/*R1*

−

* The -1 option tells <code>ls</code> to list the matching files in a single column

−

** If you have noticed, the number of columns in a directory listing varies based on the width of your window.

−

** We want to copy these files into one column of a spreadsheet so want them displayed as a single column

−

~~Highlight this list and copy them into the FASTQ1 column of your spreadsheet:~~

−

~~[[File:HighlightedFASTQs.png]]~~

−

~~[[File:HdrSheetFQ1.png|700]]~~

−

~~Now that we know how many rows we have, copy your sample name into all rows:~~

−

~~[[File:HdrSheetMN1.png|700]]~~

−

~~=====FASTQ2=====~~

−

~~As mentioned before, FASTQ2 files are the 2nd~~ in ~~pair.~~

−

* They have the same filename as FASTQ1, except replace the R1 with R2

−

~~You could do an ls and copy & paste as we did for FASTQ1, BUT we'd have to make sure we properly matched up the mates.~~

−

~~EASIER solution:~~

−

* In your spreadsheet, copy your FASTQ1 filenames into the FASTQ2 column:

−

~~[[File:HdrSheetFQ2 1.png]]~~

−

* Now, replace <code>R1</code> with <code>R2</code> in JUST the FASTQ2 column

−

** Make sure FASTQ1 column keeps the <code>R1</code>

−

~~[[File:HdrSheetFQ2 2.png]]~~

−

~~===== RGID =====~~

−

~~We want to group our FASTQs by Run & Lane.~~

−

* Each Run/Lane combination should have a ~~unique Read Group~~

−

** Run is in the directory path: ~~Run_992 or Run_993~~

−

** Lane is indicated by L001/L002

−

*** Lanes may have multiple FASTQ pairs

−

**** the sequencing core split them into smaller FASTQs, but they are the same Run/Lane, so should have the same read group.

−

~~Populate the RGID column with a name unique to each Read Group/Lane combination:~~

−

~~[[File:HdrSheetRGannotated.png]]~~

−

~~=====SAMPLE =====~~

−

~~Put your sample name in each row of this column (you can copy from MERGE_NAME)~~

−

~~===== LIBRARY =====~~

−

~~Put your sample name in each row of this column (you can copy from MERGE_NAME)~~

−

* If a sample has multiple library preparations done on it, you would want to give unique names

−

** That is not our case, so just put in the sample name.

−

~~===== PLATFORM =====~~

−

~~Your data was sequenced on ILLUMINA, so enter <code>ILLUMINA</code> in each row of the platform column.~~

−

~~[[File:HdrSheetDone.png]]~~

−

~~===== Copy to Text File =====~~

−

~~Open nedit or your favorite linux editor~~

−

~~nedit~~ ~/~~personal/align.2x.index&~~

−

~~Click <code>New File</code> in the pop-up stating that it couldn't find that file.~~

−

~~Copy (Ctrl-c) your table from EXCEL (including the header row, but with no extra rows/columns.~~

−

~~Paste (Ctrl-v) into your nedit window.~~

−

~~Navigate through the file - the columns should be delimited with tabs.~~

−

~~Save (Ctrl-s) & close nedit.~~

−

~~You now have a tab delimited align.2x.index file (a little simpler than yesterday).~~

−

~~==== Using a Script ====~~

−

~~When generating an index of your FASTQs, it can be easiest to have a script.~~

−

* Especially if you have many samples/runs, it would be very tedious to do by hand

−

~~If you are good at scripting, this may be even easier than doing it by hand~~

−

* If you aren't good at scripting, and you have too much data to do by hand

−

** Make friends with someone who is :-)

−

** I always find it useful to start from another script (reminds me of commands/tricks)

−

~~If you still need to create your file and you don't want to use the spreadsheet method above, you can run a script that I made:~~

−

~~perl /home/mktrost/seqshop/inputs/buildIndex.pl ~/personal > ~/personal/align.2x.index~~

−

* ~~<code>></code> means to direct the output to the file specified after the <code>></code>~~

−

~~Curious what the script looks like and what it does in case you want to create one in the future?~~

−

~~<div class="mw-collapsible mw-collapsed" style="width:200px">~~

−

~~<li>View Annotated Script<~~/~~li>~~

−

~~<div class="mw-collapsible-content">~~

−

~~[[File:BuildIndex.png|800px]]~~

−

~~</div>~~

−

~~</div>~~

−

~~=== Checking your index file listing your FASTQs ===~~

−

~~Are you analyzing your own genome? Do you think you setup your file correctly?~~

−

~~Try running this script to see if you have any errors:~~

−

~~perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index~~

−

~~On success it prints: <code>Congratulations, your~~ fastq ~~index looks valid</code>~~

−

~~NOTE: This script is tailored to the filenames provided by our sequencing core as described above.~~

−

* It could be tailored to other methods, but is designed for the paths of our data.

== Create your GotCloud Configuration File ==

Mktrost

Administrators

3,045

edits

Changes

SeqShop: Aligning Your Own Genome, December 2014 (view source)

Revision as of 21:52, 7 December 2014

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools