Changes

From Genome Analysis Wiki
Jump to navigationJump to search
Line 21: Line 21:  
</div>
 
</div>
   −
== I got sequence, how to I get my data ready to run? ==
+
== Locating your FASTQs ==
=== Finding your FASTQs ===
+
Your FASTQ files are under <code>~/Sample*/fastqs</code> directory.
Your FASTQ files are under your <code>personal</code> directory.
+
* ''or <code>~/NA12878/fastqs</code>''
    
Look at your directory:  
 
Look at your directory:  
  ls ~/personal
+
  ls ~/Sample*/fastqs
 +
or
 +
ls ~/NA12878/fastqs
 
* <code>ls</code> does a directory listing
 
* <code>ls</code> does a directory listing
 
* <code>~/</code> means to start from your home/base directory
 
* <code>~/</code> means to start from your home/base directory
 
** Using ~/ means the command will work even if you have changed directories
 
** Using ~/ means the command will work even if you have changed directories
   −
You should see 2 directories: Run_992 & Run_993
+
You will see 2 files:
* Our DNA was run through as 2 sequencing runs.
+
* Sample_R1.fastq.gz - 1st in pair
 +
* Sample_R2.fastq.gz - 2nd in pair
   −
What is under those directories? Let's check:
+
== What did I do to run the alignment? ==
ls ~/personal/Run_*
+
=== Created FASTQ_LIST file ===
* <code>*</code> is a Linux wild-card match
+
I created a FASTQ_LIST file with the sample #s and fastqs.
** It will do a directory listing on both Run directories at the same time:
+
* What columns do we need in our file that tells GotCloud about our FASTQ?
 +
** SAMPLE
 +
** FASTQ1
 +
** FASTQ2
 +
 +
SAMPLE  FASTQ1                                        FASTQ2
 +
Sample#  /path/to/Sample#/fastqs/Sample#_R1.fastq.gz    /path/to/Sample#/fastqs/Sample#_R2.fastq.gz
   −
[[File:RunDir.png]]
+
You can see what your fastq.list looks like
 
+
* I took out the full path since I ran alignment in a different path:
You will see your sample name instead of <code>12345</code>.
+
  cat ~/Sample*/fastq.list
 
  −
 
  −
Since we found another directory, let's check below that for our fastqs:
  −
ls ~/personal/Run_*/*
  −
You should see your fastq files:
  −
 
  −
[[File:Fastqlist.png]]
  −
 
  −
 
  −
Those filenames are cryptic, what do they mean?
  −
 
  −
[[File:FastqlistAnnotated.png]]
  −
 
  −
=== Checking your index file listing your FASTQs ===
  −
Are you analyzing your own genome?  Do you think you setup your file correctly?
  −
 
  −
Try running this script to see if you have any errors:
  −
perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index
  −
 
  −
On success it prints: <code>Congratulations, your fastq index looks valid</code>
  −
 
  −
NOTE: This script is tailored to the filenames provided by our sequencing core as described above.
  −
* It could be tailored to other methods, but is designed for the paths of our data.
  −
 
  −
=== Generating the index file listing your FASTQs ===
  −
What columns do we need in our file that tells GotCloud about our FASTQ?
  −
* MERGE_NAME
  −
* FASTQ1
  −
* FASTQ2
  −
* RGID
  −
* SAMPLE
  −
* LIBRARY
  −
* PLATFORM
  −
 
  −
We will store our FASTQ info file in: ~/personal/align.2x.index.
  −
 
  −
There are a few ways to create this file.
  −
* Write into a text file one fastq pair at a time.
  −
* Copy fastq1s into a spreadsheet, fill it in and copy back to a text file
  −
* [[#Using a Script|Write/Use a script]]
  −
==== Using a Regular Text File ====
  −
Follow the instructions below, but do it one FASTQ1 at a time (you won't be able to paste a full column of FASTQs at a time).
  −
* Remember to put a tab between each field.
  −
 
  −
==== Using a Spreadsheet ====
  −
Since we just have a handful of FASTQs, we can use a spreadsheet to construct our file and then copy the data into a text file.
  −
* Thanks to those who thought to do this yesterday - it was a great idea.
  −
 
  −
First, open Excel
  −
 
  −
===== Header Row =====
  −
Create the header line by typing each of the column names in a row (you may be able to copy this line):
  −
* make sure you enter these in all CAPS & spelling does matter
  −
MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY PLATFORM
  −
[[File:HdrRow.png]]
  −
 
  −
===== MERGE_NAME =====
  −
MERGE_NAME is just your sample name
  −
* Type your Sample name under the MERGE_NAME column, for example: <code>Sample_12345</code>
  −
[[File:FastqHdrMN.png]]
  −
 
  −
All FASTQs are for the same sample, so you will use <code>Sample_12345</code> on every line.  We will fill those in after we get know how many rows we need.
  −
 
  −
===== FASTQ1 =====
  −
FASTQ1 is just the 1st in pair FASTQs (or the single FASTQ in single end)
  −
* Our sequencing core indicated 1st in pair by <code>R1</code> in the filename.
  −
** All our FASTQs are paired end (no single end fastqs)
  −
* To get a a list of just the <code>R1</code> files:
  −
ls -1 ~/personal/Run_*/*/*R1*
  −
* The -1 option tells <code>ls</code> to list the matching files in a single column
  −
** If you have noticed, the number of columns in a directory listing varies based on the width of your window.
  −
** We want to copy these files into one column of a spreadsheet so want them displayed as a single column
  −
 
  −
Highlight this list and copy them into the FASTQ1 column of your spreadsheet:
  −
 
  −
[[File:HighlightedFASTQs.png]]
  −
 
  −
[[File:HdrSheetFQ1.png|700]]
  −
 
  −
Now that we know how many rows we have, copy your sample name into all rows:
  −
 
  −
[[File:HdrSheetMN1.png|700]]
  −
 
  −
=====FASTQ2=====
  −
As mentioned before, FASTQ2 files are the 2nd in pair.
  −
* They have the same filename as FASTQ1, except replace the R1 with R2
  −
 
  −
You could do an ls and copy & paste as we did for FASTQ1, BUT we'd have to make sure we properly matched up the mates.
  −
 
  −
EASIER solution:
  −
* In your spreadsheet, copy your FASTQ1 filenames into the FASTQ2 column:
  −
[[File:HdrSheetFQ2 1.png]]
  −
* Now, replace <code>R1</code> with <code>R2</code> in JUST the FASTQ2 column
  −
** Make sure FASTQ1 column keeps the <code>R1</code>
  −
[[File:HdrSheetFQ2 2.png]]
  −
 
  −
===== RGID =====
  −
We want to group our FASTQs by Run & Lane.
  −
* Each Run/Lane combination should have a unique Read Group
  −
** Run is in the directory path: Run_992 or Run_993
  −
** Lane is indicated by L001/L002
  −
*** Lanes may have multiple FASTQ pairs
  −
**** the sequencing core split them into smaller FASTQs, but they are the same Run/Lane, so should have the same read group.
  −
 
  −
Populate the RGID column with a name unique to each Read Group/Lane combination:
  −
[[File:HdrSheetRGannotated.png]]
  −
 
  −
=====SAMPLE =====
  −
Put your sample name in each row of this column (you can copy from MERGE_NAME)
  −
 
  −
===== LIBRARY =====
  −
Put your sample name in each row of this column (you can copy from MERGE_NAME)
  −
* If a sample has multiple library preparations done on it, you would want to give unique names
  −
** That is not our case, so just put in the sample name.
  −
 
  −
===== PLATFORM =====
  −
Your data was sequenced on ILLUMINA, so enter <code>ILLUMINA</code> in each row of the platform column.
  −
[[File:HdrSheetDone.png]]
  −
 
  −
===== Copy to Text File =====
  −
Open nedit or your favorite linux editor
  −
  nedit ~/personal/align.2x.index&
  −
Click <code>New File</code> in the pop-up stating that it couldn't find that file.
  −
 
  −
Copy (Ctrl-c) your table from EXCEL (including the header row, but with no extra rows/columns.
  −
 
  −
Paste (Ctrl-v) into your nedit window.
  −
 
  −
Navigate through the file - the columns should be delimited with tabs.
  −
 
  −
Save (Ctrl-s) & close nedit.
  −
 
  −
You now have a tab delimited align.2x.index file (a little simpler than yesterday).
  −
 
  −
==== Using a Script ====
  −
When generating an index of your FASTQs, it can be easiest to have a script.
  −
* Especially if you have many samples/runs, it would be very tedious to do by hand
  −
 
  −
If you are good at scripting, this may be even easier than doing it by hand
  −
* If you aren't good at scripting, and you have too much data to do by hand
  −
** Make friends with someone who is :-)
  −
** I always find it useful to start from another script (reminds me of commands/tricks)
  −
 
  −
If you still need to create your file and you don't want to use the spreadsheet method above, you can run a script that I made:
  −
perl /home/mktrost/seqshop/inputs/buildIndex.pl ~/personal > ~/personal/align.2x.index
  −
* <code>></code> means to direct the output to the file specified after the <code>></code>
  −
 
  −
Curious what the script looks like and what it does in case you want to create one in the future?
  −
<div class="mw-collapsible mw-collapsed" style="width:200px">
  −
<li>View Annotated Script</li>
  −
<div class="mw-collapsible-content">
  −
[[File:BuildIndex.png|800px]]
  −
</div>
  −
</div>
  −
 
  −
=== Checking your index file listing your FASTQs ===
  −
Are you analyzing your own genome?  Do you think you setup your file correctly?
  −
 
  −
Try running this script to see if you have any errors:
  −
perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index
  −
 
  −
On success it prints: <code>Congratulations, your fastq index looks valid</code>
  −
 
  −
NOTE: This script is tailored to the filenames provided by our sequencing core as described above.
  −
* It could be tailored to other methods, but is designed for the paths of our data.
      
== Create your GotCloud Configuration File ==
 
== Create your GotCloud Configuration File ==

Navigation menu