Line 21: |
Line 21: |
| </div> | | </div> |
| | | |
− | == I got sequence, how to I get my data ready to run? == | + | == Locating your FASTQs == |
− | === Finding your FASTQs ===
| + | Your FASTQ files are under <code>~/Sample*/fastqs</code> directory. |
− | Your FASTQ files are under your <code>personal</code> directory. | + | * ''or <code>~/NA12878/fastqs</code>'' |
| | | |
| Look at your directory: | | Look at your directory: |
− | ls ~/personal | + | ls ~/Sample*/fastqs |
| + | or |
| + | ls ~/NA12878/fastqs |
| * <code>ls</code> does a directory listing | | * <code>ls</code> does a directory listing |
| * <code>~/</code> means to start from your home/base directory | | * <code>~/</code> means to start from your home/base directory |
| ** Using ~/ means the command will work even if you have changed directories | | ** Using ~/ means the command will work even if you have changed directories |
| | | |
− | You should see 2 directories: Run_992 & Run_993 | + | You will see 2 files: |
− | * Our DNA was run through as 2 sequencing runs. | + | * Sample_R1.fastq.gz - 1st in pair |
| + | * Sample_R2.fastq.gz - 2nd in pair |
| | | |
− | What is under those directories? Let's check: | + | == What did I do to run the alignment? == |
− | ls ~/personal/Run_*
| + | === Created FASTQ_LIST file === |
− | * <code>*</code> is a Linux wild-card match | + | I created a FASTQ_LIST file with the sample #s and fastqs. |
− | ** It will do a directory listing on both Run directories at the same time: | + | * What columns do we need in our file that tells GotCloud about our FASTQ? |
| + | ** SAMPLE |
| + | ** FASTQ1 |
| + | ** FASTQ2 |
| + | |
| + | SAMPLE FASTQ1 FASTQ2 |
| + | Sample# /path/to/Sample#/fastqs/Sample#_R1.fastq.gz /path/to/Sample#/fastqs/Sample#_R2.fastq.gz |
| | | |
− | [[File:RunDir.png]]
| + | You can see what your fastq.list looks like |
− | | + | * I took out the full path since I ran alignment in a different path: |
− | You will see your sample name instead of <code>12345</code>. | + | cat ~/Sample*/fastq.list |
− | | |
− | | |
− | Since we found another directory, let's check below that for our fastqs:
| |
− | ls ~/personal/Run_*/*
| |
− | You should see your fastq files:
| |
− | | |
− | [[File:Fastqlist.png]]
| |
− | | |
− | | |
− | Those filenames are cryptic, what do they mean?
| |
− | | |
− | [[File:FastqlistAnnotated.png]]
| |
− | | |
− | === Checking your index file listing your FASTQs ===
| |
− | Are you analyzing your own genome? Do you think you setup your file correctly?
| |
− | | |
− | Try running this script to see if you have any errors:
| |
− | perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index
| |
− | | |
− | On success it prints: <code>Congratulations, your fastq index looks valid</code>
| |
− | | |
− | NOTE: This script is tailored to the filenames provided by our sequencing core as described above.
| |
− | * It could be tailored to other methods, but is designed for the paths of our data. | |
− | | |
− | === Generating the index file listing your FASTQs ===
| |
− | What columns do we need in our file that tells GotCloud about our FASTQ?
| |
− | * MERGE_NAME
| |
− | * FASTQ1
| |
− | * FASTQ2
| |
− | * RGID
| |
− | * SAMPLE
| |
− | * LIBRARY
| |
− | * PLATFORM
| |
− | | |
− | We will store our FASTQ info file in: ~/personal/align.2x.index.
| |
− | | |
− | There are a few ways to create this file.
| |
− | * Write into a text file one fastq pair at a time.
| |
− | * Copy fastq1s into a spreadsheet, fill it in and copy back to a text file
| |
− | * [[#Using a Script|Write/Use a script]]
| |
− | ==== Using a Regular Text File ====
| |
− | Follow the instructions below, but do it one FASTQ1 at a time (you won't be able to paste a full column of FASTQs at a time).
| |
− | * Remember to put a tab between each field.
| |
− | | |
− | ==== Using a Spreadsheet ====
| |
− | Since we just have a handful of FASTQs, we can use a spreadsheet to construct our file and then copy the data into a text file.
| |
− | * Thanks to those who thought to do this yesterday - it was a great idea.
| |
− | | |
− | First, open Excel
| |
− | | |
− | ===== Header Row =====
| |
− | Create the header line by typing each of the column names in a row (you may be able to copy this line):
| |
− | * make sure you enter these in all CAPS & spelling does matter
| |
− | MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY PLATFORM
| |
− | [[File:HdrRow.png]]
| |
− | | |
− | ===== MERGE_NAME =====
| |
− | MERGE_NAME is just your sample name
| |
− | * Type your Sample name under the MERGE_NAME column, for example: <code>Sample_12345</code>
| |
− | [[File:FastqHdrMN.png]]
| |
− | | |
− | All FASTQs are for the same sample, so you will use <code>Sample_12345</code> on every line. We will fill those in after we get know how many rows we need.
| |
− | | |
− | ===== FASTQ1 =====
| |
− | FASTQ1 is just the 1st in pair FASTQs (or the single FASTQ in single end)
| |
− | * Our sequencing core indicated 1st in pair by <code>R1</code> in the filename.
| |
− | ** All our FASTQs are paired end (no single end fastqs)
| |
− | * To get a a list of just the <code>R1</code> files:
| |
− | ls -1 ~/personal/Run_*/*/*R1*
| |
− | * The -1 option tells <code>ls</code> to list the matching files in a single column
| |
− | ** If you have noticed, the number of columns in a directory listing varies based on the width of your window.
| |
− | ** We want to copy these files into one column of a spreadsheet so want them displayed as a single column
| |
− | | |
− | Highlight this list and copy them into the FASTQ1 column of your spreadsheet:
| |
− | | |
− | [[File:HighlightedFASTQs.png]]
| |
− | | |
− | [[File:HdrSheetFQ1.png|700]]
| |
− | | |
− | Now that we know how many rows we have, copy your sample name into all rows:
| |
− | | |
− | [[File:HdrSheetMN1.png|700]]
| |
− | | |
− | =====FASTQ2=====
| |
− | As mentioned before, FASTQ2 files are the 2nd in pair.
| |
− | * They have the same filename as FASTQ1, except replace the R1 with R2
| |
− | | |
− | You could do an ls and copy & paste as we did for FASTQ1, BUT we'd have to make sure we properly matched up the mates.
| |
− | | |
− | EASIER solution:
| |
− | * In your spreadsheet, copy your FASTQ1 filenames into the FASTQ2 column:
| |
− | [[File:HdrSheetFQ2 1.png]]
| |
− | * Now, replace <code>R1</code> with <code>R2</code> in JUST the FASTQ2 column
| |
− | ** Make sure FASTQ1 column keeps the <code>R1</code>
| |
− | [[File:HdrSheetFQ2 2.png]]
| |
− | | |
− | ===== RGID =====
| |
− | We want to group our FASTQs by Run & Lane.
| |
− | * Each Run/Lane combination should have a unique Read Group
| |
− | ** Run is in the directory path: Run_992 or Run_993
| |
− | ** Lane is indicated by L001/L002
| |
− | *** Lanes may have multiple FASTQ pairs
| |
− | **** the sequencing core split them into smaller FASTQs, but they are the same Run/Lane, so should have the same read group.
| |
− | | |
− | Populate the RGID column with a name unique to each Read Group/Lane combination:
| |
− | [[File:HdrSheetRGannotated.png]]
| |
− | | |
− | =====SAMPLE =====
| |
− | Put your sample name in each row of this column (you can copy from MERGE_NAME)
| |
− | | |
− | ===== LIBRARY =====
| |
− | Put your sample name in each row of this column (you can copy from MERGE_NAME)
| |
− | * If a sample has multiple library preparations done on it, you would want to give unique names
| |
− | ** That is not our case, so just put in the sample name.
| |
− | | |
− | ===== PLATFORM =====
| |
− | Your data was sequenced on ILLUMINA, so enter <code>ILLUMINA</code> in each row of the platform column.
| |
− | [[File:HdrSheetDone.png]]
| |
− | | |
− | ===== Copy to Text File =====
| |
− | Open nedit or your favorite linux editor
| |
− | nedit ~/personal/align.2x.index& | |
− | Click <code>New File</code> in the pop-up stating that it couldn't find that file.
| |
− | | |
− | Copy (Ctrl-c) your table from EXCEL (including the header row, but with no extra rows/columns.
| |
− | | |
− | Paste (Ctrl-v) into your nedit window.
| |
− | | |
− | Navigate through the file - the columns should be delimited with tabs.
| |
− | | |
− | Save (Ctrl-s) & close nedit.
| |
− | | |
− | You now have a tab delimited align.2x.index file (a little simpler than yesterday).
| |
− | | |
− | ==== Using a Script ====
| |
− | When generating an index of your FASTQs, it can be easiest to have a script.
| |
− | * Especially if you have many samples/runs, it would be very tedious to do by hand
| |
− | | |
− | If you are good at scripting, this may be even easier than doing it by hand
| |
− | * If you aren't good at scripting, and you have too much data to do by hand
| |
− | ** Make friends with someone who is :-)
| |
− | ** I always find it useful to start from another script (reminds me of commands/tricks)
| |
− | | |
− | If you still need to create your file and you don't want to use the spreadsheet method above, you can run a script that I made:
| |
− | perl /home/mktrost/seqshop/inputs/buildIndex.pl ~/personal > ~/personal/align.2x.index
| |
− | * <code>></code> means to direct the output to the file specified after the <code>></code> | |
− | | |
− | Curious what the script looks like and what it does in case you want to create one in the future?
| |
− | <div class="mw-collapsible mw-collapsed" style="width:200px">
| |
− | <li>View Annotated Script</li>
| |
− | <div class="mw-collapsible-content">
| |
− | [[File:BuildIndex.png|800px]]
| |
− | </div>
| |
− | </div>
| |
− | | |
− | === Checking your index file listing your FASTQs ===
| |
− | Are you analyzing your own genome? Do you think you setup your file correctly?
| |
− | | |
− | Try running this script to see if you have any errors:
| |
− | perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index
| |
− | | |
− | On success it prints: <code>Congratulations, your fastq index looks valid</code>
| |
− | | |
− | NOTE: This script is tailored to the filenames provided by our sequencing core as described above.
| |
− | * It could be tailored to other methods, but is designed for the paths of our data.
| |
| | | |
| == Create your GotCloud Configuration File == | | == Create your GotCloud Configuration File == |