Line 1: |
Line 1: |
| + | '''Note:''' the latest version of this practical is available at: [[SeqShop: Aligning Your Own Genome]] |
| + | * The ones here is the original one from the June workshop (updated to be run from elsewhere) |
| + | |
| + | |
| + | == First Things First == |
| + | *Helpful reference to many tools: |
| + | ** http://infoplatter.wordpress.com/2014/04/06/bioinformaticians-pocket-reference/ |
| + | *** links to "cheat-sheets", including, Unix, screen, and vi |
| + | * Our wiki with some brief description of how to do some basic commands |
| + | **http://genome.sph.umich.edu/wiki/Basic_Linux_Intro |
| + | * [[Screen Commands]] for commands to use screen (one way to leave your commands running even after you log out) |
| + | |
| == Goals of This Session == | | == Goals of This Session == |
| + | Learn how to go from your FASTQ files to generate Aligned BAMs. |
| + | * Practice setting up and running GotCloud on your own. |
| + | |
| + | {{SeqShopLogin}} |
| + | |
| + | |
| + | == I didn't get sequenced, what can I do? == |
| + | I prepared some test files for you. |
| + | * I took a 1000g sample and reduced it to 2x. |
| + | ** I already created an align.2x.index file for you. |
| + | *** This is a 1000g sample, so the filenames/RG information for these do not match the ones produced by our sequencer that are described below. |
| + | To be consistent with everyone else you can do: |
| + | mkdir ~/personal |
| + | cp -r /home/mktrost/seqshop/inputs/2x/* ~/personal/. |
| + | ls ~/personal/ |
| + | ls ~/personal/fastq |
| + | |
| + | == I got sequence, how to I get my data ready to run? == |
| + | === Finding your FASTQs === |
| + | Your FASTQ files are under your <code>personal</code> directory. |
| + | |
| + | Look at your directory: |
| + | ls ~/personal |
| + | * <code>ls</code> does a directory listing |
| + | * <code>~/</code> means to start from your home/base directory |
| + | ** Using ~/ means the command will work even if you have changed directories |
| + | |
| + | You should see 2 directories: Run_992 & Run_993 |
| + | * Our DNA was run through as 2 sequencing runs. |
| + | |
| + | What is under those directories? Let's check: |
| + | ls ~/personal/Run_* |
| + | * <code>*</code> is a Linux wild-card match |
| + | ** It will do a directory listing on both Run directories at the same time: |
| + | |
| + | [[File:RunDir.png]] |
| + | |
| + | You will see your sample name instead of <code>12345</code>. |
| + | |
| + | |
| + | Since we found another directory, let's check below that for our fastqs: |
| + | ls ~/personal/Run_*/* |
| + | You should see your fastq files: |
| + | |
| + | [[File:Fastqlist.png]] |
| + | |
| + | |
| + | Those filenames are cryptic, what do they mean? |
| + | |
| + | [[File:FastqlistAnnotated.png]] |
| + | |
| + | === Checking your index file listing your FASTQs === |
| + | Are you analyzing your own genome? Do you think you setup your file correctly? |
| + | |
| + | Try running this script to see if you have any errors: |
| + | perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index |
| + | |
| + | On success it prints: <code>Congratulations, your fastq index looks valid</code> |
| + | |
| + | NOTE: This script is tailored to the filenames provided by our sequencing core as described above. |
| + | * It could be tailored to other methods, but is designed for the paths of our data. |
| + | |
| + | === Generating the index file listing your FASTQs === |
| + | What columns do we need in our file that tells GotCloud about our FASTQ? |
| + | * MERGE_NAME |
| + | * FASTQ1 |
| + | * FASTQ2 |
| + | * RGID |
| + | * SAMPLE |
| + | * LIBRARY |
| + | * PLATFORM |
| + | |
| + | We will store our FASTQ info file in: ~/personal/align.2x.index. |
| + | |
| + | There are a few ways to create this file. |
| + | * Write into a text file one fastq pair at a time. |
| + | * Copy fastq1s into a spreadsheet, fill it in and copy back to a text file |
| + | * [[#Using a Script|Write/Use a script]] |
| + | ==== Using a Regular Text File ==== |
| + | Follow the instructions below, but do it one FASTQ1 at a time (you won't be able to paste a full column of FASTQs at a time). |
| + | * Remember to put a tab between each field. |
| + | |
| + | ==== Using a Spreadsheet ==== |
| + | Since we just have a handful of FASTQs, we can use a spreadsheet to construct our file and then copy the data into a text file. |
| + | * Thanks to those who thought to do this yesterday - it was a great idea. |
| + | |
| + | First, open Excel |
| + | |
| + | ===== Header Row ===== |
| + | Create the header line by typing each of the column names in a row (you may be able to copy this line): |
| + | * make sure you enter these in all CAPS & spelling does matter |
| + | MERGE_NAME FASTQ1 FASTQ2 RGID SAMPLE LIBRARY PLATFORM |
| + | [[File:HdrRow.png]] |
| + | |
| + | ===== MERGE_NAME ===== |
| + | MERGE_NAME is just your sample name |
| + | * Type your Sample name under the MERGE_NAME column, for example: <code>Sample_12345</code> |
| + | [[File:FastqHdrMN.png]] |
| + | |
| + | All FASTQs are for the same sample, so you will use <code>Sample_12345</code> on every line. We will fill those in after we get know how many rows we need. |
| + | |
| + | ===== FASTQ1 ===== |
| + | FASTQ1 is just the 1st in pair FASTQs (or the single FASTQ in single end) |
| + | * Our sequencing core indicated 1st in pair by <code>R1</code> in the filename. |
| + | ** All our FASTQs are paired end (no single end fastqs) |
| + | * To get a a list of just the <code>R1</code> files: |
| + | ls -1 ~/personal/Run_*/*/*R1* |
| + | * The -1 option tells <code>ls</code> to list the matching files in a single column |
| + | ** If you have noticed, the number of columns in a directory listing varies based on the width of your window. |
| + | ** We want to copy these files into one column of a spreadsheet so want them displayed as a single column |
| + | |
| + | Highlight this list and copy them into the FASTQ1 column of your spreadsheet: |
| + | |
| + | [[File:HighlightedFASTQs.png]] |
| + | |
| + | [[File:HdrSheetFQ1.png|700]] |
| + | |
| + | Now that we know how many rows we have, copy your sample name into all rows: |
| + | |
| + | [[File:HdrSheetMN1.png|700]] |
| + | |
| + | =====FASTQ2===== |
| + | As mentioned before, FASTQ2 files are the 2nd in pair. |
| + | * They have the same filename as FASTQ1, except replace the R1 with R2 |
| + | |
| + | You could do an ls and copy & paste as we did for FASTQ1, BUT we'd have to make sure we properly matched up the mates. |
| + | |
| + | EASIER solution: |
| + | * In your spreadsheet, copy your FASTQ1 filenames into the FASTQ2 column: |
| + | [[File:HdrSheetFQ2 1.png]] |
| + | * Now, replace <code>R1</code> with <code>R2</code> in JUST the FASTQ2 column |
| + | ** Make sure FASTQ1 column keeps the <code>R1</code> |
| + | [[File:HdrSheetFQ2 2.png]] |
| + | |
| + | ===== RGID ===== |
| + | We want to group our FASTQs by Run & Lane. |
| + | * Each Run/Lane combination should have a unique Read Group |
| + | ** Run is in the directory path: Run_992 or Run_993 |
| + | ** Lane is indicated by L001/L002 |
| + | *** Lanes may have multiple FASTQ pairs |
| + | **** the sequencing core split them into smaller FASTQs, but they are the same Run/Lane, so should have the same read group. |
| + | |
| + | Populate the RGID column with a name unique to each Read Group/Lane combination: |
| + | [[File:HdrSheetRGannotated.png]] |
| + | |
| + | =====SAMPLE ===== |
| + | Put your sample name in each row of this column (you can copy from MERGE_NAME) |
| + | |
| + | ===== LIBRARY ===== |
| + | Put your sample name in each row of this column (you can copy from MERGE_NAME) |
| + | * If a sample has multiple library preparations done on it, you would want to give unique names |
| + | ** That is not our case, so just put in the sample name. |
| + | |
| + | ===== PLATFORM ===== |
| + | Your data was sequenced on ILLUMINA, so enter <code>ILLUMINA</code> in each row of the platform column. |
| + | [[File:HdrSheetDone.png]] |
| + | |
| + | ===== Copy to Text File ===== |
| + | Open nedit or your favorite linux editor |
| + | nedit ~/personal/align.2x.index& |
| + | Click <code>New File</code> in the pop-up stating that it couldn't find that file. |
| + | |
| + | Copy (Ctrl-c) your table from EXCEL (including the header row, but with no extra rows/columns. |
| + | |
| + | Paste (Ctrl-v) into your nedit window. |
| + | |
| + | Navigate through the file - the columns should be delimited with tabs. |
| + | |
| + | Save (Ctrl-s) & close nedit. |
| + | |
| + | You now have a tab delimited align.2x.index file (a little simpler than yesterday). |
| + | |
| + | ==== Using a Script ==== |
| + | When generating an index of your FASTQs, it can be easiest to have a script. |
| + | * Especially if you have many samples/runs, it would be very tedious to do by hand |
| + | |
| + | If you are good at scripting, this may be even easier than doing it by hand |
| + | * If you aren't good at scripting, and you have too much data to do by hand |
| + | ** Make friends with someone who is :-) |
| + | ** I always find it useful to start from another script (reminds me of commands/tricks) |
| + | |
| + | If you still need to create your file and you don't want to use the spreadsheet method above, you can run a script that I made: |
| + | perl /home/mktrost/seqshop/inputs/buildIndex.pl ~/personal > ~/personal/align.2x.index |
| + | * <code>></code> means to direct the output to the file specified after the <code>></code> |
| + | |
| + | Curious what the script looks like and what it does in case you want to create one in the future? |
| + | <div class="mw-collapsible mw-collapsed" style="width:200px"> |
| + | <li>View Annotated Script</li> |
| + | <div class="mw-collapsible-content"> |
| + | [[File:BuildIndex.png|800px]] |
| + | </div> |
| + | </div> |
| + | |
| + | === Checking your index file listing your FASTQs === |
| + | Are you analyzing your own genome? Do you think you setup your file correctly? |
| + | |
| + | Try running this script to see if you have any errors: |
| + | perl /home/mktrost/seqshop/inputs/checkIndex.pl ~/personal/align.2x.index |
| + | |
| + | On success it prints: <code>Congratulations, your fastq index looks valid</code> |
| + | |
| + | NOTE: This script is tailored to the filenames provided by our sequencing core as described above. |
| + | * It could be tailored to other methods, but is designed for the paths of our data. |
| + | |
| + | == Create your GotCloud Configuration File == |
| + | Now that you have your FASTQ info file created, we need to setup the configuration. |
| + | |
| + | '''If you updated your gotcloud.2x.conf file yesterday, you still need to do this step''' |
| + | * We updated a few settings to match 1000g |
| + | *# Updated the version of the genome reference |
| + | *# Updated to use BWA instead of BWA_MEM |
| + | *# Updated the version of BWA |
| + | |
| + | You can copy from my directory to your personal directory: |
| + | cp /home/mktrost/seqshop/inputs/gotcloud.2x.conf ~/personal/. |
| + | |
| + | You need to update the conf file to find your align.2x.index and ensure your output directory is setup. |
| + | |
| + | nedit ~/personal/gotcloud.2x.conf & |
| + | (or use your favorite editor) |
| + | * Change the <code>IN_DIR</code> line, replacing <code>YOUR_USER_NAME</code> with your user name. |
| + | * Everything else is configured already. |
| + | [[File:Gc2xconf.png]] |
| + | |
| + | |
| + | You'll notice that this file is very similar to the one we have been using. |
| + | * Just a few modifications to run a new test on the whole genome |
| + | |
| + | == Run your Alignment == |
| + | |
| + | === Screen === |
| + | The alignment pipeline will run overnight, but you'll want to log out. |
| + | |
| + | ; How do I leave something running on the server even if I log out? |
| + | : One solution is screen! |
| + | |
| + | ; How do I use screen? |
| + | : Before running your command, you need to start screen: |
| + | : <pre>screen</pre> |
| + | |
| + | [[File:Screen.png]] |
| + | |
| + | As it says, press <code>Space</code> or <code>Return</code>. |
| + | * It should now look basically the same as your normal command line. |
| + | |
| + | You can now start your alignment: |
| + | /home/mktrost/seqshop/gotcloud/gotcloud align --conf ~/personal/gotcloud.2x.conf --numjobs 1 |
| + | Yes, leave that as /home/mktrost/seqshop/gotcloud/gotcloud - that is where gotcloud is installed. The ~/... points GotCloud to your specific configuration file. |
| + | |
| + | You should now see your alignment running: |
| + | |
| + | ;Want to log out and leave your job running? |
| + | In the screen window, type: |
| + | Ctrl-a d |
| + | (Hold down Ctrl and type 'a', let go of both and type 'd') |
| + | * This will "detach" from your screen session while your alignment continues to run. |
| + | |
| + | ;How do you log back into screen tomorrow? |
| + | screen -r |
| + | This will resume an already running screen. |
| + | * Feel free to test it out and you will see your alignment still running |
| + | ** Just use Ctrl-a d to detach from screen and leave your job running |
| + | |
| + | ; Scrolling problems? |
| + | : If you want to scroll and screen doesn't scroll like you normally would? |
| + | :* Type Ctrl-a Esc and you should be able to scroll up with your mouse wheel |
| + | :** Or at least that is what I do from my Linux machine - (sorry I'm typing this up/testing these commands from Linux and not windows, so can't test it out) |
| + | |
| + | |
| + | == Log Out == |
| + | If you have not detached from screen: |
| + | Ctrl-a d |
| + | |
| + | exit PuTTY |
| + | |
| + | == FEEDBACK! == |
| + | Since I didn't send this out yesterday, today's survey has feedback for Tuesday & Wednesday. |
| + | https://docs.google.com/forms/d/1qaLHq9w1Ib3FZq0CtlrbK_-breNiqGRV06oRYNmUuME/viewform |