Line 1: |
Line 1: |
| + | '''Note:''' the latest version of this practical is available at: [[SeqShop: Sequence Mapping and Assembly Practical]] |
| + | * The ones here is the original one from the June workshop (updated to be run from elsewhere) |
| + | |
| == Introduction == | | == Introduction == |
| See the [[Media:SeqShop - GotCloud Align.pdf|introductory slides]] for an intro to this tutorial. | | See the [[Media:SeqShop - GotCloud Align.pdf|introductory slides]] for an intro to this tutorial. |
| + | |
| == Goals of This Session == | | == Goals of This Session == |
| * What we want to learn | | * What we want to learn |
Line 7: |
Line 11: |
| ** How to evaluate the quality of sequence data | | ** How to evaluate the quality of sequence data |
| ** How to visualize sequence data to examine the reads aligned to particular genomic positions | | ** How to visualize sequence data to examine the reads aligned to particular genomic positions |
| + | |
| + | == Setup in person at the SeqShop Workshop == |
| + | ''This section is specifically for the SeqShop Workshop computers.'' |
| + | <div class="mw-collapsible mw-collapsed" style="width:600px"> |
| + | ''If you are not running during the SeqShop Workshop, please skip this section.'' |
| + | <div class="mw-collapsible-content"> |
| | | |
| | | |
| {{SeqShopLogin}} | | {{SeqShopLogin}} |
| | | |
− | == Setup your run environment== | + | === Setup your run environment=== |
| | | |
| This will setup some environment variables to point you to | | This will setup some environment variables to point you to |
Line 24: |
Line 34: |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
| [[File:setup.png|500px]] | | [[File:setup.png|500px]] |
| + | </div> |
| + | </div> |
| </div> | | </div> |
| </div> | | </div> |
| + | |
| + | == Setup when running on your own outside of the SeqShop Workshop == |
| + | ''This section is specifically for running on your own outside of the SeqShop Workshop.'' |
| + | <div class="mw-collapsible" style="width:600px"> |
| + | ''If you are running during the SeqShop Workshop, please skip this section.'' |
| + | <div class="mw-collapsible-content"> |
| + | === Download & Build GotCloud === |
| + | If you do not already have GotCloud: |
| + | * cd to where you want GotCloud installed (you can change this to any directory you want) |
| + | mkdir -p ~/seqshop |
| + | cd ~/seqshop/ |
| + | * download, decompress, and build the version of gotcloud that was tested with this tutorial: |
| + | wget https://github.com/statgen/gotcloud/archive/gotcloud.workshop.tar.gz |
| + | tar xvf gotcloud.workshop.tar.gz |
| + | mv gotcloud-gotcloud.workshop gotcloud |
| + | cd gotcloud/src |
| + | make |
| + | cd ../.. |
| + | |
| + | Remember the path to gotcloud/ that is what you will need to set your GC variable to. |
| + | |
| + | === Download the example data === |
| + | Download and untar file containing the example data used in the practicals: |
| + | wget http://csg.sph.umich.edu//mktrost/seqshopExample.tar.gz |
| + | tar xvf seqshopExample.tar.gz |
| + | |
| + | You will see the names of all the files included in the example data scrolling on the screen as they are unpacked from the tar file. |
| + | |
| + | {{SeqShopRemoteEnv}} |
| | | |
| == Examining [[GotCloud]] Align Input Files == | | == Examining [[GotCloud]] Align Input Files == |
Line 35: |
Line 76: |
| * Subset of FASTQs - should map to chromosome 22 36000000-37000000 | | * Subset of FASTQs - should map to chromosome 22 36000000-37000000 |
| | | |
− | ls ${IN}/fastq/ | + | ls ${SS}/fastq/ |
| There are 24 fastq files: combination of single-end & paired-end. | | There are 24 fastq files: combination of single-end & paired-end. |
| | | |
Line 54: |
Line 95: |
| | | |
| Look at a couple of FASTQs: | | Look at a couple of FASTQs: |
− | less -S ${IN}/fastq/HG00551.SRR190851_1.fastq | + | less -S ${SS}/fastq/HG00551.SRR190851_1.fastq |
| <code>less</code> is a Linux command that allows you to look at a file. | | <code>less</code> is a Linux command that allows you to look at a file. |
| *<code>-S</code> option prevents line wrap | | *<code>-S</code> option prevents line wrap |
Line 74: |
Line 115: |
| | | |
| Look at the paired read: | | Look at the paired read: |
− | less -S ${IN}/fastq/HG00551.SRR190851_2.fastq | + | less -S ${SS}/fastq/HG00551.SRR190851_2.fastq |
| | | |
| Remember, use <code>'q'</code> to exit out of <code>less</code> | | Remember, use <code>'q'</code> to exit out of <code>less</code> |
Line 107: |
Line 148: |
| | | |
| Take a look at the chromosome 22 reference files included for this tutorial: | | Take a look at the chromosome 22 reference files included for this tutorial: |
− | ls ${REF} | + | ls ${SS}/ref22 |
| | | |
| <ul> | | <ul> |
Line 113: |
Line 154: |
| <li>View Screenshot</li> | | <li>View Screenshot</li> |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
− | [[File:RefDir.png|500px]] | + | [[File:RefDir.png|700px]] |
| </div> | | </div> |
| </div> | | </div> |
Line 119: |
Line 160: |
| | | |
| Let's read the reference FASTA file (all reference bases for the chromosome): | | Let's read the reference FASTA file (all reference bases for the chromosome): |
− | less ${REF}/human.g1k.v37.chr22.fa | + | less ${SS}/ref22/human.g1k.v37.chr22.fa |
| | | |
| Remember, use <code>'q'</code> to exit out of <code>less</code> | | Remember, use <code>'q'</code> to exit out of <code>less</code> |
Line 132: |
Line 173: |
| <li>The ends of a chromosome are 'N' - unknown bases</li> | | <li>The ends of a chromosome are 'N' - unknown bases</li> |
| <li>Let's look at 5 lines of the file starting at line 300,000</li> | | <li>Let's look at 5 lines of the file starting at line 300,000</li> |
− | tail -n+300000 ${REF}/human.g1k.v37.chr22.fa |head -n 5 | + | tail -n+300000 ${SS}/ref22/human.g1k.v37.chr22.fa |head -n 5 |
| [[File:Fasta.png|500px]] | | [[File:Fasta.png|500px]] |
| </div> | | </div> |
Line 138: |
Line 179: |
| </ul> | | </ul> |
| </ul> | | </ul> |
| + | |
| + | If you want to access the FASTA file by position, you can use <code>samtools faidx</code> command |
| + | ${GC}/bin/samtools faidx ${SS}/ref22/human.g1k.v37.chr22.fa 22:36000000 | less |
| + | or |
| + | ${GC}/bin/samtools faidx ${SS}/ref22/human.g1k.v37.chr22.fa 22:36000000-36000100 |
| | | |
| === GotCloud FASTQ Index File === | | === GotCloud FASTQ Index File === |
Line 155: |
Line 201: |
| | | |
| Let's look a look at the index file I prepared for this tutorial: | | Let's look a look at the index file I prepared for this tutorial: |
− | less -S ${IN}/align.index | + | less -S ${SS}/align.index |
| | | |
| Remember, use <code>'q'</code> to exit out of <code>less</code> | | Remember, use <code>'q'</code> to exit out of <code>less</code> |
Line 165: |
Line 211: |
| <li>Need a reminder of the format?</li> | | <li>Need a reminder of the format?</li> |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
− | [[File:fqindex.png|650px]] | + | [[File:fqindex.png|750px]] |
| </div> | | </div> |
| </div> | | </div> |
− | | + | <ul> |
| + | <li>Note: in the screenshots, the fields are shifted into clear columns to make it easier to read</li> |
| + | <ul> |
| + | <li>When you view the file, the fields will not line up in neat columns and it can be hard to read</li> |
| + | </ul> |
| + | </ul> |
| <div class="mw-collapsible mw-collapsed" style="width:500px"> | | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
| <li>Hard to read the index? Need a hint?</li> | | <li>Hard to read the index? Need a hint?</li> |
Line 174: |
Line 225: |
| <ul> | | <ul> |
| <li>Use cut to extract just the MERGE_NAME & RGID fields </li> | | <li>Use cut to extract just the MERGE_NAME & RGID fields </li> |
− | cut -f 1,4 ${IN}/align.index | + | cut -f 1,4 ${SS}/align.index |
| + | </ul> |
| </div> | | </div> |
| </div> | | </div> |
− | </ul>
| |
− | </ul>
| |
| <div class="mw-collapsible mw-collapsed" style="width:500px"> | | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
| <li>Answer:</li> | | <li>Answer:</li> |
Line 185: |
Line 235: |
| <li>HG00553 & HG00640</li> | | <li>HG00553 & HG00640</li> |
| <li>They have multiple unique values in the RGID field</li> | | <li>They have multiple unique values in the RGID field</li> |
− | [[File:fqindexRG.png|650px]] | + | [[File:fqindexRG.png|800px]] |
| </div> | | </div> |
| </div> | | </div> |
Line 205: |
Line 255: |
| ** Most values should be left as the defaults | | ** Most values should be left as the defaults |
| * Specify values in your configuration file as: | | * Specify values in your configuration file as: |
− | KEY = value
| + | ** <code>KEY = value</code> |
| * Use $(KEY) to refer to another key's value | | * Use $(KEY) to refer to another key's value |
| * If a KEY is specified twice, the later value is used | | * If a KEY is specified twice, the later value is used |
Line 212: |
Line 262: |
| | | |
| Let's look at the configuration file I created for this test: | | Let's look at the configuration file I created for this test: |
− | more ${IN}/gotcloud.conf | + | more ${SS}/gotcloud.conf |
| | | |
| Use the <code>space bar</code> to advance if the whole file isn't displayed. | | Use the <code>space bar</code> to advance if the whole file isn't displayed. |
| | | |
− | ; If your input and references are at different paths than specified, what would you change? | + | ; If your references are in a different path than what is specified, what would you change? |
| <ul> | | <ul> |
− | <div class="mw-collapsible mw-collapsed" style="width:200px"> | + | <div class="mw-collapsible mw-collapsed" style="width:300px"> |
| <li>Answer:</li> | | <li>Answer:</li> |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
| <ul> | | <ul> |
− | <li>You would change <code>IN_DIR</code> & <code>REF_DIR</code> to the new paths</li> | + | <li>You would change <code>REF_DIR</code> to the new path</li> |
− | [[File:gcConf.png|500px]] | + | [[File:gcConf.png|800px]] |
| </div> | | </div> |
| </div> | | </div> |
Line 234: |
Line 284: |
| | | |
| Now that we have all of our input files, we need just a simple command to run them | | Now that we have all of our input files, we need just a simple command to run them |
− | ${GC}/gotcloud align --conf ${IN}/gotcloud.conf --numcs 2 | + | ${GC}/gotcloud align --conf ${SS}/gotcloud.conf --numcs 2 --base_prefix ${SS} --outdir ${OUT} |
| + | |
| + | * <code>${GC}/gotcloud</code> runs GotCloud |
| + | * <code>align</code> tells GotCloud you want to run the alignment pipeline. |
| + | * <code>--conf</code> tells GotCloud the name of the configuration file to use. |
| + | ** The configuration for this test was downloaded with the seqshop input files. |
| * <code>--numcs</code> means to run 2 samples at a time. | | * <code>--numcs</code> means to run 2 samples at a time. |
− | ** How many you can run concurrently depends on your system | + | ** How many you can run concurrently depends on your system. |
| + | * <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths. |
| + | ** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS} |
| + | ** Alternatively, gotcloud.conf could be updated to specify the full paths |
| + | * <code>--out_dir</code> tells GotCloud where to write the output. |
| + | ** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line |
| + | |
| [[File:gcalignStart.png|850px]] | | [[File:gcalignStart.png|850px]] |
| | | |
Line 244: |
Line 305: |
| | | |
| If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off. | | If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off. |
| + | |
| + | Inside GotCloud align, not only sequence alignment but also pre-processing of sequence data, including deduplication and base quality recalibration are performed along with quality assessment, as illustrated below. |
| + | |
| + | [[File:Gotcloud_align_detail.png|500px]] |
| | | |
| == Examining GotCloud Align Output == | | == Examining GotCloud Align Output == |
Line 319: |
Line 384: |
| [[File:Qplotpdf.png|400px]] | | [[File:Qplotpdf.png|400px]] |
| <li> Look at the PDF I produced when I ran the whole genome:</li> | | <li> Look at the PDF I produced when I ran the whole genome:</li> |
− | evince ${IN}/example/HG00551.wg.qplot.pdf& | + | evince ${SS}/ext/HG00551.wg.qplot.pdf& |
| </ul> | | </ul> |
| [[File:Qplotpdfwg.png|400px]] | | [[File:Qplotpdfwg.png|400px]] |
Line 342: |
Line 407: |
| ; What are the chromosome and position of the first record in the BAM file? | | ; What are the chromosome and position of the first record in the BAM file? |
| <ul> | | <ul> |
− | <div class="mw-collapsible mw-collapsed" style="width:200px"> | + | <div class="mw-collapsible mw-collapsed" style="width:300px"> |
| + | <li>Need a reminder of the format?</li> |
| + | <div class="mw-collapsible-content"> |
| + | [[File:Bam.png|750px]] |
| + | </div> |
| + | </div> |
| + | <div class="mw-collapsible mw-collapsed" style="width:300px"> |
| <li>Answer</li> | | <li>Answer</li> |
| <div class="mw-collapsible-content"> | | <div class="mw-collapsible-content"> |
Line 348: |
Line 419: |
| <li>Chr 22, Pos: 16114122</li> | | <li>Chr 22, Pos: 16114122</li> |
| </ul> | | </ul> |
− | [[File:Bam.png|750px]] | + | [[File:BamRec.png|650px]] |
| </div> | | </div> |
| </div> | | </div> |
Line 355: |
Line 426: |
| ==== Accessing BAMs by Position ==== | | ==== Accessing BAMs by Position ==== |
| BAM's are so big, what if we want to see a position part way through the file? | | BAM's are so big, what if we want to see a position part way through the file? |
− | :[http://samtools.sourceforge.net/samtools.shtml#3 samtools] has an option for that.
| + | *[http://samtools.sourceforge.net/samtools.shtml#3 samtools] has an option for that. |
| | | |
| Add a region to the view command we used above. Let's find all reads that overlap positions 36907000-36907005: | | Add a region to the view command we used above. Let's find all reads that overlap positions 36907000-36907005: |
Line 362: |
Line 433: |
| | | |
| Let's visualize what reads in that area look like using samtools tview: | | Let's visualize what reads in that area look like using samtools tview: |
− | ${GC}/bin/samtools tview ${OUT}/bams/HG00551.recal.bam ${REF}/human.g1k.v37.chr22.fa | + | ${GC}/bin/samtools tview ${OUT}/bams/HG00551.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa |
| * Type ‘g’ | | * Type ‘g’ |
| ** Type 22:36907000 | | ** Type 22:36907000 |
Line 406: |
Line 477: |
| | | |
| == Logging Off == | | == Logging Off == |
| + | |
| + | ''This section is specifically for the SeqShop Workshop computers.'' |
| + | <div class="mw-collapsible mw-collapsed" style="width:600px"> |
| + | ''If you are not running during the SeqShop Workshop, please skip this section.'' |
| + | <div class="mw-collapsible-content"> |
| To logout of seqshop-server, type: | | To logout of seqshop-server, type: |
| exit | | exit |
Line 411: |
Line 487: |
| | | |
| When done, log out of the Windows machine. | | When done, log out of the Windows machine. |
| + | </div> |
| + | </div> |