Line 1: |
Line 1: |
| + | '''Note:''' the latest version of this practical is available at: [[SeqShop: Sequence Mapping and Assembly Practical]] |
| + | * The ones here is the original one from the June workshop (updated to be run from elsewhere) |
| + | |
| + | == Introduction == |
| + | See the [[Media:SeqShop - GotCloud Align.pdf|introductory slides]] for an intro to this tutorial. |
| + | |
| == Goals of This Session == | | == Goals of This Session == |
| * What we want to learn | | * What we want to learn |
Line 6: |
Line 12: |
| ** How to visualize sequence data to examine the reads aligned to particular genomic positions | | ** How to visualize sequence data to examine the reads aligned to particular genomic positions |
| | | |
| + | == Setup in person at the SeqShop Workshop == |
| + | ''This section is specifically for the SeqShop Workshop computers.'' |
| + | <div class="mw-collapsible mw-collapsed" style="width:600px"> |
| + | ''If you are not running during the SeqShop Workshop, please skip this section.'' |
| + | <div class="mw-collapsible-content"> |
| | | |
− | == Step 0: Login to the machine & setup environment==
| |
| | | |
− | <ol>
| + | {{SeqShopLogin}} |
− | <li> Login to the windows machine
| |
− | * The username/password for the Windows machine should be written on it</li>
| |
− | <li> Open putty
| |
− | * Start->.....</li>
| |
− | <li> In putty, login to seqshop-server.sph.umich.edu
| |
− | *
| |
− | * Server name: seqshop-server.sph.umich.edu
| |
− | * Enter your provided username & password</li>
| |
− | <li> To simplify commands/typing, we will setup a couple of environment variables
| |
− | <ol>
| |
− | <li>to point to the GotCloud directory</li>
| |
− | export GC=/home/mktrost/seqshop/
| |
− | <li> to point to your test setup</li>
| |
− | export SETUP=~/setup
| |
− | <li> to point to your output directory</li>
| |
− | export OUTPUT=~/out
| |
− | </ol>
| |
− | <li> Create your directories:</li>
| |
− | mkdir ${SETUP}
| |
− | mkdir ${OUTPUT}
| |
− | </ol>
| |
| | | |
− | == Examining Raw Sequence Reads == | + | === Setup your run environment=== |
− | FASTQ : standard file format provided to you by those who did the sequencing.
| |
− | : For more information on the FASTQ format, see: http://en.wikipedia.org/wiki/FASTQ_format
| |
| | | |
− | For this tutorial, we will use FASTQs for 6 1000 Genome samples
| + | This will setup some environment variables to point you to |
− | * Subset of FASTQs - should map around CFTR on chr7 | + | * [[GotCloud]] program |
| + | * Tutorial input files |
| + | * Setup an output directory |
| + | source /home/mktrost/seqshop/setup.txt |
| + | * You won't see any output after running <code>source</code> |
| + | ** It silently sets up your environment |
| + | <div class="mw-collapsible mw-collapsed" style="width:200px"> |
| + | View setup.txt |
| + | <div class="mw-collapsible-content"> |
| + | [[File:setup.png|500px]] |
| + | </div> |
| + | </div> |
| + | </div> |
| + | </div> |
| | | |
− | ls ${GC}/inputs/fastq/
| + | == Setup when running on your own outside of the SeqShop Workshop == |
− | There are 24 fastq files: combination of single-end & paired-end.
| + | ''This section is specifically for running on your own outside of the SeqShop Workshop.'' |
− | * Single-end: HG00641.chr7.CFTR.SRR069531.fastq | + | <div class="mw-collapsible" style="width:600px"> |
− | * Paired-end: HG00641.chr7.CFTR.SRR069531'''_1'''.fastq & HG00641.chr7.CFTR.SRR069531'''_2'''.fastq
| + | ''If you are running during the SeqShop Workshop, please skip this section.'' |
| + | <div class="mw-collapsible-content"> |
| + | === Download & Build GotCloud === |
| + | If you do not already have GotCloud: |
| + | * cd to where you want GotCloud installed (you can change this to any directory you want) |
| + | mkdir -p ~/seqshop |
| + | cd ~/seqshop/ |
| + | * download, decompress, and build the version of gotcloud that was tested with this tutorial: |
| + | wget https://github.com/statgen/gotcloud/archive/gotcloud.workshop.tar.gz |
| + | tar xvf gotcloud.workshop.tar.gz |
| + | mv gotcloud-gotcloud.workshop gotcloud |
| + | cd gotcloud/src |
| + | make |
| + | cd ../.. |
| | | |
− | Look at FASTQ:
| + | Remember the path to gotcloud/ that is what you will need to set your GC variable to. |
− | less -S ${GC}/inputs/fastq/HG00641.chr7.CFTR.SRR069531_1.fastq
| |
− | <code>less</code> is a Linux command that allows you to look at a file.
| |
− | *<code>-S</code> option prevents line wrap.
| |
− | * Use the arrow (up/down/left/right) keys to scroll through the file.
| |
− | * Use <code>zless</code> if the file is compressed.
| |
| | | |
− | Use <code>'q'</code> to exit out of <code>less</code>
| + | === Download the example data === |
− | q | + | Download and untar file containing the example data used in the practicals: |
− | [[File:FASTQ.png|500px]]
| + | wget http://csg.sph.umich.edu//mktrost/seqshopExample.tar.gz |
| + | tar xvf seqshopExample.tar.gz |
| | | |
− | Quality represented as ASCII code of (33 + Phred-scale-quality)
| + | You will see the names of all the files included in the example data scrolling on the screen as they are unpacked from the tar file. |
− | * Lower qualities : special characters, or digits
| |
− | ** ! (Q=0), “ (Q=1), # (Q=3), + (Q=10), / (Q=14)
| |
− | ** 0 (Q=15), 5 (Q=20), 9 (Q=24)
| |
− | * Higher qualities : alphabet
| |
− | ** : (Q=25), ? (Q=30), @ (Q=31)
| |
− | ** A (Q=32), B (Q=33), G (Q=38)
| |
| | | |
− | == GotCloud Alignment Pipeline==
| + | {{SeqShopRemoteEnv}} |
| | | |
− | [[File:AlignDiagram.png|500px]] | + | == Examining [[GotCloud]] Align Input Files == |
| + | === Examining Raw Sequence Reads : FASTQs === |
| + | FASTQ : standard file format provided to you by those who did the sequencing. |
| + | : For more information on the FASTQ format, see: http://en.wikipedia.org/wiki/FASTQ_format |
| | | |
− | === Why GotCloud?===
| + | For this tutorial, we will use FASTQs for 4 1000 Genome samples |
− | * Easy to learn & run
| + | * Subset of FASTQs - should map to chromosome 22 36000000-37000000 |
− | ** All-in-one package for sequence analysis pipeline
| |
− | ** You don’t have to know the details of individual component | |
− | * Robust parallelization
| |
− | ** Automatic partition of multi-sample jobs
| |
− | ** Reliable and fault-tolerant parallelization via GNU make
| |
− | *** Restart from where it stopped upon unexpected crash
| |
− | * Cloud & Cluster-friendly
| |
− | ** Supports multiple clusters such as MOSIX, Slurm, & SGE
| |
− | ** Amazon instances allow running large-scale jobs without having your own cluster
| |
| | | |
− | Sequence Processing Recommendations
| + | ls ${SS}/fastq/ |
− | * Be consistent within a project
| + | There are 24 fastq files: combination of single-end & paired-end. |
− | ** Process all samples with same pipeline
| |
− | *** Batch effect may arise if different pipelines are used due to pipeline differences
| |
− | ** Use the same configuration within a project
| |
| | | |
− | === Examining GotCloud Align Input Files === | + | ;Can you tell which files are single-end and which are paired-end? |
− | ==== Sequence Data Files : FASTQs ====
| + | <ul> |
− | We already looked at those in: [[#Examining Raw Sequence Reads|Examining Raw Sequence Reads]]
| + | <div class="mw-collapsible mw-collapsed" style="width:400px"> |
| + | <li>View answer:</li> |
| + | <div class="mw-collapsible-content"> |
| + | <ul> |
| + | <li> Paired-end files have a '''_1.fastq''' or '''_2.fastq''' extension</li> |
| + | <li> This convention isn't mandatory, but something similar is common</li> |
| + | [[File:Fastqsm.png|700px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| + | </ul> |
| | | |
− | ==== Reference Files ====
| |
− | Reference files can be downloaded with GotCloud or from other sources.
| |
− | * Partial reference for this example
| |
| | | |
− | ls ${GC}/reference/chr7 | + | Look at a couple of FASTQs: |
| + | less -S ${SS}/fastq/HG00551.SRR190851_1.fastq |
| + | <code>less</code> is a Linux command that allows you to look at a file. |
| + | *<code>-S</code> option prevents line wrap |
| + | * Use the arrow (up/down/left/right) keys to scroll through the file |
| + | * Use the <code>space bar</code> to jump down a page |
| + | Use <code>'q'</code> to exit out of <code>less</code> |
| + | q |
| | | |
− | [[File:RefDir.png|500px]] | + | ;Do you remember the parts of a FASTQ? |
| + | <ul> |
| + | <div class="mw-collapsible mw-collapsed" style="width:200px"> |
| + | <li>No, remind me:</li> |
| + | <div class="mw-collapsible-content"> |
| + | [[File:Fastq.png|500px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| | | |
| | | |
− | VCF files
| + | Look at the paired read: |
− | * List of chromosome/position
| + | less -S ${SS}/fastq/HG00551.SRR190851_2.fastq |
− | * Used for:
| |
− | ** dbsnp - recalibration skips known variants
| |
− | ** hapmap - used for sample contamination/sample swap validation
| |
− | ** variant filtering
| |
| | | |
| + | Remember, use <code>'q'</code> to exit out of <code>less</code> |
| + | q |
| | | |
− | Let's read the first 5 lines of the genome reference FASTA file (all reference bases for a chromosome):
| + | ;Do you notice something in common? |
− | head -n 5 ${GC}/reference/chr7/human.g1k.v37.chr7.fa
| |
− | [[File:Fasta3.png|600px]]
| |
| | | |
− | The start of the chromosome is all N's, so let's look at a later section (reading 5 lines starting at line 2000):
| + | <ul> |
− | tail -n+2000 ${GC}/reference/chr7/human.g1k.v37.chr7.fa |head -n 5
| + | <div class="mw-collapsible mw-collapsed" style="width:400px"> |
− | [[File:Fasta3 (copy).png|600px]] | + | <li>View answer:</li> |
| + | <div class="mw-collapsible-content"> |
| + | <ul> |
| + | <li> Paired-end reads have matching read names with a different extensions</li> |
| + | <li> This convention isn't mandatory, but something similar is common</li> |
| + | [[File:Fastq3.png|500px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| + | </ul> |
| | | |
− | Additional information on the FASTA format: http://en.wikipedia.org/wiki/FASTA_format
| + | === Reference Files === |
| + | Reference files can be downloaded with [[GotCloud]] or from other sources |
| + | * See [[GotCloud: Genetic Reference and Resource Files]] for more information on downloading/generating reference files |
| | | |
− | See [[GotCloud: Genetic Reference and Resource Files]] for more information on downloading/generating reference files.
| + | For alignment, you need: |
| + | # Reference genome FASTA file |
| + | #* Contains the reference base for each position of each chromosome |
| + | #* Additional information on the FASTA format: http://en.wikipedia.org/wiki/FASTA_format |
| + | # VCF (variant call format) files with chromosomes/positions |
| + | #* dbsnp - used to skip known variants when recalibrating |
| + | #* hapmap - used for sample contamination/sample swap validation |
| | | |
− | ==== GotCloud FASTQ Index File ====
| + | Take a look at the chromosome 22 reference files included for this tutorial: |
− | You need to tell GotCloud about each FASTQ file
| + | ls ${SS}/ref22 |
− | * Full path
| |
− | * Sample name
| |
− | ** Each sample can have multiple FASTQs
| |
− | ** Each FASTQ is for a single sample
| |
| | | |
− | The FASTQ index file is created by you to direct GotCloud to your FASTQ files, providing additional information for them.
| + | <ul> |
| + | <div class="mw-collapsible mw-collapsed" style="width:200px"> |
| + | <li>View Screenshot</li> |
| + | <div class="mw-collapsible-content"> |
| + | [[File:RefDir.png|700px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| | | |
− | * tab delimited
| + | Let's read the reference FASTA file (all reference bases for the chromosome): |
− | * columns may be in any order
| + | less ${SS}/ref22/human.g1k.v37.chr22.fa |
− | * starts with a header line
| |
− | * one line per single-end read
| |
− | * one line per paired-end read (only 1 line per pair).
| |
| | | |
| + | Remember, use <code>'q'</code> to exit out of <code>less</code> |
| + | q |
| | | |
− | '''Required Columns'''
| + | ; Where is the reference sequence? |
− | {|class="wikitable" cellpadding=5
| + | <ul> |
− | ! Column Name !! Description !! Recommended Value
| + | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
− | |-
| + | <li>Answer:</li> |
− | | MERGE_NAME ||
| + | <div class="mw-collapsible-content"> |
− | * Base name for the resulting BAM file for the sample
| + | <ul> |
− | * Used to group multiple fastqs or fastq pairs into a single BAM
| + | <li>The ends of a chromosome are 'N' - unknown bases</li> |
− | | Sample Name
| + | <li>Let's look at 5 lines of the file starting at line 300,000</li> |
− | |-
| + | tail -n+300000 ${SS}/ref22/human.g1k.v37.chr22.fa |head -n 5 |
− | | FASTQ1 ||
| + | [[File:Fasta.png|500px]] |
− | * Name of the fastq or the first in the pair if paired-end. (Only 1 line per pair)
| + | </div> |
− | | path/fastq1
| + | </div> |
− | |- | + | </ul> |
− | | FASTQ2 || | + | </ul> |
− | *Name of the 2nd fastq in paired-end reads.
| |
− | *Column is not required if all fastqs are single-end
| |
− | *'.' if the column is used, but this line is single-ended.
| |
− | | path/fastq2
| |
− | |}
| |
| | | |
| + | If you want to access the FASTA file by position, you can use <code>samtools faidx</code> command |
| + | ${GC}/bin/samtools faidx ${SS}/ref22/human.g1k.v37.chr22.fa 22:36000000 | less |
| + | or |
| + | ${GC}/bin/samtools faidx ${SS}/ref22/human.g1k.v37.chr22.fa 22:36000000-36000100 |
| | | |
− | The following columns are optional and used to populate the Read Group Information in the BAM file. | + | === GotCloud FASTQ Index File === |
− | * RGID field is required if using any of these fields, the others are optional. | + | The FASTQ index file is created by you to tell GotCloud about each of your FASTQ files: |
| + | * Where to find it |
| + | * Sample name |
| + | ** Each sample can have multiple FASTQs |
| + | ** Each FASTQ is for a single sample |
| + | * Run identifier |
| + | ** For recalibration we need to know which reads were in the same run. |
| | | |
− | What is a Read Group?
| + | FASTQ Index Format: |
− | * Groups reads together | + | * Tab delimited |
− | * Used for recalibration | + | * Starts with a header line |
− | ** Each sequencing run should get a different ReadGroup | + | * One line per single-end read |
− | * Typically a new name for each fastq pair/group | + | * One line per paired-end read (only 1 line per pair). |
| | | |
− | If you do not want the field for:
| + | Let's look a look at the index file I prepared for this tutorial: |
− | * any fastq, leave the column out of the header line
| + | less -S ${SS}/align.index |
− | * a single line, use a '.'
| |
| | | |
| + | Remember, use <code>'q'</code> to exit out of <code>less</code> |
| + | q |
| | | |
− | '''Optional Columns'''
| + | ; Which samples had multiple runs? |
− | {|class="wikitable" cellpadding=5
| + | <ul> |
− | |-
| + | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
− | ! Column Name !! Description !! Recommended Value
| + | <li>Need a reminder of the format?</li> |
− | |-
| + | <div class="mw-collapsible-content"> |
− | | RGID || Read Group ID || Run ID | + | [[File:fqindex.png|750px]] |
− | |-
| + | </div> |
− | | SAMPLE || Sample Name || Sample Name
| + | </div> |
− | |-
| + | <ul> |
− | | LIBRARY || Library
| + | <li>Note: in the screenshots, the fields are shifted into clear columns to make it easier to read</li> |
− | * separate FASTQs for a sample that were prepped separately
| + | <ul> |
− | | if you don't know or it is all the same, use Sample Name
| + | <li>When you view the file, the fields will not line up in neat columns and it can be hard to read</li> |
− | |-
| + | </ul> |
− | | CENTER || Center Name || Name of the sequencing center producing the FASTQ
| + | </ul> |
− | |-
| + | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
− | | PLATFORM || Platform || CAPILLARY, LS454, ILLUMINA, | + | <li>Hard to read the index? Need a hint?</li> |
− | SOLID, HELICOS, IONTORRENT, or PACBIO
| + | <div class="mw-collapsible-content"> |
− | |}
| + | <ul> |
| + | <li>Use cut to extract just the MERGE_NAME & RGID fields </li> |
| + | cut -f 1,4 ${SS}/align.index |
| + | </ul> |
| + | </div> |
| + | </div> |
| + | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
| + | <li>Answer:</li> |
| + | <div class="mw-collapsible-content"> |
| + | <ul> |
| + | <li>HG00553 & HG00640</li> |
| + | <li>They have multiple unique values in the RGID field</li> |
| + | [[File:fqindexRG.png|800px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| + | </ul> |
| | | |
− | Your sequencing core may provide to you a file with information to fill in these columns.
| |
| | | |
− | For our example, we have <code>sequence.index</code> which contains the information from 1000 Genomes for the FASTQs we are processing.
| + | How do you point GotCloud to your index file? |
− | less -S ${GC}/inputs/fastq/sequence.index
| + | * Command-line <code>--index_file</code> option |
− | | + | : or |
− | In this file, we want the SAMPLE_NAME, FASTQ_FILE, RUN_ID, LIBRARY_NAME, CENTER_NAME, INSTRUMENT_PLATFORM (columns 10, 1, 15, 6, 13).
| + | * Configuration file <code>INDEX_FILE</code> setting. |
− | * You can use perl/awk/linux to extract these fields & format as necessary.
| |
− | * I prepared a perl script that you can use: | |
− | perl ${GC}/scripts/genIndex.pl > ${SETUP}/align.index
| |
− | | |
− | Let's look at the index file:
| |
− | less -S ${SETUP}/align.index
| |
− | [[File:Align index.png|1000px]]
| |
− | | |
− | The command-line <code>--fastq</code> option or the configuration file <code>FASTQ_PREFIX</code> setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths.
| |
− | | |
− | This file is specified either via the command-line <code>--index_file</code> parameter or via the configuration file <code>INDEX_FILE</code> setting.
| |
| | | |
| The command-line setting takes precedence over the configuration file setting. | | The command-line setting takes precedence over the configuration file setting. |
| | | |
− | ==== GotCloud Configuration File ====
| + | === GotCloud Configuration File === |
| This file is created by you to configure GotCloud for your data. | | This file is created by you to configure GotCloud for your data. |
| | | |
− | * Default values are provided in ${GC}/gotcloud/bin/gotcloudDefaults.conf | + | * Default values are provided in ${GC}/bin/gotcloudDefaults.conf |
| ** Most values should be left as the defaults | | ** Most values should be left as the defaults |
| * Specify values in your configuration file as: | | * Specify values in your configuration file as: |
− | KEY = value
| + | ** <code>KEY = value</code> |
| * Use $(KEY) to refer to another key's value | | * Use $(KEY) to refer to another key's value |
| * If a KEY is specified twice, the later value is used | | * If a KEY is specified twice, the later value is used |
| * Does not have access to environment variables | | * Does not have access to environment variables |
| * '#' indicates a comment | | * '#' indicates a comment |
− | * Keys to override:
| |
− | {|class="wikitable" cellpadding=5
| |
− | |-
| |
− | ! Key Name !! Description
| |
− | |-
| |
− | | colspan=2 style="text-align:center"| Index File Settings - pointing GotCloud to your data
| |
− | |-
| |
− | | INDEX_FILE || Path to the FASTQ index file that you created
| |
− | * Alternatively, this can be specified on the command-line as <code>--index_file</code>
| |
− | |-
| |
− | | FASTQ_PREFIX || Prefix to be added to the FASTQ files in INDEX_FILE
| |
− | * Not required
| |
− | |-
| |
− | | BAM_INDEX || Path to the BAM index file
| |
− | * to be created by alignment
| |
− | * to be used for snp calling
| |
− | |-
| |
− | | colspan=2 style="text-align:center"| Reference File Settings - telling GotCloud where to find your reference files
| |
− | |-
| |
− | | REF_DIR || Path to your reference files
| |
− | * You don't have to use this, you can specify the full path for each file
| |
− | |-
| |
− | | REF || Path/filename of the FASTA reference file
| |
− | * If different than default: $(REF_DIR)/human.g1k.v37.fa
| |
− | |-
| |
− | | DBSNP_VCF || Path/filename of the DBSNP file
| |
− | * If different than default: $(REF_DIR)/dbsnp_135.b37.vcf.gz
| |
− | |-
| |
− | | HM3_VCF || Path/filename of the HapMap3 file
| |
− | * If different than default: $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz
| |
− | |-
| |
− | | OMNI_VCF || Path/filename of the OMNI file
| |
− | * If different than default: $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz
| |
− | |-
| |
− | | INDEL_PREFIX || Path/filename base of the indels file
| |
− | * If different than default: $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19
| |
− | |}
| |
| | | |
| Let's look at the configuration file I created for this test: | | Let's look at the configuration file I created for this test: |
− | more ${GC}/inputs/gotcloud.conf | + | more ${SS}/gotcloud.conf |
| + | |
| + | Use the <code>space bar</code> to advance if the whole file isn't displayed. |
| + | |
| + | ; If your references are in a different path than what is specified, what would you change? |
| + | <ul> |
| + | <div class="mw-collapsible mw-collapsed" style="width:300px"> |
| + | <li>Answer:</li> |
| + | <div class="mw-collapsible-content"> |
| + | <ul> |
| + | <li>You would change <code>REF_DIR</code> to the new path</li> |
| + | [[File:gcConf.png|800px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| + | </ul> |
| | | |
− | It already points to your align file.
| + | == Run [[GotCloud]] Align == |
| | | |
− | === Run GotCloud Align ===
| + | [[File:AlignDiagram.png|500px]] |
− | Now that we have all of our input files, we need just a simple command to run | + | |
− | ${GC}/gotcloud/gotcloud align --conf ${GC}/inputs/gotcloud.conf --numcs 2 | + | Now that we have all of our input files, we need just a simple command to run them |
| + | ${GC}/gotcloud align --conf ${SS}/gotcloud.conf --numcs 2 --base_prefix ${SS} --outdir ${OUT} |
| + | |
| + | * <code>${GC}/gotcloud</code> runs GotCloud |
| + | * <code>align</code> tells GotCloud you want to run the alignment pipeline. |
| + | * <code>--conf</code> tells GotCloud the name of the configuration file to use. |
| + | ** The configuration for this test was downloaded with the seqshop input files. |
| * <code>--numcs</code> means to run 2 samples at a time. | | * <code>--numcs</code> means to run 2 samples at a time. |
− | ** Depends on your system | + | ** How many you can run concurrently depends on your system. |
| + | * <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths. |
| + | ** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS} |
| + | ** Alternatively, gotcloud.conf could be updated to specify the full paths |
| + | * <code>--out_dir</code> tells GotCloud where to write the output. |
| + | ** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line |
| + | |
| [[File:gcalignStart.png|850px]] | | [[File:gcalignStart.png|850px]] |
| | | |
− | This should take < 4 minutes to run. | + | This should take 1-3 minutes to run. |
| | | |
| It should end with a line like: <code>Processing finished in 133 secs with no errors reported</code> | | It should end with a line like: <code>Processing finished in 133 secs with no errors reported</code> |
Line 275: |
Line 306: |
| If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off. | | If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off. |
| | | |
− | === Examining GotCloud Align Output ===
| + | Inside GotCloud align, not only sequence alignment but also pre-processing of sequence data, including deduplication and base quality recalibration are performed along with quality assessment, as illustrated below. |
| + | |
| + | [[File:Gotcloud_align_detail.png|500px]] |
| + | |
| + | == Examining GotCloud Align Output == |
| | | |
| Let's look at the output directory: | | Let's look at the output directory: |
− | ls ${OUTPUT} | + | ls ${OUT} |
| [[File:gcalignOutM.png|600px]] | | [[File:gcalignOutM.png|600px]] |
| | | |
− | Let's look at the BAMs (aligned reads that are ready for variant calling):
| + | === Quality Control Files === |
− | ls ${OUTPUT}/bams
| |
− | [[File:GcalignOutBAMm.png|600px]]
| |
− | | |
− | BAM Files:
| |
− | * Binary Sequence Alignment/Map (SAM) Format
| |
− | * Maps reads to Chromosome/Position
| |
− | * For a detailed explanation of the SAM/BAM format, see:
| |
− | ** SAM/BAM Spec: http://samtools.github.io/hts-specs/SAMv1.pdf
| |
− | ** Additional information I put together as I started working with SAM/BAM: [[SAM]]
| |
− | * Consists of:
| |
− | ** Header
| |
− | *** Starts with '@'
| |
− | *** Records - one for each sequence read
| |
− | Let's examine a BAM file:
| |
− | samtools view -h ${OUTPUT}/bams/
| |
− | [[File:BAM.png|750px]]
| |
− | | |
| Let's take a look at our quality control output directory: | | Let's take a look at our quality control output directory: |
− | ls ${OUTPUT}/QCFiles | + | ls ${OUT}/QCFiles |
| [[File:GcalignOutQCm.png|600px]] | | [[File:GcalignOutQCm.png|600px]] |
| | | |
| + | ==== Sample Contamination/Swap ==== |
| Check for sample contamination: | | Check for sample contamination: |
| * *.selfSM : Main output file containing the contamination estimate. | | * *.selfSM : Main output file containing the contamination estimate. |
− | ** If you are only interested in checking sample contamination:
| + | ** Check the 'FREEMIX' column for genotype-free estimate of contamination |
− | *** Check the 'FREEMIX' column for genotype-free estimate of contamination
| |
| **** 0-1 scale, the lower, the better | | **** 0-1 scale, the lower, the better |
− | **** See [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information | + | **** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination |
− | *** Check the 'CHIPMIX' column for contamination estimates with external genotypes (if provided)
| + | ** See [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information |
− | * *.selfRG : Same output to .*selfSM, but separated by readGroup (which might be helpful for library-level examination)
| + | less -S ${OUT}/QCFiles/HG00551.genoCheck.selfSM |
− | * *.depthSM : depth distribution of reads covering the marker position of the input VCF, across all readGroups.
| + | |
− | * *.depthRG : depth distribution of reads covering the marker position of the input VCF, per readGroups.
| + | Remember, use <code>'q'</code> to exit out of <code>less</code> |
− | less -S ${OUTPUT}/QCFiles/HG00551.genoCheck.selfSM | + | q |
| + | |
| + | ; Is there evidence of sample contamination? |
| + | <ul> |
| + | <div class="mw-collapsible mw-collapsed" style="width:200px"> |
| + | <li>Answer:</li> |
| + | <div class="mw-collapsible-content"> |
| + | <ul> |
| + | <li>No, FREEMIX = 0.00000 (<0.03)</li> |
| + | </ul> |
| [[File:Contam1.png|700px]] | | [[File:Contam1.png|700px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| + | |
| + | ==== QC Metrics ==== |
| + | See: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results. |
| | | |
− | Next, let's look at some quality control metrics:
| + | Let's look at some quality control metrics: |
− | cat ${OUTPUT}/QCFiles/HG00551.qplot.stats | + | cat ${OUT}/QCFiles/HG00551.qplot.stats |
− | * 99.16% mapping rate
| |
− | * 94.01% high quality bases
| |
− | * 7x coverage
| |
− | * 31.3% A, 31.3% T
| |
− | * 18.7% C, 18.7% G
| |
| | | |
| + | ; What is the mapping rate & average coverage for HG00551? |
| + | <ul> |
| + | <div class="mw-collapsible mw-collapsed" style="width:200px"> |
| + | <li>Answer</li> |
| + | <div class="mw-collapsible-content"> |
| + | <ul> |
| + | <li> 98.93% Mapped</li> |
| + | <li>7.43 MeanDepth</li> |
| + | </ul> |
| + | [[File:qplots.png|200px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| | | |
− | Generate the pdf's of our quality metrics: | + | Generate a pdf of quality metrics: |
− | Rscript ${OUTPUT}/QCFiles/HG00551.qplot.R | + | Rscript ${OUT}/QCFiles/HG00551.qplot.R |
− | Rscript ${OUTPUT}/QCFiles/HG00553.qplot.R
| |
− | Rscript ${OUTPUT}/QCFiles/HG00640.qplot.R
| |
− | Rscript ${OUTPUT}/QCFiles/HG00641.qplot.R
| |
| | | |
| Examine the PDF: | | Examine the PDF: |
− | evince ${OUTPUT}/QCFiles/HG00551.qplot.pdf& | + | evince ${OUT}/QCFiles/HG00551.qplot.pdf& |
− | The first plot: Empirical vs reported Phred score does not look as good as we would like.
| |
− | * This is due to the small region used for recalibration
| |
− | Look at the PDF I produced when I ran the whole genome:
| |
− | evince ${GC}/example/HG00551.wg.qplot.pdf&
| |
| | | |
− | See: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results.
| + | It is ok if you see a warning message when opening evince. It should still open. If not, let me know. To close evince, just close the pdf window. |
| + | |
| + | ;Does the Empirical vs reported Phred score look as good as we would like? |
| + | <ul> |
| + | <div class="mw-collapsible mw-collapsed" style="width:400px"> |
| + | <li>Answer</li> |
| + | <div class="mw-collapsible-content"> |
| + | <ul> |
| + | <li> No, it is well above the line</li> |
| + | <li> This is due to the small region used for recalibration</li> |
| + | [[File:Qplotpdf.png|400px]] |
| + | <li> Look at the PDF I produced when I ran the whole genome:</li> |
| + | evince ${SS}/ext/HG00551.wg.qplot.pdf& |
| + | </ul> |
| + | [[File:Qplotpdfwg.png|400px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| + | |
| + | === BAM Files === |
| + | Binary Sequence Alignment/Map (SAM) Format |
| + | * Maps reads to Chromosome/Position |
| + | * For a detailed explanation of the SAM/BAM format, see: |
| + | ** SAM/BAM Spec: http://samtools.github.io/hts-specs/SAMv1.pdf |
| + | ** Additional information I put together as I started working with SAM/BAM: [[SAM]] |
| + | |
| + | Let's look at the BAMs (aligned reads that are ready for variant calling): |
| + | ls ${OUT}/bams |
| + | [[File:GcalignOutBAMm.png|600px]] |
| + | |
| + | Let's examine at the first 5 lines of the BAM file using [http://samtools.sourceforge.net/samtools.shtml#3 samtools view]: |
| + | ${GC}/bin/samtools view -h ${OUT}/bams/HG00551.recal.bam|head -n 5 |
| + | |
| + | ; What are the chromosome and position of the first record in the BAM file? |
| + | <ul> |
| + | <div class="mw-collapsible mw-collapsed" style="width:300px"> |
| + | <li>Need a reminder of the format?</li> |
| + | <div class="mw-collapsible-content"> |
| + | [[File:Bam.png|750px]] |
| + | </div> |
| + | </div> |
| + | <div class="mw-collapsible mw-collapsed" style="width:300px"> |
| + | <li>Answer</li> |
| + | <div class="mw-collapsible-content"> |
| + | <ul> |
| + | <li>Chr 22, Pos: 16114122</li> |
| + | </ul> |
| + | [[File:BamRec.png|650px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| + | |
| + | ==== Accessing BAMs by Position ==== |
| + | BAM's are so big, what if we want to see a position part way through the file? |
| + | *[http://samtools.sourceforge.net/samtools.shtml#3 samtools] has an option for that. |
| + | |
| + | Add a region to the view command we used above. Let's find all reads that overlap positions 36907000-36907005: |
| + | ${GC}/bin/samtools view -h ${OUT}/bams/HG00551.recal.bam 22:36907000-36907005 |
| + | * Just a few reads. |
| + | |
| + | Let's visualize what reads in that area look like using samtools tview: |
| + | ${GC}/bin/samtools tview ${OUT}/bams/HG00551.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa |
| + | * Type ‘g’ |
| + | ** Type 22:36907000 |
| + | * Type ‘n’ to color by nucleotide |
| + | * Use the arrow keys to move around and look at the area. |
| + | |
| + | Understanding the syntax: |
| + | * '.' : match to the reference on the forward strand |
| + | * ',' : match to the reference on the reverse strand |
| + | * ACGTN : mismatch to reference on the forward strand |
| + | * acgtn : mismatch to reference on the reverse strand |
| + | |
| + | ; Do you see anything interesting? |
| + | <ul> |
| + | <div class="mw-collapsible mw-collapsed" style="width:500px"> |
| + | <li>Screenshot</li> |
| + | <div class="mw-collapsible-content"> |
| + | <ul> |
| + | <li>We will have to remember this region when we run snpcall to see what it says.</li> |
| + | </ul> |
| + | [[File:tview.png|750px]] |
| + | </div> |
| + | </div> |
| + | </ul> |
| + | |
| + | Other tview commands: |
| + | * Type '?' for a help screen |
| + | * Type 'q' to quit tview |
| + | |
| + | Feel free to play around more and browse the BAM files. |
| + | |
| + | ==== Other tools for BAMs ==== |
| + | We have developed a lot of tools that operate on BAM files. |
| + | |
| + | See [[Software#BAM_Util_Tools|Software: BamUtil Tools]] for a list |
| + | * Many operations: |
| + | ** diff : diff 2 BAM files |
| + | ** stats: per positions statistics |
| + | ** bam2Fastq : convert a BAM back to a FASTQ (how I created the fastqs for this tutorial) |
| + | ** Lots of others |
| + | * Feel free to try some out |
| + | * If you have any questions, let me know, I wrote most of them and am happy to help. |
| + | |
| + | == Logging Off == |
| + | |
| + | ''This section is specifically for the SeqShop Workshop computers.'' |
| + | <div class="mw-collapsible mw-collapsed" style="width:600px"> |
| + | ''If you are not running during the SeqShop Workshop, please skip this section.'' |
| + | <div class="mw-collapsible-content"> |
| + | To logout of seqshop-server, type: |
| + | exit |
| + | And close the windows. |
| + | |
| + | When done, log out of the Windows machine. |
| + | </div> |
| + | </div> |