Changes

4,093 bytes added , 11:07, 2 February 2017

→‎Download the example data

Line 1: Line 1: +

'''Note:''' the latest version of this practical is available at: [[SeqShop: Sequence Mapping and Assembly Practical]]

+

* The ones here is the original one from the June workshop (updated to be run from elsewhere)

+

== Introduction ==

+

See the [[Media:SeqShop - GotCloud Align.pdf|introductory slides]] for an intro to this tutorial.

+

== Goals of This Session ==

* What we want to learn

Line 6: Line 12:

** How to visualize sequence data to examine the reads aligned to particular genomic positions

+

== Setup in person at the SeqShop Workshop ==

+

''This section is specifically for the SeqShop Workshop computers.''

+

+

''If you are not running during the SeqShop Workshop, please skip this section.''

+

−

~~== Step 0: Login to the machine & setup environment==~~

−

~~<ol>~~

+

−

~~<li> Login to the windows machine~~

−

* The username/password for the Windows machine should be written on it</li>

−

~~<li> Open putty~~

−

* Start->.....</li>

−

~~<li> In putty, login to seqshop-server.sph.umich.edu~~

−

*

−

* Server name: seqshop-server.sph.umich.edu

−

* Enter your provided username & password</li>

−

~~<li> To simplify commands/typing, we will setup a couple of environment variables~~

−

~~<ol>~~

−

~~<li>to point to the GotCloud directory</li>~~

−

~~export GC=/home/mktrost/seqshop/~~

−

~~<li> to point to your test setup</li>~~

−

~~export SETUP=~/setup~~

−

~~<li> to point to your output directory</li>~~

−

~~export OUTPUT=~/out~~

−

~~</ol>~~

−

~~<li> Create your directories:</li>~~

−

~~mkdir $~~{~~SETUP~~}

−

~~mkdir ${OUTPUT~~}

−

~~</ol>~~

−

== ~~Examining Raw Sequence Reads~~ ==

+

=== Setup your run environment===

−

~~FASTQ : standard file format provided to you by those who did the sequencing.~~

−

~~: For more information on the FASTQ format, see: http://en.wikipedia.org/wiki/FASTQ_format~~

−

~~For this tutorial, we~~ will ~~use FASTQs for 6 1000 Genome samples~~

+

This will setup some environment variables to point you to

−

* ~~Subset of FASTQs~~ - ~~should map around CFTR on chr7~~

+

* [[GotCloud]] program

+

* Tutorial input files

+

* Setup an output directory

+

source /home/mktrost/seqshop/setup.txt

+

* You won't see any output after running <code>source</code>

+

** It silently sets up your environment

+

+

View setup.txt

+

+

[[File:setup.png|500px]]

+

</div>

+

</div>

+

</div>

+

</div>

−

~~ls ${GC}/inputs/fastq/~~

+

== Setup when running on your own outside of the SeqShop Workshop ==

−

~~There~~ are ~~24 fastq files: combination of single~~-~~end~~ & ~~paired~~-~~end.~~

+

''This section is specifically for running on your own outside of the SeqShop Workshop.''

−

* ~~Single-end~~: ~~HG00641~~.~~chr7~~.~~CFTR~~.~~SRR069531~~.~~fastq~~

+

−

* Paired-end: ~~HG00641~~.~~chr7~~.~~CFTR~~.~~SRR069531'''_1'''~~.~~fastq & HG00641~~.~~chr7~~.~~CFTR~~.~~SRR069531'''_2'''~~.~~fastq~~

+

''If you are running during the SeqShop Workshop, please skip this section.''

+

+

=== Download & Build GotCloud ===

+

If you do not already have GotCloud:

+

* cd to where you want GotCloud installed (you can change this to any directory you want)

+

mkdir -p ~/seqshop

+

cd ~/seqshop/

+

* download, decompress, and build the version of gotcloud that was tested with this tutorial:

+

wget https://github.com/statgen/gotcloud/archive/gotcloud.workshop.tar.gz

+

tar xvf gotcloud.workshop.tar.gz

+

mv gotcloud-gotcloud.workshop gotcloud

+

cd gotcloud/src

+

make

+

cd ../..

−

~~Look at FASTQ:~~

+

Remember the path to gotcloud/ that is what you will need to set your GC variable to.

−

~~less -S ${GC}~~/~~inputs/fastq/HG00641.chr7.CFTR.SRR069531_1.fastq~~

−

~~<code>less</code>~~ is ~~a Linux command that allows~~ you to ~~look at a file.~~

−

*<code>-S</code> option prevents line wrap.

−

* Use the arrow (up/down/left/right) keys to ~~scroll through the file.~~

−

* Use <code>zless</code> if the file is compressed.

−

~~Use <code>'q'<~~/~~code> to exit out of <code>less<~~/~~code>~~

+

=== Download the example data ===

−

q

+

Download and untar file containing the example data used in the practicals:

−

~~[[File:FASTQ~~.~~png|500px]]~~

+

wget http://csg.sph.umich.edu//mktrost/seqshopExample.tar.gz

+

tar xvf seqshopExample.tar.gz

−

~~Quality represented~~ as ~~ASCII code of (33 + Phred-scale-quality)~~

+

You will see the names of all the files included in the example data scrolling on the screen as they are unpacked from the tar file.

−

* Lower qualities : special characters, or digits

−

** ! (Q=0), “ (Q=1), # (Q=3), + (Q=10), / (Q=14)

−

**  0 (Q=15), 5 (Q=20), 9 (Q=24)

−

* Higher qualities : alphabet

−

** : (Q=25), ? (Q=30), @ (Q=31)

−

** A (Q=32), B (Q=33), G (Q=38)

−

~~== GotCloud Alignment Pipeline==~~

+

−

[[~~File~~:~~AlignDiagram~~.~~png|500px]]~~

+

== Examining [[GotCloud]] Align Input Files ==

+

=== Examining Raw Sequence Reads : FASTQs ===

+

FASTQ : standard file format provided to you by those who did the sequencing.

+

: For more information on the FASTQ format, see: http://en.wikipedia.org/wiki/FASTQ_format

−

~~=== Why GotCloud?===~~

+

For this tutorial, we will use FASTQs for 4 1000 Genome samples

−

* Easy to learn & run

+

* Subset of FASTQs - should map to chromosome 22 36000000-37000000

−

** All-in-one package for ~~sequence analysis pipeline~~

−

** You don’t have to know the details of ~~individual component~~

−

* Robust parallelization

−

** Automatic partition of multi-~~sample jobs~~

−

** Reliable and fault-tolerant parallelization via GNU make

−

*** Restart from where it stopped upon unexpected crash

−

* Cloud & Cluster-friendly

−

** Supports multiple clusters such as MOSIX, Slurm, & SGE

−

** Amazon instances allow running large-~~scale jobs without having your own cluster~~

−

~~Sequence Processing Recommendations~~

+

ls ${SS}/fastq/

−

* Be consistent within a project

+

There are 24 fastq files: combination of single-end & paired-end.

−

** Process all samples with same pipeline

−

*** Batch effect may arise if different pipelines are ~~used due to pipeline differences~~

−

** Use the same configuration within a project

−

==~~= Examining GotCloud Align Input Files ===~~

+

;Can you tell which files are single-end and which are paired-end?

−

~~==== Sequence Data Files~~ : ~~FASTQs ===~~=

+

<ul>

−

~~We already looked at those in:~~ [[~~#Examining Raw Sequence Reads~~|~~Examining Raw Sequence Reads~~]]

+

+

<li>View answer:</li>

+

+

<ul>

+

<li> Paired-end files have a '''_1.fastq''' or '''_2.fastq''' extension</li>

+

<li> This convention isn't mandatory, but something similar is common</li>

+

[[File:Fastqsm.png|700px]]

+

</div>

+

</div>

+

</ul>

+

</ul>

−

~~==== Reference Files ====~~

−

~~Reference files can be downloaded with GotCloud or from other sources.~~

−

* Partial reference for this example

−

ls ${GC}/~~reference~~/~~chr7~~

+

Look at a couple of FASTQs:

+

less -S ${SS}/fastq/HG00551.SRR190851_1.fastq

+

<code>less</code> is a Linux command that allows you to look at a file.

+

*<code>-S</code> option prevents line wrap

+

* Use the arrow (up/down/left/right) keys to scroll through the file

+

* Use the <code>space bar</code> to jump down a page

+

Use <code>'q'</code> to exit out of <code>less</code>

+

q

−

[[File:~~RefDir~~.png|500px]]

+

;Do you remember the parts of a FASTQ?

+

<ul>

+

+

<li>No, remind me:</li>

+

+

[[File:Fastq.png|500px]]

+

</div>

+

</div>

+

</ul>

−

~~VCF files~~

+

Look at the paired read:

−

* List of chromosome/position

+

less -S ${SS}/fastq/HG00551.SRR190851_2.fastq

−

* Used for:

−

** dbsnp - ~~recalibration skips known variants~~

−

** hapmap - used for sample contamination/~~sample swap validation~~

−

** variant filtering

+

Remember, use <code>'q'</code> to exit out of <code>less</code>

+

q

−

~~Let's read the first 5 lines of the genome reference FASTA file (all reference bases for a chromosome):~~

+

;Do you notice something in common?

−

~~head -n 5 ${GC}/reference/chr7/human.g1k.v37.chr7.fa~~

−

~~[[File:Fasta3.png|600px]]~~

−

~~The start of the chromosome is all N's, so let's look at a later section (reading 5 lines starting at line 2000)~~:

+

<ul>

−

~~tail~~ -~~n+2000 ${GC}~~/~~reference~~/~~chr7/human.g1k.v37.chr7.fa |head -n 5~~

+

−

[[File:~~Fasta3 (copy)~~.png|~~600px~~]]

+

<li>View answer:</li>

+

+

<ul>

+

<li> Paired-end reads have matching read names with a different extensions</li>

+

<li> This convention isn't mandatory, but something similar is common</li>

+

[[File:Fastq3.png|500px]]

+

</div>

+

</div>

+

</ul>

+

</ul>

−

~~Additional~~ information on ~~the FASTA format: http://en.wikipedia.org/wiki~~/~~FASTA_format~~

+

=== Reference Files ===

+

Reference files can be downloaded with [[GotCloud]] or from other sources

+

* See [[GotCloud: Genetic Reference and Resource Files]] for more information on downloading/generating reference files

−

~~See [[GotCloud~~: ~~Genetic~~ Reference ~~and Resource Files]]~~ for ~~more~~ information on ~~downloading~~/~~generating reference~~ files.

+

For alignment, you need:

+

# Reference genome FASTA file

+

#* Contains the reference base for each position of each chromosome

+

#* Additional information on the FASTA format: http://en.wikipedia.org/wiki/FASTA_format

+

# VCF (variant call format) files with chromosomes/positions

+

#* dbsnp - used to skip known variants when recalibrating

+

#* hapmap - used for sample contamination/sample swap validation

−

~~==== GotCloud FASTQ Index File ====~~

+

Take a look at the chromosome 22 reference files included for this tutorial:

−

~~You need to tell GotCloud about each FASTQ file~~

+

ls ${SS}/ref22

−

* Full path

−

* Sample name

−

** Each sample can have multiple FASTQs

−

** Each FASTQ is for a single sample

−

~~The FASTQ index file is created by you to direct GotCloud to your FASTQ files, providing additional information for them~~.

+

<ul>

+

+

<li>View Screenshot</li>

+

+

[[File:RefDir.png|700px]]

+

</div>

+

</div>

+

</ul>

−

* tab delimited

+

Let's read the reference FASTA file (all reference bases for the chromosome):

−

* columns may be in any order

+

less ${SS}/ref22/human.g1k.v37.chr22.fa

−

* starts with a header line

−

* one line per single-end read

−

* one line per paired-end read (~~only 1 line per pair~~).

+

Remember, use <code>'q'</code> to exit out of <code>less</code>

+

q

−

~~'''Required Columns'''~~

+

; Where is the reference sequence?

−

{|class="~~wikitable~~" ~~cellpadding~~=5

+

<ul>

−

~~! Column Name !! Description !! Recommended Value~~

+

−

|-

+

<li>Answer:</li>

−

~~| MERGE_NAME ||~~

+

−

* Base name for the resulting BAM file for the sample

+

<ul>

−

* Used to group multiple fastqs or fastq pairs into a ~~single BAM~~

+

<li>The ends of a chromosome are 'N' - unknown bases</li>

−

~~| Sample Name~~

+

<li>Let's look at 5 lines of the file starting at line 300,000</li>

−

|-

+

tail -n+300000 ${SS}/ref22/human.g1k.v37.chr22.fa |head -n 5

−

~~| FASTQ1 ||~~

+

[[File:Fasta.png|500px]]

−

* Name of the ~~fastq or the first in the pair if paired-end. (Only 1~~ line ~~per pair)~~

+

</div>

−

~~| path~~/~~fastq1~~

+

</div>

−

|-

+

</ul>

−

| ~~FASTQ2 ||~~

+

</ul>

−

*Name of the 2nd fastq in paired-end reads.

−

*Column is not required if all fastqs are single-end

−

*'.' if the column is used, but this line is single-ended.

−

~~| path~~/~~fastq2~~

−

|}

+

If you want to access the FASTA file by position, you can use <code>samtools faidx</code> command

+

${GC}/bin/samtools faidx ${SS}/ref22/human.g1k.v37.chr22.fa 22:36000000 | less

+

or

+

${GC}/bin/samtools faidx ${SS}/ref22/human.g1k.v37.chr22.fa 22:36000000-36000100

−

The ~~following columns are optional and used~~ to ~~populate the Read Group Information in the BAM file.~~

+

=== GotCloud FASTQ Index File ===

−

* ~~RGID field~~ is ~~required if using any of these fields,~~ the ~~others are optional~~.

+

The FASTQ index file is created by you to tell GotCloud about each of your FASTQ files:

+

* Where to find it

+

* Sample name

+

** Each sample can have multiple FASTQs

+

** Each FASTQ is for a single sample

+

* Run identifier

+

** For recalibration we need to know which reads were in the same run.

−

~~What is a Read Group?~~

+

FASTQ Index Format:

−

* ~~Groups reads together~~

+

* Tab delimited

−

* ~~Used for recalibration~~

+

* Starts with a header line

−

** Each sequencing run should get a different ReadGroup

+

* One line per single-end read

−

* ~~Typically a new name for each fastq~~ pair~~/group~~

+

* One line per paired-end read (only 1 line per pair).

−

~~If you do not want~~ the ~~field~~ for:

+

Let's look a look at the index file I prepared for this tutorial:

−

* any fastq, leave the column out of the header line

+

less -S ${SS}/align.index

−

* a single line, use a '.'

+

Remember, use <code>'q'</code> to exit out of <code>less</code>

+

q

−

~~'''Optional Columns'''~~

+

; Which samples had multiple runs?

−

{|class="~~wikitable~~" ~~cellpadding~~=5

+

<ul>

−

|-

+

−

~~! Column Name !! Description !! Recommended Value~~

+

<li>Need a reminder of the format?</li>

−

|-

+

−

| ~~RGID || Read Group ID || Run ID~~

+

[[File:fqindex.png|750px]]

−

|-

+

</div>

−

~~| SAMPLE || Sample Name || Sample Name~~

+

</div>

−

|-

+

<ul>

−

~~| LIBRARY || Library~~

+

<li>Note: in the screenshots, the fields are shifted into clear columns to make it easier to read</li>

−

* separate FASTQs for a sample that were prepped separately

+

<ul>

−

~~| if~~ you ~~don't know or it is all~~ the ~~same~~, ~~use Sample Name~~

+

<li>When you view the file, the fields will not line up in neat columns and it can be hard to read</li>

−

|-

+

</ul>

−

~~| CENTER || Center Name || Name of~~ the ~~sequencing center producing~~ the ~~FASTQ~~

+

</ul>

−

|-

+

−

| ~~PLATFORM || Platform || CAPILLARY, LS454, ILLUMINA,~~

+

<li>Hard to read the index? Need a hint?</li>

−

~~SOLID, HELICOS, IONTORRENT, or PACBIO~~

+

−

|}

+

<ul>

+

<li>Use cut to extract just the MERGE_NAME & RGID fields </li>

+

cut -f 1,4 ${SS}/align.index

+

</ul>

+

</div>

+

</div>

+

+

<li>Answer:</li>

+

+

<ul>

+

+

<li>They have multiple unique values in the RGID field</li>

+

[[File:fqindexRG.png|800px]]

+

</div>

+

</div>

+

</ul>

+

</ul>

−

~~Your sequencing core may provide to you a file with information to fill in these columns.~~

−

~~For our example, we have <code>sequence.~~index~~</code> which contains the information from 1000 Genomes for the FASTQs we are processing.~~

+

How do you point GotCloud to your index file?

−

~~less -S ${GC}/inputs/fastq/sequence.index~~

+

* Command-line <code>--index_file</code> option

−

+

: or

−

~~In this~~ file~~, we want the SAMPLE_NAME, FASTQ_FILE, RUN_ID, LIBRARY_NAME, CENTER_NAME, INSTRUMENT_PLATFORM (columns 10, 1, 15, 6, 13).~~

+

* Configuration file <code>INDEX_FILE</code> setting.

−

* You can use perl/awk/linux to extract these fields & format as necessary.

−

* ~~I prepared a perl script that you can use:~~

−

~~perl ${GC}/scripts/genIndex.pl > ${SETUP}/align.index~~

−

~~Let's look at the index file:~~

−

~~less -S ${SETUP}/align.index~~

−

~~[[File:Align index.png|1000px]]~~

−

~~The command~~-line <code>--~~fastq~~</code> option or ~~the configuration file <code>FASTQ_PREFIX</code> setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths.~~

−

~~This file is specified either via the command-line <code>--index_file</code> parameter or via the configuration~~ file <code>INDEX_FILE</code> setting.

The command-line setting takes precedence over the configuration file setting.

−

==== GotCloud Configuration File ====

+

=== GotCloud Configuration File ===

This file is created by you to configure GotCloud for your data.

−

* Default values are provided in ${GC}~~/gotcloud~~/bin/gotcloudDefaults.conf

+

* Default values are provided in ${GC}/bin/gotcloudDefaults.conf

** Most values should be left as the defaults

* Specify values in your configuration file as:

−

KEY = value

+

** <code>KEY = value</code>

* Use $(KEY) to refer to another key's value

* If a KEY is specified twice, the later value is used

* Does not have access to environment variables

* '#' indicates a comment

−

* Keys to override:

−

~~{|class="wikitable" cellpadding=5~~

−

|-

−

~~! Key Name !! Description~~

−

|-

−

~~| colspan=2 style="text-align:center"| Index File Settings - pointing GotCloud to your data~~

−

|-

−

~~| INDEX_FILE || Path to the FASTQ index file that you created~~

−

* Alternatively, this can be specified on the command-line as <code>--index_file</code>

−

|-

−

~~| FASTQ_PREFIX || Prefix to be added to the FASTQ files in INDEX_FILE~~

−

* Not required

−

|-

−

~~| BAM_INDEX || Path to the BAM index file~~

−

* to be created by alignment

−

* to be used for snp calling

−

|-

−

~~| colspan=2 style="text-align:center"| Reference File Settings - telling GotCloud where to find your reference files~~

−

|-

−

~~| REF_DIR || Path to your reference files~~

−

* You don't have to use this, you can specify the full path for each file

−

|-

−

~~| REF || Path/filename of the FASTA reference file~~

−

* If different than default: $(REF_DIR)/human.g1k.v37.fa

−

|-

−

~~| DBSNP_VCF || Path/filename of the DBSNP file~~

−

* If different than default: $(REF_DIR)/dbsnp_135.b37.vcf.gz

−

|-

−

~~| HM3_VCF || Path/filename of the HapMap3 file~~

−

* If different than default: $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

−

|-

−

~~| OMNI_VCF || Path/filename of the OMNI file~~

−

* If different than default: $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz

−

|-

−

~~| INDEL_PREFIX || Path/filename base of the indels file~~

−

* If different than default: $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19

−

|}

Let's look at the configuration file I created for this test:

−

more ${GC}~~/inputs~~/gotcloud.conf

+

more ${SS}/gotcloud.conf

+

Use the <code>space bar</code> to advance if the whole file isn't displayed.

+

; If your references are in a different path than what is specified, what would you change?

+

<ul>

+

+

<li>Answer:</li>

+

+

<ul>

+

<li>You would change <code>REF_DIR</code> to the new path</li>

+

[[File:gcConf.png|800px]]

+

</div>

+

</div>

+

</ul>

+

</ul>

−

~~It already points to your align file.~~

+

== Run [[GotCloud]] Align ==

−

~~=== Run GotCloud Align ===~~

+

[[File:AlignDiagram.png|500px]]

−

Now that we have all of our input files, we need just a simple command to run

+

−

${GC}~~/gotcloud~~/gotcloud align --conf ${GC}~~/inputs~~/gotcloud.conf --numcs 2

+

Now that we have all of our input files, we need just a simple command to run them

+

${GC}/gotcloud align --conf ${SS}/gotcloud.conf --numcs 2 --base_prefix ${SS} --outdir ${OUT}

+

* <code>${GC}/gotcloud</code> runs GotCloud

+

* <code>align</code> tells GotCloud you want to run the alignment pipeline.

+

* <code>--conf</code> tells GotCloud the name of the configuration file to use.

+

** The configuration for this test was downloaded with the seqshop input files.

* <code>--numcs</code> means to run 2 samples at a time.

−

** ~~Depends~~ on your system

+

** How many you can run concurrently depends on your system.

+

* <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths.

+

** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}

+

** Alternatively, gotcloud.conf could be updated to specify the full paths

+

* <code>--out_dir</code> tells GotCloud where to write the output.

+

** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line

+

[[File:gcalignStart.png|850px]]

−

This should take ~~< 4~~ minutes to run.

+

This should take 1-3 minutes to run.

It should end with a line like: <code>Processing finished in 133 secs with no errors reported</code>

Line 275: Line 306:

If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off.

−

=== Examining GotCloud Align Output ===

+

Inside GotCloud align, not only sequence alignment but also pre-processing of sequence data, including deduplication and base quality recalibration are performed along with quality assessment, as illustrated below.

+

[[File:Gotcloud_align_detail.png|500px]]

+

== Examining GotCloud Align Output ==

Let's look at the output directory:

−

ls ${~~OUTPUT~~}

+

ls ${OUT}

[[File:gcalignOutM.png|600px]]

−

~~Let's look at the BAMs (aligned reads that are ready for variant calling):~~

+

=== Quality Control Files ===

−

~~ls ${OUTPUT}/bams~~

−

~~[[File:GcalignOutBAMm.png|600px]]~~

−

~~BAM~~ Files:

−

* Binary Sequence Alignment/Map (SAM) Format

−

* Maps reads to Chromosome/Position

−

* For a detailed explanation of the SAM/BAM format, see:

−

** SAM/BAM Spec: http://samtools.github.io/hts-specs/SAMv1.pdf

−

** Additional information I put together as I started working with SAM/BAM: [[SAM]]

−

* Consists of:

−

** Header

−

*** Starts with '@'

−

*** Records - one for each sequence read

−

~~Let's examine a BAM file:~~

−

~~samtools view -h ${OUTPUT}/bams/~~

−

~~[[File:BAM.png|750px]]~~

−

Let's take a look at our quality control output directory:

−

ls ${~~OUTPUT~~}/QCFiles

+

ls ${OUT}/QCFiles

[[File:GcalignOutQCm.png|600px]]

+

==== Sample Contamination/Swap ====

Check for sample contamination:

* *.selfSM : Main output file containing the contamination estimate.

−

** If you are only interested in checking sample contamination:

+

** Check the 'FREEMIX' column for genotype-free estimate of contamination

−

*** Check the 'FREEMIX' column for genotype-free estimate of contamination

**** 0-1 scale, the lower, the better

−

**** See [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information

+

**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination

−

*** Check the 'CHIPMIX' column for contamination estimates with external genotypes (if provided)

+

** See [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information

−

* *.selfRG : Same output to .*selfSM, but separated by readGroup (which might be helpful for library-level examination)

+

less -S ${OUT}/QCFiles/HG00551.genoCheck.selfSM

−

* *.depthSM : depth distribution of reads covering the marker position of the input VCF, across all readGroups.

+

−

* *.depthRG : depth distribution of reads covering the marker position of the input VCF, per readGroups.

+

Remember, use <code>'q'</code> to exit out of <code>less</code>

−

less -S ${~~OUTPUT~~}/QCFiles/HG00551.genoCheck.selfSM

+

q

+

; Is there evidence of sample contamination?

+

<ul>

+

+

<li>Answer:</li>

+

+

<ul>

+

<li>No, FREEMIX = 0.00000 (<0.03)</li>

+

</ul>

[[File:Contam1.png|700px]]

+

</div>

+

</div>

+

</ul>

+

==== QC Metrics ====

+

See: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results.

−

~~Next, let~~'s look at some quality control metrics:

+

Let's look at some quality control metrics:

−

cat ${~~OUTPUT~~}/QCFiles/HG00551.qplot.stats

+

cat ${OUT}/QCFiles/HG00551.qplot.stats

−

* 99.16% mapping rate

−

* 94.01% high quality bases

−

* 7x coverage

−

* 31.3% A, 31.3% T

−

* 18.7% C, 18.7% G

+

; What is the mapping rate & average coverage for HG00551?

+

<ul>

+

+

<li>Answer</li>

+

+

<ul>

+

<li> 98.93% Mapped</li>

+

<li>7.43 MeanDepth</li>

+

</ul>

+

[[File:qplots.png|200px]]

+

</div>

+

</div>

+

</ul>

−

Generate ~~the~~ pdf's of ~~our~~ quality metrics:

+

Generate a pdf of quality metrics:

−

Rscript ${~~OUTPUT~~}/QCFiles/HG00551~~.qplot.R~~

+

Rscript ${OUT}/QCFiles/HG00551.qplot.R

−

~~Rscript ${OUTPUT}/QCFiles/HG00553.qplot.R~~

−

~~Rscript ${OUTPUT}/QCFiles/HG00640.qplot.R~~

−

~~Rscript ${OUTPUT}/QCFiles/HG00641~~.qplot.R

Examine the PDF:

−

evince ${~~OUTPUT~~}/QCFiles/HG00551~~.qplot.pdf&~~

+

evince ${OUT}/QCFiles/HG00551.qplot.pdf&

−

~~The first plot: Empirical vs reported Phred score does not look as good as we would like.~~

−

* This is due to the small region used for recalibration

−

~~Look at the PDF I produced when I ran the whole genome:~~

−

~~evince ${GC}/example/HG00551.wg~~.qplot.pdf&

−

~~See~~: [[~~QPLOT~~#~~Diagnose_sequencing_quality~~|~~QPLOT~~: ~~Diagnose sequencing quality~~]] for more ~~info~~ on how to ~~use QPLOT results~~.

+

It is ok if you see a warning message when opening evince. It should still open. If not, let me know. To close evince, just close the pdf window.

+

;Does the Empirical vs reported Phred score look as good as we would like?

+

<ul>

+

+

<li>Answer</li>

+

+

<ul>

+

<li> No, it is well above the line</li>

+

<li> This is due to the small region used for recalibration</li>

+

[[File:Qplotpdf.png|400px]]

+

<li> Look at the PDF I produced when I ran the whole genome:</li>

+

evince ${SS}/ext/HG00551.wg.qplot.pdf&

+

</ul>

+

[[File:Qplotpdfwg.png|400px]]

+

</div>

+

</div>

+

</ul>

+

=== BAM Files ===

+

Binary Sequence Alignment/Map (SAM) Format

+

* Maps reads to Chromosome/Position

+

* For a detailed explanation of the SAM/BAM format, see:

+

** SAM/BAM Spec: http://samtools.github.io/hts-specs/SAMv1.pdf

+

** Additional information I put together as I started working with SAM/BAM: [[SAM]]

+

Let's look at the BAMs (aligned reads that are ready for variant calling):

+

ls ${OUT}/bams

+

[[File:GcalignOutBAMm.png|600px]]

+

Let's examine at the first 5 lines of the BAM file using [http://samtools.sourceforge.net/samtools.shtml#3 samtools view]:

+

${GC}/bin/samtools view -h ${OUT}/bams/HG00551.recal.bam|head -n 5

+

; What are the chromosome and position of the first record in the BAM file?

+

<ul>

+

+

<li>Need a reminder of the format?</li>

+

+

[[File:Bam.png|750px]]

+

</div>

+

</div>

+

+

<li>Answer</li>

+

+

<ul>

+

+

</ul>

+

[[File:BamRec.png|650px]]

+

</div>

+

</div>

+

</ul>

+

==== Accessing BAMs by Position ====

+

BAM's are so big, what if we want to see a position part way through the file?

+

*[http://samtools.sourceforge.net/samtools.shtml#3 samtools] has an option for that.

+

Add a region to the view command we used above. Let's find all reads that overlap positions 36907000-36907005:

+

${GC}/bin/samtools view -h ${OUT}/bams/HG00551.recal.bam 22:36907000-36907005

+

* Just a few reads.

+

Let's visualize what reads in that area look like using samtools tview:

+

${GC}/bin/samtools tview ${OUT}/bams/HG00551.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa

+

* Type ‘g’

+

** Type 22:36907000

+

* Type ‘n’ to color by nucleotide

+

* Use the arrow keys to move around and look at the area.

+

Understanding the syntax:

+

* '.' : match to the reference on the forward strand

+

* ',' : match to the reference on the reverse strand

+

* ACGTN : mismatch to reference on the forward strand

+

* acgtn : mismatch to reference on the reverse strand

+

; Do you see anything interesting?

+

<ul>

+

+

<li>Screenshot</li>

+

+

<ul>

+

<li>We will have to remember this region when we run snpcall to see what it says.</li>

+

</ul>

+

[[File:tview.png|750px]]

+

</div>

+

</div>

+

</ul>

+

Other tview commands:

+

* Type '?' for a help screen

+

* Type 'q' to quit tview

+

Feel free to play around more and browse the BAM files.

+

==== Other tools for BAMs ====

+

We have developed a lot of tools that operate on BAM files.

+

See [[Software#BAM_Util_Tools|Software: BamUtil Tools]] for a list

+

* Many operations:

+

** diff : diff 2 BAM files

+

** stats: per positions statistics

+

** bam2Fastq : convert a BAM back to a FASTQ (how I created the fastqs for this tutorial)

+

** Lots of others

+

* Feel free to try some out

+

* If you have any questions, let me know, I wrote most of them and am happy to help.

+

== Logging Off ==

+

''This section is specifically for the SeqShop Workshop computers.''

+

+

''If you are not running during the SeqShop Workshop, please skip this section.''

+

+

To logout of seqshop-server, type:

+

exit

+

And close the windows.

+

When done, log out of the Windows machine.

+

</div>

+

</div>

Ppwhite

96

edits

Changes

SeqShop: Sequence Mapping and Assembly Practical, June 2014 (view source)

Revision as of 11:07, 2 February 2017

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools