Changes

3,480 bytes added , 11:07, 2 February 2017

→‎Download the example data

Line 1: Line 1: +

'''Note:''' the latest version of this practical is available at: [[SeqShop: Sequence Mapping and Assembly Practical]]

+

* The ones here is the original one from the June workshop (updated to be run from elsewhere)

+

== Introduction ==

−

See the [[introductory slides]] for an intro to this tutorial.

+

See the [[Media:SeqShop - GotCloud Align.pdf|introductory slides]] for an intro to this tutorial.

+

== Goals of This Session ==

* What we want to learn

Line 7: Line 11:

** How to evaluate the quality of sequence data

** How to visualize sequence data to examine the reads aligned to particular genomic positions

+

== Setup in person at the SeqShop Workshop ==

+

''This section is specifically for the SeqShop Workshop computers.''

+

+

''If you are not running during the SeqShop Workshop, please skip this section.''

+

−

== Setup your run environment==

+

=== Setup your run environment===

This will setup some environment variables to point you to

−

* GotCloud program

+

* [[GotCloud]] program

* Tutorial input files

* Setup an output directory

source /home/mktrost/seqshop/setup.txt

+

* You won't see any output after running <code>source</code>

+

** It silently sets up your environment

View setup.txt

−

[[File:setup.png|~~300px~~]]

+

[[File:setup.png|500px]]

+

</div>

+

</div>

+

== Setup when running on your own outside of the SeqShop Workshop ==

+

''This section is specifically for running on your own outside of the SeqShop Workshop.''

+

+

''If you are running during the SeqShop Workshop, please skip this section.''

+

+

=== Download & Build GotCloud ===

+

If you do not already have GotCloud:

+

* cd to where you want GotCloud installed (you can change this to any directory you want)

+

mkdir -p ~/seqshop

+

cd ~/seqshop/

+

* download, decompress, and build the version of gotcloud that was tested with this tutorial:

+

wget https://github.com/statgen/gotcloud/archive/gotcloud.workshop.tar.gz

+

tar xvf gotcloud.workshop.tar.gz

+

mv gotcloud-gotcloud.workshop gotcloud

+

cd gotcloud/src

+

make

+

cd ../..

−

~~Alternatively, if~~ you ~~would like~~ to ~~change the output directory, copy the file, make the modifications and source~~ your ~~own file:~~

+

Remember the path to gotcloud/ that is what you will need to set your GC variable to.

−

~~cp /home/mktrost/seqshop/setup.txt ~/setup.txt~~

−

~~nedit ~/setup.txt~~

−

~~source ~/setup.txt~~

−

~~(You can use your favorite editor instead of nedit. I typically use emacs, but nedit is more like Windows~~.)

−

== Examining Raw Sequence Reads ==

+

=== Download the example data ===

+

Download and untar file containing the example data used in the practicals:

+

wget http://csg.sph.umich.edu//mktrost/seqshopExample.tar.gz

+

tar xvf seqshopExample.tar.gz

+

You will see the names of all the files included in the example data scrolling on the screen as they are unpacked from the tar file.

+

== Examining [[GotCloud]] Align Input Files ==

+

=== Examining Raw Sequence Reads : FASTQs ===

FASTQ : standard file format provided to you by those who did the sequencing.

: For more information on the FASTQ format, see: http://en.wikipedia.org/wiki/FASTQ_format

Line 39: Line 76:

* Subset of FASTQs - should map to chromosome 22 36000000-37000000

−

ls ${IN}/fastq/

+

ls ${SS}/fastq/

There are 24 fastq files: combination of single-end & paired-end.

Line 47: Line 84:

<li>View answer:</li>

+

<ul>

<li> Paired-end files have a '''_1.fastq''' or '''_2.fastq''' extension</li>

<li> This convention isn't mandatory, but something similar is common</li>

Line 52: Line 90:

</div>

+

</ul>

Look at a couple of FASTQs:

−

less -S ${IN}/fastq/HG00551.SRR190851_1.fastq

+

less -S ${SS}/fastq/HG00551.SRR190851_1.fastq

<code>less</code> is a Linux command that allows you to look at a file.

−

*<code>-S</code> option prevents line wrap.

+

*<code>-S</code> option prevents line wrap

−

* Use the arrow (up/down/left/right) keys to scroll through the file.

+

* Use the arrow (up/down/left/right) keys to scroll through the file

−

* Use <code>~~zless~~</code> ~~if the file is compressed.~~

+

* Use the <code>space bar</code> to jump down a page

Use <code>'q'</code> to exit out of <code>less</code>

q

Line 76: Line 115:

Look at the paired read:

−

less -S ${IN}/fastq/HG00551.SRR190851_2.fastq

+

less -S ${SS}/fastq/HG00551.SRR190851_2.fastq

+

Remember, use <code>'q'</code> to exit out of <code>less</code>

+

q

;Do you notice something in common?

+

<ul>

<li>View answer:</li>

+

<ul>

<li> Paired-end reads have matching read names with a different extensions</li>

<li> This convention isn't mandatory, but something similar is common</li>

Line 88: Line 132:

</div>

+

</ul>

−

== ~~GotCloud Alignment Pipeline~~==

+

=== Reference Files ===

+

Reference files can be downloaded with [[GotCloud]] or from other sources

+

* See [[GotCloud: Genetic Reference and Resource Files]] for more information on downloading/generating reference files

−

~~[[File~~:~~AlignDiagram~~.~~png|500px]]~~

+

For alignment, you need:

+

# Reference genome FASTA file

+

#* Contains the reference base for each position of each chromosome

+

#* Additional information on the FASTA format: http://en.wikipedia.org/wiki/FASTA_format

+

# VCF (variant call format) files with chromosomes/positions

+

#* dbsnp - used to skip known variants when recalibrating

+

#* hapmap - used for sample contamination/sample swap validation

−

~~=== Why GotCloud?===~~

+

Take a look at the chromosome 22 reference files included for this tutorial:

−

* Easy to learn & run

+

ls ${SS}/ref22

−

** All-in-one sequence analysis pipeline

−

** You don’t have to know the ~~details of individual component~~

−

* Robust parallelization

−

** Automatic partition of multi-sample jobs

−

** Reliable and fault-tolerant parallelization via GNU make

−

*** Restart from where it stopped upon unexpected crash

−

* Cloud & Cluster-friendly

−

** Supports multiple clusters such as MOSIX, Slurm, & SGE

−

** Amazon instances allow running large-scale jobs without having your own cluster

−

~~Sequence Processing Recommendations~~

+

<ul>

−

* Be consistent within a project

+

−

** Process all samples with same pipeline

+

<li>View Screenshot</li>

−

*** Batch effect may arise if different pipelines are used due to pipeline differences

+

−

** Use the same configuration within a project

+

[[File:RefDir.png|700px]]

+

</div>

+

</div>

+

</ul>

−

~~=== Examining GotCloud Align Input Files ===~~

+

Let's read the reference FASTA file (all reference bases for the chromosome):

−

~~==== Sequence Data Files~~ : ~~FASTQs ====~~

+

less ${SS}/ref22/human.g1k.v37.chr22.fa

−

~~We already looked at those in: [[#Examining Raw Sequence Reads|Examining Raw Sequence Reads]]~~

−

~~==== Reference Files ====~~

+

Remember, use <code>'q'</code> to exit out of <code>less</code>

−

~~Reference files can be downloaded with GotCloud or from other sources.~~

+

q

−

* Partial reference for this example

−

ls ${GC}/~~reference~~/~~chr7~~

+

; Where is the reference sequence?

+

<ul>

+

+

<li>Answer:</li>

+

+

<ul>

+

<li>The ends of a chromosome are 'N' - unknown bases</li>

+

<li>Let's look at 5 lines of the file starting at line 300,000</li>

+

tail -n+300000 ${SS}/ref22/human.g1k.v37.chr22.fa |head -n 5

+

[[File:Fasta.png|500px]]

+

</div>

+

</div>

+

</ul>

+

</ul>

−

~~[[File~~:~~RefDir~~.~~png|500px]]~~

+

If you want to access the FASTA file by position, you can use <code>samtools faidx</code> command

+

${GC}/bin/samtools faidx ${SS}/ref22/human.g1k.v37.chr22.fa 22:36000000 | less

+

or

+

${GC}/bin/samtools faidx ${SS}/ref22/human.g1k.v37.chr22.fa 22:36000000-36000100

−

+

=== GotCloud FASTQ Index File ===

−

~~VCF files~~

+

The FASTQ index file is created by you to tell GotCloud about each of your FASTQ files:

−

* List of chromosome/position

+

* Where to find it

−

* Used for:

−

** dbsnp - recalibration skips known variants

−

** hapmap - used for sample contamination/sample swap validation

−

** variant filtering

−

~~Let's read the first 5 lines of the genome reference FASTA file (all reference bases for a chromosome):~~

−

~~head -n 5 ${GC}/reference/chr7/human.g1k.v37.chr7.fa~~

−

~~[[File:Fasta3.png|600px]]~~

−

~~The start of the chromosome is all N's, so let's look at a later section (reading 5 lines starting at line 2000):~~

−

~~tail -n+2000 ${GC}/reference/chr7/human.g1k.v37.chr7.fa |head -n 5~~

−

~~[[File:Fasta3 (copy).png|600px]]~~

−

~~Additional information on the FASTA format: http://en.wikipedia.org/wiki/FASTA_format~~

−

~~See [[GotCloud: Genetic Reference and Resource Files]] for more information on downloading/generating reference files.~~

−

==== GotCloud FASTQ Index File ====

−

~~You need~~ to tell GotCloud about each FASTQ ~~file~~

−

* ~~Full path~~

* Sample name

** Each sample can have multiple FASTQs

** Each FASTQ is for a single sample

+

* Run identifier

+

** For recalibration we need to know which reads were in the same run.

−

~~The~~ FASTQ ~~index file is created by you to direct GotCloud to your FASTQ files, providing additional information for them~~.

+

FASTQ Index Format:

+

* Tab delimited

+

* Starts with a header line

+

* One line per single-end read

+

* One line per paired-end read (only 1 line per pair).

−

* tab delimited

+

Let's look a look at the index file I prepared for this tutorial:

−

* columns may be in any order

+

less -S ${SS}/align.index

−

* starts with a ~~header line~~

−

* one line per single-end read

−

* one line per paired-~~end read (only 1 line per pair)~~.

+

Remember, use <code>'q'</code> to exit out of <code>less</code>

+

q

−

~~'''Required Columns'''~~

+

; Which samples had multiple runs?

−

{|class="~~wikitable~~" ~~cellpadding~~=5

+

<ul>

−

~~! Column Name !! Description !! Recommended Value~~

+

−

|-

+

<li>Need a reminder of the format?</li>

−

| ~~MERGE_NAME ||~~

+

−

* Base name for the ~~resulting BAM~~ file ~~for~~ the ~~sample~~

+

[[File:fqindex.png|750px]]

−

* Used to ~~group multiple fastqs or fastq pairs into~~ a ~~single BAM~~

+

</div>

−

~~| Sample Name~~

+

</div>

−

|-

+

<ul>

−

~~| FASTQ1 ||~~

+

<li>Note: in the screenshots, the fields are shifted into clear columns to make it easier to read</li>

−

* Name of the ~~fastq or the first in the pair if paired~~-~~end~~. ~~(Only 1 line per pair)~~

+

<ul>

−

~~| path~~/~~fastq1~~

+

<li>When you view the file, the fields will not line up in neat columns and it can be hard to read</li>

−

|-

+

</ul>

−

~~| FASTQ2 ||~~

+

</ul>

−

*Name of the 2nd fastq in paired-~~end reads.~~

+

−

*Column is not required if all fastqs are single-end

+

<li>Hard to read the index? Need a hint?</li>

−

*'.' if the ~~column is used, but this line is single-ended~~.

+

−

~~| path~~/~~fastq2~~

+

<ul>

−

|}

+

<li>Use cut to extract just the MERGE_NAME & RGID fields </li>

+

cut -f 1,4 ${SS}/align.index

+

</ul>

+

</div>

+

</div>

+

+

<li>Answer:</li>

+

+

<ul>

+

+

<li>They have multiple unique values in the RGID field</li>

+

[[File:fqindexRG.png|800px]]

+

</div>

+

</div>

+

</ul>

+

</ul>

−

~~The following columns are optional and used to populate the Read Group Information in the BAM file.~~

+

How do you point GotCloud to your index file?

−

* RGID field is required if using any of these fields, the others are optional.

+

* Command-line <code>--index_file</code> option

−

+

: or

−

~~What is a Read Group?~~

+

* Configuration file <code>INDEX_FILE</code> setting.

−

* Groups reads together

−

* Used for recalibration

−

** Each sequencing run should get a different ReadGroup

−

* Typically a new name for each fastq pair/group

−

~~If you~~ do ~~not want the field for:~~

−

* any fastq, leave the column out of the header line

−

* a single line, use a '.'

−

~~'''Optional Columns'''~~

−

~~{|class="wikitable" cellpadding=5~~

−

|-

−

~~! Column Name !! Description !! Recommended Value~~

−

|-

−

~~| RGID || Read Group ID || Run ID~~

−

|-

−

~~| SAMPLE || Sample Name || Sample Name~~

−

|-

−

~~| LIBRARY || Library~~

−

* separate FASTQs for a sample that were prepped separately

−

~~| if you don't know or it is all the same, use Sample Name~~

−

|-

−

~~| CENTER || Center Name || Name of the sequencing center producing the FASTQ~~

−

|-

−

~~| PLATFORM || Platform || CAPILLARY, LS454, ILLUMINA,~~

−

~~SOLID, HELICOS, IONTORRENT, or PACBIO~~

−

|}

−

~~Your sequencing core may provide to~~ you ~~a file with information~~ to ~~fill in these columns.~~

−

~~For our example, we have <code>sequence.~~index~~</code> which contains the information from 1000 Genomes for the FASTQs we are processing.~~

−

~~less -S ${GC}/inputs/fastq/sequence.index~~

−

~~In this~~ file~~, we want the SAMPLE_NAME, FASTQ_FILE, RUN_ID, LIBRARY_NAME, CENTER_NAME, INSTRUMENT_PLATFORM (columns 10, 1, 15, 6, 13).~~

−

* You can use perl/awk/linux to extract these fields & format as necessary.

−

* ~~I prepared a perl script that you can use:~~

−

~~perl ${GC}/scripts/genIndex.pl > ${SETUP}/align.index~~

−

~~Let's look at the index file:~~

−

~~less -S ${SETUP}/align.index~~

−

~~[[File:Align index.png|1000px]]~~

−

~~The command~~-line <code>--~~fastq~~</code> option or ~~the configuration file <code>FASTQ_PREFIX</code> setting can be used to specify a prefix to the FASTQ1/FASTQ2 file paths.~~

−

~~This file is specified either via the command-line <code>--index_file</code> parameter or via the configuration~~ file <code>INDEX_FILE</code> setting.

The command-line setting takes precedence over the configuration file setting.

−

==== GotCloud Configuration File ====

+

=== GotCloud Configuration File ===

This file is created by you to configure GotCloud for your data.

−

* Default values are provided in ${GC}~~/gotcloud~~/bin/gotcloudDefaults.conf

+

* Default values are provided in ${GC}/bin/gotcloudDefaults.conf

** Most values should be left as the defaults

* Specify values in your configuration file as:

−

KEY = value

+

** <code>KEY = value</code>

* Use $(KEY) to refer to another key's value

* If a KEY is specified twice, the later value is used

* Does not have access to environment variables

* '#' indicates a comment

−

* Keys to override:

−

~~{|class="wikitable" cellpadding=5~~

−

|-

−

~~! Key Name !! Description~~

−

|-

−

~~| colspan=2 style="text-align:center"| Index File Settings - pointing GotCloud to your data~~

−

|-

−

~~| INDEX_FILE || Path to the FASTQ index file that you created~~

−

* Alternatively, this can be specified on the command-line as <code>--index_file</code>

−

|-

−

~~| FASTQ_PREFIX || Prefix to be added to the FASTQ files in INDEX_FILE~~

−

* Not required

−

|-

−

~~| BAM_INDEX || Path to the BAM index file~~

−

* to be created by alignment

−

* to be used for snp calling

−

|-

−

~~| colspan=2 style="text-align:center"| Reference File Settings - telling GotCloud where to find your reference files~~

−

|-

−

~~| REF_DIR || Path to your reference files~~

−

* You don't have to use this, you can specify the full path for each file

−

|-

−

~~| REF || Path/filename of the FASTA reference file~~

−

* If different than default: $(REF_DIR)/human.g1k.v37.fa

−

|-

−

~~| DBSNP_VCF || Path/filename of the DBSNP file~~

−

* If different than default: $(REF_DIR)/dbsnp_135.b37.vcf.gz

−

|-

−

~~| HM3_VCF || Path/filename of the HapMap3 file~~

−

* If different than default: $(REF_DIR)/hapmap_3.3.b37.sites.vcf.gz

−

|-

−

~~| OMNI_VCF || Path/filename of the OMNI file~~

−

* If different than default: $(REF_DIR)/1000G_omni2.5.b37.sites.PASS.vcf.gz

−

|-

−

~~| INDEL_PREFIX || Path/filename base of the indels file~~

−

* If different than default: $(REF_DIR)/1kg.pilot_release.merged.indels.sites.hg19

−

|}

Let's look at the configuration file I created for this test:

−

more ${GC}~~/inputs~~/gotcloud.conf

+

more ${SS}/gotcloud.conf

+

Use the <code>space bar</code> to advance if the whole file isn't displayed.

+

; If your references are in a different path than what is specified, what would you change?

+

<ul>

+

+

<li>Answer:</li>

+

+

<ul>

+

<li>You would change <code>REF_DIR</code> to the new path</li>

+

[[File:gcConf.png|800px]]

+

</div>

+

</div>

+

</ul>

+

</ul>

+

== Run [[GotCloud]] Align ==

−

~~It already points to your align file~~.

+

[[File:AlignDiagram.png|500px]]

−

~~=== Run GotCloud Align ===~~

+

Now that we have all of our input files, we need just a simple command to run them

−

Now that we have all of our input files, we need just a simple command to run

+

${GC}/gotcloud align --conf ${SS}/gotcloud.conf --numcs 2 --base_prefix ${SS} --outdir ${OUT}

−

${GC}~~/gotcloud~~/gotcloud align --conf ${GC}~~/inputs~~/gotcloud.conf --numcs 2

+

* <code>${GC}/gotcloud</code> runs GotCloud

+

* <code>align</code> tells GotCloud you want to run the alignment pipeline.

+

* <code>--conf</code> tells GotCloud the name of the configuration file to use.

+

** The configuration for this test was downloaded with the seqshop input files.

* <code>--numcs</code> means to run 2 samples at a time.

−

** ~~Depends~~ on your system

+

** How many you can run concurrently depends on your system.

+

* <code>--base_prefix</code> tells GotCloud the prefix to append to relative paths.

+

** The Configuration file cannot read environment variables, so we need to tell GotCloud the path to the input files, ${SS}

+

** Alternatively, gotcloud.conf could be updated to specify the full paths

+

* <code>--out_dir</code> tells GotCloud where to write the output.

+

** This could be specified in gotcloud.conf, but to allow you to use the ${OUT} to change the output location, it is specified on the command-line

+

[[File:gcalignStart.png|850px]]

−

This should take ~~< 4~~ minutes to run.

+

This should take 1-3 minutes to run.

It should end with a line like: <code>Processing finished in 133 secs with no errors reported</code>

Line 302: Line 306:

If you cancelled GotCloud part way through, just rerun your GotCloud command and it will pick up where it left off.

−

=== Examining GotCloud Align Output ===

+

Inside GotCloud align, not only sequence alignment but also pre-processing of sequence data, including deduplication and base quality recalibration are performed along with quality assessment, as illustrated below.

+

[[File:Gotcloud_align_detail.png|500px]]

+

== Examining GotCloud Align Output ==

Let's look at the output directory:

−

ls ${~~OUTPUT~~}

+

ls ${OUT}

[[File:gcalignOutM.png|600px]]

−

~~Let's look at the BAMs (aligned reads that are ready for variant calling):~~

+

=== Quality Control Files ===

−

~~ls ${OUTPUT}/bams~~

−

~~[[File:GcalignOutBAMm.png|600px]]~~

−

~~BAM~~ Files:

−

* Binary Sequence Alignment/Map (SAM) Format

−

* Maps reads to Chromosome/Position

−

* For a detailed explanation of the SAM/BAM format, see:

−

** SAM/BAM Spec: http://samtools.github.io/hts-specs/SAMv1.pdf

−

** Additional information I put together as I started working with SAM/BAM: [[SAM]]

−

* Consists of:

−

** Header

−

*** Starts with '@'

−

*** Records - one for each sequence read

−

~~Let's examine a BAM file:~~

−

~~samtools view -h ${OUTPUT}/bams/~~

−

~~[[File:BAM.png|750px]]~~

−

Let's take a look at our quality control output directory:

−

ls ${~~OUTPUT~~}/QCFiles

+

ls ${OUT}/QCFiles

[[File:GcalignOutQCm.png|600px]]

+

==== Sample Contamination/Swap ====

Check for sample contamination:

* *.selfSM : Main output file containing the contamination estimate.

−

** If you are only interested in checking sample contamination:

+

** Check the 'FREEMIX' column for genotype-free estimate of contamination

−

*** Check the 'FREEMIX' column for genotype-free estimate of contamination

**** 0-1 scale, the lower, the better

−

**** See [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information

+

**** If [FREEMIX] >= 0.03 and [FREELK1]-[FREELK0] is large, possible contamination

−

*** Check the 'CHIPMIX' column for contamination estimates with external genotypes (if provided)

+

** See [[VerifyBamID#A_guideline_to_interpret_output_files|VerifyBamID: A guideline to interpret output files]] for more information

−

* *.selfRG : Same output to .*selfSM, but separated by readGroup (which might be helpful for library-level examination)

+

less -S ${OUT}/QCFiles/HG00551.genoCheck.selfSM

−

* *.depthSM : depth distribution of reads covering the marker position of the input VCF, across all readGroups.

+

−

* *.depthRG : depth distribution of reads covering the marker position of the input VCF, per readGroups.

+

Remember, use <code>'q'</code> to exit out of <code>less</code>

−

less -S ${~~OUTPUT~~}/QCFiles/HG00551.genoCheck.selfSM

+

q

+

; Is there evidence of sample contamination?

+

<ul>

+

+

<li>Answer:</li>

+

+

<ul>

+

<li>No, FREEMIX = 0.00000 (<0.03)</li>

+

</ul>

[[File:Contam1.png|700px]]

+

</div>

+

</div>

+

</ul>

−

~~Next, let's look at some quality control metrics~~:

+

==== QC Metrics ====

−

~~cat ${OUTPUT}/QCFiles/HG00551.qplot.stats~~

+

See: [[QPLOT#Diagnose_sequencing_quality|QPLOT: Diagnose sequencing quality]] for more info on how to use QPLOT results.

−

* 99.16% mapping rate

−

* 94.01% high quality ~~bases~~

−

* 7x coverage

−

* 31.3% A, 31.3% T

−

* 18.7% C, 18.~~7% G~~

+

Let's look at some quality control metrics:

+

cat ${OUT}/QCFiles/HG00551.qplot.stats

−

~~Generate~~ the ~~pdf's of our quality metrics~~:

+

; What is the mapping rate & average coverage for HG00551?

−

~~Rscript ${OUTPUT}~~/~~QCFiles~~/~~HG00551~~.~~qplot~~.R

+

<ul>

−

~~Rscript ${OUTPUT}~~/~~QCFiles~~/~~HG00553.qplot.R~~

+

−

~~Rscript ${OUTPUT}~~/~~QCFiles/HG00640.qplot.R~~

+

<li>Answer</li>

−

Rscript ${~~OUTPUT~~}/QCFiles/~~HG00641~~.qplot.R

+

+

<ul>

+

<li> 98.93% Mapped</li>

+

<li>7.43 MeanDepth</li>

+

</ul>

+

[[File:qplots.png|200px]]

+

</div>

+

</div>

+

</ul>

+

Generate a pdf of quality metrics:

+

Rscript ${OUT}/QCFiles/HG00551.qplot.R

Examine the PDF:

−

evince ${~~OUTPUT~~}/QCFiles/HG00551.qplot.pdf&

+

evince ${OUT}/QCFiles/HG00551.qplot.pdf&

−

~~The first plot:~~ Empirical vs reported Phred score ~~does not~~ look as good as we would like.

+

−

* This is due to the small region used for recalibration

+

It is ok if you see a warning message when opening evince. It should still open. If not, let me know. To close evince, just close the pdf window.

−

Look at the PDF I produced when I ran the whole genome:

+

−

evince ${GC}/~~example~~/HG00551.wg.qplot.pdf&

+

;Does the Empirical vs reported Phred score look as good as we would like?

+

<ul>

+

+

<li>Answer</li>

+

+

<ul>

+

<li> No, it is well above the line</li>

+

<li> This is due to the small region used for recalibration</li>

+

[[File:Qplotpdf.png|400px]]

+

<li> Look at the PDF I produced when I ran the whole genome:</li>

+

evince ${SS}/ext/HG00551.wg.qplot.pdf&

+

</ul>

+

[[File:Qplotpdfwg.png|400px]]

+

</div>

+

</div>

+

</ul>

+

=== BAM Files ===

+

Binary Sequence Alignment/Map (SAM) Format

+

* Maps reads to Chromosome/Position

+

* For a detailed explanation of the SAM/BAM format, see:

+

** SAM/BAM Spec: http://samtools.github.io/hts-specs/SAMv1.pdf

+

** Additional information I put together as I started working with SAM/BAM: [[SAM]]

−

~~See~~: [[~~QPLOT~~#~~Diagnose_sequencing_quality~~|~~QPLOT~~: ~~Diagnose sequencing quality~~]] for more ~~info~~ on how to ~~use QPLOT results~~.

+

Let's look at the BAMs (aligned reads that are ready for variant calling):

+

ls ${OUT}/bams

+

[[File:GcalignOutBAMm.png|600px]]

+

Let's examine at the first 5 lines of the BAM file using [http://samtools.sourceforge.net/samtools.shtml#3 samtools view]:

+

${GC}/bin/samtools view -h ${OUT}/bams/HG00551.recal.bam|head -n 5

+

; What are the chromosome and position of the first record in the BAM file?

+

<ul>

+

+

<li>Need a reminder of the format?</li>

+

+

[[File:Bam.png|750px]]

+

</div>

+

</div>

+

+

<li>Answer</li>

+

+

<ul>

+

+

</ul>

+

[[File:BamRec.png|650px]]

+

</div>

+

</div>

+

</ul>

+

==== Accessing BAMs by Position ====

+

BAM's are so big, what if we want to see a position part way through the file?

+

*[http://samtools.sourceforge.net/samtools.shtml#3 samtools] has an option for that.

+

Add a region to the view command we used above. Let's find all reads that overlap positions 36907000-36907005:

+

${GC}/bin/samtools view -h ${OUT}/bams/HG00551.recal.bam 22:36907000-36907005

+

* Just a few reads.

+

Let's visualize what reads in that area look like using samtools tview:

+

${GC}/bin/samtools tview ${OUT}/bams/HG00551.recal.bam ${SS}/ref22/human.g1k.v37.chr22.fa

+

* Type ‘g’

+

** Type 22:36907000

+

* Type ‘n’ to color by nucleotide

+

* Use the arrow keys to move around and look at the area.

+

Understanding the syntax:

+

* '.' : match to the reference on the forward strand

+

* ',' : match to the reference on the reverse strand

+

* ACGTN : mismatch to reference on the forward strand

+

* acgtn : mismatch to reference on the reverse strand

+

; Do you see anything interesting?

+

<ul>

+

+

<li>Screenshot</li>

+

+

<ul>

+

<li>We will have to remember this region when we run snpcall to see what it says.</li>

+

</ul>

+

[[File:tview.png|750px]]

+

</div>

+

</div>

+

</ul>

+

Other tview commands:

+

* Type '?' for a help screen

+

* Type 'q' to quit tview

+

Feel free to play around more and browse the BAM files.

+

==== Other tools for BAMs ====

+

We have developed a lot of tools that operate on BAM files.

+

See [[Software#BAM_Util_Tools|Software: BamUtil Tools]] for a list

+

* Many operations:

+

** diff : diff 2 BAM files

+

** stats: per positions statistics

+

** bam2Fastq : convert a BAM back to a FASTQ (how I created the fastqs for this tutorial)

+

** Lots of others

+

* Feel free to try some out

+

* If you have any questions, let me know, I wrote most of them and am happy to help.

+

== Logging Off ==

+

''This section is specifically for the SeqShop Workshop computers.''

+

+

''If you are not running during the SeqShop Workshop, please skip this section.''

+

+

To logout of seqshop-server, type:

+

exit

+

And close the windows.

+

When done, log out of the Windows machine.

+

</div>

+

</div>

Ppwhite

96

edits

Changes

SeqShop: Sequence Mapping and Assembly Practical, June 2014 (view source)

Revision as of 11:07, 2 February 2017

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools