Sequence Analysis Workshop

In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you will learn about many of

the file formats commonly used to store next generation sequence data.

We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read

(producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes

(resulting in an updated VCF file).

Example Dataset

Our dataset consists of 30 individuals from Tuscany (in Italy) sequenced by the 1000 Genomes Project. As with other 1000 Genomes Project

samples, these individuals have been sequenced to an average depth of about 4x.

To conserve time and disk-space, our analysis will focus on a small region surrounding the HNF4A gene on chromosome 20.

We will first map reads for a single individual (labeled NA20589), combine the results with mapped reads from other individuals, and generate a list of polymorphic

sites.

Mapping Reads to The Genome

There are many different tools for mapping DNA sequence reads. One of the most commonly used tools is BWA, developed by Heng Li and Richard Durbin at the Sanger

Center. As with other read mappers, BWA first builds an index of the reference genome and then uses this index to quickly assign each sequence read to a genomic

location.

To learn more about BWA, you should visit the BWA website at http://bwa.sourceforge.net

Here, we will simply use BWA to find the most likely sequence location for each read using the bwa aln command. This command requires two parameters,

one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.

 bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/NA20589.fastq.gz > bwa.sai/NA20589.sai

The file NA20589.fastq.gz contains DNA sequence reads for sample NA20589. To conserve disk space, the file has been compressed with gzip but, since fastq is a simple

text format, you can easily view the contents of the file using a command like:

 zcat NA20589.fastq.gz | more

The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream

analysis tools. We can do this with a combination of the bwa samse command and samtools view and samtoosl sort commands.

 bwa samse ref/human_g1k_v37_chr20.fa bwa.sai/NA20589.sai fastq/NA20589.fastq.gz | \
     samtools view -uhS - | samtools sort -m 2000000000 -  bams/noheader.NA20589

The result BAM file uses a compact binary format to represent the alignment of each short read to the genome. You can view the contents of the file using the samtools view command, like so:

 samtools view bams/noheader.NA20589

Altough the current file contains all necessary information about reads and their genomic locations, it is missing some auxiliary information that BAM files typically contain to help describe their contents (for example, to specify that this file contains DNA sequence reads for sample NA20589). So, the very next step is to add this information to the file:

 samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam

If you reached this far, rejoice! The mapping process is almost done. We will now create and index for the file, which makes it convenient to quickly extract reads from any genome location. We do this with the samtools index command, like so:

 samtools index bams/NA20589.bam

You can view the contents of the alignment at any location using the samtools view and samtools tview commands. While the tview generates prettier output, it is not compatible with all screens. For example, to view reads overlapping a 50-bp window starting at position 43,000,000 on chromosome 20, we could:

  samtools tview NA20589 20:43000000-43000100

So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more commonly used format used by many different programs, sorted and indexed the results.

Initial set of variant calls

You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that the next few steps are much simpler.

Tutorial: Low Pass Sequence Analysis

Contents

Sequence Analysis Workshop

Example Dataset

Mapping Reads to The Genome

Initial set of variant calls

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools