Revision as of 04:35, 7 September 2011

Sequence Analysis Workshop

In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you will learn about many of the file formats commonly used to store next generation sequence data.

We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).

Example Dataset

Our dataset consists of 31 individuals from Tuscany (in Italy) sequenced by the 1000 Genomes Project. As with other 1000 Genomes Project samples, these individuals have been sequenced to an average depth of about 4x.

To conserve time and disk-space, our analysis will focus on a small region surrounding the HNF4A gene on chromosome 20. We will first map reads for a single individual (labeled NA20589), combine the results with mapped reads from the other 30 individuals, and generate a list of polymorphic sites.

Mapping Reads to The Genome

There are many different tools for mapping DNA sequence reads. One of the most commonly used tools is BWA, developed by Heng Li and Richard Durbin at the Sanger Center. As with other read mappers, BWA first builds an index of the reference genome and then uses this index to quickly assign each sequence read to a genomic location.

To learn more about BWA, you should visit the BWA website at http://bio-bwa.sourceforge.net

Here, we will simply use BWA to find the most likely sequence location for each read using the bwa aln command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.

 bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/NA20589.fastq.gz > bwa.sai/NA20589.sai

The file NA20589.fastq.gz contains DNA sequence reads for sample NA20589. To conserve disk space, the file has been compressed with gzip but, since fastq is a simple text format, you can easily view the contents of the file using a command like:

 zcat NA20589.fastq.gz | more

The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the bwa samse command and samtools view and samtoosl sort commands.

 bwa samse ref/human_g1k_v37_chr20.fa bwa.sai/NA20589.sai fastq/NA20589.fastq.gz | \
     samtools view -uhS - | samtools sort -m 2000000000 -  bams/noheader.NA20589

The result BAM file uses a compact binary format to represent the alignment of each short read to the genome. You can view the contents of the file using the samtools view command, like so:

 samtools view bams/noheader.NA20589 | more

Although the current file contains all necessary information about reads and their genomic locations, it is missing some auxiliary information that BAM files typically contain to help describe their contents (for example, to specify that this file contains DNA sequence reads for sample NA20589). So, the very next step is to add this information to the file:

 samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam

If you reached this far, rejoice! The mapping process is almost done. We will now create and index for the file, which makes it convenient to quickly extract reads from any genome location. We do this with the samtools index command, like so:

 samtools index bams/NA20589.bam

You can view the contents of the alignment at any location using the samtools view and samtools tview commands. While the tview generates prettier output, it is not compatible with all screens. For example, to view reads overlapping a 50-bp window starting at position 43,000,000 on chromosome 20, we could:

  samtools tview NA20589 20:43000000-43000100

So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more commonly used format used by many different programs, sorted and indexed the results.

Initial set of variant calls

You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that the next few steps are much simpler.

Difference between revisions of "Tutorial: Low Pass Sequence Analysis"

Revision as of 04:35, 7 September 2011

Contents

Sequence Analysis Workshop

Example Dataset

Mapping Reads to The Genome

Initial set of variant calls

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools

@@ Line 1: / Line 1: @@
 = Sequence Analysis Workshop =
-In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you will learn about many of
+In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you will learn about many of the file formats commonly used to store next generation sequence data.
-the file formats commonly used to store next generation sequence data.
+We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).
-We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read
-(producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes
-(resulting in an updated VCF file).
 == Example Dataset ==
-Our dataset consists of 30 individuals from Tuscany (in Italy) sequenced by the [http://www.1000genomes.org 1000 Genomes Project]. As with other 1000 Genomes Project
+Our dataset consists of 31 individuals from Tuscany (in Italy) sequenced by the [http://www.1000genomes.org 1000 Genomes Project]. As with other 1000 Genomes Project samples, these individuals have been sequenced to an average depth of about 4x.
-samples, these individuals have been sequenced to an average depth of about 4x.
-To conserve time and disk-space, our analysis will focus on a small region surrounding the HNF4A gene on chromosome 20.
+To conserve time and disk-space, our analysis will focus on a small region surrounding the HNF4A gene on chromosome 20. We will first map reads for a single individual (labeled NA20589), combine the results with mapped reads from the other 30 individuals, and generate a list of polymorphic sites.
-We will first map reads for a single individual (labeled NA20589), combine the results with mapped reads from other individuals, and generate a list of polymorphic
-sites.
 == Mapping Reads to The Genome ==
-There are many different tools for mapping DNA sequence reads. One of the most commonly used tools is BWA, developed by Heng Li and Richard Durbin at the Sanger
+There are many different tools for mapping DNA sequence reads. One of the most commonly used tools is BWA, developed by Heng Li and Richard Durbin at the Sanger Center. As with other read mappers, BWA first builds an index of the reference genome and then uses this index to quickly assign each sequence read to a genomic location.
-Center. As with other read mappers, BWA first builds an index of the reference genome and then uses this index to quickly assign each sequence read to a genomic
+To learn more about BWA, you should visit the BWA website at http://bio-bwa.sourceforge.net
-location.
+Here, we will simply use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.
-To learn more about BWA, you should visit the BWA website at http://bwa.sourceforge.net
-Here, we will simply use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters,
-one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.
    bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/NA20589.fastq.gz > bwa.sai/NA20589.sai
-The file NA20589.fastq.gz contains DNA sequence reads for sample NA20589. To conserve disk space, the file has been compressed with gzip but, since fastq is a simple
+The file NA20589.fastq.gz contains DNA sequence reads for sample NA20589. To conserve disk space, the file has been compressed with gzip but, since fastq is a simple text format, you can easily view the contents of the file using a command like:
-text format, you can easily view the contents of the file using a command like:
    zcat NA20589.fastq.gz | more
-The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream
+The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.
-analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands.
    bwa samse ref/human_g1k_v37_chr20.fa bwa.sai/NA20589.sai fastq/NA20589.fastq.gz | \
@@ Line 56: / Line 34: @@
 of the file using the <code>samtools view</code> command, like so:
-   samtools view bams/noheader.NA20589
+   samtools view bams/noheader.NA20589 | more
-Altough the current file contains all necessary information about reads and their
+Although the current file contains all necessary information about reads and their
 genomic locations, it is missing some auxiliary information that BAM files typically
 contain to help describe their contents (for example, to specify that this file contains