Difference between revisions of "Tutorial: Low Pass Sequence Analysis"
(Created page with '= Sequence Analysis Workshop = In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you wil…') |
|||
Line 1: | Line 1: | ||
= Sequence Analysis Workshop = | = Sequence Analysis Workshop = | ||
− | In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you will learn about many of | + | In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you will learn about many of the file formats commonly used to store next generation sequence data. |
− | + | We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file). | |
− | |||
− | We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read | ||
− | |||
− | (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes | ||
− | |||
− | (resulting in an updated VCF file). | ||
== Example Dataset == | == Example Dataset == | ||
− | Our dataset consists of | + | Our dataset consists of 31 individuals from Tuscany (in Italy) sequenced by the [http://www.1000genomes.org 1000 Genomes Project]. As with other 1000 Genomes Project samples, these individuals have been sequenced to an average depth of about 4x. |
− | |||
− | samples, these individuals have been sequenced to an average depth of about 4x. | ||
− | To conserve time and disk-space, our analysis will focus on a small region surrounding the HNF4A gene on chromosome 20. | + | To conserve time and disk-space, our analysis will focus on a small region surrounding the HNF4A gene on chromosome 20. We will first map reads for a single individual (labeled NA20589), combine the results with mapped reads from the other 30 individuals, and generate a list of polymorphic sites. |
− | |||
− | We will first map reads for a single individual (labeled NA20589), combine the results with mapped reads from other individuals, and generate a list of polymorphic | ||
− | |||
− | sites. | ||
== Mapping Reads to The Genome == | == Mapping Reads to The Genome == | ||
− | There are many different tools for mapping DNA sequence reads. One of the most commonly used tools is BWA, developed by Heng Li and Richard Durbin at the Sanger | + | There are many different tools for mapping DNA sequence reads. One of the most commonly used tools is BWA, developed by Heng Li and Richard Durbin at the Sanger Center. As with other read mappers, BWA first builds an index of the reference genome and then uses this index to quickly assign each sequence read to a genomic location. |
− | + | To learn more about BWA, you should visit the BWA website at http://bio-bwa.sourceforge.net | |
− | + | Here, we will simply use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. | |
− | |||
− | |||
− | |||
− | Here, we will simply use BWA to find the most likely sequence location for each read using the <code>bwa aln</code> command. This command requires two parameters, | ||
− | |||
− | one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped. | ||
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/NA20589.fastq.gz > bwa.sai/NA20589.sai | bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/NA20589.fastq.gz > bwa.sai/NA20589.sai | ||
− | The file NA20589.fastq.gz contains DNA sequence reads for sample NA20589. To conserve disk space, the file has been compressed with gzip but, since fastq is a simple | + | The file NA20589.fastq.gz contains DNA sequence reads for sample NA20589. To conserve disk space, the file has been compressed with gzip but, since fastq is a simple text format, you can easily view the contents of the file using a command like: |
− | |||
− | text format, you can easily view the contents of the file using a command like: | ||
zcat NA20589.fastq.gz | more | zcat NA20589.fastq.gz | more | ||
− | The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream | + | The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands. |
− | |||
− | analysis tools. We can do this with a combination of the <code>bwa samse</code> command and <code>samtools view</code> and <code>samtoosl sort</code> commands. | ||
bwa samse ref/human_g1k_v37_chr20.fa bwa.sai/NA20589.sai fastq/NA20589.fastq.gz | \ | bwa samse ref/human_g1k_v37_chr20.fa bwa.sai/NA20589.sai fastq/NA20589.fastq.gz | \ | ||
Line 56: | Line 34: | ||
of the file using the <code>samtools view</code> command, like so: | of the file using the <code>samtools view</code> command, like so: | ||
− | samtools view bams/noheader.NA20589 | + | samtools view bams/noheader.NA20589 | more |
− | + | Although the current file contains all necessary information about reads and their | |
genomic locations, it is missing some auxiliary information that BAM files typically | genomic locations, it is missing some auxiliary information that BAM files typically | ||
contain to help describe their contents (for example, to specify that this file contains | contain to help describe their contents (for example, to specify that this file contains |
Revision as of 04:35, 7 September 2011
Sequence Analysis Workshop
In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you will learn about many of the file formats commonly used to store next generation sequence data.
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).
Example Dataset
Our dataset consists of 31 individuals from Tuscany (in Italy) sequenced by the 1000 Genomes Project. As with other 1000 Genomes Project samples, these individuals have been sequenced to an average depth of about 4x.
To conserve time and disk-space, our analysis will focus on a small region surrounding the HNF4A gene on chromosome 20. We will first map reads for a single individual (labeled NA20589), combine the results with mapped reads from the other 30 individuals, and generate a list of polymorphic sites.
Mapping Reads to The Genome
There are many different tools for mapping DNA sequence reads. One of the most commonly used tools is BWA, developed by Heng Li and Richard Durbin at the Sanger Center. As with other read mappers, BWA first builds an index of the reference genome and then uses this index to quickly assign each sequence read to a genomic location.
To learn more about BWA, you should visit the BWA website at http://bio-bwa.sourceforge.net
Here, we will simply use BWA to find the most likely sequence location for each read using the bwa aln
command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.
bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/NA20589.fastq.gz > bwa.sai/NA20589.sai
The file NA20589.fastq.gz contains DNA sequence reads for sample NA20589. To conserve disk space, the file has been compressed with gzip but, since fastq is a simple text format, you can easily view the contents of the file using a command like:
zcat NA20589.fastq.gz | more
The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the bwa samse
command and samtools view
and samtoosl sort
commands.
bwa samse ref/human_g1k_v37_chr20.fa bwa.sai/NA20589.sai fastq/NA20589.fastq.gz | \ samtools view -uhS - | samtools sort -m 2000000000 - bams/noheader.NA20589
The result BAM file uses a compact binary format to represent the
alignment of each short read to the genome. You can view the contents
of the file using the samtools view
command, like so:
samtools view bams/noheader.NA20589 | more
Although the current file contains all necessary information about reads and their genomic locations, it is missing some auxiliary information that BAM files typically contain to help describe their contents (for example, to specify that this file contains DNA sequence reads for sample NA20589). So, the very next step is to add this information to the file:
samtools reheader bams/NA20589.header bams/noheader.NA20589.bam > bams/NA20589.bam
If you reached this far, rejoice! The mapping process is almost done. We will now create
and index for the file, which makes it convenient to quickly extract reads from any
genome location. We do this with the samtools index
command, like so:
samtools index bams/NA20589.bam
You can view the contents of the alignment at any location using the samtools view
and samtools tview
commands. While the tview
generates prettier output,
it is not compatible with all screens. For example, to view reads overlapping a 50-bp window
starting at position 43,000,000 on chromosome 20, we could:
samtools tview NA20589 20:43000000-43000100
So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more commonly used format used by many different programs, sorted and indexed the results.
Initial set of variant calls
You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that the next few steps are much simpler.