Difference between revisions of "TrioCaller"

Revision as of 13:17, 27 January 2012

TrioCaller

Test

In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you will learn about many of the file formats commonly used to store next generation sequence data.

We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).

Example Dataset

Our dataset consists of 31 individuals from Tuscany (in Italy) sequenced by the 1000 Genomes Project. As with other 1000 Genomes Project samples, these individuals have been sequenced to an average depth of about 4x.

To conserve time and disk-space, our analysis will focus on a small region surrounding the HNF4A gene on chromosome 20. We will first map reads for a single individual (labeled NA20589), combine the results with mapped reads from the other 30 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.

The example dataset we'll be using is included in this tar-ball lowPassWorkshop-2012-01-23.tar.gz.

Required Software

In order to run this exercise, you should have BWA (available from Sourceforge), samtools (also from Sourceforge) and thunderVCF (included in the UMAKE pipeline download) installed.

Building an Index for Short Read Alignment

To quickly place short reads along the genome, BWA and other read mappers typically build a word index for the genome. This index lists the location of particular short words along the genome and can be used to seed and then extend particular matches.

The sequence index is typically not compatible across different BWA versions. To rebuild the sequence index, issue the following commands:

 rm ref/human_g1k_v37_chr20.fa.*
 bwa index -a is ref/human_g1k_v37_chr20.fa

Mapping Reads to The Genome

There are many different tools for mapping DNA sequence reads. One of the most commonly used tools is BWA, developed by Heng Li and Richard Durbin at the Sanger Center. As with other read mappers, BWA first builds an index of the reference genome and then uses this index to quickly assign each sequence read to a genomic location.

To learn more about BWA, you should visit the BWA website at http://bio-bwa.sourceforge.net

Here, we will simply use BWA to find the most likely sequence location for each read using the bwa aln command. This command requires two parameters, one corresponding to the reference genome, the other corresponding to a fastq file containing reads to be mapped.

 bwa aln -q 15 ref/human_g1k_v37_chr20.fa fastq/NA20589.fastq.gz > bwa.sai/NA20589.sai

The file NA20589.fastq.gz contains DNA sequence reads for sample NA20589. To conserve disk space, the file has been compressed with gzip but, since fastq is a simple text format, you can easily view the contents of the file using a command like:

 zcat NA20589.fastq.gz | more

A fastq file consists of a series of multi-line records. Each record starts with a read name, followed by a DNA sequencing, a separator line, and a set of per base quality scores. Base quality scores estimate the probability of error at each sequenced base (a base quality of 10 denotes an error probability of 10%, base quality 20 denotes 1% error probability and base quality 30 denotes 0.1% error probability). These error probabilities are each encoded in a single character (for compactness) and can be decoded using an [1] - you should look up the ascii code for each base and subtract 33 to get base quality. By inspecting the FastQ file you should be able to learn about the length of reads being mapped and their base qualities (is base quality typically higher at the start or end of each read).

Converting Alignments to BAM format

The .sai alignment format is specific to BWA, so the first thing to do is to convert the alignment to a more standard format that will be compatible with downstream analysis tools. We can do this with a combination of the bwa samse command and samtools view and samtoosl sort commands.

 bwa samse ref/human_g1k_v37_chr20.fa bwa.sai/NA20589.sai fastq/NA20589.fastq.gz | \
     samtools view -uhS - | samtools sort -m 2000000000 - bams/NA20589

The result BAM file uses a compact binary format to represent the alignment of each short read to the genome. You can view the contents of the file using the samtools view command, like so:

 samtools view bams/NA20589.bam | more

The text representation of the alignemt produced by samtools view describes the alignment of one read per line. The most interesting fields are column 1 (the read name), columns 3 and 4 (the alignment position), column 5 (the CIGAR string, describing any gaps in the alignment), and columns 10 and 11 (with the sequence and quality score). In this representation, all alignments are automatically converted to the forward strand.

Indexing the BAM file

If you reached this far, rejoice! The mapping process is almost done. We will now create and index for the file, which makes it convenient to quickly extract reads from any genome location. We do this with the samtools index command, like so:

 samtools index bams/NA20589.bam

Browsing Alignment Results

You can view the contents of the alignment at any location using the samtools view and samtools tview commands. While the tview generates prettier output, it is not compatible with all screens. For example, to view reads overlapping starting at position 43,000,000 on chromosome 20, we could run:

  samtools tview bams/NA20589.bam ref/human_g1k_v37_chr20.fa

Then, type "g 20:43000000"

So let's recap: we have mapped reads to genome, converted them from a BWA specific format to a more commonly used format used by many different programs, sorted and indexed the results.

In most cases, the next step would be to remove duplicate reads and to ensure that base quality scores are properly calibrated. To save time, we'll skip those steps now.

Initial set of variant calls

You probably thought the initial mapping process was quite convoluted ... you'll be glad to know that the next few steps are much simpler.

The first thing we'll do is use samtools to generate an initial list of variant sites, using the mpileup command. This command looks at the bases aligned to each location and flags locations taht are likely to vary. By default, the results are stored in BCF file, which can be converted into the more widely used VCF format using bcftools (a companion set of tools distributed with samtools).

  samtools mpileup -r 20:42000000-44000000 -Iuf ref/human_g1k_v37_chr20.fa bams/NA*bam | bcftools view -bvcg - > mpileup/chr20.bcf

  bcftools view mpileup/chr20.bcf  | sed s/AF1/AF/g | grep -v AF=1 > mpileup/chr20.vcf

The VCF format is a simple text format. It starts with several header lines, which all start with the two '##' characters, and is followed by a single line per marker that provides both summary information about the marker and genotypes for each individual. You can review the contents of the VCF file using the 'more' command:

  more mpileup/chr20.vcf

Here are some questions for you to investigate:

How many variant sites were detected in this dataset? Try a command like this one:

 grep -vE ^# chr20.vcf | wc -l

(The grep command line excludes all lines beginning with # and then the wc command counts the number of lines in the file).

Genotype Refinement Using Linkage Disequilibrium Information

The initial set of genotype calls is generated examining a single individual at a time. These calls are typically quite good for deep sequencing data, but much less accurate for low pass sequence data. In either case, they can be greatly improved by models that combine information across sites and individuals.

Here is how that might work:

  thunderVCF --shotgun mpileup/chr20.vcf --phase --dosage -r 10 --prefix thunder/chr20.thunder

Again, you can review the contents of the updated VCF file using the more command:

  more thunder/chr20.vcf

Difference between revisions of "TrioCaller"

Revision as of 13:17, 27 January 2012

Contents

TrioCaller

Example Dataset

Required Software

Building an Index for Short Read Alignment

Mapping Reads to The Genome

Converting Alignments to BAM format

Indexing the BAM file

Browsing Alignment Results

Initial set of variant calls

Genotype Refinement Using Linkage Disequilibrium Information

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools

Revision as of 13:15, 27 January 2012 (view source) Weich (talk \| contribs) ← Older edit	Revision as of 13:17, 27 January 2012 (view source) Weich (talk \| contribs) m (moved User talk:Weich to TrioCaller) Newer edit →
(No difference)