Changes

From Genome Analysis Wiki
Jump to navigationJump to search
66 bytes removed ,  17:28, 27 January 2012
no edit summary
Line 6: Line 6:  
Test
 
Test
   −
In this workshop, we will illustrate some of the essential steps in the analysis of next generation sequence data. As part of the process, you will learn about many of the file formats commonly used to store next generation sequence data.
+
We will illustrate how TrioCaller works in sequence data including trios and unrelated samples. We will start from the scratch and walk through all necessary steps  
 +
from raw sequence data to called genotypes. If you are new to sequence data, please be patient to go through every step. If you are experienced, you may jump to the section of TrioCaller.  
    
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).
 
We will start with a set of short sequence reads and associated base quality scores (stored in a fastq file), find the most likely genomic location for each read (producing a BAM file), generate an initial list of polymorphic sites and genotypes (stored in a VCF file) and use haplotype information to refine these genotypes (resulting in an updated VCF file).
Line 12: Line 13:  
== Example Dataset ==
 
== Example Dataset ==
   −
Our dataset consists of 31 individuals from Tuscany (in Italy) sequenced by the [http://www.1000genomes.org 1000 Genomes Project]. As with other 1000 Genomes Project samples, these individuals have been sequenced to an average depth of about 4x.
+
Our dataset consists of 40 individuals, which have been sequenced at an average depth of about 4x.
   −
To conserve time and disk-space, our analysis will focus on a small region surrounding the HNF4A gene on chromosome 20. We will first map reads for a single individual (labeled NA20589), combine the results with mapped reads from the other 30 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.
+
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20. We will first map reads for a single individual (labeled NA20589), combine the results with mapped reads from the other 30 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.
   −
The example dataset we'll be using is included in this tar-ball [http://www.sph.umich.edu/csg/abecasis/downloads/lowPassWorkshop-2012-01-23.tar.gz lowPassWorkshop-2012-01-23.tar.gz].
+
The example dataset we'll be using is included in this tar-ball [http://www.sph.umich.edu/csg/abecasis/downloads/TrioCaller-2012-01-28.tar.gz) [TrioCaller-2012-01-28.tar.gz].
    
=== Required Software ===
 
=== Required Software ===
533

edits

Navigation menu