Changes

From Genome Analysis Wiki
Jump to navigationJump to search
1,468 bytes added ,  00:34, 21 February 2013
no edit summary
Line 1: Line 1: −
==Installation==
+
= GotCloud Tutorial =
 +
In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data.
   −
First, make sure GotCloud is installed on your system.  Installation instructions [[GotCloud#Setup|here]].
+
We will start with a set of sequence reads and associated base quality scores stored in fastq file.
    +
The mapping pipeline will find the most likely genomic location for each read producing a BAM file.
   −
==Running the Automatic Test==
+
The variant calling pipeline generates an initial list of polymorphic sites and genotypes stored in a VCF file and then uses haplotype information to refine these genotypes in an updated VCF file.
   −
This will verify whether GotCloud was installed correctly.
+
== Example Dataset ==
 +
Our dataset consists of 60 individuals from GBR sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.
   −
*To run the test case for the alignment pipeline automatically, type in the following command:
+
To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000. We will first map the reads for a single individual (labeled TBD).  We will then combine the results with mapped reads from the other 59 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.
   −
{ROOT_DIR}/bin/gen_biopipeline.pl -test OUTPUT_DIR
+
The example dataset we'll be using is included in this tar-ball TBD.
   −
where OUTPUT_DIR is the directory where you want to store the test results.  (We will call the directory in which GotCloud is installed "{ROOT_DIR}".)
     −
If you see "Test Passed", then you are ready to align samples.
+
== Required Software ==
    +
In order to run this tutorial, you need to make sure you have GotCloud is installed on your system. 
   −
*To run the test case for the variant-calling pipeline (UMAKE), change your current directory to GotCloud's root directory, and type in the following command:
+
Installation instructions [[GotCloud#Setup|here]].
   −
{ROOT_DIR}/bin/umake.pl -test OUTPUT_DIR
     −
where OUTPUT_DIR is the directory where you want to store the test results.
+
==Mapping Reads==
 +
The first step in processing next generation sequence data is mapping the reads to the reference genome, generating per sample BAM files.  
   −
If you see "Test Passed", then you are ready to call variants.
+
The mapping pipeline has multiple built-in steps to generate BAMs:
 +
# Align the fastqs to the reference genome
 +
#* handles both single & paired end
 +
# Merge the results from multiple fastqs into 1 file per sample
 +
# Mark Duplicate Reads are marked
 +
# Recalibrate Base Qualities
    +
This processing results in 1 BAM file per sample.
   −
==Aligning a Sample==
+
The mapping pipeline also includes Quality Control (QC) steps:
 +
# Visualization of various quality measures (QPLOT)
 +
# Screen for sample contamination & swap (VerifyBamID)
   −
As an example, we can align the sample fastq files used in the automatic test. They belong to two different samples, which we will call "Sample1" and "Sample2". They are found in {ROOT_DIR}/test/align/fastq.
+
Run the mapping pipeline:
 +
  gen_biopipeline.pl --conf [[GBR60map.conf]] --out_dir mappingResults
   −
To make this easier, change to the {ROOT_DIR}/test/align directory. It contains an index file and a configuration file that can be used directly.
+
TBD - add link explaining the contents of the .conf & .index files.
 +
 
 +
Upon successful completion of the mapping pipeline, you will see the following message:
 +
Commands finished in nn secs with no errors reported
 +
 
 +
The final BAM files produced by the mapping pipeline can be found in the files:
 +
ls mappingResults/alignment.recal/*.recal.bam
 +
 
 +
Index files (.bai) for these BAMs are also in that directory.
 +
 
 +
The QC files for verifyBamID are:
 +
ls mappingResults/QCFiles/*.genoCheck.selfRG mappingResults/QCFiles/*.genoCheck.selfSM
 +
 
 +
[[Understanding VerifyBamID output]]
 +
 
 +
The QC files for qplot are:
 +
ls mappingResults/QCFiles/*.qplot.R mappingResults/QCFiles/*.qplot.stats
 +
 
 +
[[Understanding QPLOT output]]
     

Navigation menu