Changes

Tutorial: GotCloud (view source)

Revision as of 00:34, 21 February 2013

1,468 bytes added , 00:34, 21 February 2013

no edit summary

Line 1: Line 1: −

==~~Installation==~~

+

= GotCloud Tutorial =

+

In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data.

−

~~First, make sure GotCloud is installed on your system. Installation instructions [[GotCloud#Setup|here]]~~.

+

We will start with a set of sequence reads and associated base quality scores stored in fastq file.

+

The mapping pipeline will find the most likely genomic location for each read producing a BAM file.

−

~~==Running the Automatic Test==~~

+

The variant calling pipeline generates an initial list of polymorphic sites and genotypes stored in a VCF file and then uses haplotype information to refine these genotypes in an updated VCF file.

−

~~This will verify whether GotCloud was installed correctly~~.

+

== Example Dataset ==

+

Our dataset consists of 60 individuals from GBR sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x.

−

*To ~~run~~ the ~~test case~~ for the ~~alignment pipeline automatically, type in~~ the ~~following command:~~

+

To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000. We will first map the reads for a single individual (labeled TBD). We will then combine the results with mapped reads from the other 59 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites.

−

~~{ROOT_DIR}/bin/gen_biopipeline~~.~~pl -test OUTPUT_DIR~~

+

The example dataset we'll be using is included in this tar-ball TBD.

−

~~where OUTPUT_DIR is the directory where you want to store the test results. (We will call the directory in which GotCloud is installed "{ROOT_DIR}".)~~

−

~~If you see "Test Passed", then you are ready to align samples.~~

+

== Required Software ==

+

In order to run this tutorial, you need to make sure you have GotCloud is installed on your system.

−

*To run the test case for the variant-calling pipeline (UMAKE), change your current directory to GotCloud~~'s root directory, and type in the following command:~~

+

Installation instructions [[GotCloud#Setup|here]].

−

~~{ROOT_DIR}/bin/umake.pl -test OUTPUT_DIR~~

−

~~where OUTPUT_DIR~~ is the ~~directory where you want~~ to ~~store~~ the ~~test results~~.

+

==Mapping Reads==

+

The first step in processing next generation sequence data is mapping the reads to the reference genome, generating per sample BAM files.

−

~~If you see "Test Passed", then you~~ are ~~ready to call variants.~~

+

The mapping pipeline has multiple built-in steps to generate BAMs:

+

# Align the fastqs to the reference genome

+

#* handles both single & paired end

+

# Merge the results from multiple fastqs into 1 file per sample

+

# Mark Duplicate Reads are marked

+

# Recalibrate Base Qualities

+

This processing results in 1 BAM file per sample.

−

~~==Aligning a Sample==~~

+

The mapping pipeline also includes Quality Control (QC) steps:

+

# Visualization of various quality measures (QPLOT)

+

# Screen for sample contamination & swap (VerifyBamID)

−

~~As an example, we can align~~ the ~~sample fastq files used in the automatic test.~~ ~~They belong to two different samples, which we will call "Sample1" and "Sample2"~~. ~~They are found in {ROOT_DIR}/test/align/fastq~~.

+

Run the mapping pipeline:

+

gen_biopipeline.pl --conf [[GBR60map.conf]] --out_dir mappingResults

−

~~To make this easier~~, ~~change to~~ the ~~{ROOT_DIR}~~/~~test~~/~~align~~ directory. ~~It contains an index file and a configuration file that can be used directly~~.

+

TBD - add link explaining the contents of the .conf & .index files.

+

Upon successful completion of the mapping pipeline, you will see the following message:

+

Commands finished in nn secs with no errors reported

+

The final BAM files produced by the mapping pipeline can be found in the files:

+

ls mappingResults/alignment.recal/*.recal.bam

+

Index files (.bai) for these BAMs are also in that directory.

+

The QC files for verifyBamID are:

+

ls mappingResults/QCFiles/*.genoCheck.selfRG mappingResults/QCFiles/*.genoCheck.selfSM

+

[[Understanding VerifyBamID output]]

+

The QC files for qplot are:

+

ls mappingResults/QCFiles/*.qplot.R mappingResults/QCFiles/*.qplot.stats

+

[[Understanding QPLOT output]]

Mktrost

Administrators

3,045

edits

Changes

Tutorial: GotCloud (view source)

Revision as of 00:34, 21 February 2013

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools