Line 1: |
Line 1: |
− | ==Installation== | + | = GotCloud Tutorial = |
| + | In this tutorial, we illustrate some of the essential steps in the analysis of next generation sequence data. |
| | | |
− | First, make sure GotCloud is installed on your system. Installation instructions [[GotCloud#Setup|here]].
| + | We will start with a set of sequence reads and associated base quality scores stored in fastq file. |
| | | |
| + | The mapping pipeline will find the most likely genomic location for each read producing a BAM file. |
| | | |
− | ==Running the Automatic Test==
| + | The variant calling pipeline generates an initial list of polymorphic sites and genotypes stored in a VCF file and then uses haplotype information to refine these genotypes in an updated VCF file. |
| | | |
− | This will verify whether GotCloud was installed correctly.
| + | == Example Dataset == |
| + | Our dataset consists of 60 individuals from GBR sequenced by the 1000 Genomes Project. These individuals have been sequenced to an average depth of about 4x. |
| | | |
− | *To run the test case for the alignment pipeline automatically, type in the following command:
| + | To conserve time and disk-space, our analysis will focus on a small region on chromosome 20, 42900000 - 43200000. We will first map the reads for a single individual (labeled TBD). We will then combine the results with mapped reads from the other 59 individuals to generate a list of polymorphic sites and estimate accurate genotypes at each of these sites. |
| | | |
− | {ROOT_DIR}/bin/gen_biopipeline.pl -test OUTPUT_DIR
| + | The example dataset we'll be using is included in this tar-ball TBD. |
| | | |
− | where OUTPUT_DIR is the directory where you want to store the test results. (We will call the directory in which GotCloud is installed "{ROOT_DIR}".)
| |
| | | |
− | If you see "Test Passed", then you are ready to align samples.
| + | == Required Software == |
| | | |
| + | In order to run this tutorial, you need to make sure you have GotCloud is installed on your system. |
| | | |
− | *To run the test case for the variant-calling pipeline (UMAKE), change your current directory to GotCloud's root directory, and type in the following command:
| + | Installation instructions [[GotCloud#Setup|here]]. |
| | | |
− | {ROOT_DIR}/bin/umake.pl -test OUTPUT_DIR
| |
| | | |
− | where OUTPUT_DIR is the directory where you want to store the test results.
| + | ==Mapping Reads== |
| + | The first step in processing next generation sequence data is mapping the reads to the reference genome, generating per sample BAM files. |
| | | |
− | If you see "Test Passed", then you are ready to call variants.
| + | The mapping pipeline has multiple built-in steps to generate BAMs: |
| + | # Align the fastqs to the reference genome |
| + | #* handles both single & paired end |
| + | # Merge the results from multiple fastqs into 1 file per sample |
| + | # Mark Duplicate Reads are marked |
| + | # Recalibrate Base Qualities |
| | | |
| + | This processing results in 1 BAM file per sample. |
| | | |
− | ==Aligning a Sample==
| + | The mapping pipeline also includes Quality Control (QC) steps: |
| + | # Visualization of various quality measures (QPLOT) |
| + | # Screen for sample contamination & swap (VerifyBamID) |
| | | |
− | As an example, we can align the sample fastq files used in the automatic test. They belong to two different samples, which we will call "Sample1" and "Sample2". They are found in {ROOT_DIR}/test/align/fastq.
| + | Run the mapping pipeline: |
| + | gen_biopipeline.pl --conf [[GBR60map.conf]] --out_dir mappingResults |
| | | |
− | To make this easier, change to the {ROOT_DIR}/test/align directory. It contains an index file and a configuration file that can be used directly.
| + | TBD - add link explaining the contents of the .conf & .index files. |
| + | |
| + | Upon successful completion of the mapping pipeline, you will see the following message: |
| + | Commands finished in nn secs with no errors reported |
| + | |
| + | The final BAM files produced by the mapping pipeline can be found in the files: |
| + | ls mappingResults/alignment.recal/*.recal.bam |
| + | |
| + | Index files (.bai) for these BAMs are also in that directory. |
| + | |
| + | The QC files for verifyBamID are: |
| + | ls mappingResults/QCFiles/*.genoCheck.selfRG mappingResults/QCFiles/*.genoCheck.selfSM |
| + | |
| + | [[Understanding VerifyBamID output]] |
| + | |
| + | The QC files for qplot are: |
| + | ls mappingResults/QCFiles/*.qplot.R mappingResults/QCFiles/*.qplot.stats |
| + | |
| + | [[Understanding QPLOT output]] |
| | | |
| | | |