Biostatistics 815 Term Project

From Genome Analysis Wiki
Jump to: navigation, search

BIOSTAT815 Fall 2012 Term Project

Overview

BIOSTAT815 course term project encourages students to develop the computational concepts, skills, and experiences to perform their own research in statistical analysis of large data set. Among the number of projects suggested below, students may team up with another student to (1) implement the software package, (2) present the achivements during the class, and (3) write a brief report of their achievements. Students are encouraged to pick their own projects closely related to their research.

Timeline

  • Pick your own project by the end of September
  • You may change the topic later
  • Present your results in Dec 6th class
  • Submit your report and full software package in Dec 20th class.

Expected Outcomes

  • Biweekly progress report (by email or in person) to the instructor.
  • A software package (C++ standalone or R/C++ combined package) containing source code, binary, and example data to test
  • 15 minutes presentation (+5 min discussion) during the class (Dec 6)
  • A brief report (less than 5 pages, in google document) summarizing a summary and highlight of the implementation, and the analysis results of real data by Dec 20.

Suggested Projects

Suggest your own project

If you have an open research question requiring to address computational challenges, you are welcomed to suggest the project. Discuss with the instructor to determine the scope of the term project.

Re-implement a published method

If you have a published method of your interest in your research area, you may want to re-implement it yourself to understand the method better, to tweak the parameters or details as you need, or to overcome certain limitations of existing methods. You will need to discuss with the instructor to determine whether your goal is sufficient and reasonable for 815 project.

Digit Recognizer

Classification of handwritten digit is a widely used problem for evaluating many statistical inference methods. You may utilize one or more numerical methods covered in the class (possibly combining with other methods) to develop a method to classify the images.

  • Input Data : Training and test data at http://www.kaggle.com/c/digit-recognizer
  • Methods : Use one or more methods covered in the class, or more advanced method (need discussion with the instructor)
  • You need to submit your classification results on the test data to kaggle.com and report the performance. (Although the rank itself will not be very important, the quality of model and implementation would be more important)
  • Implementation : C++ standalone, or R/C++ combined package.

Haplotype Phasing

Deconvoluting genetic variation data of diploid genome into each haploid is statistically challenging and scientifically important. Although there are existing algorithms with sophisticated methods, haplotype phasing is a great problem to understand the concept of Hidden Markov model particularly to those who are interested in genetics.

  • Input Data : You may pick one dataset from http://www.sph.umich.edu/csg/abecasis/MACH/download/
  • Evaluation : The input data is already phased. You are expected to start with the input data ignoring the phase information, and compare your phaed outcome with original input data.
  • Methods : Use hidden Markov model or more advanced methods (e.g. conditional random field)
  • Implementation : C++ standalone

Computing multivariate normal rectangle probability

Calculating the multi-dimensional integration on a multivariate normal distribution is a computationally important problems in many statistical methods handling high dimensional data. mvtnorm package (http://cran.r-project.org/web/packages/mvtnorm/index.html) has a good implementation to calculate the multivariate normal rectangle probability, but it is not perfect.

  • Input data : You need to develop equivalent of pmvnorm and qmvnorm R function (core implemented in C++) similar to http://cran.r-project.org/web/packages/mvtnorm/index.html
  • Methods : Use at least one of the methods covered in class (e.g. MCMC, importance sampling, etc)
  • Implementation : R/C++ combined package.