Biostatistics 615/815 Fall 2011
Objective
In Fall 2011, Biostatistics 615/815 aims for providing students with a practical understanding of computational aspects in implementing statistical methods. Although C++ language will be used throughout the course, using Java programming language for homework and project will be acceptable.
Target Audience
Students in Biostatistics 615 should be comfortable with simple algebra and basic statistics including probability distribution, linear model, and hypothesis testing. Previous experience in programming is not required, but those who do not have previous programming experience should expect to spend additional time studying and learning to be familiar with a programming language during the coursework. Most students registering for the course are Masters or Doctoral students in Biostatistics, Statistics, Bioinformatics or Human Genetics.
Students in Biostatistics 815 should be familiar with programming languages so that they can complete the class project tackling an advanced statistical problem during the semester. Project will be carried out in teams of 2. The details of the possible projects will be announced soon.
Textbook
- Recommended Textbook : Cormen, Leiserson, Rivest, and Stein, "Introduction to Algorithms", Third Edition, The MIT Press, 2009 [Official Book Web Site]
- Optional Textbook : Press, Teukolsky, Vetterling, Flannery, "Numerical Recipes", 3rd Edition, Cambridge University Press, 2007 [Official Book Web Site]
Class Schedule
Classes are scheduled for Tuesday and Thursdays, 8:30 - 10:00 am at SPH II M4332
Topics
The following contents are planned to be covered.
Part I : C++ Basics and Introductory Algorithms
- Computational Time Complexity
- Sorting
- Divide and Conquer Algorithms
- Searching
- Key Data Structure
- Dynamic Programming
- Hidden Markov Models
Part II : Numerical Methods and Randomized Algorithms
- Random Numbers
- Matrix Operations and Least Square Methods
- Importance Sampling
- Expectation Maximization
- Markov-Chain Monte Carlo Methods
- Simulated Annealing
- Gibbs Sampling
Class Notes
- Lecture 1 : Introduction to Statistical Computing -- (Handout PDF) (Presentation PDF)
- Lecture 2 : Introduction to C++ Programming -- (Handout PDF) (Presentation PDF)
- Lecture 3 : C++ Basics and Fisher's Exact Test -- (Handout PDF) (Presentation PDF)
- Lecture 4 : Standard Template Libraries + Divide and Conquer Algorithms-- (Handout PDF) (Presentation PDF)
- Lecture 5 : Sorting Algorithms -- (Handout PDF) (Presentation PDF)
- Lecture 6 : Sorting Algorithms & Arrays -- (Handout PDF) (Presentation PDF)
- Lecture 7 : List and Binary Search Trees -- (Handout PDF) (Presentation PDF)
- Lecture 8 : Hash Tables -- (Handout PDF) (Presentation PDF)
- Lecture 9 : Dynamic Programming -- (Handout PDF) (Presentation PDF)
- Review : Dynamic Programming & Midterm Review -- (PDF)
- Lecture 10 : Hidden Markov Model-- (Handout PDF) (Presentation PDF)
- Lecture 11 : Hidden Markov Model (cont'd) -- (Handout PDF) (Presentation PDF)
Problem Sets
- Problem Set 0 - Running screenshots of helloWorld.cpp and towerOfHanoi.cpp - Due before the submission of Problem Set 1
- Problem Set 1 -- Due on Tuesday September 27th, 2011 (PDF) (PDF-SOLUTIONS)
- Problem Set 2 -- Due on Thursday October 6th, 2011 (PDF) (PDF-SOLUTIONS)
- (Example data - shuf-1M.txt.gz) 1,000,000 randomly shuffled data (gzipped)
- (Example data - Rand-1M-3digits.txt.gz) 1,000,000 random data from 1 to 1,000]] (gzipped)
- (Example data - Rand-50k.txt.gz) 50,000 random data from 1 to 1,000,000)]] (gzippd)
- (Update Oct 2, 2011 : Note that the problem 1 and 3 are slightly updated for clarification)
- (If you can't decompress the files above properly, use this alternative link by CLICKING HERE )
- Problem Set 3 -- Due on Tuesday November 1st, 2011 (PDF)
- Example input data for problem 3-2
TIME TOSS Pr(F) Pr(HB) Pr(TB) MLSTATE 1 T 0.9758 0.0068 0.0174 FAIR 2 H 0.9640 0.0312 0.0048 FAIR 3 H 0.9584 0.0341 0.0075 FAIR 4 T 0.9504 0.0091 0.0406 FAIR 5 T 0.9444 0.0118 0.0438 FAIR 6 H 0.9313 0.0582 0.0105 FAIR 7 H 0.9216 0.0663 0.0121 FAIR 8 T 0.9068 0.0358 0.0574 FAIR 9 H 0.8794 0.0672 0.0534 FAIR 10 T 0.8124 0.0316 0.1560 FAIR 11 T 0.7474 0.0699 0.1827 FAIR 12 H 0.5663 0.4101 0.0236 HEAD-BIASED 13 H 0.4432 0.5512 0.0056 HEAD-BIASED 14 H 0.3642 0.6325 0.0032 HEAD-BIASED 15 H 0.3164 0.6809 0.0027 HEAD-BIASED 16 H 0.2911 0.7063 0.0026 HEAD-BIASED 17 H 0.2840 0.7134 0.0026 HEAD-BIASED 18 H 0.2937 0.7033 0.0031 HEAD-BIASED 19 H 0.3215 0.6714 0.0071 HEAD-BIASED 20 H 0.3699 0.5879 0.0422 HEAD-BIASED 21 T 0.4269 0.2127 0.3604 TAIL-BASED 22 T 0.4257 0.2133 0.3610 TAIL-BASED 23 H 0.3642 0.5936 0.0422 HEAD-BIASED 24 H 0.3129 0.6800 0.0071 HEAD-BIASED 25 H 0.2828 0.7141 0.0031 HEAD-BIASED 26 H 0.2709 0.7263 0.0028 HEAD-BIASED 27 H 0.2751 0.7203 0.0046 HEAD-BIASED 28 H 0.2947 0.6840 0.0213 HEAD-BIASED 29 T 0.3214 0.5070 0.1716 HEAD-BIASED 30 H 0.2823 0.6290 0.0887 HEAD-BIASED
Office Hours
- Friday 9:00AM-10:30PM
Standards of Academic Conduct
The following is an extract from the School of Public Health's Student Code of Conduct [1]:
Student academic misconduct includes behavior involving plagiarism, cheating, fabrication, falsification of records or official documents, intentional misuse of equipment or materials, and aiding and abetting the perpetration of such acts. The preparation of reports, papers, and examinations, assigned on an individual basis, must represent each student’s own effort. Reference sources should be indicated clearly. The use of assistance from other students or aids of any kind during a written examination, except when the use of books or notes has been approved by an instructor, is a violation of the standard of academic conduct.
In the context of this course, any work you hand-in should be your own.
Course History
- Winter 2011 Course Web Site Biostatistics_615/815_Winter_2011
- Goncalo Abecasis taught it in several academic years previously. For previous course notes, see [Goncalo's older class notes].