Biostatistics 615/815 Fall 2011
Objective
In Fall 2011, Biostatistics 615/815 aims for providing students with a practical understanding of computational aspects in implementing statistical methods. Although C++ language will be used throughout the course, using Java programming language for homework and project will be acceptable.
Target Audience
Students in Biostatistics 615 should be comfortable with simple algebra and basic statistics including probability distribution, linear model, and hypothesis testing. Previous experience in programming is not required, but those who do not have previous programming experience should expect to spend additional time studying and learning to be familiar with a programming language during the coursework. Most students registering for the course are Masters or Doctoral students in Biostatistics, Statistics, Bioinformatics or Human Genetics.
Students in Biostatistics 815 should be familiar with programming languages so that they can complete the class project tackling an advanced statistical problem during the semester. Project will be carried out in teams of 2. The details of the possible projects will be announced soon.
Textbook
- Recommended Textbook : Cormen, Leiserson, Rivest, and Stein, "Introduction to Algorithms", Third Edition, The MIT Press, 2009 [Official Book Web Site]
- Optional Textbook : Press, Teukolsky, Vetterling, Flannery, "Numerical Recipes", 3rd Edition, Cambridge University Press, 2007 [Official Book Web Site]
Class Schedule
Classes are scheduled for Tuesday and Thursdays, 8:30 - 10:00 am at SPH II M4332
Topics
The following contents are planned to be covered.
Part I : C++ Basics and Introductory Algorithms
- Computational Time Complexity
- Sorting
- Divide and Conquer Algorithms
- Searching
- Key Data Structure
- Dynamic Programming
- Hidden Markov Models
Part II : Numerical Methods and Randomized Algorithms
- Random Numbers
- Matrix Operations and Least Square Methods
- Importance Sampling
- Expectation Maximization
- Markov-Chain Monte Carlo Methods
- Simulated Annealing
- Gibbs Sampling
Class Notes
- Lecture 1 : Introduction to Statistical Computing -- (Handout PDF) (Presentation PDF)
- Lecture 2 : Introduction to C++ Programming -- (Handout PDF) (Presentation PDF)
- Lecture 3 : C++ Basics and Fisher's Exact Test -- (Handout PDF) (Presentation PDF)
- Lecture 4 : Standard Template Libraries + Divide and Conquer Algorithms-- (Handout PDF) (Presentation PDF)
- Lecture 5 : Sorting Algorithms -- (Handout PDF) (Presentation PDF)
- Lecture 6 : Sorting Algorithms & Arrays -- (Handout PDF) (Presentation PDF)
- Lecture 7 : List and Binary Search Trees -- (Handout PDF) (Presentation PDF)
- Lecture 8 : Hash Tables -- (Handout PDF) (Presentation PDF)
- Lecture 9 : Dynamic Programming -- (Handout PDF) (Presentation PDF)
- Review : Dynamic Programming & Midterm Review -- (PDF)
- Lecture 10 : Hidden Markov Model-- (PDF) (UPDATED on Oct 29th at 12:51PM)
- Lecture 11 : Hidden Markov Model (cont'd) -- (PDF) (UPDATED on Oct 28th at 1:42PM)
- Lecture 12 : Boost Library & Random Numbers -- (PDF)
- Lecture 13 : Single dimensional optimization -- (PDF)
- Lecture 14 : Single and multi dimensional optimizations -- (PDF) (Updated on Nov 3rd 1:25AM)
- Lecture 15 : Multi dimensional optimizations -- (PDF) (Updated Nov 8 10:35AM)
- Lecture 16 : E-M algorithm -- (PDF) (Updated Nov 8 10:35AM)
- Lecture 17 : Simulated Annealing -- (PDF)
- Lecture 18 : Gibbs Sampling -- (PDF)
Problem Sets
- Problem Set 0 - Running screenshots of helloWorld.cpp and towerOfHanoi.cpp - Due before the submission of Problem Set 1
- Problem Set 1 -- Due on Tuesday September 27th, 2011 (PDF) (PDF-SOLUTIONS)
- Problem Set 2 -- Due on Thursday October 6th, 2011 (PDF) (PDF-SOLUTIONS)
- (Update Oct 2, 2011 : Note that the problem 1 and 3 are slightly updated for clarification)
- (If you can't decompress the files above properly, use this alternative link by CLICKING HERE )
- Problem Set 3 -- Due on Tuesday November 1st, 2011 (PDF) (UPDATED on Oct 25th at 11:10AM)
- Problem Set 4 -- Due on Tuesday November 15th, 2011 (PDF)
- Problem Set 5 -- Due on Tuesday November 29th, 2011 (PDF)
Supplementary Data sets for Problem Sets
- Problem Set 2
- (Example data - shuf-1M.txt.gz) 1,000,000 randomly shuffled data (gzipped)
- (Example data - Rand-1M-3digits.txt.gz) 1,000,000 random data from 1 to 1,000]] (gzipped)
- (Example data - Rand-50k.txt.gz) 50,000 random data from 1 to 1,000,000)]] (gzippd)
- Problem Set 3
- Example output data for problem 3-1 (input is the second column) (NOTE : ADDED on Oct 25 11:45PM) -- This is also reflected in lecture 11 class note.
TIME TOSS P(FAIR) P(BIAS) MLSTATE 1 H 0.5950 0.4050 FAIR 2 T 0.8118 0.1882 FAIR 3 H 0.8071 0.1929 FAIR 4 T 0.8584 0.1416 FAIR 5 H 0.7613 0.2387 FAIR 6 H 0.7276 0.2724 FAIR 7 T 0.7495 0.2505 FAIR 8 H 0.5413 0.4587 BIASED 9 H 0.4187 0.5813 BIASED 10 H 0.3533 0.6467 BIASED 11 H 0.3301 0.6699 BIASED 12 H 0.3436 0.6564 BIASED 13 H 0.3971 0.6029 BIASED 14 T 0.5028 0.4972 BIASED 15 H 0.3725 0.6275 BIASED 16 H 0.2985 0.7015 BIASED 17 H 0.2635 0.7365 BIASED 18 H 0.2596 0.7404 BIASED 19 H 0.2858 0.7142 BIASED 20 H 0.3482 0.6518 BIASED
- Example output data for problem 3-2 (input is the second column) (NOTE : UPDATED on Oct 25 11:23PM)
TIME TOSS Pr(F) Pr(HB) Pr(TB) MLSTATE 1 T 0.8844 0.0326 0.0830 FAIR 2 H 0.9012 0.0791 0.0198 FAIR 3 H 0.9075 0.0735 0.0189 FAIR 4 T 0.9091 0.0145 0.0764 FAIR 5 T 0.9068 0.0114 0.0818 FAIR 6 H 0.9058 0.0440 0.0502 FAIR 7 T 0.8834 0.0275 0.0891 FAIR 8 H 0.8520 0.0698 0.0783 FAIR 9 T 0.7713 0.0347 0.1940 FAIR 10 T 0.6927 0.0823 0.2249 FAIR 11 H 0.4730 0.4984 0.0286 HEAD-BIASED 12 H 0.3227 0.6706 0.0066 HEAD-BIASED 13 H 0.2236 0.7726 0.0037 HEAD-BIASED 14 H 0.1589 0.8381 0.0031 HEAD-BIASED 15 H 0.1169 0.8803 0.0028 HEAD-BIASED 16 H 0.0902 0.9072 0.0026 HEAD-BIASED 17 H 0.0740 0.9235 0.0025 HEAD-BIASED 18 H 0.0654 0.9321 0.0025 HEAD-BIASED 19 H 0.0630 0.9346 0.0025 HEAD-BIASED 20 H 0.0661 0.9314 0.0025 HEAD-BIASED 21 H 0.0755 0.9219 0.0026 HEAD-BIASED 22 H 0.0926 0.9038 0.0036 HEAD-BIASED 23 H 0.1204 0.8684 0.0113 HEAD-BIASED 24 H 0.1603 0.7586 0.0811 HEAD-BIASED 25 T 0.1904 0.0858 0.7238 TAIL-BASED 26 T 0.1819 0.0118 0.8063 TAIL-BASED 27 T 0.1797 0.0036 0.8167 TAIL-BASED 28 T 0.1894 0.0028 0.8077 TAIL-BASED 29 T 0.2136 0.0038 0.7826 TAIL-BASED 30 T 0.2561 0.0123 0.7317 TAIL-BASED
- Example input/output data for problem 3-3 (Applying 2-state HMM in Problem 3-1): Download using THIS LINK
Office Hours
- Friday 9:00AM-10:30PM
Standards of Academic Conduct
The following is an extract from the School of Public Health's Student Code of Conduct [1]:
Student academic misconduct includes behavior involving plagiarism, cheating, fabrication, falsification of records or official documents, intentional misuse of equipment or materials, and aiding and abetting the perpetration of such acts. The preparation of reports, papers, and examinations, assigned on an individual basis, must represent each student’s own effort. Reference sources should be indicated clearly. The use of assistance from other students or aids of any kind during a written examination, except when the use of books or notes has been approved by an instructor, is a violation of the standard of academic conduct.
In the context of this course, any work you hand-in should be your own.
Course History
- Winter 2011 Course Web Site Biostatistics_615/815_Winter_2011
- Goncalo Abecasis taught it in several academic years previously. For previous course notes, see [Goncalo's older class notes].