Difference between revisions of "Biostatistics 615/815 Fall 2011"

Revision as of 09:26, 17 February 2011

Objective

In this winter, Biostatistics 615/815 aims for providing students with a practical understanding of computational aspects in implementing statistical methods. Although C++ language will be used throughout the course, using Java programming language for homework and project will be acceptable.

Target Audience

Students in Biostatistics 615 should be comfortable with simple algebra and basic statistics including probability distribution, linear model, and hypothesis testing. Previous experience in programming is not required, but those who do not have previous programming experience should expect to spend additional time studying and learning to be familiar with a programming language during the coursework. Most students registering for the course are Masters or Doctoral students in Biostatistics, Statistics, Bioinformatics or Human Genetics.

Students in Biostatistics 815 should be familiar with programming languages so that they can complete the class project tackling an advanced statistical problem during the semester. Project will be carried out in teams of 2. The details of the possible projects will be announced soon.

Textbook

Required Textbook : Cormen, Leiserson, Rivest, and Stein, "Introduction to Algorithms", Third Edition, The MIT Press, 2009 [Official Book Web Site]
Optional Textbook : Press, Teukolsky, Vetterling, Flannery, "Numerical Recipes", 3rd Edition, Cambridge University Press, 2007 [Official Book Web Site]

Class Schedule

Classes are scheduled for Tuesday and Thursdays, 8:30 - 10:00 am at SPH II M4318

Topics

The following contents are planned to be covered.

Part I : Algorithms 101

Understanding of Computational Time Complexity
Sorting
Divide and Conquer Algorithms
Searching
Key Data Structure
Dynamic Programming

Part II : Matrix Operations and Numerical Optimizations

Matrix decomposition (LU, QR, SVD)
Implementation of Linear Models
Numerical Optimizations

Part III : Advanced Statistical Methods

Hidden Markov Models
Expectation Maximization
Markov-Chain Monte Carlo Methods

Class Notes

Lecture 1 : Statistical Computing -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 2 : C++ Basics and Precisions -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 3 : Implementing Fisher's Exact Test -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 4 : Classes and STLs -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 5 : Divide and Conquer Algorithms -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 6 : Linear Sorting Algorithms and Elementary Data Structures -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 7 : Data Structures -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 8 : Hash Tables -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 9 : Dyamic Programming -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 10 : Boost Libraries and Graphical Algorithms -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 11 : Hidden Markov Models -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 12 : Hidden Markov Models -- (Handout mode - PDF) (Presentation mode - PDF)
Lecture 13 : Matrix Computation -- (Handout mode - PDF) (Presentation mode - PDF)

Problem Sets

Problem Set 1 : Due on January 20th, 2011 -- (Problem Set 1 - PDF)
Problem Set 2 : Due on February 8th, 2011 -- (Problem Set 2 - PDF)
- (Example data - shuf-1M.txt.gz) 1,000,000 randomly shuffled data (gzipped)
- (Example data - Rand-1M-3digits.txt.gz) 1,000,000 random data from 1 to 1,000]] (gzipped)
- (Example data - Rand-50k.txt.gz) 50,000 random data from 1 to 1,000,000)]] (gzippd)
Problem Set 3 : Due on February 17th, 2011 -- (Problem Set 3 - PDF)
- For problem 2 - For Visual C++ users : Refer to the (Code Example) for a better start
Problem Set 4 : Due on February 24th, 2011 -- (Problem Set 4 - PDF)
- (Example input file for problem 2 - gzipped)
- (Example output file for problem 2 - gzipped)
- See below if you want to copy and paste example code

Source code for Homework 4 - Problem 1

#include <iostream>
#include <fstream>
#include <vector>
#include <boost/tokenizer.hpp>
#include <boost/lexical_cast.hpp>
#include <boost/foreach.hpp>

#define MAX_DISTANCE 9999

// A simple integer matrix class
class IntMatrix {
protected:  // internal data
  std::vector<int> data;
  int nRows;
  int nCols;

public:
  IntMatrix() : nRows(0), nCols(0) {}  // default contructor

  // constructor specifying # rows, # cols, and default value
  IntMatrix(int m, int n, int defaultValue = 0) : nRows(m), nCols(n) {
    for(int i=0; i < m; ++i) {
      for(int j=0; j < n; ++j) {
        data.push_back(defaultValue);
      }
    }
  }

  // constructor from a file
  IntMatrix(const char* file) {
    open(file);
  }

  // opens a file to fill the matrix
  void open(const char* file) {
    std::ifstream ifs(file);
    if ( ! ifs.is_open() ) {
      std::cerr << "Cannot open file " << file << std::endl;
      abort();
    }

    std::string line;
    boost::char_separator<char> sep(" \t");
    typedef boost::tokenizer< boost::char_separator<char> > wsTokenizer;

    nRows = 0;
    nCols = 0;
    while( std::getline(ifs, line) ) {
      wsTokenizer t(line,sep);
      for(wsTokenizer::iterator i=t.begin(); i != t.end(); ++i) {
        data.push_back(boost::lexical_cast<int>(i->c_str()));  // similar to atoi()
        if ( nRows == 0 ) ++nCols;
      }
      ++nRows;
      if ( data.size() != nRows*nCols ) {
        std::cerr << "The input file is not rectangle at line " << nRows << std::endl;
        abort();
      }
    }
  }

  int numRows() { return nRows; }
  int numCols() { return nCols; }
  int& at(int i, int j) { return data[i*nCols + j]; } // row i, col j

  // function to print the content of matrix
  // Example usage: M.print(std::cout);
  void print(std::ostream& o) {
    for(int i=0; i < nRows; ++i) {
      for(int j=0; j < nCols; ++j) {
        if ( j > 0 ) o << "\t";
        o << at(i,j);
      }
      o << std::endl;
    }
  }
};

int main(int argc, char** argv) {
  if ( argc != 2 ) {
    std::cerr << "Usage: floydWarshall [weight matrix]" << std::endl;
    abort();
  }

  IntMatrix W(argv[1]);  // initialize weight matrix from a file
  if ( W.numRows() != W.numCols() ) {  // make sure that the matrix is square
    std::cerr << "The input file is not exactly square" << std::endl;
    abort();
  }

  // this part needs to be filled out

  std::cerr << "All-pair shortest distance is:" << std::endl;
  // this also needs to be filled out
  std::cerr << std::endl;
  std::cerr << "Previous node in the shortest path is:" << std::endl;
  // this also needs to be filled out
  return 0;
}

Source code for Homework 4 - Problem 2

#include <iostream>
#include <fstream>
#include <vector>
#include <cstdlib>

// note that the code below is really minimal.
int main(int argc, char** argv) {
  if ( argc != 2 ) {
    std::cerr << "Usage: coinHMM [inputFile]" << std::endl;
    return -1;
  }

  std::vector<int> trueStates;
  std::vector<int> observations;
  std::ifstream ifs(argv[1]);

  if ( !ifs.is_open() ) {
    std::cerr << "Cannot open file " << argv[1] << std::endl;
    return -1;    
  }

  std::string tok;
  for(int i=0; ifs >> tok; ++i) {
    if ( i % 3 == 1 ) {
      if ( tok.compare("F") == 0 ) {
        trueStates.push_back(0);
      }
      else if ( tok.compare("B") == 0 ) {
        trueStates.push_back(1);
      }
      else {
        std::cerr << "Cannot recognize state " << tok << std::endl;	
      }
    }
    else if ( i % 3 == 2 ) {
      if ( tok.compare("H") == 0 ) {
        observations.push_back(0);
      }
      else if ( tok.compare("T") == 0 ) {
        observations.push_back(1);
      }
      else {
        std::cerr << "Cannot recognize observation " << tok << std::endl;	
      }
    }
  }

  std::cout << "Read " << trueStates.size() << " states and observations" << std::endl;

  int T = (int)trueStates.size();
  double* alphas = new double[T*2];  // or double** alphas = ...
  double* betas = new double[T*2];
  double* gammas = new double[T*2];  // to store Pr(q|o,lambda)
  double* deltas = new double[T*2];  
  int* phis = new int[T*2];          // to store viterbi path
  double trans[2][2] = { {0.95,0.2}, {0.05,0.8} };
  double emis[2][2] = { {0.5,0.9}, {0.5,0.1} };
  double pi[2] = {0.9,0.1};

  // fill out the details here

  for(int i=0; i < T; ++i) {
    std::cout << i+1 << "\t" << (trueStates[i] ? "B" : "F") << "\t" << (observations[i] ? "T" : "H") << "\t" << 
              gammas[i*2+1] << "\t" << (paths[i] ? "B" : "F") << std::endl;
  }

  delete [] alphas;
  delete [] betas;
  delete [] gammas;
  delete [] deltas;
  delete [] phis;
  delete [] paths;
  return 0;
}

Source code from Homework 2 - Problem 3

#include <iostream>
#include <vector>
#include <ctime>
#include <fstream>
#include <set>
#include "mySortedArray.h"
#include "myTree.h"
#include "myList.h"                                            

int main(int argc, char** argv) {
  int tok;
  std::vector<int> v;
  if ( argc > 1 ) {
    std::ifstream fin(argv[1]);
    while( fin >> tok ) { v.push_back(tok); }
    fin.close();
  }
  else {
    while( std::cin >> tok ) { v.push_back(tok); }
  }                                             
  mySortedArray<int> c1;
  myList<int> c2;
  myTree<int> c3;
  std::set<int> s;

  clock_t start = clock();
  for(int i=0; i < (int)v.size(); ++i) {
    c1.insert(v[i]);
  }
  clock_t finish = clock();
  double duration = (double)(finish-start)/CLOCKS_PER_SEC;  
  std::cout << "Sorted Array (Insert) " << duration << std::endl;

  start = clock();
  for(int i=0; i < (int)v.size(); ++i) {
    c2.insert(v[i]);
  }
  finish = clock();
  duration = (double)(finish-start)/CLOCKS_PER_SEC;  
  std::cout << "List (Insert) " << duration << std::endl;
 
  start = clock();
  for(int i=0; i < (int)v.size(); ++i) {
    c3.insert(v[i]);
  }
  finish = clock();
  duration = (double)(finish-start)/CLOCKS_PER_SEC;  
  std::cout << "Tree (Insert) " << duration << std::endl;

  start = clock();
  for(int i=0; i < (int)v.size(); ++i) {
    s.insert(v[i]);
  }
  finish = clock();
  duration = (double)(finish-start)/CLOCKS_PER_SEC;  
  std::cout << "std::set (Insert) " << duration << std::endl;

  start = clock();
  for(int i=0; i < (int)v.size(); ++i) {
    c1.search(v[i]);
  }
  finish = clock();
  duration = (double)(finish-start)/CLOCKS_PER_SEC;  
  std::cout << "Sorted Array (Search) " << duration << std::endl;

  start = clock();
  for(int i=0; i < (int)v.size(); ++i) {
    c2.search(v[i]);
  }
  finish = clock();
  duration = (double)(finish-start)/CLOCKS_PER_SEC;  
  std::cout << "List (Search) " << duration << std::endl;

  start = clock();
  for(int i=0; i < (int)v.size(); ++i) {
    c3.search(v[i]);
  }
  finish = clock();
  duration = (double)(finish-start)/CLOCKS_PER_SEC;  
  std::cout << "Tree (Search) " << duration << std::endl;

  start = clock();
  for(int i=0; i < (int)v.size(); ++i) {
    s.find(v[i]);
  }
  finish = clock();
  duration = (double)(finish-start)/CLOCKS_PER_SEC;  
  std::cout << "std::set (Search) " << duration << std::endl;
}

The header file of mySortedArray.h (For Homework 2 Problem 3)

#include <iostream>
#define DEFAULT_ALLOC 1024
template <class T> // template supporting a generic type
class mySortedArray {
 protected:     // member variables hidden from outside
   T *data;    // array of the genetic type
   int size;   // number of elements in the container
   int nalloc; // # of objects allocated in the memory
  mySortedArray(mySortedArray& a) {};
  int search(const T& x, int begin, int end);
public:      // abstract interface visible to outside
   mySortedArray();         // default constructor
   ~mySortedArray();        // desctructor
   void insert(const T& x); // insert an element x
   int search(const T& x);  // search for an element x and return its location
   bool remove(const T& x); // delete a particular element
};

template <class T>
mySortedArray<T>::mySortedArray() {     // default constructor
  size = 0;              // array do not have element initially
  nalloc = DEFAULT_ALLOC;
  data = new T[nalloc];  // allocate default # of objects in memory
}

template <class T>
mySortedArray<T>::~mySortedArray() {    // destructor
  if ( data != NULL ) {
    delete [] data;      // delete the allocated memory before destorying
  }                      // the object. otherwise, memory leak happens
}

template <class T>
void mySortedArray<T>::insert(const T& x) {
  if ( size >= nalloc ) {  // if container has more elements than allocated
    T* newdata = new T[nalloc*2];   // make an array at doubled size
    for(int i=0; i < nalloc; ++i) {
      newdata[i] = data[i];         // copy the contents of array
    }
    delete [] data;                 // delete the original array
    data = newdata;                 // and reassign data ptr
  }

  int i;
  for(i=size-1; (i >= 0) && (data[i] > x); --i) {
    data[i+1] = data[i];            // insert the list into right position
  }
  data[i+1] = x;
  ++size;                           // increase the size
}

template <class T>
int mySortedArray<T>::search(const T& x) {
  return search(x, 0, size-1);
}

template <class T>
int mySortedArray<T>::search(const T& x, int begin, int end) {
  if ( begin > end ) {
    return -1;
  }
  else {
    int mid = (begin+end)/2;
    if ( data[mid] == x ) {
      return mid;
    }
    else if ( data[mid] < x ) {
      return search(x, mid+1, end);
    }
    else {
      return search(x, begin, mid);
    }
  }
}

template <class T>
bool mySortedArray<T>::remove(const T& x) {
  int i = search(x);  // try to find the element
  if ( i >= 0 ) {      // if found
    for(int j=i; j < size-1; ++j) {
      data[i] = data[i+1];  // shift all the elements by one
    }
    --size;           // and reduce the array size
    return true;      // successfully removed the value
  }
  else {
    return false;     // cannot find the value to remove
  }
}

Example code to generare all possible permutations

Note that http://www.cplusplus.com/reference/algorithm/next_permutation/ provides much simpler solution for generating permutation. (Thanks to Matthew Snyder)

// Code to generate permutations from 1..n
// These code was adopted from
// http://geeksforgeeks.org/?p=767 by Hyun Min Kang
#include <iostream>
#include <vector>

// swaps two elements
void swap(std::vector<int>& v, int i, int j) {
  int tmp = v[i];
  v[i] = v[j];
  v[j] = tmp;
  return;
}

// actual engine - recursively calls permutation
void permute(std::vector<int>& v, int from, int to) {
  if ( from == to ) {           // print the permutation
    for(int i=0; i < v.size(); ++i) {
      std::cout << " " << v[i];
    }
    std::cout << std::endl;
  }
  else {
    for(int i = from; i <= to; ++i) {
      swap(v,from,i);           // swaps two elements
      permute(v, from+1, to);   // permute the rest
      swap(v,from,i);           // recover the vector to the original state
    }
  }
}

// Permutation Example
int main(int argc, char** argv) {
  if ( argc != 2 ) {
    std::cerr << "Usage: " << argv[0] << " " << "[# of samples]" << std::endl;
    return -1;
  }

  // takes input from 1..n
  int n = atoi(argv[1]);
  std::vector<int> input;
  for(int i=1; i <= n; ++i) {
    input.push_back(i);
  }
  permute(input, 0, n-1); // print all possible permutations
  return 0;
}

#include <iostream>                       // for input/output
#include <boost/graph/adjacency_list.hpp> // for using graph type 
#include <boost/graph/dijkstra_shortest_paths.hpp> // for dijkstra algorithm
using namespace std;   // allow to omit prefix 'std::'
using namespace boost; // allow to omit prefix 'boost::'

int getNodeID(int node) {
  return ((node/5)+1)*10+(node%5+1);
}

Code for Problem 3 - Problem 2

#include <iostream>                       // for input/output
#include <boost/graph/adjacency_list.hpp> // for using graph type 
#include <boost/graph/dijkstra_shortest_paths.hpp> // for dijkstra algorithm
using namespace std;   // allow to omit prefix 'std::'
using namespace boost; // allow to omit prefix 'boost::'

int getNodeID(int node) {
  return ((node/5)+1)*10+(node%5+1);
}

int main(int argc, char** argv) {
  // defining a graph type
  // 1. edges are stored as std::list internally
  // 2. verticies are stored as std::vector internally
  // 3. the graph is directed (undirectedS, bidirectionalS can be used)
  // 4. vertices do not carry particular properties
  // 5. edges contains weight property as integer value
  typedef adjacency_list< listS, vecS, directedS, no_property, property< edge_weight_t, int> > graph_t;

  // vertex_descriptor is a type for representing vertices
  typedef graph_traits< graph_t >::vertex_descriptor vertex_descriptor;
  // a nodes is represented as an integer, and an edge is a pair of integers
  typedef std::pair<int, int> E;

  // Connect between verticies as in the Manhattan Tourist Problem
  // Each node is numbered as a two-digit integer of [#row] and [#col]
  enum { N11, N12, N13, N14, N15,
	 N21, N22, N23, N24, N25,
	 N31, N32, N33, N34, N35,
	 N41, N42, N43, N44, N45,
	 N51, N52, N53, N54, N55 };

  E edges [] = { E(N11,N12), E(N12,N13), E(N13,N14), E(N14,N15),
		 E(N21,N22), E(N22,N23), E(N23,N24), E(N24,N25),
		 E(N31,N32), E(N32,N33), E(N33,N34), E(N34,N35),
		 E(N41,N42), E(N42,N43), E(N43,N44), E(N44,N45),
		 E(N51,N52), E(N52,N53), E(N53,N54), E(N54,N55),
		 E(N11,N21), E(N12,N22), E(N13,N23), E(N14,N24), E(N15,N25),
		 E(N21,N31), E(N22,N32), E(N23,N33), E(N24,N34), E(N25,N35),
		 E(N31,N41), E(N32,N42), E(N33,N43), E(N34,N44), E(N35,N45),
		 E(N41,N51), E(N42,N52), E(N43,N53), E(N44,N54), E(N45,N55) };
  // Assign weights for each edge
  int weight [] = { 4, 2, 0, 7,     // horizontal weights
		    7, 4, 5, 9,
		    6, 8, 1, 0,
		    1, 6, 4, 7,
		    1, 5, 8, 5,
		    0, 6, 6, 2, 4,  // vertical weights
		    9, 7, 1, 0, 6,
		    1, 8, 4, 8, 9,
		    3, 6, 6, 0, 7 };

  // define a graph as an array of edges and weights
#if defined(BOOST_MSVC) && BOOST_MSVC <= 1300 // special routine for MS VC++
  typedef graph_traits < graph_t >::edge_descriptor edge_descriptor;
  graph_t g(25);
  property_map<graph_t, edge_weight_t>::type weightmap = get(edge_weight, g);
  for (std::size_t j = 0; j < sizeof(edges)/sizeof(E); ++j) {
    edge_descriptor e; 
    bool inserted;
    boost::tie(e, inserted) = add_edge(edge_array[j].first, edge_array[j].second, g);
    weightmap[e] = weights[j];
  }
#else  // for regulat compilers
  graph_t g(edges, edges + sizeof(edges) / sizeof(E), weight, 25);
#endif

  std::vector<vertex_descriptor> p(num_vertices(g));
  std::vector<int> d(num_vertices(g));
  vertex_descriptor s = vertex(N11, g);

#if defined(BOOST_MSVC) && BOOST_MSVC <= 1300 // for VC++
  property_map<graph_t, vertex_index_t>::type indexmap = get(vertex_index, g);
  dijkstra_shortest_paths(g, s, &p[0], &d[0], weightmap, indexmap, 
                          std::less<int>(), closed_plus<int>(), 
                          (std::numeric_limits<int>::max)(), 0,
                          default_dijkstra_visitor());
#else  // for regular compilers
  dijkstra_shortest_paths(g, s, predecessor_map(&p[0]).distance_map(&d[0]));
#endif
  graph_traits < graph_t >::vertex_iterator vi, vend;

  std::cout << "Backtracking the optimal path from the destination to source" << std::endl;
  for(int node = N55; node != N11; node = p[node]) {
    std::cout << "Path: N" << getNodeID(p[node]) << " -> N" << getNodeID(node) << ", Distance from origin is " << d[node] << std::endl;
  }
  return 0;
}

Office Hours

Friday 9:30AM-12:30PM

Information on Biostatistics Cluster

You may want to request an account to the biostatistics cluster. Refer to https://webservices.itcs.umich.edu/mediawiki/Biostatistics/index.php/Cluster for further information.

Standards of Academic Conduct

The following is an extract from the School of Public Health's Student Code of Conduct [1]:

Student academic misconduct includes behavior involving plagiarism, cheating, fabrication, falsification of records or official documents, intentional misuse of equipment or materials, and aiding and abetting the perpetration of such acts. The preparation of reports, papers, and examinations, assigned on an individual basis, must represent each student’s own effort. Reference sources should be indicated clearly. The use of assistance from other students or aids of any kind during a written examination, except when the use of books or notes has been approved by an instructor, is a violation of the standard of academic conduct.

In the context of this course, any work you hand-in should be your own.

Course History

Goncalo Abecasis taught it in several academic years previously. For previous course notes, see [Goncalo's older class notes].

@@ Line 52: / Line 52: @@
 * Lecture 11 : Hidden Markov Models -- [[Media:Biostat615-lecture11-nup.pdf | (Handout mode - PDF)]] [[Media:Biostat615-lecture11.pdf | (Presentation mode - PDF)]]
 * Lecture 12 : Hidden Markov Models -- [[Media:Biostat615-lecture12-nup.pdf | (Handout mode - PDF)]] [[Media:Biostat615-lecture12-presentation.pdf | (Presentation mode - PDF)]]
+* Lecture 13 : Matrix Computation -- [[Media:Biostat615-lecture13-nup.pdf | (Handout mode - PDF)]] [[Media:Biostat615-lecture13-presentation.pdf | (Presentation mode - PDF)]]
 == Problem Sets ==