Base Caller Summaries

From Genome Analysis Wiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Standard Illumina Base Caller (Bustard)

Sequencing-by-Synthesis (SBS)

  • DNA sample obtained, containing many copies of same sequences and randomly fragmented
  • Single-stranded DNA fragments attached to slide and amplified so there is a cluster of each fragment
  • DNA polymerase and 4 terminal bases (with distinct fluorescent markers) added
  • Clusters excited by lasers and photos taken in optimal wavelengths for 4 fluorophores
  • Fluorophores and terminators removed and process repeated for L cycles

Image Analysis

  • Corrects for imperfect repositioning of camera and aberrations of lens by aligning images to reference from original cycle
  • Signal for each cluster characterized as time series data of fluorescence intensities and noise

Base Calling

  • Converts fluorescence signals into actual sequence data with quality scores
  • Takes intensities of four channels for every cluster in each cycle and determines concentration of each base
  • Renormalizes concentrations by multiplying by ratio of average concentrations in first cycle and current cycle
  • Uses Markov model to determine transition matrix modeling probability of phasing (no new base synthesized), prephasing (two new bases synthesized), and normal incorporation
  • Uses transition matrix and observed concentrations of each base to determine concentrations in absence of phasing and reports these as base calls
    • Assumes crosstalk matrix constant for a given sequencing run and that phasing affects all nucleotides in the same way

General Noise Factors

  • Phasing
    • Failures in nucleotide incorporation or block removal or incorporation of more than one nucleotide in a particular cycle
  • Fading
    • Decay in fluorescent signal intensity with each cycle
    • Likely attributable to material loss during sequencing
  • Crosstalk
    • C channel illumination overlaps with A: a C label fluoresces in A channel (similarly G and T overlap)
    • Likely caused by overlap in dye emission frequencies
  • T Accumulation
    • The fluorophores used for thymine are not always removed properly after each iteration
    • Intensity of T signal increases across sequencing run


Training Stage

  • Learns run-specific noise patterns according to model and finds optimized solution reducing affect of noise sources using a Support Vector Machine (SVM)
  • Half of training set used for cross-validation

Base Calling Stage

  • Reports all sequences from run with optimized parameters

Differences from Standard Illumina Base Caller

  • Calling parameters optimized empirically and tested to enhance accuracy of each run
  • Calculates phasing parameters based on parametric model
  • Dynamically tracks changes in crosstalk, which disrupt signals in later cycles

Probabilistic Base Calling

  • Produces an alternative probabilistic base calling method based on the fluorescence intensity quantifications that uses:
    • Extended IUPAC alphabet to code ambiguous bases
    • Information criterion to control length of trustable reads
  • Reduced systematic bias by addressing:
    • Crosstalk
    • Dephasing
    • Optical effect that tiles in center of image appear brighter corrected by fitting a 2D loess model to intensities and subtracting difference between fit and median intensities
  • Measure level of uncertainty in base calling by entropy (uncertainty in determination of correct kth base)
  • Does not consider fine-tuning image analysis


  • Model-based approach to base calling
  • Main goal is to model sequencing process by taking stochasticity into account and by explicitly modeling how errors may arise
  • Obtain base calls by maximizing posterior distribution of sequences given observed data and assuming a uniform prior on sequences


Performs both image analysis and base calling

Image Analysis

  • Background subtraction – minimal pixel value within a window around each pixel subtracted from central pixel’s value
  • Image correlation – alignment of images to reference cycle
  • Object identification and intensity extraction

Base Calling

  • Corrects for crosstalk by performing linear regression on crosstalk plots and use slope to derive correction matrix, performed iteratively until slope is zero
  • Phasing correction by ranking clusters by chastity (the ratio of the highest intensity to the sum of the top two intensities) - use top 400 clusters to estimate phasing and apply it as a correction
  • After correction, base with maximum intensity chosen as called base



  • Estimate sequencing chemistry model as a parameter directly from data using statistical learning
  • Training set from Bustard output using raw cluster intensities
  • Used a base caller with SVM classifiers for each cycle that have intensity values of the current cycle as well as the previous and following cycles (if they exist)
  • Data set created by aligning raw reads with mismatches for a fraction of the tiles to a reference sequence
    • Half of this set used as a training set and the other half as a test set used to check results of training
  • Estimate parameters for calculating a quality score given class assignment and distances to the classification/decision boundary from SVM


  • Unlike AltaCyclic, includes base-specific phasing parameters so can correct raw intensities for T accumulation
  • Does not call an 'N' character for poor quality bases
  • Process unique as causes of sequencing error not modeled separately
    • Consider causes together by using neighboring signals in statistical learning procedure


Erlich, Y., Mitra, P.P., delaBastide, M., McCombie, W.R., Hannon, G.J. (2008) Alta-Cyclic: A self-optimizing base caller for next-generation sequencing. Nature Methods 5:679-682

Kao, W.-C., Stevens, K., Song, Y.S. (2009) BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome Research 19:1884-1895

Kircher, M., Stenzel, U., Kelso, J. (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 10(8):Article R83

Rougemont, J., Amzallag, A., Iseli, C., Farinelli, L., Xenarios, I., Naef, F. (2008) Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 9:Article 431

Whiteford, N., Skelly, T., Curtis, C., Ritchie, M.E., Löhr, A., Zaranek, A.W., Abnizova, I., Brown, C. (2009) Swift: Primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics 25:2194-2199