Difference between revisions of "Base Caller Summaries"
(Created page with '= Standard Illumina Base Caller = = Alta-Cyclic = = Probabilistic Base Calling = = BayesCall = = Swift = = IBIS = (To be added soon.) = References =') |
|||
(10 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | = Standard Illumina Base Caller = | + | ==Standard Illumina Base Caller (Bustard)== |
− | = | + | ===Sequencing-by-Synthesis (SBS)=== |
− | + | *DNA sample obtained, containing many copies of same sequences and randomly fragmented | |
+ | *Single-stranded DNA fragments attached to slide and amplified so there is a cluster of each fragment | ||
+ | *DNA polymerase and 4 terminal bases (with distinct fluorescent markers) added | ||
+ | *Clusters excited by lasers and photos taken in optimal wavelengths for 4 fluorophores | ||
+ | *Fluorophores and terminators removed and process repeated for L cycles | ||
− | = | + | ===Image Analysis=== |
+ | *Corrects for imperfect repositioning of camera and aberrations of lens by aligning images to reference from original cycle | ||
+ | *Signal for each cluster characterized as time series data of fluorescence intensities and noise | ||
− | = | + | ===Base Calling=== |
+ | *Converts fluorescence signals into actual sequence data with quality scores | ||
+ | *Takes intensities of four channels for every cluster in each cycle and determines concentration of each base | ||
+ | *Renormalizes concentrations by multiplying by ratio of average concentrations in first cycle and current cycle | ||
+ | *Uses Markov model to determine transition matrix modeling probability of phasing (no new base synthesized), prephasing (two new bases synthesized), and normal incorporation | ||
+ | *Uses transition matrix and observed concentrations of each base to determine concentrations in absence of phasing and reports these as base calls | ||
+ | **Assumes crosstalk matrix constant for a given sequencing run and that phasing affects all nucleotides in the same way | ||
− | = | + | ===General Noise Factors=== |
+ | *Phasing | ||
+ | **Failures in nucleotide incorporation or block removal or incorporation of more than one nucleotide in a particular cycle | ||
+ | *Fading | ||
+ | **Decay in fluorescent signal intensity with each cycle | ||
+ | **Likely attributable to material loss during sequencing | ||
+ | *Crosstalk | ||
+ | **C channel illumination overlaps with A: a C label fluoresces in A channel (similarly G and T overlap) | ||
+ | **Likely caused by overlap in dye emission frequencies | ||
+ | *T Accumulation | ||
+ | **The fluorophores used for thymine are not always removed properly after each iteration | ||
+ | **Intensity of T signal increases across sequencing run | ||
− | ( | + | ==Alta-Cyclic== |
+ | ===Training Stage=== | ||
+ | *Learns run-specific noise patterns according to model and finds optimized solution reducing affect of noise sources using a Support Vector Machine (SVM) | ||
+ | *Half of training set used for cross-validation | ||
− | = References = | + | ===Base Calling Stage=== |
+ | *Reports all sequences from run with optimized parameters | ||
+ | |||
+ | ===Differences from Standard Illumina Base Caller=== | ||
+ | *Calling parameters optimized empirically and tested to enhance accuracy of each run | ||
+ | *Calculates phasing parameters based on parametric model | ||
+ | *Dynamically tracks changes in crosstalk, which disrupt signals in later cycles | ||
+ | |||
+ | ==Probabilistic Base Calling== | ||
+ | *Produces an alternative probabilistic base calling method based on the fluorescence intensity quantifications that uses: | ||
+ | **Extended IUPAC alphabet to code ambiguous bases | ||
+ | **Information criterion to control length of trustable reads | ||
+ | *Reduced systematic bias by addressing: | ||
+ | **Crosstalk | ||
+ | **Dephasing | ||
+ | **Optical effect that tiles in center of image appear brighter corrected by fitting a 2D loess model to intensities and subtracting difference between fit and median intensities | ||
+ | *Measure level of uncertainty in base calling by entropy (uncertainty in determination of correct kth base) | ||
+ | *Does not consider fine-tuning image analysis | ||
+ | |||
+ | ==BayesCall== | ||
+ | *Model-based approach to base calling | ||
+ | *Main goal is to model sequencing process by taking stochasticity into account and by explicitly modeling how errors may arise | ||
+ | *Obtain base calls by maximizing posterior distribution of sequences given observed data and assuming a uniform prior on sequences | ||
+ | |||
+ | ==Swift== | ||
+ | Performs both image analysis and base calling | ||
+ | |||
+ | ===Image Analysis=== | ||
+ | *Background subtraction – minimal pixel value within a window around each pixel subtracted from central pixel’s value | ||
+ | *Image correlation – alignment of images to reference cycle | ||
+ | *Object identification and intensity extraction | ||
+ | |||
+ | ===Base Calling=== | ||
+ | *Corrects for crosstalk by performing linear regression on crosstalk plots and use slope to derive correction matrix, performed iteratively until slope is zero | ||
+ | *Phasing correction by ranking clusters by chastity (the ratio of the highest intensity to the sum of the top two intensities) - use top 400 clusters to estimate phasing and apply it as a correction | ||
+ | *After correction, base with maximum intensity chosen as called base | ||
+ | |||
+ | ==Ibis== | ||
+ | |||
+ | ===Method=== | ||
+ | *Estimate sequencing chemistry model as a parameter directly from data using statistical learning | ||
+ | *Training set from Bustard output using raw cluster intensities | ||
+ | *Used a base caller with SVM classifiers for each cycle that have intensity values of the current cycle as well as the previous and following cycles (if they exist) | ||
+ | *Data set created by aligning raw reads with mismatches for a fraction of the tiles to a reference sequence | ||
+ | **Half of this set used as a training set and the other half as a test set used to check results of training | ||
+ | *Estimate parameters for calculating a quality score given class assignment and distances to the classification/decision boundary from SVM | ||
+ | |||
+ | ===Comparison=== | ||
+ | *Unlike AltaCyclic, includes base-specific phasing parameters so can correct raw intensities for T accumulation | ||
+ | *Does not call an 'N' character for poor quality bases | ||
+ | *Process unique as causes of sequencing error not modeled separately | ||
+ | **Consider causes together by using neighboring signals in statistical learning procedure | ||
+ | |||
+ | ==References== | ||
+ | |||
+ | Erlich, Y., Mitra, P.P., delaBastide, M., McCombie, W.R., Hannon, G.J. (2008) Alta-Cyclic: A self-optimizing base caller for next-generation sequencing. ''Nature Methods'' '''5''':679-682 | ||
+ | |||
+ | Kao, W.-C., Stevens, K., Song, Y.S. (2009) BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. ''Genome Research'' '''19''':1884-1895 | ||
+ | |||
+ | Kircher, M., Stenzel, U., Kelso, J. (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. ''Genome Biol.'' '''10(8)''':Article R83 | ||
+ | |||
+ | Rougemont, J., Amzallag, A., Iseli, C., Farinelli, L., Xenarios, I., Naef, F. (2008) Probabilistic base calling of Solexa sequencing data. ''BMC Bioinformatics'' '''9''':Article 431 | ||
+ | |||
+ | Whiteford, N., Skelly, T., Curtis, C., Ritchie, M.E., Löhr, A., Zaranek, A.W., Abnizova, I., Brown, C. (2009) Swift: Primary data analysis for the Illumina Solexa sequencing platform. ''Bioinformatics'' '''25''':2194-2199 |
Latest revision as of 17:04, 12 March 2010
Standard Illumina Base Caller (Bustard)
Sequencing-by-Synthesis (SBS)
- DNA sample obtained, containing many copies of same sequences and randomly fragmented
- Single-stranded DNA fragments attached to slide and amplified so there is a cluster of each fragment
- DNA polymerase and 4 terminal bases (with distinct fluorescent markers) added
- Clusters excited by lasers and photos taken in optimal wavelengths for 4 fluorophores
- Fluorophores and terminators removed and process repeated for L cycles
Image Analysis
- Corrects for imperfect repositioning of camera and aberrations of lens by aligning images to reference from original cycle
- Signal for each cluster characterized as time series data of fluorescence intensities and noise
Base Calling
- Converts fluorescence signals into actual sequence data with quality scores
- Takes intensities of four channels for every cluster in each cycle and determines concentration of each base
- Renormalizes concentrations by multiplying by ratio of average concentrations in first cycle and current cycle
- Uses Markov model to determine transition matrix modeling probability of phasing (no new base synthesized), prephasing (two new bases synthesized), and normal incorporation
- Uses transition matrix and observed concentrations of each base to determine concentrations in absence of phasing and reports these as base calls
- Assumes crosstalk matrix constant for a given sequencing run and that phasing affects all nucleotides in the same way
General Noise Factors
- Phasing
- Failures in nucleotide incorporation or block removal or incorporation of more than one nucleotide in a particular cycle
- Fading
- Decay in fluorescent signal intensity with each cycle
- Likely attributable to material loss during sequencing
- Crosstalk
- C channel illumination overlaps with A: a C label fluoresces in A channel (similarly G and T overlap)
- Likely caused by overlap in dye emission frequencies
- T Accumulation
- The fluorophores used for thymine are not always removed properly after each iteration
- Intensity of T signal increases across sequencing run
Alta-Cyclic
Training Stage
- Learns run-specific noise patterns according to model and finds optimized solution reducing affect of noise sources using a Support Vector Machine (SVM)
- Half of training set used for cross-validation
Base Calling Stage
- Reports all sequences from run with optimized parameters
Differences from Standard Illumina Base Caller
- Calling parameters optimized empirically and tested to enhance accuracy of each run
- Calculates phasing parameters based on parametric model
- Dynamically tracks changes in crosstalk, which disrupt signals in later cycles
Probabilistic Base Calling
- Produces an alternative probabilistic base calling method based on the fluorescence intensity quantifications that uses:
- Extended IUPAC alphabet to code ambiguous bases
- Information criterion to control length of trustable reads
- Reduced systematic bias by addressing:
- Crosstalk
- Dephasing
- Optical effect that tiles in center of image appear brighter corrected by fitting a 2D loess model to intensities and subtracting difference between fit and median intensities
- Measure level of uncertainty in base calling by entropy (uncertainty in determination of correct kth base)
- Does not consider fine-tuning image analysis
BayesCall
- Model-based approach to base calling
- Main goal is to model sequencing process by taking stochasticity into account and by explicitly modeling how errors may arise
- Obtain base calls by maximizing posterior distribution of sequences given observed data and assuming a uniform prior on sequences
Swift
Performs both image analysis and base calling
Image Analysis
- Background subtraction – minimal pixel value within a window around each pixel subtracted from central pixel’s value
- Image correlation – alignment of images to reference cycle
- Object identification and intensity extraction
Base Calling
- Corrects for crosstalk by performing linear regression on crosstalk plots and use slope to derive correction matrix, performed iteratively until slope is zero
- Phasing correction by ranking clusters by chastity (the ratio of the highest intensity to the sum of the top two intensities) - use top 400 clusters to estimate phasing and apply it as a correction
- After correction, base with maximum intensity chosen as called base
Ibis
Method
- Estimate sequencing chemistry model as a parameter directly from data using statistical learning
- Training set from Bustard output using raw cluster intensities
- Used a base caller with SVM classifiers for each cycle that have intensity values of the current cycle as well as the previous and following cycles (if they exist)
- Data set created by aligning raw reads with mismatches for a fraction of the tiles to a reference sequence
- Half of this set used as a training set and the other half as a test set used to check results of training
- Estimate parameters for calculating a quality score given class assignment and distances to the classification/decision boundary from SVM
Comparison
- Unlike AltaCyclic, includes base-specific phasing parameters so can correct raw intensities for T accumulation
- Does not call an 'N' character for poor quality bases
- Process unique as causes of sequencing error not modeled separately
- Consider causes together by using neighboring signals in statistical learning procedure
References
Erlich, Y., Mitra, P.P., delaBastide, M., McCombie, W.R., Hannon, G.J. (2008) Alta-Cyclic: A self-optimizing base caller for next-generation sequencing. Nature Methods 5:679-682
Kao, W.-C., Stevens, K., Song, Y.S. (2009) BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome Research 19:1884-1895
Kircher, M., Stenzel, U., Kelso, J. (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol. 10(8):Article R83
Rougemont, J., Amzallag, A., Iseli, C., Farinelli, L., Xenarios, I., Naef, F. (2008) Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 9:Article 431
Whiteford, N., Skelly, T., Curtis, C., Ritchie, M.E., Löhr, A., Zaranek, A.W., Abnizova, I., Brown, C. (2009) Swift: Primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics 25:2194-2199