Software

From Genome Analysis Wiki
Jump to navigationJump to search


Software

Due to increasing volume of next generation sequencing and genotyping data, we have created these C++ library and tools that use that library.

This page points to downloads, documentation, and papers for software that is written here at the Center for Statistical Genetics

If you have any questions or comments, please email Mary Kate Wing (mktrost@umich.edu).

StatGen C++ Software

We have developed a C++ library and tools for handling and analyzing next generation sequencing and genotyping data.

Library

The library contains easy-to-use APIs for developing tools for processing and analyzing next generation sequencing and genotyping data. Allows easy processing of SAM/BAM, GLF, and FASTQ (VCF is coming).

More information on the library can be found at: C++ Library: libStatGen

The library can be downloaded at: libStatGen Download

Programs/Tools

Follow the program links for more information on obtaining the tool. Some tools are packaged together.

SAM/BAM

  • QPLOT - Calculate & plot summary statistics
  • VerifyBamID – Check sample identities for contamination/sample swap
    • Genotype concordance based detection
    • Estimate based on population allele frequencies without genotype data
  • Pileup – Pileup every base or just bases in specified region and write VCF

BAM Util Tools

The following tools are part of the BamUtil program.

QC/Stats

  • validate – Check file format & print statistics
  • diff - Print the diffs between 2 bams
  • stats - Generate some statistics for a SAM/BAM file

Rewrite SAM/BAM file

  • convert – Convert between SAM & BAM
  • splitBam – Split into 1 file per Read Group
  • splitChromosome – Split into 1 file per Chromosome
  • writeRegion – Write only reads in the specified region and/or have the specified read name
  • BAM Recovery - Recover corrupted BAM files
  • asp - perform an asynchronous pileup producing an ASP file. ASP is a new format that is currently in production, so this tool is not yet available for public release.

File Updates

  • dedup – Mark or remove duplicates, can also perform recalibration
  • recab - Recalibrate base qualities based on an adaptive logistic regression model
  • clipOverlap - Clip overlapping read pairs so they do not overlap
  • filter – Soft clip ends with too high mismatch % and mark unmapped if quality of mismatches is too high
  • revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags
  • squeeze - Reduce files size by dropping OQ fields, duplicates, specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers
  • trimBam – Trim end of reads, changing read ends to ‘N’ & quality to ‘!’
  • polishBam – Add/Update header lines & add RG tag to each record
  • rgMergeBam – Merge sorted BAM files adding Read Groups

Additional Tools

  • bam2FastQ - Convert the specified BAM file to fastQs

Helper Tools to Print Readable Information

  • dumpHeader - Print the File Header to the screen.
  • dumpRefInfo - Print the reference information from the SAM/BAM header.
  • dumpIndex - Print the BAM Index to the screen in a readable format
  • dumpAsp - perform an asynchronous pileup producing an ASP file. ASP is a new format that is currently in production, so this tool is not yet available for public release.
  • readReference - Print the reference string for the specified region to the screen.

FASTQ

  • fastqValidator - validate a FASTQ file
    • Reports errors for badly formatted files
    • Reports Base Composition Statistics (%reads at each read index)


Meta Analysis

Other Tools


Requested Tools

BAM to FASTQ

Other Tools

  • samtools-hybrid - Since many of our tools still rely on GLF files and samtools stopped supporting GLF files, we created a version of samtools that still supports pileup to GLF files AND incorporates the updated BAQ logic. This version is called samtools-hybrid That code can be downloaded at: https://github.com/statgen/samtools-0.1.7a-hybrid
  • baseQualityCheck - tool to calculate the observed base quality vs. empirical base quality (helps to evaluate mappers)

Variant Calling

  • glfSingle - Variant calling for a single, deeply sequenced individual
  • glfMultiples - Variant calling for multiple, unrelated individuals
  • polymutt - Variant and de novo mutation detection in families (nuclear or extended pedigrees) from sequencing

Variant Annotation

Additional Pedigree & Sequence Analysis Tools

Can be found at: http://sph.umich.edu/csg/abecasis/software.html

Other Useful Links

Links to Sequence Analysis Tools

Other

ASHG 2010 Poster: C++ library & tools for next generation sequence data