Software
Software
Due to increasing volume of next generation sequencing and genotyping data, we have created these created C++ library and tools that use that library.
This page points to downloads, documentation, and papers for software that is written here at the Center for Statistical Genetics
StatGen C++ Software
We have developed a C++ library and tools for handling and analyzing next generation sequencing and genotyping data.
Library
The library contains easy-to-use APIs for developing tools for processing and analyzing next generation sequencing and genotyping data. Allows easy processing of SAM/BAM, GLF, and FASTQ (VCF is coming).
More information on the library can be found at: C++ Library: libStatGen
The library can be downloaded at: libStatGen Download
Programs/Tools
Follow the program links for more information on obtaining the tool. Some tools are packaged together.
SAM/BAM
General Tools
- QPLOT - Calculate & plot summary statistics
- Validate – Check file format & print statistics
- VerifyBamID – Check sample identities for contamination/sample swap
- Genotype concordance based detection
- Estimate based on population allele frequencies without genotype data
- Diff - Print the diffs between 2 bams
- Stats - Generate some statistics for a SAM/BAM file
Rewrite SAM/BAM file
- Convert – Convert between SAM & BAM
- SplitBam – Split into 1 file per Read Group
- SplitChromosome – Split into 1 file per Chromosome
- WriteRegion – Write only reads in the specified region and/or have the specified read name
- Pileup – Pileup every base or just bases in specified region and write VCF
- ReadIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file
Update the File
- SuperDeDuper - Determine duplicate alignments, either marking or removing the lower quality duplicates. In addition, it may modify paired-end reads where the ends overlap by soft clipping the end with the lower quality bases in the region of overlap.
- RGMergeBam – Merge sorted BAM files adding Read Groups
- PolishBam – Add/Update header lines & add RG tag to each record
- TrimBam – Trim end of reads, changing read ends to ‘N’ & quality to ‘!’
- Filter – Soft clip ends with too high mismatch % and mark unmapped if quality of mismatches is too high
- Revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags
- Squeeze - Reduce files size by dropping OQ fields, duplicates, specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers
Helper Tools to Print Readable Information
- DumpHeader - Print the File Header to the screen.
- DumpRefInfo - Print the reference information from the SAM/BAM header.
- DumpIndex - Print the BAM Index to the screen in a readable format
- ReadReference - Print the reference string for the specified region to the screen.
FASTQ
- fastqValidator - validate a FASTQ file
- Reports errors for badly formatted files
- Reports Base Composition Statistics (%reads at each read index)
Other Tools
- createUMref - Create the University of Michigan formatted reference used by many of our tools
- thunderVCF
- vcfCooker – Manipulate, filter, summarize VCF/BED file in various forms
- VcfGenomeStat – Print flanking sequences and how often they appear for input VCF file
Requested Tools
Other Tools
Read Mapping
- Karma - Our fast short read aligner, which generates Mapping Quality Scores
- Karma-ColorSpace - QUICKSTART on mapping color space reads
- baseQualityCheck - a mature tool to calculate the observed base quality vs. empirical base quality (helps to evaluate mappers)
- Examples - Sample command lines with discussion
- MapabilityScores - Definitions of various mappability scores adopted at UCSC genome browser.
SAM/BAM
- Recalibrator – Resource-efficient tool, which recalibrates base qualities based on an adaptive logistic regression model - Available upon request
- Deduper – Mark or remove duplicates - Coming Soon
Variant Calling
- glfSingle - Variant calling for a single, deeply sequenced individual
- glfTrio- Variant calling for a single, deeply sequenced nuclear family with two parents and one child
- glfMultiples - Variant calling for multiple, unrelated individuals
- polymutt - Variant and de novo mutation detection in families (nuclear or extended pedigrees) from sequencing
Variant Annotation
- vcfCodingSnps - Annotate coding variants in a VCF file.
Quality Control
- GenotypeIDcheck - Check that mapped reads are consistent with known genotypes for each individual.
File Conversion
- bam2FastQ - Convert BAM files into FastQ files
Other Useful Links
Other
ASHG 2010 Poster: C++ library & tools for next generation sequence data