Difference between revisions of "Software"
From Genome Analysis Wiki
Jump to navigationJump to searchLine 1: | Line 1: | ||
− | = Software | + | =Software= |
+ | Due to increasing volume of next generation sequencing and genotyping data, we have created these created C++ library and tools that use that library. | ||
This page points to downloads, documentation, and papers for software that is written here at the [http://genome.sph.umich.edu Center for Statistical Genetics] | This page points to downloads, documentation, and papers for software that is written here at the [http://genome.sph.umich.edu Center for Statistical Genetics] | ||
+ | =StatGen C++ Software= | ||
+ | A library and set of set of tools developed for handling and analyzing next generation sequencing and genotyping data. | ||
− | = | + | == Download == |
− | |||
− | |||
− | ==[[ | + | == Library == |
− | + | * [[C++ Library: libStatGen]] - Library containing easy-to-use APIs for developing tools for processing and analyzing next generation sequencing and genotyping data. Allows easy processing of SAM/BAM, GLF, FASTQ. | |
− | |||
− | |||
− | == | + | == Tools == |
− | + | === SAM/BAM === | |
− | == | + | ==== General Tools ==== |
− | [[ | + | *[[QPLOT]] - Calculate & plot summary statistics |
+ | *[[BamValidator]] – Check file format & print statistics | ||
+ | *[[C++ Executable: bam#convert|Convert]] – Convert between SAM & BAM | ||
+ | *[[C++ Executable: bam#writeRegion|WriteRegion]] – Write only reads in the specified region | ||
+ | *[[Pileup]] – Pileup every base or just bases in specified region and write VCF - <span style="color:#D2691E">Coming Soon</span> | ||
+ | *[[C++ Executable: bam#readIndexedBam|ReadIndexedBam]] - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file | ||
− | |||
− | ==[[ | + | ==== Update the File ==== |
− | + | *[[RGMergeBam]] – Merge sorted BAM files adding Read Groups | |
+ | *[[PolishBam]] – Add/Update header lines & add RG tag to each record | ||
+ | *[[TrimBam]] – Trim end of reads, changing read ends to ‘N’ & quality to ‘!’ | ||
+ | *[[C++ Executable: bam#filter|Filter]] – Soft clip ends with too high mismatch % and mark unmapped if quality of mismatches is too high | ||
− | |||
− | |||
− | ==[[ | + | ==== Split the File ==== |
− | + | *[[SplitBam]] – Split into 1 file per Read Group | |
+ | *[[C++ Executable: bam#splitChromosome|SplitChromosome]] – Split into 1 file per Chromosome | ||
− | |||
− | ==[[ | + | ==== Helper Tools to Print Readable Information ==== |
− | + | *[[C++ Executable: bam#dumpHeader|DumpHeader]] - Print the File Header to the screen. | |
+ | *[[C++ Executable: bam#dumpRefInfo|DumpRefInfo]] - Print the reference information from the SAM/BAM header. | ||
+ | *[[C++ Executable: bam#dumpIndex|DumpIndex]] - Print the BAM Index to the screen in a readable format | ||
+ | *[[C++ Executable: bam#readReference|ReadReference]] - Print the reference string for the specified region to the screen. | ||
− | |||
− | |||
− | [[ | + | === FASTQ === |
+ | * [[FastQValidator|fastqValidator]] - validate a FASTQ file | ||
+ | **Reports errors for badly formatted files | ||
+ | **Reports Base Composition Statistics (%reads at each read index) | ||
− | |||
− | [[ | + | === Other Tools === |
+ | *[[VcfGenomeStat]] – Print flanking sequences and how often they appear for input VCF file | ||
− | = | + | =Other Tools= |
− | [[ | + | == [[Read Mapping]] == |
+ | *[[Karma|Karma]] - Our fast short read aligner, which generates [[Mapping Quality Scores]] | ||
+ | *[[Karma-colorspace|Karma-ColorSpace]] - QUICKSTART on mapping color space reads | ||
+ | *[[baseQualityCheck]] - a mature tool to calculate the observed base quality vs. empirical base quality (helps to evaluate mappers) | ||
− | [[ | + | *[[Examples|Examples]] - Sample command lines with discussion |
+ | |||
+ | *[[MapabilityScores]] - Definitions of various mappability scores adopted at UCSC genome browser. | ||
+ | |||
+ | |||
+ | |||
+ | ==SAM/BAM== | ||
+ | *[[VerifyBamID]] – Check sample identities for contamination/sample swap | ||
+ | **Genotype concordance based detection | ||
+ | **Estimate based on population allele frequencies without genotype data | ||
+ | *Recalibrator – Resource-efficient tool, which recalibrates base qualities based on an adaptive logistic regression model - <span style="color:#D2691E">Available upon request</span> | ||
+ | *Deduper – Mark or remove duplicates - <span style="color:#D2691E">Coming Soon</span> | ||
+ | |||
+ | == Variant Calling == | ||
+ | * [[glfSingle]] - Variant calling for a single, deeply sequenced individual | ||
+ | * [[glfTrio]]- Variant calling for a single, deeply sequenced nuclear family with two parents and one child | ||
+ | * [[glfMultiples]] - Variant calling for multiple, unrelated individuals | ||
+ | |||
+ | == Variant Annotation == | ||
+ | *[[vcfCodingSnps]] - Annotate coding variants in a VCF file. | ||
+ | |||
+ | == Quality Control == | ||
+ | *[[GenotypeIDcheck]] - Check that mapped reads are consistent with known genotypes for each individual. | ||
== File Conversion == | == File Conversion == | ||
− | + | *[[bam2FastQ]] - Convert BAM files into FastQ files | |
− | [[bam2FastQ]] | ||
= [[Links to Sequence Analysis Tools|Other Useful Links]] = | = [[Links to Sequence Analysis Tools|Other Useful Links]] = |
Revision as of 01:11, 2 November 2010
Software
Due to increasing volume of next generation sequencing and genotyping data, we have created these created C++ library and tools that use that library.
This page points to downloads, documentation, and papers for software that is written here at the Center for Statistical Genetics
StatGen C++ Software
A library and set of set of tools developed for handling and analyzing next generation sequencing and genotyping data.
Download
Library
- C++ Library: libStatGen - Library containing easy-to-use APIs for developing tools for processing and analyzing next generation sequencing and genotyping data. Allows easy processing of SAM/BAM, GLF, FASTQ.
Tools
SAM/BAM
General Tools
- QPLOT - Calculate & plot summary statistics
- BamValidator – Check file format & print statistics
- Convert – Convert between SAM & BAM
- WriteRegion – Write only reads in the specified region
- Pileup – Pileup every base or just bases in specified region and write VCF - Coming Soon
- ReadIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file
Update the File
- RGMergeBam – Merge sorted BAM files adding Read Groups
- PolishBam – Add/Update header lines & add RG tag to each record
- TrimBam – Trim end of reads, changing read ends to ‘N’ & quality to ‘!’
- Filter – Soft clip ends with too high mismatch % and mark unmapped if quality of mismatches is too high
Split the File
- SplitBam – Split into 1 file per Read Group
- SplitChromosome – Split into 1 file per Chromosome
Helper Tools to Print Readable Information
- DumpHeader - Print the File Header to the screen.
- DumpRefInfo - Print the reference information from the SAM/BAM header.
- DumpIndex - Print the BAM Index to the screen in a readable format
- ReadReference - Print the reference string for the specified region to the screen.
FASTQ
- fastqValidator - validate a FASTQ file
- Reports errors for badly formatted files
- Reports Base Composition Statistics (%reads at each read index)
Other Tools
- VcfGenomeStat – Print flanking sequences and how often they appear for input VCF file
Other Tools
Read Mapping
- Karma - Our fast short read aligner, which generates Mapping Quality Scores
- Karma-ColorSpace - QUICKSTART on mapping color space reads
- baseQualityCheck - a mature tool to calculate the observed base quality vs. empirical base quality (helps to evaluate mappers)
- Examples - Sample command lines with discussion
- MapabilityScores - Definitions of various mappability scores adopted at UCSC genome browser.
SAM/BAM
- VerifyBamID – Check sample identities for contamination/sample swap
- Genotype concordance based detection
- Estimate based on population allele frequencies without genotype data
- Recalibrator – Resource-efficient tool, which recalibrates base qualities based on an adaptive logistic regression model - Available upon request
- Deduper – Mark or remove duplicates - Coming Soon
Variant Calling
- glfSingle - Variant calling for a single, deeply sequenced individual
- glfTrio- Variant calling for a single, deeply sequenced nuclear family with two parents and one child
- glfMultiples - Variant calling for multiple, unrelated individuals
Variant Annotation
- vcfCodingSnps - Annotate coding variants in a VCF file.
Quality Control
- GenotypeIDcheck - Check that mapped reads are consistent with known genotypes for each individual.
File Conversion
- bam2FastQ - Convert BAM files into FastQ files