Difference between revisions of "Software"

From Genome Analysis Wiki
Jump to navigationJump to search
Line 1: Line 1:
= Software Page Overview =
+
=Software=
 +
Due to increasing volume of next generation sequencing and genotyping data, we have created these created C++ library and tools that use that library.
  
 
This page points to downloads, documentation, and papers for software that is written here at the [http://genome.sph.umich.edu Center for Statistical Genetics]
 
This page points to downloads, documentation, and papers for software that is written here at the [http://genome.sph.umich.edu Center for Statistical Genetics]
  
 +
=StatGen C++ Software=
 +
A library and set of set of tools developed for handling and analyzing next generation sequencing and genotyping data.
  
= [[Read Mapping]] =
+
== Download ==
  
==[[Karma|Karma]]==
 
Our fast short read aligner, which generates [[Mapping Quality Scores]]
 
  
==[[Karma-colorspace|Karma-ColorSpace]]==
+
== Library ==
QUICKSTART on mapping color space reads
+
* [[C++ Library: libStatGen]] - Library containing easy-to-use APIs for developing tools for processing and analyzing next generation sequencing and genotyping data.  Allows easy processing of SAM/BAM, GLF, FASTQ.
  
==[[Examples|Examples]]==
 
Sample command lines with discussion
 
  
==[[MapabilityScores]]==
+
== Tools ==
Definitions of various mappability scores adopted at UCSC genome browser.
+
=== SAM/BAM ===
  
==Evaluation of Mappers==
+
==== General Tools ====
[[baseQualityCheck]] is a mature tool to calculate the observed base quality vs. empirical base quality.
+
*[[QPLOT]] - Calculate & plot summary statistics
 +
*[[BamValidator]] – Check file format & print statistics
 +
*[[C++ Executable: bam#convert|Convert]] – Convert between SAM & BAM
 +
*[[C++ Executable: bam#writeRegion|WriteRegion]] – Write only reads in the specified region
 +
*[[Pileup]] – Pileup every base or just bases in specified region and write VCF - <span style="color:#D2691E">Coming Soon</span>
 +
*[[C++ Executable: bam#readIndexedBam|ReadIndexedBam]] - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file
  
= Variant Calling =
 
  
==[[glfSingle]]==
+
==== Update the File ====
Variant calling for a single, deeply sequenced individual
+
*[[RGMergeBam]] – Merge sorted BAM files adding Read Groups
 +
*[[PolishBam]] – Add/Update header lines & add RG tag to each record
 +
*[[TrimBam]] – Trim end of reads, changing read ends to ‘N’ & quality to ‘!’
 +
*[[C++ Executable: bam#filter|Filter]] – Soft clip ends with too high mismatch % and mark unmapped if quality of mismatches is too high
  
==[[glfTrio]]==
 
Variant calling for a single, deeply sequenced nuclear family with two parents and one child
 
  
==[[glfMultiples]]==
+
==== Split the File ====
Variant calling for multiple, unrelated individuals
+
*[[SplitBam]] – Split into 1 file per Read Group
 +
*[[C++ Executable: bam#splitChromosome|SplitChromosome]] – Split into 1 file per Chromosome
  
= Variant Annotation =
 
  
==[[vcfCodingSnps]]==
+
==== Helper Tools to Print Readable Information ====
Annotate coding variants in a VCF file.
+
*[[C++ Executable: bam#dumpHeader|DumpHeader]] - Print the File Header to the screen.
 +
*[[C++ Executable: bam#dumpRefInfo|DumpRefInfo]] - Print the reference information from the SAM/BAM header.
 +
*[[C++ Executable: bam#dumpIndex|DumpIndex]] - Print the BAM Index to the screen in a readable format
 +
*[[C++ Executable: bam#readReference|ReadReference]] - Print the reference string for the specified region to the screen.
  
= Quality Control Utilities =
 
  
== Validators ==
 
  
[[C++ Executable: fastQValidator|FastQValidator]] -- Check that a FASTQ file conforms to specification.
+
=== FASTQ ===
 +
* [[FastQValidator|fastqValidator]] - validate a FASTQ file
 +
**Reports errors for badly formatted files
 +
**Reports Base Composition Statistics (%reads at each read index)
  
[[GenotypeIDcheck]] -- Check that mapped reads are consistent with known genotypes for each individual.
 
  
[[BamValidator]] -- Checks that a SAM/BAM file conforms to specification and generates some statistics on the file.
+
=== Other Tools ===
 +
*[[VcfGenomeStat]] – Print flanking sequences and how often they appear for input VCF file
  
== File Readers ==
+
=Other Tools=
  
[[C++ Library: libbam|BamFile]] -- Reads a BAM/SAM file.
+
== [[Read Mapping]] ==
 +
*[[Karma|Karma]] - Our fast short read aligner, which generates [[Mapping Quality Scores]]
 +
*[[Karma-colorspace|Karma-ColorSpace]] - QUICKSTART on mapping color space reads
 +
*[[baseQualityCheck]] - a mature tool to calculate the observed base quality vs. empirical base quality (helps to evaluate mappers)
  
[[C++ Library: libfqf|FastQFile]] -- Read a FASTQ file sequence by sequence. Validating the sequence as it is read.
+
*[[Examples|Examples]] - Sample command lines with discussion
 +
 
 +
*[[MapabilityScores]] - Definitions of various mappability scores adopted at UCSC genome browser.
 +
 
 +
 
 +
 
 +
==SAM/BAM==
 +
*[[VerifyBamID]] – Check sample identities for contamination/sample swap
 +
**Genotype concordance based detection
 +
**Estimate based on population allele frequencies without genotype data
 +
*Recalibrator – Resource-efficient tool, which recalibrates base qualities based on an adaptive logistic regression model - <span style="color:#D2691E">Available upon request</span>
 +
*Deduper – Mark or remove duplicates - <span style="color:#D2691E">Coming Soon</span>
 +
 
 +
== Variant Calling ==
 +
* [[glfSingle]] - Variant calling for a single, deeply sequenced individual
 +
* [[glfTrio]]- Variant calling for a single, deeply sequenced nuclear family with two parents and one child
 +
* [[glfMultiples]] - Variant calling for multiple, unrelated individuals
 +
 
 +
== Variant Annotation ==
 +
*[[vcfCodingSnps]] - Annotate coding variants in a VCF file.
 +
 
 +
== Quality Control ==
 +
*[[GenotypeIDcheck]] - Check that mapped reads are consistent with known genotypes for each individual.
  
 
== File Conversion ==
 
== File Conversion ==
 
+
*[[bam2FastQ]] - Convert BAM files into FastQ files
[[bam2FastQ]] -- Convert BAM files into FastQ files
 
  
  
 
= [[Links to Sequence Analysis Tools|Other Useful Links]] =
 
= [[Links to Sequence Analysis Tools|Other Useful Links]] =

Revision as of 01:11, 2 November 2010

Software

Due to increasing volume of next generation sequencing and genotyping data, we have created these created C++ library and tools that use that library.

This page points to downloads, documentation, and papers for software that is written here at the Center for Statistical Genetics

StatGen C++ Software

A library and set of set of tools developed for handling and analyzing next generation sequencing and genotyping data.

Download

Library

  • C++ Library: libStatGen - Library containing easy-to-use APIs for developing tools for processing and analyzing next generation sequencing and genotyping data. Allows easy processing of SAM/BAM, GLF, FASTQ.


Tools

SAM/BAM

General Tools

  • QPLOT - Calculate & plot summary statistics
  • BamValidator – Check file format & print statistics
  • Convert – Convert between SAM & BAM
  • WriteRegion – Write only reads in the specified region
  • Pileup – Pileup every base or just bases in specified region and write VCF - Coming Soon
  • ReadIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file


Update the File

  • RGMergeBam – Merge sorted BAM files adding Read Groups
  • PolishBam – Add/Update header lines & add RG tag to each record
  • TrimBam – Trim end of reads, changing read ends to ‘N’ & quality to ‘!’
  • Filter – Soft clip ends with too high mismatch % and mark unmapped if quality of mismatches is too high


Split the File


Helper Tools to Print Readable Information

  • DumpHeader - Print the File Header to the screen.
  • DumpRefInfo - Print the reference information from the SAM/BAM header.
  • DumpIndex - Print the BAM Index to the screen in a readable format
  • ReadReference - Print the reference string for the specified region to the screen.


FASTQ

  • fastqValidator - validate a FASTQ file
    • Reports errors for badly formatted files
    • Reports Base Composition Statistics (%reads at each read index)


Other Tools

  • VcfGenomeStat – Print flanking sequences and how often they appear for input VCF file

Other Tools

Read Mapping

  • Examples - Sample command lines with discussion
  • MapabilityScores - Definitions of various mappability scores adopted at UCSC genome browser.


SAM/BAM

  • VerifyBamID – Check sample identities for contamination/sample swap
    • Genotype concordance based detection
    • Estimate based on population allele frequencies without genotype data
  • Recalibrator – Resource-efficient tool, which recalibrates base qualities based on an adaptive logistic regression model - Available upon request
  • Deduper – Mark or remove duplicates - Coming Soon

Variant Calling

  • glfSingle - Variant calling for a single, deeply sequenced individual
  • glfTrio- Variant calling for a single, deeply sequenced nuclear family with two parents and one child
  • glfMultiples - Variant calling for multiple, unrelated individuals

Variant Annotation

Quality Control

  • GenotypeIDcheck - Check that mapped reads are consistent with known genotypes for each individual.

File Conversion

  • bam2FastQ - Convert BAM files into FastQ files


Other Useful Links