Difference between revisions of "Software"

From Genome Analysis Wiki
Jump to navigationJump to search
(46 intermediate revisions by 3 users not shown)
Line 3: Line 3:
  
 
=Software=
 
=Software=
Due to increasing volume of next generation sequencing and genotyping data, we have created these created C++ library and tools that use that library.
+
Due to increasing volume of next generation sequencing and genotyping data, we have created these C++ library and tools that use that library.
  
 
This page points to downloads, documentation, and papers for software that is written here at the [http://genome.sph.umich.edu Center for Statistical Genetics]
 
This page points to downloads, documentation, and papers for software that is written here at the [http://genome.sph.umich.edu Center for Statistical Genetics]
 +
 +
If you have any questions or comments, please email Mary Kate Wing (mktrost@umich.edu).
  
 
=StatGen C++ Software=
 
=StatGen C++ Software=
A library and set of set of tools developed for handling and analyzing next generation sequencing and genotyping data.
 
  
[[StatGen Repository]]
+
We have developed a C++ library and tools for handling and analyzing next generation sequencing and genotyping data.
 +
 
 +
== Library ==
  
== Download ==
+
The library contains easy-to-use APIs for developing tools for processing and analyzing next generation sequencing and genotyping data.  Allows easy processing of SAM/BAM, GLF, and FASTQ (VCF is coming).
  
[[StatGen Download]]
+
More information on the library can be found at: [[C++ Library: libStatGen]]
 
== Library ==
 
* [[C++ Library: libStatGen]] - Library containing easy-to-use APIs for developing tools for processing and analyzing next generation sequencing and genotyping data.  Allows easy processing of SAM/BAM, GLF, FASTQ.
 
  
 +
The library can be downloaded at: [[libStatGen Download]]
  
 
== Programs/Tools ==
 
== Programs/Tools ==
 +
 +
Follow the program links for more information on obtaining the tool.  Some tools are packaged together.
 +
 
=== SAM/BAM ===
 
=== SAM/BAM ===
  
==== General Tools ====
 
 
*[[QPLOT]] - Calculate & plot summary statistics
 
*[[QPLOT]] - Calculate & plot summary statistics
*[[BamValidator]] – Check file format & print statistics
 
*[[C++ Executable: bam#convert|Convert]] – Convert between SAM & BAM
 
*[[C++ Executable: bam#writeRegion|WriteRegion]] – Write only reads in the specified region
 
*[[Pileup]] – Pileup every base or just bases in specified region and write VCF
 
*[[C++ Executable: bam#readIndexedBam|ReadIndexedBam]] - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file
 
 
*[[VerifyBamID]] – Check sample identities for contamination/sample swap
 
*[[VerifyBamID]] – Check sample identities for contamination/sample swap
 
**Genotype concordance based detection
 
**Genotype concordance based detection
 
**Estimate based on population allele frequencies without genotype data
 
**Estimate based on population allele frequencies without genotype data
 +
*[[Pileup]] – Pileup every base or just bases in specified region and write VCF
  
 
+
==== BAM Util Tools ====
==== Update the File ====
+
{{BamUtilPrograms}}
*[[SuperDeDuper]] - Determine duplicate alignments, either marking or removing the lower quality duplicates. In addition, it may modify paired-end reads where the ends overlap by soft clipping the end with the lower quality bases in the region of overlap.
 
*[[RGMergeBam]] – Merge sorted BAM files adding Read Groups
 
*[[PolishBam]] – Add/Update header lines & add RG tag to each record
 
*[[TrimBam]] – Trim end of reads, changing read ends to ‘N’ & quality to ‘!’
 
*[[C++ Executable: bam#filter|Filter]] – Soft clip ends with too high mismatch % and mark unmapped if quality of mismatches is too high
 
 
 
==== Split the File ====
 
*[[SplitBam]] – Split into 1 file per Read Group
 
*[[C++ Executable: bam#splitChromosome|SplitChromosome]] – Split into 1 file per Chromosome
 
 
 
 
 
==== Helper Tools to Print Readable Information ====
 
*[[C++ Executable: bam#dumpHeader|DumpHeader]] - Print the File Header to the screen.
 
*[[C++ Executable: bam#dumpRefInfo|DumpRefInfo]] - Print the reference information from the SAM/BAM header.
 
*[[C++ Executable: bam#dumpIndex|DumpIndex]] - Print the BAM Index to the screen in a readable format
 
*[[C++ Executable: bam#readReference|ReadReference]] - Print the reference string for the specified region to the screen.
 
 
 
 
 
  
 
=== FASTQ ===
 
=== FASTQ ===
Line 60: Line 41:
 
**Reports Base Composition Statistics (%reads at each read index)
 
**Reports Base Composition Statistics (%reads at each read index)
  
 +
 +
=== Meta Analysis ===
 +
* [[Rare-Metal-Worker|RAREMETALWORKER - generate summary level statistics for meta analysis using Rare-Metal]]
 +
* [[Rare-Metal|RAREMETAL - perform genome-wide meta analysis of rare variants]]
  
 
=== Other Tools ===
 
=== Other Tools ===
 +
*[[statgenTools#createUMref|createUMref - Create the University of Michigan formatted reference used by many of our tools]]
 
*[[Thunder|thunderVCF]]
 
*[[Thunder|thunderVCF]]
 
*[[vcfCooker]] – Manipulate, filter, summarize VCF/BED file in various forms
 
*[[vcfCooker]] – Manipulate, filter, summarize VCF/BED file in various forms
Line 69: Line 55:
  
 
=== Requested Tools ===
 
=== Requested Tools ===
[[BAM to FASTQ]]
 
  
 
=Other Tools=
 
=Other Tools=
  
== [[Read Mapping]] ==
+
* [[samtools-hybrid]] - Since many of our tools still rely on GLF files and samtools stopped supporting GLF files, we created a version of samtools that still supports pileup to GLF files AND incorporates the updated BAQ logic.  This version is called samtools-hybrid That code can be downloaded at: https://github.com/statgen/samtools-0.1.7a-hybrid
*[[Karma|Karma]] - Our fast short read aligner, which generates [[Mapping Quality Scores]]
+
*[[baseQualityCheck]] - tool to calculate the observed base quality vs. empirical base quality (helps to evaluate mappers)
*[[Karma-colorspace|Karma-ColorSpace]] - QUICKSTART on mapping color space reads
 
*[[baseQualityCheck]] - a mature tool to calculate the observed base quality vs. empirical base quality (helps to evaluate mappers)
 
 
 
*[[Examples|Examples]] - Sample command lines with discussion
 
 
 
*[[MapabilityScores]] - Definitions of various mappability scores adopted at UCSC genome browser.
 
 
 
 
 
 
 
==SAM/BAM==
 
*Recalibrator – Resource-efficient tool, which recalibrates base qualities based on an adaptive logistic regression model - <span style="color:#D2691E">Available upon request</span>
 
*Deduper – Mark or remove duplicates - <span style="color:#D2691E">Coming Soon</span>
 
  
 
== Variant Calling ==
 
== Variant Calling ==
 
* [[glfSingle]] - Variant calling for a single, deeply sequenced individual
 
* [[glfSingle]] - Variant calling for a single, deeply sequenced individual
* [[glfTrio]]- Variant calling for a single, deeply sequenced nuclear family with two parents and one child
 
 
* [[glfMultiples]] - Variant calling for multiple, unrelated individuals
 
* [[glfMultiples]] - Variant calling for multiple, unrelated individuals
* [[Polymutt:_a_tool_for_calling_polymorphism_and_de_novo_mutations|polymutt]] - Polymorphism and ''de novo'' mutation detection in families from sequencing
+
* [[Polymutt|polymutt]] - Variant and ''de novo'' mutation detection in families (nuclear or extended pedigrees) from sequencing
  
 
== Variant Annotation ==
 
== Variant Annotation ==
 
*[[vcfCodingSnps]] - Annotate coding variants in a VCF file.
 
*[[vcfCodingSnps]] - Annotate coding variants in a VCF file.
  
== Quality Control ==
+
== Genotype Imputation ==
*[[GenotypeIDcheck]] - Check that mapped reads are consistent with known genotypes for each individual.
+
*[[Minimac3]] - Fast and Efficient Genotype Imputation.
 
 
== File Conversion ==
 
*[[bam2FastQ]] - Convert BAM files into FastQ files
 
  
 +
== Additional Pedigree & Sequence Analysis Tools ==
 +
Can be found at: http://sph.umich.edu/csg/abecasis/software.html
  
 
= Other Useful Links =
 
= Other Useful Links =

Revision as of 01:07, 31 January 2015


Software

Due to increasing volume of next generation sequencing and genotyping data, we have created these C++ library and tools that use that library.

This page points to downloads, documentation, and papers for software that is written here at the Center for Statistical Genetics

If you have any questions or comments, please email Mary Kate Wing (mktrost@umich.edu).

StatGen C++ Software

We have developed a C++ library and tools for handling and analyzing next generation sequencing and genotyping data.

Library

The library contains easy-to-use APIs for developing tools for processing and analyzing next generation sequencing and genotyping data. Allows easy processing of SAM/BAM, GLF, and FASTQ (VCF is coming).

More information on the library can be found at: C++ Library: libStatGen

The library can be downloaded at: libStatGen Download

Programs/Tools

Follow the program links for more information on obtaining the tool. Some tools are packaged together.

SAM/BAM

  • QPLOT - Calculate & plot summary statistics
  • VerifyBamID – Check sample identities for contamination/sample swap
    • Genotype concordance based detection
    • Estimate based on population allele frequencies without genotype data
  • Pileup – Pileup every base or just bases in specified region and write VCF

BAM Util Tools

BamUtil is built using libStatGen. Running bin/bam with no parameters will print the usage information for the bam executable. Running bin/bam subProgram will print the usage information for the BamUtil sub-program.

Tools to Rewrite SAM/BAM Files:

  • convert - Convert SAM/BAM to SAM/BAM (optionally converts between '=' & bases in the sequence
  • writeRegion - Write a file with reads in the specified region and/or have the specified read name
  • splitChromosome - Split BAM into 1 file per Chromosome
  • splitBam - Split BAM into 1 file per Read Group
  • findCigars - Output just the reads that contain any of the specified CIGAR operations.
  • BAM Recovery - Recover corrupted BAM files

Tools to Modify & write SAM/BAM Files:

  • clipOverlap - Clip overlapping read pairs in a SAM/BAM File already sorted by Coordinate or ReadName so they do not overlap
  • filter - Filter reads by soft clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high
  • revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags
  • squeeze - Reduce file size by dropping OQ fields, duplicates, & specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers
  • trimBam - Trim the ends of reads in a SAM/BAM file changing read ends to 'N' and quality to '!' or by doing soft clips
  • mergeBam - Merge multiple BAMs and headers appending ReadGroupIDs if necessary
  • polishBam - Add/update header lines & add the RG tag to each record
  • dedup - Mark or remove duplicates, can also perform recalibration
  • recab - Recalibrate base qualities

Informational Tools:

  • validate - Validate a SAM/BAM File, checking file format & printing statistics
  • diff - Diff 2 coordinate sorted SAM/BAM files.
  • stats - Generate some basic statistics for a SAM/BAM file
  • gapInfo - Print information on the gap between read pairs in a SAM/BAM File.

Helper Tools to Print Information In Readable Format:

  • dumpHeader - Print the SAM/BAM Header to the screen
  • dumpRefInfo - Print SAM/BAM Reference Name Information from the header
  • dumpIndex - Print BAM Index File to the screen in a readable format
  • readReference - Print the reference string for the specified region to the screen
  • explainFlags - Describe SAM/BAM flags

Additional Tools:

  • bam2FastQ - Convert the specified BAM file to fastQs.

Dummy/Example Tools:

  • readIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file

ASP programs: ASP is a new format that is currently in production, so this tool is not yet available for public release.

  • asp - perform an asynchronous pileup producing an ASP file.
  • dumpAsp - perform an asynchronous pileup producing an ASP file.

FASTQ

  • fastqValidator - validate a FASTQ file
    • Reports errors for badly formatted files
    • Reports Base Composition Statistics (%reads at each read index)


Meta Analysis

Other Tools


Requested Tools

Other Tools

  • samtools-hybrid - Since many of our tools still rely on GLF files and samtools stopped supporting GLF files, we created a version of samtools that still supports pileup to GLF files AND incorporates the updated BAQ logic. This version is called samtools-hybrid That code can be downloaded at: https://github.com/statgen/samtools-0.1.7a-hybrid
  • baseQualityCheck - tool to calculate the observed base quality vs. empirical base quality (helps to evaluate mappers)

Variant Calling

  • glfSingle - Variant calling for a single, deeply sequenced individual
  • glfMultiples - Variant calling for multiple, unrelated individuals
  • polymutt - Variant and de novo mutation detection in families (nuclear or extended pedigrees) from sequencing

Variant Annotation

Genotype Imputation

  • Minimac3 - Fast and Efficient Genotype Imputation.

Additional Pedigree & Sequence Analysis Tools

Can be found at: http://sph.umich.edu/csg/abecasis/software.html

Other Useful Links

Links to Sequence Analysis Tools

Other

ASHG 2010 Poster: C++ library & tools for next generation sequence data