Difference between revisions of "Software"

From Genome Analysis Wiki
Jump to navigationJump to search
(→‎Programs/Tools: fix bam validate name)
(→‎Programs/Tools: Add new BamUtil tools.)
Line 26: Line 26:
 
*[[QPLOT]] - Calculate & plot summary statistics
 
*[[QPLOT]] - Calculate & plot summary statistics
 
*[[BamUtil: validate|Validate]] – Check file format & print statistics
 
*[[BamUtil: validate|Validate]] – Check file format & print statistics
 +
*[[VerifyBamID]] – Check sample identities for contamination/sample swap
 +
**Genotype concordance based detection
 +
**Estimate based on population allele frequencies without genotype data
 +
*[[BamUtil: diff|Diff]] - Print the diffs between 2 bams
 +
*[[BamUtil: stats|Stats]] - Print the diffs between 2 bams
 +
 +
 +
==== Rewrite SAM/BAM file ====
 
*[[BamUtil: convert|Convert]] – Convert between SAM & BAM
 
*[[BamUtil: convert|Convert]] – Convert between SAM & BAM
*[[BamUtil: writeRegion|WriteRegion]] – Write only reads in the specified region
+
*[[SplitBam]] – Split into 1 file per Read Group
 +
*[[BamUtil: splitChromosome|SplitChromosome]] – Split into 1 file per Chromosome
 +
*[[BamUtil: writeRegion|WriteRegion]] – Write only reads in the specified region and/or have the specified read name
 
*[[Pileup]] – Pileup every base or just bases in specified region and write VCF
 
*[[Pileup]] – Pileup every base or just bases in specified region and write VCF
 
*[[BamUtil: readIndexedBam|ReadIndexedBam]] - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file
 
*[[BamUtil: readIndexedBam|ReadIndexedBam]] - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file
*[[VerifyBamID]] – Check sample identities for contamination/sample swap
 
**Genotype concordance based detection
 
**Estimate based on population allele frequencies without genotype data
 
  
  
Line 41: Line 48:
 
*[[TrimBam]] – Trim end of reads, changing read ends to ‘N’ & quality to ‘!’
 
*[[TrimBam]] – Trim end of reads, changing read ends to ‘N’ & quality to ‘!’
 
*[[BamUtil: filter|Filter]] – Soft clip ends with too high mismatch % and mark unmapped if quality of mismatches is too high
 
*[[BamUtil: filter|Filter]] – Soft clip ends with too high mismatch % and mark unmapped if quality of mismatches is too high
 
+
*[[BamUtil: revert|Revert]] - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags
==== Split the File ====
+
*[[BamUtil: squeeze|Squeeze]] - Reduce files size by dropping OQ fields, duplicates, specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers
*[[SplitBam]] – Split into 1 file per Read Group
 
*[[BamUtil: splitChromosome|SplitChromosome]] – Split into 1 file per Chromosome
 
 
 
  
 
==== Helper Tools to Print Readable Information ====
 
==== Helper Tools to Print Readable Information ====

Revision as of 15:08, 2 September 2011


Software

Due to increasing volume of next generation sequencing and genotyping data, we have created these created C++ library and tools that use that library.

This page points to downloads, documentation, and papers for software that is written here at the Center for Statistical Genetics

StatGen C++ Software

A library and set of set of tools developed for handling and analyzing next generation sequencing and genotyping data.

StatGen Repository

Download

StatGen Download

Library

  • C++ Library: libStatGen - Library containing easy-to-use APIs for developing tools for processing and analyzing next generation sequencing and genotyping data. Allows easy processing of SAM/BAM, GLF, FASTQ.


Programs/Tools

SAM/BAM

General Tools

  • QPLOT - Calculate & plot summary statistics
  • Validate – Check file format & print statistics
  • VerifyBamID – Check sample identities for contamination/sample swap
    • Genotype concordance based detection
    • Estimate based on population allele frequencies without genotype data
  • Diff - Print the diffs between 2 bams
  • Stats - Print the diffs between 2 bams


Rewrite SAM/BAM file

  • Convert – Convert between SAM & BAM
  • SplitBam – Split into 1 file per Read Group
  • SplitChromosome – Split into 1 file per Chromosome
  • WriteRegion – Write only reads in the specified region and/or have the specified read name
  • Pileup – Pileup every base or just bases in specified region and write VCF
  • ReadIndexedBam - Read an indexed BAM file reference by reference id -1 to the max reference id and write it out as a SAM/BAM file


Update the File

  • SuperDeDuper - Determine duplicate alignments, either marking or removing the lower quality duplicates. In addition, it may modify paired-end reads where the ends overlap by soft clipping the end with the lower quality bases in the region of overlap.
  • RGMergeBam – Merge sorted BAM files adding Read Groups
  • PolishBam – Add/Update header lines & add RG tag to each record
  • TrimBam – Trim end of reads, changing read ends to ‘N’ & quality to ‘!’
  • Filter – Soft clip ends with too high mismatch % and mark unmapped if quality of mismatches is too high
  • Revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags
  • Squeeze - Reduce files size by dropping OQ fields, duplicates, specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers

Helper Tools to Print Readable Information

  • DumpHeader - Print the File Header to the screen.
  • DumpRefInfo - Print the reference information from the SAM/BAM header.
  • DumpIndex - Print the BAM Index to the screen in a readable format
  • ReadReference - Print the reference string for the specified region to the screen.


FASTQ

  • fastqValidator - validate a FASTQ file
    • Reports errors for badly formatted files
    • Reports Base Composition Statistics (%reads at each read index)


Other Tools

  • thunderVCF
  • vcfCooker – Manipulate, filter, summarize VCF/BED file in various forms
  • VcfGenomeStat – Print flanking sequences and how often they appear for input VCF file


Requested Tools

BAM to FASTQ

Other Tools

Read Mapping

  • Examples - Sample command lines with discussion
  • MapabilityScores - Definitions of various mappability scores adopted at UCSC genome browser.


SAM/BAM

  • Recalibrator – Resource-efficient tool, which recalibrates base qualities based on an adaptive logistic regression model - Available upon request
  • Deduper – Mark or remove duplicates - Coming Soon

Variant Calling

  • glfSingle - Variant calling for a single, deeply sequenced individual
  • glfTrio- Variant calling for a single, deeply sequenced nuclear family with two parents and one child
  • glfMultiples - Variant calling for multiple, unrelated individuals
  • polymutt - Variant and de novo mutation detection in families (nuclear or extended pedigrees) from sequencing

Variant Annotation

Quality Control

  • GenotypeIDcheck - Check that mapped reads are consistent with known genotypes for each individual.

File Conversion

  • bam2FastQ - Convert BAM files into FastQ files


Other Useful Links

Links to Sequence Analysis Tools

Other

ASHG 2010 Poster: C++ library & tools for next generation sequence data