SAV File Format

From Genome Analysis Wiki
Revision as of 15:51, 19 April 2018 by Lefaivej (talk | contribs)
Jump to navigationJump to search

SAV (Sparse Allele Vectors) is a file format for storing very large sets of genotypes and halpotype dosages that produces small file sizes and is optimized for fast association analysis. The design supports fine-level random access to variants, with no limitations on variant annotation, sample size, ploidy level or genome length.

SAV capitalizes on the sparsity of genetic variation to both compress data and reduce deserialization overhead. Since the proportion of rare variants continually increases with sample size, the compression ratio and efficiency of our format both improve as study sizes grow. In addition to the I/O and computational efficiency attributed to reduced storage footprints, further computational efficiency can be achieved through sparse matrix operations.

Accompanying the SAV format is the Savvy C++ programming library for interfacing with it and other file formats. This library was designed for efficient association analysis and provides a mechanism to plug in linear algebra and numerical libraries, which reduces the overhead of copying data and lowers the memory footprint.