SAV File Format

From Genome Analysis Wiki
Jump to: navigation, search

SAV (Sparse Allele Vectors) is a file format for storing very large sets of genotypes and halpotype dosages that produces small file sizes and is optimized for fast association analysis. The design supports fine-level random access to variants, with no limitations on variant annotation, sample size, ploidy level or genome length.

SAV capitalizes on the sparsity of genetic variation to both compress data and reduce deserialization overhead. Since the proportion of rare variants continually increases with sample size, the compression ratio and efficiency of our format both improve as study sizes grow. In addition to the I/O and computational efficiency attributed to reduced storage footprints, further computational efficiency can be achieved through sparse matrix operations.

Accompanying the SAV format is the Savvy C++ programming library for interfacing with it and other file formats. This library was designed for efficient association analysis and provides a mechanism to plug in linear algebra and numerical libraries, which reduces the overhead of copying data and lowers the memory footprint.

S1R Index

SAV files are indexed using an S1R (Sort-tile-recursive One-dimensional R-tree) index file. Genomic regions are organized into an r-tree to enable fast random access to a SAV file without having to parse the entire index file. Each leaf entry in the tree points to a zstd compressed block in the corresponding SAV file. The entry also encodes the number of variants in the block, which can be variable depending on the parameters for compressing the SAV file.

Diagram of an S1R R-tree

S1r diagram.png