GLF is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes.
Generating GLF Files
GLF files can be generated using samtools-hybrid. To generate a GLF file, use the
samtools-hybrid pileup -g command, which requires a sorted SAM file and a FASTA file with the human genome reference sequence.
samtools-hybrid pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf
NOTE: Newer versions of samtools do not have the pileup command and do not generate glf files. samtools-hybrid is a version that still supports those operations but also has the updated BAQ logic.
Generating GLF Files for a Specific Region
Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using
samtools-hybrid view to extract reads for a specific region from a SAM or BAM file and then using
samtools-hybrid pileup -g to generate the GLF file using the selected reads as input. Here is an example:
samtools-hybrid view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools-hybrid pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf
The GLF-file format is defined in an Appendix to the SAM-file format specification.
The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library.
Each GLF file starts with a header that identifies the files.
char magicNumber = "GLF\3"; // This identifies the file format int32_t headerLength; // This is typically zero char[headerLength] headerText; // This is typically unused
The main header is followed by a series of blocks, summarizing likelihoods along each chromosome. Each of these blocks starts with a header, that records the chromosome label and length.
int32_t labelLength; // Including the terminating null character char[labelLength] label; // Printable string identified the chromosome label; typically, "1", "2", "3"... "22", "X", "Y", "MT" are used as labels for human chromosomes. uint32_t chromosomeLength; // Length of the original reference sequence. This is a useful sanity check, but samtools can generate GLF entries that go past the end of the sequence.
Likelihood Record Header
Each chromosome header is followed by a series of likelihood records, terminating with an end record of type 0.
char refBase:4; // 0..15 => XACMGRSVTWYHKDBN, so that 0x01 = A, 0x02 = C, 0x04 = G, 0x08 = T char recordType:4; // 0 for the last record in this chromosome, 1 for regular records, 2 for indels
Simple Likelihood Record
These are records with recordType = 1.
uint32_t offset; // Offset from the previous record uint32_t:24 depth; // Depth of coverage for the current record uint32_t:8 maxLLK; // Maximum log-likelihood, multiplied by -10 log 10 uint8 mappingQuality; // Root mean squared mapping quality uint8_t llk; // Log-likelihood for each genotype, in the order AA..AT..CC..CT..GG..TT
Indel Likelihood Record
These are records with recordType = 2.
uint32_t offset; // Offset from the previous record uint32_t:24 depth; // Depth of coverage for the current record uint32_t:8 maxLLK; // Maximum log-likelihood, multiplied by -10 log 10 uint8 mappingQuality; // Root mean squared mapping quality uint8_t llkHomozygous11; // Log-likelihood for an allele 1 homozygote uint8_t llkHomozygous22; // Log-likelihood for an allele 2 homozygote uint8_t llkHomozygous12; // Log-likelihood for an allele 1/2 heterozygote int16_t signedAllele1length; // Length of the first indel allele (positive=ins; negative=del; zero=no-indel) int16_t signedAllele2length; // Length of the first indel allele (positive=ins; negative=del; zero=no-indel) char indelSequence1[signedAllele1Length]; // Sequence of the first indel allele char indelSequence2[signedAllele2Length]; // Sequence of the first indel allele
Records with recordType = 0 are empty.
Tools That Use GLF Files
- glfMerge - Combines GLF multiple glfFiles generated for the same individual. Useful for combining data across platforms.