GLF

From Genome Analysis Wiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

GLF is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes.

Generating GLF Files

GLF files can be generated using samtools-hybrid. To generate a GLF file, use the samtools-hybrid pileup -g command, which requires a sorted SAM file and a FASTA file with the human genome reference sequence.

  samtools-hybrid pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf

NOTE: Newer versions of samtools do not have the pileup command and do not generate glf files. samtools-hybrid is a version that still supports those operations but also has the updated BAQ logic.

Generating GLF Files for a Specific Region

Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using samtools-hybrid view to extract reads for a specific region from a SAM or BAM file and then using samtools-hybrid pileup -g to generate the GLF file using the selected reads as input. Here is an example:

  samtools-hybrid view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools-hybrid pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf

File Format

The GLF-file format was defined in an Appendix to the SAM-file format specification. It has since been removed from that document.

The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library.

Header

Each GLF file starts with a header that identifies the files.

 char[4]              magicNumber = "GLF\3";         // This identifies the file format
 int32_t              headerLength;                  // This is typically zero
 char[headerLength]   headerText;                    // This is typically unused

Chromosome Header

The main header is followed by a series of blocks, summarizing likelihoods along each chromosome. Each of these blocks starts with a header, that records the chromosome label and length.

  int32_t             labelLength;                   // Including the terminating null character
  char[labelLength]   label;                         // Printable string identified the chromosome label; typically, "1", "2", "3"... "22", "X", "Y", "MT" are used as labels for human chromosomes.
  uint32_t            chromosomeLength;              // Length of the original reference sequence. This is a useful sanity check, but samtools can generate GLF entries that go past the end of the sequence.

Likelihood Record Header

Each chromosome header is followed by a series of likelihood records, terminating with an end record of type 0.

   char               refBase:4;                     // 0..15 => XACMGRSVTWYHKDBN, so that 0x01 = A, 0x02 = C, 0x04 = G, 0x08 = T
   char               recordType:4;                  // 0 for the last record in this chromosome, 1 for regular records, 2 for indels

Simple Likelihood Record

These are records with recordType = 1.

   uint32_t           offset;                       // Offset from the previous record
   uint32_t:24        depth;                        // Depth of coverage for the current record 
   uint32_t:8         maxLLK;                       // Maximum log-likelihood, multiplied by -10 log 10
   uint8              mappingQuality;               // Root mean squared mapping quality
   uint8_t            llk[10];                      // Log-likelihood for each genotype, in the order AA..AT..CC..CT..GG..TT 

Indel Likelihood Record

These are records with recordType = 2.

   uint32_t           offset;                       // Offset from the previous record
   uint32_t:24        depth;                        // Depth of coverage for the current record 
   uint32_t:8         maxLLK;                       // Maximum log-likelihood, multiplied by -10 log 10
   uint8              mappingQuality;               // Root mean squared mapping quality
   uint8_t            llkHomozygous11;              // Log-likelihood for an allele 1 homozygote
   uint8_t            llkHomozygous22;              // Log-likelihood for an allele 2 homozygote
   uint8_t            llkHomozygous12;              // Log-likelihood for an allele 1/2 heterozygote
   int16_t            signedAllele1length;          // Length of the first indel allele (positive=ins; negative=del; zero=no-indel)
   int16_t            signedAllele2length;          // Length of the first indel allele (positive=ins; negative=del; zero=no-indel) 
   char               indelSequence1[signedAllele1Length];      // Sequence of the first indel allele
   char               indelSequence2[signedAllele2Length];      // Sequence of the first indel allele

Last Record

Records with recordType = 0 are empty.


SAM Format Specification 0.1.2-draft (20090820) A. Genotype Likelihood Format version 3 (GLFv3)

GLFs store the probability of a genotype given data.

All integers are in little-endian.

Field Description Type Value
magic GLFv3 magic number char[4] GLF\3
l_text Length of the header text, including any zero padding int32_t
text Text char[l_text]
List of reference information until the end of the file
l_name Length of the reference sequence name plus 1 (including NULL) int32_t
name Name; NULL terminated char[l_name]
ref_len length of the reference sequence uint32_t
List of sites until a record with rtype==0
rtype_ref record_type<<4 | ref_base; 0..15=>XACMGRSVTWYHKDBN uint8_t
if rtype==1 offset offset from the precedent record1 uint32_t
min_depth min_lk<<24 | read_depth (min_lk capped at 255) uint32_t
rmsMapQ RMS of mapping qualities of reads covering the site uint8_t
lk likelihood of each genotype in the order of AA AC AG AT CC GC CT GG GT TT uint8_t[10]
if rtype==2 offset offset from the precedent record1,2 uint32_t
min_depth min_lk<<24 | read_depth uint32_t
rmsMapQ RMS of mapping qualities of reads covering the site uint8_t
lkHom1 likelihood of the first homozygous indel allele (capped at 255) uint8_t
lkHom2 likelihood of the second homozygous indel allele (capped at 255) uint8_t
lkHet likelihood of a heterozygote the (capped at 255) uint8_t
indelLen1 length of the first indel allele (positive=ins; negative=del; zero=no-indel) uint16_t
indelLen2 length of the second indel allele uint16_t
indelSeq1 sequence of the first indel allele char[indelLen1]
indelSeq2 sequence of the second indel allele char[indelLen2]
if rtype==0 endMarker end of this chromosome; no data in this record (null)

Notes:

1. Field offset equals the zero-based coordinate of the current record minus the coordinate of the previous record. For the first record in a reference sequence, the coordinate of the precedent record is assumed to be zero. Offset is non-negative.

2. If a sequence is inserted between position [x,x+1] on the reference sequence, the coordinate of this record is x; if the sequence between [x,y] on the reference is is deleted, the coordinate of this record is x.

Tools That Use GLF Files

Variant Callers

glfSingle

glfTrio

glfMultiples

Utilities

  • glfMerge - Combines GLF multiple glfFiles generated for the same individual. Useful for combining data across platforms.