GLF

From Genome Analysis Wiki
Revision as of 16:12, 5 April 2012 by Mktrost (talk | contribs)
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

GLF is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes.

Generating GLF Files

GLF files can be generated using samtools-hybrid. To generate a GLF file, use the samtools-hybrid pileup -g command, which requires a sorted SAM file and a FASTA file with the human genome reference sequence.

  samtools-hybrid pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf

NOTE: Newer versions of samtools do not have the pileup command and do not generate glf files. samtools-hybrid is a version that still supports those operations but also has the updated BAQ logic.

Generating GLF Files for a Specific Region

Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using samtools-hybrid view to extract reads for a specific region from a SAM or BAM file and then using samtools-hybrid pileup -g to generate the GLF file using the selected reads as input. Here is an example:

  samtools-hybrid view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools-hybrid pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf

File Format

The GLF-file format is defined in an Appendix to the SAM-file format specification.

The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library.

Header

Each GLF file starts with a header that identifies the files.

 char[4]              magicNumber = "GLF\3";         // This identifies the file format
 int32_t              headerLength;                  // This is typically zero
 char[headerLength]   headerText;                    // This is typically unused

Chromosome Header

The main header is followed by a series of blocks, summarizing likelihoods along each chromosome. Each of these blocks starts with a header, that records the chromosome label and length.

  int32_t             labelLength;                   // Including the terminating null character
  char[labelLength]   label;                         // Printable string identified the chromosome label; typically, "1", "2", "3"... "22", "X", "Y", "MT" are used as labels for human chromosomes.
  uint32_t            chromosomeLength;              // Length of the original reference sequence. This is a useful sanity check, but samtools can generate GLF entries that go past the end of the sequence.

Likelihood Record Header

Each chromosome header is followed by a series of likelihood records, terminating with an end record of type 0.

   char               refBase:4;                     // 0..15 => XACMGRSVTWYHKDBN, so that 0x01 = A, 0x02 = C, 0x04 = G, 0x08 = T
   char               recordType:4;                  // 0 for the last record in this chromosome, 1 for regular records, 2 for indels

Simple Likelihood Record

These are records with recordType = 1.

   uint32_t           offset;                       // Offset from the previous record
   uint32_t:24        depth;                        // Depth of coverage for the current record 
   uint32_t:8         maxLLK;                       // Maximum log-likelihood, multiplied by -10 log 10
   uint8              mappingQuality;               // Root mean squared mapping quality
   uint8_t            llk[10];                      // Log-likelihood for each genotype, in the order AA..AT..CC..CT..GG..TT 

Indel Likelihood Record

These are records with recordType = 2.

   uint32_t           offset;                       // Offset from the previous record
   uint32_t:24        depth;                        // Depth of coverage for the current record 
   uint32_t:8         maxLLK;                       // Maximum log-likelihood, multiplied by -10 log 10
   uint8              mappingQuality;               // Root mean squared mapping quality
   uint8_t            llkHomozygous11;              // Log-likelihood for an allele 1 homozygote
   uint8_t            llkHomozygous22;              // Log-likelihood for an allele 2 homozygote
   uint8_t            llkHomozygous12;              // Log-likelihood for an allele 1/2 heterozygote
   int16_t            signedAllele1length;          // Length of the first indel allele (positive=ins; negative=del; zero=no-indel)
   int16_t            signedAllele2length;          // Length of the first indel allele (positive=ins; negative=del; zero=no-indel) 
   char               indelSequence1[signedAllele1Length];      // Sequence of the first indel allele
   char               indelSequence2[signedAllele2Length];      // Sequence of the first indel allele

Last Record

Records with recordType = 0 are empty.

Tools That Use GLF Files

Variant Callers

glfSingle

glfTrio

glfMultiples

Utilities

  • glfMerge - Combines GLF multiple glfFiles generated for the same individual. Useful for combining data across platforms.