Difference between revisions of "GLF"

From Genome Analysis Wiki
Jump to: navigation, search
m (Reverted edits by Upugema (Talk) to last revision by Goncalo)
(File Format)
 
(One intermediate revision by the same user not shown)
Line 3: Line 3:
 
== Generating GLF Files ==
 
== Generating GLF Files ==
  
GLF files can be generated using [http://samtools.sourceforge.net samtools]. To generate a GLF file, use the <code>samtools pileup -g</code> command, which requires a sorted [[SAM]] file and a [[FASTA]] file with the human genome reference sequence.  
+
GLF files can be generated using [https://github.com/statgen/samtools-0.1.7a-hybrid samtools-hybrid]. To generate a GLF file, use the <code>samtools-hybrid pileup -g</code> command, which requires a sorted [[SAM]] file and a [[FASTA]] file with the human genome reference sequence.  
  
   samtools pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf
+
   samtools-hybrid pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf
 +
 
 +
NOTE: Newer versions of samtools do not have the pileup command and do not generate glf files.  samtools-hybrid is a version that still supports those operations but also has the updated BAQ logic.
  
 
=== Generating GLF Files for a Specific Region ===
 
=== Generating GLF Files for a Specific Region ===
  
Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using <code>samtools view</code> to extract reads for a specific region from a [[SAM]] or [[BAM]] file and then using <code>samtools pileup -g</code> to generate the GLF file using the selected reads as input. Here is an example:
+
Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using <code>samtools-hybrid view</code> to extract reads for a specific region from a [[SAM]] or [[BAM]] file and then using <code>samtools-hybrid pileup -g</code> to generate the GLF file using the selected reads as input. Here is an example:
  
   samtools view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf
+
   samtools-hybrid view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools-hybrid pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf
  
 
== File Format ==
 
== File Format ==
  
The GLF-file format is defined in an Appendix to the [http://samtools.sourceforge.net/SAM1.pdf SAM-file format specification].
+
The GLF-file format was defined in an Appendix to the [http://samtools.sourceforge.net/SAM1.pdf SAM-file format specification].  It has since been removed from that document.
  
 
The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library.
 
The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library.
Line 71: Line 73:
  
 
Records with recordType = 0 are empty.
 
Records with recordType = 0 are empty.
 +
 +
 +
=== SAM Format Specification 0.1.2-draft (20090820) A. Genotype Likelihood Format version 3 (GLFv3)===
 +
GLFs store the probability of a genotype given data.
 +
 +
All integers are in little-endian.
 +
 +
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
! colspan="3"|Field !! Description !!  Type !! Value
 +
|-
 +
| colspan="3"|magic || GLFv3 magic number || char[4]||GLF\3
 +
|-
 +
|colspan="3"|l_text||Length of the header text, including any zero padding||int32_t||
 +
|-
 +
|colspan="3"|text||Text||char[l_text]||
 +
|-
 +
|colspan="6" align="center"|''List of reference information until the end of the file''
 +
|-
 +
| rowspan="20" style="width: 20px"| || colspan="2"|l_name||Length of the reference sequence name plus 1 (including NULL) || int32_t||
 +
|-
 +
| colspan="2"|name||Name; NULL terminated || char[l_name]||
 +
|-
 +
| colspan="2"|ref_len||length of the reference sequence || uint32_t||
 +
|-
 +
| colspan="5" align="center"|''List of sites until a record with rtype==0''
 +
|-
 +
| colspan="2"|rtype_ref||<nowiki>record_type<<4 | ref_base; 0..15=>XACMGRSVTWYHKDBN</nowiki> || uint8_t||
 +
|-
 +
| rowspan="4"|if rtype==1||offset||offset from the precedent record<sup>1</sup>||uint32_t||
 +
|-
 +
| min_depth||<nowiki>min_lk<<24 | read_depth (min_lk capped at 255)</nowiki>||uint32_t||
 +
|-
 +
| rmsMapQ|| RMS of mapping qualities of reads covering the site ||uint8_t||
 +
|-
 +
| lk||likelihood of each genotype in the order of AA AC AG AT CC GC CT GG GT TT||uint8_t[10]||
 +
|-style="border-top: 1px #aaa dashed;"
 +
| rowspan="10"|if rtype==2||offset||offset from the precedent record<sup>1,2</sup>||uint32_t||
 +
|-
 +
| min_depth||<nowiki>min_lk<<24 | read_depth</nowiki>||uint32_t||
 +
|-
 +
| rmsMapQ|| RMS of mapping qualities of reads covering the site ||uint8_t||
 +
|-
 +
| lkHom1||likelihood of the first homozygous indel allele (capped at 255)||uint8_t||
 +
|-
 +
| lkHom2||likelihood of the second homozygous indel allele (capped at 255)||uint8_t||
 +
|-
 +
| lkHet|| likelihood of a heterozygote the (capped at 255)||uint8_t||
 +
|-
 +
| indelLen1|| length of the first indel allele (positive=ins; negative=del; zero=no-indel)||uint16_t||
 +
|-
 +
| indelLen2|| length of the second indel allele ||uint16_t||
 +
|-
 +
| indelSeq1|| sequence of the first indel allele||char[indelLen1]||
 +
|-
 +
| indelSeq2|| sequence of the second indel allele||char[indelLen2]||
 +
|-style="border-top: 1px #aaa dashed;"
 +
| if rtype==0||endMarker||end of this chromosome; no data in this record||(null)||
 +
|}
 +
 +
'''Notes:'''
 +
 +
1. Field offset equals the zero-based coordinate of the current record minus the coordinate of the previous record.  For the first record in a reference sequence, the coordinate of the precedent record is assumed to be zero.  Offset is non-negative.
 +
 +
2. If a sequence is inserted between position [x,x+1] on the reference sequence, the coordinate of this record is x; if the sequence between [x,y] on the reference is is deleted, the coordinate of this record is x.
  
 
== Tools That Use GLF Files ==
 
== Tools That Use GLF Files ==

Latest revision as of 10:22, 13 August 2013

GLF is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes.

Generating GLF Files

GLF files can be generated using samtools-hybrid. To generate a GLF file, use the samtools-hybrid pileup -g command, which requires a sorted SAM file and a FASTA file with the human genome reference sequence.

  samtools-hybrid pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf

NOTE: Newer versions of samtools do not have the pileup command and do not generate glf files. samtools-hybrid is a version that still supports those operations but also has the updated BAQ logic.

Generating GLF Files for a Specific Region

Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using samtools-hybrid view to extract reads for a specific region from a SAM or BAM file and then using samtools-hybrid pileup -g to generate the GLF file using the selected reads as input. Here is an example:

  samtools-hybrid view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools-hybrid pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf

File Format

The GLF-file format was defined in an Appendix to the SAM-file format specification. It has since been removed from that document.

The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library.

Header

Each GLF file starts with a header that identifies the files.

 char[4]              magicNumber = "GLF\3";         // This identifies the file format
 int32_t              headerLength;                  // This is typically zero
 char[headerLength]   headerText;                    // This is typically unused

Chromosome Header

The main header is followed by a series of blocks, summarizing likelihoods along each chromosome. Each of these blocks starts with a header, that records the chromosome label and length.

  int32_t             labelLength;                   // Including the terminating null character
  char[labelLength]   label;                         // Printable string identified the chromosome label; typically, "1", "2", "3"... "22", "X", "Y", "MT" are used as labels for human chromosomes.
  uint32_t            chromosomeLength;              // Length of the original reference sequence. This is a useful sanity check, but samtools can generate GLF entries that go past the end of the sequence.

Likelihood Record Header

Each chromosome header is followed by a series of likelihood records, terminating with an end record of type 0.

   char               refBase:4;                     // 0..15 => XACMGRSVTWYHKDBN, so that 0x01 = A, 0x02 = C, 0x04 = G, 0x08 = T
   char               recordType:4;                  // 0 for the last record in this chromosome, 1 for regular records, 2 for indels

Simple Likelihood Record

These are records with recordType = 1.

   uint32_t           offset;                       // Offset from the previous record
   uint32_t:24        depth;                        // Depth of coverage for the current record 
   uint32_t:8         maxLLK;                       // Maximum log-likelihood, multiplied by -10 log 10
   uint8              mappingQuality;               // Root mean squared mapping quality
   uint8_t            llk[10];                      // Log-likelihood for each genotype, in the order AA..AT..CC..CT..GG..TT 

Indel Likelihood Record

These are records with recordType = 2.

   uint32_t           offset;                       // Offset from the previous record
   uint32_t:24        depth;                        // Depth of coverage for the current record 
   uint32_t:8         maxLLK;                       // Maximum log-likelihood, multiplied by -10 log 10
   uint8              mappingQuality;               // Root mean squared mapping quality
   uint8_t            llkHomozygous11;              // Log-likelihood for an allele 1 homozygote
   uint8_t            llkHomozygous22;              // Log-likelihood for an allele 2 homozygote
   uint8_t            llkHomozygous12;              // Log-likelihood for an allele 1/2 heterozygote
   int16_t            signedAllele1length;          // Length of the first indel allele (positive=ins; negative=del; zero=no-indel)
   int16_t            signedAllele2length;          // Length of the first indel allele (positive=ins; negative=del; zero=no-indel) 
   char               indelSequence1[signedAllele1Length];      // Sequence of the first indel allele
   char               indelSequence2[signedAllele2Length];      // Sequence of the first indel allele

Last Record

Records with recordType = 0 are empty.


SAM Format Specification 0.1.2-draft (20090820) A. Genotype Likelihood Format version 3 (GLFv3)

GLFs store the probability of a genotype given data.

All integers are in little-endian.

Field Description Type Value
magic GLFv3 magic number char[4] GLF\3
l_text Length of the header text, including any zero padding int32_t
text Text char[l_text]
List of reference information until the end of the file
l_name Length of the reference sequence name plus 1 (including NULL) int32_t
name Name; NULL terminated char[l_name]
ref_len length of the reference sequence uint32_t
List of sites until a record with rtype==0
rtype_ref record_type<<4 | ref_base; 0..15=>XACMGRSVTWYHKDBN uint8_t
if rtype==1 offset offset from the precedent record1 uint32_t
min_depth min_lk<<24 | read_depth (min_lk capped at 255) uint32_t
rmsMapQ RMS of mapping qualities of reads covering the site uint8_t
lk likelihood of each genotype in the order of AA AC AG AT CC GC CT GG GT TT uint8_t[10]
if rtype==2 offset offset from the precedent record1,2 uint32_t
min_depth min_lk<<24 | read_depth uint32_t
rmsMapQ RMS of mapping qualities of reads covering the site uint8_t
lkHom1 likelihood of the first homozygous indel allele (capped at 255) uint8_t
lkHom2 likelihood of the second homozygous indel allele (capped at 255) uint8_t
lkHet likelihood of a heterozygote the (capped at 255) uint8_t
indelLen1 length of the first indel allele (positive=ins; negative=del; zero=no-indel) uint16_t
indelLen2 length of the second indel allele uint16_t
indelSeq1 sequence of the first indel allele char[indelLen1]
indelSeq2 sequence of the second indel allele char[indelLen2]
if rtype==0 endMarker end of this chromosome; no data in this record (null)

Notes:

1. Field offset equals the zero-based coordinate of the current record minus the coordinate of the previous record. For the first record in a reference sequence, the coordinate of the precedent record is assumed to be zero. Offset is non-negative.

2. If a sequence is inserted between position [x,x+1] on the reference sequence, the coordinate of this record is x; if the sequence between [x,y] on the reference is is deleted, the coordinate of this record is x.

Tools That Use GLF Files

Variant Callers

glfSingle

glfTrio

glfMultiples

Utilities

  • glfMerge - Combines GLF multiple glfFiles generated for the same individual. Useful for combining data across platforms.