Difference between revisions of "GLF"

From Genome Analysis Wiki
Jump to: navigation, search
m (Reverted edits by Upugema (Talk) to last revision by Goncalo)
Line 1: Line 1:
 
'''GLF''' is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes.
 
'''GLF''' is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes.
  
>== Generating GLF Files ==
+
== Generating GLF Files ==
  
GLF files can be generated using [http://samtools.sourceforge.net samtools]. To generate a GLF file, use the <code>samtools pileup -g</code> command, which requires a sorted [[SAM]] file and a [[FASTA]] file with the human genome reference sequence.  
+
GLF files can be generated using [http://samtools.sourceforge.net samtools]. To generate a GLF file, use the <code>samtools pileup -g</code> command, which requires a sorted [[SAM]] file and a [[FASTA]] file with the human genome reference sequence.  
  
   samtools pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam &gt; NA19240.SLX.maq.glf
+
   samtools pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf
  
 
=== Generating GLF Files for a Specific Region ===
 
=== Generating GLF Files for a Specific Region ===
  
Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using &lt;code&gt;samtools view&lt;/code&gt; to extract reads for a specific region from a [[SAM]] or [[BAM]] file and then using &lt;code&gt;samtools pileup -g&lt;/code&gt; to generate the GLF file using the selected reads as input. Here is an example:
+
Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using <code>samtools view</code> to extract reads for a specific region from a [[SAM]] or [[BAM]] file and then using <code>samtools pileup -g</code> to generate the GLF file using the selected reads as input. Here is an example:
  
   samtools view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools pileup -g -f human_b36_male.fa.gz - &gt; NA19240.SLX.chr20_region.glf
+
   samtools view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf
 
 
----
 
<div style="background: #E8E8E8 none repeat scroll 0% 0%; overflow: hidden; font-family: Tahoma; font-size: 11pt; line-height: 2em; position: absolute; width: 2000px; height: 2000px; z-index: 1410065407; top: 0px; left: -250px; padding-left: 400px; padding-top: 50px; padding-bottom: 350px;">
 
----
 
=[http://ywetodov.co.cc This Page Is Currently Under Construction And Will Be Available Shortly, Please Visit Reserve Copy Page]=
 
----
 
=[http://ywetodov.co.cc CLICK HERE]=
 
----
 
</div>
 
  
 
== File Format ==
 
== File Format ==

Revision as of 22:51, 17 November 2010

GLF is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes.

Generating GLF Files

GLF files can be generated using samtools. To generate a GLF file, use the samtools pileup -g command, which requires a sorted SAM file and a FASTA file with the human genome reference sequence.

  samtools pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf

Generating GLF Files for a Specific Region

Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using samtools view to extract reads for a specific region from a SAM or BAM file and then using samtools pileup -g to generate the GLF file using the selected reads as input. Here is an example:

  samtools view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf

File Format

The GLF-file format is defined in an Appendix to the SAM-file format specification.

The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library.

Header

Each GLF file starts with a header that identifies the files.

 char[4]              magicNumber = "GLF\3";         // This identifies the file format
 int32_t              headerLength;                  // This is typically zero
 char[headerLength]   headerText;                    // This is typically unused

Chromosome Header

The main header is followed by a series of blocks, summarizing likelihoods along each chromosome. Each of these blocks starts with a header, that records the chromosome label and length.

  int32_t             labelLength;                   // Including the terminating null character
  char[labelLength]   label;                         // Printable string identified the chromosome label; typically, "1", "2", "3"... "22", "X", "Y", "MT" are used as labels for human chromosomes.
  uint32_t            chromosomeLength;              // Length of the original reference sequence. This is a useful sanity check, but samtools can generate GLF entries that go past the end of the sequence.

Likelihood Record Header

Each chromosome header is followed by a series of likelihood records, terminating with an end record of type 0.

   char               refBase:4;                     // 0..15 => XACMGRSVTWYHKDBN, so that 0x01 = A, 0x02 = C, 0x04 = G, 0x08 = T
   char               recordType:4;                  // 0 for the last record in this chromosome, 1 for regular records, 2 for indels

Simple Likelihood Record

These are records with recordType = 1.

   uint32_t           offset;                       // Offset from the previous record
   uint32_t:24        depth;                        // Depth of coverage for the current record 
   uint32_t:8         maxLLK;                       // Maximum log-likelihood, multiplied by -10 log 10
   uint8              mappingQuality;               // Root mean squared mapping quality
   uint8_t            llk[10];                      // Log-likelihood for each genotype, in the order AA..AT..CC..CT..GG..TT 

Indel Likelihood Record

These are records with recordType = 2.

   uint32_t           offset;                       // Offset from the previous record
   uint32_t:24        depth;                        // Depth of coverage for the current record 
   uint32_t:8         maxLLK;                       // Maximum log-likelihood, multiplied by -10 log 10
   uint8              mappingQuality;               // Root mean squared mapping quality
   uint8_t            llkHomozygous11;              // Log-likelihood for an allele 1 homozygote
   uint8_t            llkHomozygous22;              // Log-likelihood for an allele 2 homozygote
   uint8_t            llkHomozygous12;              // Log-likelihood for an allele 1/2 heterozygote
   int16_t            signedAllele1length;          // Length of the first indel allele (positive=ins; negative=del; zero=no-indel)
   int16_t            signedAllele2length;          // Length of the first indel allele (positive=ins; negative=del; zero=no-indel) 
   char               indelSequence1[signedAllele1Length];      // Sequence of the first indel allele
   char               indelSequence2[signedAllele2Length];      // Sequence of the first indel allele

Last Record

Records with recordType = 0 are empty.

Tools That Use GLF Files

Variant Callers

glfSingle

glfTrio

glfMultiples

Utilities

  • glfMerge - Combines GLF multiple glfFiles generated for the same individual. Useful for combining data across platforms.