Changes

From Genome Analysis Wiki
Jump to navigationJump to search
3,714 bytes added ,  18:54, 14 November 2009
no edit summary
Line 1: Line 1:  
'''GLF''' is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes.
 
'''GLF''' is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes.
   −
=== Generating GLF Files ===
+
== Generating GLF Files ==
    
GLF files can be generated using [http://samtools.sourceforge.net samtools]. To generate a GLF file, use the <code>samtools pileup -g</code> command, which requires a sorted [[SAM]] file and a [[FASTA]] file with the human genome reference sequence.  
 
GLF files can be generated using [http://samtools.sourceforge.net samtools]. To generate a GLF file, use the <code>samtools pileup -g</code> command, which requires a sorted [[SAM]] file and a [[FASTA]] file with the human genome reference sequence.  
    
   samtools pileup -g -f human_b36_male.fa.gz NA19240.chrom21.SLX.maq.SRP000032.2009_07.bam > NA19240.chrom21.SLX.maq.SRP000032.2009_07.glf
 
   samtools pileup -g -f human_b36_male.fa.gz NA19240.chrom21.SLX.maq.SRP000032.2009_07.bam > NA19240.chrom21.SLX.maq.SRP000032.2009_07.glf
 +
 +
== File Format ==
 +
 +
The GLF-file format is defined in an Appendix to the [http://samtools.sourceforge.net/SAM1.pdf SAM-file format specification].
 +
 +
The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library.
 +
 +
=== Header ===
 +
 +
Each GLF file starts with a header that identifies the files.
 +
 +
  char[4]              magicNumber = "GLF\3";        // This identifies the file format
 +
  int32_t              headerLength;                  // This is typically zero
 +
  char[headerLength]  headerText;                    // This is typically unused
 +
 +
=== Chromosome Header ===
 +
 +
The main header is followed by a series of blocks, summarizing likelihoods along each chromosome. Each of these blocks starts with a header, that records the chromosome label and length.
 +
 +
  int32_t            labelLength;                  // Including the terminating null character
 +
  char[labelLength]  label;                        // Printable string identified the chromosome label; typically, "1", "2", "3"... "22", "X", "Y", "MT" are used as labels for human chromosomes.
 +
  uint32_t            chromosomeLength;              // Length of the original reference sequence. This is a useful sanity check, but samtools can generate GLF entries that go past the end of the sequence.
 +
 +
=== Likelihood Record Header ===
 +
 +
Each chromosome header is followed by a series of likelihood records, terminating with an end record of type 0.
 +
 +
    char              refBase:4;                    // 0..15 => XACMGRSVTWYHKDBN, so that 0x01 = A, 0x02 = C, 0x04 = G, 0x08 = T
 +
    char              recordType:4;                  // 0 for the last record in this chromosome, 1 for regular records, 2 for indels
 +
 +
=== Simple Likelihood Record ===
 +
 +
These are records with recordType = 1.
 +
 +
    uint32_t          offset;                      // Offset from the previous record
 +
    uint32_t:24        depth;                        // Depth of coverage for the current record
 +
    uint32_t:8        maxLLK;                      // Maximum log-likelihood, multiplied by -10 log 10
 +
    uint8              mappingQuality;              // Root mean squared mapping quality
 +
    uint8_t            llk[10];                      // Log-likelihood for each genotype, in the order AA..AT..CC..CT..GG..TT
 +
 +
=== Indel Likelihood Record ===
 +
 +
These are records with recordType = 2.
 +
 +
    uint32_t          offset;                      // Offset from the previous record
 +
    uint32_t:24        depth;                        // Depth of coverage for the current record
 +
    uint32_t:8        maxLLK;                      // Maximum log-likelihood, multiplied by -10 log 10
 +
    uint8              mappingQuality;              // Root mean squared mapping quality
 +
    uint8_t            llkHomozygous11;              // Log-likelihood for an allele 1 homozygote
 +
    uint8_t            llkHomozygous22;              // Log-likelihood for an allele 2 homozygote
 +
    uint8_t            llkHomozygous12;              // Log-likelihood for an allele 1/2 heterozygote
 +
    int16_t            signedAllele1length;          // Length of the first indel allele (positive=ins; negative=del; zero=no-indel)
 +
    int16_t            signedAllele2length;          // Length of the first indel allele (positive=ins; negative=del; zero=no-indel)
 +
    char              indelSequence1[signedAllele1Length];      // Sequence of the first indel allele
 +
    char              indelSequence2[signedAllele2Length];      // Sequence of the first indel allele
 +
 +
=== Last Record ===
 +
 +
These records are empty.

Navigation menu