Line 1: |
Line 1: |
| '''GLF''' is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes. | | '''GLF''' is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes. |
| | | |
− | === Generating GLF Files ===
| + | == Generating GLF Files == |
| | | |
| GLF files can be generated using [http://samtools.sourceforge.net samtools]. To generate a GLF file, use the <code>samtools pileup -g</code> command, which requires a sorted [[SAM]] file and a [[FASTA]] file with the human genome reference sequence. | | GLF files can be generated using [http://samtools.sourceforge.net samtools]. To generate a GLF file, use the <code>samtools pileup -g</code> command, which requires a sorted [[SAM]] file and a [[FASTA]] file with the human genome reference sequence. |
| | | |
| samtools pileup -g -f human_b36_male.fa.gz NA19240.chrom21.SLX.maq.SRP000032.2009_07.bam > NA19240.chrom21.SLX.maq.SRP000032.2009_07.glf | | samtools pileup -g -f human_b36_male.fa.gz NA19240.chrom21.SLX.maq.SRP000032.2009_07.bam > NA19240.chrom21.SLX.maq.SRP000032.2009_07.glf |
| + | |
| + | == File Format == |
| + | |
| + | The GLF-file format is defined in an Appendix to the [http://samtools.sourceforge.net/SAM1.pdf SAM-file format specification]. |
| + | |
| + | The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library. |
| + | |
| + | === Header === |
| + | |
| + | Each GLF file starts with a header that identifies the files. |
| + | |
| + | char[4] magicNumber = "GLF\3"; // This identifies the file format |
| + | int32_t headerLength; // This is typically zero |
| + | char[headerLength] headerText; // This is typically unused |
| + | |
| + | === Chromosome Header === |
| + | |
| + | The main header is followed by a series of blocks, summarizing likelihoods along each chromosome. Each of these blocks starts with a header, that records the chromosome label and length. |
| + | |
| + | int32_t labelLength; // Including the terminating null character |
| + | char[labelLength] label; // Printable string identified the chromosome label; typically, "1", "2", "3"... "22", "X", "Y", "MT" are used as labels for human chromosomes. |
| + | uint32_t chromosomeLength; // Length of the original reference sequence. This is a useful sanity check, but samtools can generate GLF entries that go past the end of the sequence. |
| + | |
| + | === Likelihood Record Header === |
| + | |
| + | Each chromosome header is followed by a series of likelihood records, terminating with an end record of type 0. |
| + | |
| + | char refBase:4; // 0..15 => XACMGRSVTWYHKDBN, so that 0x01 = A, 0x02 = C, 0x04 = G, 0x08 = T |
| + | char recordType:4; // 0 for the last record in this chromosome, 1 for regular records, 2 for indels |
| + | |
| + | === Simple Likelihood Record === |
| + | |
| + | These are records with recordType = 1. |
| + | |
| + | uint32_t offset; // Offset from the previous record |
| + | uint32_t:24 depth; // Depth of coverage for the current record |
| + | uint32_t:8 maxLLK; // Maximum log-likelihood, multiplied by -10 log 10 |
| + | uint8 mappingQuality; // Root mean squared mapping quality |
| + | uint8_t llk[10]; // Log-likelihood for each genotype, in the order AA..AT..CC..CT..GG..TT |
| + | |
| + | === Indel Likelihood Record === |
| + | |
| + | These are records with recordType = 2. |
| + | |
| + | uint32_t offset; // Offset from the previous record |
| + | uint32_t:24 depth; // Depth of coverage for the current record |
| + | uint32_t:8 maxLLK; // Maximum log-likelihood, multiplied by -10 log 10 |
| + | uint8 mappingQuality; // Root mean squared mapping quality |
| + | uint8_t llkHomozygous11; // Log-likelihood for an allele 1 homozygote |
| + | uint8_t llkHomozygous22; // Log-likelihood for an allele 2 homozygote |
| + | uint8_t llkHomozygous12; // Log-likelihood for an allele 1/2 heterozygote |
| + | int16_t signedAllele1length; // Length of the first indel allele (positive=ins; negative=del; zero=no-indel) |
| + | int16_t signedAllele2length; // Length of the first indel allele (positive=ins; negative=del; zero=no-indel) |
| + | char indelSequence1[signedAllele1Length]; // Sequence of the first indel allele |
| + | char indelSequence2[signedAllele2Length]; // Sequence of the first indel allele |
| + | |
| + | === Last Record === |
| + | |
| + | These records are empty. |