Difference between revisions of "GLF"
(One intermediate revision by the same user not shown) | |||
Line 3: | Line 3: | ||
== Generating GLF Files == | == Generating GLF Files == | ||
− | GLF files can be generated using [ | + | GLF files can be generated using [https://github.com/statgen/samtools-0.1.7a-hybrid samtools-hybrid]. To generate a GLF file, use the <code>samtools-hybrid pileup -g</code> command, which requires a sorted [[SAM]] file and a [[FASTA]] file with the human genome reference sequence. |
− | samtools pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf | + | samtools-hybrid pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf |
+ | |||
+ | NOTE: Newer versions of samtools do not have the pileup command and do not generate glf files. samtools-hybrid is a version that still supports those operations but also has the updated BAQ logic. | ||
=== Generating GLF Files for a Specific Region === | === Generating GLF Files for a Specific Region === | ||
− | Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using <code>samtools view</code> to extract reads for a specific region from a [[SAM]] or [[BAM]] file and then using <code>samtools pileup -g</code> to generate the GLF file using the selected reads as input. Here is an example: | + | Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using <code>samtools-hybrid view</code> to extract reads for a specific region from a [[SAM]] or [[BAM]] file and then using <code>samtools-hybrid pileup -g</code> to generate the GLF file using the selected reads as input. Here is an example: |
− | samtools view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf | + | samtools-hybrid view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools-hybrid pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf |
== File Format == | == File Format == | ||
− | The GLF-file format | + | The GLF-file format was defined in an Appendix to the [http://samtools.sourceforge.net/SAM1.pdf SAM-file format specification]. It has since been removed from that document. |
The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library. | The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library. | ||
Line 71: | Line 73: | ||
Records with recordType = 0 are empty. | Records with recordType = 0 are empty. | ||
+ | |||
+ | |||
+ | === SAM Format Specification 0.1.2-draft (20090820) A. Genotype Likelihood Format version 3 (GLFv3)=== | ||
+ | GLFs store the probability of a genotype given data. | ||
+ | |||
+ | All integers are in little-endian. | ||
+ | |||
+ | {| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1" | ||
+ | |-style="background: #f2f2f2; text-align: center;" | ||
+ | ! colspan="3"|Field !! Description !! Type !! Value | ||
+ | |- | ||
+ | | colspan="3"|magic || GLFv3 magic number || char[4]||GLF\3 | ||
+ | |- | ||
+ | |colspan="3"|l_text||Length of the header text, including any zero padding||int32_t|| | ||
+ | |- | ||
+ | |colspan="3"|text||Text||char[l_text]|| | ||
+ | |- | ||
+ | |colspan="6" align="center"|''List of reference information until the end of the file'' | ||
+ | |- | ||
+ | | rowspan="20" style="width: 20px"| || colspan="2"|l_name||Length of the reference sequence name plus 1 (including NULL) || int32_t|| | ||
+ | |- | ||
+ | | colspan="2"|name||Name; NULL terminated || char[l_name]|| | ||
+ | |- | ||
+ | | colspan="2"|ref_len||length of the reference sequence || uint32_t|| | ||
+ | |- | ||
+ | | colspan="5" align="center"|''List of sites until a record with rtype==0'' | ||
+ | |- | ||
+ | | colspan="2"|rtype_ref||<nowiki>record_type<<4 | ref_base; 0..15=>XACMGRSVTWYHKDBN</nowiki> || uint8_t|| | ||
+ | |- | ||
+ | | rowspan="4"|if rtype==1||offset||offset from the precedent record<sup>1</sup>||uint32_t|| | ||
+ | |- | ||
+ | | min_depth||<nowiki>min_lk<<24 | read_depth (min_lk capped at 255)</nowiki>||uint32_t|| | ||
+ | |- | ||
+ | | rmsMapQ|| RMS of mapping qualities of reads covering the site ||uint8_t|| | ||
+ | |- | ||
+ | | lk||likelihood of each genotype in the order of AA AC AG AT CC GC CT GG GT TT||uint8_t[10]|| | ||
+ | |-style="border-top: 1px #aaa dashed;" | ||
+ | | rowspan="10"|if rtype==2||offset||offset from the precedent record<sup>1,2</sup>||uint32_t|| | ||
+ | |- | ||
+ | | min_depth||<nowiki>min_lk<<24 | read_depth</nowiki>||uint32_t|| | ||
+ | |- | ||
+ | | rmsMapQ|| RMS of mapping qualities of reads covering the site ||uint8_t|| | ||
+ | |- | ||
+ | | lkHom1||likelihood of the first homozygous indel allele (capped at 255)||uint8_t|| | ||
+ | |- | ||
+ | | lkHom2||likelihood of the second homozygous indel allele (capped at 255)||uint8_t|| | ||
+ | |- | ||
+ | | lkHet|| likelihood of a heterozygote the (capped at 255)||uint8_t|| | ||
+ | |- | ||
+ | | indelLen1|| length of the first indel allele (positive=ins; negative=del; zero=no-indel)||uint16_t|| | ||
+ | |- | ||
+ | | indelLen2|| length of the second indel allele ||uint16_t|| | ||
+ | |- | ||
+ | | indelSeq1|| sequence of the first indel allele||char[indelLen1]|| | ||
+ | |- | ||
+ | | indelSeq2|| sequence of the second indel allele||char[indelLen2]|| | ||
+ | |-style="border-top: 1px #aaa dashed;" | ||
+ | | if rtype==0||endMarker||end of this chromosome; no data in this record||(null)|| | ||
+ | |} | ||
+ | |||
+ | '''Notes:''' | ||
+ | |||
+ | 1. Field offset equals the zero-based coordinate of the current record minus the coordinate of the previous record. For the first record in a reference sequence, the coordinate of the precedent record is assumed to be zero. Offset is non-negative. | ||
+ | |||
+ | 2. If a sequence is inserted between position [x,x+1] on the reference sequence, the coordinate of this record is x; if the sequence between [x,y] on the reference is is deleted, the coordinate of this record is x. | ||
== Tools That Use GLF Files == | == Tools That Use GLF Files == |
Latest revision as of 10:22, 13 August 2013
GLF is a format for storing marginal likelihoods for next-generation sequence data, conditional on a set of possible genotypes.
Generating GLF Files
GLF files can be generated using samtools-hybrid. To generate a GLF file, use the samtools-hybrid pileup -g
command, which requires a sorted SAM file and a FASTA file with the human genome reference sequence.
samtools-hybrid pileup -g -f human_b36_male.fa.gz NA19240.SLX.maq.bam > NA19240.SLX.maq.glf
NOTE: Newer versions of samtools do not have the pileup command and do not generate glf files. samtools-hybrid is a version that still supports those operations but also has the updated BAQ logic.
Generating GLF Files for a Specific Region
Sometimes, you want to generate a GLF file for a specific chromosome region. This can be accomplished by first using samtools-hybrid view
to extract reads for a specific region from a SAM or BAM file and then using samtools-hybrid pileup -g
to generate the GLF file using the selected reads as input. Here is an example:
samtools-hybrid view -u NA19240.SLX.maq.bam chr20:10000000-20000000 | samtools-hybrid pileup -g -f human_b36_male.fa.gz - > NA19240.SLX.chr20_region.glf
File Format
The GLF-file format was defined in an Appendix to the SAM-file format specification. It has since been removed from that document.
The current specification (GLF version 3) follows. All integers in are stored in the little-endian byte order. Most GLF files are compressed in a GZIP compatible format; SAMTOOLS will only read GLF files that are compressed with the BGZF library.
Header
Each GLF file starts with a header that identifies the files.
char[4] magicNumber = "GLF\3"; // This identifies the file format int32_t headerLength; // This is typically zero char[headerLength] headerText; // This is typically unused
Chromosome Header
The main header is followed by a series of blocks, summarizing likelihoods along each chromosome. Each of these blocks starts with a header, that records the chromosome label and length.
int32_t labelLength; // Including the terminating null character char[labelLength] label; // Printable string identified the chromosome label; typically, "1", "2", "3"... "22", "X", "Y", "MT" are used as labels for human chromosomes. uint32_t chromosomeLength; // Length of the original reference sequence. This is a useful sanity check, but samtools can generate GLF entries that go past the end of the sequence.
Likelihood Record Header
Each chromosome header is followed by a series of likelihood records, terminating with an end record of type 0.
char refBase:4; // 0..15 => XACMGRSVTWYHKDBN, so that 0x01 = A, 0x02 = C, 0x04 = G, 0x08 = T char recordType:4; // 0 for the last record in this chromosome, 1 for regular records, 2 for indels
Simple Likelihood Record
These are records with recordType = 1.
uint32_t offset; // Offset from the previous record uint32_t:24 depth; // Depth of coverage for the current record uint32_t:8 maxLLK; // Maximum log-likelihood, multiplied by -10 log 10 uint8 mappingQuality; // Root mean squared mapping quality uint8_t llk[10]; // Log-likelihood for each genotype, in the order AA..AT..CC..CT..GG..TT
Indel Likelihood Record
These are records with recordType = 2.
uint32_t offset; // Offset from the previous record uint32_t:24 depth; // Depth of coverage for the current record uint32_t:8 maxLLK; // Maximum log-likelihood, multiplied by -10 log 10 uint8 mappingQuality; // Root mean squared mapping quality uint8_t llkHomozygous11; // Log-likelihood for an allele 1 homozygote uint8_t llkHomozygous22; // Log-likelihood for an allele 2 homozygote uint8_t llkHomozygous12; // Log-likelihood for an allele 1/2 heterozygote int16_t signedAllele1length; // Length of the first indel allele (positive=ins; negative=del; zero=no-indel) int16_t signedAllele2length; // Length of the first indel allele (positive=ins; negative=del; zero=no-indel) char indelSequence1[signedAllele1Length]; // Sequence of the first indel allele char indelSequence2[signedAllele2Length]; // Sequence of the first indel allele
Last Record
Records with recordType = 0 are empty.
SAM Format Specification 0.1.2-draft (20090820) A. Genotype Likelihood Format version 3 (GLFv3)
GLFs store the probability of a genotype given data.
All integers are in little-endian.
Field | Description | Type | Value | ||
---|---|---|---|---|---|
magic | GLFv3 magic number | char[4] | GLF\3 | ||
l_text | Length of the header text, including any zero padding | int32_t | |||
text | Text | char[l_text] | |||
List of reference information until the end of the file | |||||
l_name | Length of the reference sequence name plus 1 (including NULL) | int32_t | |||
name | Name; NULL terminated | char[l_name] | |||
ref_len | length of the reference sequence | uint32_t | |||
List of sites until a record with rtype==0 | |||||
rtype_ref | record_type<<4 | ref_base; 0..15=>XACMGRSVTWYHKDBN | uint8_t | |||
if rtype==1 | offset | offset from the precedent record1 | uint32_t | ||
min_depth | min_lk<<24 | read_depth (min_lk capped at 255) | uint32_t | |||
rmsMapQ | RMS of mapping qualities of reads covering the site | uint8_t | |||
lk | likelihood of each genotype in the order of AA AC AG AT CC GC CT GG GT TT | uint8_t[10] | |||
if rtype==2 | offset | offset from the precedent record1,2 | uint32_t | ||
min_depth | min_lk<<24 | read_depth | uint32_t | |||
rmsMapQ | RMS of mapping qualities of reads covering the site | uint8_t | |||
lkHom1 | likelihood of the first homozygous indel allele (capped at 255) | uint8_t | |||
lkHom2 | likelihood of the second homozygous indel allele (capped at 255) | uint8_t | |||
lkHet | likelihood of a heterozygote the (capped at 255) | uint8_t | |||
indelLen1 | length of the first indel allele (positive=ins; negative=del; zero=no-indel) | uint16_t | |||
indelLen2 | length of the second indel allele | uint16_t | |||
indelSeq1 | sequence of the first indel allele | char[indelLen1] | |||
indelSeq2 | sequence of the second indel allele | char[indelLen2] | |||
if rtype==0 | endMarker | end of this chromosome; no data in this record | (null) |
Notes:
1. Field offset equals the zero-based coordinate of the current record minus the coordinate of the previous record. For the first record in a reference sequence, the coordinate of the precedent record is assumed to be zero. Offset is non-negative.
2. If a sequence is inserted between position [x,x+1] on the reference sequence, the coordinate of this record is x; if the sequence between [x,y] on the reference is is deleted, the coordinate of this record is x.
Tools That Use GLF Files
Variant Callers
Utilities
- glfMerge - Combines GLF multiple glfFiles generated for the same individual. Useful for combining data across platforms.