Changes

From Genome Analysis Wiki
Jump to navigationJump to search
7,555 bytes added ,  16:10, 11 September 2015
m
Line 1: Line 1:  
== What is SAM ==
 
== What is SAM ==
The '''SAM Format''' is a text format for storing aligned reads in a series of tab delimited ASCII columns.  
+
The '''SAM Format''' is a text format for storing sequence data in a series of tab delimited ASCII columns.  
   −
Most often it is generated as a human readable projection of its sister [[BAM]] format, which can store data in a compact, indexed, binary representation.  
+
Most often it is generated as a human readable version of its sister [[BAM]] format, which stores the same data in a compressed, indexed, binary form.
 +
 
 +
Currently, most SAM format data is output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome.  In the future, SAM will also be used to archive unaligned sequence data generated directly from sequencing machines.
    
The current definition of the format is at [[http://samtools.sourceforge.net/SAM1.pdf BAM/SAM Specification]].
 
The current definition of the format is at [[http://samtools.sourceforge.net/SAM1.pdf BAM/SAM Specification]].
 +
 +
If you are writing software to read SAM or BAM data, our C++ [[C++ Library: libStatGen|libStatGen]] is a good resource to use.
      Line 10: Line 14:  
SAM files and BAM files contain the same information, but in a different format.  Refer to the specs to see a format description.
 
SAM files and BAM files contain the same information, but in a different format.  Refer to the specs to see a format description.
   −
Both SAM & BAM files contain a header section and an alignment section.
+
Both SAM & BAM files contain an optional header section followed by the alignment section.
 +
 
 
The header section may contain information about the entire file and additional information for alignments.  The alignments then associate themselves with specific header information.
 
The header section may contain information about the entire file and additional information for alignments.  The alignments then associate themselves with specific header information.
 +
 +
The alignment section contains the information for each sequence about where/how it aligns to the reference genome.
    
=== What Information Does SAM/BAM Have for an Alignment ===
 
=== What Information Does SAM/BAM Have for an Alignment ===
 
Each Alignment has:
 
Each Alignment has:
 
* query name, QNAME (SAM)/read_name (BAM).  It is used to group/identify alignments that are together, like paired alignments or a read that appears in multiple alignments.
 
* query name, QNAME (SAM)/read_name (BAM).  It is used to group/identify alignments that are together, like paired alignments or a read that appears in multiple alignments.
* a bitwise set of information describing the alignment, FLAG:
+
* a bitwise set of information describing the alignment, FLAG.  Provides the following information:
** are there multiple fragments
+
** are there multiple fragments?
 +
** are all fragments properly aligned?
 +
** is this fragment unmapped?
 +
** is the next fragment unmapped?
 +
** is this query the reverse strand?
 +
** is the next fragment the reverse strand?
 +
** is this the 1st fragment?
 +
** is this the last fragment?
 +
** is this a secondary alignment?
 +
** did this read fail quality controls?
 +
** is this read a PCR or optical duplicate?
    
Not all alignments contain The rest of the alignment fields may be set to default values if the information is unknown.
 
Not all alignments contain The rest of the alignment fields may be set to default values if the information is unknown.
 
* reference sequence name, RNAME, often contains the Chromosome name.   
 
* reference sequence name, RNAME, often contains the Chromosome name.   
 
* leftmost position of where this alignment maps to the reference, POS.  For SAM, the reference starts at 1, so this value is 1-based, while for BAM the reference starts at 0,so this value is 0-based.  Beware to always use the correct base when referencing positions.
 
* leftmost position of where this alignment maps to the reference, POS.  For SAM, the reference starts at 1, so this value is 1-based, while for BAM the reference starts at 0,so this value is 0-based.  Beware to always use the correct base when referencing positions.
* mapping quality, MAPQ, which contains the "phred-scaled posterior probability that the mapping position" is wrong. (from SAM-1.pdf)
+
* mapping quality, MAPQ, which contains the "phred-scaled posterior probability that the mapping position" is wrong. (see [[http://samtools.sourceforge.net/SAM1.pdf]])
* CIGAR
+
* string indicating alignment information that allows the storing of clipped, [[SAM#What is a CIGAR?|CIGAR]]
 
* the reference sequence name of the next alignment in this group, MRNM or RNEXT.  In paired alignments, it is the mate's reference sequence name. (A group is alignments with the same query name.)
 
* the reference sequence name of the next alignment in this group, MRNM or RNEXT.  In paired alignments, it is the mate's reference sequence name. (A group is alignments with the same query name.)
 
* leftmost position of where the next alignment in this group maps to the reference, MPOS or PNEXT.  For SAM, the reference starts at 1, so this value is 1-based, while for BAM the reference starts at 0,so this value is 0-based.  Beware to always use the correct base when referencing positions.
 
* leftmost position of where the next alignment in this group maps to the reference, MPOS or PNEXT.  For SAM, the reference starts at 1, so this value is 1-based, while for BAM the reference starts at 0,so this value is 0-based.  Beware to always use the correct base when referencing positions.
 
* length of this group from the leftmost position to the rightmost position, ISIZE or TLEN
 
* length of this group from the leftmost position to the rightmost position, ISIZE or TLEN
 
* the query sequence for this alignment, SEQ
 
* the query sequence for this alignment, SEQ
* the query quality for this alignment, QUAL, one for each base in the query sequence.
+
* the query quality for this alignment, [[SAM#What is QUAL?|QUAL]], one for each base in the query sequence.
* Additional optional information is also contained within the alignment, TAGS.  A bunch of different information can be stored here and they appear as key/value pairs.  See the spec for a detailed list of commonly used tags and what they mean.
+
* Additional optional information is also contained within the alignment, [[SAM#What are TAGs?|TAGs]].  A bunch of different information can be stored here and they appear as key/value pairs.  See the spec for a detailed list of commonly used tags and what they mean.
 +
 
    
==== What is a CIGAR? ====
 
==== What is a CIGAR? ====
 
You may have heard the term CIGAR, but wondered what it means.  Hopefully this section will help clarify it.
 
You may have heard the term CIGAR, but wondered what it means.  Hopefully this section will help clarify it.
 +
 +
The sequence being aligned to a reference may have additional bases that are not in the reference or may be missing bases that are in the reference.  The CIGAR string is a sequence of of base lengths and the associated operation.  They are used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.
 +
 +
For example:
 +
RefPos:    1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
 +
Reference:  C  C  A  T  A  C  T  G  A  A  C  T  G  A  C  T  A  A  C
 +
Read: ACTAGAATGGCT
 +
Aligning these two:
 +
RefPos:    1  2  3  4  5  6  7    8  9 10 11 12 13 14 15 16 17 18 19
 +
Reference:  C  C  A  T  A  C  T    G  A  A  C  T  G  A  C  T  A  A  C
 +
Read:                  A  C  T  A  G  A  A    T  G  G  C  T
 +
With the alignment above, you get:
 +
POS: 5
 +
CIGAR: 3M1I3M1D5M
 +
 +
The POS indicates that the read aligns starting at position 5 on the reference.
 +
The CIGAR says that the first 3 bases in the read sequence align with the reference.  The next base in the read does not exist in the reference.  Then 3 bases align with the reference.  The next reference base does not exist in the read sequence, then 5 more bases align with the reference.  Note that at position 14, the base in the read is different than the reference, but it still counts as an M since it aligns to that position.
 +
 +
 +
==== What is QUAL? ====
 +
QUAL stands for query quality.  It is an indicator for how accurate each base in the query sequence (SEQ) is.  If QUAL is specified, there is a quality value for each base in SEQ.
 +
 +
Quality is calculated based on the probability that a base is wrong, p, using the following formula:
 +
<math>quality = -10 \log_{10}p</math>
 +
This quality is called the [http://en.wikipedia.org/wiki/Phred_quality_score Phred Quality Score].
 +
 +
Since a human readable format is desired for SAM, 33 is added to the calculated quality in order to make it a printable character ranging from ! - ~.
 +
 +
So, for SAM, the QUAL field is:
 +
<math>QUAL = (-10 \log_{10}p) + 33</math>
 +
 +
Phred Quality is also found in a FASTQ file, described here: http://en.wikipedia.org/wiki/FASTQ_format#Quality
 +
 +
 +
==== What are TAGs? ====
 +
TAGs are optional fields on a SAM/BAM Alignment.
 +
A TAG is comprised of a two character TAG key, they type of the value, and the value:
 +
[A-Za-z][A-za-z]:[AifZH]:.*
 +
 +
The types, A, i, f, Z, H are used to indicate the type of value stored in the tag.
 +
{| style="margin: 1em 1em 1em 0; background-color: #f9f9f9; border: 1px #aaa solid; border-collapse: collapse;" border="1"
 +
|-style="background: #f2f2f2; text-align: center;"
 +
! '''Type''' !! '''Description'''
 +
|-
 +
|A
 +
|character
 +
|-
 +
|i
 +
|signed 32-bit integer
 +
|-
 +
|f
 +
|single-precision float
 +
|-
 +
|Z
 +
|string
 +
|-
 +
|H
 +
|hex string
 +
|}
 +
 +
 +
There are a set of predefined tags that are general used in Alignments.  They are documented in the SAM Specification.
 +
Predefined tags have been specified for storing information about the read or alignment.
 +
Examples of things stored in predefined tags:
 +
* Previous settings for various fields if they have been updated due to additional processing
 +
* Mappings from the alignment to Header values, used to match to a read group or program.
 +
* Additional information which may already be in the header like library and platform.
 +
 +
A user can also use any additional tags to store any information they want.  TAGs starting with X, Y, or Z are reserved to be user defined.
 +
 +
Examples:
 +
XT:A:U  - user defined tag called XT.  It holds a character.  The value associated with this tag is 'U'.
 +
NM:i:2  - predefined tag NM means: Edit distance to the reference (number of changes necessary to make this equal the reference, excluding clipping)
 +
 +
=== What Information is in the SAM/BAM Header ===
 +
 +
The SAM/BAM header is not required, but if it is there, it contains generic information for the SAM/BAM file. 
 +
 +
The header may contain the version information for the SAM/BAM file and information regarding whether or not and how the file is sorted.
 +
 +
It also contains supplemental information for alignment records like information about the reference sequences, the processing that was used to generate the various reads in the file, and the programs that have been used to process the different reads.  The alignment records may then point to this supplemental information identifying which ones the specific alignment is associated with.
 +
 +
For example, a group of reads in the SAM/BAM file may all be assigned to the same reference sequence.  Rather than every alignment containing information about the reference sequence, this information is put in the header, and the alignment "points" to the appropriate reference sequence in the header via the RNAME field.  The header contains generic information about this reference like its length.
 +
 +
The SAM/BAM Header also may contain comments which are free-form text lines that can contain any information.
 +
 +
Header lines start with an '@'.
    
== Example SAM ==
 
== Example SAM ==
 +
=== Example Header Lines ===
 +
@HD VN:1.0 SO:coordinate
 +
@SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128
 +
@SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e
 +
@SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5
 +
@RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 CN:UMCORE
 +
@RG ID:UM0098:2 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L002 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 CN:UMCORE
 +
@PG ID:bwa VN:0.5.4
 +
@PG ID:GATK TableRecalibration VN:1.0.3471 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, CycleCovariate, DinucCovariate, TileCovariate], default_read_group=null, default_platform=null, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, exception_if_no_tile=false, ignore_nocall_colorspace=false, pQ=5, maxQ=40, smoothing=1
 +
 +
In the alignment examples below, you will see that the 2nd alignment maps back to the RG line with ID UM0098.1, and all of the alignments point back to the SQ line with SN:1 because their RNAME is 1.
 +
 +
 
=== Example Alignments ===
 
=== Example Alignments ===
 
This is what the alignment section of a SAM file looks like:
 
This is what the alignment section of a SAM file looks like:
    
  1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
 
  1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
  19:20389:F:275+18M2D19M 99 1 17644 0 37M = 17919 314 TATGACTGCTAATAATACCTACACATGTTAGAACCAT >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
+
  19:20389:F:275+18M2D19M 99 1 17644 0 37M = 17919 314 TATGACTGCTAATAATACCTACACATGTTAGAACCAT >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 RG:Z:UM0098:1 XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
 
  19:20389:F:275+18M2D19M 147 1 17919 0 18M2D19M = 17644 -314 GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT ;44999;499<8<8<<<8<<><<<<><7<;<<<>><< XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19
 
  19:20389:F:275+18M2D19M 147 1 17919 0 18M2D19M = 17644 -314 GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT ;44999;499<8<8<<<8<<><<<<><7<;<<<>><< XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19
 
  9:21597+10M2I25M:R:-209 83 1 21678 0 8M2I27M = 21469 -244 CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT <;9<<5><<<<><<<>><<><>><9>><>>>9>>><> XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35
 
  9:21597+10M2I25M:R:-209 83 1 21678 0 8M2I27M = 21469 -244 CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT <;9<<5><<<<><<<>><<><>><9>><>>>9>>><> XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35
Line 116: Line 234:  
|TAGs
 
|TAGs
 
|XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
 
|XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
|XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
+
|RG:Z:UM0098:1 XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
 
|XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19
 
|XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19
 
|XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35
 
|XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35
 
|}
 
|}
 +
 +
== Tips/Tricks ==
 +
*Calculating BAM Block Size
 +
** Block Size = 8*4 + ReadNameLength(including null) + CigarLength*4 + (ReadLength+1)/2 + ReadLength + TagLength
 +
 +
 +
 +
You should now be a SAM expert :-)
61

edits

Navigation menu