Difference between revisions of "SAM"

Revision as of 11:48, 30 July 2010

What is SAM

The SAM Format is a text format for storing sequence data in a series of tab delimited ASCII columns.

Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form.

Currently, most SAM format data is output from aligners that read FASTQ files and assign the sequences to a position with respect to a genome. In the future, SAM will also be used to archive unaligned sequence data generated directly from sequencing machines.

The current definition of the format is at [BAM/SAM Specification].

What Information is in SAM & BAM

SAM files and BAM files contain the same information, but in a different format. Refer to the specs to see a format description.

Both SAM & BAM files contain a header section and an alignment section. The header section may contain information about the entire file and additional information for alignments. The alignments then associate themselves with specific header information.

The alignment section contains the information for each sequence about where/how it aligns to the reference genome.

What Information Does SAM/BAM Have for an Alignment

Each Alignment has:

query name, QNAME (SAM)/read_name (BAM). It is used to group/identify alignments that are together, like paired alignments or a read that appears in multiple alignments.
a bitwise set of information describing the alignment, FLAG. Provides the following information:
- are there multiple fragments?
- are all fragments properly aligned?
- is this fragment unmapped?
- is the next fragment unmapped?
- is this query the reverse strand?
- is the next fragment the reverse strand?
- is this the 1st fragment?
- is this the last fragment?
- is this a secondary alignment?
- did this read fail quality controls?
- is this read a PCR or optical duplicate?

Not all alignments contain The rest of the alignment fields may be set to default values if the information is unknown.

reference sequence name, RNAME, often contains the Chromosome name.
leftmost position of where this alignment maps to the reference, POS. For SAM, the reference starts at 1, so this value is 1-based, while for BAM the reference starts at 0,so this value is 0-based. Beware to always use the correct base when referencing positions.
mapping quality, MAPQ, which contains the "phred-scaled posterior probability that the mapping position" is wrong. (from SAM-1.pdf)
string indicating alignment information that allows the storing of clipped, CIGAR
the reference sequence name of the next alignment in this group, MRNM or RNEXT. In paired alignments, it is the mate's reference sequence name. (A group is alignments with the same query name.)
leftmost position of where the next alignment in this group maps to the reference, MPOS or PNEXT. For SAM, the reference starts at 1, so this value is 1-based, while for BAM the reference starts at 0,so this value is 0-based. Beware to always use the correct base when referencing positions.
length of this group from the leftmost position to the rightmost position, ISIZE or TLEN
the query sequence for this alignment, SEQ
the query quality for this alignment, QUAL, one for each base in the query sequence.
Additional optional information is also contained within the alignment, TAGs. A bunch of different information can be stored here and they appear as key/value pairs. See the spec for a detailed list of commonly used tags and what they mean.

What is a CIGAR?

You may have heard the term CIGAR, but wondered what it means. Hopefully this section will help clarify it.

The sequence being aligned to a reference may have additional bases that are not in the reference or may be missing bases that are in the reference. The CIGAR string is a sequence of of base lengths and the associated operation. They are used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.

For example:

RefPos:     1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T  G  A  A  C  T  G  A  C  T  A  A  C
Read: ACTAGAATGGCT

Aligning these two:

RefPos:     1  2  3  4  5  6  7     8  9 10 11 12 13 14 15 16 17 18 19
Reference:  C  C  A  T  A  C  T     G  A  A  C  T  G  A  C  T  A  A  C
Read:                   A  C  T  A  G  A  A     T  G  G  C  T

With the alignment above, you get:

POS: 5
CIGAR: 3M1I3M1D5M

The POS indicates that the read aligns starting at position 5 on the reference. The CIGAR says that the first 3 bases in the read sequence align with the reference. The next base in the read does not exist in the reference. Then 3 bases align with the reference. The next reference base does not exist in the read sequence, then 5 more bases align with the reference. Note that at position 14, the base in the read is different than the reference, but it still counts as an M since it aligns to that position.

What is QUAL?

QUAL stands for query quality. It is an indicator for how accurate each base in the query sequence (SEQ) is. If QUAL is specified, there is a quality value for each base in SEQ.

Quality is calculated based on the probability that a base is wrong, p, using the following formula:

 $quality=-10\log _{10}p$

This quality is called the Phred Quality Score.

Since a human readable format is desired for SAM, 33 is added to the calculated quality in order to make it a printable character ranging from ! - ~.

So, for SAM, the QUAL field is:

 $QUAL=(-10\log _{10}p)+33$

What are TAGs?

Example SAM

Example Alignments

This is what the alignment section of a SAM file looks like:

1:497:R:-272+13M17D24M	113	1	497	37	37M	15	100338662	0	CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG	0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>	XT:A:U	NM:i:0	SM:i:37	AM:i:0	X0:i:1	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:37
19:20389:F:275+18M2D19M	99	1	17644	0	37M	=	17919	314	TATGACTGCTAATAATACCTACACATGTTAGAACCAT	>>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9	XT:A:R	NM:i:0	SM:i:0	AM:i:0	X0:i:4	X1:i:0	XM:i:0	XO:i:0	XG:i:0	MD:Z:37
19:20389:F:275+18M2D19M	147	1	17919	0	18M2D19M	=	17644	-314	GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT	;44999;499<8<8<<<8<<><<<<><7<;<<<>><<	XT:A:R	NM:i:2	SM:i:0	AM:i:0	X0:i:4	X1:i:0	XM:i:0	XO:i:1	XG:i:2	MD:Z:18^CA19
9:21597+10M2I25M:R:-209	83	1	21678	0	8M2I27M	=	21469	-244	CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT	<;9<<5><<<<><<<>><<><>><9>><>>>9>>><>	XT:A:R	NM:i:2	SM:i:0	AM:i:0	X0:i:5	X1:i:0	XM:i:0	XO:i:1	XG:i:2	MD:Z:35

In this example, the fields are:

Field	Alignment 1	Alignment 2	Alignment 3	Alignment 4
QNAME	1:497:R:-272+13M17D24M	19:20389:F:275+18M2D19M	19:20389:F:275+18M2D19M	9:21597+10M2I25M:R:-209
FLAG	113	99	147	83
RNAME	1	1	1	1
POS	497	17644	17919	21678
MAPQ	37	0	0	0
CIGAR	37M	37M	18M2D19M	8M2I27M
MRNM/RNEXT	15	=	=	=
MPOS/PNEXT	100338662	17919	17644	21469
ISIZE/TLEN	0	314
SEQ	CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG	TATGACTGCTAATAATACCTACACATGTTAGAACCAT	GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT	CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT
QUAL	0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>	>>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9	;44999;499<8<8<<<8<<><<<<><7<;<<<>><<	<;9<<5><<<<><<<>><<><>><9>><>>>9>>><>
TAGs	XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37	XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37	XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19	XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35

@@ Line 65: / Line 65: @@
 ==== What is QUAL? ====
+QUAL stands for query quality.  It is an indicator for how accurate each base in the query sequence (SEQ) is.  If QUAL is specified, there is a quality value for each base in SEQ.
+Quality is calculated based on the probability that a base is wrong, p, using the following formula:
+ <math>quality = -10 \log_{10}p</math>
+This quality is called the [http://en.wikipedia.org/wiki/Phred_quality_score Phred Quality Score].
+Since a human readable format is desired for SAM, 33 is added to the calculated quality in order to make it a printable character ranging from ! - ~.
+So, for SAM, the QUAL field is:
+ <math>QUAL = (-10 \log_{10}p) + 33</math>
 ==== What are TAGs? ====

Difference between revisions of "SAM"

Revision as of 11:48, 30 July 2010

Contents

What is SAM

What Information is in SAM & BAM

What Information Does SAM/BAM Have for an Alignment

What is a CIGAR?

What is QUAL?

What are TAGs?

Example SAM

Example Alignments

Navigation menu

Page actions

Page actions

Personal tools

quick links

teaching

Navigation

Search

Tools