SAM & BAM Files: Format, Structure & Usage in RNA-Seq

Last updated: February 26, 2026

Introduction

SAM and BAM files are formats used to represent the results of mapping reads (nucleotide sequences) generated by a sequencer to a reference sequence. They describe where each read was mapped and how it was mapped.

Reads are generally represented in FASTQ files, and reference sequences are represented in FASTA files. Using these as input, mapping software such as HISAT2, STAR, Bowtie2, or BWA produces SAM or BAM files. SAM files are text-based, while BAM files are binary files that contain equivalent information. Because BAM files are smaller in size, mapping results are usually stored in BAM format.

SAM / BAM format

Let us consider the following example of mapping results.

Position	1	2	3	4	5	6	7	8	9	10	11	12	13	14	*	*	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45
Reference	A	G	C	A	T	G	T	T	A	G	A	T	A	A	*	*	G	A	T	A	G	C	T	G	T	G	C	T	A	G	T	A	G	G	C	A	G	T	C	A	G	C	G	C	C	A	T
+r001/1							T	T	A	G	A	T	A	A	A	G	G	A	T	A	*	C	T	G
+r002						a	a	a	A	G	A	T	A	A	*	G	G	A	T	A
+r003				g	c	c	t	a	A	G	C	T	A	A
+r004																		A	T	A	G	C	T	.	.	.	.	.	.	.	.	.	.	.	.	.	.	T	C	A	G	C
-r003																									t	t	a	g	c	t	T	A	G	G	C
-r001/2																																							C	A	G	C	G	G	C	A	T

Bases written in lowercase indicate regions at the ends of reads that do not match the reference sequence. r001/1 and r001/2 are paired reads, r003 is a chimeric read, and r004 represents a split alignment.

The corresponding SAM file looks like the following.

Example of a SAM file

@HD VN:1.6 SO:coordinate @SQ SN:ref LN:45 r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1

Lines starting with @ are header lines.

The following lines represent the mapping results. Each line consists of 11 required tab-separated columns, followed by optional additional columns. The contents of each column are as follows.

	Column name	Description
Column 1	QNAME	Read name
Column 2	FLAG	Flags describing the mapping result
Column 3	RNAME	Reference sequence name
Column 4	POS	Mapping position
Column 5	MAPQ	Mapping quality
Column 6	CIGAR	String representation of the alignment
Column 7	RNEXT	Name of the paired read
Column 8	PNEXT	Mapping position of the paired read
Column 9	TLEN	Insert length
Column 10	SEQ	Nucleotide sequence
Column 11	QUAL	Base quality scores

For more detailed information about the FLAG and CIGAR fields, please refer to this document.

What is a Sorted BAM?

BAM files output directly from mapping software are usually ordered by the order in which reads were processed. A Sorted BAM file is a BAM file that has been reordered by reference coordinate. This sorting step is almost always required before proceeding to the next stage of analysis.

Whether a BAM file is sorted can be determined by checking the SO tag in the header. If the file is sorted, it will be labeled as SO:coordinate. In practice, filenames such as .sorted.bam are often used to make this clear.

RNA-Seq Data Analysis Software

This is an RNA-Seq Data Analysis Software recommended for those who:

✔︎ Seeking to avoid outsourcing or collaboration for RNA-Seq data analysis.

✔︎ Lacking time to learn RNA-Seq data analysis.

✔︎ Frustrated by the complexity of existing tools.

Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.

About the Author

BxINFO LLC

A research support company specializing in bioinformatics.

We provide tools and information to support life science research, with a focus on RNA-Seq analysis.

→ Learn more

Recommended Pages