>
SAM & BAM Files: Format, Structure & Usage in RNA-Seq

SAM & BAM Files: Format, Structure & Usage in RNA-Seq

Last updated: March 13, 2026

Introduction

SAM and BAM files store the results of mapping sequencer reads (nucleotide sequences) to a reference sequence. They record where each read mapped and how it aligned to the reference.

Reads are typically stored in FASTQ files, and reference sequences are stored in FASTA files. These serve as input to mapping tools such as HISAT2, STAR, Bowtie2, and BWA, which produce SAM or BAM files as output. SAM is a text-based format, while BAM is a binary format that holds the same information. Because BAM files are much smaller, mapping results are usually stored in BAM format.

SAM / BAM format

Consider the following example of mapping results.

Position1234567891011121314**15161718192021222324252627282930313233343536373839404142434445
ReferenceAGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT
+r001/1TTAGATAAAGGATA*CTG
+r002aaaAGATAA*GGATA
+r003gcctaAGCTAA
+r004ATAGCT..............TCAGC
-r003ttagctTAGGC
-r001/2CAGCGGCAT

Lowercase bases indicate portions at the ends of reads that do not match the reference sequence. r001/1 and r001/2 are paired-end reads, r003 is a chimeric read, and r004 is a split alignment.

The corresponding SAM file is shown below.

Example of a SAM file

@HD VN:1.6 SO:coordinate @SQ SN:ref LN:45 r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0; r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1; r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1

Lines starting with @ are header lines.

The remaining lines contain the mapping results. Each line consists of 11 required tab-separated columns, optionally followed by additional columns. The columns are described below.

Column nameDescription
Column 1QNAMERead name
Column 2FLAGBitwise flags describing various properties of the mapping result
Column 3RNAMEReference sequence name
Column 4POSMapping position
Column 5MAPQMapping quality
Column 6CIGARA string encoding of the alignment
Column 7RNEXTReference name of the mate read
Column 8PNEXTMapping position of the mate read
Column 9TLENTemplate length (insert size)
Column 10SEQNucleotide sequence
Column 11QUALPer-base quality scores

For more details on the FLAG and CIGAR fields, see the SAM format specification.

What is a Sorted BAM?

BAM files produced directly by mapping software are typically ordered by the sequence in which reads were processed. A Sorted BAM is a BAM file whose records have been rearranged by genomic coordinate. This sorting step is nearly always required before moving on to downstream analysis.

You can check whether a BAM file has been sorted by examining the SO tag in the header. A sorted file will have SO:coordinate. In practice, files are often named with a .sorted.bam extension to make the sort status immediately obvious.

RNA-Seq Data Analysis Software

This is an RNA-Seq Data Analysis Software recommended for those who:

✔︎ Seeking to avoid outsourcing or collaboration for RNA-Seq data analysis.

✔︎ Lacking time to learn RNA-Seq data analysis.

✔︎ Frustrated by the complexity of existing tools.

overview

Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.

BxINFO LLC logo

BxINFO LLC

A research support company specializing in bioinformatics.

We provide tools and information to support life science research, with a focus on RNA-Seq analysis.

→ Learn more