SAM & BAM Files: Format, Structure & Usage in RNA-Seq
Introduction
SAM and BAM files store the results of mapping sequencer reads (nucleotide sequences) to a reference sequence. They record where each read mapped and how it aligned to the reference.
Reads are typically stored in FASTQ files, and reference sequences are stored in FASTA files. These serve as input to mapping tools such as HISAT2, STAR, Bowtie2, and BWA, which produce SAM or BAM files as output. SAM is a text-based format, while BAM is a binary format that holds the same information. Because BAM files are much smaller, mapping results are usually stored in BAM format.
SAM / BAM format
Consider the following example of mapping results.
| Position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | * | * | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Reference | A | G | C | A | T | G | T | T | A | G | A | T | A | A | * | * | G | A | T | A | G | C | T | G | T | G | C | T | A | G | T | A | G | G | C | A | G | T | C | A | G | C | G | C | C | A | T |
| +r001/1 | T | T | A | G | A | T | A | A | A | G | G | A | T | A | * | C | T | G | |||||||||||||||||||||||||||||
| +r002 | a | a | a | A | G | A | T | A | A | * | G | G | A | T | A | ||||||||||||||||||||||||||||||||
| +r003 | g | c | c | t | a | A | G | C | T | A | A | ||||||||||||||||||||||||||||||||||||
| +r004 | A | T | A | G | C | T | . | . | . | . | . | . | . | . | . | . | . | . | . | . | T | C | A | G | C | ||||||||||||||||||||||
| -r003 | t | t | a | g | c | t | T | A | G | G | C | ||||||||||||||||||||||||||||||||||||
| -r001/2 | C | A | G | C | G | G | C | A | T |
Lowercase bases indicate portions at the ends of reads that do not match the reference sequence. r001/1 and r001/2 are paired-end reads, r003 is a chimeric read, and r004 is a split alignment.
The corresponding SAM file is shown below.
Example of a SAM file
Lines starting with @ are header lines.
The remaining lines contain the mapping results. Each line consists of 11 required tab-separated columns, optionally followed by additional columns. The columns are described below.
| Column name | Description | |
|---|---|---|
| Column 1 | QNAME | Read name |
| Column 2 | FLAG | Bitwise flags describing various properties of the mapping result |
| Column 3 | RNAME | Reference sequence name |
| Column 4 | POS | Mapping position |
| Column 5 | MAPQ | Mapping quality |
| Column 6 | CIGAR | A string encoding of the alignment |
| Column 7 | RNEXT | Reference name of the mate read |
| Column 8 | PNEXT | Mapping position of the mate read |
| Column 9 | TLEN | Template length (insert size) |
| Column 10 | SEQ | Nucleotide sequence |
| Column 11 | QUAL | Per-base quality scores |
For more details on the FLAG and CIGAR fields, see the SAM format specification.
What is a Sorted BAM?
BAM files produced directly by mapping software are typically ordered by the sequence in which reads were processed. A Sorted BAM is a BAM file whose records have been rearranged by genomic coordinate. This sorting step is nearly always required before moving on to downstream analysis.
You can check whether a BAM file has been sorted by examining the SO tag in the header. A sorted file will have SO:coordinate. In practice, files are often named with a .sorted.bam extension to make the sort status immediately obvious.
RNA-Seq Data Analysis Software
This is an RNA-Seq Data Analysis Software recommended for those who:
✔︎ Seeking to avoid outsourcing or collaboration for RNA-Seq data analysis.
✔︎ Lacking time to learn RNA-Seq data analysis.
✔︎ Frustrated by the complexity of existing tools.
Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.
About the Author
BxINFO LLC
A research support company specializing in bioinformatics.
We provide tools and information to support life science research, with a focus on RNA-Seq analysis.