What is TPM? Understanding Normalization Methods for Gene Expression Levels in RNA-Seq Analysis

Introduction

The raw count obtained from RNA-Seq analysis cannot be directly compared between genes or samples.

This is because the longer the gene, the more reads are mapped to it, and the greater the total number of reads obtained by sequencing, the more reads are mapped to each gene in that sample.

Therefore, various normalization methods have been proposed, but this page focuses on explaining TPM. Previously, the metrics FPKM/RPKM were widely used. However, they have been criticized for not adequately representing gene expression levels. Currently, TPM is increasingly being utilized.

Definition of TPM

TPM stands for 'transcripts per million' and was proposed as an alternative normalization method to FPKM/RPKM.

Like FPKM/RPKM, TPM also normalizes the total number of mapped reads to one million and the transcript length to 1,000 bases. However, the order of normalization is different: TPM first adjusts for length, then for the total number of reads.

The formula is as follows (where \(q_i\) represents the number of mapped reads, and \(l_i\) represents the transcript length):

\(A_i = \frac{q_i}{l_i} * 10^3\)
\(TPM_i = A_i * \frac{1}{\sum_j A_j} * 10^6\)

Using FPKM, it can also be expressed as follows:

\(TPM_i = \frac{FPKM_i}{\sum_j FPKM_j} * 10^6\)

Effective length

The calculation method for TPM varies slightly depending on the software. For example, some software may use the effective length instead of the actual transcript length for \(l_i\).

The effective length can be calculated as follows:

\(\tilde{l_i} = l_i - μ_{FLD} + 1\)

\(μ_{FLD}\) represents the average fragment length.

It is said that using the effective length in the calculation of TPM allows for a more appropriate correction for the effects of length.

RNA-Seq Data Analysis Software

For those who don't have the time to study analysis methods or lack a high-spec computer necessary for the analysis, please consider using our RNA-Seq data analysis software.

概要

Starting with either raw RNA-Seq data (FASTQ files/public data) or expression tables (CSV/TSV files), users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.