What is FPKM/RPKM? Understanding Normalization Methods for Gene Expression Levels in RNA-Seq Analysis

Introduction

The raw count obtained from RNA-Seq analysis cannot be directly compared between genes or samples.

This is because the longer the gene, the more reads are mapped to it, and the greater the total number of reads obtained by sequencing, the more reads are mapped to each gene in that sample.

Therefore, various normalization methods have been proposed, but this page focuses on explaining FPKM/RPKM. Note that recently FPKM/RPKM has been criticized for not adequately representing gene expression levels, and TPM is increasingly being used.

Difinition of FPKM/RPKM

FPKM stands for 'Fragments Per Kilobase of exon per Million mapped reads', and RPKM stands for 'Reads Per Kilobase of exon per Million mapped reads'. As the names suggest, they normalize the total number of mapped reads to one million and the transcript length to 1,000 bases. FPKM and RPKM are essentially the same in terms of their formulas, differing only in whether they count reads or fragments.

The formula is as follows (where \(q_i\) represents the number of mapped reads, and \(l_i\) represents the transcript length):

\(FPKM_i = \frac{q_i}{\frac{l_i}{10^3} * \frac{\sum_j q_j}{10^6}} = \frac{q_i}{l_i * \sum_j q_j} * 10^9\)

Effective length

The calculation method for FPKM/RPKM varies slightly depending on the software. For example, some software may use the effective length instead of the actual transcript length for \(l_i\).

The effective length can be calculated as follows:

\(\tilde{l_i} = l_i - μ_{FLD} + 1\)

\(μ_{FLD}\) represents the average fragment length.

It is said that using the effective length in the calculation of FPKM/RPKM allows for a more appropriate correction for the effects of length.

RNA-Seq Data Analysis Software

For those who don't have the time to study analysis methods or lack a high-spec computer necessary for the analysis, please consider using our RNA-Seq data analysis software.

概要

Starting with either raw RNA-Seq data (FASTQ files/public data) or expression tables (CSV/TSV files), users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.