edgeR: A Tutorial for Differential Expression Analysis in RNA-Seq Data

What is edgeR?

RNA-Seq analysis using next-generation sequencing allows for the measurement of gene expression levels for each gene. By comparing these quantitative results of gene expression across multiple samples, differentially expressed genes can be identified through comparisons between sample groups.

This page provides a tutorial on how to use and install edgeR, a software for identifying differentially expressed genes.

If you find the following procedures difficult, we also offer a web-based software that allows you to easily identify differentially expressed genes.

Installing DESeq2

First, if R is not already installed, install R. (The following is an example of installation using Homebrew.)

$ brew install r

Launch R and execute the following to install BiocManager and edgeR.

> if (!requireNamespace("BiocManager", quietly=TRUE)) > install.packages("BiocManager") > BiocManager::install("edgeR")

Execute the following, and if no errors are displayed, the installation is successful.

> library(edgeR)

Preparing Data

Using software such as featureCounts, StringTie, and RSEM, obtain quantitative results of gene expression levels.

Ultimately, the data was organized into a comma-separated file (CSV file) as shown below. Please note that the file input into DESeq2 should be raw read counts, not normalized data such as FPKM/RPKM or TPM.

edgeRのデータの準備

How to Use edgeR

> counts <- read.csv("counts.csv", sep=",", row.names=1) > group <- factor(c("A", "A", "A", "A", "B", "B", "B", "B")) > y <- DGEList(counts=counts, group=group)

The count data is loaded and combined with the group information to create a DGEList object. In this analysis, the samples were divided into two groups for comparison: samples 1 to 4 as Group A and samples 5 to 8 as Group B.

> nrow(y) [1] 35627 > keep <- filterByExpr(y) > y <- y[keep, , keep.lib.sizes=FALSE] > nrow(y) [1] 14698

The genes with low expression were filtered using filterByExpr. In this example, the number of genes was reduced from 35,627 to 14,698.

> y <- calcNormFactors(y) > y$samples group lib.size norm.factors sample1 A 20190676 0.5252614 sample2 A 17815578 0.7264956 sample3 A 16858297 0.8385346 sample4 A 17080450 0.5972758 sample5 B 18305317 1.4576576 sample6 B 23123425 1.5620988 sample7 B 23260262 1.6452239 sample8 B 19145522 1.3967109

TMM normalization was performed using calcNormFactors to correct for biases between samples.

Finally, quasi-likelihood F-tests are performed.

> design <- model.matrix(~group) > y <- estimateDisp(y, design) > fit <- glmQLFit(y, design) > qlf <- glmQLFTest(fit, coef=2) > topTags(qlf) Coefficient: groupB logFC logCPM F PValue FDR ENSMUSG00000038738 -3.934239 4.352090 191.17683 4.406800e-08 0.0006477115 ENSMUSG00000021250 2.543736 6.859874 111.07055 6.409002e-07 0.0027772690 ENSMUSG00000032487 3.859926 1.891284 103.08209 9.187893e-07 0.0027772690 ENSMUSG00000070495 -3.377550 3.217322 99.99038 1.063542e-06 0.0027772690 ENSMUSG00000064356 -3.325373 11.863375 96.88620 1.237031e-06 0.0027772690 ENSMUSG00000033453 -1.469414 5.458883 95.42623 1.330180e-06 0.0027772690 ENSMUSG00000100862 -15.490014 7.031386 169.49446 1.993728e-06 0.0027772690 ENSMUSG00000096887 -3.332279 8.621508 85.88952 2.194494e-06 0.0027772690 ENSMUSG00000054942 -1.453327 5.837918 83.01079 2.577827e-06 0.0027772690 ENSMUSG00000004842 -5.072918 2.248245 82.51494 2.651651e-06 0.0027772690

Differentially expressed genes were identified. The explanation of logFC can be found here. In logCPM, CPM stands for Counts Per Million.

The rows with FDR < 0.05 can be extracted as follows.

> result <- as.data.frame(topTags(qlf, n=nrow(y))) > result[result$FDR<0.05,] logFC logCPM F PValue FDR ENSMUSG00000038738 -3.9342393 4.35208965 191.17683 4.406800e-08 0.0006477115 ENSMUSG00000021250 2.5437356 6.85987389 111.07055 6.409002e-07 0.0027772690 ENSMUSG00000032487 3.8599263 1.89128388 103.08209 9.187893e-07 0.0027772690 ENSMUSG00000070495 -3.3775501 3.21732235 99.99038 1.063542e-06 0.0027772690 ENSMUSG00000064356 -3.3253735 11.86337535 96.88620 1.237031e-06 0.0027772690 ENSMUSG00000033453 -1.4694139 5.45888339 95.42623 1.330180e-06 0.0027772690 ENSMUSG00000100862 -15.4900144 7.03138596 169.49446 1.993728e-06 0.0027772690 ENSMUSG00000096887 -3.3322793 8.62150780 85.88952 2.194494e-06 0.0027772690 ENSMUSG00000054942 -1.4533272 5.83791761 83.01079 2.577827e-06 0.0027772690 ENSMUSG00000004842 -5.0729179 2.24824466 82.51494 2.651651e-06 0.0027772690 ...

FDR stands for False Discovery Rate. When extracting with FDR < 0.05, it means that among the extracted genes, the proportion of genes that are not actually differentially expressed (false positives) is 5%.

What is TMM Normalization?

TMM normalization is a method for correcting gene expression levels in RNA-Seq analysis, implemented in the edgeR software.

In RNA-Seq, what can be measured is not the absolute expression level, but the relative expression level. Therefore, when a few genes are highly expressed, it can appear that the expression levels of other genes are relatively reduced. TMM normalization addresses this by adjusting to minimize differences in expression levels between samples. Using this method, it is possible to make appropriate adjustments for data in which the expression of most genes does not vary between samples.

It should be noted that TMM normalization does not adjust for common factors between samples. For instance, gene length is said to correlate with read counts, with longer genes having higher read counts, but TMM normalization does not correct for this. In edgeR, the focus is on identifying differentially expressed genes, so adjustments between genes are not necessary. Therefore, TMM normalization is sufficient.

On the other hand, normalization methods such as FPKM/RPKM and TPM also account for gene length to enable comparisons of expression levels between genes.

RNA-Seq Data Analysis Software

With our RNA-Seq data analysis software, you won't need to outsource or rely on collaborators. You can start analyzing the data yourself right away, without the need for high-spec computers or knowledge of Linux commands.

概要

Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.