Step-by-Step Guide to RNA-Seq Data Analysis

Introduction

When conducting RNA-Seq analysis using next-generation sequencers, raw data known as FASTQ files are generated. Various data processing steps are required to perform the quantification of gene expression, identification of differentially expressed genes(DEGs), and functional analysis of DEGs (such as gene ontology analysis and pathway analysis) starting from this raw data.

This page provides a step-by-step guide for analyzing RNA-Seq data from raw data for species with well-annotated genomes, such as humans, mice, and rats.

Overview of Data Analysis Process

Overview of Data Analysis Process

List of Steps

StepsOverview
Data PreparationPrepare the FASTQ files for analysis.
Data PreprocessingTrim adapter sequences from FASTQ files and filter out low-quality reads.
MappingFinding the original location of a read in the reference genome.
Read Counting Based on Mapping ResultsCount the number of reads mapped to each gene, based on the mapping results.
Read Counting by Pseudo-AlignmentCount the number of reads mapped to each gene, based on the FASTQ files.
Identification of Differentially Expressed Genes (DEGs)Identify genes with changes in expression levels under different conditions or between groups.
Functional Analysis (Gene Ontology Analysis and Pathway Analysis)Clarify the functions that the genes are involved in.

Data Preparation

First, it is necessary to prepare the FASTQ files to be used for data analysis. FASTQ files can be obtained through sequencing with next-generation sequencers. You may either conduct the sequencing yourself or outsource it. It is also possible to download FASTQ files from public databases. In such cases, it is recommended to use a tool such as fasterq-dump.

Data Preprocessing

Raw data output from next-generation sequencers may contain reads with adapter sequences and low-quality reads. Therefore, it's often beneficial to preprocess the FASTQ files by trimming adapter sequences and filtering out low-quality reads.

Commonly used software for preprocessing FASTQ files includes Trimmomatic, Cutadapt, FastQC, and fastp.

An example of the results after preprocessing using fastp is provided below. The data, initially comprising 19,786,002 reads and 2,967,900,000 bases, was processed to 19,722,456 reads and 2,941,557,000 bases.

fastp summary

Mapping

Next, the reads obtained from the next-generation sequencer are mapped to a reference genome. Mapping refers to the analysis of finding the original location of a read in the reference genome. The results of the mapping are produced in a format known as a BAM file. Thus, mapping can be described as the analysis that takes FASTQ files and a reference genome as inputs and produces a BAM file.

Commonly used software for mapping includes HISAT2, STAR, and Bowtie2.

An example of the visualization of the mapping results is provided below.

An example of the visualization of the mapping results

Read Counting Based on Mapping Results

Next process is counting the number of reads mapped to each gene based on the mapping results. This process utilizes the reference genome and annotation files (such as GTF or GFF3 files), as well as the mapping results (BAM file).

Commonly used software for counting reads based on mapping results includes featureCounts, htseq-count, RSEM, and StringTie. If the analysis aims for read counting at the gene level, tools like featureCounts and htseq-count are sufficient. However, when transcript-level counting is required, tools such as RSEM or StringTie are utilized.

An example of the read counting results is provided below.

An example of the read counting results

Read Counting by Pseudo-Alignment

In addition to the traditional method of read counting based on mapping results, there is also the option of using a process known as pseudoalignment mapping for counting reads. This approach eliminates the need for prior mapping. Instead, it generates read counting results directly from the FASTQ files and the reference transcriptome.

Software commonly used for pseudoalignment mapping includes Salmon and Kallisto.

The read counting results from pseudoalignment mapping are essentially equivalent to those obtained from mapping-based read counting.

An example of the read counting results

Identification of Differentially Expressed Genes

The process of identifying differentially expressed genes (DEGs) involves finding genes that exhibit significantly increased or decreased expression levels under different conditions or between groups. DEGs can be identified based on read counting results.

Software commonly used for identifying DEGs includes Ballgown, edgeR, and DESeq2.

An example of the results from identifying differentially expressed genes (DEGs) is provided below.

An example of the results from identifying differentially expressed genes (DEGs)

Functional Analysis(Gene Ontology Analysis and Pathway Analysis)

This process investigates which functions the genes identified as differentially expressed genes (DEGs) are primarily involved in. Regarding the representation of gene functions, Gene Ontology (GO) or pathways are often utilized.

Software commonly used for functional analysis includes clusterProfiler, topGO, and GOseq.

An example of the results from a Gene Ontology (GO) analysis is provided below.

An example of the results from a Gene Ontology (GO) analysis

Related Pages

RNA-Seq Data Analysis Software

For those who don't have the time to study analysis methods or lack a high-spec computer necessary for the analysis, please consider using our RNA-Seq data analysis software.

概要

Starting with either raw RNA-Seq data (FASTQ files/public data) or expression tables (CSV/TSV files), users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.