>
RNA-Seq Data Analysis Workflow: A Step-by-Step Guide

RNA-Seq Data Analysis Workflow: A Step-by-Step Guide

Last updated: March 13, 2026

Introduction

Performing RNA-Seq with a next-generation sequencer produces raw data in the form of FASTQ files. Going from this raw data to gene expression quantification, differentially expressed gene (DEG) identification, and functional enrichment analysis (such as GO analysis and pathway analysis) requires a series of data processing steps.

This page walks through the RNA-Seq data analysis workflow from raw FASTQ files, targeting well-annotated species such as human, mouse, and rat.

Overview of the Analysis Workflow

Overview of the Analysis Workflow

Steps at a Glance

StepsOverview
Data PreparationObtain the FASTQ files needed for analysis.
Data PreprocessingTrim adapter sequences from FASTQ files and filter out low-quality reads.
MappingAlign reads to their corresponding positions in the reference genome.
Read Counting Based on Mapping ResultsCount reads mapped to each gene using the mapping results.
Read Counting by Pseudo-AlignmentCount reads per gene directly from FASTQ files without a separate mapping step.
Identification of Differentially Expressed Genes (DEGs)Find genes whose expression has significantly changed between conditions or groups.
Functional Enrichment Analysis (Gene Ontology Analysis and Pathway Analysis)Determine which biological functions the identified genes are involved in.

Data Preparation

The first step is to obtain the FASTQ files for your analysis. FASTQ files are generated by sequencing on a next-generation sequencer -- you can run the sequencing yourself or use a sequencing service provider. You can also download publicly available FASTQ files from databases. For downloading, a tool such as fasterq-dump is recommended.

Related Pages

Data Preprocessing

Raw sequencer output often contains adapter sequences and low-quality reads. It is therefore common practice to preprocess the FASTQ files by trimming adapters and filtering out low-quality reads.

Popular preprocessing tools include Trimmomatic, Cutadapt, FastQC, and fastp.

Below is an example of preprocessing results using fastp. The input data contained 19,786,002 reads (2,967,900,000 bases), which was reduced to 19,722,456 reads (2,941,557,000 bases) after preprocessing.

Result of Data Preprocessing

Result of Data Preprocessing

Related Pages

Mapping

Next, the sequenced reads are mapped (aligned) to a reference genome. Mapping is the process of finding where each read originated in the reference genome. The output is a BAM file. In other words, mapping takes FASTQ files and a reference genome as input and produces a BAM file.

Popular mapping tools for RNA-Seq include HISAT2, STAR, and Bowtie2.

Below is a visualization of mapping results, showing where reads align along the reference genome.

Result of Mapping

Result of Mapping

Read Counting Based on Mapping Results

The next step is to count how many reads mapped to each gene. This requires the reference genome, annotation files (GTF or GFF3), and the mapping results (BAM file).

Common tools for read counting include featureCounts, htseq-count, RSEM, and StringTie. For gene-level quantification, featureCounts or htseq-count is sufficient. For transcript-level quantification, use RSEM or StringTie.

The output of read counting is a table (e.g., a CSV file) like the one shown below.

Result of Read Count

Result of Read Count

Related Pages

Read Counting by Pseudo-Alignment

As an alternative to the mapping-based approach above, you can quantify reads using pseudo-alignment. This method does not require a separate mapping step -- it produces read counts directly from FASTQ files and a reference transcriptome at very high speed.

Popular pseudo-alignment tools include Salmon and Kallisto.

Pseudo-alignment produces read count results that are essentially equivalent to those from the mapping-based approach.

Result of Read Count

Result of Read Count

Identification of Differentially Expressed Genes

Differentially expressed genes (DEGs) are genes whose expression levels are significantly upregulated or downregulated between conditions or groups. DEGs are identified from the read count data produced in the previous step.

Popular DEG analysis tools include edgeR, DESeq2, and Ballgown.

Below is an example of DEG analysis results.

Results of Differentially Expressed Gene Identification

Results of Differentially Expressed Gene Identification

DEG results are often visualized with figures such as a heatmap or a volcano plot.

Heatmap

Heatmap

Volcano plot

Volcano plot

Related Pages

Functional Enrichment Analysis (Gene Ontology Analysis and Pathway Analysis)

Functional enrichment analysis identifies which biological functions are overrepresented among a list of genes. In RNA-Seq, this is typically applied to the DEG list from the previous step. Gene functions are commonly described using Gene Ontology (GO) terms or biological pathways.

Popular tools for functional enrichment analysis include clusterProfiler, topGO, and GOseq.

Below is an example of GO analysis results.

Results of Functional Enrichment Analysis

Results of Functional Enrichment Analysis

Related Pages

RNA-Seq Data Analysis Software

This is an RNA-Seq Data Analysis Software recommended for those who:

✔︎ Seeking to avoid outsourcing or collaboration for RNA-Seq data analysis.

✔︎ Lacking time to learn RNA-Seq data analysis.

✔︎ Frustrated by the complexity of existing tools.

overview

Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.

BxINFO LLC logo

BxINFO LLC

A research support company specializing in bioinformatics.

We provide tools and information to support life science research, with a focus on RNA-Seq analysis.

→ Learn more