RNA-Seq Data Analysis Workflow: A Step-by-Step Guide
Introduction
Performing RNA-Seq with a next-generation sequencer produces raw data in the form of FASTQ files. Going from this raw data to gene expression quantification, differentially expressed gene (DEG) identification, and functional enrichment analysis (such as GO analysis and pathway analysis) requires a series of data processing steps.
This page walks through the RNA-Seq data analysis workflow from raw FASTQ files, targeting well-annotated species such as human, mouse, and rat.
Overview of the Analysis Workflow
Steps at a Glance
| Steps | Overview |
|---|---|
| Data Preparation | Obtain the FASTQ files needed for analysis. |
| Data Preprocessing | Trim adapter sequences from FASTQ files and filter out low-quality reads. |
| Mapping | Align reads to their corresponding positions in the reference genome. |
| Read Counting Based on Mapping Results | Count reads mapped to each gene using the mapping results. |
| Read Counting by Pseudo-Alignment | Count reads per gene directly from FASTQ files without a separate mapping step. |
| Identification of Differentially Expressed Genes (DEGs) | Find genes whose expression has significantly changed between conditions or groups. |
| Functional Enrichment Analysis (Gene Ontology Analysis and Pathway Analysis) | Determine which biological functions the identified genes are involved in. |
Data Preparation
The first step is to obtain the FASTQ files for your analysis. FASTQ files are generated by sequencing on a next-generation sequencer -- you can run the sequencing yourself or use a sequencing service provider. You can also download publicly available FASTQ files from databases. For downloading, a tool such as fasterq-dump is recommended.
Related Pages
Data Preprocessing
Raw sequencer output often contains adapter sequences and low-quality reads. It is therefore common practice to preprocess the FASTQ files by trimming adapters and filtering out low-quality reads.
Popular preprocessing tools include Trimmomatic, Cutadapt, FastQC, and fastp.
Below is an example of preprocessing results using fastp. The input data contained 19,786,002 reads (2,967,900,000 bases), which was reduced to 19,722,456 reads (2,941,557,000 bases) after preprocessing.
Result of Data Preprocessing
Related Pages
Mapping
Next, the sequenced reads are mapped (aligned) to a reference genome. Mapping is the process of finding where each read originated in the reference genome. The output is a BAM file. In other words, mapping takes FASTQ files and a reference genome as input and produces a BAM file.
Popular mapping tools for RNA-Seq include HISAT2, STAR, and Bowtie2.
Below is a visualization of mapping results, showing where reads align along the reference genome.
Result of Mapping
Read Counting Based on Mapping Results
The next step is to count how many reads mapped to each gene. This requires the reference genome, annotation files (GTF or GFF3), and the mapping results (BAM file).
Common tools for read counting include featureCounts, htseq-count, RSEM, and StringTie. For gene-level quantification, featureCounts or htseq-count is sufficient. For transcript-level quantification, use RSEM or StringTie.
The output of read counting is a table (e.g., a CSV file) like the one shown below.
Result of Read Count
Related Pages
Read Counting by Pseudo-Alignment
As an alternative to the mapping-based approach above, you can quantify reads using pseudo-alignment. This method does not require a separate mapping step -- it produces read counts directly from FASTQ files and a reference transcriptome at very high speed.
Popular pseudo-alignment tools include Salmon and Kallisto.
Pseudo-alignment produces read count results that are essentially equivalent to those from the mapping-based approach.
Result of Read Count
Identification of Differentially Expressed Genes
Differentially expressed genes (DEGs) are genes whose expression levels are significantly upregulated or downregulated between conditions or groups. DEGs are identified from the read count data produced in the previous step.
Popular DEG analysis tools include edgeR, DESeq2, and Ballgown.
Below is an example of DEG analysis results.
Results of Differentially Expressed Gene Identification
DEG results are often visualized with figures such as a heatmap or a volcano plot.
Heatmap
Volcano plot
Related Pages
Functional Enrichment Analysis (Gene Ontology Analysis and Pathway Analysis)
Functional enrichment analysis identifies which biological functions are overrepresented among a list of genes. In RNA-Seq, this is typically applied to the DEG list from the previous step. Gene functions are commonly described using Gene Ontology (GO) terms or biological pathways.
Popular tools for functional enrichment analysis include clusterProfiler, topGO, and GOseq.
Below is an example of GO analysis results.
Results of Functional Enrichment Analysis
Related Pages
RNA-Seq Data Analysis Software
This is an RNA-Seq Data Analysis Software recommended for those who:
✔︎ Seeking to avoid outsourcing or collaboration for RNA-Seq data analysis.
✔︎ Lacking time to learn RNA-Seq data analysis.
✔︎ Frustrated by the complexity of existing tools.
Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.
About the Author
BxINFO LLC
A research support company specializing in bioinformatics.
We provide tools and information to support life science research, with a focus on RNA-Seq analysis.