How to Use StringTie: Gene Expression Quantification in RNA-Seq Analysis
Introduction
When performing RNA-Seq analysis using next-generation sequencing, you obtain raw data called FASTQ files. After mapping the reads to a reference genome, gene expression levels are quantified by counting the reads mapped to each gene.
This page explains how to use StringTie, a software tool for identifying novel isoforms and estimating expression levels for each isoform from RNA-Seq alignments.
Installation
Precompiled binaries are available here. Download the appropriate binary for your environment. (Example: using StringTie v2.2.1 on macOS)
Check the installation by printing the help message:
If you see output like the following, the installation succeeded:
Identification of Novel Isoforms
Use the following commands to identify novel isoforms and estimate isoform-level expression. First, run StringTie for each sample (sample1, sample2, sample3, and sample4):
The -G option provides the reference annotation as a guide during isoform assembly.
You should obtain output files such as sample1.gtf to sample4.gtf.
The output GTF file has the following columns:
| Column | Description |
| 1st | Chromosome name |
| 2nd | Always contains "StringTie" |
| 3rd | Feature type (exon, transcript, mRNA, 5'UTR, etc.) |
| 4th | Feature start position (1-based index) |
| 5th | Feature end position (1-based index) |
| 6th | Always contains 1000 |
| 7th | Strand direction |
| 8th | Always contains "." |
| 9th | Additional attributes separated by semicolons |
The attributes in the 9th column include the following (separated by semicolons ";"):
Merging
Although you obtained results for each sample, isoforms may differ between samples, making it difficult to compare expression across samples. To address this, merge all sample annotations as follows:
This produces a merged annotation file named merged.gtf.
Estimating Isoform-Level Expression
Finally, estimate isoform-level expression based on the merged annotation file:
The -e option disables novel isoform assembly and quantifies only the isoforms present in merged.gtf. The -B option outputs files (*.ctab) for Ballgown.
Preparing Files for DESeq2 / edgeR
Although Ballgown-compatible files are generated by the steps above, those files cannot be directly used for differential expression tools such as DESeq2 or edgeR.
To enable analysis in DESeq2/edgeR, create count matrix CSV files that combine results across all samples.
For this, use prepDE.py3. prepDE.py3 is a Python 3 compatible version of prepDE.py, and using prepDE.py will produce the same results.
In this example, the read length is 150 bp, so we specify -l 150:
This generates two files: gene_count_matrix.csv and transcript_count_matrix.csv.

As discussed in https://github.com/gpertea/stringtie/issues/126, for paired-end reads, the summed read counts may not match the actual number of reads. However, it seems this output is typically used as-is.
RNA-Seq Data Analysis Software
This is an RNA-Seq Data Analysis Software recommended for those who:
✔︎ Seeking to avoid outsourcing or collaboration for RNA-Seq data analysis.
✔︎ Lacking time to learn RNA-Seq data analysis.
✔︎ Frustrated by the complexity of existing tools.

Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.
About the Author
BxINFO LLC
A research support company specializing in bioinformatics.
We provide tools and information to support life science research, with a focus on RNA-Seq analysis.