StringTie Tutorial: Transcript Assembly & Quantification

Last updated: March 13, 2026

Introduction

When performing RNA-Seq analysis with a next-generation sequencer, you obtain raw data in the form of FASTQ files. After mapping these reads to a reference genome, gene expression levels are quantified by counting the reads that align to each gene.

This page explains how to use StringTie, a tool for discovering novel isoforms and estimating expression levels at the isoform level from RNA-Seq alignment results.

Installation

Pre-compiled binaries are available here. Download the binary that matches your environment. (Example below uses StringTie v2.2.1 on macOS.)

$ wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.OSX_x86_64.tar.gz $ tar -zxvf stringtie-2.2.1.OSX_x86_64.tar.gz

Try displaying the help message to verify the installation.

$ stringtie -h

If you see output similar to the following, the installation was successful.

StringTie v2.2.1 usage: stringtie <in.bam ..> [-G <guide_gff>] [-l <prefix>] [-o <out.gtf>] [-p <cpus>] [-v] [-a <min_anchor_len>] [-m <min_len>] [-j <min_anchor_cov>] [-f <min_iso>] [-c <min_bundle_cov>] [-g <bdist>] [-u] [-L] [-e] [--viral] [-E <err_margin>] [--ptf <f_tab>] [-x <seqid,..>] [-A <gene_abund.out>] [-h] {-B|-b <dir_path>} [--mix] [--conservative] [--rf] [--fr] Assemble RNA-Seq alignments into potential transcripts. Options: ...

Identifying Novel Isoforms

Run the following commands to identify novel isoforms and estimate isoform-level expression. Start by running StringTie on each of the four samples (sample1 through sample4):

$ stringtie sample1.bam -G annotation.gtf -o sample1.gtf $ stringtie sample2.bam -G annotation.gtf -o sample2.gtf $ stringtie sample3.bam -G annotation.gtf -o sample3.gtf $ stringtie sample4.bam -G annotation.gtf -o sample4.gtf

The -G option provides a reference annotation file that serves as a guide during isoform assembly.

This produces output files sample1.gtf through sample4.gtf, one for each sample.

The output GTF file contains the following columns:

Column	Description
1st	Chromosome name
2nd	Always contains "StringTie"
3rd	Feature type (exon, transcript, mRNA, 5'UTR, etc.)
4th	Feature start position (1-based index)
5th	Feature end position (1-based index)
6th	Always contains 1000
7th	Strand direction of the transcript
8th	Always contains "."
9th	Additional attributes separated by semicolons

The 9th column contains the following attributes, delimited by semicolons ";":

Name	Description
gene_id	Gene ID
transcript_id	Transcript ID
exon_number	Position of the exon within the transcript.
reference_id	Transcript ID in the reference annotation
ref_gene_id	Gene ID in the reference annotation
ref_gene_name	Gene name in the reference annotation
cov	Per-base coverage
FPKM	FPKM value
TPM	TPM value

Merging

Although results have been generated for each sample, the identified isoforms may differ between samples, making direct comparison impossible. To resolve this, merge the results from all samples using the following command:

$ stringtie --merge -G annotation.gtf -o merged.gtf sample1.gtf sample2.gtf sample3.gtf sample4.gtf

This produces a unified annotation file called merged.gtf.

Estimating Isoform-Level Expression

Finally, re-estimate isoform-level expression using the merged annotation file:

$ stringtie sample1.bam -G merged.gtf -o result/sample1/sample1.gtf -e -B $ stringtie sample2.bam -G merged.gtf -o result/sample2/sample2.gtf -e -B $ stringtie sample3.bam -G merged.gtf -o result/sample3/sample3.gtf -e -B $ stringtie sample4.bam -G merged.gtf -o result/sample4/sample4.gtf -e -B

The -e option skips novel isoform discovery and restricts the analysis to only the isoforms listed in merged.gtf. The -B option outputs Ballgown-compatible files (*.ctab).

Preparing Files for DESeq2 / edgeR

The steps above produce Ballgown-compatible files, but these cannot be used directly with differential expression tools such as DESeq2 or edgeR.

To enable analysis with DESeq2 or edgeR, you need to create CSV files that consolidate the results from all samples into count matrices.

Use prepDE.py3 for this purpose. prepDE.py3 is the Python 3 version of prepDE.py; using either script produces the same results.

In this example, the read length is 150 bp, so we specify -l 150:

$ prepDE.py3 -i result -l 150

This generates two files: gene_count_matrix.csv and transcript_count_matrix.csv.

As discussed in https://github.com/gpertea/stringtie/issues/126, when using paired-end reads, the summed read counts may not match the actual number of reads. However, in practice, the output is typically used as-is.

RNA-Seq Data Analysis Software

This is an RNA-Seq Data Analysis Software recommended for those who:

✔︎ Seeking to avoid outsourcing or collaboration for RNA-Seq data analysis.

✔︎ Lacking time to learn RNA-Seq data analysis.

✔︎ Frustrated by the complexity of existing tools.

Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.

About the Author

BxINFO LLC

A research support company specializing in bioinformatics.

We provide tools and information to support life science research, with a focus on RNA-Seq analysis.

→ Learn more

Recommended Pages