>
StringTie Tutorial: Transcript Assembly & Quantification

StringTie Tutorial: Transcript Assembly & Quantification

Last updated: March 13, 2026

Introduction

When performing RNA-Seq analysis with a next-generation sequencer, you obtain raw data in the form of FASTQ files. After mapping these reads to a reference genome, gene expression levels are quantified by counting the reads that align to each gene.

This page explains how to use StringTie, a tool for discovering novel isoforms and estimating expression levels at the isoform level from RNA-Seq alignment results.

Installation

Pre-compiled binaries are available here. Download the binary that matches your environment. (Example below uses StringTie v2.2.1 on macOS.)

$ wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.OSX_x86_64.tar.gz $ tar -zxvf stringtie-2.2.1.OSX_x86_64.tar.gz

Try displaying the help message to verify the installation.

$ stringtie -h

If you see output similar to the following, the installation was successful.

StringTie v2.2.1 usage: stringtie <in.bam ..> [-G <guide_gff>] [-l <prefix>] [-o <out.gtf>] [-p <cpus>] [-v] [-a <min_anchor_len>] [-m <min_len>] [-j <min_anchor_cov>] [-f <min_iso>] [-c <min_bundle_cov>] [-g <bdist>] [-u] [-L] [-e] [--viral] [-E <err_margin>] [--ptf <f_tab>] [-x <seqid,..>] [-A <gene_abund.out>] [-h] {-B|-b <dir_path>} [--mix] [--conservative] [--rf] [--fr] Assemble RNA-Seq alignments into potential transcripts. Options: ...

Identifying Novel Isoforms

Run the following commands to identify novel isoforms and estimate isoform-level expression. Start by running StringTie on each of the four samples (sample1 through sample4):

$ stringtie sample1.bam -G annotation.gtf -o sample1.gtf $ stringtie sample2.bam -G annotation.gtf -o sample2.gtf $ stringtie sample3.bam -G annotation.gtf -o sample3.gtf $ stringtie sample4.bam -G annotation.gtf -o sample4.gtf

The -G option provides a reference annotation file that serves as a guide during isoform assembly.

This produces output files sample1.gtf through sample4.gtf, one for each sample.

The output GTF file contains the following columns:

ColumnDescription
1stChromosome name
2ndAlways contains "StringTie"
3rdFeature type (exon, transcript, mRNA, 5'UTR, etc.)
4thFeature start position (1-based index)
5thFeature end position (1-based index)
6thAlways contains 1000
7thStrand direction of the transcript
8thAlways contains "."
9thAdditional attributes separated by semicolons

The 9th column contains the following attributes, delimited by semicolons ";":

NameDescription
gene_idGene ID
transcript_idTranscript ID
exon_numberPosition of the exon within the transcript.
reference_idTranscript ID in the reference annotation
ref_gene_idGene ID in the reference annotation
ref_gene_nameGene name in the reference annotation
covPer-base coverage
FPKMFPKM value
TPMTPM value

Merging

Although results have been generated for each sample, the identified isoforms may differ between samples, making direct comparison impossible. To resolve this, merge the results from all samples using the following command:

$ stringtie --merge -G annotation.gtf -o merged.gtf sample1.gtf sample2.gtf sample3.gtf sample4.gtf

This produces a unified annotation file called merged.gtf.

Estimating Isoform-Level Expression

Finally, re-estimate isoform-level expression using the merged annotation file:

$ stringtie sample1.bam -G merged.gtf -o result/sample1/sample1.gtf -e -B $ stringtie sample2.bam -G merged.gtf -o result/sample2/sample2.gtf -e -B $ stringtie sample3.bam -G merged.gtf -o result/sample3/sample3.gtf -e -B $ stringtie sample4.bam -G merged.gtf -o result/sample4/sample4.gtf -e -B

The -e option skips novel isoform discovery and restricts the analysis to only the isoforms listed in merged.gtf. The -B option outputs Ballgown-compatible files (*.ctab).

Preparing Files for DESeq2 / edgeR

The steps above produce Ballgown-compatible files, but these cannot be used directly with differential expression tools such as DESeq2 or edgeR.

To enable analysis with DESeq2 or edgeR, you need to create CSV files that consolidate the results from all samples into count matrices.

Use prepDE.py3 for this purpose. prepDE.py3 is the Python 3 version of prepDE.py; using either script produces the same results.

In this example, the read length is 150 bp, so we specify -l 150:

$ prepDE.py3 -i result -l 150

This generates two files: gene_count_matrix.csv and transcript_count_matrix.csv.

Count matrix files for DESeq2/edgeR

As discussed in https://github.com/gpertea/stringtie/issues/126, when using paired-end reads, the summed read counts may not match the actual number of reads. However, in practice, the output is typically used as-is.

RNA-Seq Data Analysis Software

This is an RNA-Seq Data Analysis Software recommended for those who:

✔︎ Seeking to avoid outsourcing or collaboration for RNA-Seq data analysis.

✔︎ Lacking time to learn RNA-Seq data analysis.

✔︎ Frustrated by the complexity of existing tools.

overview

Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.

BxINFO LLC logo

BxINFO LLC

A research support company specializing in bioinformatics.

We provide tools and information to support life science research, with a focus on RNA-Seq analysis.

→ Learn more