FastQC Tutorial: Quality Control for FASTQ Files
📖 RNA-Seq Data Analysis Workflow — check it out for an overview.
Introduction
When you perform sequencing with a next-generation sequencer (NGS), you obtain raw data in the form of FASTQ files, which contain the base sequences of reads along with their quality scores. After running NGS, the first step is to perform a quality check on your FASTQ files to make sure there are no problems with read quality. FastQC is the most widely used tool for checking the quality of FASTQ files.
This page explains how to run FastQC from the command line to perform quality checks on your sequencing data.
Installation
FastQC can be downloaded from here.
If you are on macOS and want to use FastQC from the command line, select the Win/Linux zip file.
You can also download it directly from the command line as shown below (adjust the version number as needed).
Verify that FastQC is working correctly by displaying the help message.
If you see the following output, the installation was successful. It is recommended to add FastQC to your system's PATH for easier access.
Running the Quality Check
If an HTML file and a ZIP file appear in the results folder, the quality check completed successfully.
Understanding the FastQC Report
The HTML report contains the following sections.
Basic Statistics
Provides an overview of basic information about the input file, such as total sequences, sequence length, and GC content.
Per base sequence quality
Shows the quality score at each position along the reads. The horizontal axis represents the position within the read, and the vertical axis represents the quality score.
Per sequence quality scores
Shows the distribution of average quality scores across all reads. The horizontal axis represents the mean quality score, and the vertical axis represents the number of reads.
Per base sequence content
Shows the proportion of each base (A, T, G, C) at every position along the reads. The horizontal axis represents the position within the read, and the vertical axis represents the base proportion.
Per sequence GC content
Shows the distribution of GC content across all reads. The horizontal axis represents the GC percentage, and the vertical axis represents the number of reads.
Per base N content
Shows the percentage of ambiguous bases (N) at each position along the reads. The horizontal axis represents the position within the read, and the vertical axis represents the proportion of N calls.
Sequence Length Distribution
Shows the distribution of read lengths in the dataset. The horizontal axis represents the read length, and the vertical axis represents the number of reads.
Sequence Duplication Levels
Shows the degree of duplication among reads. The horizontal axis represents how many times a sequence is duplicated, and the vertical axis represents the percentage of reads at each duplication level.
Overrepresented sequences
Lists sequences that appear at unusually high frequency in the dataset.
Adapter Content
Shows the proportion of adapter sequences detected at each position along the reads. The horizontal axis represents the position within the read, and the vertical axis represents the adapter proportion.
RNA-Seq Data Analysis Software
This is an RNA-Seq Data Analysis Software recommended for those who:
✔︎ Seeking to avoid outsourcing or collaboration for RNA-Seq data analysis.
✔︎ Lacking time to learn RNA-Seq data analysis.
✔︎ Frustrated by the complexity of existing tools.
Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.
About the Author
BxINFO LLC
A research support company specializing in bioinformatics.
We provide tools and information to support life science research, with a focus on RNA-Seq analysis.