How to Run FastQC from the Command Line

Introduction

When sequencing is performed using a next-generation sequencer (NGS), raw data in the form of FASTQ files, which contain the base sequences of the reads and their quality scores, is obtained. After conducting NGS, the first step is to perform a quality check on the FASTQ files to ensure there are no issues with the quality of the reads. The most well-known software for checking the quality of FASTQ files is FastQC.

In this page, we will explain the steps to perform quality checks using FastQC from the command line.

Installation

FastQC can be installed from here.

Installion FastQC

For Mac users who prefer to work with the command line interface (CUI), selecting the Win/Linux zip file is recommended.

It's also possible to download it via command line as shown below (adjust the version as needed).

$ wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
The following steps will unzip the file and grant execution permissions.
$ unzip fastqc_v0.12.1.zip $ cd FastQC/ $ chmod u+x fastqc

Let's check if it works properly by displaying the help message.

$ ./fastqc -h

If the following message is displayed, the setup was successful. It is recommended to add FastQC to your system's PATH.

FastQC - A high throughput sequence QC analysis tool SYNOPSIS fastqc seqfile1 seqfile2 .. seqfileN fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN DESCRIPTION FastQC reads a set of sequence files and produces from each one a quality control report consisting of a number of different modules, each one of which will help to identify a different potential type of problem in your data.

Performing Quality Check

Run FastQC with the following command.
$ mkdir results $ fastqc -o results/ *.fastq

If an HTML file and a ZIP file are created in the results folder, the process was successful.

FastQC Report

The HTML file contains the following content.

Basic Statistics

Displays basic information.

Basic Statistics

Per base sequence quality

Shows the quality at each position in the reads. The horizontal axis represents the position in the read, and the vertical axis represents the quality.

Per base sequence quality

Per sequence quality scores

Displays the distribution of average quality scores. The horizontal axis represents the average quality score, and the vertical axis represents the number of reads.

Per sequence quality scores

Per base sequence content

Shows the base composition at each position in the reads. The horizontal axis represents the position in the read, and the vertical axis represents the proportion of each base.

Per base sequence content

Per sequence GC content

Displays the distribution of GC content for each read. The horizontal axis represents the GC content, and the vertical axis represents the number of reads.

Per sequence GC content

Per base N content

Shows the proportion of N bases at each position in the reads. The horizontal axis represents the position in the read, and the vertical axis represents the proportion.

Per base N content

Sequence Length Distribution

Displays the distribution of read lengths. The horizontal axis represents the read length, and the vertical axis represents the number of reads.

Sequence Length Distribution

Sequence Duplication Levels

Indicates the level of duplication in the reads. The horizontal axis represents the number of times a read is duplicated, and the vertical axis represents the percentage of duplicated reads.

Sequence Duplication Levels

Overrepresented sequences

Displays sequences that appear frequently.

Overrepresented sequences

Adapter Content

Shows the proportion of adapter sequences at each position in the reads. The horizontal axis represents the position in the read, and the vertical axis represents the proportion.

Adapter Content

RNA-Seq Data Analysis Software

With our RNA-Seq data analysis software, you won't need to outsource or rely on collaborators. You can start analyzing the data yourself right away, without the need for high-spec computers or knowledge of Linux commands.

概要

Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.