What is a FASTQ File and the FASTQ Format?

Introduction

A FASTQ file is a file written in the FASTQ format, containing nucleotide sequences and their corresponding quality scores (confidence levels). It is commonly used to represent nucleotide sequences output from sequencers.

The file extensions are often ".fastq" or ".fq". They are also frequently gzip-compressed, with extensions like ".fastq.gz" or ".fq.gz". Additionally, the reads output from next-generation sequencers (NGS) often come in pairs, making it easier to identify paired files by naming them as "[sample_name]_1.fastq.gz", "[sample_name]_1.fq.gz", "[sample_name]_2.fastq.gz", or "[sample_name]_2.fq.gz".

Files that contain only nucleotide sequences without quality scores are called FASTA files. For a detailed explanation of FASTA files, please see here.

FASTQ Format

In the FASTQ format, four lines constitute one set, representing the nucleotide sequence and its quality.

Example of a FASTQ file

@SRR21484222.626.1 626 length=51 GCCTTGGTGGTGAAATGGTAGACTGGAATTCTCGGGTGCCAAGGAACTCCA +SRR21484222.626.1 626 length=51 F:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:,FFFFFFFFF @SRR21484222.627.1 627 length=51 TAGCGGCACCATGGAATTCTCGGGTGCCAAGGAACTCCAGTCACATCGTGA +SRR21484222.627.1 627 length=51 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF ...

In this example, there are 8 lines, displaying information for two sequences.

The information for each line is as follows:

ContentsExample
1st LineSequence ID or description, starting with '@'.@SRR21484222.626.1 626 length=51
2nd LineNucleotide sequenceGCCTTGGTGGTGAAATGGTAGACTGGAATTCTCGGGTGCCAAGGAACTCCA
3rd LineThe "+" symbol is written. Sometimes the sequence ID is also included after it.+SRR21484222.626.1 626 length=51
4th LineQuality scoreF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:,FFFFFFFFF

Phred Quality Score

There are several ways to represent the quality score, but the Phred quality score is commonly used.

The Phred quality score is calculated using the following formula:

\(Q = \log_{10} p_{err}\)

\(p_{err}\) represents the probability that the sequencing result is an error.

In other words, Q10 corresponds to a 10% probability of error, Q20 to 1%, Q30 to 0.1%, and Q40 to 0.01%.

In a FASTQ file, this quality score is represented by a single character.

Below is the table showing the correspondence between characters and scores.

Quality Score
Character
0!
1"
2#
3$
4%
5&
6'
7(
8)
9*
10+
11,
12-
13.
14/
150
161
172
183
194
205
216
227
238
249
25:
26;
27<
28=
29>
30?
31@
32A
33B
34C
35D
36E
37F
38G
39H
40I

RNA-Seq Data Analysis Software

With our RNA-Seq data analysis software, you won't need to outsource or rely on collaborators. You can start analyzing the data yourself right away, without the need for high-spec computers or knowledge of Linux commands.

概要

Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.