FASTQ Format: Structure, Quality Scores & Examples
Introduction
A FASTQ file is a text file in FASTQ format that stores nucleotide sequences along with their quality scores, which indicate the confidence of each base call. It is the standard format for storing sequencer output.
Common file extensions are ".fastq" and ".fq". These files are often gzip-compressed, resulting in extensions like ".fastq.gz" or ".fq.gz". Since next-generation sequencers (NGS) frequently produce paired-end reads, paired files are typically named with conventions such as "[sample_name]_1.fastq.gz" / "[sample_name]_2.fastq.gz" or "[sample_name]_1.fq.gz" / "[sample_name]_2.fq.gz" to clearly identify read pairs.
Files that contain only nucleotide sequences without quality scores use the FASTA format. For more details on FASTA files, see here.
FASTQ Format
In FASTQ format, every four consecutive lines form one record, representing a single sequence and its quality information.
Example of a FASTQ File
This example contains 8 lines, which means it holds two sequence records.
Each line in a record contains the following:
| Contents | Example | |
|---|---|---|
| 1st Line | Sequence ID and description. Begins with '@'. | @SRR21484222.626.1 626 length=51 |
| 2nd Line | Nucleotide sequence | GCCTTGGTGGTGAAATGGTAGACTGGAATTCTCGGGTGCCAAGGAACTCCA |
| 3rd Line | A "+" separator. May optionally be followed by the sequence ID. | +SRR21484222.626.1 626 length=51 |
| 4th Line | Quality score | F:FFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:,FFFFFFFFF |
Phred Quality Score
While there are several ways to encode quality, the Phred quality score is by far the most common.
The Phred quality score is defined by the following formula, where \(p_{err}\) is the probability that the base call is incorrect:
For example, Q10 indicates a 10% error probability, Q20 means 1%, Q30 means 0.1%, and Q40 means 0.01%.
In FASTQ files, each quality score is encoded as a single ASCII character.
The table below shows the mapping between quality scores and their corresponding characters.
Quality Score | Character |
|---|---|
| 0 | ! |
| 1 | " |
| 2 | # |
| 3 | $ |
| 4 | % |
| 5 | & |
| 6 | ' |
| 7 | ( |
| 8 | ) |
| 9 | * |
| 10 | + |
| 11 | , |
| 12 | - |
| 13 | . |
| 14 | / |
| 15 | 0 |
| 16 | 1 |
| 17 | 2 |
| 18 | 3 |
| 19 | 4 |
| 20 | 5 |
| 21 | 6 |
| 22 | 7 |
| 23 | 8 |
| 24 | 9 |
| 25 | : |
| 26 | ; |
| 27 | < |
| 28 | = |
| 29 | > |
| 30 | ? |
| 31 | @ |
| 32 | A |
| 33 | B |
| 34 | C |
| 35 | D |
| 36 | E |
| 37 | F |
| 38 | G |
| 39 | H |
| 40 | I |
RNA-Seq Data Analysis Software
This is an RNA-Seq Data Analysis Software recommended for those who:
✔︎ Seeking to avoid outsourcing or collaboration for RNA-Seq data analysis.
✔︎ Lacking time to learn RNA-Seq data analysis.
✔︎ Frustrated by the complexity of existing tools.
Users can perform gene expression quantification, identification of differentially expressed genes, gene ontology(GO) analysis, pathway analysis, as well as drawing volcano plots, MA plots, and heatmaps.
About the Author
BxINFO LLC
A research support company specializing in bioinformatics.
We provide tools and information to support life science research, with a focus on RNA-Seq analysis.