Audience: Students, researchers, analysts, and practitioners
Theme: Evaluating sequencing read quality before downstream analysis
Introduction
The first computational stage of an RNA-Seq workflow begins with raw sequencing reads.
Before quantification, normalization, or differential expression analysis, it is important to evaluate the quality of the sequencing data. Poor-quality reads can introduce technical artifacts that affect downstream results and interpretation.
Quality control (QC) helps determine whether sequencing data are suitable for analysis and whether corrective actions may be needed.
Where This Chapter Fits
Code
flowchart TD A[Sequencing] subgraph DP["Data Processing"] B[Raw Reads] C[Read Quality Control] D[Read Processing & Quantification] E[Count Matrix] end A --> B B --> C --> D --> E
flowchart TD
A[Sequencing]
subgraph DP["Data Processing"]
B[Raw Reads]
C[Read Quality Control]
D[Read Processing & Quantification]
E[Count Matrix]
end
A --> B
B --> C --> D --> E
This chapter focuses on the transition from raw reads to quality-assessed sequencing data.
What Are Raw Reads?
RNA-Seq sequencing typically produces FASTQ files.
A FASTQ file contains:
Read identifiers
Nucleotide sequences
Quality scores
Each base is assigned a Phred quality score that reflects the probability of an incorrect base call.
Higher quality scores indicate greater confidence in the sequencing result.
Objectives of Read Quality Control
Quality control aims to answer questions such as:
Are the reads high quality?
Is quality consistent across read positions?
Are adapter sequences present?
Is there unusual GC-content?
Is sequence duplication excessive?
Are there technical artifacts that require attention?
These assessments help determine whether downstream analyses can proceed confidently.
Common QC Metrics
Typical RNA-Seq QC metrics include:
Per-base sequence quality
Per-sequence quality scores
Sequence length distribution
GC-content distribution
Adapter contamination
Sequence duplication levels
Overrepresented sequences
Read counts per sample
No single metric determines success or failure. Interpretation requires considering multiple metrics together.
Per-Base Sequence Quality
Per-base quality plots show how sequencing quality changes across read positions.
A common pattern is:
High quality at the beginning of reads
Slight quality decline toward the end
Acceptable overall quality throughout most positions
Large quality drops may indicate potential problems.
The goal is not perfection but confidence that sequencing quality supports downstream analyses.
Phred Quality Scores
Quality scores are commonly represented as Phred scores.
The goal is to understand data quality before generating expression measurements.
QC Does Not Mean Automatic Filtering
Quality control is primarily an assessment process.
A warning does not automatically mean:
The sample should be removed.
The sequencing run failed.
The study is invalid.
QC findings should be interpreted in the context of:
Study objectives
Organism
Library preparation protocol
Sequencing platform
Downstream analysis plans
Context matters.
Common Mistakes
Common QC mistakes include:
Ignoring quality reports entirely
Treating every warning as a failure
Removing samples without justification
Evaluating only one QC metric
Ignoring metadata when interpreting QC results
Assuming all datasets should look identical
QC is a process of evidence-based evaluation rather than rule-based filtering.
QC Checklist
Before moving to quantification, confirm that:
Quality reports have been reviewed.
Sequencing depth is reasonable.
Adapter contamination has been assessed.
Outlier samples have been investigated.
Metadata have been consulted.
Major technical issues are documented.
Key Takeaway
Raw read quality control is the first computational checkpoint in the RNA-Seq system.
Its purpose is to understand the quality of the sequencing data, identify potential technical issues, and provide confidence that downstream analyses are based on reliable input data.
What Comes Next
The next chapter focuses on read processing and quantification, where sequencing reads are converted into quantitative expression measurements.