Audience: Students, biologists, bioinformaticians, data scientists, researchers, and practitioners
Theme: Making expression measurements comparable and exploring data structure before formal modeling
Introduction
After count quality assessment and filtering, the RNA-Seq workflow transitions from data preparation to data understanding.
At this stage, expression measurements must be normalized to account for technical differences between samples. Once normalized, exploratory analyses help reveal patterns, relationships, potential outliers, and sources of variation that may influence downstream statistical modeling.
The goal is not yet to identify differentially expressed genes. The goal is to understand the structure of the dataset before formal inference begins.
Where This Chapter Fits
Code
flowchart TD A[Filtered Count Matrix] subgraph EA["Expression Analysis"] B[Normalization & Exploratory Analysis] C[Differential Expression Analysis] end A --> B --> C
flowchart TD
A[Filtered Count Matrix]
subgraph EA["Expression Analysis"]
B[Normalization & Exploratory Analysis]
C[Differential Expression Analysis]
end
A --> B --> C
This chapter transforms filtered counts into comparable expression measurements and explores the overall structure of the dataset.
Why Normalization Matters
RNA-Seq samples often differ in sequencing depth and library composition.
Consider two samples:
Sample
Total Reads
Sample1
20,000,000
Sample2
40,000,000
Even if the underlying biology is identical, Sample2 may contain roughly twice as many observed counts simply because more reads were sequenced.
Without normalization, direct comparisons may be misleading.
Objectives of Normalization
Normalization aims to:
Reduce technical variation
Improve comparability across samples
Preserve biological differences
Support reliable statistical inference
Normalization does not remove all variation. It provides a better foundation for downstream analyses.
Common Normalization Approaches
RNA-Seq workflows commonly use:
DESeq2 size-factor normalization
TMM normalization (edgeR)
Counts per million (CPM)
Variance stabilizing transformations
Regularized log transformations
Different methods serve different analytical purposes.
DESeq2 Size Factors
DESeq2 estimates sample-specific size factors that account for differences in sequencing depth and library composition.
The output of this stage becomes the primary input for formal statistical modeling.
Key Takeaway
Normalization improves comparability across samples, while exploratory analysis helps reveal the structure of the dataset.
Together, these steps create a strong foundation for differential expression analysis and help ensure that downstream biological conclusions are based on well-understood data.
What Comes Next
The next chapter focuses on differential expression analysis, where normalized expression data are formally modeled to identify genes associated with the biological question.