Audience: Students, biologists, bioinformaticians, data scientists, researchers, and practitioners
Theme: Identifying genes associated with biological conditions through statistical modeling
Introduction
After normalization and exploratory analysis, the next stage of the RNA-Seq workflow is formal statistical inference.
The objective of differential expression analysis is to determine whether observed differences in expression are consistent with the biological question being investigated.
Rather than simply comparing counts, differential expression analysis uses statistical models to evaluate evidence while accounting for biological variability and experimental design.
Where This Chapter Fits
Code
flowchart TD A[Normalized Expression Data] subgraph EA["Expression Analysis"] B[Differential Expression Analysis] C[Results QC & Visualization] end A --> B --> C
flowchart TD
A[Normalized Expression Data]
subgraph EA["Expression Analysis"]
B[Differential Expression Analysis]
C[Results QC & Visualization]
end
A --> B --> C
This chapter represents the formal statistical modeling stage of the RNA-Seq system.
Biological Questions
Differential expression analysis is often used to address questions such as:
Which genes respond to treatment?
Which genes differ between disease and healthy samples?
Which pathways may be activated or suppressed?
Which genes are associated with experimental conditions?
The goal is to identify expression differences supported by statistical evidence.
Differential Expression Concepts
Differential expression analysis evaluates whether observed expression differences are larger than expected by chance.
This requires:
Expression measurements
Biological replication
Variability estimation
Statistical modeling
Reliable inference depends on all four components.
Statistical Modeling
RNA-Seq counts are typically modeled using count-based statistical methods.
Popular approaches include:
DESeq2
edgeR
limma-voom
These methods estimate variation across biological replicates and evaluate evidence for differential expression.
Design Formula
Differential expression models depend on the study design.
A simple design formula might be:
~ condition
A model accounting for batch effects might be:
~ batch + condition
The design formula links the statistical model to the experimental design and metadata.
Example DESeq2 Workflow
dds <- DESeq2::DESeq(dds)
Differential expression results can then be extracted.
results_tbl <- DESeq2::results(dds)
The resulting table contains statistics used for interpretation.
Typical Output
A differential expression table often contains:
Gene
log2FC
pvalue
padj
GeneA
2.1
0.0001
0.001
GeneB
-1.8
0.0008
0.004
GeneC
0.3
0.7200
0.880
Each column provides different information about the observed expression differences.
Log2 Fold Change
The log2 fold change describes the magnitude and direction of expression differences.
Examples:
log2FC
Interpretation
+1
Expression doubled
+2
Expression increased four-fold
-1
Expression reduced by half
0
No change
Fold changes describe effect size rather than statistical significance.
P-values
P-values evaluate evidence against the null hypothesis.
A small p-value suggests that the observed difference is unlikely to be explained solely by random variation.
However, RNA-Seq studies test thousands of genes simultaneously, which creates a multiple-testing problem.
Multiple Testing
RNA-Seq experiments often evaluate thousands of genes.
For example:
20,000 genes tested
Even if no true differences exist, some genes may appear significant by chance alone.
The next stage focuses on evaluating and communicating these results.
Key Takeaway
Differential expression analysis transforms normalized expression measurements into statistical evidence.
By combining study design, biological replication, normalization, and statistical modeling, researchers can identify genes associated with biological conditions while accounting for variability and uncertainty.
What Comes Next
The next chapter focuses on results quality assessment and visualization, where statistical findings are evaluated, summarized, and prepared for biological interpretation.