Q&A 7 How do you log-transform RNA-Seq counts for PCA or clustering using R?
7.1 Explanation
Raw RNA-Seq counts are:
- Not normally distributed
- Heteroscedastic (variance increases with mean)
- Influenced by a few highly expressed genes
These properties make them unsuitable for PCA, clustering, or heatmaps without transformation.
To correct this, we apply a log transformation to stabilize variance:
log2(count + 1)— simple and fastrlog()— regularized log transformation (DESeq2), ideal for small sample sizesvst()— variance-stabilizing transformation, faster for large datasets
We save the rlog matrix so it can be reused by downstream visualizations.
7.2 R Code
library(tidyverse)
library(DESeq2)
# 📊 Load count data and metadata
counts <- read_csv("data/demo_counts.csv") |>
column_to_rownames("gene") |>
as.matrix()
metadata <- read_csv("data/demo_metadata.csv")
# 📦 Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = counts,
colData = metadata,
design = ~ condition)
# 🔄 Transform
rlog_dds <- rlog(dds)
# 🧬 Extract transformed matrix
rlog_mat <- assay(rlog_dds) |>
as.data.frame() |>
rownames_to_column("gene")
# 💾 Save for reuse
write_csv(rlog_mat, "data/rlog_matrix.csv")✅ Takeaway: Log transformations—especially
rlog()—stabilize variance and prepare RNA-Seq data for PCA, clustering, and heatmaps. Saving the transformed matrix improves reproducibility.