Q&A 7 How do you log-transform RNA-Seq counts for PCA or clustering using R?

7.1 Explanation

Raw RNA-Seq counts are:

  • Not normally distributed
  • Heteroscedastic (variance increases with mean)
  • Influenced by a few highly expressed genes

These properties make them unsuitable for PCA, clustering, or heatmaps without transformation.

To correct this, we apply a log transformation to stabilize variance:

  • log2(count + 1) — simple and fast
  • rlog() — regularized log transformation (DESeq2), ideal for small sample sizes
  • vst() — variance-stabilizing transformation, faster for large datasets

We save the rlog matrix so it can be reused by downstream visualizations.

7.2 R Code

library(tidyverse)
library(DESeq2)

# 📊 Load count data and metadata
counts <- read_csv("data/demo_counts.csv") |>
  column_to_rownames("gene") |>
  as.matrix()

metadata <- read_csv("data/demo_metadata.csv")

# 📦 Create DESeq2 object
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = metadata,
                              design = ~ condition)

# 🔄 Transform
rlog_dds <- rlog(dds)

# 🧬 Extract transformed matrix
rlog_mat <- assay(rlog_dds) |>
  as.data.frame() |>
  rownames_to_column("gene")

# 💾 Save for reuse
write_csv(rlog_mat, "data/rlog_matrix.csv")

Takeaway: Log transformations—especially rlog()—stabilize variance and prepare RNA-Seq data for PCA, clustering, and heatmaps. Saving the transformed matrix improves reproducibility.