Q&A 9 How do you visualize RNA-Seq samples using PCA in R?
9.1 Explanation
Principal Component Analysis (PCA) reduces the dimensionality of high-throughput data like RNA-Seq by finding the principal directions of variation. PCA is useful for:
- Detecting sample outliers
- Checking for batch effects
- Visualizing group separation
We apply PCA on the rlog-transformed data (rlog_matrix.csv) to ensure homoscedasticity and interpretability.
9.2 R Code
library(tidyverse)
# ๐ฅ Load rlog-transformed matrix
rlog_mat <- read_csv("data/rlog_matrix.csv")
# ๐งช Prepare PCA input
pca_input <- rlog_mat |>
column_to_rownames("gene") |>
t() |>
as.data.frame()
# ๐ Run PCA
pca_res <- prcomp(pca_input, center = TRUE, scale. = TRUE)
pca_df <- as_tibble(pca_res$x) |>
mutate(Sample = rownames(pca_input))
# ๐ Join with metadata
metadata <- read_csv("data/demo_metadata.csv")
plot_df <- left_join(pca_df, metadata, by = "Sample")
# ๐จ Plot
ggplot(plot_df, aes(x = PC1, y = PC2, color = condition)) +
geom_point(size = 3, alpha = 0.8) +
labs(title = "PCA of RNA-Seq Samples",
x = paste0("PC1 (", round(summary(pca_res)$importance[2, 1] * 100, 1), "%)"),
y = paste0("PC2 (", round(summary(pca_res)$importance[2, 2] * 100, 1), "%)")) +
theme_minimal()
โ Takeaway: PCA on log-transformed RNA-Seq data helps visualize sample similarities, spot outliers, and confirm that experimental conditions drive the major sources of variation.