Lesson 5 Normalization and Exploratory Data Analysis (Computation)

CDI goal: compute normalized and transformed representations for exploratory analysis, and save derived tables for visualization in Lesson 05b.

5.1 Learning outcomes

By the end of this lesson, you will be able to:

  • Explain why normalization precedes exploratory analysis
  • Understand the purpose of rlog transformation
  • Prepare transformed matrices for visualization
  • Save EDA-ready outputs for reuse (a/b pattern)

5.2 Why normalization before EDA

Raw RNA-Seq counts are dominated by sequencing depth and composition effects. Exploratory analysis on raw counts often reveals technical structure rather than biology.

In CDI, we:

  • Normalize counts to adjust for technical effects
  • Transform values to stabilize variance
  • Use transformed values only for exploration, not modeling

5.3 Load required data

library(tidyverse)

counts_raw <- readr::read_csv("data/demo_counts.csv")
meta <- readr::read_csv("data/demo_metadata.csv")

5.4 Prepare count matrix

gene_ids <- counts_raw[[1]]

counts_mat <- counts_raw |>
  dplyr::select(-1) |>
  as.matrix()

storage.mode(counts_mat) <- "numeric"

5.5 Load rlog-transformed matrix

For this guide, we use a precomputed rlog matrix to focus on interpretation rather than computation cost.

rlog_mat <- readr::read_csv("data/rlog_matrix.csv")

rlog_mat
# A tibble: 1,000 × 15
   gene  Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Gene1    7.05    6.87    6.96    5.72    5.76    5.87    5.57    3.63    5.26
 2 Gene2    5.72    6.40    5.87    5.73    7.58    6.08    5.54    5.82    6.48
 3 Gene3    5.54    4.55    7.22    6.32    6.30    6.43    6.73    4.15    6.50
 4 Gene4    7.63    4.85    7.11    5.88    6.53    7.04    6.06    6.89    5.76
 5 Gene5    7.18    7.62    6.22    5.31    4.76    5.49    5.09    5.90    6.06
 6 Gene6    7.54    7.87    5.71    6.04    6.45    7.10    5.86    7.70    8.05
 7 Gene7    6.54    6.59    6.25    5.09    7.87    6.25    5.17    7.53    6.08
 8 Gene8    4.08    4.35    4.41    6.45    6.28    4.76    6.89    7.29    6.09
 9 Gene9    5.64    6.04    6.38    7.16    3.90    6.78    5.55    6.26    6.32
10 Gene…    7.11    6.09    5.66    6.74    4.47    4.12    5.67    7.26    6.68
# ℹ 990 more rows
# ℹ 5 more variables: Sample10 <dbl>, Sample11 <dbl>, Sample12 <dbl>,
#   Sample13 <dbl>, Sample14 <dbl>

5.6 Prepare rlog matrix for EDA

We separate identifiers and numeric values, then compute sample-level summaries used in visualization.

rlog_gene_ids <- rlog_mat[[1]]

rlog_values <- rlog_mat |>
  dplyr::select(-1) |>
  as.matrix()

storage.mode(rlog_values) <- "numeric"

5.7 Principal component analysis (PCA)

PCA summarizes dominant axes of variation in the transformed data.

# Transpose so rows = samples
pca <- prcomp(t(rlog_values), scale. = FALSE)

pca_scores <- as_tibble(pca$x, rownames = "sample_id")

# Identify metadata sample ID column
sample_id_col <- if ("sample_id" %in% names(meta)) "sample_id" else names(meta)[1]

# Make the join explicit and safe
pca_scores <- pca_scores |>
  dplyr::left_join(
    meta |> dplyr::rename(sample_id = .data[[sample_id_col]]),
    by = "sample_id"
  )

5.8 Save PCA results for visualization

dir.create("results", showWarnings = FALSE, recursive = TRUE)

readr::write_csv(pca_scores, "results/05a-pca-scores.csv")

5.9 Takeaway

You have prepared normalized, transformed representations suitable for exploratory analysis. In the next section, we will visualize and interpret these patterns using Python.

5.10 Exploratory Data Analysis and Interpretation

CDI goal: visualize transformed RNA-Seq data and interpret dominant patterns using CDI’s Python visualization workflow.

5.11 Learning outcomes

By the end of this lesson, you will be able to:

  • Load PCA results generated in Lesson 05a
  • Create consistent CDI figures using Python
  • Interpret PCA structure in the context of experimental design

5.12 Load PCA results

import pandas as pd

pca = pd.read_csv("results/05a-pca-scores.csv")
pca.head()
sample_id PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 condition
0 Sample1 -4.240090 3.775028 0.161505 -6.365113 -18.277427 17.382098 0.910165 6.516211 -5.656467 13.128635 -15.520337 2.576621 1.670239 2.106427e-14 Positive
1 Sample2 -3.242645 10.317651 26.092297 -1.664219 7.949342 6.973966 -1.008776 -10.141592 -4.067700 1.469867 6.033461 8.685954 -11.516987 2.287780e-14 Positive
2 Sample3 -5.615398 14.608090 -14.222394 16.621951 4.572279 -4.095005 3.134252 -17.063624 6.034119 0.263393 -13.230639 2.125597 -2.532548 2.369863e-14 Positive
3 Sample4 -5.304612 -6.202227 -0.009784 2.127661 4.671486 -1.494118 -0.045156 16.620939 5.730119 -7.887317 -7.822560 -11.126510 -21.009351 2.235085e-14 Positive
4 Sample5 -8.377700 13.287879 -4.037823 -8.287961 12.943720 11.393271 -0.376822 4.618224 11.966963 4.033432 9.757514 -13.609210 11.071384 2.190929e-14 Positive

5.13 Initialize CDI visualization

from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl
import matplotlib.pyplot as plt

cdi_notebook_init(chapter='05', title_x=0)

5.14 PCA: PC1 vs PC2

This plot reveals the dominant axes of variation across samples.

plt.figure()
plt.scatter(pca['PC1'], pca['PC2'])
plt.xlabel('PC1')
plt.ylabel('PC2')

show_and_save_mpl()
Saved PNG → figures/05_001.png

5.15 Interpreting PCA structure

When interpreting PCA plots:

  • Check whether separation aligns with biological variables
  • Look for clustering driven by batch or technical factors
  • Remember that PCA reflects variance, not differential expression

5.16 Takeaway

Exploratory analysis helps you understand dominant patterns before modeling. These insights guide decisions about design, contrasts, and covariates in DE analysis.