Lesson 5 Normalization and Exploratory Data Analysis (Computation)

CDI goal: compute normalized and transformed representations for exploratory analysis, and save derived tables for visualization in Lesson 05b.

5.1 Learning outcomes

By the end of this lesson, you will be able to:

Explain why normalization precedes exploratory analysis
Understand the purpose of rlog transformation
Prepare transformed matrices for visualization
Save EDA-ready outputs for reuse (a/b pattern)

5.2 Why normalization before EDA

Raw RNA-Seq counts are dominated by sequencing depth and composition effects. Exploratory analysis on raw counts often reveals technical structure rather than biology.

In CDI, we:

Normalize counts to adjust for technical effects
Transform values to stabilize variance
Use transformed values only for exploration, not modeling

5.3 Load required data

library(tidyverse)

counts_raw <- readr::read_csv("data/demo_counts.csv")
meta <- readr::read_csv("data/demo_metadata.csv")

5.4 Prepare count matrix

gene_ids <- counts_raw[[1]]

counts_mat <- counts_raw |>
  dplyr::select(-1) |>
  as.matrix()

storage.mode(counts_mat) <- "numeric"

5.5 Load rlog-transformed matrix

For this guide, we use a precomputed rlog matrix to focus on interpretation rather than computation cost.

rlog_mat <- readr::read_csv("data/rlog_matrix.csv")

rlog_mat

# A tibble: 1,000 × 15
   gene  Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Gene1    7.05    6.87    6.96    5.72    5.76    5.87    5.57    3.63    5.26
 2 Gene2    5.72    6.40    5.87    5.73    7.58    6.08    5.54    5.82    6.48
 3 Gene3    5.54    4.55    7.22    6.32    6.30    6.43    6.73    4.15    6.50
 4 Gene4    7.63    4.85    7.11    5.88    6.53    7.04    6.06    6.89    5.76
 5 Gene5    7.18    7.62    6.22    5.31    4.76    5.49    5.09    5.90    6.06
 6 Gene6    7.54    7.87    5.71    6.04    6.45    7.10    5.86    7.70    8.05
 7 Gene7    6.54    6.59    6.25    5.09    7.87    6.25    5.17    7.53    6.08
 8 Gene8    4.08    4.35    4.41    6.45    6.28    4.76    6.89    7.29    6.09
 9 Gene9    5.64    6.04    6.38    7.16    3.90    6.78    5.55    6.26    6.32
10 Gene…    7.11    6.09    5.66    6.74    4.47    4.12    5.67    7.26    6.68
# ℹ 990 more rows
# ℹ 5 more variables: Sample10 <dbl>, Sample11 <dbl>, Sample12 <dbl>,
#   Sample13 <dbl>, Sample14 <dbl>

5.6 Prepare rlog matrix for EDA

We separate identifiers and numeric values, then compute sample-level summaries used in visualization.

rlog_gene_ids <- rlog_mat[[1]]

rlog_values <- rlog_mat |>
  dplyr::select(-1) |>
  as.matrix()

storage.mode(rlog_values) <- "numeric"

5.7 Principal component analysis (PCA)

PCA summarizes dominant axes of variation in the transformed data.

# Transpose so rows = samples
pca <- prcomp(t(rlog_values), scale. = FALSE)

pca_scores <- as_tibble(pca$x, rownames = "sample_id")

# Identify metadata sample ID column
sample_id_col <- if ("sample_id" %in% names(meta)) "sample_id" else names(meta)[1]

# Make the join explicit and safe
pca_scores <- pca_scores |>
  dplyr::left_join(
    meta |> dplyr::rename(sample_id = .data[[sample_id_col]]),
    by = "sample_id"
  )

5.8 Save PCA results for visualization

dir.create("results", showWarnings = FALSE, recursive = TRUE)

readr::write_csv(pca_scores, "results/05a-pca-scores.csv")

5.9 Takeaway

You have prepared normalized, transformed representations suitable for exploratory analysis. In the next section, we will visualize and interpret these patterns using Python.

5.10 Exploratory Data Analysis and Interpretation

CDI goal: visualize transformed RNA-Seq data and interpret dominant patterns using CDI’s Python visualization workflow.

5.11 Learning outcomes

By the end of this lesson, you will be able to:

Load PCA results generated in Lesson 05a
Create consistent CDI figures using Python
Interpret PCA structure in the context of experimental design

5.12 Load PCA results

import pandas as pd

pca = pd.read_csv("results/05a-pca-scores.csv")
pca.head()

	sample_id	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10	PC11	PC12	PC13	PC14	condition
0	Sample1	-4.240090	3.775028	0.161505	-6.365113	-18.277427	17.382098	0.910165	6.516211	-5.656467	13.128635	-15.520337	2.576621	1.670239	2.106427e-14	Positive
1	Sample2	-3.242645	10.317651	26.092297	-1.664219	7.949342	6.973966	-1.008776	-10.141592	-4.067700	1.469867	6.033461	8.685954	-11.516987	2.287780e-14	Positive
2	Sample3	-5.615398	14.608090	-14.222394	16.621951	4.572279	-4.095005	3.134252	-17.063624	6.034119	0.263393	-13.230639	2.125597	-2.532548	2.369863e-14	Positive
3	Sample4	-5.304612	-6.202227	-0.009784	2.127661	4.671486	-1.494118	-0.045156	16.620939	5.730119	-7.887317	-7.822560	-11.126510	-21.009351	2.235085e-14	Positive
4	Sample5	-8.377700	13.287879	-4.037823	-8.287961	12.943720	11.393271	-0.376822	4.618224	11.966963	4.033432	9.757514	-13.609210	11.071384	2.190929e-14	Positive

5.13 Initialize CDI visualization

from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl
import matplotlib.pyplot as plt

cdi_notebook_init(chapter='05', title_x=0)

5.14 PCA: PC1 vs PC2

This plot reveals the dominant axes of variation across samples.

plt.figure()
plt.scatter(pca['PC1'], pca['PC2'])
plt.xlabel('PC1')
plt.ylabel('PC2')

show_and_save_mpl()

Saved PNG → figures/05_001.png

5.15 Interpreting PCA structure

When interpreting PCA plots:

Check whether separation aligns with biological variables
Look for clustering driven by batch or technical factors
Remember that PCA reflects variance, not differential expression

5.16 Takeaway

Exploratory analysis helps you understand dominant patterns before modeling. These insights guide decisions about design, contrasts, and covariates in DE analysis.

Proceed to Lesson 06: Differential Expression Modeling Concepts