Lesson 5 Normalization and Exploratory Data Analysis (Computation)
5.1 Learning outcomes
By the end of this lesson, you will be able to:
- Explain why normalization precedes exploratory analysis
- Understand the purpose of rlog transformation
- Prepare transformed matrices for visualization
- Save EDA-ready outputs for reuse (a/b pattern)
5.2 Why normalization before EDA
Raw RNA-Seq counts are dominated by sequencing depth and composition effects. Exploratory analysis on raw counts often reveals technical structure rather than biology.
In CDI, we:
- Normalize counts to adjust for technical effects
- Transform values to stabilize variance
- Use transformed values only for exploration, not modeling
5.5 Load rlog-transformed matrix
For this guide, we use a precomputed rlog matrix to focus on interpretation rather than computation cost.
# A tibble: 1,000 × 15
gene Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Gene1 7.05 6.87 6.96 5.72 5.76 5.87 5.57 3.63 5.26
2 Gene2 5.72 6.40 5.87 5.73 7.58 6.08 5.54 5.82 6.48
3 Gene3 5.54 4.55 7.22 6.32 6.30 6.43 6.73 4.15 6.50
4 Gene4 7.63 4.85 7.11 5.88 6.53 7.04 6.06 6.89 5.76
5 Gene5 7.18 7.62 6.22 5.31 4.76 5.49 5.09 5.90 6.06
6 Gene6 7.54 7.87 5.71 6.04 6.45 7.10 5.86 7.70 8.05
7 Gene7 6.54 6.59 6.25 5.09 7.87 6.25 5.17 7.53 6.08
8 Gene8 4.08 4.35 4.41 6.45 6.28 4.76 6.89 7.29 6.09
9 Gene9 5.64 6.04 6.38 7.16 3.90 6.78 5.55 6.26 6.32
10 Gene… 7.11 6.09 5.66 6.74 4.47 4.12 5.67 7.26 6.68
# ℹ 990 more rows
# ℹ 5 more variables: Sample10 <dbl>, Sample11 <dbl>, Sample12 <dbl>,
# Sample13 <dbl>, Sample14 <dbl>
5.6 Prepare rlog matrix for EDA
We separate identifiers and numeric values, then compute sample-level summaries used in visualization.
5.7 Principal component analysis (PCA)
PCA summarizes dominant axes of variation in the transformed data.
# Transpose so rows = samples
pca <- prcomp(t(rlog_values), scale. = FALSE)
pca_scores <- as_tibble(pca$x, rownames = "sample_id")
# Identify metadata sample ID column
sample_id_col <- if ("sample_id" %in% names(meta)) "sample_id" else names(meta)[1]
# Make the join explicit and safe
pca_scores <- pca_scores |>
dplyr::left_join(
meta |> dplyr::rename(sample_id = .data[[sample_id_col]]),
by = "sample_id"
)5.9 Takeaway
You have prepared normalized, transformed representations suitable for exploratory analysis. In the next section, we will visualize and interpret these patterns using Python.
5.11 Learning outcomes
By the end of this lesson, you will be able to:
- Load PCA results generated in Lesson 05a
- Create consistent CDI figures using Python
- Interpret PCA structure in the context of experimental design
5.12 Load PCA results
| sample_id | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | PC14 | condition | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Sample1 | -4.240090 | 3.775028 | 0.161505 | -6.365113 | -18.277427 | 17.382098 | 0.910165 | 6.516211 | -5.656467 | 13.128635 | -15.520337 | 2.576621 | 1.670239 | 2.106427e-14 | Positive |
| 1 | Sample2 | -3.242645 | 10.317651 | 26.092297 | -1.664219 | 7.949342 | 6.973966 | -1.008776 | -10.141592 | -4.067700 | 1.469867 | 6.033461 | 8.685954 | -11.516987 | 2.287780e-14 | Positive |
| 2 | Sample3 | -5.615398 | 14.608090 | -14.222394 | 16.621951 | 4.572279 | -4.095005 | 3.134252 | -17.063624 | 6.034119 | 0.263393 | -13.230639 | 2.125597 | -2.532548 | 2.369863e-14 | Positive |
| 3 | Sample4 | -5.304612 | -6.202227 | -0.009784 | 2.127661 | 4.671486 | -1.494118 | -0.045156 | 16.620939 | 5.730119 | -7.887317 | -7.822560 | -11.126510 | -21.009351 | 2.235085e-14 | Positive |
| 4 | Sample5 | -8.377700 | 13.287879 | -4.037823 | -8.287961 | 12.943720 | 11.393271 | -0.376822 | 4.618224 | 11.966963 | 4.033432 | 9.757514 | -13.609210 | 11.071384 | 2.190929e-14 | Positive |
5.14 PCA: PC1 vs PC2
This plot reveals the dominant axes of variation across samples.
plt.figure()
plt.scatter(pca['PC1'], pca['PC2'])
plt.xlabel('PC1')
plt.ylabel('PC2')
show_and_save_mpl()
Saved PNG → figures/05_001.png

5.15 Interpreting PCA structure
When interpreting PCA plots:
- Check whether separation aligns with biological variables
- Look for clustering driven by batch or technical factors
- Remember that PCA reflects variance, not differential expression