From Results to Biological Claims

  • ID: RNASEQ-L06
  • Type: Lesson
  • Audience: Public
  • Theme: Interpreting results responsibly
source("scripts/R/cdi-plot-theme.R")

Why this lesson matters

Most RNA-seq confusion does not come from computation.

It comes from interpretation.

You can normalize data.
You can visualize structure.
You can compute p-values.

But the real question is:

What can you confidently claim?

This lesson builds the bridge between statistical output and biological reasoning.


Separate structure from inference

From previous lessons you observed:

  • Global structure using PCA
  • Sample similarity using clustering
  • Mean–variance relationships

These are descriptive layers.

They suggest patterns.

They do not prove biological mechanisms.

A PCA separation does not mean genes are statistically different.
A cluster split does not imply causality.

Structure provides context.
Inference requires modeling.


A simplified teaching comparison (not production)

To illustrate interpretation logic, we perform a simple gene-wise comparison using log-CPM values.

This is not a production RNA-seq method.
It is a teaching device to understand reasoning structure.

counts <- readr::read_csv("data/demo-counts.csv", show_col_types = FALSE)
meta   <- readr::read_csv("data/demo-metadata.csv", show_col_types = FALSE)

count_matrix <- as.matrix(counts[-1])
rownames(count_matrix) <- counts$gene_id

library_sizes <- colSums(count_matrix)
cpm <- sweep(count_matrix, 2, library_sizes, FUN = "/") * 1e6
log_cpm <- log2(cpm + 1)

group <- meta$condition

p_values <- apply(log_cpm, 1, function(x) {
  stats::t.test(x[group == "Control"], x[group == "Treatment"])$p.value
})

results <- tibble::tibble(
  gene_id = rownames(log_cpm),
  p_value = p_values
) |>
  dplyr::mutate(
    adjusted_p = p.adjust(p_value, method = "BH")
  )

Add effect size

Interpretation requires magnitude, not just significance.

group_means <- t(apply(log_cpm, 1, function(x) {
  c(
    mean_control = mean(x[group == "Control"]),
    mean_treatment = mean(x[group == "Treatment"])
  )
}))

effect_df <- tibble::as_tibble(group_means) |>
  dplyr::mutate(
    gene_id = rownames(group_means),
    mean_diff = mean_treatment - mean_control
  )

results <- dplyr::left_join(results, effect_df, by = "gene_id")

Now each gene has:

  • adjusted p-value
  • direction of change
  • magnitude of change

Visualize the relationship: effect size vs significance

volcano_df <- results |>
  dplyr::mutate(
    neg_log10_adj_p = -log10(adjusted_p),
    significant = adjusted_p < 0.05
  )

ggplot2::ggplot(
  volcano_df,
  ggplot2::aes(x = mean_diff,
               y = neg_log10_adj_p,
               color = significant)
) +
  ggplot2::geom_point(alpha = 0.6) +
  ggplot2::labs(
    title = "Effect Size vs Statistical Significance",
    subtitle = "Teaching-only comparison using log-CPM values",
    x = "Mean Difference (Treatment − Control)",
    y = "-log10 Adjusted p-value"
  ) +
  cdi_theme() +
  ggplot2::scale_color_manual(
  values = c("FALSE" = "#d9500f", "TRUE" = "#2a9d8f"),
  labels = c("Not significant", "FDR < 0.05"),
  name = NULL
)

This visualization shows:

  • Some genes with small effects but strong statistical evidence
  • Some genes with large effects but weaker statistical support
  • The geometry of interpretation

Statistical significance alone is insufficient.
Effect size alone is insufficient.

Responsible claims require both.


Calibrated interpretation

A disciplined statement sounds like this:

Global structure suggests condition-related variation. A subset of genes shows statistically detectable mean shifts with varying magnitudes. Formal count-based modeling is required before drawing pathway-level or mechanistic conclusions.

Notice what this does:

  • acknowledges structure
  • acknowledges statistical evidence
  • avoids overstatement
  • defers mechanism

That is calibrated reasoning.


What the free track establishes

You now understand:

  • How count matrices behave
  • Why normalization exists
  • How exploratory analysis reveals structure
  • Why modeling is necessary
  • How to separate statistical output from biological claims

Premium bridge

In production RNA-seq workflows:

  • Differential testing uses negative binomial models
  • Dispersion is explicitly estimated
  • Shrinkage stabilizes fold changes
  • Diagnostics assess reliability
  • Pathway analysis integrates biological context

Full DESeq2 model fitting, dispersion estimation, shrinkage mechanics, diagnostics, and pathway-level interpretation are covered in the premium edition.


Closing perspective

RNA-seq analysis is not a sequence of commands.

It is a reasoning chain:

Design → Structure → Modeling → Effect Size → Context → Claim

Break that chain, and results become fragile.
Respect it, and even complex outputs remain interpretable.