Lesson 6 Differential Expression Modeling Concepts

CDI goal: understand what differential expression (DE) modeling is, what questions it answers, and what assumptions underlie statistical tests before running any DESeq2 code.

6.1 Learning outcomes

By the end of this lesson, you will be able to:

  • Explain what differential expression means in RNA-Seq analysis
  • Distinguish modeling goals from visualization and exploration
  • Understand the role of the statistical model in DE analysis
  • Identify key assumptions behind count-based models
  • Interpret contrasts conceptually (without fitting a model yet)

6.2 What is differential expression?

Differential expression (DE) analysis asks a focused statistical question:

Which genes show evidence of systematic expression differences between conditions, beyond random variability?

In RNA-Seq, this question is answered using count-based statistical models that explicitly account for biological and technical variability.

6.3 Differential expression is a modeling problem

It is tempting to think of DE as:

  • “genes that look different in a plot”
  • “genes with large fold changes”

But DE is fundamentally about probabilistic modeling, not visual separation.

Visualization supports intuition; models support inference.

6.4 Why raw counts are modeled (not transformed values)

RNA-Seq DE methods (e.g. DESeq2) operate on raw counts, not rlog- or log-transformed values.

Why?

  • Counts retain the mean–variance relationship
  • Variance depends on expression level
  • Transformations distort distributional assumptions

Exploration uses transformed data; modeling uses raw counts.

6.5 The basic ingredients of a DE model

A differential expression model requires:

  • A count matrix (genes × samples)
  • Sample metadata describing experimental variables
  • A design formula specifying which effects to model

The model links counts to experimental conditions through statistical assumptions.

6.6 Experimental conditions and contrasts

In the demo dataset, samples belong to two conditions:

  • positive
  • negative

A DE analysis typically asks questions like:

  • How does gene expression differ between positive and negative samples?

This comparison is encoded as a contrast within the model.

6.7 What a DE result represents

For each gene, a DE method estimates:

  • A log2 fold change between conditions
  • An estimate of uncertainty (standard error)
  • A p-value testing whether the observed difference is larger than expected by chance

These quantities are always interpreted in the context of the model assumptions.

6.8 Common assumptions in RNA-Seq DE models

Most count-based DE methods assume:

  • Counts follow a negative binomial distribution
  • Samples are independent
  • Most genes are not differentially expressed
  • Technical effects have been reasonably controlled

Violations of these assumptions can lead to misleading results.

6.9 Why exploratory analysis comes first

EDA (Lesson 05) helps you assess whether modeling assumptions are plausible:

  • Are samples clustering by condition?
  • Are there strong batch effects?
  • Are there obvious outliers?

Modeling without this context is risky.

6.10 What we are not doing yet

In this lesson, we deliberately avoid:

  • Fitting a DESeq2 model
  • Choosing thresholds
  • Interpreting volcano plots

Those steps come after the modeling framework is understood.

6.11 Takeaway

Differential expression analysis is not a visualization task or a filtering exercise. It is a statistical modeling problem grounded in assumptions about RNA-Seq data.

You’ve already demonstrated careful thinking, patience, and attention to detail — exactly what RNA-Seq analysis demands.

At this point, you understand what differential expression is, why it requires modeling, and which assumptions must hold for results to be meaningful.

As you move forward, treat every downstream result as something you can explain, justify, and reproduce — not just generate.

That mindset is what separates routine analysis from defensible science.