Lesson 2 RNA-Seq Study Design and Metadata

CDI goal: understand RNA-Seq experimental design and how sample metadata drives downstream analysis.

2.1 Why study design comes first

RNA-Seq analysis begins before any sequencing data are processed. Decisions made at the study design stage determine:

  • which questions can be answered,
  • which comparisons are valid,
  • and how results should be interpreted.

Poor design cannot be fixed downstream with better statistics or visualization.

2.2 Key concepts in RNA-Seq study design

  • Experimental unit: the entity to which a condition is applied
  • Sample: the sequenced material derived from an experimental unit
  • Biological replication: independent experimental units
  • Technical variation: variation introduced during library prep or sequencing
  • Batch effects: systematic differences unrelated to the biological question

2.3 Sample metadata structure

Sample metadata describe how each RNA-Seq sample relates to the experimental design.

In this guide, metadata are stored as a simple tabular file:

  • one row per sample
  • one column per variable
  • a unique sample identifier

2.4 Loading demo metadata

We use a small demo dataset originally created for earlier CDI RNA-Seq Q&A guides.

library(tidyverse)

metadata <- readr::read_csv("data/demo_metadata.csv")

2.5 Inspecting metadata

glimpse(metadata)
Rows: 14
Columns: 2
$ Sample    <chr> "Sample1", "Sample2", "Sample3", "Sample4", "Sample5", "Samp…
$ condition <chr> "Positive", "Positive", "Positive", "Positive", "Positive", …

At minimum, confirm that:

  • each sample appears exactly once
  • sample identifiers are unique
  • categorical variables are correctly encoded

2.6 Common metadata variables

Typical RNA-Seq metadata include:

  • experimental condition (e.g. positive / negative; control / treated)
  • batch or sequencing run
  • biological covariates (sex, genotype, timepoint)

Not all variables must be used in every analysis, but all should be recorded.

2.7 Preparing metadata for analysis

Before modeling, metadata should be:

  • complete (no missing sample identifiers)
  • consistent (matching column names across files)
  • interpretable (clear factor levels)

No modeling is performed at this stage.

2.8 Takeaway

Careful study design and well-structured metadata are the foundation of reliable RNA-Seq analysis.

In the next lesson, we begin working with sequencing-level quality control metrics.