Lesson 2 RNA-Seq Study Design and Metadata

CDI goal: understand RNA-Seq experimental design and how sample metadata drives downstream analysis.

2.1 Why study design comes first

RNA-Seq analysis begins before any sequencing data are processed. Decisions made at the study design stage determine:

which questions can be answered,
which comparisons are valid,
and how results should be interpreted.

Poor design cannot be fixed downstream with better statistics or visualization.

2.2 Key concepts in RNA-Seq study design

Experimental unit: the entity to which a condition is applied
Sample: the sequenced material derived from an experimental unit
Biological replication: independent experimental units
Technical variation: variation introduced during library prep or sequencing
Batch effects: systematic differences unrelated to the biological question

2.3 Sample metadata structure

Sample metadata describe how each RNA-Seq sample relates to the experimental design.

In this guide, metadata are stored as a simple tabular file:

one row per sample
one column per variable
a unique sample identifier

2.4 Loading demo metadata

We use a small demo dataset originally created for earlier CDI RNA-Seq Q&A guides.

library(tidyverse)

metadata <- readr::read_csv("data/demo_metadata.csv")

2.5 Inspecting metadata

glimpse(metadata)

Rows: 14
Columns: 2
$ Sample    <chr> "Sample1", "Sample2", "Sample3", "Sample4", "Sample5", "Samp…
$ condition <chr> "Positive", "Positive", "Positive", "Positive", "Positive", …

At minimum, confirm that:

each sample appears exactly once
sample identifiers are unique
categorical variables are correctly encoded

2.6 Common metadata variables

Typical RNA-Seq metadata include:

experimental condition (e.g. positive / negative; control / treated)
batch or sequencing run
biological covariates (sex, genotype, timepoint)

Not all variables must be used in every analysis, but all should be recorded.

2.7 Preparing metadata for analysis

Before modeling, metadata should be:

complete (no missing sample identifiers)
consistent (matching column names across files)
interpretable (clear factor levels)

No modeling is performed at this stage.

2.8 Takeaway

Careful study design and well-structured metadata are the foundation of reliable RNA-Seq analysis.

In the next lesson, we begin working with sequencing-level quality control metrics.

Proceed to Lesson 03: FASTQ Intake and Quality Control Metrics