Lesson 2 RNA-Seq Study Design and Metadata
CDI goal: understand RNA-Seq experimental design and how sample metadata drives downstream analysis.
2.1 Why study design comes first
RNA-Seq analysis begins before any sequencing data are processed. Decisions made at the study design stage determine:
- which questions can be answered,
- which comparisons are valid,
- and how results should be interpreted.
Poor design cannot be fixed downstream with better statistics or visualization.
2.2 Key concepts in RNA-Seq study design
- Experimental unit: the entity to which a condition is applied
- Sample: the sequenced material derived from an experimental unit
- Biological replication: independent experimental units
- Technical variation: variation introduced during library prep or sequencing
- Batch effects: systematic differences unrelated to the biological question
2.3 Sample metadata structure
Sample metadata describe how each RNA-Seq sample relates to the experimental design.
In this guide, metadata are stored as a simple tabular file:
- one row per sample
- one column per variable
- a unique sample identifier
2.4 Loading demo metadata
We use a small demo dataset originally created for earlier CDI RNA-Seq Q&A guides.
2.5 Inspecting metadata
Rows: 14
Columns: 2
$ Sample <chr> "Sample1", "Sample2", "Sample3", "Sample4", "Sample5", "Samp…
$ condition <chr> "Positive", "Positive", "Positive", "Positive", "Positive", …
At minimum, confirm that:
- each sample appears exactly once
- sample identifiers are unique
- categorical variables are correctly encoded
2.6 Common metadata variables
Typical RNA-Seq metadata include:
- experimental condition (e.g. positive / negative; control / treated)
- batch or sequencing run
- biological covariates (sex, genotype, timepoint)
Not all variables must be used in every analysis, but all should be recorded.
2.7 Preparing metadata for analysis
Before modeling, metadata should be:
- complete (no missing sample identifiers)
- consistent (matching column names across files)
- interpretable (clear factor levels)
No modeling is performed at this stage.