Study Design and Metadata

  • ID: RNASEQ-L02
  • Type: Lesson
  • Audience: Public
  • Theme: Study design and metadata integrity

Why study design comes first

RNA-seq analysis begins before any sequencing data are processed.

Decisions made at the study design stage determine:

  • which questions can be answered,
  • which comparisons are valid,
  • how results should be interpreted.

Poor design cannot be corrected downstream with better statistics or visualization.

Key concepts in RNA-seq study design

  • Experimental unit: the entity to which a condition is applied
  • Sample: the sequenced material derived from an experimental unit
  • Biological replication: independent experimental units
  • Technical variation: variation introduced during library preparation or sequencing
  • Batch effects: systematic differences unrelated to the biological question

Understanding these concepts prevents interpretive errors later.

Sample metadata structure

Sample metadata describe how each RNA-seq sample relates to the experimental design.

In this guide, metadata are stored as a simple tabular file:

  • one row per sample
  • one column per variable
  • a unique sample identifier

This structure ensures traceability between the count matrix and the experimental design.

Load the demo metadata

meta <- readr::read_csv("data/demo-metadata.csv", show_col_types = FALSE)

meta
# A tibble: 12 × 3
   sample_id condition library_size
   <chr>     <chr>            <dbl>
 1 sample-01 Control        2763082
 2 sample-02 Control        3050899
 3 sample-03 Control        5217936
 4 sample-04 Control        3338902
 5 sample-05 Control        3398302
 6 sample-06 Control        5468525
 7 sample-07 Treatment      3753784
 8 sample-08 Treatment      2236632
 9 sample-09 Treatment      2660286
10 sample-10 Treatment      2859912
11 sample-11 Treatment      4719552
12 sample-12 Treatment      3641638

Inspect metadata structure

str(meta)
spc_tbl_ [12 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ sample_id   : chr [1:12] "sample-01" "sample-02" "sample-03" "sample-04" ...
 $ condition   : chr [1:12] "Control" "Control" "Control" "Control" ...
 $ library_size: num [1:12] 2763082 3050899 5217936 3338902 3398302 ...
 - attr(*, "spec")=
  .. cols(
  ..   sample_id = col_character(),
  ..   condition = col_character(),
  ..   library_size = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Confirm that:

  • each sample appears exactly once
  • sample identifiers are unique
  • categorical variables are correctly encoded

Check uniqueness explicitly:

any(duplicated(meta$sample_id))
[1] FALSE

This should return FALSE.

Confirm balanced replication

table(meta$condition)

  Control Treatment 
        6         6 

Balanced replication improves stability in downstream modeling and simplifies interpretation.

Inspect library sizes

summary(meta$library_size)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
2236632 2835704 3368602 3592454 3995226 5468525 

Variation in library size is expected.
Extremely unbalanced sizes require careful normalization and cautious interpretation.

Why metadata integrity matters

If metadata are incorrect:

  • samples may be mislabeled
  • conditions may be inverted
  • replication may be overstated
  • variance may be misinterpreted

Every downstream step assumes metadata are accurate.

Premium note

In production workflows, design formulas are encoded directly into statistical models.

Full DESeq2 model fitting, design specification, diagnostics, and interpretation are covered in the premium edition.