Lesson 3 Data Intake and Basic QC Checks

CDI goal: load the demo RNA-Seq count matrix and metadata, validate alignment, compute basic QC summaries, and save them for plotting in Lesson 03b.

3.1 Learning outcomes

By the end of this lesson, you will be able to:

Load demo_counts.csv and demo_metadata.csv
Validate that sample identifiers match between counts and metadata
Compute basic count-level QC summaries (library sizes, detected genes, zero fractions)
Save QC tables for downstream visualization (a/b lesson pattern)

3.2 Why we do intake + QC before analysis

Before normalization or differential expression, we confirm that:

counts and metadata refer to the same samples
counts look plausible (no all-zero samples, no swapped identifiers)
basic distributions do not contain obvious anomalies

This lesson performs computation + saving only. No plots are produced here by design.

3.3 Load demo data

The guide reuses the small demo datasets from the earlier CDI RNA-Seq Q&A versions.

library(tidyverse)

counts_raw <- readr::read_csv("data/demo_counts.csv")
meta <- readr::read_csv("data/demo_metadata.csv")

3.4 Inspect structure

tibble::glimpse(counts_raw)

Rows: 1,000
Columns: 15
$ gene     <chr> "Gene1", "Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7"…
$ Sample1  <dbl> 186, 45, 38, 293, 215, 246, 99, 8, 38, 197, 28, 98, 93, 142, …
$ Sample2  <dbl> 175, 101, 12, 0, 377, 376, 115, 13, 66, 82, 89, 14, 38, 114, …
$ Sample3  <dbl> 175, 54, 214, 172, 72, 34, 74, 13, 88, 49, 223, 237, 315, 145…
$ Sample4  <dbl> 54, 48, 92, 35, 18, 53, 18, 118, 200, 146, 169, 38, 93, 64, 1…
$ Sample5  <dbl> 54, 287, 86, 85, 3, 81, 369, 96, 1, 12, 101, 41, 135, 102, 82…
$ Sample6  <dbl> 67, 74, 110, 173, 27, 180, 82, 22, 146, 8, 148, 70, 0, 432, 1…
$ Sample7  <dbl> 50, 42, 149, 50, 12, 46, 22, 192, 39, 55, 15, 47, 78, 53, 3, …
$ Sample8  <dbl> 3, 54, 6, 142, 50, 312, 289, 264, 82, 244, 48, 208, 77, 167, …
$ Sample9  <dbl> 37, 118, 124, 32, 68, 480, 72, 94, 96, 156, 54, 46, 196, 30, …
$ Sample10 <dbl> 24, 28, 11, 198, 284, 86, 41, 21, 93, 51, 96, 175, 56, 229, 2…
$ Sample11 <dbl> 38, 42, 121, 138, 0, 46, 43, 66, 266, 14, 18, 96, 56, 9, 166,…
$ Sample12 <dbl> 361, 437, 388, 173, 590, 221, 334, 108, 583, 193, 428, 826, 3…
$ Sample13 <dbl> 238, 314, 324, 984, 1208, 287, 350, 438, 170, 346, 857, 986, …
$ Sample14 <dbl> 266, 586, 589, 889, 650, 189, 659, 190, 650, 172, 106, 434, 1…

tibble::glimpse(meta)

Rows: 14
Columns: 2
$ Sample    <chr> "Sample1", "Sample2", "Sample3", "Sample4", "Sample5", "Samp…
$ condition <chr> "Positive", "Positive", "Positive", "Positive", "Positive", …

Expectation:

counts_raw contains a gene identifier column plus one column per sample
meta contains one row per sample and a sample ID column

3.5 Validate sample alignment (namespace-safe)

Some Bioconductor packages define functions with the same names as dplyr verbs. In CDI, we avoid masking issues by using explicit namespaces (e.g., dplyr::slice).

# Identify the gene ID column (assume first column)
gene_id_col <- names(counts_raw)[1]

# Sample columns are everything except the gene ID
sample_cols <- setdiff(names(counts_raw), gene_id_col)

# Choose the metadata sample ID column
# Adjust here if your metadata uses a different column name
sample_id_col <- if ("sample_id" %in% names(meta)) "sample_id" else names(meta)[1]

stopifnot(length(sample_cols) > 1)
stopifnot(sample_id_col %in% names(meta))

# Are all count columns present in metadata?
missing_in_meta <- setdiff(sample_cols, meta[[sample_id_col]])
stopifnot(length(missing_in_meta) == 0)

# Are there metadata samples not present in counts?
missing_in_counts <- setdiff(meta[[sample_id_col]], sample_cols)
stopifnot(length(missing_in_counts) == 0)

# Reorder metadata to match the column order in counts (explicit namespaces)
meta <- meta |>
  dplyr::mutate(.sample_id = .data[[sample_id_col]]) |>
  dplyr::slice(match(sample_cols, .sample_id))

stopifnot(all(meta$.sample_id == sample_cols))

3.6 Compute basic QC summaries

We compute three simple summaries per sample:

Library size: total reads/counts per sample
Detected genes: number of genes with count > 0
Zero fraction: fraction of genes with count == 0

counts_mat <- counts_raw |>
  dplyr::select(-dplyr::all_of(gene_id_col)) |>
  as.matrix()

# Ensure numeric
storage.mode(counts_mat) <- "numeric"

qc_samples <- tibble::tibble(
  sample_id = sample_cols,
  library_size = colSums(counts_mat, na.rm = TRUE),
  detected_genes = colSums(counts_mat > 0, na.rm = TRUE),
  zero_fraction = colMeans(counts_mat == 0, na.rm = TRUE)
)

# Join sample metadata (explicit namespaces)
qc_samples <- qc_samples |>
  dplyr::left_join(meta |> dplyr::rename(sample_id = .sample_id), by = "sample_id")

qc_samples

# A tibble: 14 × 6
   sample_id library_size detected_genes zero_fraction Sample   condition
   <chr>            <dbl>          <dbl>         <dbl> <chr>    <chr>    
 1 Sample1          93252            989         0.011 Sample1  Positive 
 2 Sample2         106031            986         0.014 Sample2  Positive 
 3 Sample3         100478            984         0.016 Sample3  Positive 
 4 Sample4          99657            994         0.006 Sample4  Positive 
 5 Sample5          99824            994         0.006 Sample5  Positive 
 6 Sample6         101970            988         0.012 Sample6  Positive 
 7 Sample7         105678            994         0.006 Sample7  Positive 
 8 Sample8          99746            992         0.008 Sample8  Positive 
 9 Sample9         108620            996         0.004 Sample9  Positive 
10 Sample10        101298            992         0.008 Sample10 Positive 
11 Sample11         97301            987         0.013 Sample11 Positive 
12 Sample12        108093            989         0.011 Sample12 Negative 
13 Sample13        115129            980         0.02  Sample13 Negative 
14 Sample14        109639            987         0.013 Sample14 Negative

3.7 Save QC table for downstream analysis

These outputs are intentionally simple CSV files so they can be read in Python without any special tooling.

dir.create("results", showWarnings = FALSE, recursive = TRUE)
readr::write_csv(qc_samples, "results/03a-qc-samples.csv")

3.8 QC Visualization and Interpretation

CDI goal: read the saved QC tables from Lesson 03a and generate consistent QC figures using CDI’s Python visualization tooling.

3.9 Learning outcomes

By the end of this lesson, you will be able to:

Load the saved QC table from results/03a-qc-samples.csv
Initialize CDI plotting with cdi_notebook_init(chapter='03')
Create and save QC figures using show_and_save_mpl()
Interpret the QC plots to identify potential outliers

3.10 Load QC results

This lesson assumes Lesson 03a has been run and the QC table exists on disk.

from pathlib import Path
import pandas as pd

qc_path = Path('results/03a-qc-samples.csv')
print('QC file exists:', qc_path.exists())
if not qc_path.exists():
    raise FileNotFoundError('Missing results/03a-qc-samples.csv. Run Lesson 03a first.')

qc = pd.read_csv(qc_path)
qc.head()

QC file exists: True

	sample_id	library_size	detected_genes	zero_fraction	Sample	condition
0	Sample1	93252	989	0.011	Sample1	Positive
1	Sample2	106031	986	0.014	Sample2	Positive
2	Sample3	100478	984	0.016	Sample3	Positive
3	Sample4	99657	994	0.006	Sample4	Positive
4	Sample5	99824	994	0.006	Sample5	Positive

3.11 Quick sanity checks

Before plotting, confirm the table has the expected columns.

expected = {'sample_id','library_size','detected_genes','zero_fraction'}
print('Columns:', list(qc.columns))
missing = expected - set(qc.columns)
print('Missing expected columns:', missing)
qc[['library_size','detected_genes','zero_fraction']].describe()

Columns: ['sample_id', 'library_size', 'detected_genes', 'zero_fraction', 'Sample', 'condition']
Missing expected columns: set()

	library_size	detected_genes	zero_fraction
count	14.000000	14.000000	14.000000
mean	103336.857143	989.428571	0.010571
std	5772.304377	4.501526	0.004502
min	93252.000000	980.000000	0.004000
25%	99765.500000	987.000000	0.006500
50%	101634.000000	989.000000	0.011000
75%	107577.500000	993.500000	0.013000
max	115129.000000	996.000000	0.020000

3.12 Notebook initialization

from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl
cdi_notebook_init(chapter='03', title_x=0)

3.13 Plot 1: library size distribution

Look for extreme low-depth samples, which can distort downstream normalization and DE.

import matplotlib.pyplot as plt

plt.figure()
plt.hist(qc['library_size'], bins=25)
plt.xlabel('Library size (total counts)')
plt.ylabel('Number of samples')
show_and_save_mpl()

Saved PNG → figures/03_001.png

3.14 Plot 2: detected genes per sample

Low detected gene counts can indicate poor library complexity.

plt.figure()
plt.hist(qc['detected_genes'], bins=25)
plt.xlabel('Detected genes (count > 0)')
plt.ylabel('Number of samples')
show_and_save_mpl()

Saved PNG → figures/03_002.png

3.15 Plot 3: zero fraction per sample

High zero fractions suggest sparsity or low depth.

plt.figure()
plt.hist(qc['zero_fraction'], bins=25)
plt.xlabel('Zero fraction (genes with count == 0)')
plt.ylabel('Number of samples')
show_and_save_mpl()

Saved PNG → figures/03_003.png

3.16 Interpreting QC in context

QC thresholds are study-dependent, but a practical CDI approach is:

Identify outliers (extremely low library size or complexity)
Check if outliers align with known technical variables (batch, run)
Decide whether to exclude, reprocess, or keep with caution

At this stage, the goal is not to remove samples aggressively, but to understand risk before modeling.

3.17 Takeaway

By separating QC computation (03a) from visualization (03b), you can regenerate figures quickly without recomputing inputs. This pattern will be reused throughout the guide.

Proceed to Lesson 04: Quantification and Count Matrix Concepts