Lesson 3 Data Intake and Basic QC Checks

CDI goal: load the demo RNA-Seq count matrix and metadata, validate alignment, compute basic QC summaries, and save them for plotting in Lesson 03b.

3.1 Learning outcomes

By the end of this lesson, you will be able to:

  • Load demo_counts.csv and demo_metadata.csv
  • Validate that sample identifiers match between counts and metadata
  • Compute basic count-level QC summaries (library sizes, detected genes, zero fractions)
  • Save QC tables for downstream visualization (a/b lesson pattern)

3.2 Why we do intake + QC before analysis

Before normalization or differential expression, we confirm that:

  • counts and metadata refer to the same samples
  • counts look plausible (no all-zero samples, no swapped identifiers)
  • basic distributions do not contain obvious anomalies

This lesson performs computation + saving only. No plots are produced here by design.

3.3 Load demo data

The guide reuses the small demo datasets from the earlier CDI RNA-Seq Q&A versions.

library(tidyverse)

counts_raw <- readr::read_csv("data/demo_counts.csv")
meta <- readr::read_csv("data/demo_metadata.csv")

3.4 Inspect structure

tibble::glimpse(counts_raw)
Rows: 1,000
Columns: 15
$ gene     <chr> "Gene1", "Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7"…
$ Sample1  <dbl> 186, 45, 38, 293, 215, 246, 99, 8, 38, 197, 28, 98, 93, 142, …
$ Sample2  <dbl> 175, 101, 12, 0, 377, 376, 115, 13, 66, 82, 89, 14, 38, 114, …
$ Sample3  <dbl> 175, 54, 214, 172, 72, 34, 74, 13, 88, 49, 223, 237, 315, 145…
$ Sample4  <dbl> 54, 48, 92, 35, 18, 53, 18, 118, 200, 146, 169, 38, 93, 64, 1…
$ Sample5  <dbl> 54, 287, 86, 85, 3, 81, 369, 96, 1, 12, 101, 41, 135, 102, 82…
$ Sample6  <dbl> 67, 74, 110, 173, 27, 180, 82, 22, 146, 8, 148, 70, 0, 432, 1…
$ Sample7  <dbl> 50, 42, 149, 50, 12, 46, 22, 192, 39, 55, 15, 47, 78, 53, 3, …
$ Sample8  <dbl> 3, 54, 6, 142, 50, 312, 289, 264, 82, 244, 48, 208, 77, 167, …
$ Sample9  <dbl> 37, 118, 124, 32, 68, 480, 72, 94, 96, 156, 54, 46, 196, 30, …
$ Sample10 <dbl> 24, 28, 11, 198, 284, 86, 41, 21, 93, 51, 96, 175, 56, 229, 2…
$ Sample11 <dbl> 38, 42, 121, 138, 0, 46, 43, 66, 266, 14, 18, 96, 56, 9, 166,…
$ Sample12 <dbl> 361, 437, 388, 173, 590, 221, 334, 108, 583, 193, 428, 826, 3…
$ Sample13 <dbl> 238, 314, 324, 984, 1208, 287, 350, 438, 170, 346, 857, 986, …
$ Sample14 <dbl> 266, 586, 589, 889, 650, 189, 659, 190, 650, 172, 106, 434, 1…
tibble::glimpse(meta)
Rows: 14
Columns: 2
$ Sample    <chr> "Sample1", "Sample2", "Sample3", "Sample4", "Sample5", "Samp…
$ condition <chr> "Positive", "Positive", "Positive", "Positive", "Positive", …

Expectation:

  • counts_raw contains a gene identifier column plus one column per sample
  • meta contains one row per sample and a sample ID column

3.5 Validate sample alignment (namespace-safe)

Some Bioconductor packages define functions with the same names as dplyr verbs. In CDI, we avoid masking issues by using explicit namespaces (e.g., dplyr::slice).

# Identify the gene ID column (assume first column)
gene_id_col <- names(counts_raw)[1]

# Sample columns are everything except the gene ID
sample_cols <- setdiff(names(counts_raw), gene_id_col)

# Choose the metadata sample ID column
# Adjust here if your metadata uses a different column name
sample_id_col <- if ("sample_id" %in% names(meta)) "sample_id" else names(meta)[1]

stopifnot(length(sample_cols) > 1)
stopifnot(sample_id_col %in% names(meta))

# Are all count columns present in metadata?
missing_in_meta <- setdiff(sample_cols, meta[[sample_id_col]])
stopifnot(length(missing_in_meta) == 0)

# Are there metadata samples not present in counts?
missing_in_counts <- setdiff(meta[[sample_id_col]], sample_cols)
stopifnot(length(missing_in_counts) == 0)

# Reorder metadata to match the column order in counts (explicit namespaces)
meta <- meta |>
  dplyr::mutate(.sample_id = .data[[sample_id_col]]) |>
  dplyr::slice(match(sample_cols, .sample_id))

stopifnot(all(meta$.sample_id == sample_cols))

3.6 Compute basic QC summaries

We compute three simple summaries per sample:

  • Library size: total reads/counts per sample
  • Detected genes: number of genes with count > 0
  • Zero fraction: fraction of genes with count == 0
counts_mat <- counts_raw |>
  dplyr::select(-dplyr::all_of(gene_id_col)) |>
  as.matrix()

# Ensure numeric
storage.mode(counts_mat) <- "numeric"

qc_samples <- tibble::tibble(
  sample_id = sample_cols,
  library_size = colSums(counts_mat, na.rm = TRUE),
  detected_genes = colSums(counts_mat > 0, na.rm = TRUE),
  zero_fraction = colMeans(counts_mat == 0, na.rm = TRUE)
)

# Join sample metadata (explicit namespaces)
qc_samples <- qc_samples |>
  dplyr::left_join(meta |> dplyr::rename(sample_id = .sample_id), by = "sample_id")

qc_samples
# A tibble: 14 × 6
   sample_id library_size detected_genes zero_fraction Sample   condition
   <chr>            <dbl>          <dbl>         <dbl> <chr>    <chr>    
 1 Sample1          93252            989         0.011 Sample1  Positive 
 2 Sample2         106031            986         0.014 Sample2  Positive 
 3 Sample3         100478            984         0.016 Sample3  Positive 
 4 Sample4          99657            994         0.006 Sample4  Positive 
 5 Sample5          99824            994         0.006 Sample5  Positive 
 6 Sample6         101970            988         0.012 Sample6  Positive 
 7 Sample7         105678            994         0.006 Sample7  Positive 
 8 Sample8          99746            992         0.008 Sample8  Positive 
 9 Sample9         108620            996         0.004 Sample9  Positive 
10 Sample10        101298            992         0.008 Sample10 Positive 
11 Sample11         97301            987         0.013 Sample11 Positive 
12 Sample12        108093            989         0.011 Sample12 Negative 
13 Sample13        115129            980         0.02  Sample13 Negative 
14 Sample14        109639            987         0.013 Sample14 Negative 

3.7 Save QC table for downstream analysis

These outputs are intentionally simple CSV files so they can be read in Python without any special tooling.

dir.create("results", showWarnings = FALSE, recursive = TRUE)
readr::write_csv(qc_samples, "results/03a-qc-samples.csv")

3.8 QC Visualization and Interpretation

CDI goal: read the saved QC tables from Lesson 03a and generate consistent QC figures using CDI’s Python visualization tooling.

3.9 Learning outcomes

By the end of this lesson, you will be able to:

  • Load the saved QC table from results/03a-qc-samples.csv
  • Initialize CDI plotting with cdi_notebook_init(chapter='03')
  • Create and save QC figures using show_and_save_mpl()
  • Interpret the QC plots to identify potential outliers

3.10 Load QC results

This lesson assumes Lesson 03a has been run and the QC table exists on disk.

from pathlib import Path
import pandas as pd

qc_path = Path('results/03a-qc-samples.csv')
print('QC file exists:', qc_path.exists())
if not qc_path.exists():
    raise FileNotFoundError('Missing results/03a-qc-samples.csv. Run Lesson 03a first.')

qc = pd.read_csv(qc_path)
qc.head()
QC file exists: True
sample_id library_size detected_genes zero_fraction Sample condition
0 Sample1 93252 989 0.011 Sample1 Positive
1 Sample2 106031 986 0.014 Sample2 Positive
2 Sample3 100478 984 0.016 Sample3 Positive
3 Sample4 99657 994 0.006 Sample4 Positive
4 Sample5 99824 994 0.006 Sample5 Positive

3.11 Quick sanity checks

Before plotting, confirm the table has the expected columns.

expected = {'sample_id','library_size','detected_genes','zero_fraction'}
print('Columns:', list(qc.columns))
missing = expected - set(qc.columns)
print('Missing expected columns:', missing)
qc[['library_size','detected_genes','zero_fraction']].describe()
Columns: ['sample_id', 'library_size', 'detected_genes', 'zero_fraction', 'Sample', 'condition']
Missing expected columns: set()
library_size detected_genes zero_fraction
count 14.000000 14.000000 14.000000
mean 103336.857143 989.428571 0.010571
std 5772.304377 4.501526 0.004502
min 93252.000000 980.000000 0.004000
25% 99765.500000 987.000000 0.006500
50% 101634.000000 989.000000 0.011000
75% 107577.500000 993.500000 0.013000
max 115129.000000 996.000000 0.020000

3.12 Notebook initialization

from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl
cdi_notebook_init(chapter='03', title_x=0)

3.13 Plot 1: library size distribution

Look for extreme low-depth samples, which can distort downstream normalization and DE.

import matplotlib.pyplot as plt

plt.figure()
plt.hist(qc['library_size'], bins=25)
plt.xlabel('Library size (total counts)')
plt.ylabel('Number of samples')
show_and_save_mpl()
Saved PNG → figures/03_001.png

3.14 Plot 2: detected genes per sample

Low detected gene counts can indicate poor library complexity.

plt.figure()
plt.hist(qc['detected_genes'], bins=25)
plt.xlabel('Detected genes (count > 0)')
plt.ylabel('Number of samples')
show_and_save_mpl()
Saved PNG → figures/03_002.png

3.15 Plot 3: zero fraction per sample

High zero fractions suggest sparsity or low depth.

plt.figure()
plt.hist(qc['zero_fraction'], bins=25)
plt.xlabel('Zero fraction (genes with count == 0)')
plt.ylabel('Number of samples')
show_and_save_mpl()
Saved PNG → figures/03_003.png

3.16 Interpreting QC in context

QC thresholds are study-dependent, but a practical CDI approach is:

  • Identify outliers (extremely low library size or complexity)
  • Check if outliers align with known technical variables (batch, run)
  • Decide whether to exclude, reprocess, or keep with caution

At this stage, the goal is not to remove samples aggressively, but to understand risk before modeling.

3.17 Takeaway

By separating QC computation (03a) from visualization (03b), you can regenerate figures quickly without recomputing inputs. This pattern will be reused throughout the guide.