Lesson 3 Data Intake and Basic QC Checks
3.1 Learning outcomes
By the end of this lesson, you will be able to:
- Load
demo_counts.csvanddemo_metadata.csv - Validate that sample identifiers match between counts and metadata
- Compute basic count-level QC summaries (library sizes, detected genes, zero fractions)
- Save QC tables for downstream visualization (a/b lesson pattern)
3.2 Why we do intake + QC before analysis
Before normalization or differential expression, we confirm that:
- counts and metadata refer to the same samples
- counts look plausible (no all-zero samples, no swapped identifiers)
- basic distributions do not contain obvious anomalies
This lesson performs computation + saving only. No plots are produced here by design.
3.3 Load demo data
The guide reuses the small demo datasets from the earlier CDI RNA-Seq Q&A versions.
3.4 Inspect structure
Rows: 1,000
Columns: 15
$ gene <chr> "Gene1", "Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7"…
$ Sample1 <dbl> 186, 45, 38, 293, 215, 246, 99, 8, 38, 197, 28, 98, 93, 142, …
$ Sample2 <dbl> 175, 101, 12, 0, 377, 376, 115, 13, 66, 82, 89, 14, 38, 114, …
$ Sample3 <dbl> 175, 54, 214, 172, 72, 34, 74, 13, 88, 49, 223, 237, 315, 145…
$ Sample4 <dbl> 54, 48, 92, 35, 18, 53, 18, 118, 200, 146, 169, 38, 93, 64, 1…
$ Sample5 <dbl> 54, 287, 86, 85, 3, 81, 369, 96, 1, 12, 101, 41, 135, 102, 82…
$ Sample6 <dbl> 67, 74, 110, 173, 27, 180, 82, 22, 146, 8, 148, 70, 0, 432, 1…
$ Sample7 <dbl> 50, 42, 149, 50, 12, 46, 22, 192, 39, 55, 15, 47, 78, 53, 3, …
$ Sample8 <dbl> 3, 54, 6, 142, 50, 312, 289, 264, 82, 244, 48, 208, 77, 167, …
$ Sample9 <dbl> 37, 118, 124, 32, 68, 480, 72, 94, 96, 156, 54, 46, 196, 30, …
$ Sample10 <dbl> 24, 28, 11, 198, 284, 86, 41, 21, 93, 51, 96, 175, 56, 229, 2…
$ Sample11 <dbl> 38, 42, 121, 138, 0, 46, 43, 66, 266, 14, 18, 96, 56, 9, 166,…
$ Sample12 <dbl> 361, 437, 388, 173, 590, 221, 334, 108, 583, 193, 428, 826, 3…
$ Sample13 <dbl> 238, 314, 324, 984, 1208, 287, 350, 438, 170, 346, 857, 986, …
$ Sample14 <dbl> 266, 586, 589, 889, 650, 189, 659, 190, 650, 172, 106, 434, 1…
Rows: 14
Columns: 2
$ Sample <chr> "Sample1", "Sample2", "Sample3", "Sample4", "Sample5", "Samp…
$ condition <chr> "Positive", "Positive", "Positive", "Positive", "Positive", …
Expectation:
counts_rawcontains a gene identifier column plus one column per samplemetacontains one row per sample and a sample ID column
3.5 Validate sample alignment (namespace-safe)
Some Bioconductor packages define functions with the same names as dplyr verbs.
In CDI, we avoid masking issues by using explicit namespaces (e.g., dplyr::slice).
# Identify the gene ID column (assume first column)
gene_id_col <- names(counts_raw)[1]
# Sample columns are everything except the gene ID
sample_cols <- setdiff(names(counts_raw), gene_id_col)
# Choose the metadata sample ID column
# Adjust here if your metadata uses a different column name
sample_id_col <- if ("sample_id" %in% names(meta)) "sample_id" else names(meta)[1]
stopifnot(length(sample_cols) > 1)
stopifnot(sample_id_col %in% names(meta))
# Are all count columns present in metadata?
missing_in_meta <- setdiff(sample_cols, meta[[sample_id_col]])
stopifnot(length(missing_in_meta) == 0)
# Are there metadata samples not present in counts?
missing_in_counts <- setdiff(meta[[sample_id_col]], sample_cols)
stopifnot(length(missing_in_counts) == 0)
# Reorder metadata to match the column order in counts (explicit namespaces)
meta <- meta |>
dplyr::mutate(.sample_id = .data[[sample_id_col]]) |>
dplyr::slice(match(sample_cols, .sample_id))
stopifnot(all(meta$.sample_id == sample_cols))3.6 Compute basic QC summaries
We compute three simple summaries per sample:
- Library size: total reads/counts per sample
- Detected genes: number of genes with count > 0
- Zero fraction: fraction of genes with count == 0
counts_mat <- counts_raw |>
dplyr::select(-dplyr::all_of(gene_id_col)) |>
as.matrix()
# Ensure numeric
storage.mode(counts_mat) <- "numeric"
qc_samples <- tibble::tibble(
sample_id = sample_cols,
library_size = colSums(counts_mat, na.rm = TRUE),
detected_genes = colSums(counts_mat > 0, na.rm = TRUE),
zero_fraction = colMeans(counts_mat == 0, na.rm = TRUE)
)
# Join sample metadata (explicit namespaces)
qc_samples <- qc_samples |>
dplyr::left_join(meta |> dplyr::rename(sample_id = .sample_id), by = "sample_id")
qc_samples# A tibble: 14 × 6
sample_id library_size detected_genes zero_fraction Sample condition
<chr> <dbl> <dbl> <dbl> <chr> <chr>
1 Sample1 93252 989 0.011 Sample1 Positive
2 Sample2 106031 986 0.014 Sample2 Positive
3 Sample3 100478 984 0.016 Sample3 Positive
4 Sample4 99657 994 0.006 Sample4 Positive
5 Sample5 99824 994 0.006 Sample5 Positive
6 Sample6 101970 988 0.012 Sample6 Positive
7 Sample7 105678 994 0.006 Sample7 Positive
8 Sample8 99746 992 0.008 Sample8 Positive
9 Sample9 108620 996 0.004 Sample9 Positive
10 Sample10 101298 992 0.008 Sample10 Positive
11 Sample11 97301 987 0.013 Sample11 Positive
12 Sample12 108093 989 0.011 Sample12 Negative
13 Sample13 115129 980 0.02 Sample13 Negative
14 Sample14 109639 987 0.013 Sample14 Negative
3.7 Save QC table for downstream analysis
These outputs are intentionally simple CSV files so they can be read in Python without any special tooling.
3.9 Learning outcomes
By the end of this lesson, you will be able to:
- Load the saved QC table from
results/03a-qc-samples.csv - Initialize CDI plotting with
cdi_notebook_init(chapter='03') - Create and save QC figures using
show_and_save_mpl() - Interpret the QC plots to identify potential outliers
3.10 Load QC results
This lesson assumes Lesson 03a has been run and the QC table exists on disk.
from pathlib import Path
import pandas as pd
qc_path = Path('results/03a-qc-samples.csv')
print('QC file exists:', qc_path.exists())
if not qc_path.exists():
raise FileNotFoundError('Missing results/03a-qc-samples.csv. Run Lesson 03a first.')
qc = pd.read_csv(qc_path)
qc.head()QC file exists: True
| sample_id | library_size | detected_genes | zero_fraction | Sample | condition | |
|---|---|---|---|---|---|---|
| 0 | Sample1 | 93252 | 989 | 0.011 | Sample1 | Positive |
| 1 | Sample2 | 106031 | 986 | 0.014 | Sample2 | Positive |
| 2 | Sample3 | 100478 | 984 | 0.016 | Sample3 | Positive |
| 3 | Sample4 | 99657 | 994 | 0.006 | Sample4 | Positive |
| 4 | Sample5 | 99824 | 994 | 0.006 | Sample5 | Positive |
3.11 Quick sanity checks
Before plotting, confirm the table has the expected columns.
expected = {'sample_id','library_size','detected_genes','zero_fraction'}
print('Columns:', list(qc.columns))
missing = expected - set(qc.columns)
print('Missing expected columns:', missing)
qc[['library_size','detected_genes','zero_fraction']].describe()Columns: ['sample_id', 'library_size', 'detected_genes', 'zero_fraction', 'Sample', 'condition']
Missing expected columns: set()
| library_size | detected_genes | zero_fraction | |
|---|---|---|---|
| count | 14.000000 | 14.000000 | 14.000000 |
| mean | 103336.857143 | 989.428571 | 0.010571 |
| std | 5772.304377 | 4.501526 | 0.004502 |
| min | 93252.000000 | 980.000000 | 0.004000 |
| 25% | 99765.500000 | 987.000000 | 0.006500 |
| 50% | 101634.000000 | 989.000000 | 0.011000 |
| 75% | 107577.500000 | 993.500000 | 0.013000 |
| max | 115129.000000 | 996.000000 | 0.020000 |
3.13 Plot 1: library size distribution
Look for extreme low-depth samples, which can distort downstream normalization and DE.
import matplotlib.pyplot as plt
plt.figure()
plt.hist(qc['library_size'], bins=25)
plt.xlabel('Library size (total counts)')
plt.ylabel('Number of samples')
show_and_save_mpl()Saved PNG → figures/03_001.png

3.14 Plot 2: detected genes per sample
Low detected gene counts can indicate poor library complexity.
plt.figure()
plt.hist(qc['detected_genes'], bins=25)
plt.xlabel('Detected genes (count > 0)')
plt.ylabel('Number of samples')
show_and_save_mpl()Saved PNG → figures/03_002.png

3.15 Plot 3: zero fraction per sample
High zero fractions suggest sparsity or low depth.
plt.figure()
plt.hist(qc['zero_fraction'], bins=25)
plt.xlabel('Zero fraction (genes with count == 0)')
plt.ylabel('Number of samples')
show_and_save_mpl()Saved PNG → figures/03_003.png

3.16 Interpreting QC in context
QC thresholds are study-dependent, but a practical CDI approach is:
- Identify outliers (extremely low library size or complexity)
- Check if outliers align with known technical variables (batch, run)
- Decide whether to exclude, reprocess, or keep with caution
At this stage, the goal is not to remove samples aggressively, but to understand risk before modeling.