Lesson 4 Quantification and Count Matrix Concepts
4.1 Learning outcomes
By the end of this lesson, you will be able to:
- Explain what RNA-Seq quantification means
- Distinguish raw counts from normalized or transformed values
- Understand the structure of a gene × sample count matrix
- Recognize common sources of bias introduced during quantification
- Relate the demo count matrix to downstream differential expression analysis
4.2 What is RNA-Seq quantification?
After sequencing and alignment (or pseudo-alignment), RNA-Seq pipelines produce quantified expression values.
In the simplest case, this is a count:
- How many reads were assigned to a gene (or transcript)
- Per sample
These counts form the raw material for downstream statistical analysis.
4.3 From reads to counts (conceptual overview)
Although tools differ (e.g., alignment-based vs pseudo-alignment), most pipelines follow the same logic:
- Sequence reads are generated from RNA molecules
- Reads are assigned to genomic features (genes or transcripts)
- Assignments are summarized into a matrix
The output is a count matrix with:
- Rows = genes (or transcripts)
- Columns = samples
- Values = integer counts
4.4 The demo count matrix used in this guide
In CDI, we work with a small demo dataset to focus on reasoning rather than scale.
The file:
data/demo_counts.csv
contains:
- One row per gene
- One column per sample
- Raw integer counts
# A tibble: 1,000 × 15
gene Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Gene1 186 175 175 54 54 67 50 3 37
2 Gene2 45 101 54 48 287 74 42 54 118
3 Gene3 38 12 214 92 86 110 149 6 124
4 Gene4 293 0 172 35 85 173 50 142 32
5 Gene5 215 377 72 18 3 27 12 50 68
6 Gene6 246 376 34 53 81 180 46 312 480
7 Gene7 99 115 74 18 369 82 22 289 72
8 Gene8 8 13 13 118 96 22 192 264 94
9 Gene9 38 66 88 200 1 146 39 82 96
10 Gene… 197 82 49 146 12 8 55 244 156
# ℹ 990 more rows
# ℹ 5 more variables: Sample10 <dbl>, Sample11 <dbl>, Sample12 <dbl>,
# Sample13 <dbl>, Sample14 <dbl>
4.5 Gene identifiers and sample columns
In most RNA-Seq workflows:
- The first column identifies the gene (e.g., gene ID or symbol)
- Remaining columns correspond to samples
This separation is important, because statistical models operate on the numeric matrix only.
gene_ids <- counts_raw[[1]]
counts_mat <- counts_raw |>
dplyr::select(-1) |>
as.matrix()
storage.mode(counts_mat) <- "numeric"
dim(counts_mat)[1] 1000 14
4.6 Properties of raw counts
Raw RNA-Seq counts have characteristic properties:
- Non-negative integers
- Strongly right-skewed distributions
- Depend on sequencing depth
- Depend on gene length and composition
These properties explain why raw counts are not directly comparable across samples.
4.7 Library size and sequencing depth
The total number of counts per sample is often called the library size.
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8
93252 106031 100478 99657 99824 101970 105678 99746
Sample9 Sample10 Sample11 Sample12 Sample13 Sample14
108620 101298 97301 108093 115129 109639
Samples with larger library sizes tend to have larger raw counts even if biology is identical.
4.8 Why normalization is required
Normalization aims to remove technical effects so that remaining differences reflect biology.
Typical goals:
- Adjust for sequencing depth
- Reduce composition bias
- Preserve relative expression differences
Different methods make different assumptions — which is why understanding the count matrix matters.
4.9 Counts vs transformed values
It is crucial to distinguish:
- Raw counts → used for modeling (e.g., DESeq2)
- Normalized counts → adjusted for depth/composition
- Transformed values (e.g., log, rlog, VST) → used for visualization and EDA
In CDI, we keep these representations explicitly separate.
4.10 Preview: rlog matrix
This guide also provides a precomputed transformation for exploration:
data/rlog_matrix.csv
This file is not used for differential expression modeling. It exists to support visualization and intuition.
# A tibble: 1,000 × 15
gene Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Gene1 7.05 6.87 6.96 5.72 5.76 5.87 5.57 3.63 5.26
2 Gene2 5.72 6.40 5.87 5.73 7.58 6.08 5.54 5.82 6.48
3 Gene3 5.54 4.55 7.22 6.32 6.30 6.43 6.73 4.15 6.50
4 Gene4 7.63 4.85 7.11 5.88 6.53 7.04 6.06 6.89 5.76
5 Gene5 7.18 7.62 6.22 5.31 4.76 5.49 5.09 5.90 6.06
6 Gene6 7.54 7.87 5.71 6.04 6.45 7.10 5.86 7.70 8.05
7 Gene7 6.54 6.59 6.25 5.09 7.87 6.25 5.17 7.53 6.08
8 Gene8 4.08 4.35 4.41 6.45 6.28 4.76 6.89 7.29 6.09
9 Gene9 5.64 6.04 6.38 7.16 3.90 6.78 5.55 6.26 6.32
10 Gene… 7.11 6.09 5.66 6.74 4.47 4.12 5.67 7.26 6.68
# ℹ 990 more rows
# ℹ 5 more variables: Sample10 <dbl>, Sample11 <dbl>, Sample12 <dbl>,
# Sample13 <dbl>, Sample14 <dbl>
4.11 Common pitfalls
- Treating normalized values as raw counts
- Filtering genes before inspecting library sizes
- Mixing transformed and untransformed data in the same analysis
- Forgetting how the count matrix was generated
Most RNA-Seq errors originate before statistical testing.