Lesson 4 Quantification and Count Matrix Concepts

CDI goal: understand what RNA-Seq quantification represents, how count matrices are constructed, and what assumptions they encode before statistical modeling.

4.1 Learning outcomes

By the end of this lesson, you will be able to:

Explain what RNA-Seq quantification means
Distinguish raw counts from normalized or transformed values
Understand the structure of a gene × sample count matrix
Recognize common sources of bias introduced during quantification
Relate the demo count matrix to downstream differential expression analysis

4.2 What is RNA-Seq quantification?

After sequencing and alignment (or pseudo-alignment), RNA-Seq pipelines produce quantified expression values.

In the simplest case, this is a count:

How many reads were assigned to a gene (or transcript)
Per sample

These counts form the raw material for downstream statistical analysis.

4.3 From reads to counts (conceptual overview)

Although tools differ (e.g., alignment-based vs pseudo-alignment), most pipelines follow the same logic:

Sequence reads are generated from RNA molecules
Reads are assigned to genomic features (genes or transcripts)
Assignments are summarized into a matrix

The output is a count matrix with:

Rows = genes (or transcripts)
Columns = samples
Values = integer counts

4.4 The demo count matrix used in this guide

In CDI, we work with a small demo dataset to focus on reasoning rather than scale.

The file:

data/demo_counts.csv

contains:

One row per gene
One column per sample
Raw integer counts

library(tidyverse)

counts_raw <- readr::read_csv("data/demo_counts.csv")

counts_raw

# A tibble: 1,000 × 15
   gene  Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Gene1     186     175     175      54      54      67      50       3      37
 2 Gene2      45     101      54      48     287      74      42      54     118
 3 Gene3      38      12     214      92      86     110     149       6     124
 4 Gene4     293       0     172      35      85     173      50     142      32
 5 Gene5     215     377      72      18       3      27      12      50      68
 6 Gene6     246     376      34      53      81     180      46     312     480
 7 Gene7      99     115      74      18     369      82      22     289      72
 8 Gene8       8      13      13     118      96      22     192     264      94
 9 Gene9      38      66      88     200       1     146      39      82      96
10 Gene…     197      82      49     146      12       8      55     244     156
# ℹ 990 more rows
# ℹ 5 more variables: Sample10 <dbl>, Sample11 <dbl>, Sample12 <dbl>,
#   Sample13 <dbl>, Sample14 <dbl>

4.5 Gene identifiers and sample columns

In most RNA-Seq workflows:

The first column identifies the gene (e.g., gene ID or symbol)
Remaining columns correspond to samples

This separation is important, because statistical models operate on the numeric matrix only.

gene_ids <- counts_raw[[1]]

counts_mat <- counts_raw |>
  dplyr::select(-1) |>
  as.matrix()

storage.mode(counts_mat) <- "numeric"

dim(counts_mat)

[1] 1000   14

4.6 Properties of raw counts

Raw RNA-Seq counts have characteristic properties:

Non-negative integers
Strongly right-skewed distributions
Depend on sequencing depth
Depend on gene length and composition

These properties explain why raw counts are not directly comparable across samples.

4.7 Library size and sequencing depth

The total number of counts per sample is often called the library size.

library_sizes <- colSums(counts_mat)

library_sizes

 Sample1  Sample2  Sample3  Sample4  Sample5  Sample6  Sample7  Sample8 
   93252   106031   100478    99657    99824   101970   105678    99746 
 Sample9 Sample10 Sample11 Sample12 Sample13 Sample14 
  108620   101298    97301   108093   115129   109639

Samples with larger library sizes tend to have larger raw counts even if biology is identical.

4.8 Why normalization is required

Normalization aims to remove technical effects so that remaining differences reflect biology.

Typical goals:

Adjust for sequencing depth
Reduce composition bias
Preserve relative expression differences

Different methods make different assumptions — which is why understanding the count matrix matters.

4.9 Counts vs transformed values

It is crucial to distinguish:

Raw counts → used for modeling (e.g., DESeq2)
Normalized counts → adjusted for depth/composition
Transformed values (e.g., log, rlog, VST) → used for visualization and EDA

In CDI, we keep these representations explicitly separate.

4.10 Preview: rlog matrix

This guide also provides a precomputed transformation for exploration:

data/rlog_matrix.csv

This file is not used for differential expression modeling. It exists to support visualization and intuition.

rlog_mat <- readr::read_csv("data/rlog_matrix.csv")

rlog_mat

# A tibble: 1,000 × 15
   gene  Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Gene1    7.05    6.87    6.96    5.72    5.76    5.87    5.57    3.63    5.26
 2 Gene2    5.72    6.40    5.87    5.73    7.58    6.08    5.54    5.82    6.48
 3 Gene3    5.54    4.55    7.22    6.32    6.30    6.43    6.73    4.15    6.50
 4 Gene4    7.63    4.85    7.11    5.88    6.53    7.04    6.06    6.89    5.76
 5 Gene5    7.18    7.62    6.22    5.31    4.76    5.49    5.09    5.90    6.06
 6 Gene6    7.54    7.87    5.71    6.04    6.45    7.10    5.86    7.70    8.05
 7 Gene7    6.54    6.59    6.25    5.09    7.87    6.25    5.17    7.53    6.08
 8 Gene8    4.08    4.35    4.41    6.45    6.28    4.76    6.89    7.29    6.09
 9 Gene9    5.64    6.04    6.38    7.16    3.90    6.78    5.55    6.26    6.32
10 Gene…    7.11    6.09    5.66    6.74    4.47    4.12    5.67    7.26    6.68
# ℹ 990 more rows
# ℹ 5 more variables: Sample10 <dbl>, Sample11 <dbl>, Sample12 <dbl>,
#   Sample13 <dbl>, Sample14 <dbl>

4.11 Common pitfalls

Treating normalized values as raw counts
Filtering genes before inspecting library sizes
Mixing transformed and untransformed data in the same analysis
Forgetting how the count matrix was generated

Most RNA-Seq errors originate before statistical testing.

4.12 Takeaway

The count matrix is the foundation of RNA-Seq analysis. Understanding what it represents — and what it does not — is essential before modeling.

Proceed to Lesson 05: Normalization and Exploratory Data Analysis