Lesson 4 Quantification and Count Matrix Concepts

CDI goal: understand what RNA-Seq quantification represents, how count matrices are constructed, and what assumptions they encode before statistical modeling.

4.1 Learning outcomes

By the end of this lesson, you will be able to:

  • Explain what RNA-Seq quantification means
  • Distinguish raw counts from normalized or transformed values
  • Understand the structure of a gene × sample count matrix
  • Recognize common sources of bias introduced during quantification
  • Relate the demo count matrix to downstream differential expression analysis

4.2 What is RNA-Seq quantification?

After sequencing and alignment (or pseudo-alignment), RNA-Seq pipelines produce quantified expression values.

In the simplest case, this is a count:

  • How many reads were assigned to a gene (or transcript)
  • Per sample

These counts form the raw material for downstream statistical analysis.

4.3 From reads to counts (conceptual overview)

Although tools differ (e.g., alignment-based vs pseudo-alignment), most pipelines follow the same logic:

  1. Sequence reads are generated from RNA molecules
  2. Reads are assigned to genomic features (genes or transcripts)
  3. Assignments are summarized into a matrix

The output is a count matrix with:

  • Rows = genes (or transcripts)
  • Columns = samples
  • Values = integer counts

4.4 The demo count matrix used in this guide

In CDI, we work with a small demo dataset to focus on reasoning rather than scale.

The file:

  • data/demo_counts.csv

contains:

  • One row per gene
  • One column per sample
  • Raw integer counts
library(tidyverse)

counts_raw <- readr::read_csv("data/demo_counts.csv")

counts_raw
# A tibble: 1,000 × 15
   gene  Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Gene1     186     175     175      54      54      67      50       3      37
 2 Gene2      45     101      54      48     287      74      42      54     118
 3 Gene3      38      12     214      92      86     110     149       6     124
 4 Gene4     293       0     172      35      85     173      50     142      32
 5 Gene5     215     377      72      18       3      27      12      50      68
 6 Gene6     246     376      34      53      81     180      46     312     480
 7 Gene7      99     115      74      18     369      82      22     289      72
 8 Gene8       8      13      13     118      96      22     192     264      94
 9 Gene9      38      66      88     200       1     146      39      82      96
10 Gene…     197      82      49     146      12       8      55     244     156
# ℹ 990 more rows
# ℹ 5 more variables: Sample10 <dbl>, Sample11 <dbl>, Sample12 <dbl>,
#   Sample13 <dbl>, Sample14 <dbl>

4.5 Gene identifiers and sample columns

In most RNA-Seq workflows:

  • The first column identifies the gene (e.g., gene ID or symbol)
  • Remaining columns correspond to samples

This separation is important, because statistical models operate on the numeric matrix only.

gene_ids <- counts_raw[[1]]

counts_mat <- counts_raw |>
  dplyr::select(-1) |>
  as.matrix()

storage.mode(counts_mat) <- "numeric"

dim(counts_mat)
[1] 1000   14

4.6 Properties of raw counts

Raw RNA-Seq counts have characteristic properties:

  • Non-negative integers
  • Strongly right-skewed distributions
  • Depend on sequencing depth
  • Depend on gene length and composition

These properties explain why raw counts are not directly comparable across samples.

4.7 Library size and sequencing depth

The total number of counts per sample is often called the library size.

library_sizes <- colSums(counts_mat)

library_sizes
 Sample1  Sample2  Sample3  Sample4  Sample5  Sample6  Sample7  Sample8 
   93252   106031   100478    99657    99824   101970   105678    99746 
 Sample9 Sample10 Sample11 Sample12 Sample13 Sample14 
  108620   101298    97301   108093   115129   109639 

Samples with larger library sizes tend to have larger raw counts even if biology is identical.

4.8 Why normalization is required

Normalization aims to remove technical effects so that remaining differences reflect biology.

Typical goals:

  • Adjust for sequencing depth
  • Reduce composition bias
  • Preserve relative expression differences

Different methods make different assumptions — which is why understanding the count matrix matters.

4.9 Counts vs transformed values

It is crucial to distinguish:

  • Raw counts → used for modeling (e.g., DESeq2)
  • Normalized counts → adjusted for depth/composition
  • Transformed values (e.g., log, rlog, VST) → used for visualization and EDA

In CDI, we keep these representations explicitly separate.

4.10 Preview: rlog matrix

This guide also provides a precomputed transformation for exploration:

  • data/rlog_matrix.csv

This file is not used for differential expression modeling. It exists to support visualization and intuition.

rlog_mat <- readr::read_csv("data/rlog_matrix.csv")

rlog_mat
# A tibble: 1,000 × 15
   gene  Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Gene1    7.05    6.87    6.96    5.72    5.76    5.87    5.57    3.63    5.26
 2 Gene2    5.72    6.40    5.87    5.73    7.58    6.08    5.54    5.82    6.48
 3 Gene3    5.54    4.55    7.22    6.32    6.30    6.43    6.73    4.15    6.50
 4 Gene4    7.63    4.85    7.11    5.88    6.53    7.04    6.06    6.89    5.76
 5 Gene5    7.18    7.62    6.22    5.31    4.76    5.49    5.09    5.90    6.06
 6 Gene6    7.54    7.87    5.71    6.04    6.45    7.10    5.86    7.70    8.05
 7 Gene7    6.54    6.59    6.25    5.09    7.87    6.25    5.17    7.53    6.08
 8 Gene8    4.08    4.35    4.41    6.45    6.28    4.76    6.89    7.29    6.09
 9 Gene9    5.64    6.04    6.38    7.16    3.90    6.78    5.55    6.26    6.32
10 Gene…    7.11    6.09    5.66    6.74    4.47    4.12    5.67    7.26    6.68
# ℹ 990 more rows
# ℹ 5 more variables: Sample10 <dbl>, Sample11 <dbl>, Sample12 <dbl>,
#   Sample13 <dbl>, Sample14 <dbl>

4.11 Common pitfalls

  • Treating normalized values as raw counts
  • Filtering genes before inspecting library sizes
  • Mixing transformed and untransformed data in the same analysis
  • Forgetting how the count matrix was generated

Most RNA-Seq errors originate before statistical testing.

4.12 Takeaway

The count matrix is the foundation of RNA-Seq analysis. Understanding what it represents — and what it does not — is essential before modeling.