Read Processing and Quantification

Published

Jun 2026

ID: RNASEQ-005
Type: Data Processing
Audience: Students, researchers, analysts, and practitioners
Theme: Converting sequencing reads into quantitative expression measurements

Introduction

After evaluating raw sequencing quality, the next stage of the RNA-Seq system is to transform sequencing reads into quantitative measurements of gene expression.

At this point, the goal is no longer to assess data quality but to determine how many reads support each biological feature, such as a gene or transcript.

The output of this stage forms the foundation for downstream statistical analysis.

Where This Chapter Fits

Code

flowchart TD

    A[Sequencing]

    subgraph DP["Data Processing"]
        B[Raw Reads]
        C[Read Quality Control]
        D[Read Processing & Quantification]
        E[Count Matrix]
    end

    A --> B
    B --> C --> D --> E

flowchart TD

    A[Sequencing]

    subgraph DP["Data Processing"]
        B[Raw Reads]
        C[Read Quality Control]
        D[Read Processing & Quantification]
        E[Count Matrix]
    end

    A --> B
    B --> C --> D --> E

This chapter focuses on converting quality-assessed sequencing reads into quantitative expression measurements.

What Is Quantification?

Quantification is the process of estimating expression levels from sequencing reads.

The central question is:

Which genes or transcripts generated these reads?

The answer allows us to summarize millions of sequencing reads into biologically meaningful expression measurements.

Reference-Based Quantification

Most RNA-Seq workflows use a reference genome or transcriptome.

Reads are compared against known biological sequences to determine their likely origin.

Common references include:

Genome assemblies
Transcriptome assemblies
Gene annotation databases

The quality of the reference influences downstream quantification accuracy.

Alignment-Based Approaches

Traditional RNA-Seq workflows often begin with sequence alignment.

Alignment attempts to determine where each read originated in the genome.

Common aligners include:

STAR
HISAT2

Typical workflow:

FASTQ Files
      ↓
Alignment
      ↓
Aligned Reads (BAM)
      ↓
Feature Counting
      ↓
Count Matrix

Alignment provides detailed mapping information but may require substantial computational resources.

Pseudoalignment Approaches

Modern workflows often use pseudoalignment or lightweight mapping approaches.

Examples include:

Salmon
kallisto

Rather than performing full genomic alignment, these tools estimate transcript abundance more efficiently.

Typical workflow:

FASTQ Files
      ↓
Pseudoalignment
      ↓
Transcript Quantification
      ↓
Gene-Level Summarization
      ↓
Count Matrix

These methods are often faster while maintaining high accuracy for expression estimation.

Gene-Level Versus Transcript-Level Quantification

Expression can be quantified at different biological levels.

Gene-Level

Gene-level quantification summarizes expression across all transcripts associated with a gene.

Example:

GeneA = 250 reads
GeneB = 1020 reads
GeneC = 430 reads

Gene-level analyses are commonly used for differential expression studies.

Transcript-Level

Transcript-level quantification estimates abundance for individual transcript isoforms.

Example:

GeneA-001 = 150 reads
GeneA-002 = 100 reads

Transcript-level analyses can provide additional biological detail but are often more complex.

Feature Counting

For alignment-based workflows, reads are commonly assigned to annotated features.

Features may include:

Genes
Exons
Transcripts

Feature counting converts mapped reads into a numerical expression matrix suitable for statistical analysis.

Example Count Matrix

The primary output of this stage is a count matrix.

Gene	Sample1	Sample2	Sample3
GeneA	250	310	295
GeneB	1020	980	1105
GeneC	430	390	470

Rows represent biological features.

Columns represent samples.

Values represent observed counts.

This matrix becomes the foundation for downstream expression analysis.

Common Quantification Outputs

Typical outputs include:

Transcript abundance estimates
Gene counts
Alignment statistics
Mapping rates
Quantification summaries

These outputs help evaluate how successfully reads were assigned to biological features.

Mapping Rate

Mapping rate refers to the proportion of reads successfully assigned to the reference.

For example:

Total Reads: 20,000,000
Mapped Reads: 18,500,000
Mapping Rate: 92.5%

Mapping rates can help identify potential issues with:

Reference quality
Sample contamination
Sequencing quality
Library preparation

Mapping rates should be interpreted together with other quality metrics.

Multi-Mapping Reads

Some reads may align to multiple locations.

This can occur because of:

Gene families
Repetitive sequences
Shared transcript regions

Different quantification tools handle multi-mapping reads differently.

Understanding these decisions is important when interpreting expression estimates.

Gene Length Considerations

Longer genes naturally generate more reads than shorter genes.

For this reason, raw counts are influenced by:

Expression level
Gene length
Sequencing depth

Downstream normalization methods help account for some of these effects.

Common Tools

Frequently used RNA-Seq quantification tools include:

Tool	Purpose
STAR	Alignment
HISAT2	Alignment
featureCounts	Gene counting
Salmon	Quantification
kallisto	Quantification
tximport	Import and summarize transcript estimates

The specific tool is less important than understanding the role it plays in the RNA-Seq system.

Example Salmon Workflow

salmon quant \
  -i transcriptome_index \
  -l A \
  -1 sample_R1.fastq.gz \
  -2 sample_R2.fastq.gz \
  -o sample_quant

This command estimates transcript abundance from paired-end sequencing reads.

Example tximport Workflow

library(tximport)

txi <- tximport(
  files = quant_files,
  type = "salmon"
)

The imported object can then be used for downstream differential expression analysis.

Quantification Checklist

Before moving to expression analysis, confirm that:

Sequencing reads have passed QC review.
Reference files are documented.
Quantification completed successfully.
Mapping summaries have been reviewed.
Sample identifiers match metadata.
Count matrices have been generated.
Output files are stored reproducibly.

Common Mistakes

Common quantification mistakes include:

Using inconsistent sample names
Ignoring mapping summaries
Mixing transcript and gene-level analyses unintentionally
Losing metadata connections during file processing
Using count matrices without understanding how they were generated

The goal is not simply to obtain counts but to understand how those counts were produced.

Key Takeaway

Read processing and quantification convert sequencing reads into expression measurements.

The count matrix produced at this stage becomes the central input for normalization, exploratory analysis, and differential expression modeling.

Understanding how counts are generated is essential for interpreting downstream biological conclusions.

What Comes Next

The next chapter focuses on count matrix quality assessment and filtering, the first step in preparing expression measurements for statistical analysis.