Count Quality Assessment and Filtering

Published

Jun 2026

  • ID: RNASEQ-007
  • Type: Expression Analysis
  • Audience: Students, biologists, bioinformaticians, data scientists, researchers, and practitioners
  • Theme: Evaluating count matrices and preparing expression data for downstream analysis

Introduction

The count matrix represents the transition between Data Generation & Processing and Expression Analysis.

Although quantification has been completed, not every feature in the count matrix contributes meaningful information to downstream analyses. Some genes may have extremely low counts, while others may show strong and consistent expression across samples.

The goal of this chapter is to assess count matrix quality and prepare expression data for normalization and statistical modeling.

Where This Chapter Fits

Code
flowchart TD

    A[Count Matrix]

    subgraph EA["Expression Analysis"]
        B[Count Quality Assessment & Filtering]
        C[Normalization & Exploratory Analysis]
        D[Differential Expression Analysis]
    end

    A --> B --> C --> D

flowchart TD

    A[Count Matrix]

    subgraph EA["Expression Analysis"]
        B[Count Quality Assessment & Filtering]
        C[Normalization & Exploratory Analysis]
        D[Differential Expression Analysis]
    end

    A --> B --> C --> D

This chapter represents the first step in the Expression Analysis workflow.

Why Evaluate the Count Matrix?

The count matrix is a measurement table, not a final analytical result.

Before normalization, it is important to understand:

  • How many genes were detected
  • Whether sequencing depth varies substantially across samples
  • Whether low-count genes dominate the dataset
  • Whether sample-level characteristics are reasonable
  • Whether potential technical issues are present

These assessments help determine whether the data are ready for downstream analysis.

Understanding Low-Count Features

RNA-Seq datasets often contain many genes with little or no detectable expression.

Examples include:

Gene Sample1 Sample2 Sample3
GeneA 250 310 295
GeneB 1020 980 1105
GeneC 0 1 0
GeneD 2 0 1

Genes A and B contain substantial information.

Genes C and D contribute little evidence and may increase noise during statistical testing.

Why Filter Low-Count Genes?

Low-count filtering can:

  • Reduce noise
  • Improve statistical efficiency
  • Reduce multiple-testing burden
  • Improve model stability
  • Focus analyses on informative features

Filtering is not intended to manipulate results. It is intended to remove features that provide insufficient evidence for reliable inference.

Library Sizes

One of the first count-level assessments involves library size.

Library size refers to the total number of counts observed within a sample.

Example:

Sample Total Counts
Sample1 18,500,000
Sample2 19,200,000
Sample3 17,900,000
Sample4 24,800,000

Substantial differences in library size should be understood before normalization.

Gene Detection Rates

Another useful summary is the number of detected genes per sample.

For example:

Sample Detected Genes
Sample1 15,300
Sample2 15,100
Sample3 15,450
Sample4 10,200

Samples with unusually low detection rates may require further investigation.

Expression Prevalence

Filtering often considers not only count magnitude but also prevalence across samples.

A common principle is:

Retain genes that show
sufficient counts in
multiple samples.

This helps remove genes that appear only sporadically.

Example Filtering Rule

A simple filtering rule might require:

At least 10 counts
in at least 3 samples

Genes failing this criterion would be removed before normalization.

The exact threshold depends on the study design and analysis objectives.

Filtering with R

A simple filtering approach may look like:

keep <- rowSums(counts >= 10) >= 3

counts_filtered <- counts[keep, ]

This code retains genes with sufficient counts in at least three samples.

Filtering with edgeR

The edgeR package provides a commonly used filtering method.

keep <- edgeR::filterByExpr(
  counts,
  group = metadata$condition
)

counts_filtered <- counts[keep, ]

This method incorporates information about experimental groups and is widely used in RNA-Seq workflows.

Sample-Level Diagnostics

Count matrices can also reveal sample-level issues.

Questions include:

  • Are some samples substantially different from others?
  • Are library sizes unusually small?
  • Are detection rates unusually low?
  • Are there potential sample outliers?

These questions should be investigated before normalization and modeling.

Filtering Is Not Normalization

Filtering and normalization serve different purposes.

Filtering:

  • Removes uninformative features
  • Reduces noise
  • Improves statistical efficiency

Normalization:

  • Adjusts for sequencing depth
  • Improves comparability across samples
  • Supports downstream modeling

Filtering prepares the dataset for normalization but does not replace it.

Documenting Filtering Decisions

Filtering criteria should always be documented.

Reports should include:

  • Filtering thresholds
  • Number of genes removed
  • Number of genes retained
  • Rationale for chosen criteria

Transparent documentation improves reproducibility and interpretation.

Common Mistakes

Common filtering mistakes include:

  • Skipping filtering entirely
  • Applying arbitrary thresholds without justification
  • Ignoring sample-level diagnostics
  • Treating filtering as a substitute for normalization
  • Removing genes without documenting criteria
  • Losing metadata alignment during filtering

These mistakes can affect every downstream stage of Expression Analysis.

Count Matrix Handoff

After filtering, the workflow transitions to normalization and exploratory analysis.

Count Matrix
      ↓
Count Quality Assessment & Filtering
      ↓
Filtered Count Matrix
      ↓
Normalization & Exploratory Analysis

The filtered count matrix becomes the primary input for downstream normalization and visualization.

Key Takeaway

Count quality assessment and filtering help ensure that downstream analyses focus on informative expression measurements.

By reducing noise and documenting filtering decisions, researchers create a stronger foundation for normalization, exploratory analysis, and differential expression testing.

What Comes Next

The next chapter focuses on normalization and exploratory analysis, where filtered count data are transformed into a form suitable for meaningful comparison across samples.