Audience: Students, biologists, bioinformaticians, data scientists, researchers, and practitioners
Theme: Evaluating count matrices and preparing expression data for downstream analysis
Introduction
The count matrix represents the transition between Data Generation & Processing and Expression Analysis.
Although quantification has been completed, not every feature in the count matrix contributes meaningful information to downstream analyses. Some genes may have extremely low counts, while others may show strong and consistent expression across samples.
The goal of this chapter is to assess count matrix quality and prepare expression data for normalization and statistical modeling.
Where This Chapter Fits
Code
flowchart TD A[Count Matrix] subgraph EA["Expression Analysis"] B[Count Quality Assessment & Filtering] C[Normalization & Exploratory Analysis] D[Differential Expression Analysis] end A --> B --> C --> D
flowchart TD
A[Count Matrix]
subgraph EA["Expression Analysis"]
B[Count Quality Assessment & Filtering]
C[Normalization & Exploratory Analysis]
D[Differential Expression Analysis]
end
A --> B --> C --> D
This chapter represents the first step in the Expression Analysis workflow.
Why Evaluate the Count Matrix?
The count matrix is a measurement table, not a final analytical result.
Before normalization, it is important to understand:
How many genes were detected
Whether sequencing depth varies substantially across samples
Whether low-count genes dominate the dataset
Whether sample-level characteristics are reasonable
Whether potential technical issues are present
These assessments help determine whether the data are ready for downstream analysis.
Understanding Low-Count Features
RNA-Seq datasets often contain many genes with little or no detectable expression.
Examples include:
Gene
Sample1
Sample2
Sample3
GeneA
250
310
295
GeneB
1020
980
1105
GeneC
0
1
0
GeneD
2
0
1
Genes A and B contain substantial information.
Genes C and D contribute little evidence and may increase noise during statistical testing.
Why Filter Low-Count Genes?
Low-count filtering can:
Reduce noise
Improve statistical efficiency
Reduce multiple-testing burden
Improve model stability
Focus analyses on informative features
Filtering is not intended to manipulate results. It is intended to remove features that provide insufficient evidence for reliable inference.
Library Sizes
One of the first count-level assessments involves library size.
Library size refers to the total number of counts observed within a sample.
Example:
Sample
Total Counts
Sample1
18,500,000
Sample2
19,200,000
Sample3
17,900,000
Sample4
24,800,000
Substantial differences in library size should be understood before normalization.
Gene Detection Rates
Another useful summary is the number of detected genes per sample.
For example:
Sample
Detected Genes
Sample1
15,300
Sample2
15,100
Sample3
15,450
Sample4
10,200
Samples with unusually low detection rates may require further investigation.
Expression Prevalence
Filtering often considers not only count magnitude but also prevalence across samples.
A common principle is:
Retain genes that show
sufficient counts in
multiple samples.
This helps remove genes that appear only sporadically.
Example Filtering Rule
A simple filtering rule might require:
At least 10 counts
in at least 3 samples
Genes failing this criterion would be removed before normalization.
The exact threshold depends on the study design and analysis objectives.
The filtered count matrix becomes the primary input for downstream normalization and visualization.
Key Takeaway
Count quality assessment and filtering help ensure that downstream analyses focus on informative expression measurements.
By reducing noise and documenting filtering decisions, researchers create a stronger foundation for normalization, exploratory analysis, and differential expression testing.
What Comes Next
The next chapter focuses on normalization and exploratory analysis, where filtered count data are transformed into a form suitable for meaningful comparison across samples.