Q&A 3 How do you validate RNA-Seq input data before analysis using R?

3.1 Explanation

Before proceeding with differential expression analysis, it’s essential to confirm that your input files are loaded correctly and match in structure. This includes:

✅ Ensuring all samples in the metadata are present in the count matrix
✅ Verifying that the matrix is numeric and genes are in rows
✅ Checking for NA or non-finite values

This validation step prevents downstream errors and ensures a smooth analysis.

3.2 R Code

library(tidyverse)

# 🔄 Load the count matrix and metadata
count_df <- read_csv("data/demo_counts.csv")
metadata <- read_csv("data/demo_metadata.csv")

# 🧪 Inspect the data
glimpse(count_df)

Rows: 1,000
Columns: 15
$ gene     <chr> "Gene1", "Gene2", "Gene3", "Gene4", "Gene5", "Gene6", "Gene7"…
$ Sample1  <dbl> 186, 45, 38, 293, 215, 246, 99, 8, 38, 197, 28, 98, 93, 142, …
$ Sample2  <dbl> 175, 101, 12, 0, 377, 376, 115, 13, 66, 82, 89, 14, 38, 114, …
$ Sample3  <dbl> 175, 54, 214, 172, 72, 34, 74, 13, 88, 49, 223, 237, 315, 145…
$ Sample4  <dbl> 54, 48, 92, 35, 18, 53, 18, 118, 200, 146, 169, 38, 93, 64, 1…
$ Sample5  <dbl> 54, 287, 86, 85, 3, 81, 369, 96, 1, 12, 101, 41, 135, 102, 82…
$ Sample6  <dbl> 67, 74, 110, 173, 27, 180, 82, 22, 146, 8, 148, 70, 0, 432, 1…
$ Sample7  <dbl> 50, 42, 149, 50, 12, 46, 22, 192, 39, 55, 15, 47, 78, 53, 3, …
$ Sample8  <dbl> 3, 54, 6, 142, 50, 312, 289, 264, 82, 244, 48, 208, 77, 167, …
$ Sample9  <dbl> 37, 118, 124, 32, 68, 480, 72, 94, 96, 156, 54, 46, 196, 30, …
$ Sample10 <dbl> 24, 28, 11, 198, 284, 86, 41, 21, 93, 51, 96, 175, 56, 229, 2…
$ Sample11 <dbl> 38, 42, 121, 138, 0, 46, 43, 66, 266, 14, 18, 96, 56, 9, 166,…
$ Sample12 <dbl> 361, 437, 388, 173, 590, 221, 334, 108, 583, 193, 428, 826, 3…
$ Sample13 <dbl> 238, 314, 324, 984, 1208, 287, 350, 438, 170, 346, 857, 986, …
$ Sample14 <dbl> 266, 586, 589, 889, 650, 189, 659, 190, 650, 172, 106, 434, 1…

glimpse(metadata)

Rows: 14
Columns: 2
$ Sample    <chr> "Sample1", "Sample2", "Sample3", "Sample4", "Sample5", "Samp…
$ condition <chr> "Positive", "Positive", "Positive", "Positive", "Positive", …

# Check that all sample names in metadata are in counts
all(metadata$Sample %in% colnames(count_df))  # Should return TRUE

[1] TRUE

# Set gene names as rownames and confirm dimensions
counts <- count_df |>
  column_to_rownames("gene") |>
  as.matrix()

stopifnot(all(metadata$Sample %in% colnames(counts)))
stopifnot(ncol(counts) == nrow(metadata))

✅ Takeaway: A quick check of structure, dimension, and sample consistency ensures your data is clean and ready for analysis. This step can help catch common mistakes early, such as misaligned sample names or non-numeric entries.