Q&A 2 How do you generate synthetic RNA-Seq counts and metadata using R?
2.1 Explanation
In this step, we generate synthetic RNA-Seq data with known differences between conditions. This allows you to simulate differential expression, save the data into a data/ folder, and later analyze it using DESeq2.
We simulate:
- 1000 genes
- 14 samples (11 Positive, 3 Negative)
- Upregulation in the top 30 genes for Negative samples
- Downregulation in the next 30 genes for Negative samples
🔧 This setup mimics a real-world study design and ensures that the resulting volcano and MA plots clearly show the V-shape pattern of differential expression.
2.2 R Code
library(tidyverse)
set.seed(42)
# 📦 Create output directory
if (!dir.exists("data")) dir.create("data", recursive = TRUE)
# 🧬 Simulation settings
n_genes <- 1000
n_pos <- 11
n_neg <- 3
n_de_up <- 30
n_de_down <- 30
gene_ids <- paste0("Gene", seq_len(n_genes))
# Simulate Positive group (baseline expression)
counts_pos <- matrix(rnbinom(n_genes * n_pos, mu = 100, size = 1), nrow = n_genes)
# Simulate Negative group
counts_neg <- matrix(rnbinom(n_genes * n_neg, mu = 100, size = 1), nrow = n_genes)
# ⬆️ Upregulate top 30 genes in Negative samples
counts_neg[1:n_de_up, ] <- counts_neg[1:n_de_up, ] + rnbinom(n_de_up * n_neg, mu = 400, size = 1)
# ⬇️ Downregulate next 30 genes in Negative samples
counts_neg[(n_de_up + 1):(n_de_up + n_de_down), ] <- rnbinom(n_de_down * n_neg, mu = 10, size = 1)
# Combine counts
count_matrix <- cbind(counts_pos, counts_neg)
colnames(count_matrix) <- paste0("Sample", seq_len(n_pos + n_neg))
rownames(count_matrix) <- gene_ids
# 📄 Metadata
metadata <- tibble(
Sample = colnames(count_matrix),
condition = c(rep("Positive", n_pos), rep("Negative", n_neg))
)
# 💾 Save to data/
write_csv(as.data.frame(count_matrix) |> rownames_to_column("gene"), "data/demo_counts.csv")
write_csv(metadata, "data/demo_metadata.csv")
# 👁️ Preview first 5 genes × 5 samples
as.data.frame(count_matrix)[1:5, 1:5] Sample1 Sample2 Sample3 Sample4 Sample5
Gene1 186 175 175 54 54
Gene2 45 101 54 48 287
Gene3 38 12 214 92 86
Gene4 293 0 172 35 85
Gene5 215 377 72 18 3
# A tibble: 5 × 2
Sample condition
<chr> <chr>
1 Sample1 Positive
2 Sample2 Positive
3 Sample3 Positive
4 Sample4 Positive
5 Sample5 Positive
✅ Takeaway: This simulation creates a realistic expression pattern where some genes are clearly upregulated or downregulated in one condition. This structure is ideal for learning DE analysis, producing excellent MA and volcano plots, and testing downstream workflows.