Q&A 2 How do you generate synthetic RNA-Seq counts and metadata using R?

2.1 Explanation

In this step, we generate synthetic RNA-Seq data with known differences between conditions. This allows you to simulate differential expression, save the data into a data/ folder, and later analyze it using DESeq2.

We simulate:

1000 genes
14 samples (11 Positive, 3 Negative)
Upregulation in the top 30 genes for Negative samples
Downregulation in the next 30 genes for Negative samples

🔧 This setup mimics a real-world study design and ensures that the resulting volcano and MA plots clearly show the V-shape pattern of differential expression.

2.2 R Code

library(tidyverse)

set.seed(42)

# 📦 Create output directory
if (!dir.exists("data")) dir.create("data", recursive = TRUE)

# 🧬 Simulation settings
n_genes <- 1000
n_pos <- 11
n_neg <- 3
n_de_up <- 30
n_de_down <- 30
gene_ids <- paste0("Gene", seq_len(n_genes))

# Simulate Positive group (baseline expression)
counts_pos <- matrix(rnbinom(n_genes * n_pos, mu = 100, size = 1), nrow = n_genes)

# Simulate Negative group
counts_neg <- matrix(rnbinom(n_genes * n_neg, mu = 100, size = 1), nrow = n_genes)

# ⬆️ Upregulate top 30 genes in Negative samples
counts_neg[1:n_de_up, ] <- counts_neg[1:n_de_up, ] + rnbinom(n_de_up * n_neg, mu = 400, size = 1)

# ⬇️ Downregulate next 30 genes in Negative samples
counts_neg[(n_de_up + 1):(n_de_up + n_de_down), ] <- rnbinom(n_de_down * n_neg, mu = 10, size = 1)

# Combine counts
count_matrix <- cbind(counts_pos, counts_neg)
colnames(count_matrix) <- paste0("Sample", seq_len(n_pos + n_neg))
rownames(count_matrix) <- gene_ids

# 📄 Metadata
metadata <- tibble(
  Sample = colnames(count_matrix),
  condition = c(rep("Positive", n_pos), rep("Negative", n_neg))
)

# 💾 Save to data/
write_csv(as.data.frame(count_matrix) |> rownames_to_column("gene"), "data/demo_counts.csv")
write_csv(metadata, "data/demo_metadata.csv")

# 👁️ Preview first 5 genes × 5 samples
as.data.frame(count_matrix)[1:5, 1:5]

      Sample1 Sample2 Sample3 Sample4 Sample5
Gene1     186     175     175      54      54
Gene2      45     101      54      48     287
Gene3      38      12     214      92      86
Gene4     293       0     172      35      85
Gene5     215     377      72      18       3

# 👁️ Preview metadata
head(metadata, 5)

# A tibble: 5 × 2
  Sample  condition
  <chr>   <chr>    
1 Sample1 Positive 
2 Sample2 Positive 
3 Sample3 Positive 
4 Sample4 Positive 
5 Sample5 Positive

✅ Takeaway: This simulation creates a realistic expression pattern where some genes are clearly upregulated or downregulated in one condition. This structure is ideal for learning DE analysis, producing excellent MA and volcano plots, and testing downstream workflows.