Environment and Project Setup

Published

Jun 2026

ID: RNASEQ-003
Type: Foundations
Audience: Students, researchers, analysts, and practitioners
Theme: Preparing a reproducible RNA-Seq analysis environment

Introduction

After the biological question, study design, and metadata are defined, the next step is to prepare a reproducible computational environment.

In RNA-Seq analysis, reproducibility is not only about writing code. It also depends on how the project is organized, how data files are stored, how software versions are documented, and how outputs are generated.

A well-structured project makes the workflow easier to run, audit, share, and extend.

Where This Chapter Fits

Code

flowchart TD

    A[Biological Question]
    B[Study Design & Metadata]
    C[Environment & Project Setup]
    D[Data Generation]
    E[Data Processing]
    F[Statistical Analysis]
    G[Biological Interpretation]
    H[Reproducible Reporting]

    A --> B --> C --> D --> E --> F --> G --> H

flowchart TD

    A[Biological Question]
    B[Study Design & Metadata]
    C[Environment & Project Setup]
    D[Data Generation]
    E[Data Processing]
    F[Statistical Analysis]
    G[Biological Interpretation]
    H[Reproducible Reporting]

    A --> B --> C --> D --> E --> F --> G --> H

This chapter prepares the computational structure needed before working with RNA-Seq data files.

Why Setup Matters

RNA-Seq projects often involve many files:

Raw sequencing files
Quality control reports
Alignment or quantification outputs
Count matrices
Metadata tables
Analysis scripts
Figures
Reports

Without a clear structure, projects can quickly become difficult to understand or reproduce.

A good setup helps answer questions such as:

Where are the raw data stored?
Which files were generated by the analysis?
Which scripts produced which outputs?
Which software versions were used?
Can the analysis be rerun later?

Recommended Project Structure

A simple RNA-Seq project can use the following structure:

rnaseq-system/
├── data/
│   ├── raw/
│   ├── metadata/
│   ├── processed/
│   └── external/
├── results/
│   ├── qc/
│   ├── counts/
│   ├── differential-expression/
│   ├── enrichment/
│   └── figures/
├── scripts/
│   ├── 00-setup.R
│   ├── 01-read-metadata.R
│   ├── 02-qc-summary.R
│   ├── 03-count-filtering.R
│   ├── 04-normalization-eda.R
│   ├── 05-differential-expression.R
│   └── 06-interpretation.R
├── reports/
├── docs/
├── renv.lock
├── README.md
└── _quarto.yml

The exact structure can be adjusted, but the principle should remain the same: raw data, processed data, scripts, results, and reports should be clearly separated.

Raw Data Should Remain Unchanged

Raw data should be treated as read-only.

Examples of raw data include:

FASTQ files
Original metadata files
External reference files
Original count files from collaborators

These files should not be manually edited. If changes are needed, create a processed version and document the transformation.

This protects the integrity of the workflow.

Metadata Location

Metadata should be stored in a dedicated folder, such as:

data/metadata/

A typical metadata file might be:

sample-metadata.csv

This file should contain one row per sample and enough variables to support quality control, modeling, and interpretation.

The metadata table is one of the most important files in the entire RNA-Seq system.

Results Should Be Regenerable

The results/ folder should contain outputs generated by the workflow.

Examples include:

QC summaries
Count matrices
Normalized expression tables
Differential expression results
Enrichment results
Figures

Ideally, these outputs should be reproducible from scripts and input data.

If a result cannot be regenerated, it becomes difficult to verify.

Script Naming

Scripts should be named in workflow order.

For example:

00-setup.R
01-read-metadata.R
02-qc-summary.R
03-count-filtering.R
04-normalization-eda.R
05-differential-expression.R
06-interpretation.R

Numbered scripts make the workflow easier to follow.

They also help new learners understand the order of operations.

R Project Setup

For R-based RNA-Seq analysis, it is helpful to work inside an RStudio Project or a clear project directory.

The project root should contain files such as:

README.md
_quarto.yml
renv.lock

Using a project root helps avoid fragile file paths and keeps the analysis portable.

Package Management with renv

The renv package helps record and restore R package versions.

A simple setup begins with:

install.packages("renv")
renv::init()

After installing required packages, save the environment with:

renv::snapshot()

Later, the environment can be restored with:

renv::restore()

This improves reproducibility by documenting the package versions used in the project.

Common R Packages

RNA-Seq workflows often use packages such as:

install.packages(c(
  "tidyverse",
  "readr",
  "dplyr",
  "ggplot2",
  "pheatmap"
))

Bioconductor packages are often installed using BiocManager:

install.packages("BiocManager")

BiocManager::install(c(
  "DESeq2",
  "tximport",
  "SummarizedExperiment",
  "apeglm",
  "clusterProfiler",
  "org.Hs.eg.db"
))

The exact packages depend on the workflow and organism being studied.

Explicit Namespaces

To reduce ambiguity and avoid function masking, this guide uses explicit namespaces when helpful.

For example:

dplyr::filter(metadata, condition == "treated")
ggplot2::ggplot(metadata, ggplot2::aes(x = condition))

Explicit namespaces make code easier to read and reduce confusion when multiple packages contain functions with the same name.

Quarto Reporting

Quarto can be used to combine explanation, code, results, and interpretation in one reproducible document.

A Quarto chapter may include:

Narrative explanation
Code chunks
Tables
Figures
Interpretation notes
Key takeaways

This supports the CDI goal of moving from analysis output to defensible interpretation.

Example Setup Script

A minimal setup script might look like this:

# 00-setup.R

# Load packages
library(readr)
library(dplyr)
library(ggplot2)

# Define project paths
metadata_path <- "data/metadata/sample-metadata.csv"
counts_dir <- "results/counts"
figures_dir <- "results/figures"

# Create output directories if needed
dir.create(counts_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(figures_dir, recursive = TRUE, showWarnings = FALSE)

This script prepares common packages and paths used later in the workflow.

File Path Principles

Good file paths should be:

Relative to the project root
Clear and descriptive
Stable across computers
Free of spaces when possible

Avoid absolute paths such as:

/Users/name/Desktop/final-analysis/new-version/results.csv

Prefer project-relative paths such as:

results/differential-expression/deseq2-results.csv

Project-relative paths make the analysis easier to share and rerun.

README File

Every RNA-Seq project should include a README.md file.

The README should explain:

Project purpose
Data sources
Main workflow steps
Folder structure
Required software
How to reproduce the analysis
Main outputs

The README acts as the entry point for future users, collaborators, and reviewers.

Setup Checklist

Before starting analysis, confirm that:

The project folder has a clear structure.
Raw data are stored separately and remain unchanged.
Metadata are stored in a dedicated location.
Results have organized output folders.
Scripts are numbered in workflow order.
Package versions are documented.
File paths are project-relative.
A README file explains the workflow.

Common Mistakes

Common setup mistakes include:

Mixing raw and processed files
Editing raw data manually
Saving outputs only on the desktop
Using unclear file names such as final2.csv
Forgetting software versions
Writing scripts that only work on one computer
Producing figures without documenting how they were generated

These problems reduce reproducibility and make interpretation harder to defend.

Key Takeaway

A reproducible RNA-Seq workflow requires more than correct statistical methods.

The project environment must be organized so that data, code, results, and reports work together as a coherent system.

Good setup protects the analysis from confusion, supports collaboration, and makes biological claims easier to verify.

What Comes Next

The next chapter begins the data processing workflow by focusing on raw read quality control.