Environment and Project Setup

Published

Jun 2026

  • ID: RNASEQ-003
  • Type: Foundations
  • Audience: Students, researchers, analysts, and practitioners
  • Theme: Preparing a reproducible RNA-Seq analysis environment

Introduction

After the biological question, study design, and metadata are defined, the next step is to prepare a reproducible computational environment.

In RNA-Seq analysis, reproducibility is not only about writing code. It also depends on how the project is organized, how data files are stored, how software versions are documented, and how outputs are generated.

A well-structured project makes the workflow easier to run, audit, share, and extend.

Where This Chapter Fits

Code
flowchart TD

    A[Biological Question]
    B[Study Design & Metadata]
    C[Environment & Project Setup]
    D[Data Generation]
    E[Data Processing]
    F[Statistical Analysis]
    G[Biological Interpretation]
    H[Reproducible Reporting]

    A --> B --> C --> D --> E --> F --> G --> H

flowchart TD

    A[Biological Question]
    B[Study Design & Metadata]
    C[Environment & Project Setup]
    D[Data Generation]
    E[Data Processing]
    F[Statistical Analysis]
    G[Biological Interpretation]
    H[Reproducible Reporting]

    A --> B --> C --> D --> E --> F --> G --> H

This chapter prepares the computational structure needed before working with RNA-Seq data files.

Why Setup Matters

RNA-Seq projects often involve many files:

  • Raw sequencing files
  • Quality control reports
  • Alignment or quantification outputs
  • Count matrices
  • Metadata tables
  • Analysis scripts
  • Figures
  • Reports

Without a clear structure, projects can quickly become difficult to understand or reproduce.

A good setup helps answer questions such as:

  • Where are the raw data stored?
  • Which files were generated by the analysis?
  • Which scripts produced which outputs?
  • Which software versions were used?
  • Can the analysis be rerun later?

Raw Data Should Remain Unchanged

Raw data should be treated as read-only.

Examples of raw data include:

  • FASTQ files
  • Original metadata files
  • External reference files
  • Original count files from collaborators

These files should not be manually edited. If changes are needed, create a processed version and document the transformation.

This protects the integrity of the workflow.

Metadata Location

Metadata should be stored in a dedicated folder, such as:

data/metadata/

A typical metadata file might be:

sample-metadata.csv

This file should contain one row per sample and enough variables to support quality control, modeling, and interpretation.

The metadata table is one of the most important files in the entire RNA-Seq system.

Results Should Be Regenerable

The results/ folder should contain outputs generated by the workflow.

Examples include:

  • QC summaries
  • Count matrices
  • Normalized expression tables
  • Differential expression results
  • Enrichment results
  • Figures

Ideally, these outputs should be reproducible from scripts and input data.

If a result cannot be regenerated, it becomes difficult to verify.

Script Naming

Scripts should be named in workflow order.

For example:

00-setup.R
01-read-metadata.R
02-qc-summary.R
03-count-filtering.R
04-normalization-eda.R
05-differential-expression.R
06-interpretation.R

Numbered scripts make the workflow easier to follow.

They also help new learners understand the order of operations.

R Project Setup

For R-based RNA-Seq analysis, it is helpful to work inside an RStudio Project or a clear project directory.

The project root should contain files such as:

README.md
_quarto.yml
renv.lock

Using a project root helps avoid fragile file paths and keeps the analysis portable.

Package Management with renv

The renv package helps record and restore R package versions.

A simple setup begins with:

install.packages("renv")
renv::init()

After installing required packages, save the environment with:

renv::snapshot()

Later, the environment can be restored with:

renv::restore()

This improves reproducibility by documenting the package versions used in the project.

Common R Packages

RNA-Seq workflows often use packages such as:

install.packages(c(
  "tidyverse",
  "readr",
  "dplyr",
  "ggplot2",
  "pheatmap"
))

Bioconductor packages are often installed using BiocManager:

install.packages("BiocManager")

BiocManager::install(c(
  "DESeq2",
  "tximport",
  "SummarizedExperiment",
  "apeglm",
  "clusterProfiler",
  "org.Hs.eg.db"
))

The exact packages depend on the workflow and organism being studied.

Explicit Namespaces

To reduce ambiguity and avoid function masking, this guide uses explicit namespaces when helpful.

For example:

dplyr::filter(metadata, condition == "treated")
ggplot2::ggplot(metadata, ggplot2::aes(x = condition))

Explicit namespaces make code easier to read and reduce confusion when multiple packages contain functions with the same name.

Quarto Reporting

Quarto can be used to combine explanation, code, results, and interpretation in one reproducible document.

A Quarto chapter may include:

  • Narrative explanation
  • Code chunks
  • Tables
  • Figures
  • Interpretation notes
  • Key takeaways

This supports the CDI goal of moving from analysis output to defensible interpretation.

Example Setup Script

A minimal setup script might look like this:

# 00-setup.R

# Load packages
library(readr)
library(dplyr)
library(ggplot2)

# Define project paths
metadata_path <- "data/metadata/sample-metadata.csv"
counts_dir <- "results/counts"
figures_dir <- "results/figures"

# Create output directories if needed
dir.create(counts_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(figures_dir, recursive = TRUE, showWarnings = FALSE)

This script prepares common packages and paths used later in the workflow.

File Path Principles

Good file paths should be:

  • Relative to the project root
  • Clear and descriptive
  • Stable across computers
  • Free of spaces when possible

Avoid absolute paths such as:

/Users/name/Desktop/final-analysis/new-version/results.csv

Prefer project-relative paths such as:

results/differential-expression/deseq2-results.csv

Project-relative paths make the analysis easier to share and rerun.

README File

Every RNA-Seq project should include a README.md file.

The README should explain:

  • Project purpose
  • Data sources
  • Main workflow steps
  • Folder structure
  • Required software
  • How to reproduce the analysis
  • Main outputs

The README acts as the entry point for future users, collaborators, and reviewers.

Setup Checklist

Before starting analysis, confirm that:

  • The project folder has a clear structure.
  • Raw data are stored separately and remain unchanged.
  • Metadata are stored in a dedicated location.
  • Results have organized output folders.
  • Scripts are numbered in workflow order.
  • Package versions are documented.
  • File paths are project-relative.
  • A README file explains the workflow.

Common Mistakes

Common setup mistakes include:

  • Mixing raw and processed files
  • Editing raw data manually
  • Saving outputs only on the desktop
  • Using unclear file names such as final2.csv
  • Forgetting software versions
  • Writing scripts that only work on one computer
  • Producing figures without documenting how they were generated

These problems reduce reproducibility and make interpretation harder to defend.

Key Takeaway

A reproducible RNA-Seq workflow requires more than correct statistical methods.

The project environment must be organized so that data, code, results, and reports work together as a coherent system.

Good setup protects the analysis from confusion, supports collaboration, and makes biological claims easier to verify.

What Comes Next

The next chapter begins the data processing workflow by focusing on raw read quality control.