Suggestion Box

Spot an error or have suggestions for improvement on these notes? Let us know!

Week 11 · Modularizing and Scaling Up

0 · Overview

Up to this point, you have worked with individual participant files and written code step by step to import, score, filter, and summarize the data.
Now it is time to bring everything together into a modular, reproducible workflow.
You will learn how to structure your project so that scripts, functions, and Quarto work together to process all participants automatically.

Goals for this section:

Build the process_participant() function.
Automate processing for all participant files.
Save intermediate and final datasets.
Understand how modularization and vectorization improve reproducibility.

1 · Modularization and Project Organization

In earlier weeks, all code was written directly inside a Quarto document.
That made sense while we were learning each concept, but now it is time to separate our analysis into distinct, modular parts.

Quarto vs. Script Files

Component	Purpose	Example Contents
Quarto Report (`.qmd`)	A readable narrative that explains your analysis and displays results. It should load data, source your scripts, and summarize outputs.	YAML header, code chunks calling your functions, markdown text interpreting results.
R Scripts (`.R`)	Contain reusable, clearly defined functions. Each script should focus on one logical task.	`score_questionnaire.R` for scoring, `process_participant.R` for summarizing data, etc.

When you knit your Quarto document, it will source these scripts to access the functions they contain.
This separation keeps your code organized, readable, and easy to debug.

2 · Building the Participant-Level Function

The process_participant() function performs all the steps needed to process one participant’s data file.
This function is saved in scripts/process_participant.R.

Conceptually, the function should:

Read a participant’s .csv file.
Extract and score the questionnaire responses.
Separate practice and experiment blocks.
Filter RTs between 250 and 900 ms.
Compute mean RT and accuracy for each subset.
Return a one-row data frame summarizing that participant.

This structure mirrors the logic you already developed when working interactively with single participants, but now wrapped into a self-contained function that can be reused automatically.

Code Scaffold

Below is a commented outline to guide you as you build process_participant().

#### process_participant.R -----------------------------------------------------
## Purpose: Process one participant's data and return a summary data frame.

process_participant <- function(file_path) {
  
  ## 1) Read in the data
  ## Example:
  ## data <- read.csv(file_path, stringsAsFactors = FALSE)
  ## Create a participant ID using basename() and sub() to remove the ".csv" extension.
  
  ## 2) Extract and score questionnaire
  ## Find the row where trialType == "questionnaire".
  ## Retrieve the JSON string from the response column.
  ## Call score_questionnaire(json_string) to calculate a total score.
  
  ## 3) Split task data into blocks and trial types
  ## Example:
  ## practice_data <- subset(data, block == "practice")
  ## Then separate experiment trials (e.g., magnitude and parity).
  
  ## 4) Filter RTs
  ## Apply the 250–900 ms filter to each subset.
  ## Example:
  ## practice_filtered <- practice_data[practice_data$rt >= 250 & data$rt <= 900, ]
  
  ## 5) Compute summary statistics
  ## Calculate mean RT and accuracy for each subset.
  ## Example:
  ## practice_mean_rt <- mean(practice_filtered$rt, na.rm = TRUE)
  ## practice_acc     <- mean(practice_filtered$correct, na.rm = TRUE)
  
  ## 6) Return a one-row data frame
  ## data.frame(subject_id = subject_id,
  ##            q_score = questionnaire_score,
  ##            practice_mean_rt = practice_mean_rt,
  ##            practice_acc = practice_acc,
  ##            magnitude_mean_rt = magnitude_mean_rt,
  ##            magnitude_acc = magnitude_acc,
  ##            parity_mean_rt = parity_mean_rt,
  ##            parity_acc = parity_acc)
}

Once this function runs correctly for one participant, it can be applied automatically across all files.

3 · Why Modularization Matters

Modularization means breaking your workflow into small, well-defined pieces that can work together smoothly.
This approach improves your workflow in several ways:

Clarity: Each script and function has a single, well-named purpose.
Reusability: The same functions can process new data without rewriting code.
Scalability: You can apply your functions across many files using lapply().
Reproducibility: Your Quarto report simply sources and narrates the pipeline, ensuring the same results every time it runs.

By structuring your project this way, anyone can open your repository, run the same scripts, and reproduce your analysis exactly.

4 · Scaling Up to All Participants

Once your process_participant() function works for one file, you can scale it to the full dataset.

The steps are:

Collect all file paths using list.files(), specifying the folder that contains your participant .csv files (usually data/raw/).
Apply your function to each file using lapply(). This applies the same logic across participants without writing a loop manually.
Combine the outputs into one data frame using do.call(rbind, ...).

Example scaffold:

#### In your Quarto report or analysis script ----------------------------------

# Step 1: Find all files
file_list <- list.files("data/raw/", pattern = "*.csv", full.names = TRUE)

# Step 2: Apply the function to each file
all_data <- lapply(file_list, process_participant)

# Step 3: Combine into one data frame
combined <- do.call(rbind, all_data)

This approach replaces manual loops and produces a concise, vectorized workflow that can handle any number of participants.

5 · Saving Intermediate and Final Outputs

Good workflows save key stages of data processing.
Saving both intermediate and final datasets helps you verify each step and avoid repeating expensive computations.

Common checkpoints include:

Filtered versions of each participant’s task data (for quality checks).
A participant-level summary file (one row per participant).
A combined dataset of all participants ready for analysis.

Example:

write.csv(combined, "data/cleaned/study_level_summary.csv", row.names = FALSE)

By saving intermediate files, you create a transparent record of how raw data were transformed into final analytic variables.

6 · Refactoring and Vectorization

After verifying that your pipeline works correctly, small improvements can make your code cleaner and faster.

Task	Original	Vectorized
Reverse-scoring items	`for (i in c(2,4,7)) responses[i] <- 4 - responses[i]`	`responses[c(2,4,7)] <- 4 - responses[c(2,4,7)]`
Filtering RTs	Loop through rows	`data[data$rt >= 250 & data$rt <= 900, ]`
Applying to files	Manual loop	`lapply(file_list, process_participant)`

Vectorization replaces explicit loops with concise, array-based operations that are easier to read and more efficient.

7 · Reproducibility Checks

Before knitting your Quarto report:

Clear your environment with rm(list = ls()).
Re-run your Quarto document or analysis script from start to finish.
Check that all expected files appear in data/cleaned/.
Confirm that the final dataset loads without errors.

A workflow that runs cleanly from a fresh environment is genuinely reproducible.

8 · Summary

You now have a complete, modular workflow for processing experimental data.
Your Quarto report narrates the analysis, while your scripts perform the reusable work behind the scenes.
This structure makes your research clear, efficient, and reproducible.

In the next stage of analysis, you will use this combined dataset to group and summarize results across participants and visualize patterns using ggplot2.
Because your code is modularized, you can build directly on this foundation without rewriting earlier steps.