Suggestion Box
Spot an error or have suggestions for improvement on these notes? Let us know!
Week 11 · Grouping and Summarizing Across Participants
How to use these notes: Create a new Quarto file named reports/npt_group.qmd. Copy the sections below into that file in order. All fenced code chunks use Quarto syntax so you can render immediately.
YAML header for reports/npt_group.qmd
Copy this header to the very top of your Quarto file:
---
title: "Grouping and Summarizing Across Participants"
format: html
execute:
echo: true
warning: true
message: false
---
0 · Overview
We now have a clean study-level dataset where each row is one participant. In this report, we will summarize across participants to create group-level values that we can interpret and later visualize.
Goals for today:
- Load and inspect the study-level data.
- Create derived variables that clarify comparisons.
- Compute summaries across participants.
- Compare groups based on a questionnaire measure of focus (High vs Low using
tef10_score). - Save a compact summary table to
output/tables/.
What are derived variables?
Derived variables are new columns we calculate from existing ones. They help us express a question more directly than raw variables alone. For example, a difference score captures parity minus magnitude reaction time, which is easier to interpret than two separate means.
1 · Load and inspect
#### Load and Inspect Group Level Data Data
```{r}
# Load here::here robustly
if (!require(here)) install.packages("here")
library(here)
# Load the study-level dataset built in our previous workflow
study_level <- read.csv(here("data/cleaned/study_level.csv"))
# Quick inspection
head(study_level)
str(study_level)
summary(study_level)
```
Let's get rid of the behavior. prefix in our variable names so it's easier to type out.
```{r}
## Replace Column Names
names(study_level) <- c("subject_id", "tef10_score", "practice_mean_rt", "practice_acc", "magnitude_mean_rt", "magnitude_acc", "parity_mean_rt", "parity_acc", "practice_sd_rt", "magnitude_sd_rt", "parity_sd_rt")
write.csv(study_level, here::here("data/cleaned/study_level.csv"), row.names = FALSE)
```
2 · Derived variables (participant-level)
We will add variables that make interpretation easier.
A. Reaction time difference: parity mean RT minus magnitude mean RT.
B. Accuracy difference: parity accuracy minus magnitude accuracy.
C. Overall averages across conditions (magnitude + parity) for each participant.
#### Calculate Difference Scores
```{r}
# A) Reaction time difference (complete this line together)
# hint: subtract magnitude mean RT from parity mean RT
study_level$rt_diff <-
# B) Accuracy difference (complete together)
study_level$acc_diff <-
# C) Overall averages across conditions
study_level$mean_rt_overall <- rowMeans(
study_level[, c("magnitude_mean_rt", "parity_mean_rt")],
na.rm = TRUE
)
study_level$mean_acc_overall <- rowMeans(
study_level[, c("magnitude_acc", "parity_acc")],
na.rm = TRUE
)
# Peek at the new columns
head(study_level[, c("subject_id", "rt_diff", "acc_diff",
"mean_rt_overall", "mean_acc_overall")])
```
Interpretation prompt
- If
rt_diffis positive, what does that imply about parity vs magnitude reaction times? - What does a negative
acc_diffimply?
3 · Summaries across participants
We summarize across participants to understand overall trends in the class. We will start with simple vectorized summaries, then use aggregate() to summarize by groups.
Why aggregate()? It computes a summary (like a mean) for each level of a grouping variable. This is useful when we want to compare the same measures for different subgroups (for example, High vs Low Focus).
3A · Vectorized summaries
#### Summarize Across Participants
```{r}
# Class-level means
colMeans(study_level[, c("magnitude_mean_rt",
"parity_mean_rt",
"rt_diff")],
na.rm = TRUE)
# Class-level variability (standard deviations)
sapply(study_level[, c("magnitude_mean_rt",
"parity_mean_rt",
"rt_diff")],
sd, na.rm = TRUE)
```
You can also write brief inline summaries in prose. For example:
Parity mean RT is r round(mean(study_level$parity_mean_rt, na.rm = TRUE), 2) ms, plus or minus r round(sd(study_level$parity_mean_rt, na.rm = TRUE), 2) ms.
3B · Group summaries with aggregate()
Sometimes we are curious whether performance depends on a participant characteristic. Maybe accuracy or reaction time depends on how focused a participant reported being. We can categorize participants by tef10_score into High Focus and Low Focus using the class mean as a cut point, then compare groups.
#### Grouping using aggregate
```{r}
# Create a focus grouping variable from tef10_score
focus_cut <- mean(study_level$tef10_score, na.rm = TRUE)
study_level$focus_group <- ifelse(study_level$tef10_score >= focus_cut,
"High Focus", "Low Focus")
# Check group counts
table(study_level$focus_group)
# Aggregate reaction times and differences by focus group
rt_by_focus <- aggregate(
study_level[, c("magnitude_mean_rt",
"parity_mean_rt",
"rt_diff")],
by = list(Focus = study_level$focus_group),
FUN = mean
)
rt_by_focus
# Aggregate accuracies and differences by focus group
acc_by_focus <- aggregate(
study_level[, c("magnitude_acc",
"parity_acc",
"acc_diff")],
by = list(Focus = study_level$focus_group),
FUN = mean
)
acc_by_focus
str(study_level)
# What do we notice here?
```
Example of inline reporting:
For High Focus participants, parity mean RT is approximately `r {
hf <- subset(study_level, focus_group == "High Focus");
round(mean(hf$parity_mean_rt, na.rm = TRUE), 2) }` ms, plus or minus `r {
hf <- subset(study_level, focus_group == "High Focus");
round(sd(hf$parity_mean_rt, na.rm = TRUE), 2) }` ms.
For Low Focus participants, parity mean RT is approximately `r {
lf <- subset(study_level, focus_group == "Low Focus");
round(mean(lf$parity_mean_rt, na.rm = TRUE), 2) }` ms, plus or minus `r {
lf <- subset(study_level, focus_group == "Low Focus");
round(sd(lf$parity_mean_rt, na.rm = TRUE), 2) }` ms.
This is a mean difference of `r round(
mean(study_level$parity_mean_rt[study_level$focus_group == "High Focus"], na.rm = TRUE) -
mean(study_level$parity_mean_rt[study_level$focus_group == "Low Focus"], na.rm = TRUE),
2
)` ms between the High and Low focus group on the parity trials.
For High Focus participants, magnitude mean RT is approximately `r {
hf <- subset(study_level, focus_group == "High Focus");
round(mean(hf$magnitude_mean_rt, na.rm = TRUE), 2) }` ms, plus or minus `r {
hf <- subset(study_level, focus_group == "High Focus");
round(sd(hf$magnitude_mean_rt, na.rm = TRUE), 2) }` ms.
For Low Focus participants, magnitude mean RT is approximately `r {
lf <- subset(study_level, focus_group == "Low Focus");
round(mean(lf$magnitude_mean_rt, na.rm = TRUE), 2) }` ms, plus or minus `r {
lf <- subset(study_level, focus_group == "Low Focus");
round(sd(lf$magnitude_mean_rt, na.rm = TRUE), 2) }` ms.
This is a mean difference of `r round(
mean(study_level$magnitude_mean_rt[study_level$focus_group == "High Focus"], na.rm = TRUE) -
mean(study_level$magnitude_mean_rt[study_level$focus_group == "Low Focus"], na.rm = TRUE),
2
)` ms between the High and Low focus group on the magnitude trials.
Discussion prompt
- Do the group means suggest that more focused participants were faster, more accurate, both, or neither?
- What additional information would help interpret these differences?
4 · Explore and interpret
#### Exploring and Interpreting
```{r}
# Summaries for key variables
summary(study_level[, c("magnitude_mean_rt",
"parity_mean_rt",
"rt_diff",
"magnitude_acc",
"parity_acc",
"acc_diff")])
# Optional exploration: association between overall RT and overall accuracy
cor(study_level$mean_rt_overall, study_level$mean_acc_overall, use = "complete.obs")
```
Interpretation prompt
- Which condition appears slower on average?
- Are accuracy differences small or large relative to RT differences?
5 · Save group-level tables
We save a compact table of overall group summaries for later use in reports and figures.
#### Save Files
```{r}
# Make an output directory for tables if needed
dir.create(here("output/tables"), recursive = TRUE, showWarnings = FALSE)
# Build a compact one-row table of overall means and SDs
group_summary <- data.frame(
mean_magnitude_rt = mean(study_level$magnitude_mean_rt, na.rm = TRUE),
mean_parity_rt = mean(study_level$parity_mean_rt, na.rm = TRUE),
mean_rt_diff = mean(study_level$rt_diff, na.rm = TRUE),
sd_magnitude_rt = sd(study_level$magnitude_mean_rt, na.rm = TRUE),
sd_parity_rt = sd(study_level$parity_mean_rt, na.rm = TRUE),
sd_rt_diff = sd(study_level$rt_diff, na.rm = TRUE)
)
# Preview and save
group_summary
write.csv(group_summary, here("output/tables/group_summary.csv"), row.names = FALSE)
saveRDS(group_summary, here("output/tables/group_summary.rds"))
# Confirm
file.exists(here("output/tables/group_summary.csv"))
```
Appendix · Sanity checks (optional)
```{r}
# Check for missing values
anyNA(study_level)
# Check number of participants
nrow(study_level)
```