Suggestion Box
Spot an error or have suggestions for improvement on these notes? Let us know!
Handling Missing Values in Group Summaries with tapply()
When working with real data, it’s common to encounter missing values (NAs).
If you use tapply() to calculate a summary statistic such as a mean or median, any group that contains NAs will return an NA by default — even if other valid numbers exist in that group.
For example, if one of our experimental conditions includes some missing reaction times, R will give us this output:
tapply(experiment_data$rt, experiment_data$condition, mean)
# control incongruent
# 498 NA
At first glance, this can be confusing — one of the groups ("incongruent") has valid data, but because at least one value was missing, R’s default behavior is to return NA.
Why This Happens
The mean() function (and many others in R) has an argument called na.rm, short for “NA remove.”
By default, na.rm = FALSE, meaning that if any missing values are present, the entire result becomes NA.
If you set na.rm = TRUE, R will ignore missing values when calculating the mean.
For example:
mean(c(100, 200, NA))
# [1] NA
mean(c(100, 200, NA), na.rm = TRUE)
# [1] 150
This same logic applies inside tapply() — but to pass this extra argument, we need to use an anonymous function.
Passing na.rm = TRUE Inside tapply()
You can include arguments like na.rm = TRUE inside tapply() by wrapping the function in an anonymous function using the syntax function(x) { ... }.
Here’s how it works:
tapply(experiment_data$rt, experiment_data$condition, function(x) mean(x, na.rm = TRUE))
# control incongruent
# 523.6 531.1
✅ Explanation:
tapply()still splits thertdata bycondition.- The anonymous function
function(x)tells R what to do with each subset:- It takes each subset of reaction times (
x), - Then computes
mean(x, na.rm = TRUE)— the mean of that subset while ignoring missing values.
- It takes each subset of reaction times (
- R applies this anonymous function separately to each group and returns the results as a named vector.
This approach works with any summary function that supports na.rm = TRUE, including sum(), sd(), and median().
Example with Multiple Grouping Variables
You can even apply this approach when summarizing by more than one variable — for example, by both condition and congruent:
tapply(experiment_data$rt, list(experiment_data$condition, experiment_data$congruent),
function(x) mean(x, na.rm = TRUE))
This will produce a two-way table showing mean reaction times for each combination of condition and congruency, ignoring any missing values.
What Is an Anonymous Function?
An anonymous function is a function you define “on the fly” — without giving it a name.
It’s a quick way to describe what R should do with each subset of data, rather than creating a separate named function first.
For example:
function(x) mean(x, na.rm = TRUE)
is equivalent to writing:
ignore_na_mean <- function(x) {
mean(x, na.rm = TRUE)
}
and then using:
tapply(experiment_data$rt, experiment_data$condition, ignore_na_mean)
But defining it inline keeps your code shorter and easier to read when the operation is simple.
Summary
| Situation | What Happens | Fix |
|---|---|---|
Group contains an NA value |
Result is NA |
Use na.rm = TRUE |
Want to apply this within tapply() |
Wrap function in function(x) mean(x, na.rm = TRUE) |
✅ Works per group |
| Want to summarize by multiple variables | Use list() for grouping variables |
✅ Produces table output |
In short, wrapping a function inside function(x) {} lets you customize what happens within tapply() — including how missing data are handled.
This simple trick makes your code both more robust and more expressive, especially when working with real-world datasets that often include missing values.