Suggestion Box

Spot an error or have suggestions for improvement on these notes? Let us know!

Update to Handling Missing Values in Group Summaries with `tapply()`

Starting with R version 4.1.0, you can pass na.rm = TRUE directly inside tapply() without wrapping it in an anonymous function. To check which version of R you are using, run R.version.string in the Console.

Keep this anonymous function process in mind for other functions whose arguments cannot always be passed into tapply() directly, or if you are using an older R version.

Handling Missing Values in Group Summaries with `tapply()`

When working with real data, it’s common to encounter missing values (NAs).
If you use tapply() to calculate a summary statistic such as a mean or median, any group that contains NAs will return an NA by default — even if other valid numbers exist in that group.

For example, if one of our experimental conditions includes some missing reaction times, R will give us this output:

tapply(experiment_data$rt, experiment_data$condition, mean)
#   control incongruent 
#      498          NA

At first glance, this can be confusing — one of the groups ("incongruent") has valid data, but because at least one value was missing, R’s default behavior is to return NA.

Why This Happens

The mean() function (and many others in R) has an argument called na.rm, short for “NA remove.”
By default, na.rm = FALSE, meaning that if any missing values are present, the entire result becomes NA.

If you set na.rm = TRUE, R will ignore missing values when calculating the mean.

For example:

mean(c(100, 200, NA))
# [1] NA

mean(c(100, 200, NA), na.rm = TRUE)
# [1] 150

This same logic applies inside tapply() — but to pass this extra argument, we need to use an anonymous function.

Passing `na.rm = TRUE` Inside `tapply()`

You can include arguments like na.rm = TRUE inside tapply() by wrapping the function in an anonymous function using the syntax function(x) { ... }.

Here’s how it works:

tapply(experiment_data$rt, experiment_data$condition, function(x) mean(x, na.rm = TRUE))
#   control incongruent 
#      523.6      531.1

✅ Explanation:

tapply() still splits the rt data by condition.
The anonymous function function(x) tells R what to do with each subset:
- It takes each subset of reaction times (x),
- Then computes mean(x, na.rm = TRUE) — the mean of that subset while ignoring missing values.
R applies this anonymous function separately to each group and returns the results as a named vector.

This approach works with any summary function that supports na.rm = TRUE, including sum(), sd(), and median().

Example with Multiple Grouping Variables

You can even apply this approach when summarizing by more than one variable — for example, by both condition and congruent:

tapply(experiment_data$rt, list(experiment_data$condition, experiment_data$congruent),
       function(x) mean(x, na.rm = TRUE))

This will produce a two-way table showing mean reaction times for each combination of condition and congruency, ignoring any missing values.

What Is an Anonymous Function?

An anonymous function is a function you define “on the fly” — without giving it a name.
It’s a quick way to describe what R should do with each subset of data, rather than creating a separate named function first.

For example:

function(x) mean(x, na.rm = TRUE)

is equivalent to writing:

ignore_na_mean <- function(x) {
  mean(x, na.rm = TRUE)
}

and then using:

tapply(experiment_data$rt, experiment_data$condition, ignore_na_mean)

But defining it inline keeps your code shorter and easier to read when the operation is simple.

Summary

Situation	What Happens	Fix
Group contains an `NA` value	Result is `NA`	Use `na.rm = TRUE`
Want to apply this within `tapply()`	Wrap function in `function(x) mean(x, na.rm = TRUE)`	✅ Works per group
Want to summarize by multiple variables	Use `list()` for grouping variables	✅ Produces table output

In short, wrapping a function inside function(x) {} lets you customize what happens within tapply() — including how missing data are handled.
This simple trick makes your code both more robust and more expressive, especially when working with real-world datasets that often include missing values.