Suggestion Box

Spot an error or have suggestions for improvement on these notes? Let us know!

Functions, Loops, and Control Structures in R

In this section, we’ll explore how R defines custom functions, controls program flow, and repeats operations efficiently.
You’ll learn to write your own functions, use if statements and for loops, and understand when to use vectorized alternatives like ifelse() and the apply() family.

These are the building blocks of writing flexible, reproducible, and efficient R code.

What Are Control Structures?

Control structures determine how your program runs — what happens, when it happens, and under what conditions. They let you make decisions and repeat actions automatically.

In R, the main types of control structures are:

Conditional statements — such as if, else if, and else, which let your code make decisions based on logical tests.
Loops — especially for loops, which repeat tasks for each item in a sequence.
Vectorized alternatives — like ifelse() or apply(), which perform the same operation across entire vectors or data frames without writing explicit loops.

We’ll mostly use if and for for code clarity, but you’ll also learn how vectorized operations make R code faster and more concise.

1. Creating Your Own Functions

Functions help you automate tasks and make your code reusable.
In R, you define them with the function() command:

function_name <- function(arg1, arg2, ...) {
  ## Code to perform the function’s task
  ## Optionally use return(value)
}

function_name — the name of the function.
arg1, arg2 — the function’s input arguments.
Function body — code inside { }.
Return value — by default, R returns the last expression, or you can use return() explicitly.

Example: Random Number Generator

For example, remember the JavaScript function for generating a random number between 1 and 10:

// JavaScript version
function getRandomNumber() {
  return Math.floor(Math.random() * 10) + 1;
}

This function works by:

Generating a random number between 0 (inclusive) and 1 (exclusive) with Math.random().
Multiplying it by 10 to get a number between 0 and 10.
Using Math.floor to round down to the nearest integer.
Adding 1 to shift the range from [0, 9] to [1, 10].

In R, we can use the sample() function to create a similar function.

## R version
getRandomNumber <- function() {
  sample(1:10, 1)
}

This function works by:

sample(1:10, 1) selects a single integer randomly from the sequence 1:10.
- The 1 tells sample to return a single number from that range.
This avoids the need for rounding and shifts the range directly to [1, 10].

Once the function has been defined (e.g., you have run the code setting up the function once), it will appear in the Functions section of the RStudio Environment Panel. To use this function, use getRandomNumber() within your code as necessary, which will return a random integer between 1 and 10.

Functions with Arguments

You can also define arguments (or parameters) that the function will accept to make a function more flexible. These arguments are used to pass data into the function for it to perform a specific task.

For example, instead of generating a random number that is always between 1 and 10, we can input a minimum and maximum number to our getRandomNumber() function:

getRandomNumber <- function(min, max) {
  sample(min:max, 1)
}

getRandomNumber(5, 25)

Now, we can pick random numbers from any range. When defining arguments, it's a good practice to give them descriptive names that reflect the purpose of the variable, for example min and max.

You can sepcify as many arguments as necessary for the function. For instance, if we wanted our getRandomNumber(min, max) function to return more than one random number from the sequence, we could add an additional input number:

getRandomNumber <- function(min, max, number) {
  sample(min:max, number)
}

number: is now the number of numbers we want the function to return. If we input 1 it will return 1 number, 3 will return 3 numbers, etc.
For example, getRandomNumber(1, 10, 3) will return 3 random numbers between 1 and 10
- You can also use the argument names and the = operator to specify the arguments in your function, for example: getRandomNumber(min = 1, max = 10, number = 3)

If a function in R is called without inputs and does not have default values set for its arguments, R will return an error indicating that the required arguments are missing.

getRandomNumber <- function(min, max) {
  sample(min:max, 1)
}

getRandomNumber()
# Error in getRandomNumber() : argument "min" is missing, with no default

Functions with Default Arguments

To help avoid errors for forgetting to input an argument, or to create a function where you typically want it to default to certain arguments but with an option to override them, you can create default values for each argument using the syntax:

my_function <- function(arg1 = default_value1, arg2 = default_value2) {
  # Function body
}

For example, if we want to be able to specify a min, max and number of random numbers to return, but we want to default to a single number between 1 and 10, we can define our getRandomNumber(min, max, number) function as:

getRandomNumber <- function(min = 1, max = 10, number = 1) {
  sample(min:max, number)
}

This will by default return a single number between 1 and 10 if we just use getRandomNumber() without inputting arguments, but we can also input the min, max, and number arguments to overwrite these defaults (or some of the defaults):

getRandomNumber()         # Will return a random number based on the default values of min=1, max=10, number=1
getRandomNumber(2)        # Will override the default minimum value to 2 and return a single number (default number) between 2 and 10 (default maximum)
getRandomNumber(max=20)   # Will override the default maximum value and return a single number (default number) between 1 (default minimum) and 20
getRandomNumber(18,65,1)  # Will override all defaults and return a single number between 18 and 65
getRandomNumber(min=18, max=65, number=1)  # Will override all defaults and return a single number between 18 and 65

2. Conditionals: `if`, `else if`, `else`

Conditionals let R make decisions based on logical tests.

`If` Statements

if statements allow us to only execute certain code if a particular condition is TRUE. The basic syntax of an if statement is:

if (condition) {
  # Code to execute if condition is TRUE
}

condition: A logical expression that evaluates to TRUE or FALSE.
If condition is TRUE, R runs the code inside the braces { }.
If condition is FALSE, R skips the code inside the braces { }.

For example, if we wanted to print "You are an adult." but only if age >= 18, we could use the code:

age <- 21 # assign age a value of 21
if (age >= 18) {
  print("You are an adult.")
}
# Output is "You are an adult." because the condition is TRUE

age <- 17 # assign age a value of 21
if (age >= 18) {
  print("You are an adult.")
}
# Nothing happens because the condition is FALSE

Adding an `else` clause

Adding an else clause allows us to execute a separate block of code if the initial condition is FALSE. The basic syntax for an if else combination is:

if (condition) {
  # Code if condition is TRUE
} else {
  # Code if condition is FALSE
}

For example, if we wanted to print "You are an adult." for ages >= 18, and "You are not an adult." for ages < 18, we could use the code:

if (age >= 18) {
  print("You are an adult.")
} else {
  print("You are not an adult.")
}

Adding an `else if` clause

If we want to have another conditional, we can use an else if statement. The basic syntax for this type of statement is:

if (condition1) {
  # Code if condition1 is TRUE
} else if (condition2) {
  # Code if condition2 is TRUE
} else {
  # Code if neither condition1 nor condition2 is TRUE
}

So for our age example, we could add two additional else if statements to make a more nuanced age categorization:

if (age >= 65) {
  print("You are a senior.")
} else if (age >= 18) {
  print("You are an adult.")
} else if (age >= 13) {
  print("You are a teen.")
} else {
  print("You are a child.")
}

JavaScript comparison:

let age = 21;
if (age >= 65) {
  console.log("You are a senior.");
} else if (age >= 18) {
  console.log("You are an adult.");
} else if (age >= 13) {
  console.log("You are a teen.");
} else {
  console.log("You are a child.");
}

Concept	R	JavaScript
Variable assignment	`age <- 21`	`let age = 21;`
Print to console	`print("text")`	`console.log("text");`
Conditional	`if (condition) {}`	`if (condition) {}`

3. For Loops

A for loop allows you to iterate over a sequence (such as a vector or a range of numbers) and perform an action for each element in that sequence. The basic syntax is:

for (variable in sequence) {
  # Code to execute for each iteration
}

variable: This is a temporary variable that takes the value of each element in the sequence during each iteration.
- R users often default to i being the individual or item value in the sequence
sequence: A vector, list, or range of numbers that the loop will iterate over.
The code inside the loop is executed once for each element in the sequence.

Example: Printing Numbers 1 to 5

for (i in 1:5) {
  print(paste("Iteration:", i))
}

The loop starts with a sequence of 1:5 (1 through 5) and selects each individual item i from that sequence
The loop starts with i = 1, then i = 2, and so on until i = 5.
The loop performs the function on each item i, in this case the print(i) function printing each i value
Each value of i is printed on a new line

JavaScript comparison:

for (let i = 1; i <= 5; i++) {
  console.log("Iteration: " + i);
}

4. Combining Loops and Conditionals in Data Frames

So far, we’ve used for loops and if statements on small, simple examples — single numbers or short vectors. In real data analysis, though, we usually work with data frames, where each row represents one observation (like a participant or trial), and each column represents a variable (like age or reaction_time).

To update or compute new information for each observation, we often need to be able to identify and modify individual cells inside that data frame — the intersection of one row and one column.

Being able to do this is essential because it allows us to:

Apply conditional logic row by row (e.g., mark trials as “Fast” or “Slow”),
Fill in or transform data programmatically (e.g., calculate per-participant results),
Debug or test how code interacts with real data structures step-by-step.

Let’s look at an example.

Example: Classifying Reaction Times Row by Row

Let’s build a small example data frame representing subjects’ reaction times in a cognitive task:

experiment_data <- data.frame(
subject_id = c(1:20),
rt = c(480, 530, 495, 610, 455, 390, 510, 565, 430, NA, 380, 230, 395, 710, 755, 590, 810, 365, 630, 200),
congruent = c(TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE),
condition = c("control", "control", "incongruent", "control",
               "incongruent", "control", "incongruent", "incongruent",
               "control", "incongruent", "control", "control", "incongruent",
               "control", "incongruent", "control", "incongruent", "incongruent",
               "control", "incongruent")
)

Now suppose we want to classify each subject’s reaction time as "Fast" or "Slow", using 500 milliseconds as our cutoff. We’ll create a new column called rt_category and fill it one row at a time using a for loop.

## Create a new (empty) column
experiment_data$rt_category <- NA

## Use a for loop to classify each subject
for (i in 1:nrow(experiment_data)) {
  if (is.na(experiment_data[i, "rt"])) {
    experiment_data[i, "rt_category"] <- "Unknown"
  } else if (experiment_data[i, "rt"] < 500) {
    experiment_data[i, "rt_category"] <- "Fast"
  } else {
    experiment_data[i, "rt_category"] <- "Slow"
  }
}

Let’s unpack what’s happening:

for (i in 1:nrow(experiment_data)): creates a loop that runs once for each row in experiment_data.
Inside the loop, experiment_data[i, "rt"] retrieves the reaction time value for the i-th subject.
The if statement checks whether that reaction time is below 500 milliseconds.
If it is, we assign "Fast" to that subject’s rt_category cell (experiment_data[i, "rt_category"]); otherwise, we assign "Slow".
The assignment operator <- updates that single cell in the data frame.

After the loop completes, every participant now has a new rt_category label, which you can check with View(experiment_data) or head(experiment_data).

This example illustrates row-wise data manipulation — one of the most fundamental tasks in R programming.

Conceptually, Why This Matters

Being able to access and update a single cell teaches you how R “thinks” about data. It reinforces three key programming concepts:

Dynamic indexing: Using variables (like i) to locate and modify elements programmatically.
Iteration: Performing actions sequentially across rows or columns.
Data integrity: Ensuring you’re modifying only what you intend — a specific cell, not an entire column or the wrong variable.

Once you understand this cell-level control, it becomes much easier to understand what vectorization is doing behind the scenes — performing these same operations automatically, across every element at once.

5. Vectorized Conditionals: `ifelse()`

In the previous example, we used a for loop to label each subject’s reaction time as "Fast" or "Slow". That approach works perfectly — and it’s useful for understanding how R processes data one row at a time — but it’s not the most efficient way to work with larger datasets. This because with a for loop, R has to work by explicit iteration and:

Create a sequence of indices (1:length(rt)),
Step through the vector one element at a time,
Run the if test for each element, and
Store the result in the corresponding position.

R is designed to work with vectors: ordered collections of values that it can process all at once. A vectorized function automatically performs the same operation on every element of a vector without you needing to write an explicit loop. That means R can evaluate a condition, perform calculations, or make assignments across an entire column in one step.

ifelse() is a vectorized version of if, meaning it can evaluate and return results for every element of a vector in a single operation.

Revisiting Our Reaction Time Example

With a vectorized function like ifelse(), we can express the same logic as our for loop in a single, readable line:

experiment_data$rt_category_vector <- NA

## Using a vectorized function
experiment_data$rt_category_vector <- ifelse(rt < 500, "Fast", "Slow")

Here’s what happens behind the scenes:

R evaluates the logical test rt < 500 for every element in the vector at once, producing a logical vector: [1] FALSE TRUE FALSE TRUE TRUE
Then it automatically assigns "Fast" wherever the condition is TRUE and "Slow" wherever it’s FALSE, returning the full character vector: [1] "Slow" "Fast" "Slow" "Fast" "Fast" That’s it — no manual looping required.

Handling More Than Two Outcomes with `ifelse()`

By default, a single ifelse() statement can only make two decisions: one for when the condition is TRUE, and another for when it’s FALSE.

Our previous example (experiment_data$rt_category_vector <- ifelse(rt < 500, "Fast", "Slow")) works, but there’s a limitation:

If a reaction time (rt) is missing (NA), it will be classified as "Slow", because NA is neither TRUE nor FALSE in a logical test.
As a result, the function falls back to the "else" case.

To fix this, and to include an additional category for missing values, we can nest one ifelse() inside another:

experiment_data$rt_category_vector <- ifelse(
  is.na(experiment_data$rt), "Unknown",              # If RT is missing
  ifelse(experiment_data$rt < 500, "Fast", "Slow")   # Otherwise, check Fast vs Slow
)

What’s Happening Here

is.na(experiment_data$rt): Returns a logical vector: TRUE where RT is missing, FALSE otherwise.
The first ifelse() checks for NAs.
- If TRUE, assigns "Unknown".
- If FALSE, evaluates the next expression — the nested ifelse().
The inner ifelse() tests experiment_data$rt < 500.
- If TRUE, assigns "Fast", else assigns "Slow".

Why This Works

ifelse() operates element-by-element across the vector. For every row in experiment_data, R runs both logical tests — first checking whether the value is missing, then checking whether it’s below 500 — and writes the appropriate category to the new column.

In short, this nested form lets us classify each trial in a single, concise, and fully vectorized command that’s functionally equivalent to the earlier for loop — but much faster and cleaner.

Why Vectorization Matters

Efficiency: Vectorized functions are implemented in optimized C code inside R, so they run much faster than manually looping in R code.
Simplicity: You write less code and reduce the chance of off-by-one or indexing errors.
Readability: Vectorized expressions more clearly express your intent (e.g., “label all trials below 500 as Fast”) instead of the mechanics of looping.

In short, ifelse() is the vectorized version of if. While a for loop works one element at a time, a vectorized function operates on entire vectors at once — making your code faster, cleaner, and easier to understand.

6. Vectorized Summary Functions: `rowMeans()`, `colMeans()`, `apply()`, and `tapply()`

Now that we’ve seen how ifelse() can replace a loop for labeling individual trials, let’s look at a few other vectorized functions that summarize data across rows, columns, or groups.

Example 1: `rowMeans()` and `colMeans()`

If we had multiple reaction time measures for each subject (e.g., across multiple blocks), we could compute their average response across columns or rows using rowMeans() and colMeans().

## Example matrix of reaction times from 3 blocks per subject
rt_data <- data.frame(
  block1 = c(520, 480, 610, 390, 450),
  block2 = c(530, 470, 600, 420, 500),
  block3 = c(540, 490, 590, 410, 480)
)
rt_data

## Mean reaction time across blocks for each subject across blocks (row-wise)
rowMeans(rt_data)
# [1] 530.0 480.0 600.0 406.7 476.7

## Mean reaction time for each block across subjects (column-wise)
colMeans(rt_data)
# block1 block2 block3 
# 490 504 504

rowMeans() averages across columns (within each subject).
colMeans() averages across rows (within each block).

If you wanted to calculate the same row or column means with an explicit for loop, you could use the following:

# Initialize an empty vector to store row means
row_means_manual <- numeric(nrow(rt_data))

# Loop over each row
for (i in 1:nrow(rt_data)) {
  total <- sum(rt_data[i, ])         # Sum across the columns in this row
  count <- ncol(rt_data)             # Number of columns (3)
  row_means_manual[i] <- total / count
}

row_means_manual
# [1] 530.0 480.0 600.0 406.7 476.7

# Initialize an empty vector to store column means
col_means_manual <- numeric(ncol(rt_data))

# Loop over each column
for (j in 1:ncol(rt_data)) {
  total <- sum(rt_data[, j])         # Sum all rows in this column
  count <- nrow(rt_data)             # Number of rows (5)
  col_means_manual[j] <- total / count
}

col_means_manual
# [1] 490 504 504
names(col_means_manual) <- names(rt_data)
col_means_manual
# block1 block2 block3 
#   490     504     504

Example 2: `apply()` for Flexible Vectorization

apply() is a more general function that can apply any function to rows or columns of a data frame or matrix.

## Apply the mean function across columns (2 = columns)
apply(rt_data, 2, mean)
# Same as colMeans()

## Apply the standard deviation across rows (1 = rows)
apply(rt_data, 1, sd)

This makes apply() especially useful when you need to run a function other than mean — for example, standard deviation, median, or a custom function.

Example 3: `tapply()` for Grouped Summaries

tapply() is one of R’s most powerful built-in tools for computing summaries by group — such as condition or congruency.

Returning to our experiment_data example:

experiment_data$rt[is.na(experiment_data$rt)] <- 500 # To start, replace the NA with 500 so tapply will work
## Mean RT by experimental condition
tapply(experiment_data$rt, experiment_data$condition, mean)
#   control incongruent 
#   498     505

## Mean RT by congruency (TRUE/FALSE)
tapply(experiment_data$rt, experiment_data$congruent, mean)
#   FALSE  TRUE 
#   447    556

tapply() takes three arguments:

The vector of data to summarize (experiment_data$rt),
A grouping variable (experiment_data$condition),
The function to apply (mean).

It returns one summarized value per group — no loops or subsetting required.

It's much simpler than if we wanted to do the same thing with a loop:

# Step-by-Step Manual Version
# Identify the unique groups in the condition column
conditions <- unique(experiment_data$condition)
conditions
# [1] "control" "incongruent"

# Create an empty vector to store the results
condition_means <- numeric(length(conditions))
names(condition_means) <- conditions

# Loop through each unique condition
for (i in 1:length(conditions)) {
  
  current_condition <- conditions[i]  # The condition for this iteration
  
  # Subset reaction times for the current condition
  subset_rt <- experiment_data$rt[experiment_data$condition == current_condition]
  
  # Compute the mean, removing missing values if there are any
  condition_means[i] <- mean(subset_rt, na.rm = TRUE)
}

condition_means
#   control incongruent 
#   523.6     531.1

Here’s what’s happening step by step:

unique(experiment_data$condition) extracts all distinct group labels — "control" and "incongruent".
We initialize an empty numeric vector (condition_means) to store the mean for each group, and name its elements after the conditions.
The for loop iterates through each condition:
- On each iteration, R filters the rt column to include only rows for that condition.
- It then calculates the mean of that subset and stores the result in the vector.
When the loop finishes, we have one mean per condition, just like the tapply() output.

Notice we had to remove the NA values before running tapply, which can't directly accept our normal na.rm = TRUE. Read more about how you can pass arguments like na.rm = TRUE to the function inside tapply() by wrapping it in an anonymous function (function(x) {}): Handling Missing Values in Group Summaries with tapply()

Summary

Function	Purpose	Typical Input	Example
`rowMeans()`	Mean across columns (per subject)	Numeric data frame or matrix	`rowMeans(rt_data)`
`colMeans()`	Mean across rows (per block)	Numeric data frame or matrix	`colMeans(rt_data)`
`apply()`	Apply a custom function to rows or columns	Matrix or data frame	`apply(rt_data, 2, sd)`
`tapply()`	Apply a function by group	Vector + grouping variable	`tapply(rt, condition, mean)`

7. Summary

Functions let you bundle operations and reuse them easily.
Conditionals let your code make decisions with if, else if, and else.
Loops let your code repeat actions for each element of a sequence.
Combining loops and conditionals lets you apply logic row by row in a data frame.
Use vectorized alternatives like ifelse(), rowMeans(), apply(), and tapply() to make your code faster, simpler, and more readable.
- As you get comfortable, you’ll start to see where loops make sense for clarity — and where vectorization makes sense for speed.
Advanced structures like while, repeat, next, and break exist but are less common in data analysis.
When writing R code, prioritize clarity first, then efficiency through vectorization once the logic is clear and correct.

The goal is not just to write working code, but to write code that’s readable, efficient, and expressive of your intent.