Suggestion Box
Spot an error or have suggestions for improvement on these notes? Let us know!
Functions, Loops, and Control Structures in R
In this section, we’ll explore how R defines custom functions, controls program flow, and repeats operations efficiently.
You’ll learn to write your own functions, use if statements and for loops, and understand when to use vectorized alternatives like ifelse() and the apply() family.
These are the building blocks of writing flexible, reproducible, and efficient R code.
What Are Control Structures?
Control structures determine how your program runs — what happens, when it happens, and under what conditions. They let you make decisions and repeat actions automatically.
In R, the main types of control structures are:
- Conditional statements — such as
if,else if, andelse, which let your code make decisions based on logical tests. - Loops — especially
forloops, which repeat tasks for each item in a sequence. - Vectorized alternatives — like
ifelse()orapply(), which perform the same operation across entire vectors or data frames without writing explicit loops.
We’ll mostly use if and for for code clarity, but you’ll also learn how vectorized operations make R code faster and more concise.
1. Creating Your Own Functions
Functions help you automate tasks and make your code reusable.
In R, you define them with the function() command:
function_name <- function(arg1, arg2, ...) {
## Code to perform the function’s task
## Optionally use return(value)
}
function_name— the name of the function.arg1,arg2— the function’s input arguments.- Function body — code inside
{ }. - Return value — by default, R returns the last expression, or you can use
return()explicitly.
Example: Random Number Generator
For example, remember the JavaScript function for generating a random number between 1 and 10:
// JavaScript version
function getRandomNumber() {
return Math.floor(Math.random() * 10) + 1;
}
This function works by:
- Generating a random number between 0 (inclusive) and 1 (exclusive) with
Math.random(). - Multiplying it by 10 to get a number between 0 and 10.
- Using
Math.floorto round down to the nearest integer. - Adding 1 to shift the range from
[0, 9]to[1, 10].
In R, we can use the sample() function to create a similar function.
## R version
getRandomNumber <- function() {
sample(1:10, 1)
}
This function works by:
sample(1:10, 1)selects a single integer randomly from the sequence1:10.- The
1tells sample to return a single number from that range.
- The
- This avoids the need for rounding and shifts the range directly to
[1, 10].
Once the function has been defined (e.g., you have run the code setting up the function once), it will appear in the Functions section of the RStudio Environment Panel. To use this function, use getRandomNumber() within your code as necessary, which will return a random integer between 1 and 10.
Functions with Arguments
You can also define arguments (or parameters) that the function will accept to make a function more flexible. These arguments are used to pass data into the function for it to perform a specific task.
For example, instead of generating a random number that is always between 1 and 10, we can input a minimum and maximum number to our getRandomNumber() function:
getRandomNumber <- function(min, max) {
sample(min:max, 1)
}
getRandomNumber(5, 25)
Now, we can pick random numbers from any range. When defining arguments, it's a good practice to give them descriptive names that reflect the purpose of the variable, for example min and max.
You can sepcify as many arguments as necessary for the function. For instance, if we wanted our getRandomNumber(min, max) function to return more than one random number from the sequence, we could add an additional input number:
getRandomNumber <- function(min, max, number) {
sample(min:max, number)
}
number: is now the number of numbers we want the function to return. If we input1it will return 1 number,3will return 3 numbers, etc.- For example,
getRandomNumber(1, 10, 3)will return 3 random numbers between 1 and 10- You can also use the argument names and the
=operator to specify the arguments in your function, for example:getRandomNumber(min = 1, max = 10, number = 3)
- You can also use the argument names and the
If a function in R is called without inputs and does not have default values set for its arguments, R will return an error indicating that the required arguments are missing.
getRandomNumber <- function(min, max) {
sample(min:max, 1)
}
getRandomNumber()
# Error in getRandomNumber() : argument "min" is missing, with no default
Functions with Default Arguments
To help avoid errors for forgetting to input an argument, or to create a function where you typically want it to default to certain arguments but with an option to override them, you can create default values for each argument using the syntax:
my_function <- function(arg1 = default_value1, arg2 = default_value2) {
# Function body
}
For example, if we want to be able to specify a min, max and number of random numbers to return, but we want to default to a single number between 1 and 10, we can define our getRandomNumber(min, max, number) function as:
getRandomNumber <- function(min = 1, max = 10, number = 1) {
sample(min:max, number)
}
This will by default return a single number between 1 and 10 if we just use getRandomNumber() without inputting arguments, but we can also input the min, max, and number arguments to overwrite these defaults (or some of the defaults):
getRandomNumber() # Will return a random number based on the default values of min=1, max=10, number=1
getRandomNumber(2) # Will override the default minimum value to 2 and return a single number (default number) between 2 and 10 (default maximum)
getRandomNumber(max=20) # Will override the default maximum value and return a single number (default number) between 1 (default minimum) and 20
getRandomNumber(18,65,1) # Will override all defaults and return a single number between 18 and 65
getRandomNumber(min=18, max=65, number=1) # Will override all defaults and return a single number between 18 and 65
2. Conditionals: if, else if, else
Conditionals let R make decisions based on logical tests.
If Statements
if statements allow us to only execute certain code if a particular condition is TRUE. The basic syntax of an if statement is:
if (condition) {
# Code to execute if condition is TRUE
}
condition: A logical expression that evaluates toTRUEorFALSE.- If
conditionisTRUE, R runs the code inside the braces{ }. - If
conditionisFALSE, R skips the code inside the braces{ }.
For example, if we wanted to print "You are an adult." but only if age >= 18, we could use the code:
age <- 21 # assign age a value of 21
if (age >= 18) {
print("You are an adult.")
}
# Output is "You are an adult." because the condition is TRUE
age <- 17 # assign age a value of 21
if (age >= 18) {
print("You are an adult.")
}
# Nothing happens because the condition is FALSE
Adding an else clause
Adding an else clause allows us to execute a separate block of code if the initial condition is FALSE. The basic syntax for an if else combination is:
if (condition) {
# Code if condition is TRUE
} else {
# Code if condition is FALSE
}
For example, if we wanted to print "You are an adult." for ages >= 18, and "You are not an adult." for ages < 18, we could use the code:
if (age >= 18) {
print("You are an adult.")
} else {
print("You are not an adult.")
}
Adding an else if clause
If we want to have another conditional, we can use an else if statement. The basic syntax for this type of statement is:
if (condition1) {
# Code if condition1 is TRUE
} else if (condition2) {
# Code if condition2 is TRUE
} else {
# Code if neither condition1 nor condition2 is TRUE
}
So for our age example, we could add two additional else if statements to make a more nuanced age categorization:
if (age >= 65) {
print("You are a senior.")
} else if (age >= 18) {
print("You are an adult.")
} else if (age >= 13) {
print("You are a teen.")
} else {
print("You are a child.")
}
JavaScript comparison:
let age = 21;
if (age >= 65) {
console.log("You are a senior.");
} else if (age >= 18) {
console.log("You are an adult.");
} else if (age >= 13) {
console.log("You are a teen.");
} else {
console.log("You are a child.");
}
| Concept | R | JavaScript |
|---|---|---|
| Variable assignment | age <- 21 |
let age = 21; |
| Print to console | print("text") |
console.log("text"); |
| Conditional | if (condition) {} |
if (condition) {} |
3. For Loops
A for loop allows you to iterate over a sequence (such as a vector or a range of numbers) and perform an action for each element in that sequence. The basic syntax is:
for (variable in sequence) {
# Code to execute for each iteration
}
variable: This is a temporary variable that takes the value of each element in thesequenceduring each iteration.- R users often default to
ibeing the individual or item value in thesequence
- R users often default to
sequence: A vector, list, or range of numbers that the loop will iterate over.- The code inside the loop is executed once for each element in the
sequence.
Example: Printing Numbers 1 to 5
for (i in 1:5) {
print(paste("Iteration:", i))
}
- The loop starts with a
sequenceof1:5(1 through 5) and selects each individual itemifrom that sequence - The loop starts with
i = 1, theni = 2, and so on untili = 5. - The loop performs the function on each item
i, in this case theprint(i)function printing eachivalue - Each value of
iis printed on a new line
JavaScript comparison:
for (let i = 1; i <= 5; i++) {
console.log("Iteration: " + i);
}
4. Combining Loops and Conditionals in Data Frames
So far, we’ve used for loops and if statements on small, simple examples — single numbers or short vectors.
In real data analysis, though, we usually work with data frames, where each row represents one observation (like a participant or trial), and each column represents a variable (like age or reaction_time).
To update or compute new information for each observation, we often need to be able to identify and modify individual cells inside that data frame — the intersection of one row and one column.
Being able to do this is essential because it allows us to:
- Apply conditional logic row by row (e.g., mark trials as “Fast” or “Slow”),
- Fill in or transform data programmatically (e.g., calculate per-participant results),
- Debug or test how code interacts with real data structures step-by-step.
Let’s look at an example.
Example: Classifying Reaction Times Row by Row
Let’s build a small example data frame representing subjects’ reaction times in a cognitive task:
experiment_data <- data.frame(
subject_id = c(1:20),
rt = c(480, 530, 495, 610, 455, 390, 510, 565, 430, NA, 380, 230, 395, 710, 755, 590, 810, 365, 630, 200),
congruent = c(TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE),
condition = c("control", "control", "incongruent", "control",
"incongruent", "control", "incongruent", "incongruent",
"control", "incongruent", "control", "control", "incongruent",
"control", "incongruent", "control", "incongruent", "incongruent",
"control", "incongruent")
)
Now suppose we want to classify each subject’s reaction time as "Fast" or "Slow", using 500 milliseconds as our cutoff.
We’ll create a new column called rt_category and fill it one row at a time using a for loop.
## Create a new (empty) column
experiment_data$rt_category <- NA
## Use a for loop to classify each subject
for (i in 1:nrow(experiment_data)) {
if (is.na(experiment_data[i, "rt"])) {
experiment_data[i, "rt_category"] <- "Unknown"
} else if (experiment_data[i, "rt"] < 500) {
experiment_data[i, "rt_category"] <- "Fast"
} else {
experiment_data[i, "rt_category"] <- "Slow"
}
}
Let’s unpack what’s happening:
for (i in 1:nrow(experiment_data)): creates a loop that runs once for each row in experiment_data.- Inside the loop,
experiment_data[i, "rt"]retrieves the reaction time value for the i-th subject. - The
ifstatement checks whether that reaction time is below 500 milliseconds. - If it is, we assign
"Fast"to that subject’srt_categorycell (experiment_data[i, "rt_category"]); otherwise, we assign"Slow". - The assignment operator
<-updates that single cell in the data frame.
After the loop completes, every participant now has a new rt_category label, which you can check with View(experiment_data) or head(experiment_data).
This example illustrates row-wise data manipulation — one of the most fundamental tasks in R programming.
Conceptually, Why This Matters
Being able to access and update a single cell teaches you how R “thinks” about data. It reinforces three key programming concepts:
- Dynamic indexing: Using variables (like
i) to locate and modify elements programmatically. - Iteration: Performing actions sequentially across rows or columns.
- Data integrity: Ensuring you’re modifying only what you intend — a specific cell, not an entire column or the wrong variable.
Once you understand this cell-level control, it becomes much easier to understand what vectorization is doing behind the scenes — performing these same operations automatically, across every element at once.
5. Vectorized Conditionals: ifelse()
In the previous example, we used a for loop to label each subject’s reaction time as "Fast" or "Slow". That approach works perfectly — and it’s useful for understanding how R processes data one row at a time — but it’s not the most efficient way to work with larger datasets. This because with a for loop, R has to work by explicit iteration and:
- Create a sequence of indices
(1:length(rt)), - Step through the vector one element at a time,
- Run the
iftest for each element, and - Store the result in the corresponding position.
R is designed to work with vectors: ordered collections of values that it can process all at once. A vectorized function automatically performs the same operation on every element of a vector without you needing to write an explicit loop. That means R can evaluate a condition, perform calculations, or make assignments across an entire column in one step.
ifelse() is a vectorized version of if, meaning it can evaluate and return results for every element of a vector in a single operation.
Revisiting Our Reaction Time Example
With a vectorized function like ifelse(), we can express the same logic as our for loop in a single, readable line:
experiment_data$rt_category_vector <- NA
## Using a vectorized function
experiment_data$rt_category_vector <- ifelse(rt < 500, "Fast", "Slow")
Here’s what happens behind the scenes:
- R evaluates the logical test
rt < 500for every element in the vector at once, producing a logical vector:[1] FALSE TRUE FALSE TRUE TRUE - Then it automatically assigns "Fast" wherever the condition is
TRUEand "Slow" wherever it’sFALSE, returning the full character vector:[1] "Slow" "Fast" "Slow" "Fast" "Fast"That’s it — no manual looping required.
Handling More Than Two Outcomes with ifelse()
By default, a single ifelse() statement can only make two decisions:
one for when the condition is TRUE, and another for when it’s FALSE.
Our previous example (experiment_data$rt_category_vector <- ifelse(rt < 500, "Fast", "Slow")) works, but there’s a limitation:
- If a reaction time (
rt) is missing (NA), it will be classified as"Slow", becauseNAis neitherTRUEnorFALSEin a logical test. - As a result, the function falls back to the "else" case.
To fix this, and to include an additional category for missing values, we can nest one ifelse() inside another:
experiment_data$rt_category_vector <- ifelse(
is.na(experiment_data$rt), "Unknown", # If RT is missing
ifelse(experiment_data$rt < 500, "Fast", "Slow") # Otherwise, check Fast vs Slow
)
What’s Happening Here
is.na(experiment_data$rt): Returns a logical vector:TRUEwhere RT is missing,FALSEotherwise.- The first
ifelse()checks for NAs.- If
TRUE, assigns"Unknown". - If
FALSE, evaluates the next expression — the nestedifelse().
- If
- The inner
ifelse()testsexperiment_data$rt < 500.- If
TRUE, assigns"Fast", else assigns"Slow".
- If
Why This Works
ifelse() operates element-by-element across the vector.
For every row in experiment_data, R runs both logical tests — first checking whether the value is missing, then checking whether it’s below 500 — and writes the appropriate category to the new column.
In short, this nested form lets us classify each trial in a single, concise, and fully vectorized command that’s functionally equivalent to the earlier for loop — but much faster and cleaner.
Why Vectorization Matters
- Efficiency: Vectorized functions are implemented in optimized C code inside R, so they run much faster than manually looping in R code.
- Simplicity: You write less code and reduce the chance of off-by-one or indexing errors.
- Readability: Vectorized expressions more clearly express your intent (e.g., “label all trials below 500 as Fast”) instead of the mechanics of looping.
In short, ifelse() is the vectorized version of if.
While a for loop works one element at a time, a vectorized function operates on entire vectors at once — making your code faster, cleaner, and easier to understand.
6. Vectorized Summary Functions: rowMeans(), colMeans(), apply(), and tapply()
Now that we’ve seen how ifelse() can replace a loop for labeling individual trials, let’s look at a few other vectorized functions that summarize data across rows, columns, or groups.
Example 1: rowMeans() and colMeans()
If we had multiple reaction time measures for each subject (e.g., across multiple blocks), we could compute their average response across columns or rows using rowMeans() and colMeans().
## Example matrix of reaction times from 3 blocks per subject
rt_data <- data.frame(
block1 = c(520, 480, 610, 390, 450),
block2 = c(530, 470, 600, 420, 500),
block3 = c(540, 490, 590, 410, 480)
)
rt_data
## Mean reaction time across blocks for each subject across blocks (row-wise)
rowMeans(rt_data)
# [1] 530.0 480.0 600.0 406.7 476.7
## Mean reaction time for each block across subjects (column-wise)
colMeans(rt_data)
# block1 block2 block3
# 490 504 504
rowMeans() averages across columns (within each subject).
colMeans() averages across rows (within each block).
If you wanted to calculate the same row or column means with an explicit for loop, you could use the following:
# Initialize an empty vector to store row means
row_means_manual <- numeric(nrow(rt_data))
# Loop over each row
for (i in 1:nrow(rt_data)) {
total <- sum(rt_data[i, ]) # Sum across the columns in this row
count <- ncol(rt_data) # Number of columns (3)
row_means_manual[i] <- total / count
}
row_means_manual
# [1] 530.0 480.0 600.0 406.7 476.7
# Initialize an empty vector to store column means
col_means_manual <- numeric(ncol(rt_data))
# Loop over each column
for (j in 1:ncol(rt_data)) {
total <- sum(rt_data[, j]) # Sum all rows in this column
count <- nrow(rt_data) # Number of rows (5)
col_means_manual[j] <- total / count
}
col_means_manual
# [1] 490 504 504
names(col_means_manual) <- names(rt_data)
col_means_manual
# block1 block2 block3
# 490 504 504
Example 2: apply() for Flexible Vectorization
apply() is a more general function that can apply any function to rows or columns of a data frame or matrix.
## Apply the mean function across columns (2 = columns)
apply(rt_data, 2, mean)
# Same as colMeans()
## Apply the standard deviation across rows (1 = rows)
apply(rt_data, 1, sd)
This makes apply() especially useful when you need to run a function other than mean — for example, standard deviation, median, or a custom function.
Example 3: tapply() for Grouped Summaries
tapply() is one of R’s most powerful built-in tools for computing summaries by group — such as condition or congruency.
Returning to our experiment_data example:
experiment_data$rt[is.na(experiment_data$rt)] <- 500 # To start, replace the NA with 500 so tapply will work
## Mean RT by experimental condition
tapply(experiment_data$rt, experiment_data$condition, mean)
# control incongruent
# 498 505
## Mean RT by congruency (TRUE/FALSE)
tapply(experiment_data$rt, experiment_data$congruent, mean)
# FALSE TRUE
# 447 556
tapply() takes three arguments:
- The vector of data to summarize (
experiment_data$rt), - A grouping variable (
experiment_data$condition), - The function to apply (
mean).
It returns one summarized value per group — no loops or subsetting required.
It's much simpler than if we wanted to do the same thing with a loop:
# Step-by-Step Manual Version
# Identify the unique groups in the condition column
conditions <- unique(experiment_data$condition)
conditions
# [1] "control" "incongruent"
# Create an empty vector to store the results
condition_means <- numeric(length(conditions))
names(condition_means) <- conditions
# Loop through each unique condition
for (i in 1:length(conditions)) {
current_condition <- conditions[i] # The condition for this iteration
# Subset reaction times for the current condition
subset_rt <- experiment_data$rt[experiment_data$condition == current_condition]
# Compute the mean, removing missing values if there are any
condition_means[i] <- mean(subset_rt, na.rm = TRUE)
}
condition_means
# control incongruent
# 523.6 531.1
Here’s what’s happening step by step:
unique(experiment_data$condition)extracts all distinct group labels —"control"and"incongruent".- We initialize an empty numeric vector (
condition_means) to store the mean for each group, and name its elements after the conditions. - The
forloop iterates through each condition:- On each iteration, R filters the
rtcolumn to include only rows for that condition. - It then calculates the mean of that subset and stores the result in the vector.
- On each iteration, R filters the
- When the loop finishes, we have one mean per condition, just like the tapply() output.
Notice we had to remove the NA values before running tapply, which can't directly accept our normal na.rm = TRUE. Read more about how you can pass arguments like na.rm = TRUE to the function inside tapply() by wrapping it in an anonymous function (function(x) {}): Handling Missing Values in Group Summaries with tapply()
Summary
| Function | Purpose | Typical Input | Example |
|---|---|---|---|
rowMeans() |
Mean across columns (per subject) | Numeric data frame or matrix | rowMeans(rt_data) |
colMeans() |
Mean across rows (per block) | Numeric data frame or matrix | colMeans(rt_data) |
apply() |
Apply a custom function to rows or columns | Matrix or data frame | apply(rt_data, 2, sd) |
tapply() |
Apply a function by group | Vector + grouping variable | tapply(rt, condition, mean) |
7. Summary
- Functions let you bundle operations and reuse them easily.
- Conditionals let your code make decisions with
if,else if, andelse. - Loops let your code repeat actions for each element of a sequence.
- Combining loops and conditionals lets you apply logic row by row in a data frame.
- Use vectorized alternatives like
ifelse(),rowMeans(),apply(), andtapply()to make your code faster, simpler, and more readable.- As you get comfortable, you’ll start to see where loops make sense for clarity — and where vectorization makes sense for speed.
- Advanced structures like
while,repeat,next, andbreakexist but are less common in data analysis. - When writing R code, prioritize clarity first, then efficiency through vectorization once the logic is clear and correct.
The goal is not just to write working code, but to write code that’s readable, efficient, and expressive of your intent.