Functions and Loops in R
Creating New Functions
In R, creating a new function involves using the function()
function and the additional code necessary to perform the function. The syntax for creating a function is:
function_name <- function(arg1, arg2, ...) {
# Code to perform the function's task
# Return a result using return() or the final evaluated expression
}
-
function_name
: The name of your function, which will be used to call it later. -
function
: calls thefunction()
function to define the following code as a function calledfunction_name
. -
arg1, arg2, ...
: Arguments or parameters the function will take. These are optional if the function does not need inputs. -
Function body: Code inside the
{ }
braces that performs the desired task. - Return value: The last evaluated expression is returned automatically, or you can use return(value) to specify what the function should return explicitly.
Function without Arguments (inputs)
For example, remember the JavaScript function for generating a random number between 1 and 10:
// JavaScript random number between 1 and 10 generator
function getRandomNumber() {
return Math.floor(Math.random() * 10) + 1;
}
This function works by:
- Generating a random number between 0 (inclusive) and 1 (exclusive) with
Math.random().
- Multiplying it by 10 to get a number between 0 and 10.
- Using
Math.floor
to round down to the nearest integer. - Adding 1 to shift the range from
[0, 9]
to[1, 10]
.
In R, we can use the sample()
function to create a similar function.
# R random number between 1 and 10 generator
getRandomNumber <- function() {
sample(1:10, 1)
}
This function works by:
-
sample(1:10, 1)
selects a single integer randomly from the sequence1:10
.- The
1
tells sample to return a single number from that range.
- The
- This avoids the need for rounding and shifts the range directly to
[1, 10]
.
Once the function has been defined (e.g., you have run the code setting up the function once), it will appear in the Functions section of the RStudio Environment Panel. To use this function, use getRandomNumber()
within your code as necessary, which will return a random integer between 1 and 10.
Function with Arguments (inputs)
You can also define arguments (or parameters) that the function will accept. These arguments are used to pass data into the function for it to perform a specific task. When defining arguments, it's a good practice to give them descriptive names that reflect the purpose of the variable, for example min
and max
.
For example, instead of generating a random number that is always between 1 and 10, we can input a minimum and maximum number to our getRandomNumber()
function:
getRandomNumber <- function(min, max) {
sample(min:max, 1)
}
-
min
andmax
: These are the arguments to the function, allowing you to specify the range from which the random number will be drawn. -
min:max
creates a sequence from min to max. -
sample(min:max, 1)
: Thesample()
function takes the sequence frommin
tomax
and picks one random number from that range.- The
1
tells sample to return a single number from that range.
- The
You can sepcify as many arguments as necessary for the function. For instance, if we wanted our getRandomNumber(min, max)
function to return more than one random number from the sequence, we could add an additional input number
:
getRandomNumber <- function(min, max, number) {
sample(min:max, number)
}
-
number
: is now the number of numbers we want the function to return. If we input1
it will return 1 number,3
will return 3 numbers, etc. - For example,
getRandomNumber(1, 10, 3)
will return 3 random numbers between 1 and 10- You can also use the argument names and the
=
operator to specify the arguments in your function, for example:getRandomNumber(min = 1, max = 10, number = 3)
- You can also use the argument names and the
If a function in R is called without inputs and does not have default values set for its arguments, R will return an error indicating that the required arguments are missing.
getRandomNumber <- function(min, max) {
sample(min:max, 1)
}
getRandomNumber()
# Error in getRandomNumber() : argument "min" is missing, with no default
Functions with Default Arguments
To help avoid errors for forgetting to input an argument, or to create a function where you typically want it to default to certain arguments but with an option to override them, you can create default values for each argument using the syntax:
my_function <- function(arg1 = default_value1, arg2 = default_value2) {
# Function body
}
For example, if we want to be able to specify a min
, max
and number
of random numbers to return, but we want to default to a single number between 1 and 10, we can define our getRandomNumber(min, max, number)
function as:
getRandomNumber <- function(min = 1, max = 10, number = 1) {
sample(min:max, number)
}
This will by default return a single number between 1 and 10 if we just use getRandomNumber()
without inputting arguments, but we can also input the min
, max
, and number
arguments to overwrite these defaults (or some of the defaults):
getRandomNumber() # Will return a random number based on the default values of min=1, max=10, number=1
getRandomNumber(2) # Will override the default minimum value to 2 and return a single number (default number) between 2 and 10 (default maximum)
getRandomNumber(max=20) # Will override the default maximum value and return a single number (default number) between 1 (default minimum) and 20
getRandomNumber(18,65,1) # Will override all defaults and return a single number between 18 and 65
getRandomNumber(min=18, max=65, number=1) # Will override all defaults and return a single number between 18 and 65
Loops and Conditionals
Loops and conditional statements are used to control the flow of code. Specifically, for
loops, if
statements, and the ifelse()
function are frequently used for iteration and decision-making.
For Loops
A for
loop allows you to iterate over a sequence (such as a vector or a range of numbers) and perform an action for each element in that sequence. The basic syntax is:
for (variable in sequence) {
# Code to execute for each iteration
}
-
variable
: This is a temporary variable that takes the value of each element in thesequence
during each iteration.- R users often default to
i
being the individual or item value in thesequence
- R users often default to
-
sequence
: A vector, list, or range of numbers that the loop will iterate over. - The code inside the loop is executed once for each element in the
sequence
.
Example: Printing Numbers 1 to 5
for (i in 1:5) {
print(i) # Prints numbers from 1 to 5
}
- The loop starts with a
sequence
of1:5
(1 through 5) and selects each individual itemi
from that sequence - The loop starts with
i = 1
, theni = 2
, and so on untili = 5
. - The loop performs the function on each item
i
, in this case theprint(i)
function printing eachi
value - Each value of
i
is printed on a new line
If we want to add an age
column to our mydata
data frame and fill it with random numbers between 18 and 65, we can use the combination of a for
loop and our getRandomNumber(min, max, number)
function. We can also use what we learned about indexing rows dynamically with the nrow()
function to dynamically calculate the end of our for
loop sequence (making it 1:last_row
):
mydata$age <- NA # Creates new column called "age" and fills it with NA (blank) values
for (i in 1:nrow(mydata)) {
mydata[i, ]$age <- getRandomNumber(min = 18, max = 65, number = 1)
}
Where:
-
for (i in 1:nrow(mydata))
: sets up a for loop that iterates over each row of mydata.-
1:nrow(mydata)
creates a sequence from1
to the total number of rows inmydata
. -
i
is the loop index variable, representing the row number in each iteration. - Everything within the
{ }
brackets is the code run on each iteration.
-
-
mydata[i, ]
selects thei
-th row ofmydata
. - Using
mydata[i, ]
lets you access all columns in thei
-th row of mydata, so to specify only theage
column, we use$age
.- Now we are only accessing and modifying a single cell: row
i
columnage
.
- Now we are only accessing and modifying a single cell: row
-
getRandomNumber(min = 18, max = 65, number = 1)
generates a single number between 18 and 65. - The assignment operator
<-
assigns this random number to the single cell we are accessing.
If Statements
if
statements allow us to only execute certain code if a particular condition is TRUE
. The basic syntax of an if
statement is:
if (condition) {
# Code to execute if condition is TRUE
}
-
condition
: A logical expression that evaluates toTRUE
orFALSE
. - If
condition
isTRUE
, R runs the code inside the braces{ }
. - If
condition
isFALSE
, R skips the code inside the braces{ }
.
For example, if we wanted to print "You are an adult." but only if age >= 18, we could use the code:
age <- 21 # assign age a value of 21
if (age >= 18) {
print("You are an adult.")
}
# Output is "You are an adult." because the condition is TRUE
age <- 17 # assign age a value of 21
if (age >= 18) {
print("You are an adult.")
}
# Nothing happens because the condition is FALSE
Adding an else clause
Adding an else
clause allows us to execute a separate block of code if the initial condition
is FALSE
. The basic syntax for an if
else
combination is:
if (condition) {
# Code if condition is TRUE
} else {
# Code if condition is FALSE
}
For example, if we wanted to print "You are an adult." for ages >= 18, and "You are not an adult." for ages < 18, we could use the code:
if (age >= 18) {
print("You are an adult.")
} else {
print("You are not an adult.")
}
Adding an else if clause
If we want to have another conditional, we can use an else if
statement. The basic syntax for this type of statement is:
if (condition1) {
# Code if condition1 is TRUE
} else if (condition2) {
# Code if condition2 is TRUE
} else {
# Code if neither condition1 nor condition2 is TRUE
}
So for our age example, we could add two additional else if
statements to make a more nuanced age categorization:
if (age >= 65) {
print("You are a senior.")
} else if (age >= 18) {
print("You are an adult.")
} else if (age >= 13) {
print("You are a teen.")
} else {
print("You are a child.")
}
Nesting for loops and if else statements
You can build more complex code by nesting for
loops and if
or if else
statements. Nesting for
loops by placing one for
loop inside another is useful for performing operations that involve two or more levels of iteration. For example, you might use nested loops to iterate over rows and columns in a data frame or operations that require comparisons between elements. The basic syntax is:
for (outer_variable in outer_sequence) {
for (inner_variable in inner_sequence) {
# Code to execute in the inner loop
}
# Code to execute after the inner loop completes for each outer iteration
}
You can also place if
or if else
statements within a for
loop. For example, if we wanted to compare the scores for positiveEmotion
and negativeEmotion
and categorize each participant by their primaryEmotion
, we could use the code:
# Initialize the new 'primaryEmotion' column
mydata$primaryEmotion <- NA
# Loop over each row to apply the logic and create the 'combinedEmotion' column
for (i in 1:nrow(mydata)) { # Loop over each row
# Get the positive and negative emotion values for the current row
pos_emotion <- mydata$positiveEmotion[i]
neg_emotion <- mydata$negativeEmotion[i]
# Ensure that we handle NA values gracefully
if (is.na(pos_emotion) || is.na(neg_emotion)) {
mydata$primaryEmotion[i] <- "unknown" # If either is NA, set to 'Unknown'
} else if (pos_emotion > neg_emotion) {
mydata$primaryEmotion[i] <- "positive" # Positive emotion is greater
} else if (neg_emotion > pos_emotion) {
mydata$primaryEmotion[i] <- "negative" # Negative emotion is greater
} else {
mydata$primaryEmotion[i] <- "neutral" # Both emotions are equal
}
}
Explanation:
-
For Loop
: We loop over eachrow
of the data frame to examine the values inpositiveEmotion
andnegativeEmotion
. - If/Else Logic:
- If either
positiveEmotion
ornegativeEmotion
isNA
, we assign "Unknown" toprimaryEmotion
. - If
positiveEmotion
is greater thannegativeEmotion
, we assign "Positive" toprimaryEmotion
. - If
negativeEmotion
is greater thanpositiveEmotion
, we assign "Negative" toprimaryEmotion
. - If both emotions are equal, we assign "Neutral" to
primaryEmotion
.
- If either
While nested for
loops are useful and often intuitive, they can be slow with large data sets in R. Vectorized functions, apply()
family functions, or tidyverse
methods (e.g., map()
from purrr
) are often better choices for performance when you’re performing straightforward operations across data structures.