; PSY 1903
PSY 1903 Programming for Psychologists

Suggestion Box

Spot an error or have suggestions for improvement on these notes? Let us know!

Indexing and Subsetting in R

In this section, we’ll focus on how to access and manipulate specific parts of your data frames—the most common structure you’ll work with in R.
These skills, known collectively as indexing and subsetting, allow you to extract, filter, and modify portions of your data efficiently.

Before we get there, we’ll briefly review how indexing works with simpler objects like vectors, lists, and matrices, since those rules form the foundation for how data frame indexing works.


1. Indexing Basics

Indexing means selecting elements from a data object using their position, names, or logical conditions.
R uses square brackets [] for indexing.

Example with a Vector

fruits <- c("apple", "banana", "cherry", "date")
fruits[1]       # first element
fruits[2:4]     # elements 2 through 4
fruits[-1]      # all but the first element

R is 1-indexed, meaning counting starts at 1 (not 0, as in some other languages like JavaScript).

JavaScript Comparison

let fruits = ["apple", "banana", "cherry", "date"];
console.log(fruits[0]); // first element in JS

2. Logical Indexing

You can subset data using logical values (TRUE or FALSE).
Only elements corresponding to TRUE will be selected.

nums <- c(5, 10, 15, 20)
nums[c(TRUE, FALSE, TRUE, FALSE)]  # selects 5 and 15
nums[nums > 10]                    # selects elements greater than 10

Important: Logical indexing is powerful for filtering data because it allows you to select rows or elements that meet specific conditions without needing to manually identify their positions. Instead of using numeric indices, you can use logical tests that return TRUE or FALSE values, and R automatically keeps only the elements where the condition is TRUE. This makes it an elegant and efficient way to subset data frames, especially when working with large datasets or complex criteria. Additionally, when your code changes the structure of the data frame (for example, by adding or deleting rows or columns), logical indexing remains valid because it selects data based on conditions rather than position. In contrast, numeric indexing must be adjusted each time the data frame changes.


3. Indexing by Name

If a vector or list has names, you can access elements by those names.

scores <- c(math = 90, english = 85, science = 92)
scores["math"]
scores[c("math", "science")]

You can combine named and position-based indexing:

scores[1]
scores["english"]

4. Subsetting Lists

Lists can contain elements of different types — numbers, strings, vectors, or even other lists.

student <- list(
  name = "Alex",
  age = 20,
  scores = c(88, 92, 95)
)

Access elements with $ or double brackets [[]]:

student$name
student[["age"]]
student$scores[2]

JavaScript Comparison

let student = { name: "Alex", age: 20, scores: [88, 92, 95] };
console.log(student.name);
console.log(student.scores[1]);

5. Indexing Matrices

Matrices are two-dimensional, so you use row, column indexing.

m <- matrix(1:9, nrow = 3, byrow = TRUE)
m
m[1, 2]     # row 1, column 2
m[ , 3]     # all rows, column 3
m[2, ]      # entire second row

You can also use negative indices to exclude specific rows or columns:

m[-1, ]     # exclude the first row

6. Indexing Data Frames (Most Important!)

Data frames are the central data structure in R for working with tabular data—that is, data organized in rows and columns, much like a spreadsheet or a table in a database.

Every data frame has:

  • Rows representing individual observations or cases (e.g., participants, trials, or measurements).
  • Columns representing variables (e.g., reaction time, condition, group).

You can access or modify data frames using the format:

data[row, column]

Both row and column can be specified using numbers, names, or logical conditions.
If you leave one part blank, R assumes “all” (e.g., data[ , 2] means all rows in column 2).

You can extract data using $column_name for single columns, or [row, column] when you want to select multiple rows or columns at once.

R indexing diagram

Let’s explore three essential ways to index data frames.


a. Index by Position

Position-based indexing is useful when you want to refer to specific rows or columns by number.

df <- data.frame(
  id = 1:4,
  name = c("Alice", "Bob", "Carmen", "Diego"),
  score = c(88, 92, 95, 90)
)

df[1, ]       # selects the first row (all columns)
df[, 2]       # selects the second column (all rows)
df[1:2, c(1, 3)]  # selects rows 1–2 and columns 1 and 3

You can also use negative indices to exclude rows or columns:

df[-1, ]   # all rows except the first
df[, -2]   # all columns except the second

When to use:
Position-based indexing is fast for quick checks, but it can break if your data changes (e.g., columns added or removed). For most analyses, named or logical indexing is safer and clearer.


b. Index by Name

Accessing data by column name is the most readable and reliable approach.

You can use either $ or the bracket notation data[ , "column_name"].

df$name       # returns the 'name' column as a vector
df$score      # returns the 'score' column as a vector
df[, "score"] # identical result

You can combine named and numeric indexing:

df[1:2, c("id", "score")] 

Returning Data Frames vs. Vectors:
When selecting a single column, R sometimes simplifies the result into a vector. To keep the result as a data frame, add drop = FALSE:

df[, "score", drop = FALSE]

Accessing Columns Programmatically:
The $ operator only works when you type the column name directly.
If you need to access a column by name stored in another variable, use double brackets [[]]:

col_to_access <- "score"
df[[col_to_access]]   # same as df$score

Why this matters:
Indexing by name keeps your code clear and readable, so you and others can see what each line does without guessing which column number it refers to. Similarly to logical indexing, it also prevents errors when your data frame changes structure, since named references stay consistent even when columns are added, removed, or rearranged.


c. Logical Indexing (Filtering Rows by Condition)

Logical indexing lets you filter your data frame based on one or more conditions that evaluate to TRUE or FALSE.

df[df$score > 90, ]      # rows where score is greater than 90
df[df$name == "Alice", ] # rows where name is Alice

You can also combine conditions using & (and) and | (or):

df[df$score > 90 & df$id < 4, ]

In JavaScript, you’d use similar logic with && (and) and || (or):

let df = [
  { id: 1, name: "Alice", score: 88 },
  { id: 2, name: "Bob", score: 92 },
  { id: 3, name: "Carmen", score: 95 },
  { id: 4, name: "Diego", score: 90 }
];

let subset = df.filter(row => row.score > 90 && row.id < 4);
console.log(subset);

Key differences between R and JavaScript:

  • In R, you use a single & or | for “and” / “or” conditions.
  • In JavaScript, you need double operators: && and ||.
  • In R, & and | work element-wise—they compare every row in a vector or data frame at once.
  • In JavaScript, && and || are used inside a function (like .filter()), which runs once per element of the array.

Both achieve the same goal—filtering data based on multiple logical conditions—but the logic is applied differently: R evaluates across an entire column vector, while JavaScript evaluates one record at a time.

Important Note:
Logical indexing and indexing by name are the two methods you’ll rely on most often in R data analysis.
They let you extract specific participants, trials, or conditions that meet defined criteria, and because they depend on conditions rather than fixed positions, they automatically adjust when your data frame changes (for example, if rows are added, removed, or reordered).


7. Manipulating Data Frames: Adding and Removing Columns or Rows

Data frames are flexible—you can easily add or remove columns (variables) and rows (cases, often participant data) as needed.
In psychological research, however, it’s best practice to add new columns (e.g., computed variables or coded conditions) or create subsets of the data in new data frames, rather than altering or overwriting the original.
This approach supports transparent, reproducible analysis by preserving the raw data exactly as it was collected.

Adding a Column

You can assign a new vector directly using $:

df$passed <- df$score >= 90
df

This adds a new logical column (TRUE/FALSE) showing which students passed.

Removing a Column

To delete a column, assign NULL:

df$passed <- NULL
df

Adding a Row

Use rbind() (“row bind”) to add new rows that match the column names.

new_row <- data.frame(id = 5, name = "Eva", score = 93)
df <- rbind(df, new_row)

Removing Rows

Use negative indexing:

df <- df[-1, ]  # removes the first row

Best Practice:
When adding rows or columns, make sure data types match (e.g., numeric column with numeric values). Avoid modifying or overwriting your original data frame—always preserve the raw dataset for reproducibility and transparency.


8. Advanced Examples

Once you understand the basics, you can combine techniques for more powerful subsetting.

Example 1: Multiple Conditions

df[df$score > 90 & df$name != "Bob", c("name", "score")]

→ Returns all rows where score > 90 and name is not Bob, keeping only the name and score columns.

Example 2: Selecting Columns Programmatically

Instead of typing column names manually, you can store them in a vector:

columns_to_keep <- c("id", "score")
df[, columns_to_keep]

Example 3: Using Functions on Subsets

You can run functions directly on filtered data frames.

mean(df[df$score > 90, "score"])
sum(df$score > 90)

Example 4: Handling Missing Data in Logical Indexing

data[data$rt > 400 & !is.na(data$rt), ]

→ Logical conditions involving NA return NA rather than TRUE or FALSE. Use !is.na() to explicitly remove missing values when filtering.

Example 5: Combining Row and Column Logic

df[df$score > 90, "name"]

→ Returns just the name values of students with scores above 90.

Example 6: The subset() Function

For quick filtering, R also provides the subset() function:

subset(df, score > 90 & id < 4, select = c(name, score))

We’ll mostly use bracket indexing for reproducibility and explicit control, but subset() can be convenient for quick exploration.


9. Practice Example: Reaction Time Data

Let’s apply these indexing skills to a realistic data frame from an experiment measuring reaction time (RT) under different conditions.

data <- data.frame(
  subject_id = 1:20,
  rt = c(470, 360, 665, 400, 445, 270, 500, 565, 350, 445,
         275, NA, 600, 290, 560, 375, 450, 480, 325, 430),
  congruent = c(TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
                FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE,
                TRUE, FALSE, TRUE, FALSE, TRUE, FALSE),
  color = c("red", "blue", "blue", "green", "red", "red", "blue",
            "green", "blue", "green", "red", "blue", "green", "blue",
            "green", "red", "blue", "blue", "green", "red")
)

Subsetting by Condition

congruent_trials <- data[data$congruent == TRUE, ]
fast_trials <- data[data$rt < 500, ]
fast_congruent <- data[data$congruent == TRUE & data$rt < 500, ]

Running Functions on Subsets

mean(data[data$congruent == TRUE, "rt"], na.rm = TRUE)
mean(data[data$congruent == FALSE, "rt"], na.rm = TRUE)
sum(is.na(data$rt))
mean(data[data$color == "blue", "rt"], na.rm = TRUE)
mean(data[data$color == "red", "rt"], na.rm = TRUE)

Subsetting Columns

subset_cols <- data[, c("subject_id", "rt", "congruent")]

Example output:

  subject_id   rt congruent
1           1 470      TRUE
2           2 360      TRUE
3           3 665     FALSE
4           4 400      TRUE
5           5 445     FALSE

10. Summary

  • Index data frames using [row, column] or $column_name.
  • Logical indexing (data[data$variable > value, ]) is the most flexible and reliable way to filter data.
  • Use rbind() and $ to add data, and negative indices or NULL to remove it.
  • Combine multiple conditions (&, |) for advanced filters.
  • Functions like mean(), sum(), and nrow() can operate directly on subsets.
  • Use drop = FALSE to preserve data frame structure when selecting single columns.
  • Use !is.na() to handle missing data when filtering.
  • Indexing is the foundation of data wrangling in R — once you master it, everything else builds naturally from here.