Introduction to Data Frames in R

What is a data frame?

A data frame is a two-dimensional structure (like a table) that is commonly used to store and manage datasets in R.
- Data frames are two-dimensional because they have both rows and columns, unlike a vector (one dimensional, used to store a sequence of elements that are of the same type) or a list (can be multidimensional but less structured, used to store a collection of elements of different data types).
Data frames are particularly well-suited for data analysis because they allow you to organize data in a structured way, making it easy to view, manipulate, and analyze. Common manipulations are summarzied in Table 1.
Data frames also flexible enough to handle mixed data types in one structure, meaning each column can hold a different data type (e.g., numerical, character, or logical), accommodating the variety found in real datasets.
Because data frames reflect how data is typically collected in the real world, they are intuitive and can be easily inspected, subsetted, summarized, and transformed.

Table 1: Summary of Common Data Frame Manipulations

Task	Description	Command or Example
Subsetting Data	Selecting rows, columns, or both based on conditions.	`data$column` (select column), `data[ , 1]` (select column by index), `data[1:5, ]` (select rows by index)
Conditional Filtering	Filtering data based on logical conditions (e.g., greater than, equal to).	`data[data$column > 10, ]` (filter rows based on condition)
Removing NA Values	Excluding missing data to ensure analyses are not biased by missing values.	`data <- na.omit(data)` (remove rows with NA values), `data[!is.na(data$column), ]` (remove NA from specific column)
Adding or Modifying Columns	Creating new columns or modifying existing ones, e.g., recoding or calculations.	`data$new_column <- data$column1 + data$column2` (add new column), `data$column <- scale(data$column)` (modify a column)
Summarizing and Aggregating Data	Calculating summary statistics, e.g., mean, median, or count.	`summary(data)` (summary statistics), `aggregate(column ~ group, data = data, FUN = mean)` (aggregate by group)
Reshaping Data	Converting data from long to wide format or vice versa.	`pivot_wider(data, names_from = column, values_from = value_column)` (long to wide), `pivot_longer(data, cols = starts_with("prefix"))` (wide to long)
Merging Data	Combining multiple data frames by a common key (e.g., participant ID).	`merged_data <- merge(data1, data2, by = "ID")` (merge by a common column)

Structure of a data frame

Data frames are a type of objects or variable that can be loaded into the R environment to be used for data analysis. Data frames are identified by their variable name, e.g., mydata, P01_dataset, or congruent_subset. Think of a data frame as a spreadsheet or a table with rows and columns:

Rows typically represent individual observations (such as participants in a study or trials in a task).
Columns represent variables (characteristics or measurements, like age, group assignment, reaction time, or their score on a questionnaire).
- Within a column, all data must be of the same type, but each different column can contain a different data type (summarized in Table 3).

Data frames can be thought of as a range of columns (1 through end) and rows (1 through end) with each cell having the position [row #, column #]. Thus, each cell can be accessed individually by its specific [row_index, column_index]. You can also make row_index or column_index a range of numbers to access a range of cells (more below).

Table 2: Structure of a Data Frame

	column1	column2	column3	column4	column5
row1	[1, 1]	[1, 2]	[1, 3]	[1, 4]	[1, 5]
row2	[2, 1]	[2, 2]	[2, 3]	[2, 4]	[2, 5]
row3	[3, 1]	[3, 2]	[3, 3]	[3, 4]	[3, 5]
row4	[4, 1]	[4, 2]	[4, 3]	[4, 4]	[4, 5]
row5	[5, 1]	[5, 2]	[5, 3]	[5, 4]	[5, 5]

Table 3: Summary of Data Types

Data Type	Description	Command to Check Type	Command to Change to Type
Numeric	Real numbers (e.g., decimals, floating-point numbers)	`is.numeric(df$column)`	`as.numeric(df$column)`
Integer	Whole numbers (e.g., `1L`, `2L`, etc.)	`is.integer(df$column)`	`as.integer(df$column)`
Character	Text or string values (e.g., `"Hello"`, `"data"`)	`is.character(df$column)`	`as.character(df$column)`
Factor	Categorical data with levels (e.g., "low", "high")	`is.factor(df$column)`	`as.factor(df$column)`
Ordered Factor	Categorical data with a specified order (e.g., "low", "medium", "high")	`is.ordered(df$column)`	`factor(column, ordered = TRUE)`
Logical	Boolean values (`TRUE` or `FALSE`)	`is.logical(df$column)`	`as.logical(df$column)`

Indexing Data Frames

Data frames (plus vectors and lists) are indexable, meaning you can access its individual elements (or parts) using a specific reference or index by rows and columns.

Rows are indexed by their row number
Columns can be indexed by either their column number or their column name
You can also index by a combination of row and column

The rows and columns of a data frame can be specified following the name of the data frame using square brackets [] and separated by a comma. Rows are specified before the comma, and columns are specified after the comma:

dataframe[row_index, column_index]

row_index specifies which rows to access (all rows or a subset)
column_index specifies which columns to access (all columns or a subset)

Important Functions and Operators Needed to Index

Comparison Operators

The greater than operator > checks if the left-hand side is greater than the right-hand side and returns TRUE or FALSE

5 > 3  # TRUE
2 > 3  # FALSE
5 > 5  # FALSE

The less than operator < checks if the left-hand side is less than the right-hand side and returns TRUE or FALSE

5 < 3  # FALSE
2 < 3  # TRUE
5 < 5  # FALSE

The greater than or equal to operator >= checks if the left-hand side is greater than or equal the right-hand side and returns TRUE or FALSE

5 >= 3  # TRUE
2 >= 3  # FALSE
5 >= 5  # TRUE

The less than or equal to operator <= checks if the left-hand side is less than or equal the right-hand side and returns TRUE or FALSE

5 <= 3  # FALSE
2 <= 3  # TRUE
5 <= 5  # TRUE

The equal to operator == checks if the left-hand side is equal the right-hand side and returns TRUE or FALSE

5 == 3  # FALSE
2 == 3  # FALSE
5 == 5  # TRUE

The not equal to operator != checks if the left-hand side is not equal the right-hand side and returns TRUE or FALSE

5 != 3  # TRUE
2 != 3  # TRUE
5 != 5  # FALSE

Logical Operators

These operators are used to perform logical comparisons between values or conditions. The and operator & checks if the condition on left-hand side is TRUE and if the condition on right-hand side is TRUE. It Returns TRUE if both conditions are TRUE.

TRUE & TRUE    # TRUE
TRUE & FALSE   # FALSE
2 < 3 & 5 > 3  # TRUE
2 > 3 & 5 > 3  # FALSE
2 > 3 & 5 < 3  # FALSE

The or operator | checks if the condition on left-hand side is TRUE and if the condition on right-hand side is TRUE. It Returns TRUE if either condition is TRUE.

TRUE | TRUE    # TRUE
TRUE | FALSE   # TRUE
2 < 3 | 5 > 3  # TRUE
2 > 3 | 5 > 3  # TRUE
2 > 3 | 5 < 3  # FALSE

The not operator ! The not operator ! used to reverse or negate logical values. It takes a TRUE value and makes it FALSE, or a FALSE value and makes it TRUE. This operator is commonly used to filter data, apply conditional statements, and work with logical vectors. The ! operator can also be helpful in conditional statements, such as if statements, when you want to run code if a condition is not met. For example:

x <- 5
if (!x == 10) {
  print("x is not equal to 10")
} else {
  print("x is equal to 10")
}
# Prints "x is not equal to 10" because x = 5, which is not 10

x <- 10
if (!x == 10) {
  print("x is not equal to 10")
} else {
  print("x is equal to 10")
}
# Prints "x is equal to 10" because x = 10, which is 10

Colon Operator

The colon operator : is used to create sequences of integers (whole numbers) between a start and end number, inclusive of the start and end numbers, in the format: start:end.

3:5    # Generates a list of numbers: 3 4 5
3:8    # Generates a list of numbers: 3 4 5 6 7 8
11:13  # Generates a list of numbers: 11 12 13

Concatenate Function

The concatenate function c() (or combine function) is used to combine elements into a vector or list. While the c() function can create vectors or lists of numbers or characters (letters or words). Importantly, it can be used in combination with the colon operator : to create a list of numbers that skip selected values.

c(1,3:5)          # Generates a list of numbers, output: 1 3 4 5
c(1,3:5,7)        # Generates a list of numbers, output: 1 3 4 5 7
c(1,3:5,7,11:13)  # Generates a list of numbers, output: 1 3 4 5 7 11 12 13
c("b", "r", "g")  # Generates a list of characters (in this case letters), output: b r g
c("blue", "red", "green")  # Generates a list of characters (in this case words), output: blue red green

Dollar Sign Operator

The dollar sign operator $ is used to access or extract specific columns of a data frame by the variable (column) name, in the format: dataframe$column_name.

mydata$participant  # Access the "participant" column, output: P01 P02 P03 ...
mydata$moodGroup    # Access the "moodGroup" column, output: negativeMood negativeMood positiveMood ...
mydata$mood group   # Invalid because:
                        # 1) there is no column mood group, and
                        # 2) column names with spaces must be in quotes
# Assuming there was a column named "mood group" with a space (bad practice), you would need to index it with quotes:
mydata$"mood group"  # Access the "mood group" column, output: negativeMood negativeMood positiveMood ...

Indexing by Column

Indexing by column is a way to access and manipulate specific subsets of data based on the column number or column name. We have already seen that we can index one specific column by column name using the dollar sign $ operator, e.g., data_frame$column_name, but this is limited to one column. We can index a subset of columns (or use this as an alternative way to index a single column) by indicating the column numbers or names with the column_index data_frame[row_index, column_index]. We can leave the row_index blank to indicate all rows data_frame[, column_index].

Specifying columns by column numbers or column names

data_frame[, 1]              # Access the first column, all rows
data_frame[, 1:3]            # Access columns 1 through 3, all rows
data_frame[, c(1, 3)]        # Access columns 1 and 3, all rows
data_frame[, c(1,3:5)]       # Access columns 1 and 3 through 5 (based on the list: 1 3 4 5), all rows

data_frame[, "column_name"]  # Access a specific column by name, all rows
data_frame$column_name       # Access a specific column by name, all rows
data_frame[, c("column_name1", "column_name2")]  # Access two specific columns by column name, all rows
data_frame[, c("rt", "accuracy", "whichPrime")]    # Access columns rt, accuracy, and whichPrime, all rows

Indexing by Row

Indexing by row is a way to access and manipulate specific subsets of data based on the order or characteristics of individual observations. In a dataset, each row typically represents a unique observation or participant. By indexing by row, you can easily extract or manipulate individual observations based on their position (row number) or specific criteria (conditions). We can index a subset of rows by indicating the row numbers with the row_index data_frame[row_index, column_index]. We can leave the column_index blank to indicate all columns data_frame[row_index, ]. Note, we cannot specify rows based on row name because rows don't have names, only numbers.

Specifying rows by row number dataframe[row_index,] You can specify a single row using its row index number:

data_frame[1, ]     # Access the first row, all columns
data_frame[1,]      # The space after the comma is not necessary, this will still access the first row, all columns
data_frame[10, ]    # Access the tenth row, all columns
data_frame[100, ]   # Access the hundreth row, all columns

Or you can specify multiple rows using a list of numbers created with the colon operator : and c() function:

data_frame[c(1, 3), ]              # Access rows 1 and 3, all columns
data_frame[3:5, ]                  # Access rows 3 through 5 (based on the list 3 4 5), all columns
data_frame[c(1, 3:5), ]            # Access rows 1 and 3 through 5 (based on the list: 1 3 4 5), all columns
data_frame[c(1, 3:5, 7, 11:13), ]  # Access rows 1, 3 through 5, 7, and 11 through 13 (based on the list: 1 3 4 5 7 11 12 13), all columns

Specifying rows by specific criteria Indexing rows by specific criteria (conditions) allows you to extract subsets of a data frame based on logical conditions. This is done using logical vectors that specify whether each row meets a certain condition, and including this in the row_index position of data_frame[row_index, ]. To index rows based on a condition, you use the following syntax:

data_frame[condition, ]              # Access rows where condition is met, all columns
data_frame[condition & condition, ]  # Access rows where both conditions are met, all columns
data_frame[condition | condition, ]  # Access rows where either conditions is met, all columns

Where:

condition is a logical expression that evaluates to TRUE or FALSE for each row.
Rows where the condition is TRUE will be returned (or used in the function/calculation), and the rest will be excluded.
Conditions are most commonly expressed by identifying a criterion based on a column (variable).
- For example, if you have collected age and only want participants over 25 years old, you would use data_frame$age > 25 which will look at the age column and return a matching list of TRUE or FALSE for whether each age is greater than or less than 25 To then index by this condition, you can put it in the row_index position:

data_frame[data_frame$age > 25, ]                        # Access rows where age is greater than 25 (26+), all columns
data_frame[data_frame$age < 50, ]                        # Access rows where age is less than 50 (0-49), all columns
data_frame[data_frame$age < 50 & data_frame$age > 25, ]  # Access rows where age is less than 50 AND age is greater than 25 (26-49), all columns
data_frame[data_frame$age > 50 & data_frame$age < 25, ]  # Access rows where age is greater than 50 AND age is less than 25 (not possible), all columns
data_frame[data_frame$age > 50 | data_frame$age < 25, ]  # Access rows where age is greater than 50 OR age is less than 25 (0-24, 51+), all columns
data_frame[data_frame$age > 25 & data_frame$correct == "correct", ]  # Access rows where age is greater than 25 AND accuracy is "correct", all columns
data_frame[data_frame$prime == "positive" & data_frame$correct == "correct", ]  # Access rows where prime is "positive" AND accuracy is "correct", all columns

Indexing by Row and Column

You can further subset your data by using a combinations of indexing by row and indexing by column. To do so, you place selection criteria in both the row_index and column_index within your square brackets []

data_frame[data_frame$age > 25, "rt"]                                            # Access rows where age is greater than 25 (26+), only column "rt"
data_frame[data_frame$accuracy == "correct", c("rt", "accuracy", "whichPrime")]  # Access rows where accuracy is correct, only columns "rt" "accuracy" and "whichPrime"
data_frame[data_frame$accuracy == "correct" & data_frame$rt > 300, c("rt", "accuracy", "whichPrime")]  # Access rows where accuracy is correct and reaction times are greater than 300, only columns "rt" "accuracy" and "whichPrime"

If you are only interested in one column (variable) with a subset of rows, you can also use a combination of the row_index and the dollar sign operator $

data_frame[data_frame$age > 25, ]$accuracy                    # Looking at "accuracy" for only participants where age is greater than 25
mydata[mydata$moodGroup == "positiveMood", ]$positiveEmotion  # Looking at "positiveEmotion" for only participants in the "positiveMood" group, returns a list: 8  7  9  8 10

Assigning data selection to a new data frame

Sometimes you want to work with a subset of data for an extended period of time. Instead of identifying the subset each time you must identify which data to look at, you can assign the subset to a secondary data frame and work further with that using the syntax new_data <- data_frame[row_index, column_index]. It can be helpful to give your new data frame a meaningful name so that you know what the difference is.

# Create new data frame with columns "rt" "accuracy" and "whichPrime" selecting only correct trials (data_frame$accuracy == "correct" in the row_index) and reaction times were not false starts (data_frame$rt > 300 in the row_index)
dataCorRT <- data_frame[data_frame$accuracy == "correct" & data_frame$rt > 300, c("rt", "accuracy", "whichPrime")]   

# Create a new data frame with all columns selecting only trials where reaction times were not false starts (data_frame$rt > 300 in the row_index) and reaction times were not excessively long (& data_frame$rt < 2000)
new_data <- data_frame[data_frame$rt > 300 & data_frame$rt < 2000, ]

# Create a new data frame with only participants who had positiveEmotion scores greater than 2 AND negativeEmotion scores less than 8, only columns "participant" "rt" and "moodGroup" (not "positiveEmotion" and "negativeEmotion"
mydataEmoFilter <- mydata[mydata$positiveEmotion > 2 & mydata$negativeEmotion < 8, c("participant", "rt", "moodGroup")]

`mydataEmoFilter` will return a data frame that looks like this:

participant	rt	moodGroup
P02	846	negativeMood
P03	497	positiveMood
P04	308	positiveMood
P05	457	positiveMood
P06	575	positiveMood
P08	509	positiveMood
P10	654	negativeMood

Adding a New Row or Column

Adding a New Row

You can add a new row by specifying either the index of the new row (if known) or by calculating it dynamically.

Option 1: If you know the last row number (in our mydata example, the last row is 10), we can do mydata[last_row + 1, ] <- NA, which in this case would be mydata[11,] <- NA. This will create a new row at the end of the data frame (row 11) filled with NA values (effectively a blank row).
Option 2: If you do not know the last row number, or want to calculate it dynamically to not accidentally overwrite an existing row if you get the last row number wrong, you can use the nrow() function: mydata[nrow(mydata) + 1, ] <- NA
If you do not want to create a blank row, but instead know the values you want to input, you can use the c() function to create a vector of equal length to columns with the values and input that instead: mydata[nrow(mydata) + 1, ] <- c("P11", 5, 5, 500, "negativeMood")

Adding a New Column

You can add a new column by specifying either the index of the new column (if known), by calculating it dynamically, or by using a new variable name.

Option 1: If you know the last column number (in our mydata example, the last row is 5), we can do mydata[, last_column + 1] <- NA, which in this case would be mydata[, 6] <- NA. This will create a new column at the end of the data frame (column 6) filled with NA values (effectively a blank column). The column will have a default variable name, which can change later, but if we already what we want the variable (column) name to be, Option 3 is the most efficient.
Option 2: If you do not know the last column number, or want to calculate it dynamically to not accidentally overwrite an existing column if you get the last column number wrong, you can use the ncol() function: mydata[, ncol(mydata) + 1] <- NA
Option 3: Best Option If you know the variable (column) name, you can simply use the dollar sign operator $ and the new variable name. This will create a new column automatically as the last column: mydata$age <- NA
If you do not want to create a blank column, but instead know the values you want to input, you can use the c() function to create a vector of equal length to your rows with the values and input that instead: mydata$gender <- c("female", NA, "male", "nonbinary", "agender", "male", "female", "female", "nonbinary", "agender", "nonbinary")