Introduction to Data Frames in R
What is a data frame?
- A data frame is a two-dimensional structure (like a table) that is commonly used to store and manage datasets in R.
- Data frames are two-dimensional because they have both rows and columns, unlike a vector (one dimensional, used to store a sequence of elements that are of the same type) or a list (can be multidimensional but less structured, used to store a collection of elements of different data types).
- Data frames are particularly well-suited for data analysis because they allow you to organize data in a structured way, making it easy to view, manipulate, and analyze. Common manipulations are summarzied in Table 1.
- Data frames also flexible enough to handle mixed data types in one structure, meaning each column can hold a different data type (e.g., numerical, character, or logical), accommodating the variety found in real datasets.
- Because data frames reflect how data is typically collected in the real world, they are intuitive and can be easily inspected, subsetted, summarized, and transformed.
Table 1: Summary of Common Data Frame Manipulations
Task | Description | Command or Example |
---|---|---|
Subsetting Data | Selecting rows, columns, or both based on conditions. | data$column (select column), data[ , 1] (select column by index), data[1:5, ] (select rows by index) |
Conditional Filtering | Filtering data based on logical conditions (e.g., greater than, equal to). | data[data$column > 10, ] (filter rows based on condition) |
Removing NA Values | Excluding missing data to ensure analyses are not biased by missing values. | data <- na.omit(data) (remove rows with NA values), data[!is.na(data$column), ] (remove NA from specific column) |
Adding or Modifying Columns | Creating new columns or modifying existing ones, e.g., recoding or calculations. | data$new_column <- data$column1 + data$column2 (add new column), data$column <- scale(data$column) (modify a column) |
Summarizing and Aggregating Data | Calculating summary statistics, e.g., mean, median, or count. | summary(data) (summary statistics), aggregate(column ~ group, data = data, FUN = mean) (aggregate by group) |
Reshaping Data | Converting data from long to wide format or vice versa. | pivot_wider(data, names_from = column, values_from = value_column) (long to wide), pivot_longer(data, cols = starts_with("prefix")) (wide to long) |
Merging Data | Combining multiple data frames by a common key (e.g., participant ID). | merged_data <- merge(data1, data2, by = "ID") (merge by a common column) |
Structure of a data frame
Data frames are a type of objects or variable that can be loaded into the R environment to be used for data analysis. Data frames are identified by their variable name, e.g., mydata
, P01_dataset
, or congruent_subset
.
Think of a data frame as a spreadsheet or a table with rows and columns:
- Rows typically represent individual observations (such as participants in a study or trials in a task).
-
Columns represent variables (characteristics or measurements, like age, group assignment, reaction time, or their score on a questionnaire).
- Within a column, all data must be of the same type, but each different column can contain a different data type (summarized in Table 3).
Data frames can be thought of as a range of columns (1 through end) and rows (1 through end) with each cell having the position [row #, column #]
. Thus, each cell can be accessed individually by its specific [row_index, column_index]
. You can also make row_index
or column_index
a range of numbers to access a range of cells (more below).
Table 2: Structure of a Data Frame
column1 | column2 | column3 | column4 | column5 | |
---|---|---|---|---|---|
row1 | [1, 1] | [1, 2] | [1, 3] | [1, 4] | [1, 5] |
row2 | [2, 1] | [2, 2] | [2, 3] | [2, 4] | [2, 5] |
row3 | [3, 1] | [3, 2] | [3, 3] | [3, 4] | [3, 5] |
row4 | [4, 1] | [4, 2] | [4, 3] | [4, 4] | [4, 5] |
row5 | [5, 1] | [5, 2] | [5, 3] | [5, 4] | [5, 5] |
Table 3: Summary of Data Types
Data Type | Description | Command to Check Type | Command to Change to Type |
---|---|---|---|
Numeric | Real numbers (e.g., decimals, floating-point numbers) | is.numeric(df$column) |
as.numeric(df$column) |
Integer | Whole numbers (e.g., 1L , 2L , etc.) |
is.integer(df$column) |
as.integer(df$column) |
Character | Text or string values (e.g., "Hello" , "data" ) |
is.character(df$column) |
as.character(df$column) |
Factor | Categorical data with levels (e.g., "low", "high") | is.factor(df$column) |
as.factor(df$column) |
Ordered Factor | Categorical data with a specified order (e.g., "low", "medium", "high") | is.ordered(df$column) |
factor(column, ordered = TRUE) |
Logical | Boolean values (TRUE or FALSE ) |
is.logical(df$column) |
as.logical(df$column) |
Indexing Data Frames
Data frames (plus vectors and lists) are indexable, meaning you can access its individual elements (or parts) using a specific reference or index by rows and columns.
- Rows are indexed by their row number
- Columns can be indexed by either their column number or their column name
- You can also index by a combination of row and column
The rows and columns of a data frame can be specified following the name of the data frame using square brackets []
and separated by a comma. Rows are specified before the comma, and columns are specified after the comma:
dataframe[row_index, column_index]
-
row_index
specifies which rows to access (all rows or a subset) -
column_index
specifies which columns to access (all columns or a subset)
Important Functions and Operators Needed to Index
Comparison Operators
The greater than operator >
checks if the left-hand side is greater than the right-hand side and returns TRUE
or FALSE
5 > 3 # TRUE
2 > 3 # FALSE
5 > 5 # FALSE
The less than operator <
checks if the left-hand side is less than the right-hand side and returns TRUE
or FALSE
5 < 3 # FALSE
2 < 3 # TRUE
5 < 5 # FALSE
The greater than or equal to operator >=
checks if the left-hand side is greater than or equal the right-hand side and returns TRUE
or FALSE
5 >= 3 # TRUE
2 >= 3 # FALSE
5 >= 5 # TRUE
The less than or equal to operator <=
checks if the left-hand side is less than or equal the right-hand side and returns TRUE
or FALSE
5 <= 3 # FALSE
2 <= 3 # TRUE
5 <= 5 # TRUE
The equal to operator ==
checks if the left-hand side is equal the right-hand side and returns TRUE
or FALSE
5 == 3 # FALSE
2 == 3 # FALSE
5 == 5 # TRUE
The not equal to operator !=
checks if the left-hand side is not equal the right-hand side and returns TRUE
or FALSE
5 != 3 # TRUE
2 != 3 # TRUE
5 != 5 # FALSE
Logical Operators
These operators are used to perform logical comparisons between values or conditions.
The and operator &
checks if the condition on left-hand side is TRUE
and if the condition on right-hand side is TRUE
. It Returns TRUE
if both conditions are TRUE
.
TRUE & TRUE # TRUE
TRUE & FALSE # FALSE
2 < 3 & 5 > 3 # TRUE
2 > 3 & 5 > 3 # FALSE
2 > 3 & 5 < 3 # FALSE
The or operator |
checks if the condition on left-hand side is TRUE
and if the condition on right-hand side is TRUE
. It Returns TRUE
if either condition is TRUE
.
TRUE | TRUE # TRUE
TRUE | FALSE # TRUE
2 < 3 | 5 > 3 # TRUE
2 > 3 | 5 > 3 # TRUE
2 > 3 | 5 < 3 # FALSE
The not operator !
The not operator !
used to reverse or negate logical values. It takes a TRUE
value and makes it FALSE
, or a FALSE
value and makes it TRUE
. This operator is commonly used to filter data, apply conditional statements, and work with logical vectors. The !
operator can also be helpful in conditional statements, such as if statements, when you want to run code if a condition is not met. For example:
x <- 5
if (!x == 10) {
print("x is not equal to 10")
} else {
print("x is equal to 10")
}
# Prints "x is not equal to 10" because x = 5, which is not 10
x <- 10
if (!x == 10) {
print("x is not equal to 10")
} else {
print("x is equal to 10")
}
# Prints "x is equal to 10" because x = 10, which is 10
Colon Operator
The colon operator :
is used to create sequences of integers (whole numbers) between a start and end number, inclusive of the start and end numbers, in the format: start:end
.
3:5 # Generates a list of numbers: 3 4 5
3:8 # Generates a list of numbers: 3 4 5 6 7 8
11:13 # Generates a list of numbers: 11 12 13
Concatenate Function
The concatenate function c()
(or combine function) is used to combine elements into a vector or list. While the c()
function can create vectors or lists of numbers or characters (letters or words). Importantly, it can be used in combination with the colon operator :
to create a list of numbers that skip selected values.
c(1,3:5) # Generates a list of numbers, output: 1 3 4 5
c(1,3:5,7) # Generates a list of numbers, output: 1 3 4 5 7
c(1,3:5,7,11:13) # Generates a list of numbers, output: 1 3 4 5 7 11 12 13
c("b", "r", "g") # Generates a list of characters (in this case letters), output: b r g
c("blue", "red", "green") # Generates a list of characters (in this case words), output: blue red green
Dollar Sign Operator
The dollar sign operator $
is used to access or extract specific columns of a data frame by the variable (column) name, in the format: dataframe$column_name
.
mydata$participant # Access the "participant" column, output: P01 P02 P03 ...
mydata$moodGroup # Access the "moodGroup" column, output: negativeMood negativeMood positiveMood ...
mydata$mood group # Invalid because:
# 1) there is no column mood group, and
# 2) column names with spaces must be in quotes
# Assuming there was a column named "mood group" with a space (bad practice), you would need to index it with quotes:
mydata$"mood group" # Access the "mood group" column, output: negativeMood negativeMood positiveMood ...
Indexing by Column
Indexing by column is a way to access and manipulate specific subsets of data based on the column number or column name. We have already seen that we can index one specific column by column name using the dollar sign $
operator, e.g., data_frame$column_name
, but this is limited to one column. We can index a subset of columns (or use this as an alternative way to index a single column) by indicating the column numbers or names with the column_index data_frame[row_index, column_index]
. We can leave the row_index blank to indicate all rows data_frame[, column_index]
.
Specifying columns by column numbers or column names
data_frame[, 1] # Access the first column, all rows
data_frame[, 1:3] # Access columns 1 through 3, all rows
data_frame[, c(1, 3)] # Access columns 1 and 3, all rows
data_frame[, c(1,3:5)] # Access columns 1 and 3 through 5 (based on the list: 1 3 4 5), all rows
data_frame[, "column_name"] # Access a specific column by name, all rows
data_frame$column_name # Access a specific column by name, all rows
data_frame[, c("column_name1", "column_name2")] # Access two specific columns by column name, all rows
data_frame[, c("rt", "accuracy", "whichPrime")] # Access columns rt, accuracy, and whichPrime, all rows
Indexing by Row
Indexing by row is a way to access and manipulate specific subsets of data based on the order or characteristics of individual observations. In a dataset, each row typically represents a unique observation or participant. By indexing by row, you can easily extract or manipulate individual observations based on their position (row number) or specific criteria (conditions). We can index a subset of rows by indicating the row numbers with the row_index data_frame[row_index, column_index]
. We can leave the column_index blank to indicate all columns data_frame[row_index, ]
. Note, we cannot specify rows based on row name because rows don't have names, only numbers.
Specifying rows by row number dataframe[row_index,]
You can specify a single row using its row index number:
data_frame[1, ] # Access the first row, all columns
data_frame[1,] # The space after the comma is not necessary, this will still access the first row, all columns
data_frame[10, ] # Access the tenth row, all columns
data_frame[100, ] # Access the hundreth row, all columns
Or you can specify multiple rows using a list of numbers created with the colon operator :
and c()
function:
data_frame[c(1, 3), ] # Access rows 1 and 3, all columns
data_frame[3:5, ] # Access rows 3 through 5 (based on the list 3 4 5), all columns
data_frame[c(1, 3:5), ] # Access rows 1 and 3 through 5 (based on the list: 1 3 4 5), all columns
data_frame[c(1, 3:5, 7, 11:13), ] # Access rows 1, 3 through 5, 7, and 11 through 13 (based on the list: 1 3 4 5 7 11 12 13), all columns
Specifying rows by specific criteria
Indexing rows by specific criteria (conditions) allows you to extract subsets of a data frame based on logical conditions. This is done using logical vectors that specify whether each row meets a certain condition, and including this in the row_index position of data_frame[row_index, ]
. To index rows based on a condition, you use the following syntax:
data_frame[condition, ] # Access rows where condition is met, all columns
data_frame[condition & condition, ] # Access rows where both conditions are met, all columns
data_frame[condition | condition, ] # Access rows where either conditions is met, all columns
Where:
-
condition
is a logical expression that evaluates toTRUE
orFALSE
for each row. - Rows where the condition is
TRUE
will be returned (or used in the function/calculation), and the rest will be excluded. - Conditions are most commonly expressed by identifying a criterion based on a column (variable).
- For example, if you have collected
age
and only want participants over 25 years old, you would usedata_frame$age > 25
which will look at the age column and return a matching list ofTRUE
orFALSE
for whether each age is greater than or less than 25 To then index by this condition, you can put it in therow_index
position:
- For example, if you have collected
data_frame[data_frame$age > 25, ] # Access rows where age is greater than 25 (26+), all columns
data_frame[data_frame$age < 50, ] # Access rows where age is less than 50 (0-49), all columns
data_frame[data_frame$age < 50 & data_frame$age > 25, ] # Access rows where age is less than 50 AND age is greater than 25 (26-49), all columns
data_frame[data_frame$age > 50 & data_frame$age < 25, ] # Access rows where age is greater than 50 AND age is less than 25 (not possible), all columns
data_frame[data_frame$age > 50 | data_frame$age < 25, ] # Access rows where age is greater than 50 OR age is less than 25 (0-24, 51+), all columns
data_frame[data_frame$age > 25 & data_frame$correct == "correct", ] # Access rows where age is greater than 25 AND accuracy is "correct", all columns
data_frame[data_frame$prime == "positive" & data_frame$correct == "correct", ] # Access rows where prime is "positive" AND accuracy is "correct", all columns
Indexing by Row and Column
You can further subset your data by using a combinations of indexing by row and indexing by column. To do so, you place selection criteria in both the row_index
and column_index
within your square brackets []
data_frame[data_frame$age > 25, "rt"] # Access rows where age is greater than 25 (26+), only column "rt"
data_frame[data_frame$accuracy == "correct", c("rt", "accuracy", "whichPrime")] # Access rows where accuracy is correct, only columns "rt" "accuracy" and "whichPrime"
data_frame[data_frame$accuracy == "correct" & data_frame$rt > 300, c("rt", "accuracy", "whichPrime")] # Access rows where accuracy is correct and reaction times are greater than 300, only columns "rt" "accuracy" and "whichPrime"
If you are only interested in one column (variable) with a subset of rows, you can also use a combination of the row_index
and the dollar sign operator $
data_frame[data_frame$age > 25, ]$accuracy # Looking at "accuracy" for only participants where age is greater than 25
mydata[mydata$moodGroup == "positiveMood", ]$positiveEmotion # Looking at "positiveEmotion" for only participants in the "positiveMood" group, returns a list: 8 7 9 8 10
Assigning data selection to a new data frame
Sometimes you want to work with a subset of data for an extended period of time. Instead of identifying the subset each time you must identify which data to look at, you can assign the subset to a secondary data frame and work further with that using the syntax new_data <- data_frame[row_index, column_index]
. It can be helpful to give your new data frame a meaningful name so that you know what the difference is.
# Create new data frame with columns "rt" "accuracy" and "whichPrime" selecting only correct trials (data_frame$accuracy == "correct" in the row_index) and reaction times were not false starts (data_frame$rt > 300 in the row_index)
dataCorRT <- data_frame[data_frame$accuracy == "correct" & data_frame$rt > 300, c("rt", "accuracy", "whichPrime")]
# Create a new data frame with all columns selecting only trials where reaction times were not false starts (data_frame$rt > 300 in the row_index) and reaction times were not excessively long (& data_frame$rt < 2000)
new_data <- data_frame[data_frame$rt > 300 & data_frame$rt < 2000, ]
# Create a new data frame with only participants who had positiveEmotion scores greater than 2 AND negativeEmotion scores less than 8, only columns "participant" "rt" and "moodGroup" (not "positiveEmotion" and "negativeEmotion"
mydataEmoFilter <- mydata[mydata$positiveEmotion > 2 & mydata$negativeEmotion < 8, c("participant", "rt", "moodGroup")]
mydataEmoFilter
will return a data frame that looks like this:
participant | rt | moodGroup |
---|---|---|
P02 | 846 | negativeMood |
P03 | 497 | positiveMood |
P04 | 308 | positiveMood |
P05 | 457 | positiveMood |
P06 | 575 | positiveMood |
P08 | 509 | positiveMood |
P10 | 654 | negativeMood |
Adding a New Row or Column
Adding a New Row
You can add a new row by specifying either the index of the new row (if known) or by calculating it dynamically.
-
Option 1: If you know the last row number (in our mydata example, the last row is 10), we can do
mydata[last_row + 1, ] <- NA
, which in this case would bemydata[11,] <- NA
. This will create a new row at the end of the data frame (row 11) filled withNA
values (effectively a blank row). -
Option 2: If you do not know the last row number, or want to calculate it dynamically to not accidentally overwrite an existing row if you get the last row number wrong, you can use the
nrow()
function:mydata[nrow(mydata) + 1, ] <- NA
- If you do not want to create a blank row, but instead know the values you want to input, you can use the
c()
function to create a vector of equal length to columns with the values and input that instead:mydata[nrow(mydata) + 1, ] <- c("P11", 5, 5, 500, "negativeMood")
Adding a New Column
You can add a new column by specifying either the index of the new column (if known), by calculating it dynamically, or by using a new variable name.
-
Option 1: If you know the last column number (in our mydata example, the last row is 5), we can do
mydata[, last_column + 1] <- NA
, which in this case would bemydata[, 6] <- NA
. This will create a new column at the end of the data frame (column 6) filled withNA
values (effectively a blank column). The column will have a default variable name, which can change later, but if we already what we want the variable (column) name to be, Option 3 is the most efficient. -
Option 2: If you do not know the last column number, or want to calculate it dynamically to not accidentally overwrite an existing column if you get the last column number wrong, you can use the
ncol()
function:mydata[, ncol(mydata) + 1] <- NA
-
Option 3: Best Option If you know the variable (column) name, you can simply use the dollar sign operator
$
and the new variable name. This will create a new column automatically as the last column:mydata$age <- NA
- If you do not want to create a blank column, but instead know the values you want to input, you can use the
c()
function to create a vector of equal length to your rows with the values and input that instead:mydata$gender <- c("female", NA, "male", "nonbinary", "agender", "male", "female", "female", "nonbinary", "agender", "nonbinary")