Suggestion Box
Spot an error or have suggestions for improvement on these notes? Let us know!
Data Preparation and Troubleshooting for Plotting
This guide explains how to prepare data for plotting and how to diagnose the most common issues that arise when making figures in R using ggplot2. Each section explains what the step accomplishes and why it matters for producing clean, accurate visualizations.
1. Inspecting and Understanding Your Data
Before making any plot, it is important to understand what is in your dataset and how R is interpreting each column.
Why this matters
Most plotting problems arise because ggplot is receiving the wrong kind of data (for example, a character when a factor is needed, or a factor when a numeric is expected). Inspecting your dataset early saves time and prevents confusing errors.
Key tools
Run these commands every time you load a new dataset:
str(npt_data)
head(npt_data)
summary(npt_data)
names(npt_data)
What to look for
- Are numeric variables truly numeric?
Reaction times, accuracy, and scores must be numeric for ggplot to place them on continuous axes. - Are grouping variables factors?
Group comparisons (for example, bar plots and boxplots) require categorical variables to be stored as factors. - Are labels consistent?
Typos create new groups accidentally. - Are there missing values?
NA values may silently remove data from plots or cause geoms to fail.
2. Converting, Reordering, and Cleaning Variables
Many plots require specific data types to function correctly. This section explains how to control and clean variable types so your plots behave as expected.
2A. Converting to factor
Use this when categorical variables appear as characters.
npt_data$focus_group <- factor(npt_data$focus_group)
Why it matters
ggplot treats character variables as discrete values but cannot control their order or labeling as well as factors.
2B. Reordering factor levels
This determines the order in which categories appear in plots.
npt_data$focus_group <- factor(
npt_data$focus_group,
levels = c("Low Focus", "High Focus")
)
Why it matters
If R uses alphabetical ordering, “High Focus” may appear before “Low Focus,” which is not intuitive for interpretation.
2C. Renaming factor levels
Use human-readable labels.
levels(npt_data$focus_group) <- c("Low Focus", "High Focus")
Why it matters
Legend and axis labels should reflect meaningful names rather than coding artifacts.
3. Summarizing Data for Plotting
Plots differ in the type of input data they require. Bar plots with geom_col() expect summarized data; scatterplots expect raw data. This section clarifies how to generate summaries correctly.
Why this matters
If you pass raw data into a plot that expects summaries, you will get incorrect results or completely wrong visuals.
3A. Summarizing with aggregate()
mean_rt_by_group <- aggregate(
mean_rt_overall ~ focus_group,
data = npt_data,
FUN = mean
)
Why it matters
Many scientific plots compare group-level averages. aggregate() creates a dataset where each row represents one group.
3B. Summarizing with tapply()
tapply(npt_data$mean_rt_overall,
npt_data$focus_group,
mean)
Why it matters
tapply() is useful for quick checks before you build graphs. It reveals whether your summary numbers look reasonable.
3C. Summarizing multiple variables
aggregate(
cbind(parity_mean_rt, magnitude_mean_rt, mean_rt_overall) ~ focus_group,
data = npt_data,
FUN = mean
)
Why it matters
This allows you to prepare several measures at once, useful for multi-bar plots and faceted plots.
4. Reshaping Data for Multi-Variable Plots
Some plots require long-format data, where each row reflects a single measurement rather than a wide layout.
Why this matters
Grouped bar plots, violin plots, boxplots with multiple variables, and line plots often require long-format data. If you try to plot from wide format, ggplot will not know how to map different variables to a single aesthetic.
4A. Reshaping with reshape() (base R)
multi_rt <- npt_data[, c("focus_group",
"parity_mean_rt",
"magnitude_mean_rt")]
multi_rt_long <- reshape(
multi_rt,
varying = list(c("parity_mean_rt", "magnitude_mean_rt")),
v.names = "rt_value",
timevar = "task_type",
times = c("Parity RT", "Magnitude RT"),
direction = "long"
)
Why it matters
This creates a tidy dataset where each row contains:
- a group,
- a task type,
- one reaction time value.
This structure is directly compatible with ggplot.
5. Handling Missing Values
Missing values affect both summaries and plots.
Why this matters
Plots may silently drop NA values, making patterns appear stronger or weaker than they actually are.
5A. Check for missing values
colSums(is.na(npt_data))
5B. Remove incomplete rows if justified
cleaned <- na.omit(npt_data)
Why it matters
Removing rows is acceptable if missingness is minimal and not systematic.
5C. Use na.rm = TRUE inside summaries
mean(npt_data$mean_rt_overall, na.rm = TRUE)
Why it matters
Without na.rm = TRUE, functions like mean return NA if any value is missing.
6. Troubleshooting Common Plotting Problems
Expanded explanations
Troubleshooting is a normal part of data visualization. The key is to understand why the error appears and what ggplot is expecting from your data.
6A. Plot is blank
Why this happens
- Aesthetic mapping references a variable that does not exist.
- The dataset is empty or filtered incorrectly.
- The print call is missing in some R contexts.
How to fix
names(npt_data)
Double-check spelling inside aes().
6B. “Discrete value supplied to continuous scale”
Why this happens
A variable needed for the y-axis is a factor or character instead of numeric.
Fix
npt_data$mean_rt_overall <- as.numeric(npt_data$mean_rt_overall)
6C. Bars or boxes appear in alphabetical order
Why this happens
Factors default to alphabetical ordering.
Fix
npt_data$focus_group <- factor(
npt_data$focus_group,
levels = c("Low Focus", "High Focus")
)
6D. Jitter points appear off-center
Why this happens
You placed width inside aes(), which makes ggplot interpret it as a variable mapping.
Fix
geom_jitter(width = 0.2)
6E. geom_smooth() errors out
Why this happens
geom_smooth(method = "lm") requires both variables to be numeric.
Fix
class(npt_data$tef10_score)
class(npt_data$mean_rt_overall)
Convert if needed.
6F. Labels are overlapping or unreadable
Why this happens
Tight spacing, long labels, small figure size.
Fix options
theme(axis.text.x = element_text(angle = 45, hjust = 1))
coord_flip()
6G. Legend colors look wrong or inconsistent
Why this happens
Different plots recreate color scales independently unless you set them manually.
Fix
my_colors <- c("High Focus" = "steelblue",
"Low Focus" = "gray40")
+ scale_fill_manual(values = my_colors)
7. Quarto-Specific Plotting Guidance
How to integrate visualizations smoothly into a reproducible report
7A. Use section dividers to break up content
---
7B. Add inline statistics below the plot
The mean RT was `r round(mean_rt, 2)` ms.
Why this matters
Inline reporting keeps code and interpretation connected and reduces errors.
7C. Control figure size and captions with chunk options
#| fig-width: 6
#| fig-height: 4
#| fig-cap: "Mean RT by group."
7D. Set a consistent theme for all figures
theme_set(theme_classic())
This ensures visual consistency across the document.
8. Summary
Preparing data carefully prevents most plotting errors.
The most important habits include:
- Inspecting your dataset before plotting
- Converting and ordering factors deliberately
- Summarizing data appropriately for each plot type
- Reshaping data when needed
- Handling missing values thoughtfully
- Applying structured troubleshooting steps
- Using Quarto features to keep plots reproducible
Good data preparation leads to clearer figures and smoother analysis.