Suggestion Box

Spot an error or have suggestions for improvement on these notes? Let us know!

Data Preparation and Troubleshooting for Plotting

This guide explains how to prepare data for plotting and how to diagnose the most common issues that arise when making figures in R using ggplot2. Each section explains what the step accomplishes and why it matters for producing clean, accurate visualizations.

1. Inspecting and Understanding Your Data

Before making any plot, it is important to understand what is in your dataset and how R is interpreting each column.

Why this matters

Most plotting problems arise because ggplot is receiving the wrong kind of data (for example, a character when a factor is needed, or a factor when a numeric is expected). Inspecting your dataset early saves time and prevents confusing errors.

Key tools

Run these commands every time you load a new dataset:

str(npt_data)
head(npt_data)
summary(npt_data)
names(npt_data)

What to look for

Are numeric variables truly numeric?
Reaction times, accuracy, and scores must be numeric for ggplot to place them on continuous axes.
Are grouping variables factors?
Group comparisons (for example, bar plots and boxplots) require categorical variables to be stored as factors.
Are labels consistent?
Typos create new groups accidentally.
Are there missing values?
NA values may silently remove data from plots or cause geoms to fail.

2. Converting, Reordering, and Cleaning Variables

Many plots require specific data types to function correctly. This section explains how to control and clean variable types so your plots behave as expected.

2A. Converting to factor

Use this when categorical variables appear as characters.

npt_data$focus_group <- factor(npt_data$focus_group)

Why it matters

ggplot treats character variables as discrete values but cannot control their order or labeling as well as factors.

2B. Reordering factor levels

This determines the order in which categories appear in plots.

npt_data$focus_group <- factor(
  npt_data$focus_group,
  levels = c("Low Focus", "High Focus")
)

Why it matters

If R uses alphabetical ordering, “High Focus” may appear before “Low Focus,” which is not intuitive for interpretation.

2C. Renaming factor levels

Use human-readable labels.

levels(npt_data$focus_group) <- c("Low Focus", "High Focus")

Why it matters

Legend and axis labels should reflect meaningful names rather than coding artifacts.

3. Summarizing Data for Plotting

Plots differ in the type of input data they require. Bar plots with geom_col() expect summarized data; scatterplots expect raw data. This section clarifies how to generate summaries correctly.

Why this matters

If you pass raw data into a plot that expects summaries, you will get incorrect results or completely wrong visuals.

3A. Summarizing with `aggregate()`

mean_rt_by_group <- aggregate(
  mean_rt_overall ~ focus_group,
  data = npt_data,
  FUN = mean
)

Why it matters

Many scientific plots compare group-level averages. aggregate() creates a dataset where each row represents one group.

3B. Summarizing with `tapply()`

tapply(npt_data$mean_rt_overall,
       npt_data$focus_group,
       mean)

Why it matters

tapply() is useful for quick checks before you build graphs. It reveals whether your summary numbers look reasonable.

3C. Summarizing multiple variables

aggregate(
  cbind(parity_mean_rt, magnitude_mean_rt, mean_rt_overall) ~ focus_group,
  data = npt_data,
  FUN  = mean
)

Why it matters

This allows you to prepare several measures at once, useful for multi-bar plots and faceted plots.

4. Reshaping Data for Multi-Variable Plots

Some plots require long-format data, where each row reflects a single measurement rather than a wide layout.

Why this matters

Grouped bar plots, violin plots, boxplots with multiple variables, and line plots often require long-format data. If you try to plot from wide format, ggplot will not know how to map different variables to a single aesthetic.

4A. Reshaping with `reshape()` (base R)

multi_rt <- npt_data[, c("focus_group",
                         "parity_mean_rt",
                         "magnitude_mean_rt")]

multi_rt_long <- reshape(
  multi_rt,
  varying  = list(c("parity_mean_rt", "magnitude_mean_rt")),
  v.names  = "rt_value",
  timevar  = "task_type",
  times    = c("Parity RT", "Magnitude RT"),
  direction = "long"
)

Why it matters

This creates a tidy dataset where each row contains:

a group,
a task type,
one reaction time value.

This structure is directly compatible with ggplot.

5. Handling Missing Values

Missing values affect both summaries and plots.

Why this matters

Plots may silently drop NA values, making patterns appear stronger or weaker than they actually are.

5A. Check for missing values

colSums(is.na(npt_data))

5B. Remove incomplete rows if justified

cleaned <- na.omit(npt_data)

Why it matters

Removing rows is acceptable if missingness is minimal and not systematic.

5C. Use `na.rm = TRUE` inside summaries

mean(npt_data$mean_rt_overall, na.rm = TRUE)

Why it matters

Without na.rm = TRUE, functions like mean return NA if any value is missing.

6. Troubleshooting Common Plotting Problems

Expanded explanations

Troubleshooting is a normal part of data visualization. The key is to understand why the error appears and what ggplot is expecting from your data.

6A. Plot is blank

Why this happens

Aesthetic mapping references a variable that does not exist.
The dataset is empty or filtered incorrectly.
The print call is missing in some R contexts.

How to fix

names(npt_data)

Double-check spelling inside aes().

6B. “Discrete value supplied to continuous scale”

Why this happens

A variable needed for the y-axis is a factor or character instead of numeric.

Fix

npt_data$mean_rt_overall <- as.numeric(npt_data$mean_rt_overall)

6C. Bars or boxes appear in alphabetical order

Why this happens

Factors default to alphabetical ordering.

Fix

npt_data$focus_group <- factor(
  npt_data$focus_group,
  levels = c("Low Focus", "High Focus")
)

6D. Jitter points appear off-center

Why this happens

You placed width inside aes(), which makes ggplot interpret it as a variable mapping.

Fix

geom_jitter(width = 0.2)

6E. geom_smooth() errors out

Why this happens

geom_smooth(method = "lm") requires both variables to be numeric.

Fix

class(npt_data$tef10_score)
class(npt_data$mean_rt_overall)

Convert if needed.

6F. Labels are overlapping or unreadable

Why this happens

Tight spacing, long labels, small figure size.

Fix options

theme(axis.text.x = element_text(angle = 45, hjust = 1))
coord_flip()

6G. Legend colors look wrong or inconsistent

Why this happens

Different plots recreate color scales independently unless you set them manually.

Fix

my_colors <- c("High Focus" = "steelblue",
               "Low Focus"  = "gray40")

+ scale_fill_manual(values = my_colors)

7. Quarto-Specific Plotting Guidance

How to integrate visualizations smoothly into a reproducible report

7A. Use section dividers to break up content

---

7B. Add inline statistics below the plot

The mean RT was `r round(mean_rt, 2)` ms.

Why this matters

Inline reporting keeps code and interpretation connected and reduces errors.

7C. Control figure size and captions with chunk options

#| fig-width: 6
#| fig-height: 4
#| fig-cap: "Mean RT by group."

7D. Set a consistent theme for all figures

theme_set(theme_classic())

This ensures visual consistency across the document.

8. Summary

Preparing data carefully prevents most plotting errors.
The most important habits include:

Inspecting your dataset before plotting
Converting and ordering factors deliberately
Summarizing data appropriately for each plot type
Reshaping data when needed
Handling missing values thoughtfully
Applying structured troubleshooting steps
Using Quarto features to keep plots reproducible

Good data preparation leads to clearer figures and smoother analysis.

Suggestion Box

Data Preparation and Troubleshooting for Plotting

1. Inspecting and Understanding Your Data

Why this matters

Key tools

What to look for

2. Converting, Reordering, and Cleaning Variables

2A. Converting to factor

Why it matters

2B. Reordering factor levels

Why it matters

2C. Renaming factor levels

Why it matters

3. Summarizing Data for Plotting

Why this matters

3A. Summarizing with aggregate()

Why it matters

3B. Summarizing with tapply()

Why it matters

3C. Summarizing multiple variables

Why it matters

4. Reshaping Data for Multi-Variable Plots

Why this matters

4A. Reshaping with reshape() (base R)

Why it matters

5. Handling Missing Values

Why this matters

5A. Check for missing values

5B. Remove incomplete rows if justified

Why it matters

5C. Use na.rm = TRUE inside summaries

Why it matters

6. Troubleshooting Common Plotting Problems

6A. Plot is blank

Why this happens

How to fix

6B. “Discrete value supplied to continuous scale”

Why this happens

Fix

6C. Bars or boxes appear in alphabetical order

Why this happens

Fix

6D. Jitter points appear off-center

Why this happens

Fix

6E. geom_smooth() errors out

Why this happens

Fix

6F. Labels are overlapping or unreadable

Why this happens

Fix options

6G. Legend colors look wrong or inconsistent

Why this happens

Fix

7. Quarto-Specific Plotting Guidance

7A. Use section dividers to break up content

7B. Add inline statistics below the plot

Why this matters

7C. Control figure size and captions with chunk options

7D. Set a consistent theme for all figures

8. Summary

3A. Summarizing with `aggregate()`

3B. Summarizing with `tapply()`

4A. Reshaping with `reshape()` (base R)

5C. Use `na.rm = TRUE` inside summaries