; PSY 1903
PSY 1903 Programming for Psychologists

Suggestion Box

Spot an error or have suggestions for improvement on these notes? Let us know!

Data Preparation and Troubleshooting for Plotting

This guide explains how to prepare data for plotting and how to diagnose the most common issues that arise when making figures in R using ggplot2. Each section explains what the step accomplishes and why it matters for producing clean, accurate visualizations.


1. Inspecting and Understanding Your Data

Before making any plot, it is important to understand what is in your dataset and how R is interpreting each column.

Why this matters

Most plotting problems arise because ggplot is receiving the wrong kind of data (for example, a character when a factor is needed, or a factor when a numeric is expected). Inspecting your dataset early saves time and prevents confusing errors.

Key tools

Run these commands every time you load a new dataset:

str(npt_data)
head(npt_data)
summary(npt_data)
names(npt_data)

What to look for

  • Are numeric variables truly numeric?
    Reaction times, accuracy, and scores must be numeric for ggplot to place them on continuous axes.
  • Are grouping variables factors?
    Group comparisons (for example, bar plots and boxplots) require categorical variables to be stored as factors.
  • Are labels consistent?
    Typos create new groups accidentally.
  • Are there missing values?
    NA values may silently remove data from plots or cause geoms to fail.

2. Converting, Reordering, and Cleaning Variables

Many plots require specific data types to function correctly. This section explains how to control and clean variable types so your plots behave as expected.

2A. Converting to factor

Use this when categorical variables appear as characters.

npt_data$focus_group <- factor(npt_data$focus_group)

Why it matters

ggplot treats character variables as discrete values but cannot control their order or labeling as well as factors.


2B. Reordering factor levels

This determines the order in which categories appear in plots.

npt_data$focus_group <- factor(
  npt_data$focus_group,
  levels = c("Low Focus", "High Focus")
)

Why it matters

If R uses alphabetical ordering, “High Focus” may appear before “Low Focus,” which is not intuitive for interpretation.


2C. Renaming factor levels

Use human-readable labels.

levels(npt_data$focus_group) <- c("Low Focus", "High Focus")

Why it matters

Legend and axis labels should reflect meaningful names rather than coding artifacts.


3. Summarizing Data for Plotting

Plots differ in the type of input data they require. Bar plots with geom_col() expect summarized data; scatterplots expect raw data. This section clarifies how to generate summaries correctly.

Why this matters

If you pass raw data into a plot that expects summaries, you will get incorrect results or completely wrong visuals.


3A. Summarizing with aggregate()

mean_rt_by_group <- aggregate(
  mean_rt_overall ~ focus_group,
  data = npt_data,
  FUN = mean
)

Why it matters

Many scientific plots compare group-level averages. aggregate() creates a dataset where each row represents one group.


3B. Summarizing with tapply()

tapply(npt_data$mean_rt_overall,
       npt_data$focus_group,
       mean)

Why it matters

tapply() is useful for quick checks before you build graphs. It reveals whether your summary numbers look reasonable.


3C. Summarizing multiple variables

aggregate(
  cbind(parity_mean_rt, magnitude_mean_rt, mean_rt_overall) ~ focus_group,
  data = npt_data,
  FUN  = mean
)

Why it matters

This allows you to prepare several measures at once, useful for multi-bar plots and faceted plots.


4. Reshaping Data for Multi-Variable Plots

Some plots require long-format data, where each row reflects a single measurement rather than a wide layout.

Why this matters

Grouped bar plots, violin plots, boxplots with multiple variables, and line plots often require long-format data. If you try to plot from wide format, ggplot will not know how to map different variables to a single aesthetic.


4A. Reshaping with reshape() (base R)

multi_rt <- npt_data[, c("focus_group",
                         "parity_mean_rt",
                         "magnitude_mean_rt")]

multi_rt_long <- reshape(
  multi_rt,
  varying  = list(c("parity_mean_rt", "magnitude_mean_rt")),
  v.names  = "rt_value",
  timevar  = "task_type",
  times    = c("Parity RT", "Magnitude RT"),
  direction = "long"
)

Why it matters

This creates a tidy dataset where each row contains:

  • a group,
  • a task type,
  • one reaction time value.

This structure is directly compatible with ggplot.


5. Handling Missing Values

Missing values affect both summaries and plots.

Why this matters

Plots may silently drop NA values, making patterns appear stronger or weaker than they actually are.


5A. Check for missing values

colSums(is.na(npt_data))

5B. Remove incomplete rows if justified

cleaned <- na.omit(npt_data)

Why it matters

Removing rows is acceptable if missingness is minimal and not systematic.


5C. Use na.rm = TRUE inside summaries

mean(npt_data$mean_rt_overall, na.rm = TRUE)

Why it matters

Without na.rm = TRUE, functions like mean return NA if any value is missing.


6. Troubleshooting Common Plotting Problems

Expanded explanations

Troubleshooting is a normal part of data visualization. The key is to understand why the error appears and what ggplot is expecting from your data.


6A. Plot is blank

Why this happens

  • Aesthetic mapping references a variable that does not exist.
  • The dataset is empty or filtered incorrectly.
  • The print call is missing in some R contexts.

How to fix

names(npt_data)

Double-check spelling inside aes().


6B. “Discrete value supplied to continuous scale”

Why this happens

A variable needed for the y-axis is a factor or character instead of numeric.

Fix

npt_data$mean_rt_overall <- as.numeric(npt_data$mean_rt_overall)

6C. Bars or boxes appear in alphabetical order

Why this happens

Factors default to alphabetical ordering.

Fix

npt_data$focus_group <- factor(
  npt_data$focus_group,
  levels = c("Low Focus", "High Focus")
)

6D. Jitter points appear off-center

Why this happens

You placed width inside aes(), which makes ggplot interpret it as a variable mapping.

Fix

geom_jitter(width = 0.2)

6E. geom_smooth() errors out

Why this happens

geom_smooth(method = "lm") requires both variables to be numeric.

Fix

class(npt_data$tef10_score)
class(npt_data$mean_rt_overall)

Convert if needed.


6F. Labels are overlapping or unreadable

Why this happens

Tight spacing, long labels, small figure size.

Fix options

theme(axis.text.x = element_text(angle = 45, hjust = 1))
coord_flip()

6G. Legend colors look wrong or inconsistent

Why this happens

Different plots recreate color scales independently unless you set them manually.

Fix

my_colors <- c("High Focus" = "steelblue",
               "Low Focus"  = "gray40")

+ scale_fill_manual(values = my_colors)

7. Quarto-Specific Plotting Guidance

How to integrate visualizations smoothly into a reproducible report

7A. Use section dividers to break up content

---

7B. Add inline statistics below the plot

The mean RT was `r round(mean_rt, 2)` ms.

Why this matters

Inline reporting keeps code and interpretation connected and reduces errors.


7C. Control figure size and captions with chunk options

#| fig-width: 6
#| fig-height: 4
#| fig-cap: "Mean RT by group."

7D. Set a consistent theme for all figures

theme_set(theme_classic())

This ensures visual consistency across the document.


8. Summary

Preparing data carefully prevents most plotting errors.
The most important habits include:

  • Inspecting your dataset before plotting
  • Converting and ordering factors deliberately
  • Summarizing data appropriately for each plot type
  • Reshaping data when needed
  • Handling missing values thoughtfully
  • Applying structured troubleshooting steps
  • Using Quarto features to keep plots reproducible

Good data preparation leads to clearer figures and smoother analysis.