Suggestion Box

Spot an error or have suggestions for improvement on these notes? Let us know!

Data Visualization with ggplot2

0. Introduction

Why visualization matters

Good visualization is an important part of data analysis because it helps us see patterns that are hard to notice in tables or summaries. A well made figure gives us a clearer view of our data and often points us toward the right questions to ask next. When we work with reaction times, accuracy scores, or questionnaire results, plots help us understand both the big picture and the smaller details.

A few reasons this matters:

Plots reveal structure. We can see distributions, relationships, and group differences that might be hidden in raw numbers.
Plots help us check our work. We can spot unusual values, inconsistent patterns, or mistakes in preprocessing before they cause problems later.
Plots support scientific communication. A clear figure can convey an idea faster and more precisely than several paragraphs of text.
Plots guide interpretation. They help us connect our statistical results to the psychological processes we are studying.
Plots help us understand our research question more fully when paired with the statistical tests. The statistical results tell us whether an effect is likely to be real, and the visualization helps us see the size and shape of that effect.
Plots make our analysis reproducible and transparent when paired with code in Quarto.

As we learn ggplot, the goal is to create visualizations that are both accurate and easy to interpret. We want figures that help us understand our data and help others understand our work.

1. Project Setup

Make sure your project has the following structure:

npt_project/
 ├── data/
 │    ├── raw/
 │    └── cleaned/
 │         └── npt_data_cleaned.csv
 ├── scripts/
 ├── output/
 │    └── plots/
 └── reports/

You can download the data for today study_level_processed.csv. Place it in your npt_project/data/cleaned/ directory to follow along.

Alternatively, you can generate it yourself by placing the following at the end of your #### Save Files section of last week's npt_group.qmd. Render the whole quarto report and it should write the data to your data/cleaned/ directory.

write.csv(study_level, here("data/cleaned/study_level_processed.csv"), row.names = FALSE)
saveRDS(study_level, here("data/cleaned/study_level_processed.rds"))

Your file should look like this:

subject_id	tef10_score	practice_mean_rt	practice_acc	magnitude_mean_rt	magnitude_acc	parity_mean_rt	parity_acc	practice_sd_rt	magnitude_sd_rt	parity_sd_rt	rt_diff	acc_diff	mean_rt_overall	mean_acc_overall	focus_group
id	1.22	545.65	1.00	502.86	0.82	544.22	0.87	100.33	104.52	63.91	41.36	0.05	523.53	0.84	Low Focus
id	1.80	519.53	1.00	576.63	0.88	488.34	0.80	71.71	90.50	76.12	-88.29	-0.08	532.49	0.84	Low Focus
id	1.30	587.90	1.00	529.58	0.94	547.88	0.88	88.74	87.04	67.71	18.30	-0.07	538.73	0.91	Low Focus
id	2.11	558.85	1.00	552.62	0.69	583.87	0.92	59.32	105.32	84.99	31.25	0.22	568.25	0.80	High Focus

Create `npt_dataviz.qmd` Quarto Report

Throughout these notes, we will use npt_data to demonstrate plotting and summarizing. In your reports directory, create a new Quarto Report titled npt_dataviz.qmd. This only needs to be run once in your console.

file.create(here::here("scripts/npt_dataviz.qmd"))

Add the following to your Quarto Report. For a shorter data frame name, we are going to load study_level_processed.csv as npt_data:

---
title: "NPT Data Visualization"
author: "Your Name"
format:
  html:
    fig-number: true
---

## This report will walk through plotting and visualizing data from the NPT Project in Base R and `ggplot2`.

---

### Set Up Environment
*This section sets up our environment and data frame.*

#### Load Packages
```{r}
if (!require("pacman")) {install.packages("pacman"); require("pacman")}
p_load("ggplot2")
```

#### Load `study_level_processed` data frame as `npt_data`
```{r}
npt_data <- read.csv(here::here("data/cleaned/study_level_processed.csv"))
```

#### Create output directories
```{r}
dir.create(here::here("output/plots"), recursive = TRUE, showWarnings = FALSE)
```

---

Note: Three dashes --- create a section break in a Quarto document. This is an easy way to separate major parts of your report so that the structure is clear when someone reads it. You can place a section break after a figure, before a new topic, or anywhere you want to make the document flow more naturally. Think of it as a horizontal divider that helps the reader see where one idea ends and the next one begins.

So the three dashes copied in at the end of the code block above will create a section break between the section setting up the environment and starting to plot.

2. Using fig-cap in Quarto

Before we begin plotting, we will set up our Quarto document so that our figures have clear, consistent captions. Captions help organize reports, and Quarto’s fig-cap option automatically numbers and labels each figure.

A fig-cap goes inside the code chunk options as the first line. If you also want Quarto to treat the plot as an official figure that can be numbered and cross-referenced, add a fig-label that starts with fig-.

```{r}
#| label: fig-example
#| fig-cap: "Example figure caption generated by Quarto."
plot(1:10, 1:10)
```

fig-cap controls the caption text. fig-label tells Quarto to recognize this chunk as a figure and make it eligible for automatic numbering and cross-referencing.

Quarto will:

place the caption directly under the figure
automatically number the figure (for example Figure 1:)
keep numbering consistent if figures move around
allow you to reference the figure elsewhere using @fig-example

We will use simple fig-cap captions and then write dynamic text below the chunk to expand upon our results.

Example of dynamic text outside the chunk:

The mean value is `r mean(1:10)` ± `r sd(1:10)` units (@fig-example).

This pattern keeps captions clean and avoids technical issues with inline R inside the "fig-cap" field.

3. The Grammar of ggplot2

Every ggplot is built from a consistent structure.
Rather than remembering separate commands for each type of plot, we focus on combining layers.

A minimal ggplot follows this pattern:

```{r}
ggplot(data_frame, aes(x = variable_x, y = variable_y)) +
  geom_layer()
```

Key pieces:

ggplot(data)
tells ggplot which dataset we are working with.
aes()
stands for aesthetics and maps variables to visual properties like x-position, y-position, color, or fill.
geom_layer()
adds the visual marks (bars, points, lines, etc.).

We will build each of our plots using this shared structure, adding labels and themes to make the figures publication-ready.

4. Example 1: Histogram of Reaction Times

Let’s begin by visualizing the distribution of mean overall reaction times in our dataset.
Histograms help us see the shape of a distribution and whether values cluster, spread out, or show skew.

4A. Basic Histogram

#### Histograms
*A histogram displays the distribution of one numeric variable.*

Basic Histogram
```{r}
#| label: fig-rt-hist
#| fig-cap: "Histogram of overall reaction times."

ggplot(npt_data, aes(x = mean_rt_overall)) +
  geom_histogram(binwidth = 5,
                 fill = "gray80",
                 color = "black") +
  labs(x = "Mean RT (ms)", y = "Count")
```
---

This gives us our starting point: a clear view of how reaction times are distributed.

4B. Adding Mean and Standard Deviation Lines

Now let’s add a few important reference points — the mean and standard deviation — on top of this histogram.
This demonstrates the layer-by-layer structure of ggplot. We will also update the figure caption and introduce dynamic text expanding on the results below the code chunk, as well as a section break --- after the text and before the next section.

Histogram with Mean and SD
```{r}
#| label: fig-rt-hist-lines
#| fig-cap: "Histogram of overall reaction times with reference lines showing the mean and one standard deviation."

## Compute summary statistics
mean_rt <- mean(npt_data$mean_rt_overall, na.rm = TRUE)
sd_rt   <- sd(npt_data$mean_rt_overall,   na.rm = TRUE)

## Histogram with reference lines
ggplot(npt_data, aes(x = mean_rt_overall)) +
  geom_histogram(binwidth = 5,
                 fill = "gray80",
                 color = "black") +
  geom_vline(xintercept = mean_rt,         color = "red") +
  geom_vline(xintercept = mean_rt - sd_rt, color = "blue", linetype = "dashed") +
  geom_vline(xintercept = mean_rt + sd_rt, color = "blue", linetype = "dashed") +
  labs(x = "Mean RT (ms)", y = "Count")
```

The mean overall reaction time was `r round(mean_rt, 2)` ms,  and the standard deviation was `r round(sd_rt, 2)` ms.

---

5. Example 2: Scatterplot with Regression Line

Next, we will visualize the relationship between two continuous variables in our study‑level dataset.
A natural pair to examine is:

TEF‑10 score
Mean overall reaction time

Scatterplots help us see whether higher TEF‑10 scores are associated with faster or slower RTs.
We will build this visualization in layers so we can see how each piece contributes to the final figure.

5A. Basic Scatterplot

#### Scatterplot
*Scatterplots show the relationship between two numeric variables.*

Simple Scatterplot
```{r}
#| label: fig-scatter
#| fig-cap: "Scatterplot of TEF-10 scores and overall reaction times."

ggplot(npt_data, aes(x = tef10_score, y = mean_rt_overall)) +
  geom_point(color = "gray40") +
  labs(x = "TEF-10 Score",
       y = "Mean RT (ms)") +
  theme_minimal()
```
---

This first layer shows us the raw relationship between the two variables.
We can already get a sense of whether the relationship looks positive, negative, or flat.

5B. Adding a Regression Line

To make the relationship clearer, we can add a fitted regression line.
This helps us see the overall trend more easily.

Before plotting, we compute a correlation test so we can report the strength and significance of the relationship.

Scatterplot with Regression Line
```{r}
#| label: fig-scatter-line
#| fig-cap: "Scatterplot with a fitted regression line showing the relationship between TEF-10 score and overall reaction time."

## Correlation test
cor_test <- cor.test(npt_data$tef10_score,
                     npt_data$mean_rt_overall)

## Scatterplot + regression line
ggplot(npt_data, aes(x = tef10_score, y = mean_rt_overall)) +
  geom_point(color = "gray40") +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(x = "TEF-10 Score",
       y = "Mean RT (ms)") +
  theme_minimal()
```

The correlation was `r round(cor_test$estimate, 2)`, with a p-value of `r signif(cor_test$p.value, 3)`.

---

5C. Using Built-In Themes

As we start refining our plots, it helps to know that ggplot2 includes several built-in themes that change the overall look of a figure without affecting its data or layers. Themes control elements like backgrounds, borders, gridlines, and text styles.

All themes can be added at the end of a plot:

+ theme_classic()

Below are a few of the most commonly used options.

Theme Name	Description	Good For
`theme_classic()`	Clean, minimal background, no shading, subtle axes	Scientific figures, journal-style plots
`theme_bw()`	Black and white theme with strong gridlines	Grayscale printing, precise value reading
`theme_minimal()`	Light, modern look with faint gridlines	Presentations, teaching slides
`theme_light()`	Light background with more visible gridlines	Comparisons requiring structure
`theme_linedraw()`	Technical drawings, simple black lines	Engineering-style plots
`theme_void()`	Almost no plot elements	Maps, minimalist visualizations
`theme_dark()`	Dark background with contrasting gridlines	Presentations, high-contrast displays

Different themes work better for different situations.
As we continue building visualizations, we will use these themes to improve clarity. You can also build your own custom themes to personalize your visualizations.

6. Example 3: Comparing Groups with a Bar Plot

In this section, we will build a bar plot step by step to compare mean overall reaction times across the two focus groups in our study. We will follow the same layered structure as before, starting with a basic bar plot and adding meaningful refinements.

6.1 Summarizing Mean RT by Focus Group

Before we can plot group means, we need to compute them. We will use aggregate() to create a summarized data frame with one row per focus group.

#### Bar Plots
*Bar plots are used to compare group-level summaries.*

Simple Bar Plot
```{r}
#| label: fig-bar
#| fig-cap: "Mean overall reaction time by focus group."

## Compute group means
mean_rt_by_group <- aggregate(mean_rt_overall ~ focus_group,
                              data = npt_data,
                              FUN  = mean)

mean_rt_by_group

## Extract values for dynamic reporting
mean_high <- mean_rt_by_group$mean_rt_overall[mean_rt_by_group$focus_group == "High Focus"]
mean_low  <- mean_rt_by_group$mean_rt_overall[mean_rt_by_group$focus_group == "Low Focus"]

## Plot the bar plot
ggplot(mean_rt_by_group, aes(x = focus_group, y = mean_rt_overall)) +
  geom_col(fill = "steelblue") +
  labs(x = "Focus Group", y = "Mean RT (ms)") +
  theme_classic()
```

The High Focus group had an average RT of `r round(mean_high, 2)` ms, and the Low Focus group averaged `r round(mean_low, 2)` ms.

---

6.2 Extension: Comparing More Than Two Variables with a Bar Plot

In some situations, we want to compare more than one variable across the same grouping variable.
Here we extend our bar plot to show parity_mean_rt and magnitude_mean_rt across the two focus groups.

This example helps us see how different task conditions relate to the same grouping variable.

6.2.1 Reshaping the Data for Multiple Bars

To create a bar plot with multiple bars per group, we first reshape the data into a “long” format. Long format stores one observation per row with a variable describing its type, while wide format spreads different measurements across multiple columns for each observation.

reshape() is a base R tool that reorganizes data into the format we need for a specific analysis or plot. Many ggplot figures work best in long format, where each row reflects a single measurement and other columns describe what that measurement refers to. Raw datasets, however, often come in wide format, with several related variables in separate columns. reshape() converts from wide to long by gathering multiple columns into one measurement column and creating a new variable that labels each measurement type. We use it because it keeps everything in base R, works well with mixed data types, and produces clean, long-format data that ggplot can interpret without confusion. This makes it easier to create flexible and informative plots, especially when comparing more than one variable across the same groups.

#### Multi-Group Bar Plots
*Bar plots with multiple bars per group allow us to compare several variables at once.*

Prepare Data
```{r}
## Select relevant columns
multi_rt <- npt_data[, c("focus_group",
                         "parity_mean_rt",
                         "magnitude_mean_rt")]

## Reshape to 'long' format for ggplot
multi_rt_long <- reshape(multi_rt,
                         varying = list(c("parity_mean_rt", "magnitude_mean_rt")),
                         v.names = "rt_value",
                         timevar = "task_type",
                         times = c("Parity RT", "Magnitude RT"),
                         direction = "long")

head(multi_rt_long)
```

Now each row contains:

a focus group,
a task type, and
the corresponding reaction time value.

6.2.2 Multi-Group Bar Plot

With the long-format data, we can create grouped bars using position = "dodge".

Multi-Group Bar Plot
```{r}
#| label: fig-multi-bar
#| fig-cap: "Comparison of parity and magnitude mean RTs across focus groups."

ggplot(multi_rt_long,
       aes(x = focus_group,
           y = rt_value,
           fill = task_type)) +
  geom_col(position = "dodge") +
  labs(x = "Focus Group",
       y = "Mean RT (ms)",
       fill = "Task Type") +
  theme_classic()
```

This visualization lets us compare how parity and magnitude reaction times vary across focus groups.

6.2.3 Dynamic Reporting

Below the plot, we can calculate and report the means for each task type within each group.

```{r}
tapply(multi_rt_long$rt_value,
       list(multi_rt_long$focus_group,
            multi_rt_long$task_type),
       mean)
```

This table summarizes:

Parity RT by focus group
Magnitude RT by focus group

7. Saving Plots

Once we have created visualizations that communicate our findings clearly, we may want to save them for use in reports, presentations, or manuscripts. We can save ggplot figures directly from a Quarto document using ggsave() alongside here::here() so that everything stays organized and paths remain reproducible.

We save plots into the folder:

npt_project/
└── output/
    ├── plots/
    ├── tables/
    └── figures/

7A. Saving the Most Recent Plot

ggsave() will save the most recently displayed plot if no plot object is specified.

#### Saving the last plot
```{r}
ggsave(
  filename = here::here("output", "plots", "last_plot.png"),
  width = 6,
  height = 4
)
```

This saves the most recent figure in your document as last_plot.png.

7B. Saving a Named Plot Object

It is often better practice to assign plots to an object and then save them explicitly.
This helps keep your workflow organized and makes your plotting code reusable.

#### Saving a named plot
```{r}
## Create a plot object
p_focus <- ggplot(mean_rt_by_group,
                  aes(x = focus_group, y = mean_rt_overall)) +
  geom_col(fill = "steelblue") +
  labs(x = "Focus Group", y = "Mean RT (ms)") +
  theme_minimal()

## Save the plot object
ggsave(
  filename = here::here("output", "plots", "focus_group_barplot.png"),
  plot     = p_focus,
  width    = 6,
  height   = 4
)

p_focus   # this displays the plot in the rendered document

```

Now the figure is saved with a clear name so you can include it in final reports.

7C. Saving Higher Resolution Figures

If you need a figure suitable for publication, increase the DPI:

```{r}
ggsave(
  filename = here::here("output", "plots", "focus_group_barplot_highres.png"),
  plot     = p_focus,
  width    = 6,
  height   = 4,
  dpi      = 300
)
```

Most journals and printed posters expect figures at 300 dpi.

7D. Saving as PDF for Vector Graphics

PDF versions of plots are useful because they scale cleanly without pixelation, which can be good if you need to change the size for different formats (write-up, slide presentation, poster presentation).

```{r}
ggsave(
  filename = here::here("output", "plots", "focus_group_barplot.pdf"),
  plot     = p_focus,
  width    = 6,
  height   = 4
)
```

This creates a vector graphic PDF that preserves lines, text, and symbols crisply at any zoom level.

Saving plots lets us preserve our work, share it with others, and include it in final reports.
We now have a reliable workflow for exporting any ggplot figure directly from a Quarto analysis.

8. Summary

As we worked through these visualizations, our goal was not only to produce specific figures, but to understand the broader principles that make ggplot2 such a useful tool for scientific communication. Here are the key ideas we want to carry forward:

Plots are built in layers. Starting with a simple foundation and adding structure one step at a time helps us stay organized and intentional in how we visualize our data.
The grammar of graphics helps us think clearly. Separating the data, aesthetic mappings, and geoms makes it easier to understand what each component contributes and how to adjust it when our needs change.
Data preparation and plotting go hand in hand. Clean, well-structured data (wide or long) makes plotting more straightforward, and understanding our data types helps us choose appropriate visualizations.
Good figures communicate a purpose. Every line, color, and axis label should help the reader understand the pattern or comparison we want to highlight.
Reproducibility supports clarity. Using Quarto, fig-cap, inline R, and saved plot files helps ensure our visualizations are consistent, documented, and easy to update when our data changes.
We are learning a workflow, not just commands. By practicing a step-by-step process—starting simple, refining with layers, and saving polished plots—we build habits that will support our work in the NPT project and future analyses.

Together, these ideas form the foundation for creating visualizations that are accurate, interpretable, and polished. In the next notes set, we will build on this foundation by learning how to design custom themes that make our figures even clearer and more professional.

9. Additional resources

If you want to explore more advanced plotting tools or refine your figures even further, the following reference notes are available. These are optional, but they will help you build stronger, clearer, and more professional visualizations.

Custom themes This guide explains how themes work in ggplot and how you can build your own custom theme for consistent styling. It covers controlling text, gridlines, spacing, margins, legends, and how to save your theme as an object or function that you can reuse across figures.
Plot Types and Aesthetics Reference A collection of additional ggplot templates and examples, including boxplots, violin plots, density plots, dot plots, faceted plots, and jittered points. It also includes examples of common aesthetics, such as labels, colors, axis controls, and annotations.
Data Preparation & Troubleshooting for Plotting A detailed reference on preparing data for plotting, including converting variables, handling missing values, reshaping wide-to-long data, summarizing with base R tools, and diagnosing common ggplot errors. This guide helps you understand what ggplot expects and how to fix issues quickly.
Good Plotting Practices A practical guide to making clear, readable, and scientifically useful figures. It covers best practices for labeling, color choices, axes, consistency, caption writing, accessibility, and how to use figures to support your research question.

These resources are here to support you as you continue developing your visualization skills. Use them whenever you want to explore a new plot type, troubleshoot a figure, or polish your final report.