Whole Game

R for Data Science Chapters 1-8

Julianne G. Clina, PhD

University of Kansas Medical Center

June 24, 2026

Welcome to the R-Lab!!!

Goal: get just enough of every major data science tool to tackle a real dataset, start to finish.

The data science cycle:

Import → Tidy → Transform ⇄ Visualize ⇄ Model → Communicate

Today’s Plan

Workflow basics

Naming, style, scripts, projects
These habits show up in every lesson going forward, so we won’t repeat them

The four tools

Import data → readr
Tidy data → tidyr
Transform data → dplyr
Visualize data → ggplot2

We’re covering this in import → tidy → transform → visualize order today (rather than the book’s order) because that’s the order you actually touch a new dataset.

The Whole Game, Explained Like a Kitchen

Think of data science like cooking a meal:

Data science step	Kitchen analogy
Import	Grocery shopping
Tidy	Washing & chopping ingredients
Transform	Cooking & seasoning
Visualize	Plating the dish
Communicate	Serving it to others

You can’t plate raw ingredients, and you shouldn’t cook before you chop.

Workflow: The Habits That Make This Easier

Naming Things and Writing Code

Who are you writing code for?

Not your professor
Not your classmates
Future you (who has forgotten everything)

If code only makes sense right now, it will absolutely fail you in 2 weeks.

# Assignment uses <- (shortcut: Alt/Option + -)
x <- 3 * 4
primes <- c(2, 3, 5, 7, 11, 13)

Strive for:

short_flights <- flights %>%
  filter(air_time < 60)

Avoid:

SHORTFLIGHTS <- flights %>%
  filter(air_time < 60)

short.flights <- flights %>%
  filter(air_time < 60)

ShortFlights <- flights %>%
  filter(air_time < 60)

snake_case, spaces around <- and operators, and %>% at the end of a line (not the start) — these three habits make code dramatically easier to read later.

Scripts and Projects

Write code in the script editor, not just the console — Cmd/Ctrl + Enter runs a line, Cmd/Ctrl + Shift + S runs the whole script
Keep every project (data, scripts, outputs) together using an RStudio Project (.Rproj)
For those using Positron, this is set up for you by working in a specific workspace (folder)
Use relative paths (data/students.csv), never absolute ones (/Users/you/...) — relative paths work on anyone’s machine

# Good — relative to the project folder
write_csv(diamonds, "data/diamonds.csv")

Restart R often (Cmd/Ctrl + Shift + 0) and re-run your script from scratch. If it still works, your script, not your environment, is the real source of truth.

Importing Data

Reading Data In With readr

library(tidyverse)

students <- read_csv(
  "data/students.csv",
  na = c("N/A", "") # treat "N/A" strings as real missing values
)

readr guesses each column’s type (logical → number → date → string) from a sample of rows, and prints what it guessed — always worth a glance.

Common cleanup after import:

janitor::clean_names() → tidy up column names to snake_case
mutate(x = factor(x)) → turn categorical text into a factor
parse_number() → pull a number out of a messy string column

Other formats: read_tsv(), read_delim(), multiple files at once with read_csv(list.files(...), id = "file").

Tidying Data

What Makes Data “Tidy”?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” — Hadley Wickham

Three rules:

Each variable is a column.
Each observation is a row.
Each value is a cell.

Tidy data is what lets dplyr and ggplot2 “just work” — most real-world data isn’t tidy yet, which is why tidyr exists.

Reshaping With `pivot_longer()` / `pivot_wider()`

Too wide — values trapped in column names (e.g. a column per week)

billboard %>%
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

Too long — one observation spread across multiple rows

cms_patient_experience %>%
  pivot_wider(
    id_cols = starts_with("org"),
    names_from = measure_cd,
    values_from = prf_rate
  )

Transforming Data

dplyr: One Grammar for Rows, Columns, and Groups

Every dplyr verb: (1) takes a data frame first, (2) refers to columns by bare name, (3) returns a new data frame.

Acts on	Verbs
Rows	`filter()`, `arrange()`, `distinct()`
Columns	`mutate()`, `select()`, `rename()`, `relocate()`
Groups	`group_by()`, `summarize()`, `slice_max()` / `slice_min()`

Before — flights, untouched:

head(flights)

# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

dplyr in Action: Before → After

The pipeline:

fast_iah <- flights |>
  filter(dest == "IAH") %>%
  mutate(speed = distance / air_time * 60) %>%
  select(year:day, dep_time, carrier, speed) %>%
  arrange(desc(speed))

After — filtered to IAH, new speed column, fewer columns, sorted:

head(fast_iah)

# A tibble: 6 × 6
   year month   day dep_time carrier speed
  <int> <int> <int>    <int> <chr>   <dbl>
1  2013     7     9      707 UA       522.
2  2013     8    27     1850 UA       521.
3  2013     8    28      902 UA       519.
4  2013     8    28     2122 UA       519.
5  2013     6    11     1628 UA       515.
6  2013     8    27     1017 UA       515.

Read %>% or |> as “then.” This pipeline reads: take flights, then filter, then mutate, then select, then arrange.

Grouped Summaries

flights %>%
  group_by(month) %>%
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE),
    n = n()
  )

💡 Always include n() alongside any summary statistic. A mean from 3 observations and a mean from 3,000 observations look identical in a table — but you should trust them very differently.

# Newer shortcut — group for just this one operation
flights %>%
  summarize(delay = mean(dep_delay, na.rm = TRUE), n = n(), .by = month)

Visualizing Data

ggplot2: The Grammar of Graphics

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Every plot: data + mapping (aes()) + geom.

Let’s build one up layer by layer so you can see exactly what each piece adds.

Building a Plot, Layer by Layer (1/4)

Start with data + mapping — aes() just sets up the coordinate system, no geometry yet:

library(palmerpenguins)
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g))

Nothing plots yet — ggplot() + aes() only declares what goes where, not how to draw it.

Building a Plot, Layer by Layer (2/4)

Add a geom — now there’s something to look at:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point()

Building a Plot, Layer by Layer (3/4)

Map species to color and shape — a third variable, no new geom required:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species))

Building a Plot, Layer by Layer (4/4)

Add a second geom — geom_smooth() layers a trend line on top:

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm")

Every plot is data + mapping (aes()) + one or more geoms — each + stacks another layer on the same coordinate system.

Matching Plot Type to Question

Question	Geom(s)
One categorical variable	`geom_bar()`
One numerical variable	`geom_histogram()`, `geom_density()`
Numerical vs. categorical	`geom_boxplot()`
Two numerical variables	`geom_point()`, `geom_smooth()`
3+ variables	extra aesthetics (`color`, `shape`) or `facet_wrap()`

# Faceting beats cramming too many aesthetics onto one plot
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species)) +
  facet_wrap(~island)

map a variable to an aesthetic when you want it to vary by data; set it (color = "blue") when you just want a fixed look.

Polishing and Saving

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    x = "Flipper length (mm)",
    y = "Body mass (g)",
    color = "Species",
    shape = "Species"
  )

ggsave("penguin-plot.png")

Same styling rules from the workflow section apply here: treat + like %>% — space before it, end of the line, one argument per line once it gets long.

Bringing It All Together

A Realistic Pipeline

library(tidyverse)

flights_summary <- read_csv("data/flights.csv") %>% # import
  drop_na(dep_delay, arr_delay) %>% # tidy / clean
  filter(month %in% 6:8) %>% # transform
  group_by(carrier) %>%
  summarize(avg_delay = mean(dep_delay), n = n())

ggplot(flights_summary, aes(x = carrier, y = avg_delay)) + # visualize
  geom_col()

This is the whole game in five lines: import → tidy → transform → visualize. Everything else in the book is depth on one of these four steps.

Getting Help

When You’re Stuck

Google it — add “R” or a package name to your search
Search Stack Overflow — include the [R] tag
Still stuck? Build a reprex — a minimal, self-contained example someone else can copy, paste, and run

reprex::reprex()

80% of the time, just writing a clean minimal example reveals the bug yourself before anyone has to answer it.

Summary

The four tools: readr (import) → tidyr (tidy) → dplyr (transform) → ggplot2 (visualize)

The habits that make it sustainable: descriptive names, consistent style, RStudio projects with relative paths, restarting R often, and knowing how to ask for help.

We’ll go much deeper on each of these tools in upcoming lessons — today was about seeing how they fit together.

In-Class Activity: Your First GitHub Push

Goal for the remainder of the meeting: Everyone will…

Edit a file in their own folder
Push it to GitHub
See it render online

Step 1: Fork the Repository

Open the class repo on Github:

Click Fork
Accept all defaults
Wait for GitHub to finish

You should now see the repo under your username

Step 2: Clone Your Fork

From your fork on GitHub:

Click Code
Copy the HTTPS URL or Open it GitHub Desktop
In RStudio or Positron:
- Clone repository
- Paste the URL
- Open the project

Step 3: Navigate to Your Folder

Inside the project (on your computer in Positron), go to:

attendees/
YourName/
code

Step 4: Create a Simple Quarto Document

Inside your code/ folder:

Create a file called: whole-game.qmd

Paste this: as the YAML

---
title: "The Whole Game Practice"
format: html
---

Create an R Code Chunk and enter:

# Hello R-LAB

message("My first GitHub push worked!")

x <- 1:10
mean(x)

And Save the file!

Step 5: Render with Quarto Locally

Open your .qmd files
In the terminal or console, run:

quarto render

This creates the HTML and confirms it’s correct!

Step 6: Commit & Push!

Open GitHub Desktop
You should see your .qmd file listed as a change
Add a message and description
Click Commmit to Main
ClicK Push Origin

You should now see your file on GitHub!

Step 7: Enable GitHub Pages

Within your fork on GitHub.com

Click Settings
Click Pages
Under Source
+ Branch: main
+ Folder: /(root)
Click Save

You should get a Pages URL, which you can preview. Navigate to your folder to preview it!

Step 8: Create a Pull Request

This will allow you to merge your work into the class repository. From your fork on GitHub

Click Contribute
Click Open pull request
Confirm the information is correct
Click Create pull request
Add a title and description with a clear message
+ Title: Whole Game Practice (Your Name)
+ Description: - added whole-game.qmd