R for Data Science Chapters 1-8
University of Kansas Medical Center
June 24, 2026
Goal: get just enough of every major data science tool to tackle a real dataset, start to finish.
The data science cycle:
Import → Tidy → Transform ⇄ Visualize ⇄ Model → Communicate
Workflow basics
The four tools
We’re covering this in import → tidy → transform → visualize order today (rather than the book’s order) because that’s the order you actually touch a new dataset.
Think of data science like cooking a meal:
| Data science step | Kitchen analogy |
|---|---|
| Import | Grocery shopping |
| Tidy | Washing & chopping ingredients |
| Transform | Cooking & seasoning |
| Visualize | Plating the dish |
| Communicate | Serving it to others |
You can’t plate raw ingredients, and you shouldn’t cook before you chop.
If code only makes sense right now, it will absolutely fail you in 2 weeks.
snake_case, spaces around <- and operators, and %>% at the end of a line (not the start) — these three habits make code dramatically easier to read later.
.Rproj)data/students.csv), never absolute ones (/Users/you/...) — relative paths work on anyone’s machineRestart R often (Cmd/Ctrl + Shift + 0) and re-run your script from scratch. If it still works, your script, not your environment, is the real source of truth.
readr guesses each column’s type (logical → number → date → string) from a sample of rows, and prints what it guessed — always worth a glance.
Common cleanup after import:
janitor::clean_names() → tidy up column names to snake_casemutate(x = factor(x)) → turn categorical text into a factorparse_number() → pull a number out of a messy string columnOther formats: read_tsv(), read_delim(), multiple files at once with read_csv(list.files(...), id = "file").
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” — Hadley Wickham
Three rules:
Tidy data is what lets dplyr and ggplot2 “just work” — most real-world data isn’t tidy yet, which is why tidyr exists.
pivot_longer() / pivot_wider()Too wide — values trapped in column names (e.g. a column per week)
Every dplyr verb: (1) takes a data frame first, (2) refers to columns by bare name, (3) returns a new data frame.
| Acts on | Verbs |
|---|---|
| Rows | filter(), arrange(), distinct() |
| Columns | mutate(), select(), rename(), relocate() |
| Groups | group_by(), summarize(), slice_max() / slice_min() |
Before — flights, untouched:
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The pipeline:
After — filtered to IAH, new speed column, fewer columns, sorted:
# A tibble: 6 × 6
year month day dep_time carrier speed
<int> <int> <int> <int> <chr> <dbl>
1 2013 7 9 707 UA 522.
2 2013 8 27 1850 UA 521.
3 2013 8 28 902 UA 519.
4 2013 8 28 2122 UA 519.
5 2013 6 11 1628 UA 515.
6 2013 8 27 1017 UA 515.
Read %>% or |> as “then.” This pipeline reads: take flights, then filter, then mutate, then select, then arrange.
💡 Always include n() alongside any summary statistic. A mean from 3 observations and a mean from 3,000 observations look identical in a table — but you should trust them very differently.
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
Every plot: data + mapping (aes()) + geom.
Let’s build one up layer by layer so you can see exactly what each piece adds.
Start with data + mapping — aes() just sets up the coordinate system, no geometry yet:
Nothing plots yet — ggplot() + aes() only declares what goes where, not how to draw it.
Add a geom — now there’s something to look at:
Map species to color and shape — a third variable, no new geom required:
Add a second geom — geom_smooth() layers a trend line on top:
Every plot is data + mapping (aes()) + one or more geoms — each + stacks another layer on the same coordinate system.
| Question | Geom(s) |
|---|---|
| One categorical variable | geom_bar() |
| One numerical variable | geom_histogram(), geom_density() |
| Numerical vs. categorical | geom_boxplot() |
| Two numerical variables | geom_point(), geom_smooth() |
| 3+ variables | extra aesthetics (color, shape) or facet_wrap() |
map a variable to an aesthetic when you want it to vary by data; set it (color = "blue") when you just want a fixed look.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(
title = "Body mass and flipper length",
x = "Flipper length (mm)",
y = "Body mass (g)",
color = "Species",
shape = "Species"
)
ggsave("penguin-plot.png")Same styling rules from the workflow section apply here: treat + like %>% — space before it, end of the line, one argument per line once it gets long.
library(tidyverse)
flights_summary <- read_csv("data/flights.csv") %>% # import
drop_na(dep_delay, arr_delay) %>% # tidy / clean
filter(month %in% 6:8) %>% # transform
group_by(carrier) %>%
summarize(avg_delay = mean(dep_delay), n = n())
ggplot(flights_summary, aes(x = carrier, y = avg_delay)) + # visualize
geom_col()This is the whole game in five lines: import → tidy → transform → visualize. Everything else in the book is depth on one of these four steps.
[R] tag80% of the time, just writing a clean minimal example reveals the bug yourself before anyone has to answer it.
The four tools: readr (import) → tidyr (tidy) → dplyr (transform) → ggplot2 (visualize)
The habits that make it sustainable: descriptive names, consistent style, RStudio projects with relative paths, restarting R often, and knowing how to ask for help.
We’ll go much deeper on each of these tools in upcoming lessons — today was about seeing how they fit together.
Goal for the remainder of the meeting: Everyone will…
Open the class repo on Github:
You should now see the repo under your username
From your fork on GitHub:
Inside the project (on your computer in Positron), go to:
Inside your code/ folder:
Create a file called: whole-game.qmd
Paste this: as the YAML
Create an R Code Chunk and enter:
And Save the file!
quarto render
You should now see your file on GitHub!
Within your fork on GitHub.com
You should get a Pages URL, which you can preview. Navigate to your folder to preview it!
This will allow you to merge your work into the class repository. From your fork on GitHub
R for Lifestyle and Brain Health (R-LAB)