Transform

Tools to Work with Common Variable Types

Brian C. Helsel, PhD

University of Kansas Medical Center

July 1, 2026

Logical Vectors

Commonly created with a numeric comparison operator like <, <=, >, >=, !=, and == inside dplyr::filter.

flights <- nycflights13::flights

flights |>
  dplyr::filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)

#> A tibble: 172,286 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     1      601            600         1      844
#> 2  2013     1     1      602            610        -8      812
#> 3  2013     1     1      602            605        -3      821
#> 4  2013     1     1      606            610        -4      858
#> 5  2013     1     1      606            610        -4      837
#> 6  2013     1     1      607            607         0      858
#> 7  2013     1     1      611            600        11      945
#> 8  2013     1     1      613            610         3      925
#> 9  2013     1     1      615            615         0      833
#>10  2013     1     1      622            630        -8     1017
#> # ℹ 172,276 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>
#> # ℹ Use `print(n = ...)` to see more rows

Keep Logical Vectors with Mutate

Create logicial variables before filtering may be useful for more complicated logic.

flights |>
  dplyr::mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
    .keep = "used"
  )

#> # A tibble: 336,776 × 4
#>    dep_time arr_delay daytime approx_ontime
#>       <int>     <dbl> <lgl>   <lgl>
#>  1      517        11 FALSE   TRUE
#>  2      533        20 FALSE   FALSE
#>  3      542        33 FALSE   FALSE
#>  4      544       -18 FALSE   TRUE
#>  5      554       -25 FALSE   FALSE
#>  6      554        12 FALSE   TRUE
#>  7      555        19 FALSE   TRUE
#>  8      557       -14 FALSE   TRUE
#>  9      557        -8 FALSE   TRUE
#> 10      558         8 FALSE   TRUE
#> # ℹ 336,766 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Floating Point Comparison

Using == with decimal points may not give matched results as computers store numbers with a fixed number of decimal points.

x <- c(1 / 49 * 49, sqrt(2)^2)

print(x)
#> [1] 1 2

x == c(1, 2)
#> [1] FALSE FALSE

print(x, digits = 16)
#> [1] 0.9999999999999999 2.0000000000000004

The dplyr::near function will ignore small differences.

dplyr::near(x, c(1, 2))
#> [1] TRUE TRUE

Missing Values

Operations involving an unknown value will be unknown.

NA > 5
#> [1] NA

10 == NA
#> [1] NA

NA == NA
#> [1] NA

Since filtering with variable == NA returns NA rather than TRUE or FALSE, it cannot be used to identify missing values. Instead, use the is.na() function to test whether a value is missing.

is.na(c(TRUE, NA, FALSE))
#> [1] FALSE TRUE FALSE

flights |>
  dplyr::filter(is.na(dep_time))

#> # A tibble: 8,255 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1     1       NA           1630        NA       NA
#>  2  2013     1     1       NA           1935        NA       NA
#>  3  2013     1     1       NA           1500        NA       NA
#>  4  2013     1     1       NA            600        NA       NA
#>  5  2013     1     2       NA           1540        NA       NA
#>  6  2013     1     2       NA           1620        NA       NA
#>  7  2013     1     2       NA           1355        NA       NA
#>  8  2013     1     2       NA           1420        NA       NA
#>  9  2013     1     2       NA           1321        NA       NA
#> 10  2013     1     2       NA           1545        NA       NA
#> # ℹ 8,245 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>
#> # ℹ Use `print(n = ...)` to see more rows

Boolean Algebra

You can combine multiple logical vectors together with Boolean algebra.

  • & is and
  • | is or
  • ! is not
  • xor is xor

xor is TRUE if x is TRUE or y is TRUE but not both

The %in% Operator

Returns a logical vector the same length as x that is TRUE when a value in x is anywhere in y.

letters[1:10] %in% c("a", "e", "i", "o", "u")
#> [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

# Find all flights in November and December
flights |>
  dplyr::filter(month %in% c(11, 12)) |>
  dplyr::count(month)

#> # A tibble: 2 × 2
#>   month     n
#>   <int> <int>
#> 1    11 27268
#> 2    12 28135

Logical Summaries

Two main logical summaries exist:

  • any is the equivalent of |
  • all is the equivalent of &
flights |>
  dplyr::group_by(year, month, day) |>
  dplyr::summarise(
    all_delayed = all(dep_delay <= 60, na.rm = TRUE),
    any_long_delay = any(arr_delay >= 300, na.rm = TRUE),
    .groups = "drop"
  )

#> A tibble: 365 × 5
#> # Groups:   year, month [12]
#>     year month   day all_delayed any_long_delay
#>    <int> <int> <int> <lgl>       <lgl>
#>  1  2013     1     1 FALSE       TRUE
#>  2  2013     1     2 FALSE       TRUE
#>  3  2013     1     3 FALSE       FALSE
#>  4  2013     1     4 FALSE       FALSE
#>  5  2013     1     5 FALSE       TRUE
#>  6  2013     1     6 FALSE       FALSE
#>  7  2013     1     7 FALSE       TRUE
#>  8  2013     1     8 FALSE       FALSE
#>  9  2013     1     9 FALSE       TRUE
#> 10  2013     1    10 FALSE       TRUE
#> # ℹ 355 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Numeric Summaries

A logical vector in a numeric context becomes 1 for TRUE and 0 for FALSE. This makes functions like sum and mean useful with logical vectors as they give the number and proportion of TRUE values.

flights |>
  dplyr::group_by(year, month, day) |>
  dplyr::summarize(
    proportion_delayed = mean(dep_delay <= 60, na.rm = TRUE),
    count_long_delay = sum(arr_delay >= 300, na.rm = TRUE),
    .groups = "drop"
  )

#> # A tibble: 365 × 5
#>     year month   day proportion_delayed count_long_delay
#>    <int> <int> <int>              <dbl>            <int>
#>  1  2013     1     1              0.939                3
#>  2  2013     1     2              0.914                3
#>  3  2013     1     3              0.941                0
#>  4  2013     1     4              0.953                0
#>  5  2013     1     5              0.964                1
#>  6  2013     1     6              0.959                0
#>  7  2013     1     7              0.956                1
#>  8  2013     1     8              0.975                0
#>  9  2013     1     9              0.986                1
#> 10  2013     1    10              0.977                2
#> # ℹ 355 more rows
#> # ℹ Use `print(n = ...)` to see more rows

There are several other useful functions for summarizing data like length, median, range, sd, IQR, quantile, min, max, first, last, etc.

Conditional Transformations

Use logical vectors to do return different values for x and y using ifelse and dplyr::if_else. The difference between these functions is an extra argument in if_else to define what happens to missing values.

x <- c(-3:-1, 1:3, NA)

ifelse(x > 0, "pos", "neg")
#> [1] "neg" "neg" "neg" "pos" "pos" "pos" NA

dplyr::if_else(x > 0, "pos", "neg", "???")
#> [1] "neg" "neg" "neg" "pos" "pos" "pos" "???"

You can use dplyr::case_when when you need a flexible way of performing different computations for different conditions. You can also define a .default value for when there are no matches.

y <- c(-3:3, NA)
dplyr::case_when(
  y == 0 ~ "zero",
  y > 0 ~ "pos",
  y < 0 ~ "neg",
  .default = "???"
)
#> [1] "neg"  "neg"  "neg"  "zero" "pos"  "pos"  "pos"  "???"

Counting Numbers

Counts of character or factor variables are great for data exploration and checks during analysis. You can optionally sort to see the most common values.

dplyr::count(flights, dest, sort = TRUE)

#> # A tibble: 105 × 2
#>    dest      n
#>    <chr> <int>
#>  1 ORD   17283
#>  2 ATL   17215
#>  3 LAX   16174
#>  4 BOS   15508
#>  5 MCO   14082
#>  6 CLT   14064
#>  7 SFO   13331
#>  8 FLL   12055
#>  9 MIA   11728
#> 10 DCA    9705
#> # ℹ 95 more rows
#> # ℹ Use `print(n = ...)` to see more rows

You can accomplish the same computation using dplyr::summarize which allows you to compute other summaries at the same time.

flights |>
  dplyr::group_by(dest) |>
  dplyr::summarise(
    n = dplyr::n(),
    delay = mean(arr_delay, na.rm = TRUE)
  ) |>
  dplyr::arrange(dplyr::desc(n))

#> # A tibble: 105 × 3
#>    dest      n  delay
#>    <chr> <int>  <dbl>
#>  1 ORD   17283  5.88
#>  2 ATL   17215 11.3
#>  3 LAX   16174  0.547
#>  4 BOS   15508  2.91
#>  5 MCO   14082  5.45
#>  6 CLT   14064  7.36
#>  7 SFO   13331  2.67
#>  8 FLL   12055  8.08
#>  9 MIA   11728  0.299
#> 10 DCA    9705  9.07
#> # ℹ 95 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Modular Arithmetic

A technical name for the type of math you did before you learned about decimal places. This division yields a whole number and a remainder. In R, %/% does integer division and %% calculates the remainder.

flights$sched_dep_time[1:10]
#> [1] 515 529 540 545 600 558 600 600 600 600

(hour <- flights$sched_dep_time[1:10] %/% 100)
#> [1] 5 5 5 5 6 5 6 6 6 6

(minute <- flights$sched_dep_time[1:10] %% 100)
#> [1] 15 29 40 45  0 58  0  0  0  0

Rounding

Uses Banker’s rounding or “round half to even” which means that if a number is halfway between two integers, it will be rounded to the even integer.

round(c(1.5, 2.5))
#> [1] 2 2

You can control the number of digits rounded or use floor to round down or ceiling to round up.

x <- 123.456

round(x, 1)
#> [1] 123.5

floor(x)
#> [1] 123

ceiling(x)
#> [1] 124

The floor and ceiling functions do not have a digits argument like round, but you can scale down, round, and then scale back up.

# Round down to the nearest two digits
floor(x / 0.01) * 0.01
#> [1] 123.45

# Round up to the nearest two digits
ceiling(x / 0.01) * 0.01
#> [1] 123.46

Cutting Numbers into Ranges

The cut() function can be used to break up a numeric vector into discrete categories. You can also apply your own custom labels.

x <- c(1, 2, 5, 10, 15, 20)
cut(x, breaks = c(0, 5, 10, 15, 20), labels = c("sm", "md", "lg", "xl"))
#> [1] sm sm sm md lg xl
#> Levels: sm md lg xl

Review the documentation of the cut function to explore other arguments like right and include.lowest.

Cumulative and Rolling Aggregates

Base R provides functions for cumulative sums (cumsum), products (cumprod), mins (cummin), and maxes (cummax).

x <- 1:10
cumsum(x)
#> [1]  1  3  6 10 15 21 28 36 45 55

Ranks

The dplyr package provides different ranking functions including:

  • min_rank: Gives every tie the same (smallest) value
  • dense_rank: Works like min_rank, but doesn’t leave any gaps
  • percent_rank: Counts the total number of values less than x, and divides it by the number of observations minus 1
  • cume_dist: Counts the number of values less than or equal to x, and divides it by the number of observations
df <- tibble::tibble(x = c(1, 5, 5, 17, 22))

df |>
  dplyr::mutate(
    row_number = dplyr::row_number(x),
    min_rank = dplyr::min_rank(x),
    dense_rank = dplyr::dense_rank(x),
    percent_rank = dplyr::percent_rank(x),
    cume_dist = dplyr::cume_dist(x)
  )

#> # A tibble: 5 × 6
#>       x row_number min_rank dense_rank percent_rank cume_dist
#>   <dbl>      <int>    <int>      <int>        <dbl>     <dbl>
#> 1     1          1        1          1         0          0.2
#> 2     5          2        2          2         0.25       0.6
#> 3     5          3        2          2         0.25       0.6
#> 4    17          4        4          3         0.75       0.8
#> 5    22          5        5          4         1          1

Offsets

The lead and lag function allow you to refer to the values before or after the current value. These functions return a vector of the same length as the input but padded with NAs at the start or end.

df |>
  dplyr::mutate(
    lag_x = dplyr::lag(x),
    lead_x = dplyr::lead(x)
  )

#> # A tibble: 5 × 3
#>       x lag_x lead_x
#>   <dbl> <dbl>  <dbl>
#> 1     1    NA      5
#> 2     5     1      5
#> 3     5     5     17
#> 4    17     5     22
#> 5    22    17     NA

You can lag or lead by more than one position by using the second argument, n in the lag and lead functions.

Strings

You can create strings with single quotes (') or double quotes ("). If you want to add a quote inside a string, then you should use single quotes.

string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string'

You can use \ as an escape to include a literal single or double quote or other special characters in a string.

single_quote <- "\'"
double_quote <- "\""
backslash <- "\\"
x <- c(single_quote, double_quote, backslash)
x
#> [1] "'"  "\"" "\\"
stringr::str_view(x)
#> [1] │ '
#> [2] │ "
#> [3] │ \

Raw Strings

Complex strings requiring multiple quotes or backslashes can get confusing with all the escape characters. Using a raw string can help make the code more readable.

tricky <- "double_quote <- \"\\\"\" # or '\"' single_quote <- '\\'' # or \"'\""
stringr::str_view(tricky)
#> [1] │ double_quote <- "\"" # or '"' single_quote <- '\'' # or "'"

raw_string <- r"(double_quote <- "\"" # or '"' single_quote <- '\'' # or "'")"
stringr::str_view(raw_string)
#> [1] │ double_quote <- "\"" # or '"' single_quote <- '\'' # or "'"

A raw string usually starts with r"( and ends with )", but you can also use r"[]" or r"{}" if your string contains )“.

Creating Strings from Data

The stringr::str_c function is like paste0 but designed to be used within dplyr::mutate. It combines strings with a variable inside your dataset. You can also use stringr::glue which avoids the need to separate the string into parts with multiple quotation marks.

df <- tibble::tibble(name = c("Jonathan", "David", "Susan"))

df |>
  dplyr::mutate(
    greeting = stringr::str_c("Hi ", name, "!"),
    question = stringr::str_glue("{name}, what is your favorite color?")
  )

#> # A tibble: 3 × 3
#>   name     greeting     question
#>   <chr>    <chr>        <glue>
#> 1 Jonathan Hi Jonathan! Jonathan, what is your favorite color?
#> 2 David    Hi David!    David, what is your favorite color?
#> 3 Susan    Hi Susan!    Susan, what is your favorite color?

The stringr::str_flatten function works with dplyr::summarize to return a single string from a character vector.

df <- tibble::tribble(
  ~name      , ~color   ,
  "Jonathan" , "blue"   ,
  "Jonathan" , "red"    ,
  "David"    , "green"  ,
  "David"    , "yellow" ,
  "David"    , "orange" ,
  "Susan"    , "purple" ,
  "Susan"    , "red"
)

df |>
  dplyr::summarise(
    color = stringr::str_flatten(color, ", "),
    .by = name
  )

#> # A tibble: 3 × 2
#>   name     color
#>   <chr>    <chr>
#> 1 Jonathan blue, red
#> 2 David    green, yellow, orange
#> 3 Susan    purple, red

Extracting Data from Strings

Functions in the tidyr package can help you separate a single string into multiple rows or columns.

Separating into Rows

tibble::tibble(x = c("a,b,c", "d,e", "f")) |>
  tidyr::separate_longer_delim(x, delim = ",")

#> # A tibble: 6 × 1
#>   x
#>   <chr>
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
#> 6 f

Separating into Columns

# By delimiter
tibble::tibble(x = c("07.01.2026", "07.02.2026", "07.03.2026")) |>
  tidyr::separate_wider_delim(
    x,
    delim = ".",
    names = c("month", "day", "year")
  )

# By widths
tibble::tibble(x = c("07012026", "07022026", "07032026")) |>
  tidyr::separate_wider_position(
    x,
    widths = c(month = 2, day = 2, year = 4)
  )

#>   month day   year
#>   <chr> <chr> <chr>
#> 1 07    01    2026
#> 2 07    02    2026
#> 3 07    03    2026

Regular Expressions

A conncise and powerful language for describing patterns within strings (often abbreviated as regex or regexp). The simplest regex patterns consist of letters or numbers that are a direct match.

fruit <- stringr::fruit
stringr::str_view(fruit, "berry")

#>  [6] │ bil<berry>
#>  [7] │ black<berry>
#> [10] │ blue<berry>
#> [11] │ boysen<berry>
#> [19] │ cloud<berry>
#> [21] │ cran<berry>
#> [29] │ elder<berry>
#> [32] │ goji <berry>
#> [33] │ goose<berry>
#> [38] │ huckle<berry>
#> [50] │ mul<berry>
#> [70] │ rasp<berry>
#> [73] │ salal <berry>
#> [76] │ straw<berry>

Metacharacters

Most punctuation characters like ., +, *, [, ], and ? have special meanings in regex. For example, “a…e” will match any string that contains an “a” followed by 3 letters and then an “e”.

stringr::str_view(fruit, "a...e")

#>  [1] │ <apple>
#>  [7] │ bl<ackbe>rry
#> [48] │ mand<arine>
#> [51] │ nect<arine>
#> [62] │ pine<apple>
#> [64] │ pomegr<anate>
#> [70] │ r<aspbe>rry
#> [73] │ sal<al be>rry

Quantifiers

Control how many times a pattern can match:

  • ? makes a pattern optional (i.e., it matches 0 or 1 times)
  • + lets a pattern repeat (i.e., it matches it at least once)
  • * lets a pattern be optional or repeat
# Matches an ap, optionally followed by an r
stringr::str_view(fruit, "apr?")

#>  [1] │ <ap>ple
#>  [2] │ <apr>icot
#> [34] │ gr<ap>e
#> [35] │ gr<ap>efruit
#> [56] │ p<ap>aya
#> [62] │ pine<ap>ple

# Matches an ap, followed by at least one r
stringr::str_view(fruit, "apr+")
#>  [2] │ <apr>icot

You can use {n}, {n,}, and {n,m} to match exactly n times, at least n times, or between n and m times.

Character Classes

Defined by [] and lets you match a set of characters. You can also invert the match by starting with ^. This matches anything except the listed characters.

words <- stringr::words

# Find an x surrounded by vowels
stringr::str_view(words, "[aeiou]x[aeiou]")
#> [284] │ <exa>ct
#> [285] │ <exa>mple
#> [288] │ <exe>rcise
#> [289] │ <exi>st

# Find a y surrounded by consonants
stringr::str_view(words, "[^aeiou]y[^aeiou]")
#> [836] │ <sys>tem
#> [901] │ <typ>e

You can also use a - to define a range of letters or numbers.

x <- "abcd ABCD 12345 -!@#%."

stringr::str_view(x, "[A-Z]")
#> [1] │ abcd <A><B><C><D> 12345 -!@#%.

stringr::str_view(x, "[0-9]")
#> [1] | abcd ABCD <1><2><3><4><5> -!@#%.

Character Class Shortcuts

  • \d matches any digit
  • \D matches anything that is not a digit
  • \s matches any whitespace (e.g., space, tab, newline)
  • \S matches anything that is not a whitespace
  • \w matches any “word” character (i.e., letters or numbers)
  • \W matches any non-word character
stringr::str_view(x, "\\W+")
#> [1] │ abcd< >ABCD< >12345< -!@#%.>

stringr::str_view(x, "\\d+")
#> [1] │ abcd ABCD <12345> -!@#%.

stringr::str_view(x, "\\D+")
#> [1] │ <abcd ABCD >12345< -!@#%.>

Alteration

You can use | to pick between one or more alternative patterns.

stringr::str_view(fruit, "aa|ee|ii|oo|uu")

#>  [9] │ bl<oo>d orange
#> [33] │ g<oo>seberry
#> [47] │ lych<ee>
#> [66] │ purple mangost<ee>n

Detecting Strings in a Dataset

tibble::tibble(fruit = fruit) |>
  dplyr::mutate(berry = stringr::str_detect(fruit, "berry")) |>
  dplyr::count(berry)

#> # A tibble: 2 × 2
#>   berry     n
#>   <lgl> <int>
#> 1 FALSE    66
#> 2 TRUE     14

The stringr::str_detect function can also be used with dplyr::filter and dplyr::summarize. For example, the code below calculates the total number and proportion of fruits containing the word “berry”.

tibble::tibble(fruit = fruit) |>
  dplyr::summarize(
    berry_sum = sum(stringr::str_detect(fruit, "berry")),
    berry_prop = mean(stringr::str_detect(fruit, "berry")) * 100
  )

#> # A tibble: 1 × 2
#>   berry_sum berry_prop
#>       <int>      <dbl>
#> 1        14       17.5

The stringr::str_subset and stringr::str_which functions are closely related to stringr::str_detect. The stringr::str_subset function returns a character vector with the strings that match and stringr::str_which returns an integer vector with the positions of the strings that match.

Counting Matches

tibble::tibble(fruit = fruit) |>
  dplyr::mutate(
    vowels = stringr::str_count(fruit, "[aeiou]"),
    consonants = stringr::str_count(fruit, "[^aeiou]")
  )

#> # A tibble: 80 × 3
#>    fruit        vowels consonants
#>    <chr>         <int>      <int>
#>  1 apple             2          3
#>  2 apricot           3          4
#>  3 avocado           4          3
#>  4 banana            3          3
#>  5 bell pepper       3          8
#>  6 bilberry          2          6
#>  7 blackberry        2          8
#>  8 blackcurrant      3          9
#>  9 blood orange      5          7
#> 10 blueberry         3          6
#> # ℹ 70 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Removing and Replacing Values

tibble::tibble(fruit = fruit) |>
  dplyr::filter(stringr::str_detect(fruit, "berry")) |>
  dplyr::mutate(
    prefix = stringr::str_remove(fruit, "berry"),
    asterisk = stringr::str_replace(fruit, "berry", "*")
  )

Use stringr::str_remove_all and stringr::str_replace_all to remove or replace all matches as str_remove and str_replace only remove or replace the first match.

Anchors

Matching the start or end of the string requires an anchor. Use ^ to match the start or $ to match the end of the string.

# All fruits that start with an a
stringr::str_view(fruit, "^a")
#> [1] │ <a>pple
#> [2] │ <a>pricot
#> [3] │ <a>vocado

# All fruits that end with an a
stringr::str_view(fruit, "a$")
#>  [4] │ banan<a>
#> [15] │ cherimoy<a>
#> [30] │ feijo<a>
#> [36] │ guav<a>
#> [56] │ papay<a>
#> [74] │ satsum<a>

Capturing Groups

Allow you to use sub-components of the match by referring back to it with a back references (e.g., \1 refers to the match contained in the first parenthesis, \2 in the second parenthesis, etc.)

# Finds all fruits with a repeated pair of letters
stringr::str_view(fruit, "(..)\\1")
#>  [4] │ b<anan>a
#> [20] │ <coco>nut
#> [22] │ <cucu>mber
#> [41] │ <juju>be
#> [56] │ <papa>ya
#> [73] │ s<alal> berry

# Finds all words that start and end with the same pair of letters
stringr::str_view(words, "^(..).*\\1$")
#> [152] │ <church>
#> [217] │ <decide>
#> [617] │ <photograph>
#> [699] │ <require>
#> [739] │ <sense>

Factors

Factors are used for categorical variables. The forcats package is part of tidyverse and provides tools for dealing with factors.

# Create a vector of valid levels
weekdays <- c(
  "Sunday",
  "Monday",
  "Tuesday",
  "Wednesday",
  "Thursday",
  "Friday",
  "Saturday"
)

weekday_levels <- substr(weekdays, 1, 3)

x <- c("Tue", "Thu", "Fri")

factor(x, levels = weekday_levels, labels = weekdays)
#> [1] Tuesday  Thursday Friday
#> Levels: Sunday Monday Tuesday Wednesday Thursday Friday Saturday

Missing Factor Levels

Any values that are not in the level will be silently converted to NA. The forcats::fct function turns this into an error message and may be preferred.

x <- c(x, "Satur")

factor(x, levels = weekday_levels, labels = weekdays)
#> [1] Tuesday  Thursday Friday <NA>
#> Levels: Sunday Monday Tuesday Wednesday Thursday Friday Saturday

forcats::fct(x, levels = weekday_levels)
#> Error in `forcats::fct()` at presentations/DataScience/transform.qmd:987:1:
#> ! All values of `x` must appear in `levels` or `na`
#> ℹ Missing level: "Satur"

Removing levels in factor(x) will take the unique values in alphabetical order. The forcats::fct function will order by first appearance.

Modifying Factor Order

You can use forcats::fct_reorder to reorder the factor by a numeric vector. For example, we may want to plot miles per gallon for cats in the mtcars dataset that is ordered miles per gallon.

mtcars_df <- tibble::tibble(model = factor(rownames(mtcars)), mtcars)

mtcars_df_ordered <- mtcars_df |>
  dplyr::mutate(model = forcats::fct_reorder(model, mpg))

Not Ordered by Miles Per Gallon

Ordered by Miles Per Gallon

Other Useful Functions for Factors

Function Purpose
fct_reorder2(.f, .x., .y) Reorder the factor .f by the .y values associated with the largest .x values
fct_infreq(f) Order factor levels by frequency (most common first)
fct_rev(f) Reverse the order of factor levels
fct_recode(.f, ...) Rename factor levels
fct_collapse(.f, ...) Combine multiple levels into broader categories
fct_lump(f, n, prop) Combine infrequent levels into "Other"
fct_lump_n(f, n) Keep the top n levels by frequency and lump the rest
fct_lump_min(f, min) Lump levels occurring fewer than a minimum count
fct_lump_prop(f, prop) Lump levels whose proportion is below a threshold
fct_lump_lowfreq(f) Lump the lowest-frequency levels into "Other"

Dates and Times

ISO8601 is the international standard for writing dates where the components of the dates are organized from largest to smallest and separated with a - (e.g., 2026-07-01). This format can also include times where the hour, minute, and second are separated by a : (e.g., 2026-07-01 09:00:00).

Describing Date and Time Formats

You can describe a date or time format with a % followed by a single character.

  • Year: %Y (four digits) or %y (two digits)
  • Month: %m (number), %b (abbreviated name like Jul), %B (full name)
  • Day: %d (one or two digits) or %e (two digits like 07)
  • Time: %H (24-hour hour), %I (12-hour hour), %M (minute), %S (second)
  • Other Time: %p (AM/PM), %Z (time zone name), %z (time zone offset)

Creating Dates from Strings

The lubridate package is designed to make working with dates and times easier and more intuitive.

lubridate::ymd("2026-07-01")
#> [1] "2026-07-01"

lubridate::mdy("July 1, 2026")
#> [1] "2026-07-01"

lubridate::mdy_hm("07/01/2026 09:00")
#> [1] "2026-07-01 09:00:00 UTC"

Creating Dates from Components

flights <- nycflights13::flights

flights |>
  dplyr::select(year, month, day, hour, minute) |>
  dplyr::mutate(
    departure = lubridate::make_datetime(year, month, day, hour, minute)
  )

#> # A tibble: 336,776 × 6
#>     year month   day  hour minute departure
#>    <int> <int> <int> <dbl>  <dbl> <dttm>
#>  1  2013     1     1     5     15 2013-01-01 05:15:00
#>  2  2013     1     1     5     29 2013-01-01 05:29:00
#>  3  2013     1     1     5     40 2013-01-01 05:40:00
#>  4  2013     1     1     5     45 2013-01-01 05:45:00
#>  5  2013     1     1     6      0 2013-01-01 06:00:00
#>  6  2013     1     1     5     58 2013-01-01 05:58:00
#>  7  2013     1     1     6      0 2013-01-01 06:00:00
#>  8  2013     1     1     6      0 2013-01-01 06:00:00
#>  9  2013     1     1     6      0 2013-01-01 06:00:00
#> 10  2013     1     1     6      0 2013-01-01 06:00:00
#> # ℹ 336,766 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Retrieving Components from a Date

You can extract indiviudal components from a date or time with functions like year, month, mday (day of the month), yday (day of the year), wday (day of the week), hour, minute, and second. You can also use label and abbr to return labels instead of numeric values.

datetime <- lubridate::ymd_hms("2026-07-01 09:00:00")

lubridate::year(datetime)
#> [1] 2026

lubridate::month(datetime, label = TRUE)
#> [1] Jul
#> Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec

lubridate::yday(datetime)
#> [1] 182

lubridate::wday(datetime, label = TRUE, abbr = FALSE)
#> [1] Wednesday
#> Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday

Rounding

The lubridate package also has round_date, floor_date, and ceiling_date to adjust the date to the nearest second, minute, hour, day, week, month, bimonth, quarter, season, halfyear and year.

lubridate::round_date(datetime, unit = "week")
#> [1] "2026-06-28 UTC"

lubridate::floor_date(datetime, unit = "week")
#> [1] "2026-06-28 UTC"

lubridate::ceiling_date(datetime, unit = "week")
#> [1] "2026-07-05 UTC"

Modifying date components

You can also use each accessor function to modify the components of a date or time or use update to modify multiple components at the same time.

lubridate::day(datetime) <- 8

datetime
#> [1] "2026-07-08 09:00:00 UTC"

lubridate::hour(datetime) <- lubridate::hour(datetime) + 1

datetime
#> [1] "2026-07-08 10:00:00 UTC"

update(datetime, day = 7, hour = 9)
#> [1] "2026-07-07 09:00:00 UTC"

Time Spans

You can also perform addition, subtraction, and division with date and time objects. Time spans can be represented by durations (seconds), periods (weeks and months), and intervals (starting and ending points).

# How old is a person born on January 1, 1990 if today is July 1, 2026

age <- lubridate::date(datetime) - lubridate::ymd("1990-01-01")

age
#> Time difference of 13337 days

A difftime class object records a time span in seconds, minutes, hours, days, or weeks. The lubridate as.duration function always records in seconds. The package also includes convenient constructors like dseconds, dminutes, dhours, ddays, dweeks, and dyears for adding durations and minutes, hours, days, months, and years for adding periods.

lubridate::dyears(3)
#> [1] "94672800s (~3 years)"

datetime + lubridate::dyears(3)
#> [1] "2029-07-08 04:00:00 UTC"

datetime + lubridate::years(3)
#> [1] "2029-07-08 10:00:00 UTC"

Periods are time spans without a fixed length in seconds and are more likely than durations to do what you would expect.

Intervals

An interval is a pair of starting and ending dates. You can think of it as a duration with a starting point.

`%--%` <- lubridate::`%--%`
y2024 <- lubridate::ymd("2024-01-01") %--% lubridate::ymd("2025-01-01")
y2026 <- lubridate::ymd("2026-01-01") %--% lubridate::ymd("2027-01-01")

y2024
#> [1] 2024-01-01 UTC--2025-01-01 UTC

y2026
#> [1] 2026-01-01 UTC--2027-01-01 UTC

y2024 / lubridate::days(1)
#> [1] 366

y2026 / lubridate::days(1)
#> [1] 365

Missing Values

You can use last observation carried forward with tidyr::fill or fill in fixed values with dplyr::coalesce. An NA can be added using dplyr::na_if if a value like -99 represents a missing value.

print(anthro)

#> # A tibble: 5 × 4
#>   subject timepoint height weight
#>     <dbl>     <dbl>  <dbl>  <dbl>
#> 1    1001         1    165     65
#> 2    1001         2     NA     67
#> 3    1001         3     NA     68
#> 4    1002         1    170     72
#> 5    1002         3     NA    -99

tidyr::fill(anthro, height)[["height"]]
#> [1] 165 165 165 170 170

dplyr::coalesce(data$height, mean(data$height, na.rm = TRUE))
#> [1] 165.0 167.5 167.5 170.0 167.5

dplyr::na_if(anthro$weight, -99)
#> [1] 65 67 68 72 NA

Implicit Missing Values

Explicit missing values are usually represented by NA, NaN, or a negative number like -99. Implicit missing values are not in the dataset. You can use tidyr::complete to add missing rows.

tidyr::complete(anthro, subject, timepoint = 1:3)

#> # A tibble: 6 × 4
#>   subject timepoint height weight
#>     <dbl>     <dbl>  <dbl>  <dbl>
#> 1    1001         1    165     65
#> 2    1001         2     NA     67
#> 3    1001         3     NA     68
#> 4    1002         1    170     72
#> 5    1002         2     NA     NA
#> 6    1002         3     NA    -99

Joins

Multiple data frames can be joined together with mutating or filtering joins. Mutating joins add new variables to one data frame from matching observations in another. Filtering joins filter observations from one data frame based on whether or not they match an observation in another.

Inner Join

Only observations that match in both data frames are kept.

dplyr::inner_join(x, y)

#> # A tibble: 2 × 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1
#> 2     2 x2    y2

An outer join is the opposite of inner join and will keep observations that appear in at least one of the data frames.

Left Join

Keeps all observations of X.

dplyr::left_join(x, y)

#> # A tibble: 3 × 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1
#> 2     2 x2    y2
#> 3     3 x3    NA

Right Join

Keeps all observations of Y.

dplyr::right_join(x, y)

#> # A tibble: 3 × 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1
#> 2     2 x2    y2
#> 3     3 NA    y3

Full Join

Keeps all observations in X and Y.

dplyr::full_join(x, y)

#> # A tibble: 4 × 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1
#> 2     2 x2    y2
#> 3     3 x3    NA
#> 4     4 NA    y3

Semi-Join

Keeps observations in X that appear in Y but does not join Y.

dplyr::semi_join(x, y)

#> # A tibble: 2 × 2
#>     key val_x
#>   <dbl> <chr>
#> 1     1 x1
#> 2     2 x2

Anti-Join

Keeps observations in X that do not appear in Y but does not join Y.

dplyr::anti_join(x, y)

#> # A tibble: 1 × 2
#>     key val_x
#>   <dbl> <chr>
#> 1     3 x3