The dplyr::near function will ignore small differences.
dplyr::near(x, c(1, 2))#> [1] TRUE TRUE
Missing Values
Operations involving an unknown value will be unknown.
NA>5#> [1] NA10==NA#> [1] NANA==NA#> [1] NA
Since filtering with variable == NA returns NA rather than TRUE or FALSE, it cannot be used to identify missing values. Instead, use the is.na() function to test whether a value is missing.
is.na(c(TRUE, NA, FALSE))#> [1] FALSE TRUE FALSEflights |> dplyr::filter(is.na(dep_time))#> # A tibble: 8,255 × 19#> year month day dep_time sched_dep_time dep_delay arr_time#> <int> <int> <int> <int> <int> <dbl> <int>#> 1 2013 1 1 NA 1630 NA NA#> 2 2013 1 1 NA 1935 NA NA#> 3 2013 1 1 NA 1500 NA NA#> 4 2013 1 1 NA 600 NA NA#> 5 2013 1 2 NA 1540 NA NA#> 6 2013 1 2 NA 1620 NA NA#> 7 2013 1 2 NA 1355 NA NA#> 8 2013 1 2 NA 1420 NA NA#> 9 2013 1 2 NA 1321 NA NA#> 10 2013 1 2 NA 1545 NA NA#> # ℹ 8,245 more rows#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,#> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,#> # minute <dbl>, time_hour <dttm>#> # ℹ Use `print(n = ...)` to see more rows
Boolean Algebra
You can combine multiple logical vectors together with Boolean algebra.
& is and
| is or
! is not
xor is xor
xor is TRUE if x is TRUE ory is TRUE but not both
The %in% Operator
Returns a logical vector the same length as x that is TRUE when a value in x is anywhere in y.
letters[1:10] %in%c("a", "e", "i", "o", "u")#> [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE# Find all flights in November and Decemberflights |> dplyr::filter(month %in%c(11, 12)) |> dplyr::count(month)#> # A tibble: 2 × 2#> month n#> <int> <int>#> 1 11 27268#> 2 12 28135
A logical vector in a numeric context becomes 1 for TRUE and 0 for FALSE. This makes functions like sum and mean useful with logical vectors as they give the number and proportion of TRUE values.
There are several other useful functions for summarizing data like length, median, range, sd, IQR, quantile, min, max, first, last, etc.
Conditional Transformations
Use logical vectors to do return different values for x and y using ifelse and dplyr::if_else. The difference between these functions is an extra argument in if_else to define what happens to missing values.
You can use dplyr::case_when when you need a flexible way of performing different computations for different conditions. You can also define a .default value for when there are no matches.
y <-c(-3:3, NA)dplyr::case_when( y ==0~"zero", y >0~"pos", y <0~"neg",.default ="???")#> [1] "neg" "neg" "neg" "zero" "pos" "pos" "pos" "???"
Counting Numbers
Counts of character or factor variables are great for data exploration and checks during analysis. You can optionally sort to see the most common values.
dplyr::count(flights, dest, sort =TRUE)#> # A tibble: 105 × 2#> dest n#> <chr> <int>#> 1 ORD 17283#> 2 ATL 17215#> 3 LAX 16174#> 4 BOS 15508#> 5 MCO 14082#> 6 CLT 14064#> 7 SFO 13331#> 8 FLL 12055#> 9 MIA 11728#> 10 DCA 9705#> # ℹ 95 more rows#> # ℹ Use `print(n = ...)` to see more rows
You can accomplish the same computation using dplyr::summarize which allows you to compute other summaries at the same time.
flights |> dplyr::group_by(dest) |> dplyr::summarise(n = dplyr::n(),delay =mean(arr_delay, na.rm =TRUE) ) |> dplyr::arrange(dplyr::desc(n))#> # A tibble: 105 × 3#> dest n delay#> <chr> <int> <dbl>#> 1 ORD 17283 5.88#> 2 ATL 17215 11.3#> 3 LAX 16174 0.547#> 4 BOS 15508 2.91#> 5 MCO 14082 5.45#> 6 CLT 14064 7.36#> 7 SFO 13331 2.67#> 8 FLL 12055 8.08#> 9 MIA 11728 0.299#> 10 DCA 9705 9.07#> # ℹ 95 more rows#> # ℹ Use `print(n = ...)` to see more rows
Modular Arithmetic
A technical name for the type of math you did before you learned about decimal places. This division yields a whole number and a remainder. In R, %/% does integer division and %% calculates the remainder.
The lead and lag function allow you to refer to the values before or after the current value. These functions return a vector of the same length as the input but padded with NAs at the start or end.
df |> dplyr::mutate(lag_x = dplyr::lag(x),lead_x = dplyr::lead(x) )#> # A tibble: 5 × 3#> x lag_x lead_x#> <dbl> <dbl> <dbl>#> 1 1 NA 5#> 2 5 1 5#> 3 5 5 17#> 4 17 5 22#> 5 22 17 NA
You can lag or lead by more than one position by using the second argument, n in the lag and lead functions.
Strings
You can create strings with single quotes (') or double quotes ("). If you want to add a quote inside a string, then you should use single quotes.
string1 <-"This is a string"string2 <-'If I want to include a "quote" inside a string'
You can use \ as an escape to include a literal single or double quote or other special characters in a string.
Complex strings requiring multiple quotes or backslashes can get confusing with all the escape characters. Using a raw string can help make the code more readable.
tricky <-"double_quote <- \"\\\"\" # or '\"' single_quote <- '\\'' # or \"'\""stringr::str_view(tricky)#> [1] │ double_quote <- "\"" # or '"' single_quote <- '\'' # or "'"raw_string <- r"(double_quote <- "\""# or '"' single_quote <- '\'' # or "'")"stringr::str_view(raw_string)#> [1] │ double_quote <- "\"" # or '"' single_quote <- '\'' # or "'"
A raw string usually starts with r"( and ends with )", but you can also use r"[]" or r"{}" if your string contains )“.
Creating Strings from Data
The stringr::str_c function is like paste0 but designed to be used within dplyr::mutate. It combines strings with a variable inside your dataset. You can also use stringr::glue which avoids the need to separate the string into parts with multiple quotation marks.
df <- tibble::tibble(name =c("Jonathan", "David", "Susan"))df |> dplyr::mutate(greeting = stringr::str_c("Hi ", name, "!"),question = stringr::str_glue("{name}, what is your favorite color?") )#> # A tibble: 3 × 3#> name greeting question#> <chr> <chr> <glue>#> 1 Jonathan Hi Jonathan! Jonathan, what is your favorite color?#> 2 David Hi David! David, what is your favorite color?#> 3 Susan Hi Susan! Susan, what is your favorite color?
The stringr::str_flatten function works with dplyr::summarize to return a single string from a character vector.
df <- tibble::tribble(~name , ~color ,"Jonathan" , "blue" ,"Jonathan" , "red" ,"David" , "green" ,"David" , "yellow" ,"David" , "orange" ,"Susan" , "purple" ,"Susan" , "red")df |> dplyr::summarise(color = stringr::str_flatten(color, ", "),.by = name )#> # A tibble: 3 × 2#> name color#> <chr> <chr>#> 1 Jonathan blue, red#> 2 David green, yellow, orange#> 3 Susan purple, red
Extracting Data from Strings
Functions in the tidyr package can help you separate a single string into multiple rows or columns.
# By delimitertibble::tibble(x =c("07.01.2026", "07.02.2026", "07.03.2026")) |> tidyr::separate_wider_delim( x,delim =".",names =c("month", "day", "year") )# By widthstibble::tibble(x =c("07012026", "07022026", "07032026")) |> tidyr::separate_wider_position( x,widths =c(month =2, day =2, year =4) )#> month day year#> <chr> <chr> <chr>#> 1 07 01 2026#> 2 07 02 2026#> 3 07 03 2026
Regular Expressions
A conncise and powerful language for describing patterns within strings (often abbreviated as regex or regexp). The simplest regex patterns consist of letters or numbers that are a direct match.
Most punctuation characters like ., +, *, [, ], and ? have special meanings in regex. For example, “a…e” will match any string that contains an “a” followed by 3 letters and then an “e”.
? makes a pattern optional (i.e., it matches 0 or 1 times)
+ lets a pattern repeat (i.e., it matches it at least once)
* lets a pattern be optional or repeat
# Matches an ap, optionally followed by an rstringr::str_view(fruit, "apr?")#> [1] │ <ap>ple#> [2] │ <apr>icot#> [34] │ gr<ap>e#> [35] │ gr<ap>efruit#> [56] │ p<ap>aya#> [62] │ pine<ap>ple# Matches an ap, followed by at least one rstringr::str_view(fruit, "apr+")#> [2] │ <apr>icot
You can use {n}, {n,}, and {n,m} to match exactly n times, at least n times, or between n and m times.
Character Classes
Defined by [] and lets you match a set of characters. You can also invert the match by starting with ^. This matches anything except the listed characters.
words <- stringr::words# Find an x surrounded by vowelsstringr::str_view(words, "[aeiou]x[aeiou]")#> [284] │ <exa>ct#> [285] │ <exa>mple#> [288] │ <exe>rcise#> [289] │ <exi>st# Find a y surrounded by consonantsstringr::str_view(words, "[^aeiou]y[^aeiou]")#> [836] │ <sys>tem#> [901] │ <typ>e
You can also use a - to define a range of letters or numbers.
The stringr::str_detect function can also be used with dplyr::filter and dplyr::summarize. For example, the code below calculates the total number and proportion of fruits containing the word “berry”.
The stringr::str_subset and stringr::str_which functions are closely related to stringr::str_detect. The stringr::str_subset function returns a character vector with the strings that match and stringr::str_which returns an integer vector with the positions of the strings that match.
Counting Matches
tibble::tibble(fruit = fruit) |> dplyr::mutate(vowels = stringr::str_count(fruit, "[aeiou]"),consonants = stringr::str_count(fruit, "[^aeiou]") )#> # A tibble: 80 × 3#> fruit vowels consonants#> <chr> <int> <int>#> 1 apple 2 3#> 2 apricot 3 4#> 3 avocado 4 3#> 4 banana 3 3#> 5 bell pepper 3 8#> 6 bilberry 2 6#> 7 blackberry 2 8#> 8 blackcurrant 3 9#> 9 blood orange 5 7#> 10 blueberry 3 6#> # ℹ 70 more rows#> # ℹ Use `print(n = ...)` to see more rows
Use stringr::str_remove_all and stringr::str_replace_all to remove or replace all matches as str_remove and str_replace only remove or replace the first match.
Anchors
Matching the start or end of the string requires an anchor. Use ^ to match the start or $ to match the end of the string.
# All fruits that start with an astringr::str_view(fruit, "^a")#> [1] │ <a>pple#> [2] │ <a>pricot#> [3] │ <a>vocado# All fruits that end with an astringr::str_view(fruit, "a$")#> [4] │ banan<a>#> [15] │ cherimoy<a>#> [30] │ feijo<a>#> [36] │ guav<a>#> [56] │ papay<a>#> [74] │ satsum<a>
Capturing Groups
Allow you to use sub-components of the match by referring back to it with a back references (e.g., \1 refers to the match contained in the first parenthesis, \2 in the second parenthesis, etc.)
# Finds all fruits with a repeated pair of lettersstringr::str_view(fruit, "(..)\\1")#> [4] │ b<anan>a#> [20] │ <coco>nut#> [22] │ <cucu>mber#> [41] │ <juju>be#> [56] │ <papa>ya#> [73] │ s<alal> berry# Finds all words that start and end with the same pair of lettersstringr::str_view(words, "^(..).*\\1$")#> [152] │ <church>#> [217] │ <decide>#> [617] │ <photograph>#> [699] │ <require>#> [739] │ <sense>
Factors
Factors are used for categorical variables. The forcats package is part of tidyverse and provides tools for dealing with factors.
Any values that are not in the level will be silently converted to NA. The forcats::fct function turns this into an error message and may be preferred.
x <-c(x, "Satur")factor(x, levels = weekday_levels, labels = weekdays)#> [1] Tuesday Thursday Friday <NA>#> Levels: Sunday Monday Tuesday Wednesday Thursday Friday Saturdayforcats::fct(x, levels = weekday_levels)#> Error in `forcats::fct()` at presentations/DataScience/transform.qmd:987:1:#> ! All values of `x` must appear in `levels` or `na`#> ℹ Missing level: "Satur"
Removing levels in factor(x) will take the unique values in alphabetical order. The forcats::fct function will order by first appearance.
Modifying Factor Order
You can use forcats::fct_reorder to reorder the factor by a numeric vector. For example, we may want to plot miles per gallon for cats in the mtcars dataset that is ordered miles per gallon.
Reorder the factor .f by the .y values associated with the largest .x values
fct_infreq(f)
Order factor levels by frequency (most common first)
fct_rev(f)
Reverse the order of factor levels
fct_recode(.f, ...)
Rename factor levels
fct_collapse(.f, ...)
Combine multiple levels into broader categories
fct_lump(f, n, prop)
Combine infrequent levels into "Other"
fct_lump_n(f, n)
Keep the top n levels by frequency and lump the rest
fct_lump_min(f, min)
Lump levels occurring fewer than a minimum count
fct_lump_prop(f, prop)
Lump levels whose proportion is below a threshold
fct_lump_lowfreq(f)
Lump the lowest-frequency levels into "Other"
Dates and Times
ISO8601 is the international standard for writing dates where the components of the dates are organized from largest to smallest and separated with a - (e.g., 2026-07-01). This format can also include times where the hour, minute, and second are separated by a : (e.g., 2026-07-01 09:00:00).
Describing Date and Time Formats
You can describe a date or time format with a % followed by a single character.
Year: %Y (four digits) or %y (two digits)
Month: %m (number), %b (abbreviated name like Jul), %B (full name)
Day: %d (one or two digits) or %e (two digits like 07)
You can extract indiviudal components from a date or time with functions like year, month, mday (day of the month), yday (day of the year), wday (day of the week), hour, minute, and second. You can also use label and abbr to return labels instead of numeric values.
datetime <- lubridate::ymd_hms("2026-07-01 09:00:00")lubridate::year(datetime)#> [1] 2026lubridate::month(datetime, label =TRUE)#> [1] Jul#> Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Declubridate::yday(datetime)#> [1] 182lubridate::wday(datetime, label =TRUE, abbr =FALSE)#> [1] Wednesday#> Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday
Rounding
The lubridate package also has round_date, floor_date, and ceiling_date to adjust the date to the nearest second, minute, hour, day, week, month, bimonth, quarter, season, halfyear and year.
lubridate::round_date(datetime, unit ="week")#> [1] "2026-06-28 UTC"lubridate::floor_date(datetime, unit ="week")#> [1] "2026-06-28 UTC"lubridate::ceiling_date(datetime, unit ="week")#> [1] "2026-07-05 UTC"
Modifying date components
You can also use each accessor function to modify the components of a date or time or use update to modify multiple components at the same time.
You can also perform addition, subtraction, and division with date and time objects. Time spans can be represented by durations (seconds), periods (weeks and months), and intervals (starting and ending points).
# How old is a person born on January 1, 1990 if today is July 1, 2026age <- lubridate::date(datetime) - lubridate::ymd("1990-01-01")age#> Time difference of 13337 days
A difftime class object records a time span in seconds, minutes, hours, days, or weeks. The lubridate as.duration function always records in seconds. The package also includes convenient constructors like dseconds, dminutes, dhours, ddays, dweeks, and dyears for adding durations and minutes, hours, days, months, and years for adding periods.
You can use last observation carried forward with tidyr::fill or fill in fixed values with dplyr::coalesce. An NA can be added using dplyr::na_if if a value like -99 represents a missing value.
Explicit missing values are usually represented by NA, NaN, or a negative number like -99. Implicit missing values are not in the dataset. You can use tidyr::complete to add missing rows.
tidyr::complete(anthro, subject, timepoint =1:3)#> # A tibble: 6 × 4#> subject timepoint height weight#> <dbl> <dbl> <dbl> <dbl>#> 1 1001 1 165 65#> 2 1001 2 NA 67#> 3 1001 3 NA 68#> 4 1002 1 170 72#> 5 1002 2 NA NA#> 6 1002 3 NA -99
Joins
Multiple data frames can be joined together with mutating or filtering joins. Mutating joins add new variables to one data frame from matching observations in another. Filtering joins filter observations from one data frame based on whether or not they match an observation in another.
Inner Join
Only observations that match in both data frames are kept.