Functional Programming

Applying R to Lifestyle and Brain Health Research

Brian C. Helsel, PhD

University of Kansas Medical Center

October 7, 2026

Introduction

R is a functional programming language. It uses functions to solve problems.

Functional programming languages have two common features:

  • First-class functions: Functions that behave like any other data structure. For example, you can assign functions to variables, store them in lists, pass them as arguments to other functions, create them inside functions, and return them as a result from another function.
  • Functions must be pure: A function is pure if the output only depends on the inputs (i.e., the same inputs lead to the same outputs each time the function is called) and the function has no unintended side-effects.

Functionals

A functional is a function that takes a function as an input and returns a vector as an output.

randomize <- function(f) f(runif(1e3))

set.seed(123)
randomize(mean)
#> [1] 0.4972778

set.seed(321)
randomize(sum)
#> [1] 511.1584

Common functionals in R that you may have already used include the apply family and purrr::map. It is likely that you have used one of these functions as a for loop replacement. Each functional is tailored for a specific task.

A Basic Functional

The most fundamental functional is purrr::map. It takes a vector and a function, calls the function once for each element in the vector, and returns the results in a list.

triple <- function(x) x * 3

purrr::map(1:3, triple)

#> [[1]]
#> [1] 3

#> [[2]]
#> [1] 6

#> [[3]]
#> [1] 9
  • purrr::map(1:3, triple) is the same as list(triple(1), triple(2), triple(3))
  • map comes from mathematics and refers to an operation that associates each element of a given set (or vector) with one or more elements of a second set

All map functions return a vector the same length as the input and each function must return a single value.

pair <- function(x) c(x, x)

purrr::map_dbl(1:2, pair)

#> Error: Result must be length 1, not 2.

You will also get an error if the function returns the wrong type of result (e.g., purrr::map_dbl(1:2, as.character)). Switching back to map solves this as it can accept any type of output.

Anonymous Functions and Shortcuts

Instead of using map with an existing function, you can create an inline anonymous function. It also supports a special short cut.

purrr::map_dbl(mtcars[1:5], function(x) length(unique(x)))

#>  mpg  cyl disp   hp drat
#>   25    3   27   22   22

purrr::map_dbl(mtcars[1:5], ~ length(unique(.x)))

#>  mpg  cyl disp   hp drat
#>   25    3   27   22   22

The shortcut works as all purrr functions translate formulas created by ~ (pronounced “twiddle”) into functions. You can use as_mapper to see what is happening behind the scenes.

purrr::as_mapper(~ length(unique(.x)))

#> <lambda>
#> function (..., .x = ..1, .y = ..2, . = ..1)
#> length(unique(.x))
#> attr(,"class")
#> [1] "rlang_lambda_function" "function"

The shortcut is useful for generating random data. However, it should only be used if your function does not span multiple lines. The map functions also have shortcuts for extracting elements from a vector. This is useful for extracting elements of deeply nested lists often found when working with JSON data.

x <- list(
  list(-1, x = 1, y = c(2), z = "a"),
  list(-2, x = 4, y = c(5, 6), z = "b"),
  list(-3, x = 8, y = c(9, 10, 11))
)

# Select by name
purrr::map_dbl(x, "x")
#> [1] 1 4 8

# Select by position
purrr::map_dbl(x, 1)
#> [1]  -1 -2 -3

# Select by name and position
purrr::map_dbl(x, list("y", 1))
#> [1] 2 5 9

# You will get an error if a component does not exist
# Unless you provide a default value
purrr::map_chr(x, "z", .default = NA)
#> [1] "a" "b" NA

Passing Arguments with …

It can be convenient to pass additional arguments to the function that you are calling. This can be accomplished with anonymous functions. However, since map functions pass … along, a simpler form is available.

x <- list(1:5, c(1:10, NA))

purrr::map_dbl(x, ~ mean(.x, na.rm = TRUE))
#> [1] 3.0 5.5

purrr::map_dbl(x, mean, na.rm = TRUE)
#> [1] 3.0 5.5

It is important to write out the full names of the arguments (e.g., na.rm = TRUE). Instead, if you write out purrr::map_dbl(x, mean, TRUE), the TRUE value is a positional argument and goes to trim, the second argument in the mean function.

Purrr Style

Here is a realistic example of fitting a linear model to multiple subgroups and extracting the coefficients from the models.

library(magrittr)

# Split the mtcars dataset into 4, 6, and 8 cylinders
by_cyl <- split(mtcars, mtcars$cyl)

# Fit a linear model and extract the second coefficient (i.e., slope).
by_cyl %>%
  purrr::map(~ lm(mpg ~ wt, data = .x)) %>%
  purrr::map(coef) %>%
  purrr::map_dbl(2)

#>         4         6         8
#> -5.647025 -2.780106 -2.192438

# In base R, you could accomplish this with lapply and vapply
models <- lapply(by_cyl, function(data) lm(mpg ~ wt, data = data))
vapply(models, function(x) coef(x)[[2]], double(1))
#>         4         6         8
#> -5.647025 -2.780106 -2.192438

# Or with a for loop
intercepts <- double(length(by_cyl))
for (i in seq_along(by_cyl)) {
  model <- lm(mpg ~ wt, data = by_cyl[[i]])
  intercepts[[i]] <- coef(model)[[2]]
}
intercepts
#> [1] -5.647025 -2.780106 -2.192438

We iterate 3 times with purrr, 2 times with the apply function, and 1 time with the for loop. The purrr package uses more but simpler steps and is easier to follow.

Map Variants

There are 23 variants of map. Learning all of the variants requires an understanding of five new ideas:

  • Output same type as input with modify()
  • Iterate over two inputs with map2()
  • Iterate with an index using imap()
  • Return nothing with walk()
  • Iterate over any number of inputs with pmap()

The map family of functions has orthogonal input and outputs.

List Atomic Same Type Nothing
One Argument map() map_lgl(), … modify() walk()
Two Arguments map2() map2_lgl(), … modify2() walk2()
One Argument + Index imap() imap_lgl(), … imodify() iwalk()
N Arguments pmap pmap_lgl(), … ———- pwalk()

Same Type of Output as Input (modify)

Imagine you wanted double every column in a data frame. Modify would be a way to return a data frame whereas map would return a list.

df <- data.frame(x = 1:3, y = 6:4)

purrr::map(df, ~ .x * 2)

#> $x
#> [1] 2 4 6

#> $y
#> [1] 12 10  8

purrr::modify(df, ~ .x * 2)

#>   x  y
#> 1 2 12
#> 2 4 10
#> 3 6  8

The purrr::modify function does not modify in place but rather returns a modified copy. If you want to permanently modify the object, you must assign the result of purrr::modify to the object.

The basic implementation of modify is simpler than map as we do not need to create a new output vector. Instead, we progressively replace the input.

simple_modify <- function(x, f, ...) {
  for (i in seq_along(x)) {
    x[[i]] <- f(x[[i]], ...)
  }
  return(x)
}

simple_modify(df, f = function(x) x * 2)

#>   x  y
#> 1 2 12
#> 2 4 10
#> 3 6  8

Two Inputs (map2)

The map function is vectorized over a single argument and only varies over .x when calling .f. This is problematic for some types of problems like calculating weighted means where you have a list of observations and weights.

\[ \text{Avg MET} = \frac{\sum_{i=1}^{n} \bigl(\text{minutes}_i \times \text{MET}_i\bigr)}{\sum_{i=1}^{n} \text{minutes}_i} \]

# Weekly minutes of physical activity for each participant
activity <- list(
  person1 = c(walk = 150, run = 60, bike = 90),
  person2 = c(walk = 200, run = 0, bike = 30),
  person3 = c(walk = 100, run = 120, bike = 60)
)

# MET values (energy cost for each activity type)
# Walk ~3.3 METs, Run ~7 METs, Bike ~6 METs
met_weights <- list(
  person1 = c(walk = 3.3, run = 7.0, bike = 6.0),
  person2 = c(walk = 3.3, run = 7.0, bike = 6.0),
  person3 = c(walk = 3.3, run = 7.0, bike = 6.0)
)

# Weighted mean activity intensity (minutes weighted by METs)
weighted_activity <- purrr::map2_dbl(
  activity,
  met_weights,
  ~ sum(.x * .y) / sum(.x)
)

weighted_activity
#>  person1  person2  person3
#> 4.850000 3.652174 5.464286

The basic implementation of map2 is like map except it iterates over two vectors in parellel.

simple_map2 <- function(x, y, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], y[[i]], ...)
  }
  return(unlist(out))
}

simple_map2(activity, met_weights, f = function(x, y) sum(x * y) / sum(x))
#> [1] 4.850000 3.652174 5.464286

No Output (walk)

Most functions are called for their returned value. However, some functions are called primarily for their side-effects (e.g., cat, write.csv, or ggsave). For example, the cat function returns NULL after printing out the result so purrr::map will return a list of NULL values.

You can assign the results of map to a variable that you would never use or you can use purrr::walk. This function ignores the return values of the function and instead invisibly returns x.

welcome <- function(x) {
  cat("Welcome to RLAB, ", x, "!\n", sep = "")
}

names <- c("Brian", "Julianne", "Morgan", "Ashlyn", "Courtney")

purrr::map(names, welcome)

#> Welcome to RLAB, Brian!
#> Welcome to RLAB, Julianne!
#> Welcome to RLAB, Morgan!
#> Welcome to RLAB, Ashlyn!
#> Welcome to RLAB, Courtney!
#> [[1]]
#> NULL

#> [[2]]
#> NULL

#> [[3]]
#> NULL

#> [[4]]
#> NULL

#> [[5]]
#> NULL

# Returned value is invisble but here we can see it is equal to .x
all(purrr::walk(names, welcome) == names)
#> Welcome to RLAB, Brian!
#> Welcome to RLAB, Julianne!
#> Welcome to RLAB, Morgan!
#> Welcome to RLAB, Ashlyn!
#> Welcome to RLAB, Courtney!
#> [1] TRUE

One of the most useful walk variants is walk2. It allows you to iterate over an object and path when saving something to your computer which makes it easier to save a list of a data frames (e.g., subsetted or filtered) to your computer.

Iterating Over Values and Indices (imap)

The imap function iterates over indices or names. It is comparable to map2 in that your function is called with 2 arguments. However, both are derived from the vector.

  • imap(x, f) equals map2(x, names(x), f) when x has names.
  • imap(x, f) equals map2(x, seq_along(x), f) when x does not have names.
# Creating labels
purrr::imap_chr(
  data.frame(t(mtcars))[1:5], # Transpose data frame so car names are columns
  ~ paste0("The mpg for ", .y, " is ", .x[[1]], "\n")
) |>
  cat() # Interpret newline character for display

#> The mpg for Mazda.RX4 is 21
#> The mpg for Mazda.RX4.Wag is 21
#> The mpg for Datsun.710 is 22.8
#> The mpg for Hornet.4.Drive is 21.4
#> The mpg for Hornet.Sportabout is 18.7

Any Number of Inputs (pmap)

The pmap function takes a list containing any number of arguments (i.e., equal-length vectors as in a data frame). In the case of map2, pmap(list(x, y), f) equals map2(x, y, f). A big difference between pmap and the other map functions is finer control over argument matching by using named lists. Naming the components will increase the clarity related to how the function will be called.

purrr::pmap_dbl(list(na.rm = TRUE), mean, x = c(1, NA, 3, NA, 5))
#> [1] 3

We can also use pmap to draw random uniform numbers with varying parameters by first setting up our parameters in a data frame.

# n, min, and max are the argument names for runif
params <- data.frame(
  n = c(1:3),
  min = c(0, 10, 100),
  max = c(1, 100, 1000)
)

set.seed(123)
purrr::pmap(params, runif)

#> [[1]]
#> [1] 0.2875775

#> [[2]]
#> [1] 80.94746 46.80792

#> [[3]]
#> [1] 894.7157 946.4206 141.0008

The base equivalents to pmap() would be Map() or mapply(). However, the Map function requires that arguments do not vary and mapply has similar drawbacks to sapply as it simplifies the output of Map.

Reduce Family

Takes a vector of length n and produces a vector of length 1 by calling a function with a pair of values at a time. For example, reduce(1:4, f) equals f(f(f(1, 2), 3), 4). It allows a function programmed to work with 2 inputs (i.e., a binary function) to work with any number of inputs.

Imagine you have a physical activity intervention lasting 7 days with 15 participants and you have the participants’ daily minutes of moderate-to-vigorous physical activity recorded in a vector by participant within a list. You could use purrr::reduce to get the weekly total instead of needing to add repeatedly.

# Generate a random data set representing minutes of daily activity for 25 participants
set.seed(123)
minutes_of_activity <- purrr::map(1:7, ~ sample(0:60, 15, replace = TRUE))
names(minutes_of_activity) <- paste0("Day ", 1:length(minutes_of_activity))

str(minutes_of_activity)
#> List of 7
#>  $ Day 1: int [1:15] 30 14 50 13 2 41 49 53 42 36 ...
#>  $ Day 2: int [1:15] 26 4 50 26 56 27 56 8 28 34 ...
#>  $ Day 3: int [1:15] 18 35 13 16 42 38 52 11 14 31 ...
#>  $ Day 4: int [1:15] 9 22 26 59 52 6 52 26 31 37 ...
#>  $ Day 5: int [1:15] 11 12 17 32 56 26 24 37 20 14 ...
#>  $ Day 6: int [1:15] 15 51 29 5 42 7 21 21 38 30 ...
#>  $ Day 7: int [1:15] 33 3 12 4 53 50 24 51 21 24 ...

# Without purrr::reduce
out <- minutes_of_activity[[1]]
out <- out + minutes_of_activity[[2]]
out <- out + minutes_of_activity[[3]]
out <- out + minutes_of_activity[[4]]
out <- out + minutes_of_activity[[5]]
out <- out + minutes_of_activity[[6]]
out <- out + minutes_of_activity[[7]]
out
#> [1] 142 141 197 155 303 195 278 207 194 206 241 222 195 207 192

# With purrr::reduce
purrr::reduce(minutes_of_activity, `+`)
#> [1] 142 141 197 155 303 195 278 207 194 206 241 222 195 207 192

The implementation of reduce can be seen using a simple wrapper around a for loop. The base R equivalent is Reduce but the argument order is different and the user can not add additional arguments.

simple_reduce <- function(x, f) {
  out <- x[[1]]
  for (i in seq(2, length(x))) {
    out <- f(out, x[[i]])
  }
  return(out)
}

simple_reduce(minutes_of_activity, function(x, y) x + y)
#> [1] 142 141 197 155 303 195 278 207 194 206 241 222 195 207 192

Reduce(f = function(x, y) x + y, x = minutes_of_activity)
#> [1] 142 141 197 155 303 195 278 207 194 206 241 222 195 207 192

Accumulate

A variant of reduce that returns the intermediate results of each step. When using + in our example, this would be the cumulative sum of minutes after each of the days.

purrr::accumulate(minutes_of_activity, `+`)

#> $`Day 1`
#>  [1] 30 14 50 13  2 41 49 53 42 36 51 13 53 24 25

#> $`Day 2`
#>  [1]  56  18 100  39  58  68 105  61  70  70  58  38  59  65  33

#> $`Day 3`
#>  [1]  74  53 113  55 100 106 157  72  84 101  99  82  65  73  73

#> $`Day 4`
#>  [1]  83  75 139 114 152 112 209  98 115 138 123 115  93  77  80

#> $`Day 5`
#>  [1]  94  87 156 146 208 138 233 135 135 152 163 161 118 136 110

#> $`Day 6`
#>  [1] 109 138 185 151 250 145 254 156 173 182 210 177 171 185 158

#> $`Day 7`
#>  [1] 142 141 197 155 303 195 278 207 194 206 241 222 195 207 192

Both reduce and accumulate can take an .init argument as the first value used to start the accumulation rather than using .x[[1]]. This ensures that these functions return a correct value when .x is empty rather than throwing an error (e.g., purrr::reduce(integer(), +, .init = 0)). It also ensures that errors are thrown if .x is not valid for the function (e.g., purrr::reduce(“a”, +, .init = 0))

Predicate Functionals

A predicate is a function that returns TRUE or FALSE (e.g., is.character, is.null, all). We say that a predicate matches a vector if it returns TRUE.

A predicate functional applies a predicate to each element of a vector. The purrr package provides 6 useful functions in 3 pairs:

  • some(.x, .p): returns TRUE if any element matches; every(.x, .p): returns TRUE if all elements match
  • detect(.x, .p): returns the value of the first match; detect_index(.x, .p): returns the location of the first match
  • keep(.x, .p): keeps all matching elements; discard(.x, .p): drops all matching elements
df <- data.frame(x = 1:3, y = letters[1:3])

purrr::detect(df, is.numeric)
#> [1] 1 2 3

purrr::detect_index(df, is.numeric)
#> [1] 1

str(purrr::keep(df, is.numeric))
#> 'data.frame':    3 obs. of  1 variable:
#>  $ x: int  1 2 3

str(purrr::discard(df, is.numeric))
#> 'data.frame':    3 obs. of  1 variable:
#>  $ y: chr  "a" "b" "c"

Map Variants

Both map and modify come in variants that also take predicate functions.

df <- data.frame(
  num1 = c(0, 10, 20),
  num2 = c(5, 6, 7),
  chr1 = c("a", "b", "c")
)

str(purrr::map_if(df, is.numeric, mean))
#> List of 3
#>  $ num1: num 10
#>  $ num2: num 6
#>  $ chr1: chr [1:3] "a" "b" "c"

str(purrr::modify_if(df, is.numeric, mean))
#> 'data.frame':    3 obs. of  3 variables:
#>  $ num1: num  10 10 10
#>  $ num2: num  6 6 6
#>  $ chr1: chr  "a" "b" "c"

str(purrr::map(purrr::keep(df, is.numeric), mean))
#> List of 2
#>  $ num1: num 10
#>  $ num2: num 6

Matrices and Arrays

The base::apply function is designed to work with matrices and arrays. It is an operation that summarizes a matrix or array by collapsing each row or column into a single value. The apply function has four arguments:

  • X: the matrix or array
  • MARGIN: An integer vector giving the dimensions to summarize over (i.e., 1: rows; 2: columns)
  • FUN: A summary function
  • *: Additional arguments to pass to FUN
a2d <- matrix(1:20, nrow = 5)

#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

# Row Means
apply(X = a2d, MARGIN = 1, FUN = mean)
#> [1]  8.5  9.5 10.5 11.5 12.5

# Column Means
apply(X = a2d, MARGIN = 2, FUN = mean)
#> [1] 3 8 13 18

You can add multiple dimensions to MARGIN to summarize high-dimensional arrays.

# Creates an array of 4 matrices with 2 rows and 3 columns
a3d <- array(1:24, c(2, 3, 4))

str(a3d)
#> int [1:2, 1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...

# Row Means for ALL arrays
apply(X = a3d, MARGIN = 1, FUN = mean)

# Row BY Column Means for ALL Arrays
apply(X = a3d, MARGIN = c(1, 2), FUN = mean)

Never use the apply function with a data frame as it coerces it to a matrix and may lead to undesirable results if your data frame contains non-numeric data. The apply function will also automatically simplify the output to a list, matrix, or vetor, reducing the predictability of using the function.

Function Factories

A function factory is a function that makes functions. Here is an example where power makes two child (manufactured) functions (i.e., square, cube).

power1 <- function(exp) {
  function(x) {
    x^exp
  }
}

square <- power1(2)
cube <- power1(3)

square(3)
#> [1] 9
cube(3)
#> [1] 27

Factory Fundamentals

Function factories work because the enclosing environment of the manufactured function is the execution environment of the function factory.

# Printing the square function shows us where x comes from
square
#> function(x){
#>     x ^ exp
#>   }
#> <environment: 0x12a7ebbc0>

# Using env_print shows the binding to exp
rlang::env_print(square)
#> <environment: 0x12a7ebbc0>
#> Parent: <environment: global>
#> Bindings:
#> • exp: <dbl>

# Extract the exp from the enclosure environment of square
rlang::fn_env(square)$exp
#> [1] 2

Diagram Conventions

  • Any environment without an explicit parent inherits from the global environment
  • The link between square and cube is through their function body (not pictured)

When square executes (e.g., square(10)), x is found in the execution environment and exp is found in the enclosing environment.

Forcing Evaluation

Lazy evaluation causes a small problem in the power function. The value x is only evaluated when square is run. We can fix this by forcing evaluation.

x <- 2
square <- power1(x)
x <- 3

# Returns 8 since the binding to x changed to 3
square(2)
#> [1] 8

power2 <- function(exp) {
  force(exp)
  function(x) {
    x^exp
  }
}

x <- 2
square <- power2(x)
x <- 3

square(2)
#> [1] 4

Use force when creating a function factory to ensure every argument is evaluated if the argument is only used by the manufactured function.

Stateful Functions

Function factories allow you to maintain state across function calls. This is made possible because the enclosing environment of the manufactured function is unique and constant and the <<- special assignment operator can modify bindings in the enclosing environment.

Here is an example of combining these ideas to create a function that can record how many times it has been called.

new_counter <- function() {
  i <- 0
  function() {
    i <<- i + 1
    i
  }
}

counter_one <- new_counter()

for (i in 1:5) {
  print(counter_one())
}

#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5

Graphical Factories

The scales package makes it easy to customize the labels in ggplot2 plots, providing many functions to control the fine details of axes and legends. The formatter functions help control the appearance of axis breaks and all return functions that need to be called to format a number.

y <- c(12345, 123456, 1234567)

scales::comma_format()(y)
#> [1] "12,345" "123,456" "1,234,567"

scales::number_format(scale = 1e-3, suffix = " K")(y)
#> [1] "12 K" "123 K" "1 235 K"

The primary interface is a function factory which allows a seamless integration with ggplot2 scales as the labels argument can accept functions.

df <- data.frame(x = 1, y = c(12345, 123456, 1234567))

base <- ggplot2::ggplot(df, ggplot2::aes(x, y)) +
  ggplot2::geom_point() +
  ggplot2::scale_x_continuous(breaks = 1, labels = NULL) +
  ggplot2::labs(x = NULL, y = NULL)

base

base + ggplot2::scale_y_continuous(labels = scales::comma_format())

base +
  ggplot2::scale_y_continuous(
    labels = scales::number_format(scale = 1e-3, suffix = " K")
  )

base + ggplot2::scale_y_continuous(labels = scales::scientific_format())

Histogram Bins

The binwidth argument in geom_histogram can be a function which is useful because the function is executed once for each group. This means that you can have different binwidths in different facets.

sd <- c(1, 5, 15)
n <- 100
df <- data.frame(x = rnorm(3 * n, sd = sd), sd = rep(sd, n))

binwidth_bins <- function(n) {
  force(n)
  function(x) {
    (max(x) - min(x)) / n
  }
}

ggplot2::ggplot(df, ggplot2::aes(x)) +
  ggplot2::geom_histogram(binwidth = 2) +
  ggplot2::facet_wrap(~sd, scales = "free_x") +
  ggplot2::labs(x = NULL)

ggplot2::ggplot(df, ggplot2::aes(x)) +
  ggplot2::geom_histogram(binwidth = binwidth_bins(20)) +
  ggplot2::facet_wrap(~sd, scales = "free_x") +
  ggplot2::labs(x = NULL)

Using ggsave

A function factory used internally by ggsave is ggplot2::plot_dev. This helps transition from a file extension (e.g., png, jpeg, etc) to a graphics device function (e.g., png(), jpeg()). There are some inconsistencies with the base graphic devices such as some having filename as the first argument but others having file and a difference in units for width and height between the raster (i.e., pixels) and vector (i.e., inches) graphic devices.

# A simplified version of ggplot2::plot_dev for some image types

plot_dev <- function(ext, dpi = 96) {
  force(dpi)
  switch(
    ext,
    pdf = function(filename, ...) {
      grDevices::pdf(file = filename, ...)
    },
    png = function(...) grDevices::png(..., res = dpi, units = "in"),
    jpg = ,
    jpeg = function(...) grDevices::jpeg(..., res = dpi, units = "in"),
    stop("Unkown graphics extension: ", ext, call. = FALSE)
  )
}

plot_dev("pdf")
#> function(filename, ...) {
#>       grDevices::pdf(file = filename, ...)
#>     }
#> <environment: 0x11c6ed5a8>

plot_dev("png")
#> function(...) grDevices::png(..., res = dpi, units = "in")
#> <bytecode: 0x12d693258>
#> <environment: 0x12f5c7458>

Statistical Factories

Function factories are also a good fit for certain statistical techniques like box-cox transformations, bootstrap resampling, and maximum likelihood estimation (MLE; See Advanced R for MLE example).

The Box-Cox transformation is a type of power transformation often used to transform data toward normality. It takes a single parameter, λ (lambda), to control the strength of the transformation.

# Define boxcox function factory
boxcox <- function(lambda) {
  if (lambda == 0) {
    function(x) log(x)
  } else {
    function(x) (x^lambda - 1) / lambda
  }
}

# Include the function factory in ggplot2::stat_function
stat_boxcox <- function(lambda) {
  ggplot2::stat_function(
    ggplot2::aes(color = lambda),
    fun = boxcox(lambda),
    linewidth = 1
  )
}

# λ = 0: Log transformation
# λ = 1: No transformation
# λ = 0.5: Square-root transformation
# λ = 2: Square transformation

ggplot2::ggplot(data.frame(x = c(0, 5)), ggplot2::aes(x)) +
  lapply(c(0.5, 1, 1.5), stat_boxcox) +
  ggplot2::scale_color_viridis_c(limits = c(0, 1.5))

ggplot2::ggplot(data.frame(x = c(0.01, 1)), ggplot2::aes(x)) +
  lapply(c(0.5, 0.25, 0.1, 0), stat_boxcox) +
  ggplot2::scale_color_viridis_c(limits = c(0, 1.5))

Bootstrap Generators

A bootstrap generator can create a fresh bootstrap every time it is called. The advantage of using a function factory is clear with a parametric bootstrap where we need to fit a model before generating the bootstrap. This approach only runs the model at set up and not each time we generate the bootstrap.

boot_model <- function(df, formula) {
  mod <- lm(formula, data = df)
  fitted <- unname(fitted(mod))
  resid <- unname(resid(mod))
  # Removing the linear model object to keep
  # the manufactured function small
  rm(mod)
  function() {
    round(fitted + sample(resid), 2)
  }
}

boot_mcars2 <- boot_model(mtcars, mpg ~ wt)

# Remove set.seed to generate completely random sample each time

set.seed(123)
head(boot_mcars2())
#> [1] 20.08 23.09 26.63 18.22 16.81 19.09

set.seed(456)
head(boot_mcars2())
#> [1] 23.23 21.72 22.80 19.41 16.29 18.95

Function Factories and Functionals

power1 <- function(exp) {
  function(x) {
    x^exp
  }
}

names <- list(
  square = 2,
  cube = 3,
  root = 1 / 2,
  cuberoot = 1 / 3,
  reciprocal = -1
)

funs <- purrr::map(names, power1)

funs$root(64)
#> [1] 8

A downside of this set up is needing to use the prefix funs$ each time. There are 3 ways to eliminate this additional syntax.

  • Use with for a temporary effect
  • Use attach and detach for a longer effect
  • Copy the functions to the global environment with rlang::env_bind
with(funs, root(100))
#> [1] 10

attach(funs)
root(100)
#> [1] 10
detach(funs)

rlang::env_bind(rlang::global_env(), !!!funs)
root(100)
#> [1] 10
rlang::env_unbind(rlang::global_env(), names(funs))

Function Operators

A function operator is a function that takes one (or more) functions as input and returns a function as output. Function operators are function factories that take a function as an input and are typically paired with functionals.

A Basic Example

chatty <- function(f) {
  force(f)
  function(x, ...) {
    res <- f(x, ...)
    cat("Processing ", x, "\n", sep = "")
    Sys.sleep(0.5)
    res
  }
}

purrr::map_dbl(c(3, 2, 1), chatty(function(x) x^2))
#> Processing 3
#> Processing 2
#> Processing 1
#> [1] 9 4 1

Existing Function Operators

Two existing function operators that can help solve common recurring problems:

  • Capturing errors with purrr::safely
  • Caching computations with memoise::memoise

Capturing Errors

One advantage of using for loops is being able to capture results up to a failure. A functional does not return any output which makes it difficult to figure out the problem.

x <- list(
  c(0.51, 0.17, 0.72),
  c(0.06, 0.78, 0.43),
  c(0.89, 0.79, 0.50),
  "This is a problem!"
)

out <- rep(NA_real_, length(x))
for (i in seq_along(x)) {
  out[[i]] <- sum(x[[i]])
}
#> Error in `sum()`: invalid 'type' (character) of argument

out
#> [1] 1.40 1.27 2.18   NA

purrr::map_dbl(x, sum)
#> Error in `purrr::map_dbl()`: In index: 4.
#> Caused by error: ! invalid 'type' (character) of argument

The function operator safely() from the purrr package that transforms a function to turn errors into data. A function transformed by safely always returns a list with two elements: result and error.

safe_sum <- purrr::safely(sum)
safe_sum
#> function (...)
#> capture_error(.f(...), otherwise, quiet)
#> <bytecode: 0x13d0c3be0>
#> <environment: 0x13dff12c8>

str(safe_sum(x[[1]]))
#> List of 2
#>  $ result: num 1.4
#>  $ error : NULL

str(safe_sum(x[[4]]))
#> List of 2
#>  $ result: NULL
#>  $ error :List of 2
#>   ..$ message: chr "invalid 'type' (character) of argument"
#>   ..$ call   : language .Primitive("sum")(..., na.rm = na.rm)
#>   ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

Using safely with a functional returns all of the results in a list, but we can transpose it to get a list of results and errors.

out <- purrr::map(x, purrr::safely(sum))
str(out)
#> List of 4
#>  $ :List of 2
#>   ..$ result: num 1.4
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: num 1.27
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: num 2.18
#>   ..$ error : NULL
#>  $ :List of 2
#>   ..$ result: NULL
#>   ..$ error :List of 2
#>   .. ..$ message: chr "invalid 'type' (character) of argument"
#>   .. ..$ call   : language .Primitive("sum")(..., na.rm = na.rm)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

out <- purrr::transpose(purrr::map(x, purrr::safely(sum)))
str(out)

#> List of 2
#>  $ result:List of 4
#>   ..$ : num 1.4
#>   ..$ : num 1.27
#>   ..$ : num 2.18
#>   ..$ : NULL
#>  $ error :List of 4
#>   ..$ : NULL
#>   ..$ : NULL
#>   ..$ : NULL
#>   ..$ :List of 2
#>   .. ..$ message: chr "invalid 'type' (character) of argument"
#>   .. ..$ call   : language .Primitive("sum")(..., na.rm = na.rm)
#>   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

We can now easily find the results that worked or the inputs that failed. This approach could also work for fitting a linear model to multiple data sets and looking back on the results that failed.

ok <- purrr::map_lgl(out$error, is.null)

x[!ok]
#> [[1]]
#> [1] "This is a problem!"

unlist(out$result[ok])
#> [1] 1.40 1.27 2.18

Other Function Operators in purrr

  • Return a default value when there is an error with possibly
  • Turn output, messages, and warning side-effects into output, message, and warning components of the output with quietly
  • Automatically execute browser inside the function when there is an error with auto_browser

Caching Computations

The function memoise from the memoise package will remember previous inputs and return cached results. Memoisation is an example of the classic tradeoff between memory and speed. A memoised funcition can run faster, but it takes up more memory as it stores all the prior inputs and outputs.

slow_function <- function(x) {
  Sys.sleep(1)
  set.seed(123)
  x * 10 * runif(1)
}

system.time(print(slow_function(1)))
#> [1] 2.875775
#>    user  system elapsed
#>   0.003   0.002   1.003

fast_function <- memoise::memoise(slow_function)

# Elapsed time is the same as slow function the first time
system.time(print(fast_function(1)))
#> [1] 2.875775
#>    user  system elapsed
#>   0.002   0.003   1.002

# Elapsed time is much faster the second time
system.time(print(fast_function(1)))
#> [1] 2.875775
#>    user  system elapsed
#>   0.021   0.001   0.007

A realistic example of using memoisation is computing the Fibonacci series. The Fibonacci series is defined recursively with the first two values defined as \(f(0) = 0\), \(f(n) = 1\), adn then \(f(n) = f(n - 1) + f(n - 2)\) for any positive integer. A naive version is slow because it needs to compute multiple Fibonacci values within the function.

fib <- memoise::memoise(function(n) {
  if (n < 2) {
    return(1)
  }
  fib(n - 2) + fib(n - 1)
})

system.time(fib(23))
#>  user  system elapsed
#> 0.011   0.002   0.005

system.time(fib(23))
#> user  system elapsed
#>  0       0       0

This illustrates dynamic programming, where a complex problem is divided into smaller subproblems, and storing their solutions significantly enhances efficiency.

Creating Your Own Function Operator

Imagine you have a named vector of URLs and you want to download each one to your computer. It is simple with purrr::walk2 and utils::file.download().

urls <- c(
  "adv-r" = "https://adv-r.hadley.nz",
  "r4ds" = "https://r4ds.had.co.nz"
)

paths <- paste0(tempdir(), names(urls), ".html")

purrr::walk2(urls, paths, utils::download.file, quiet = TRUE)

This works great for a few URLs, but as the vector gets longer, you might want to add a small delay to avoid overloading the server and display a . every few URLs to know the function is still working. You can do this with a for loop, but the code is harder to read and the components are not reusable in new situations.

for (i in seq_along(urls)) {
  Sys.sleep(0.1)
  if (i %% 10 == 0) {
    cat(".")
  }
  utils::download.file(urls[[i]], paths[[i]])
}

A function operator overcomes the limitations of a for loop and we can create delay_by and dot_every functions that can be reused.

delay_by <- function(f, amount) {
  force(f)
  force(amount)
  function(...) {
    Sys.sleep(amount)
    f(...)
  }
}

dot_every <- function(f, n) {
  force(f)
  force(n)
  i <- 0
  function(...) {
    i <<- i + 1
    if (i %% n == 0) {
      cat(".")
    }
    f(...)
  }
}

purrr::walk2(
  urls,
  paths,
  utils::download.file |> dot_every(10) |> delay_by(0.1),
  quiet = TRUE
)