Applying R to Lifestyle and Brain Health Research
University of Kansas Medical Center
August 26, 2026
Subsetting in R is easy to learn but hard to master
[[, [, and $)The interactive viewer can help you understand the structure of complex objects.
Positive integers return elements at the specified positions.
Negative integers exclude elements at the specified positions.
Logical vectors select the elements where the logical value is TRUE
Character vectors can be used to return elements with matching names.
y <- setNames(x, letters[1:4])
#> a b c d
#> 2.1 4.2 3.3 5.4
y[c("d", "c", "a")]
#> d c a
#> 5.4 3.3 2.1
# When subsetting with [, names are matched exactly
z <- c(abc = 1, def = 2)
z[c("a", "d")]
#> <NA> <NA>
#> NA NA
# Subsetting with factors use the integer vector and not character levels
y[factor("b")]
#> a
#> 2.1Subsetting with a list works the same as an atomic vector except that [ always returns a list and [[ and $ must be used to extract list elements.
Subsetting higher-dimensional structures can happen with multiple vectors, a single vector, or a matrix.
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")
# Blank subsetting allows you to keep all the rows or columns
a[1:2, ]
#> A B C
#> [1,] 1 4 7
#> [2,] 2 5 8
a[c(TRUE, FALSE, TRUE), c("B", "A")]
#> B A
#> [1,] 4 1
#> [2,] 6 3
# By default, [ simplifies the results to the lowest dimensionality
a[1, ]
#> A B C
#> 1 4 7Matrices and arrays are vectors with special dimension attributes stored in column-major order. This allows subsetting as if they were a single vector.
vals <- outer(1:5, 1:5, FUN = "paste", sep = ", ")
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] "1, 1" "1, 2" "1, 3" "1, 4" "1, 5"
#> [2,] "2, 1" "2, 2" "2, 3" "2, 4" "2, 5"
#> [3,] "3, 1" "3, 2" "3, 3" "3, 4" "3, 5"
#> [4,] "4, 1" "4, 2" "4, 3" "4, 4" "4, 5"
#> [5,] "5, 1" "5, 2" "5, 3" "5, 4" "5, 5"
vals[c(4, 15)]
#> [1] "4, 1" "5, 3"You can also use 2 column matrix to subset a matrix and a 3 column matrix to subset an array.
Data frames have the characteristics of lists and matrices.
Subsetting with a single index returns columns while two indices selects the rows and columns.
When selecting a single column, matrix-like subsetting simplifies by default. Subsetting a tibble with [ always returns a tibble.
Adding drop = FALSE will preserve the original dimensionality when subsetting like a matrix.
The default drop = TRUE can be a common source of bugs in functions. When writing functions, get in the habit of using drop = FALSE or tibbles.
Factor subsetting also has a drop argument to control levels and not dimensions. Unused levels will be dropped when subsetting factors with drop = TRUE
The two other subsetting operators, [[ and $, are used for extracting single items.
$ is a useful shorthand for x[["y"]] (i.e., x$y).[[ operator is important when working with lists because subsetting a list with [ always returns a smaller list.Only a single positive integer or string can be used with [[ since it can only return a single item. A vector causes [[ to subset recursively (e.g., x[[c(1, 2)]] is the same as x[[1]][[2]])
Create a smaller train using [ or extact the contents of a car with [[
The $ can be used to access variables in a data frame, but [[ must be used if the name of the column is stored in a variable.
Another difference between $ and [[ is that the $ operator does left-to-right partial matching.
Setting options(warnPartialMatchDollar = TRUE) will issue a warning when partial matching occurs.
Inconsistencies when using [[ to subset with an invalid index.
| row[[col]] | Zero-length | OOB (int) | OOB (chr) | Missing |
|---|---|---|---|---|
| Atomic | Error | Error | Error | Error |
| List | Error | Error | NULL | NULL |
| NULL | NULL | NULL | NULL | NULL |
| OOB: Out of Bounds |
Functions in the purrr package can be used for more predictable behavior when subsetting errors may occur.
purrr::pluck: Always returns NULL or the .default argumentpurrr::chuck: Always throws an errorThe function purrr::pluck is useful for deeply nested data structures where the component you are searching for may not exist. An example of this is JSON data from web application programming interfaces (APIs).
Using purrr::pluck allows you to mix integer and character indices and provides an alternative default value if an item does not exist.
Subassignment: All subsetting operators can be combined with assignment to modify selected values. The basic form is x[i] <- value.
Make sure that length(value) is the same as length(x[i]) and that i is unique to avoid recycling.
Subsetting with nothing can be useful with assignment to preserve the original object’s structure.
You can remove a component from a list with x[[i]] <- NULL or add a literal NULL with x[[i]] <- list(NULL)
Many of the basic subsetting principles have been integrated within functions like subset(), merge(), dplyr::select(), and dplyr::filter(). However, a deeper understanding of how subsetting principles have been implemented will be valuable when encountering a situation where functions you need do not exist.
Here are 8 subsetting applications:
Use character matching to create lookup tables.
Complex lookup tables with multiple columns using match to return the position where each activity_code is found in the activity_info$code.
activity_info <- data.frame(
code = c("sed", "lig", "mod", "vig"),
label = c("Sedentary", "Light", "Moderate", "Vigorous"),
mvpa = c(FALSE, FALSE, TRUE, TRUE)
)
activity_info[match(activity_codes, activity_info$code), ]
#> code label mvpa
#> 1 sed Sedentary FALSE
#> 2 lig Light FALSE
#> 2.1 lig Light FALSE
#> 3 mod Moderate TRUE
#> 1.1 sed Sedentary FALSE
#> 4 vig Vigorous TRUE
#> 2.2 lig Light FALSEInteger indices can be used to randomly sample a vector or data frame.
df <- activity_info[match(activity_codes, activity_info$code), ]
# Run to reproduce random sample output
set.seed(1)
# Randomly reorder
df[sample(nrow(df)), ]
#> code label mvpa
#> 1 sed Sedentary FALSE
#> 3 mod Moderate TRUE
#> 2.2 lig Light FALSE
#> 2 lig Light FALSE
#> 1.1 sed Sedentary FALSE
#> 2.1 lig Light FALSE
#> 4 vig Vigorous TRUE
# Select 3 random rows
set.seed(1)
df[sample(nrow(df), size = 3), ]
# code label mvpa
# 1 sed Sedentary FALSE
# 3 mod Moderate TRUE
# 2.2 lig Light FALSEThe order() function returns an integer vector describing how to order the subsetted vector.
decreasing = TRUE.na.last = NA or put them at the front with na.last = FALSE.Use sort() or dplyr::arrange() to directly sort a vector or data frame.
Expanding identical rows that have been collapsed with a count column is easier with rep() and integer subsetting since rep(x, y) repeats x[i] y[i] times.
df <- subset(
data.frame(table(df$code, df$label, df$mvpa)),
Freq > 0
)
colnames(df) <- c("code", "label", "mvpa", "count")
rownames(df) <- NULL
df
#> code label mvpa count
#> 1 lig Light FALSE 3
#> 2 sed Sedentary FALSE 2
#> 3 mod Moderate TRUE 1
#> 4 vig Vigorous TRUE 1
(df <- df[rep(1:nrow(df), df$count), ])
# code label mvpa count
# 1 lig Light FALSE 3
# 1.1 lig Light FALSE 3
# 1.2 lig Light FALSE 3
# 2 sed Sedentary FALSE 2
# 2.1 sed Sedentary FALSE 2
# 3 mod Moderate TRUE 1
# 4 vig Vigorous TRUE 1Removing columns from a data frame can be done through subsetting or setting individual columns to NULL. Operations like setdiff can be used if you are not sure what columns to keep.
Logical subsetting allows you to easily combine conditions from multiple columns to extract rows out of a data frame.
Remember to use the vector boolean operators & for and and | for or when subsetting on multiple columns instead of && and || which are short-circuiting scalar operators that are more useful inside if statements.
De Morgan’s laws can be used to simplify negations:
Example: !(X & !(Y | Z)) == !X | !!(Y | Z) == !X | Y | Z
Using set operations (integer subsetting) over Boolean algebra (logical subsetting) may be more effective when you want to find the first (or last) TRUE or you only have a few TRUE values.
Here are the Boolean operators with their set equivalents.
| Boolean Algebra (Logical) | Set Operators (Integer) |
|---|---|
| X & Y | intersect(X, Y) |
| X | Y | union(X, Y) |
| X & !Y | setdiff(X, Y) |
| xor(X, Y) | setdiff(union(X, Y), intersect(X, Y)) |
xor: Stands for exclusive or and returns TRUE if exactly one of X or Y is TRUE and the other is FALSE
which() allows the conversion from a Boolean representation to integer, but a function needs to be created to reverse the operation.
R for Lifestyle and Brain Health (R-LAB)