Dreams of a “Lazyverse”

A delayed, functional interface for tidy data wrangling

code
r
Author

Michael DeCrescenzo

Published

July 6, 2023

It is hard to introduce what this post is about. I will try.

This post proposes a delayed evaluation, function-forward interface for Tidyverse data manipulation, focusing primarily on dplyr verbs. Instead of writing data |> verb(args), it should be easier to write verb(args) as its own expression, save it ahead of time, and later evaluate it on data.

This post provides some less-than-ideal code to implement such an interface. It is surely incomplete. But the interface is interesting and appealing at least to me, a nostalgic R user with a new love of functional programming.

(Many) R users love the Tidyverse

Let us count the reasons:

  • Uniform interfaces mean functions are composable. dplyr and tidyr provide functions that map dataframes to dataframes. stringr functions map string vector to string vector. forcats functions map factor to factor. And the argument structures are similar: an object of type T goes in the first argument, and an object of type T is returned. (N.B. by “type” I basically mean “class”.)
  • The forward pipe (%>% or |>). Write a linear sequence of procedures instead of a nested expression. Instead of g(f(x)) we can write x |> f() |> g(). That rules.
  • Unquoted expressions let you write “ergonomic” code that remains flexible. Instead of data[["new_var"]] = ..., we can write mutate(data, new_var = ...) which feels much nicer.

These are the same reasons why I love the Tidyverse too.

But the Tidyverse is also missing something for me. Something major.

The functional style

the tidyverse advertises itself as a “functional” interface for data manipulation in r. and compared to the object oriented style of python (pandas in particular), that’s certainly true. it is much easier to combine tidy functions (dplyr and tidyr “verbs” in particular) than it is to do, frankly, anything at all in pandas. and i write python every day for my job.

But taking a wider view, the “functional” interfaces of the Tidyverse have some key limitations. Take the following example. I want to filter the palmerpenguins data to the “Adelie” species only.

library(palmerpenguins)
library(dplyr)
penguins |> 
    filter(species == "Adelie")
## # A tibble: 152 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 142 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

This works fine if I want to tie my functionality (drop all species but “Adelie”) to my data (penguins). But can I separate these steps? Can I write a function (using dplyr) that filters on species, and then later apply that function to a dataframe?

Here is what I want.

# a "partial" recipe for filtering.
# here is how I want to filter, regardless of the dataset.
adelie_only = filter(species == "Adelie")

# apply the filter to a dataset
adelie_only(penguins)

Here is what I get:

adelie_only = filter(species == "Adelie")
## Error in filter(species == "Adelie"): object 'species' not found

So what we have is a clash between two tenets of the Tidyverse interface: functional style and unquoted expressions. I cannot write the function that I want to apply to my data. Not that easily, at least.

I could write the function with a “base” style…

# clunkier
adelie_only = function(d) {
    condition = (d$species == "Adelie")
    d[condition, ]
}

# but at least it works
adelie_only(penguins)
## # A tibble: 152 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 142 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

But there is more boilerplate code, and the magic of unquoted expressions is gone. So the interface is less expressive than what the Tidyverse ordinarily feels like.

More abstractly, I want to separate functions from data. I want to pass data x to functions f, g, and h. But instead of writing this:

x |>
    f(...) |>
    g(...) |>
    h(...)

I want to write this:

do_f = f(...)
do_g = g(...)
do_h = h(...)

fgh = compose(f, g, h)
fgh(x)

Why do I want this? Because I should be able to reuse the do_f, do_g, and do_h functions on different data. These operations don’t have any necessary dependence on the data x, but I cannot easily write these functions as abstract expressions on some yet-to-be-provided data. Not with an ergonomic Tidy style, anyway.

What can we do to make this possible?

Tidy evaluation magic

Take just the do_f function. What I want to do is write an expression f(...) and delay the evaluation of the arguments ... until I call this function with data.

If this is unclear, I am going to rephrase it a handful of ways in hopes that one of them clicks.

  • If f(.data = x, ...) is a fully-formed expression that takes input data and returns output data, then f(...) is a procedure that we could do to some data x, if we were to pass x. But if I don’t pass data x, the statement f(...) is just a function.
  • If a fully-formed expression is f(.data = x, ...), then the expression without data f(.data = ?, ...) is a partial/curried function. Whereas f is a function that takes arguments .data and ..., if we fix the ... arguments ahead of time, the result is a function that takes only .data. I have written about curried functions before, hint hint, and have interpreted pipe chains as curried functions quite explicitly.
  • This is just as if do_f = f(...) and do_f = \(x) f(x, ...) were the same thing.

The issue is, of course, the evaluation of the expression in the ... slot of f. You have probably noticed that functions like mutate, select, filter, etc. allow you to write unquoted variable names without throwing errors. This is similar to writing formulae in R (such as in the lm or aggregate functions).

How can we achieve this? In general, the answer to write a function that “captures” arguments unevaluated in one environment and evaluates those arguments in some different environment containing other necessary information. To use an example, in the code filter(penguins, species == "Adelie") the object species does not exist in the global environment. But it does exist in the environment of the dataset penguins. What the filter function does is prevent the evaluation of the expression species == "Adelie" in the global environment, evaluating that expression in the .data = penguins environment instead.

We must use a similar trick.

The simplest example, delay_filter

The filter function already delays the evaluation of ... into the data environment. But we must delay it further; I don’t want to pass data at all when the expression is captured.

So we write a function that captures arguments ..., and returns a new function that would take .data and execute filter(.data, ...). We capture the unevaluated arguments with rlang::enquos and evaluate them in the data context with rlang::eval_tidy and the !!! operator. Sadly I cannot explain these tools in detail here. If you want to learn more, I would recommend the Advanced R chapters on metaprogramming.

# args -> function(.data, args)
delay_filter = function(...) {
    args = rlang::enquos(...)
    function(.data) rlang::eval_tidy(filter(.data, !!!args))
}

Here it is in action. I delay the evaluation of filter and pass some arguments to filter on. The result is a new function.

# creates a function
adelie_only = delay_filter(species == "Adelie")
class(adelie_only)
## [1] "function"

And only when I pass a dataset is the expression species == "Adelie" evaluated.

adelie_only(penguins)
## # A tibble: 152 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 142 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

More generally, delay_verb

There are more dplyr verbs than filter. We also have mutate, select, count and so on. How can we scale this routine up to these functions as well?

This is where functional programming shines. Let’s write a function that takes a verb and returns a delayed-verb-creator function. That is, we pass a function (like filter), and it returns a function of args ... which would, in turn, return a function of .data. It sounds like a mouthful, but let’s see how it goes.

# verb -> (function(...) -> function(.data))
delay_verb = function(verb) {
    function(...) {
        args = rlang::enquos(...)
        function(.data) rlang::eval_tidy(verb(.data, !!!args))
    }
}

And now we can create many delayed verbs with this function.

dfilter = delay_verb(filter)
dmutate = delay_verb(mutate)
dselect = delay_verb(select)
dcount = delay_verb(count)
dsummarize = delay_verb(summarize)
dgroup_by = delay_verb(group_by)
darrange = delay_verb(arrange)

These delayed verbs let us pre-supply arguments without data.

# create the function
adelie_only = dfilter(species == "Adelie")

And then call these functions with whatever data we want.

adelie_only(penguins)
## # A tibble: 152 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 142 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Let’s verify that some other delayed verbs work as expected.

# delay mutate: convert species to uppercase
upper_species = dmutate(species = toupper(species))
upper_species(penguins)
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 ADELIE  Torgersen           39.1          18.7               181        3750
##  2 ADELIE  Torgersen           39.5          17.4               186        3800
##  3 ADELIE  Torgersen           40.3          18                 195        3250
##  4 ADELIE  Torgersen           NA            NA                  NA          NA
##  5 ADELIE  Torgersen           36.7          19.3               193        3450
##  6 ADELIE  Torgersen           39.3          20.6               190        3650
##  7 ADELIE  Torgersen           38.9          17.8               181        3625
##  8 ADELIE  Torgersen           39.2          19.6               195        4675
##  9 ADELIE  Torgersen           34.1          18.1               193        3475
## 10 ADELIE  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

# delay select: only columns that start with "bill"
bill_cols = dselect(starts_with("bill"))
bill_cols(penguins)
## # A tibble: 344 × 2
##    bill_length_mm bill_depth_mm
##             <dbl>         <dbl>
##  1           39.1          18.7
##  2           39.5          17.4
##  3           40.3          18  
##  4           NA            NA  
##  5           36.7          19.3
##  6           39.3          20.6
##  7           38.9          17.8
##  8           39.2          19.6
##  9           34.1          18.1
## 10           42            20.2
## # ℹ 334 more rows

# delay summarize, compute mean mass
avg_mass = dsummarize(avg_mass = mean(body_mass_g, na.rm = TRUE))
avg_mass(penguins)
## # A tibble: 1 × 1
##   avg_mass
##      <dbl>
## 1    4202.

The payoff: even more “functional” approach to data munging

As we mentioned above, one of the key benefits of the Tidyverse was composable, linear code. The drawback was that pipe chains need both functions and data.

With this new delayed interface, we can separate the functions from the data. So I can write the same linear pipeline of functionality without passing any data. This lets me evaluate the same function pipeline of multiple datasets without redundant typing.

This new style is much more like “traditional” function composition. Say I have two functions \(f\) and \(g\). If I want to compose these functions, I can write the composition \(g \cdot f\), which means “do \(g\) after \(f\)”. The result is a new function, and there is no reference to data. I can pass data later.

Let’s consider an example. I want to (a) filter to Adelie species, (b) compute a ratio of bill length to depth, and (c) find the mean of this ratio. I can write this routine as a series of functions without any data.

But first… Function composition helpers

First I will write some tools to facilitate function composition. This compose function takes two functions, g and f, and returns a new function that applies g after f to whatever data you provide. I also write a compose_right, which reverses the direction of composition, and infix operators for both functions. If you are unfamiliar with what is going on here, you are invited to read this other post.

# (g, f) -> (g . f)
compose = function(g, f) function(...) g(f(...))
`%.%` = compose

# (f, g) -> (g . f)
compose_right = function(f, g) compose(g, f)
`%;%` = compose_right

To demonstrate:

x = c(1, 1, 2, 2, 2)
compose(length, unique)(x)
## [1] 2

# same as...
(length %.% unique)(x)
## [1] 2
compose_right(unique, length)(x)
## [1] 2
(unique %;% length)(x)
## [1] 2

Okay, now the example.

Here’s the recipe I stated above:

  1. filter to Adelie species
  2. compute a ratio of bill length to depth
  3. find the mean of this ratio

Here we write these steps with delayed verbs. Each of these expressions return separate functions from dataframe to dataframe.

adelie_only = dfilter(species == "Adelie")

compute_length_to_depth = dmutate(
    bill_length_to_depth = bill_length_mm / bill_depth_mm
)

mean_bill_ratio = dsummarize(
    avg_bill_ratio = mean(bill_length_to_depth, na.rm = TRUE)
)

I can now write a “data pipeline” not as eager actions on data, but as functions that don’t need to know anything about data.

# using "right" / "postfix" composition
adelie_mean_bill_ratio = (
    adelie_only %;%
    compute_length_to_depth %;%
    mean_bill_ratio
)

Since my pipeline is a function, it is simple to apply it to any dataset I want. For instance, here it is in the full dataset:

adelie_mean_bill_ratio(penguins)
## # A tibble: 1 × 1
##   avg_bill_ratio
##            <dbl>
## 1           2.12

But let’s say I wanted to group by year and sex and compute the same pipeline. Do I need to write the same pipeline with a different starting point? Do I need to stack my data and apply my pipeline with purrr::map? No, I simply call the function on a different dataset.

# grouping: this is also a function :)
by_year_sex = dgroup_by(year, sex)

# using left composition this time
compose(adelie_mean_bill_ratio, by_year_sex)(penguins)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 7 × 3
## # Groups:   year [3]
##    year sex    avg_bill_ratio
##   <int> <fct>           <dbl>
## 1  2007 female           2.10
## 2  2007 male             2.05
## 3  2007 <NA>             2.07
## 4  2008 female           2.10
## 5  2008 male             2.15
## 6  2009 female           2.16
## 7  2009 male             2.16

The argument for “Delayplyr”: functions over pipelines

The proposition of a delayed / functional approach to data manipulation is that you can create complex results by writing and applying functions instead of executing procedures on data.

One key benefit of this style is modularity. Many of the “steps” in your pipeline are not dependent on the exact state of your data in the pipeline. They can instead be expressions that would do some operation to some data. In that approach, your “pipeline” isn’t a dataset being transformed; it is the recipe for transforming the data. The difference? A pipeline binds the functions to the data The recipe lets the functions exist separate from the data.

As a result, you can write and save many separate components ahead of time: adelie_only, compute_length_to_depth, mean_bill_ratio, by_year_sex… In a traditional dplyr chain, these functions would be steps in a pipe chain that have to carry the data from one state to the next. The state of the data in one function is intimately related to the state of the data in all prior functions. And although dplyr verbs are in theory abstract functions on dataframes, the pipelines you create are not abstract at all. They are actually quite fragile.

But when you write a lot of separate functions, these functions don’t depend on one another. You can costlessly combine and rearrange them. You only pay the cost when you apply the functions on data, which you can do much more flexibly now that you have separated the data from the chain of operations.

Now that is functional programming with the Tidyverse!