currr: a functional grammar for data frame operations.

Functional, composable, and deferred evaluation.

code
r
Author

Michael DeCrescenzo

Published

October 3, 2023

This post proposes a dialect for data frame operations that I call currr. (I am not married to the name.) You can find and install a very unfinished package on Github.

currr is built on dplyr, but it uses a distinct grammar.

A typical dplyr workflow will pipe a data frame through a series of functions (in this case filter, group_by, and summarize).

library("palmerpenguins")
library("dplyr")

penguins |>
    filter(sex %in% c("male", "female")) |>
    group_by(sex, island) |>
    summarize(
        mass_med = median(body_mass_g, na.rm = TRUE),
        mass_mean = mean(body_mass_g, na.rm = TRUE),
        mass_std = sd(body_mass_g, na.rm = TRUE),
        .groups = "drop"
    )
## # A tibble: 6 × 5
##   sex    island    mass_med mass_mean mass_std
##   <fct>  <fct>        <dbl>     <dbl>    <dbl>
## 1 female Biscoe       4588.     4319.     660.
## 2 female Dream        3450      3446.     270.
## 3 female Torgersen    3400      3396.     259.
## 4 male   Biscoe       5350      5105.     714.
## 5 male   Dream        3950      3987.     350.
## 6 male   Torgersen    4000      4035.     372.

The purpose of currr is to provide a convenient way to pre-define functions that perform these data manipulation steps, without referring to a data frame. Instead of filter, group_by, and summarize, here we use currr::filtering, currr::grouping, and currr::summarizing to create new functions.

library(currr)
## 
## Attaching package: 'currr'
## The following object is masked from 'package:base':
## 
##     grouping

# each of these returns a new function that is dataframe -> dataframe
flt_sex_FM = filtering(sex %in% c("male", "female"))
by_sex_isl = grouping(sex, island)
smz_mass = summarizing(
    mass_med = median(body_mass_g, na.rm = TRUE),
    mass_mean = mean(body_mass_g, na.rm = TRUE),
    mass_std = sd(body_mass_g, na.rm = TRUE),
    .groups = "drop"
)

These functions can be rearranged, composed, and called later.

(smz_mass %.% by_sex_isl)(penguins)
## # A tibble: 9 × 5
##   sex    island    mass_med mass_mean mass_std
##   <fct>  <fct>        <dbl>     <dbl>    <dbl>
## 1 female Biscoe       4588.     4319.     660.
## 2 female Dream        3450      3446.     270.
## 3 female Torgersen    3400      3396.     259.
## 4 male   Biscoe       5350      5105.     714.
## 5 male   Dream        3950      3987.     350.
## 6 male   Torgersen    4000      4035.     372.
## 7 <NA>   Biscoe       4688.     4588.     338.
## 8 <NA>   Dream        2975      2975       NA 
## 9 <NA>   Torgersen    3588.     3681.     413.

(smz_mass %.% by_sex_isl %.% flt_sex_FM)(penguins)
## # A tibble: 6 × 5
##   sex    island    mass_med mass_mean mass_std
##   <fct>  <fct>        <dbl>     <dbl>    <dbl>
## 1 female Biscoe       4588.     4319.     660.
## 2 female Dream        3450      3446.     270.
## 3 female Torgersen    3400      3396.     259.
## 4 male   Biscoe       5350      5105.     714.
## 5 male   Dream        3950      3987.     350.
## 6 male   Torgersen    4000      4035.     372.

At this point you may be wondering,

If dplyr is already so good, why should I complicate my life with this new style?

…and this post will try to answer.

We proceed in three sections below:

  1. The tl;dr: how to use and understand currr
  2. For the skeptics: why you would want to use currr.
  3. For the dorks: how currr really works.

The basics of currr

If you know dplyr, you know almost everything you need to know about currr. There are only two things going on here.

  1. currr code is just curried dplyr code.
  2. currr functions are meant to be composed without evaluating them immediately.

Thing 1 of 2: currr code is just curried dplyr code.

“Currying” a function means turning a function of many arguments into a function of fewer arguments. I have written about this before, but we will explain it plenty here too.1

dplyr functions like filter and select have analogous currr functions like filtering and selecting. The currr functions create curried versions of the dplyr functions. Let me give you an example. Here is how we can use dplyr::filter to keep only known male and female penguins in the palmerpenguins dataset.

filter(penguins, sex %in% c("male", "female"))
## # A tibble: 333 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # ℹ 323 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Notice that filter has several arguments: the data frame itself, and as many boolean expressions as you want to filter with. And it returns a new data frame.

Here is how to accomplish a similar functionality with currr::filtering. filtering is just like filter, but I don’t pass the data frame.

filtering(sex %in% c("male", "female"))
## Memoised Function:
## function (.data) 
## {
##     return((purrr::partial(verb))(.data, ...))
## }
## <bytecode: 0x13cd86920>
## <environment: 0x12c97eac8>

And instead of getting a data frame back, I get a new function. The new function maps me from dataframe to dataframe. I can pass that dataframe later on, though. I don’t have to do it right now.

Another way to say this is that I have partially applied the argument sex %in% c("male", "female") to the filter function. That is, I created a version of the filter function with that boolean condition pre-specified. When I call that function later, it will invoke the boolean expression without needing me to pass it again.

To assure you that it works, here I build the same function, give it a name, and then pass the penguins data. I get the same result as the original, fully-specified filter call.

flt_sex = filtering(sex %in% c("male", "female"))
flt_sex(penguins)
## # A tibble: 333 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           41.1          17.6               182        3200
##  9 Adelie  Torgersen           38.6          21.2               191        3800
## 10 Adelie  Torgersen           34.6          21.1               198        4400
## # ℹ 323 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

You typically don’t want to pass the data immediately after creating the function. The purpose of currr is to create the curried function and evaluate it on data later on. Why delay the evaluation of data? Because by separating the function from the data, it is easier to compose and recycle functionality to act on multiple datasets. More explanation in the “Why” section below.

Thing 2 of 2: currr functions can be composed without evaluating on data.

Most R users hear “function composition” and think of the pipe operator |>, which turns a nested function call like h(g(f(x))) into x |> f() |> g() |> h(). That is convenient, but its behavior is too eager for our needs. We want to combine functions without passing the data x.

I could write a new function that combines f, g, and h and then pass x to that function like this…

fgh = function(...) f(...) |> g() |> h()

x |> fgh()

…but that is a bit clunky. Mathematical notation for function composition, meanwhile, is simple: \(fgh = h \cdot g \cdot f\).2 Can I achieve something that simple in R? Yes.

currr provides operators for two different kinds of function composition. First, the %.% operator for classical mathematical composition: (g %.% f)(x) is like g(f(x)). You should read g %.% f as “do g after f”. Second, we have a postfix-style %;% composition operator that reads more “pipe-like”: (f %;% g)(x) is like x |> f() |> g().3 f %;% g is “do f and then g”.

These operators let us compose little functions into bigger functions without applying them on data. Take the example at the top of the post. Each of these objects is a function.

flt_sex_FM = filtering(sex %in% c("male", "female"))

by_sex_isl = grouping(sex, island)

smz_mass = summarizing(
    mass_med = median(body_mass_g, na.rm = TRUE),
    mass_mean = mean(body_mass_g, na.rm = TRUE),
    mass_std = sd(body_mass_g, na.rm = TRUE),
    .groups = "drop"
)

I compose them into a “chain” of operations, which is itself a new function, without evaluating it.

mass_by_sex_isl = (flt_sex_FM %;% by_sex_isl %;% smz_mass)

And only when I need the results, I can pass the data.

mass_by_sex_isl(penguins)
## # A tibble: 6 × 5
##   sex    island    mass_med mass_mean mass_std
##   <fct>  <fct>        <dbl>     <dbl>    <dbl>
## 1 female Biscoe       4588.     4319.     660.
## 2 female Dream        3450      3446.     270.
## 3 female Torgersen    3400      3396.     259.
## 4 male   Biscoe       5350      5105.     714.
## 5 male   Dream        3950      3987.     350.
## 6 male   Torgersen    4000      4035.     372.

Why use the currr style?

Now that you know how to use currr, you may want to know why you would bother.

First, pipe chains are easy to write but hard to re-use. Let’s look at an example in dplyr world. I have the penguins and I want to summarize the body mass column.

penguins |>
    summarize(
        mass_med = median(body_mass_g, na.rm = TRUE),
        mass_mean = mean(body_mass_g, na.rm = TRUE),
        mass_std = sd(body_mass_g, na.rm = TRUE),
        .groups = "drop"
    )
## # A tibble: 1 × 3
##   mass_med mass_mean mass_std
##      <dbl>     <dbl>    <dbl>
## 1     4050     4202.     802.

Okay, I also want to compute the same stats, grouped by sex. Notice I have to write the summarize step again.

penguins |>
    group_by(sex) |>
    summarize(
        mass_med = median(body_mass_g, na.rm = TRUE),
        mass_mean = mean(body_mass_g, na.rm = TRUE),
        mass_std = sd(body_mass_g, na.rm = TRUE),
        .groups = "drop"
    )
## # A tibble: 3 × 4
##   sex    mass_med mass_mean mass_std
##   <fct>     <dbl>     <dbl>    <dbl>
## 1 female     3650     3862.     666.
## 2 male       4300     4546.     788.
## 3 <NA>       4100     4006.     679.

And if I also wanted it grouped by sex and island, summarize yet again.

penguins |>
    group_by(sex, island) |>
    summarize(
        mass_med = median(body_mass_g, na.rm = TRUE),
        mass_mean = mean(body_mass_g, na.rm = TRUE),
        mass_std = sd(body_mass_g, na.rm = TRUE),
        .groups = "drop"
    )
## # A tibble: 9 × 5
##   sex    island    mass_med mass_mean mass_std
##   <fct>  <fct>        <dbl>     <dbl>    <dbl>
## 1 female Biscoe       4588.     4319.     660.
## 2 female Dream        3450      3446.     270.
## 3 female Torgersen    3400      3396.     259.
## 4 male   Biscoe       5350      5105.     714.
## 5 male   Dream        3950      3987.     350.
## 6 male   Torgersen    4000      4035.     372.
## 7 <NA>   Biscoe       4688.     4588.     338.
## 8 <NA>   Dream        2975      2975       NA 
## 9 <NA>   Torgersen    3588.     3681.     413.

Now I have three instances in my code where I need to write the same summarize code, because I wanted to see it three different ways. And if I want to change that code (say, add a mean abs. deviation statistic), I have to change it in multiple places. This is a good scenario to write a function! currr gives us a way to write that function conveniently and composably. We write the summarize step one time and re-use whenever we want later.

smz_mass = summarizing(
    mass_med = median(body_mass_g, na.rm = TRUE),
    mass_mean = mean(body_mass_g, na.rm = TRUE),
    mass_std = sd(body_mass_g, na.rm = TRUE),
    .groups = "drop"
)

smz_mass(penguins)
## # A tibble: 1 × 3
##   mass_med mass_mean mass_std
##      <dbl>     <dbl>    <dbl>
## 1     4050     4202.     802.

penguins |> group_by(sex) |> smz_mass()
## # A tibble: 3 × 4
##   sex    mass_med mass_mean mass_std
##   <fct>     <dbl>     <dbl>    <dbl>
## 1 female     3650     3862.     666.
## 2 male       4300     4546.     788.
## 3 <NA>       4100     4006.     679.

penguins |> group_by(sex, island) |> smz_mass()
## # A tibble: 9 × 5
##   sex    island    mass_med mass_mean mass_std
##   <fct>  <fct>        <dbl>     <dbl>    <dbl>
## 1 female Biscoe       4588.     4319.     660.
## 2 female Dream        3450      3446.     270.
## 3 female Torgersen    3400      3396.     259.
## 4 male   Biscoe       5350      5105.     714.
## 5 male   Dream        3950      3987.     350.
## 6 male   Torgersen    4000      4035.     372.
## 7 <NA>   Biscoe       4688.     4588.     338.
## 8 <NA>   Dream        2975      2975       NA 
## 9 <NA>   Torgersen    3588.     3681.     413.

This is the benefit we get by separating the data from the functionality.

  • When functions are pre-defined, we can pass whatever data whenever we want.
  • It is easier to abstract over the data because we wrote modular functions instead of a pipeline that forces you to provide the data up front.
  • We expend a little up-front effort to define these little functions, but we amortize the costs when we invoke those functions repeatedly. Lots of little functions may look silly in isolation, but they aren’t silly when you compose them to create bigger functionality.
  • We greatly reduce the cost of changing the definitions of these functions; change the function definition in one place, and inherit the change everywhere the function is called.

How currr works

All of currr fits in one small file. Although in fairness, it probably isn’t finished yet.

currr works by creating curryable dplyr verbs. For example:

selecting <- currify_verb(dplyr::select)

The implementation of currify_verb is, in turn…

currify_verb <- function(verb) {
    function(...) {
        intention <- function(.data) {
            return(purrr::partial(verb)(.data, ...))
        }
        memoise::memoise(intention)
    }
}

The outermost function currify_verb takes a verb and returns a new function of args .... The args are then used to return another function of .data, enclosing both the verb and the args in the environment of the innermost function. I call the innermost function an intention, because it declares an intent to evaluate a verb without actually evaluating it (yet).

Another notable detail is that we return not the intention object but a memoized intention using the memoise package. Memoization caches function values in (morally) a hash table keyed by the function arguments. If I evaluate a function for the first time, I hash the value by the function arguments. If I evaluate a function a second time or more, I can lookup the value in the lookup table instead of recomputing the value from scratch. This gives us the same efficiency as, say, storing a copy of data in some intermediate state, without the need to pollute our environment with intermediately-stateful data that we don’t care about unto itself. Now that’s functional programming!

The only other thing to note is how we implement function composition:

`%.%` <- function(g, f) function(...) g(f(...))

In R, we can define custom binary operations as functions of two arguments. Function composition is associative, which lets me compose multiple functions like f %.% g %.% h %.% i without parentheses. Associativity tells us the grouping doesn’t matter. This is similar to the way + is a function of two variables, yet we can still write a + b + c + d and so on. Now that’s also functional programming!

Footnotes

  1. If add(1, 2) returns 3, then curry(add, 1) creates a new function that adds an argument to 1. Technically speaking currying has a stricter definition than that: currying is turning a multi-argument function into a series of lambda functions that each take one argument. What we are doing is “merely” partial application of arguments. Read about the differences if you are so inclined.↩︎

  2. You can read more about function composition in this other post as well.↩︎

  3. There also exists purrr::compose, which takes an arbitrary number of functions and an optional .dir argument to switch the direction of composition. Take your pick.↩︎