library("palmerpenguins")
library("dplyr")
|>
penguins filter(sex %in% c("male", "female")) |>
group_by(sex, island) |>
summarize(
mass_med = median(body_mass_g, na.rm = TRUE),
mass_mean = mean(body_mass_g, na.rm = TRUE),
mass_std = sd(body_mass_g, na.rm = TRUE),
.groups = "drop"
)## # A tibble: 6 × 5
## sex island mass_med mass_mean mass_std
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 female Biscoe 4588. 4319. 660.
## 2 female Dream 3450 3446. 270.
## 3 female Torgersen 3400 3396. 259.
## 4 male Biscoe 5350 5105. 714.
## 5 male Dream 3950 3987. 350.
## 6 male Torgersen 4000 4035. 372.
This post proposes a dialect for data frame operations that I call currr
. (I am not married to the name.) You can find and install a very unfinished package on Github.
currr
is built on dplyr
, but it uses a distinct grammar.
A typical dplyr
workflow will pipe a data frame through a series of functions (in this case filter
, group_by
, and summarize
).
The purpose of currr
is to provide a convenient way to pre-define functions that perform these data manipulation steps, without referring to a data frame. Instead of filter
, group_by
, and summarize
, here we use currr::filtering
, currr::grouping
, and currr::summarizing
to create new functions.
library(currr)
##
## Attaching package: 'currr'
## The following object is masked from 'package:base':
##
## grouping
# each of these returns a new function that is dataframe -> dataframe
= filtering(sex %in% c("male", "female"))
flt_sex_FM = grouping(sex, island)
by_sex_isl = summarizing(
smz_mass mass_med = median(body_mass_g, na.rm = TRUE),
mass_mean = mean(body_mass_g, na.rm = TRUE),
mass_std = sd(body_mass_g, na.rm = TRUE),
.groups = "drop"
)
These functions can be rearranged, composed, and called later.
%.% by_sex_isl)(penguins)
(smz_mass ## # A tibble: 9 × 5
## sex island mass_med mass_mean mass_std
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 female Biscoe 4588. 4319. 660.
## 2 female Dream 3450 3446. 270.
## 3 female Torgersen 3400 3396. 259.
## 4 male Biscoe 5350 5105. 714.
## 5 male Dream 3950 3987. 350.
## 6 male Torgersen 4000 4035. 372.
## 7 <NA> Biscoe 4688. 4588. 338.
## 8 <NA> Dream 2975 2975 NA
## 9 <NA> Torgersen 3588. 3681. 413.
%.% by_sex_isl %.% flt_sex_FM)(penguins)
(smz_mass ## # A tibble: 6 × 5
## sex island mass_med mass_mean mass_std
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 female Biscoe 4588. 4319. 660.
## 2 female Dream 3450 3446. 270.
## 3 female Torgersen 3400 3396. 259.
## 4 male Biscoe 5350 5105. 714.
## 5 male Dream 3950 3987. 350.
## 6 male Torgersen 4000 4035. 372.
At this point you may be wondering,
If
dplyr
is already so good, why should I complicate my life with this new style?
…and this post will try to answer.
We proceed in three sections below:
- The tl;dr: how to use and understand
currr
- For the skeptics: why you would want to use
currr
. - For the dorks: how
currr
really works.
The basics of currr
If you know dplyr
, you know almost everything you need to know about currr
. There are only two things going on here.
currr
code is just currieddplyr
code.currr
functions are meant to be composed without evaluating them immediately.
Thing 1 of 2: currr
code is just curried dplyr
code.
“Currying” a function means turning a function of many arguments into a function of fewer arguments. I have written about this before, but we will explain it plenty here too.1
dplyr
functions like filter
and select
have analogous currr
functions like filtering
and selecting
. The currr
functions create curried versions of the dplyr
functions. Let me give you an example. Here is how we can use dplyr::filter
to keep only known male and female penguins in the palmerpenguins
dataset.
filter(penguins, sex %in% c("male", "female"))
## # A tibble: 333 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 41.1 17.6 182 3200
## 9 Adelie Torgersen 38.6 21.2 191 3800
## 10 Adelie Torgersen 34.6 21.1 198 4400
## # ℹ 323 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
Notice that filter
has several arguments: the data frame itself, and as many boolean expressions as you want to filter with. And it returns a new data frame.
Here is how to accomplish a similar functionality with currr::filtering
. filtering
is just like filter
, but I don’t pass the data frame.
filtering(sex %in% c("male", "female"))
## Memoised Function:
## function (.data)
## {
## return((purrr::partial(verb))(.data, ...))
## }
## <bytecode: 0x13cd86920>
## <environment: 0x12c97eac8>
And instead of getting a data frame back, I get a new function. The new function maps me from dataframe to dataframe. I can pass that dataframe later on, though. I don’t have to do it right now.
Another way to say this is that I have partially applied the argument sex %in% c("male", "female")
to the filter
function. That is, I created a version of the filter
function with that boolean condition pre-specified. When I call that function later, it will invoke the boolean expression without needing me to pass it again.
To assure you that it works, here I build the same function, give it a name, and then pass the penguins
data. I get the same result as the original, fully-specified filter
call.
= filtering(sex %in% c("male", "female"))
flt_sex flt_sex(penguins)
## # A tibble: 333 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 41.1 17.6 182 3200
## 9 Adelie Torgersen 38.6 21.2 191 3800
## 10 Adelie Torgersen 34.6 21.1 198 4400
## # ℹ 323 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
You typically don’t want to pass the data immediately after creating the function. The purpose of currr
is to create the curried function and evaluate it on data later on. Why delay the evaluation of data? Because by separating the function from the data, it is easier to compose and recycle functionality to act on multiple datasets. More explanation in the “Why” section below.
Thing 2 of 2: currr
functions can be composed without evaluating on data.
Most R users hear “function composition” and think of the pipe operator |>
, which turns a nested function call like h(g(f(x)))
into x |> f() |> g() |> h()
. That is convenient, but its behavior is too eager for our needs. We want to combine functions without passing the data x
.
I could write a new function that combines f
, g
, and h
and then pass x
to that function like this…
= function(...) f(...) |> g() |> h()
fgh
|> fgh() x
…but that is a bit clunky. Mathematical notation for function composition, meanwhile, is simple: \(fgh = h \cdot g \cdot f\).2 Can I achieve something that simple in R? Yes.
currr
provides operators for two different kinds of function composition. First, the %.%
operator for classical mathematical composition: (g %.% f)(x)
is like g(f(x))
. You should read g %.% f
as “do g
after f
”. Second, we have a postfix-style %;%
composition operator that reads more “pipe-like”: (f %;% g)(x)
is like x |> f() |> g()
.3 f %;% g
is “do f
and then g
”.
These operators let us compose little functions into bigger functions without applying them on data. Take the example at the top of the post. Each of these objects is a function.
= filtering(sex %in% c("male", "female"))
flt_sex_FM
= grouping(sex, island)
by_sex_isl
= summarizing(
smz_mass mass_med = median(body_mass_g, na.rm = TRUE),
mass_mean = mean(body_mass_g, na.rm = TRUE),
mass_std = sd(body_mass_g, na.rm = TRUE),
.groups = "drop"
)
I compose them into a “chain” of operations, which is itself a new function, without evaluating it.
= (flt_sex_FM %;% by_sex_isl %;% smz_mass) mass_by_sex_isl
And only when I need the results, I can pass the data.
mass_by_sex_isl(penguins)
## # A tibble: 6 × 5
## sex island mass_med mass_mean mass_std
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 female Biscoe 4588. 4319. 660.
## 2 female Dream 3450 3446. 270.
## 3 female Torgersen 3400 3396. 259.
## 4 male Biscoe 5350 5105. 714.
## 5 male Dream 3950 3987. 350.
## 6 male Torgersen 4000 4035. 372.
Why use the currr
style?
Now that you know how to use currr
, you may want to know why you would bother.
First, pipe chains are easy to write but hard to re-use. Let’s look at an example in dplyr
world. I have the penguins
and I want to summarize the body mass column.
|>
penguins summarize(
mass_med = median(body_mass_g, na.rm = TRUE),
mass_mean = mean(body_mass_g, na.rm = TRUE),
mass_std = sd(body_mass_g, na.rm = TRUE),
.groups = "drop"
)## # A tibble: 1 × 3
## mass_med mass_mean mass_std
## <dbl> <dbl> <dbl>
## 1 4050 4202. 802.
Okay, I also want to compute the same stats, grouped by sex
. Notice I have to write the summarize
step again.
|>
penguins group_by(sex) |>
summarize(
mass_med = median(body_mass_g, na.rm = TRUE),
mass_mean = mean(body_mass_g, na.rm = TRUE),
mass_std = sd(body_mass_g, na.rm = TRUE),
.groups = "drop"
)## # A tibble: 3 × 4
## sex mass_med mass_mean mass_std
## <fct> <dbl> <dbl> <dbl>
## 1 female 3650 3862. 666.
## 2 male 4300 4546. 788.
## 3 <NA> 4100 4006. 679.
And if I also wanted it grouped by sex and island, summarize yet again.
|>
penguins group_by(sex, island) |>
summarize(
mass_med = median(body_mass_g, na.rm = TRUE),
mass_mean = mean(body_mass_g, na.rm = TRUE),
mass_std = sd(body_mass_g, na.rm = TRUE),
.groups = "drop"
)## # A tibble: 9 × 5
## sex island mass_med mass_mean mass_std
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 female Biscoe 4588. 4319. 660.
## 2 female Dream 3450 3446. 270.
## 3 female Torgersen 3400 3396. 259.
## 4 male Biscoe 5350 5105. 714.
## 5 male Dream 3950 3987. 350.
## 6 male Torgersen 4000 4035. 372.
## 7 <NA> Biscoe 4688. 4588. 338.
## 8 <NA> Dream 2975 2975 NA
## 9 <NA> Torgersen 3588. 3681. 413.
Now I have three instances in my code where I need to write the same summarize
code, because I wanted to see it three different ways. And if I want to change that code (say, add a mean abs. deviation statistic), I have to change it in multiple places. This is a good scenario to write a function! currr
gives us a way to write that function conveniently and composably. We write the summarize
step one time and re-use whenever we want later.
= summarizing(
smz_mass mass_med = median(body_mass_g, na.rm = TRUE),
mass_mean = mean(body_mass_g, na.rm = TRUE),
mass_std = sd(body_mass_g, na.rm = TRUE),
.groups = "drop"
)
smz_mass(penguins)
## # A tibble: 1 × 3
## mass_med mass_mean mass_std
## <dbl> <dbl> <dbl>
## 1 4050 4202. 802.
|> group_by(sex) |> smz_mass()
penguins ## # A tibble: 3 × 4
## sex mass_med mass_mean mass_std
## <fct> <dbl> <dbl> <dbl>
## 1 female 3650 3862. 666.
## 2 male 4300 4546. 788.
## 3 <NA> 4100 4006. 679.
|> group_by(sex, island) |> smz_mass()
penguins ## # A tibble: 9 × 5
## sex island mass_med mass_mean mass_std
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 female Biscoe 4588. 4319. 660.
## 2 female Dream 3450 3446. 270.
## 3 female Torgersen 3400 3396. 259.
## 4 male Biscoe 5350 5105. 714.
## 5 male Dream 3950 3987. 350.
## 6 male Torgersen 4000 4035. 372.
## 7 <NA> Biscoe 4688. 4588. 338.
## 8 <NA> Dream 2975 2975 NA
## 9 <NA> Torgersen 3588. 3681. 413.
This is the benefit we get by separating the data from the functionality.
- When functions are pre-defined, we can pass whatever data whenever we want.
- It is easier to abstract over the data because we wrote modular functions instead of a pipeline that forces you to provide the data up front.
- We expend a little up-front effort to define these little functions, but we amortize the costs when we invoke those functions repeatedly. Lots of little functions may look silly in isolation, but they aren’t silly when you compose them to create bigger functionality.
- We greatly reduce the cost of changing the definitions of these functions; change the function definition in one place, and inherit the change everywhere the function is called.
How currr
works
All of currr
fits in one small file. Although in fairness, it probably isn’t finished yet.
currr
works by creating curryable dplyr
verbs. For example:
<- currify_verb(dplyr::select) selecting
The implementation of currify_verb
is, in turn…
<- function(verb) {
currify_verb function(...) {
<- function(.data) {
intention return(purrr::partial(verb)(.data, ...))
}::memoise(intention)
memoise
} }
The outermost function currify_verb
takes a verb
and returns a new function of args ...
. The args are then used to return another function of .data
, enclosing both the verb and the args in the environment of the innermost function. I call the innermost function an intention
, because it declares an intent to evaluate a verb without actually evaluating it (yet).
Another notable detail is that we return not the intention
object but a memoized intention
using the memoise package. Memoization caches function values in (morally) a hash table keyed by the function arguments. If I evaluate a function for the first time, I hash the value by the function arguments. If I evaluate a function a second time or more, I can lookup the value in the lookup table instead of recomputing the value from scratch. This gives us the same efficiency as, say, storing a copy of data in some intermediate state, without the need to pollute our environment with intermediately-stateful data that we don’t care about unto itself. Now that’s functional programming!
The only other thing to note is how we implement function composition:
`%.%` <- function(g, f) function(...) g(f(...))
In R, we can define custom binary operations as functions of two arguments. Function composition is associative, which lets me compose multiple functions like f %.% g %.% h %.% i
without parentheses. Associativity tells us the grouping doesn’t matter. This is similar to the way +
is a function of two variables, yet we can still write a + b + c + d
and so on. Now that’s also functional programming!
Footnotes
If
add(1, 2)
returns3
, thencurry(add, 1)
creates a new function that adds an argument to1
. Technically speaking currying has a stricter definition than that: currying is turning a multi-argument function into a series of lambda functions that each take one argument. What we are doing is “merely” partial application of arguments. Read about the differences if you are so inclined.↩︎You can read more about function composition in this other post as well.↩︎
There also exists
purrr::compose
, which takes an arbitrary number of functions and an optional.dir
argument to switch the direction of composition. Take your pick.↩︎