michael decrescenzo - Obnoxious functional programming in R

When people talk about “functional programming” in R, they usually mean two things.

Not doing for-loops. Many intermediate R users have a thing against for-loops because for-loops have a (not exactly fair¹) reputation for being slow in R. Instead, these users advocate for *apply functions or purrr::map. This is called “functional programming” because we are applying a function to over some container of data instead of imperatively manipulating the data state directly.
More advanced R discussions refer to the fact that functions are first-class objects in R, so you can examine their properties and do operations on functions themselves, just like you could with other objects.

[1] is definitely the more popular understanding than [2], which is a shame because functional programming outside of R goes way deeper, and get way weirder, than apply(function, data). This post is an attempt to explain a little of that weirdness and implement it in R with some obnoxious examples.

Readers with more FP experience will recognize that this post isn’t the most rigorous. They may also realize that R is backed by a lot of functional ideas even if many users may not recognize or articulate those ideas readily. That’s fine if you know that stuff, but this post is meant to be light on the technicals.

What is functional programming?

Let’s do the Wikipedia thing. From the functional programming article:

In computer science, functional programming is a programming paradigm where programs are constructed by applying and composing functions.

You may be thinking, “That sounds pretty unremarkable. I already know that I write functions and apply them. Why do we need a name for this?”

Well… read on:

It is a declarative programming paradigm in which function definitions are trees of expressions that map values to other values, rather than a sequence of imperative statements which update the running state of the program.

There are a many information-dense pieces to that sentence that will be difficult to appreciate without a little more theory. But let’s try to break it apart, starting in the middle.

“Functions map values to other values”

You have probably seen function definition written like this, \[ y = a + bx \] and we say that $y$ is a function of $x$. We can label the transformation of $x$ as $f$ and say $y = f(x)$. Easy!

But there is another way to write statements like: \[ f : X \to Y \tag{1}\] which reads, “$f$ is a map from $X$ to $Y$”, where capital-$X$ and capital-$Y$ are sets of values from which little-$x$ and little-$y$ are drawn. Which is to say: $f$ is like a lookup table that outputs a value from $Y$ when it sees an input from $X$. In this example, the value of $y$ that it returns is equal to $a + (b \times x)$.²

“Function definitions are trees of expressions …”

Referring to functions as “trees” won’t make sense until we talk about function composition and associativity.

First, composition. If the notation in Equation 1 is new to you, your mental picture of function composition probably looks like this. We can take two functions $f$ and $g$…

\[\begin{align} y &= f(x) \\ z &= g(y) \end{align}\]

and put them together… \[ \begin{align} z &= g(f(x)) \end{align} \tag{2}\] and then we can represent that composition with a different symbol $h$, so that that $h(x)$ is equivalent to $g(f(x))$.

There’s nothing incorrect about that approach, but it is verbose. We actually don’t have to refer to any function arguments or values (x$, $y$, or $z$) to introduce $h$. Instead we can define $h$ in a “point-free” style, \[ h = g \circ f \tag{3}\] where the symbol $\circ$ refers to function composition (in a leftward direction). The expression $g \circ f$ is a new function, which we can say aloud as “$g$ compose $f$” or “$g$ after $f$”. Because the composition is itself a function, we can pass it an argument: $(g \circ f)(x)$ would spit out the value $z$, just like in Equation 2.

Okay, now associativity. Function composition is associative, which means compositions can be “grouped” in whatever way you want as long as you don’t change the ordering of the functions.

To demonstrate, let’s say we have three functions, $f$, $g$, and $j$, and we want to compose these functions. There are a few ways to do it:

As one single chained composition: $j \circ g \circ f$, which if applied to $x$ would be equivalent to $j(g(f(x)))$.
Introduce $h = g \circ f$, and rewrite as $j \circ h$. This is the same as $j \circ (g \circ f)$ or $j(h(x))$.
Compose $j$ and $g$ into some $d$ and say $d \circ f$, which is the same as $(j \circ g) \circ f$ or $d(f(x))$.

All of these expressions are equivalent.

Okay, recap. As long as the output of one map is the same “type” as the input to the some other map, we can compose those two functions. And we can take a series of compositions and group them in whatever way we want, as long as the input and output types conform.

But how is this a “tree”? I will make a little graph with nodes and edges that shows the basic idea. Caveats up front that this will be pretty informal, but let’s say the functions are nodes, and the edges point in the direction of composition. Let’s say function $f$ takes two arguments, $a$ and $b$.

But maybe the value of $a$ is the result of a function that itself takes two arguments, $p$ and $q$, and $b$ is the result of a function of $r$.

The tree lets us express composition and associativity in a different way: $f$ doesn’t really care how you compose or group the operations ahead of it, as long as the values that you pass are $a$-like and $b$-like.

“…Rather than a sequence of imperative statements that update the running state.”

This is the really important stuff.

Most of the time when we do data analysis, we write code that updates the state of some data. Say I start with some $x$, and then I do functions $f$, $g$, and $j$ to it.

y = f(x)
z = g(y)
w = j(z)

There is some data x that we alter, in steps, by doing things to it, each time saving some data at some intermediate state.

The functional approach would be different. Instead of spending most of our time, energy, and keystrokes passing data around, we spend these resources writing functions.

h = compose(g, f)
a = compose(j, h)

And only later (say, in some main() routine) do we apply our ultimately-composed functions to data. The data have to be acted on eventually, but we do it only after we compose a map from the data to some endpoint.

Stated a different way, when we do more imperative programming, the present state of x is always our concern. Our path from beginning to end requires us to leave x in some intermediate state at many different steps, and if the state of x is exposed along the way, there can be problems that mutate x into something you don’t want it to be. When we do functional programming, however, we write the recipe for what we will do to x if we ever encountered such an x. The roadmap is set ahead of time, so the intermediate state of x is obscured behind the functions that we compose. We can’t really get to the intermediate state because we aren’t supposed to be able to.

This might be hard to envision because we haven’t talked tangibly about functional programming interfaces in R yet, so let’s do that.

Functional fundamentals, with R

We saw a notational approach to function composition above that looked like this: \[\begin{align} a = (j \circ g \circ u) \end{align}\] If we could do that in a computer, it might look like this:

a = compose(j, g, f)

There are two important things to remember about this.

Function composition creates a new function. It does not immediately evaluate the function on any data. In the above example, a is the composition, and it is entirely ignorant of any data you may pass to it.
Function compositions are themselves composable, so we want a framework to compose a lot of functions all at once with whatever associative groupings we want. We saw above (with math) that we could compose an arbitrary sequence of functions and the result should be well-behaved (as long as their input and output types conform—more on that in a bit). We would like to achieve that behavior in the code as well.

We start with a primitive operation to compose just two functions. We call it compose_once—it implements only one composition of a left-hand and a right-hand function.

# returns the new function: f . g
compose_once = function(f, g) {
    function(...) f(g(...))
}

Read this closely. compose_once takes two arguments, and those arguments are functions. We do not know or care what those functions are. We also return a new function, rather than some data value. That new function defines a recipe for evaluating the functions $f$ and $g$ in sequence on some unknown inputs ..., whever it happens to come across those inputs—we didn’t pass any ... yet. This is also intentional: composition does not care what the data are. It only knows how to evaluate functions in order.

This might feel weird at first, but it will give us legible behavior right from the start. For example, we often want to know how many unique elements are in some data object. In base R, we might ask length(unique(x)). The dplyr package provides a combined function n_unique(x), but we haven’t invented dplyr yet, so we have to make n_unique ourselves:

# we save the function returned by compose_once
# no computation on data is done yet.
n_unique = compose_once(length, unique)

# examine the object to convince yourself of this
n_unique
## function(...) f(g(...))
## <environment: 0x112c0b890>

# apply the function on some data
x = c(0, 0, 0, 1, 2)
n_unique(x)
## [1] 3

compose_once lets us compose two functions, but we want to compose an arbitary sequence of functions into one composition. So we extend this interface by performing a reduction across an array of functions.

# pass vector fns = c(f1, f2, ..., fn)
compose = function(fns) {
    Reduce(compose_once, fns, init=identity)
}

If you aren’t familiar with reductions, they are an efficient trick to collapse an array (loosely speaking) of input values into one output value, accumulating at each step the results of some binary operation. If you want to learn more about reduction, follow this footnote.³ We also supply an argument to init, which isn’t strictly necessary in this R example but is interesting (IMO) in the mathematical context of function composition, which I explain in this other footnote.⁴

Let’s see it in action. Let’s do an additional step to convert n_unique into an English word.

# same as n_unique but convert to a word
english_n_unique = compose(c(english::english, length, unique))

# apply to our vector x
english_n_unique(x)
## [1] three

And just to underscore your confidence in the associativity of composition, we can exhaustively regroup these compositions without affecting the results.

compose(c(english::english, compose(c(length, unique))))(x)
## [1] three
compose(c(compose(c(english::english, length)), unique))(x)
## [1] three

Types are our guide.

You may have seen online discussions about “strong typing”. Types are like representations of data in the computer. We have basic data types like short integers, long integers, characters, and strings. Types encode abstract features of some data that without caring about the data values themselves. Languages implement other data structures like lists, dictionaries, arrays, tuples, etc. that we can also call types for our purposes.

You may have seen some people discuss the “type flexibility” of R or Python as advantages. Well…sorry! Functional programming like strong types because strong types provide structure to function composition at scale. More specifically, two functions can be composed if the output type of one function matches the input type of the next function. If we can trust this fact, we can build really big, abstract creations out of function composition, maybe even creations that are so big and complicated that we struggle to keep track of it all in our brains. But these creations are virtually guaranteed to work as long as they are composed of functions with conforming input and output types.

Not coincidentally, the associativity of function composition is additionally helpful for API design. If we have to build a big, abstract structure, we always have the option to group some of the operations together if they perform a coherent and useful operation. Intermediate functions are more useful than intermediate state because at least an intermediate function is legible and potentially reusable. Chances are your intermediate data are not that legible or useful.

Let’s look at the “type roadmap” for the english_n_unique example. We started with a vector type and mapped to an english type. How did we know it would work? We can diagram the types. \[ \begin{align} \mathtt{unique} &: \mathtt{many} \to \mathtt{vector} \\ \mathtt{length} &: \mathtt{many} \to \mathtt{integer} \\ \mathtt{english::english} &: \mathtt{many} \to \mathtt{vector} \end{align} \] We know that unique takes objects of various types (which I represent as many to say “this function has many methods for different data types”) and returns a vector of unique elements. We know that length takes many and returns its integer length, and that should work for a vector input. And english::english turns many objects into english type, which ought to work for an integer input. But we know that these operations should compose because we can know that we can pass a vector to unique, a vector to length, and an integer to english::english. Stated differently, if I know the type behavior of each function, I know the type behavior of the compositions. That is an extremely useful and powerful foundation for building big things.

Something bigger: means within groups

This is where we really start to see the difference between typical “imperative” programming and functional programming “trees”. Let’s take something we commonly do in R, calculate means within groups of data, and do it with our functional tools. We will write this in a functional setup that, I admit, is sort of an ugly interface. But the point isn’t to write a perfect interface, it’s to demonstrate what it’s like to be functional. So let’s roll with some ugly code for a while.

Let’s take the mtcars data frame…

head(mtcars, 5)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

…and calculate the mean of each variable as a new data frame with one row per group. We want to be able to specify the group on the fly.

We want to chart a course through this problem that composes small, re-usable steps on well-defined types. Here is my plan:

define a function that maps a data frame of raw data to a data frame of column means.
define a function that applies any other function to grouped data.
compose a function that implements a split-apply-combine which is the composition of the above steps.

1. Means for a data frame.

We create a function by composing the following functions that map the following types:

colMeans: data frame $\to$ vector
as.list: vector $\to$ list
tibble::as_tibble: list $\to$ data_frame

You can see how these functions take us incrementally from data frame, to vector, to list, back to data frame. So we know that this composition should work before we ever test it on data.

means = compose(c(tibble::as_tibble, as.list, colMeans))

Testing it on the full data:

means(mtcars)
## # A tibble: 1 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81

2. Apply a function over groups

Remember, functional programming hasn’t been invented yet, so we can’t simply do his with lapply. We can, however, reimplement lapply knowing what we know about functional programming.

Let’s create an function that takes any function f and an iterable object l, and returns a list.

apply_on_elements = function(l, f) {
    # initialize an empty list
    v = vector(mode = "list", length = length(l))
    # assign f(x) for each x in l
    for (x in seq_along(l)) {
        v[[x]] = f(l[[x]])
    }
    return(v)
}

Notice, this function takes two objects as arguments and returns one object. In order to nicely compose it with other functions that take and return only one object, I want a way to reduce the arguments required when I call it. This will look a little weird, but I am going to create a common object in functional programming called a partial function or curried function. A partial function is a function that has some of its arguments fixed ahead of time. Let’s define it before I explain further:

partial_apply = function(f) {
    function(x) apply_on_elements(x, f)
}

So partial_apply takes a function f and returns a new function. That new function applies f to some iterable object x to return a list. But the function isn’t evaluated; it is only created, because I never provide an x in the outer function. The user has to pass x to evaluate the partial at a later time.

Here we use this tool to create a partial length, which we apply to some x.

lens = partial_apply(length)

# examine, it's a function
lens
## function(x) apply_on_elements(x, f)
## <environment: 0x112cdb238>

lens(x)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 1
## 
## [[5]]
## [1] 1

We see more of that delayed evaluation behavior. This lets us create a function that applies our earlier means to groups of data.

partial_apply(means)
## function(x) apply_on_elements(x, f)
## <bytecode: 0x1328ce4f0>
## <environment: 0x132c05138>

3. Apply our partial function to data.

Rather, create a function that would do that, if it were evaluated.

This function should turn a data frame into an iterable collection of groups (like a list), apply means to each element of that collection, and return a final data frame that re-merges the collection. Here’s how we map these steps from type to type.

split: pair of (data frame, vector) $\to$ list
partial_apply(means): list $\to$ list
dplyr::bind_rows: list $\to$ data frame

The composition of all these steps creates a function that eats a data frame and returns a data frame.

means_by = compose(c(dplyr::bind_rows, partial_apply(group_means), split))

Put it all together.

The code below retraces our steps. Step (1) creates our means function. Step (2) creates some functional infrastructure to apply functions over iterable objects, which isn’t the kind of thing we would ordinarily have to mess with as end-users. And step (3) composes our means function with the iteration tools to make our eventual result.

# 1. small function to be applied
means = compose(c(tibble::as_tibble, as.list, colMeans))

# 2. infrastructure layer:
# you can see why these would be generally useful for many problems
# 2a. reimplement *-apply()
apply_on_elements = function(l, f) {
    v = vector(mode = "list", length = length(l))
    for (x in seq_along(l)) {
        v[[x]] = f(l[[x]])
    }
    return(v)
}
# 2b. partial apply
partial_apply = function(f) {
    function(x) apply_on_elements(x, f)
}


# 3. interface layer
means_by = compose(c(dplyr::bind_rows, partial_apply(means), split))

Again, given that step (2) is rebuilding the wheel, it’s pretty impressive how little code goes into steps (1) and (3) to achieve the end results. The interface currently isn’t as succinct as dplyr grouping and summarizing, but remember, functional programming hasn’t been invented yet. But moreover, you can start to imagine how tools like partials can be stacked to create tools that are essentially as powerful as dplyr with nice interfaces too, even if those interfaces are different from the imperative steps you are used to.

Let’s apply the ultimate function, means_by, to various groups in mtcars.

means_by(mtcars, mtcars$cyl)
## # A tibble: 3 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  26.7     4  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
## 2  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
## 3  15.1     8  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5
means_by(mtcars, mtcars$vs)
## # A tibble: 2 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  16.6  7.44  307. 190.   3.39  3.69  16.7     0 0.333  3.56  3.61
## 2  24.6  4.57  132.  91.4  3.86  2.61  19.3     1 0.5    3.86  1.79
means_by(mtcars, mtcars$am)
## # A tibble: 2 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  17.1  6.95  290.  160.  3.29  3.77  18.2 0.368     0  3.21  2.74
## 2  24.4  5.08  144.  127.  4.05  2.41  17.4 0.538     1  4.38  2.92
means_by(mtcars, mtcars$gear)
## # A tibble: 3 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  16.1  7.47  326. 176.   3.13  3.89  17.7 0.2   0         3  2.67
## 2  24.5  4.67  123.  89.5  4.04  2.62  19.0 0.833 0.667     4  2.33
## 3  21.4  6     202. 196.   3.92  2.63  15.6 0.2   1         5  4.4
means_by(mtcars, mtcars$carb)
## # A tibble: 6 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  25.3  4.57  134.   86   3.68  2.49  19.5   1   0.571  3.57     1
## 2  22.4  5.6   208.  117.  3.70  2.86  18.2   0.5 0.4    3.8      2
## 3  16.3  8     276.  180   3.07  3.86  17.7   0   0      3        3
## 4  15.8  7.2   309.  187   3.60  3.90  17.0   0.2 0.3    3.6      4
## 5  19.7  6     145   175   3.62  2.77  15.5   0   1      5        6
## 6  15    8     301   335   3.54  3.57  14.6   0   1      5        8

Ending notes on the tidyverse

Now that we have invented functional programming, we can better appreciate how tidyverse tools leverage functional infrastructure to make nice APIs, even if those APIs feel way less hardcore-functional than the example we just created.

The pipe operator. You may have thought to yourself, composing “left” sure is harder to read than composing “right” like the pipe operator. First, functional programming hadn’t been invented yet, so you can’t blame me for not knowing about the pipe operator. Second, creating a composition function that reads more “linearly” is easy with one extra step: reversing the direction of the Reduce.

pipe = function(fns) {
    # reverse the order of fns before composing
    compose(rev(fns))
}

Now we can write our means_by function with more linear recipe that reminds us more of tidyverse code.

means_by = pipe(c(split, partial_apply(means), dplyr::bind_rows))
means_by(mtcars, mtcars$cyl)
## # A tibble: 3 × 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  26.7     4  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
## 2  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
## 3  15.1     8  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5

Partial functions. The partial_apply function might have been the weirdest-feeling step in the earlier example. But if you squint at it, you realize that this is sort of what the tidyverse does with tidy evaluation. Whenever you pipe a data frame to mutate or filter and so on, and you write expressions on unquoted variables, those arguments are (in a way) creating new partial functions. There is also delayed evaluation of that function: the unquoted expressions are not evaluated on the spot, but instead are translated to create the partial function you actually want. It is that translated/partial function that actually is evaluated when you pass it your data frame.

Footnotes

Most of the time, loops in R are slow because people are doing unwise things with them.↩︎
You may have assumed that $x$ and $y$ are numbers and that the operations $+$ and $\times$ refer to the addition and multiplication of numbers. It’s fine if you made those assumptions, but it wasn’t necessary. $X$ and $Y$ could be other sets containing god-knows-what, and there are many kinds of operations that follow “algebraic” patterns akin to addition and subtraction. But we leave algebraic data types for another day.↩︎
You may have seen a reduction in an operation like max()—for an array of numbers, recursively check to see if left is greater than right, passing the winner to the next iteration as left. In this case, however, the array we are collapsing is full of functions rather than, say, numbers. And the operation we apply as we accumulate is compose_once(left, right) rather than left > right. The result of the reduction is a function that encodes the composition of all functions in fns from left to right.↩︎
The role of an initialization in a reduction is to handle an identity condition: how do we handle “empty” values along the reduction without affecting the output. Stated differently, what value can we pass to ensure a “no-op” when the reduction is applied, which can also be used as an initial value to left. For example, if we reduce an array of numbers by addition, the initialization would be the value 0, because applying + 0 to any number gives you the original number. Same with multiplication and the value 1. When we are reducing functions by composing them, we use the identity function. Not surprisingly, the identity function simply returns the function arguments unaffected: identity(x) gives you x. For function composition, compose(f, identity) is equal to compose(identity, f) is equal to f. Now you’re programmin with abstractions!↩︎