R is functional language, however not a pure one like Haskell, R is fond of the functional paradigm. As John Chambers has recently remind us, in R *everything that happens is a function call* [John Chambers, *Extending R*, CRC Press]. R functions are first-class citizens and the functional programming style is prominent.

`filter`

, `map`

, and `reduce`

(aka. `fold`

) are among the most wanted functions in the toolset of every functional programmer. R provides them too under the names `Filter`

, `Map`

, `Reduce`

. Even though the *apply* family and vectorization are preferred, R is kind enough to give these functions to seasoned functional programmers. It couldn't have been otherwise.

The source code of base R, currently at `/src/library/base/R`

contains a file with the implementation of these functions among others, `funprog.R`

.

In my exploration of R source code I have to begin with something. `Filter`

may be a good candidate. It is brief, easier to read, and a sensible choice for a fan of functional programming as I am.

### What is filter?

`filter`

, as its name suggests, serves the purpose of filtering elements in a list according to certain given function, called predicate, so that elements in this list for which the predicate holds are selected and the rest discarded, where *predicate* is the technical term for any function that produces true or false (a boolean value).

Some examples will make this clear.

Let's suppose we have this list of numbers `[1, 2, 3, 4, 5]`

, and a function `odd`

that takes a number and determines whether it is odd or not. Selecting all odd numbers in `[1, 2, 3, 4, 5]`

is a matter of filtering them by their oddity

`filter odd [1, 2, 3, 4, 5]`

will produce `[1, 3, 5]`

Another example. We want to select words in `["hi", "world", "bye", "hell"]`

whose first letter is 'h'.

`filter start_with_h ["hi", "world", "bye", "hell"]`

will produce the desired list `["hi", "hell"]`

In general, `filter`

takes a predicate and a list of elements of some type, say type `a`

, and produces a list of elements of the same type. Formally expressed:

`filter :: (a -> Bool) [a] -> [a]`

where `[a]`

stands for list of elements of type `a`

, and the arrow stands for a function whose arguments are to the left and the result to the right of the arrow. Note that the predicate, the first argument, is also a function that consumes values of type `a`

and produces a boolean.

This type description is called the *signature* of the function. In some functional programming languages it is customary to apply the so called *currying* operation that translates the evaluation of a function taking multiple arguments into the evaluation of a sequence of functions, with a single argument each. Under currying, the signature of filter would be:

`filter :: (a -> Bool) -> [a] -> [a]`

Although this is a mathematically more appealing description, I'll stick here and in what follows to the first uncurried form.

What about implementing `filter`

? A natural implementation of `filter`

would take the form of a mathematical definition.

Mathematics is plenty of inductive or recursive definitions. Do you recall factorial? In Math, it is defined as a two-part function:

```
factorial 0 = 1
factorial n = (factorial n - 1) * n
```

In words, the factorial of 0 is equal to 1, the factorial of *n* is equal to the factorial of *n* - 1 times *n*.

Similarly, a math-like definition for `filter`

could be expressed as a two-part function with two branches for the second part:

- The filter of a predicate and an empty list is the empty list.
- The filter of a predicate and a non-empty list is
- either (if the predicate holds for the first element of the given list) the list consisting of that first element and the result of the filter of the predicate and the rest of the given list,
- or (otherwise) the filter of the predicate and the rest of the given list.

This wording is exact but verbose. Formalizing it a bit turns out to be easier to read. What follows is the actual Haskell implementation of `filter`

, that hopefully is almost self-explanatory. Actually the signature before is also the real Haskell signature.

```
filter p [] = []
filter p (x:xs) | p x = x : filter p xs
| otherwise = filter p xs
```

Another possible syntax, now written in friendly Racket, that you may find even more readable, looks as follows [signature included as an initial comment]:

```
;; (X -> Boolean) (listof X) -> (listof X)
(define (filter p lox)
(cond [(empty? lox) empty]
[else
(if (p (first lox))
(cons p (filter p (rest lox)))
(filter p (rest lox)))))
```

### Filter in R

What about the R implementation? Here it is:

```
Filter <-
function(f, x)
{
ind <- as.logical(unlist(lapply(x, f)))
x[which(ind)]
}
```

It looks quite different. Obviously, it is not recursive. This is not surprising, recursion is the main technique in pure functional languages that are well prepared to handle recursive implementations without performance penalties.

Having noted that, let's explore this code in detail.

The very first thing we always have to do when studying a function definition is to be sure about the value(s) that the function consumes and the value that the function produces, the signature of the function. By the way, as you may know, some languages check signatures at compile time, like Haskell, others don't, like R or the Racket version above [though Racket has variants for type checking]. Whatever the case the signature is critical for programmers and for users, since it tells what kind of values are involved.

The documentation for `Filter`

reveals the assumed signature. Omitting for the moment some nuances and summarizing the 'Arguments' and 'Details' section , `?Filter`

says this:

‘Filter’ applies the unary predicate function ‘f’ to each element of ‘x’ [which is a vector] ..., and returns the subset of ‘x’ for which this gives true.

Recall the header of `Filter`

`function(f, x)`

The doc says that `f`

, the first argument, is a *unary predicate function*, meaning a function that takes a single argument (unary) [of any type, say `a`

as before], and produces a boolean (`TRUE`

or `FALSE`

). The signature of the predicate is therefore:

`(a -> Bool)`

The second argument, `x`

, is in turn a vector, instead of a list. A vector in R can be an atomic vector, a list, or an expression. The main R data type to represent arbitrarily large collections of objects, possibly nested, is vector and in this regard is a natural choice to represent what in functional languages is typically represented by lists.

Let us symbolize, just by convention, vector of some type with `[a]`

, that's the type of the second argument of `Filter`

.

The result of `Filter`

is in turn a subset of `x`

, hence again a vector of type `[a]`

.

Therefore the whole signature could be specified as follows:

`Filter :: (a -> Bool) [a] -> [a]`

As it's easy to see, this is the same signature of `filter`

we figure out before, just that `[a]`

stands for list in the former and for vector (atomic or not) in the latter.

Let's go on with the implementation. In order to understand an implementation I find really helpful trying first to design our own version.

Before starting off with the implementation itself we should write examples of the function usage, at least as many as the function signature requires, and wrap them as test cases. Examples will guide the implementation, and at the same time, wrapped as test cases, provide for the indispensable testing workhorse. For the latter I will use `testthat`

as explained in the final part of the first post in this series: https://los-pajaros-de-hogano.blogspot.com.es/2016/07/reading-r-code-introduction.html

So save these examples into the file `test_my_filter.R`

:

```
source("my_filter.R")
# Some predicates for testing filter
odd <- function(x) { x %% 2 != 0 }
starts_with_h <- function(x) { any(grepl(pattern = "^h.*", x = x)) }
numeric_expression <- function(x) { is.numeric(eval(x)) }
test_that("test my_filter", {
expect_equal(my_filter(odd, integer(0)), integer(0))
expect_equal(my_filter(is.atomic, list()), list())
expect_equal(my_filter(is.expression, expression()), expression())
expect_equal(my_filter(odd, 1:5), c(1, 3, 5))
expect_equal(my_filter(starts_with_h, c("hi", "world", "bye", "hell")),
c("hi", "hell"))
expect_equal(my_filter(is.atomic, list(1, list(3, 4), 5)), list(1, 5))
expect_equal(my_filter(numeric_expression, expression(c, "a", 1, 3)),
expression(1, 3))
})
```

A quick glance at `Filter`

gives us a first hint as to the way we could follow in the implementation. The last line is a typical subsetting operation. We could try to use this idea to delineate our code.

You may figure it out after a bit of thinking. Let's begin with one of our examples:

`my_filter(odd, 1:5)`

Well, I have something like the vector `c(1, 2, 3, 4, 5)`

and I want to filter odd elements in it. I also have a predicate `odd`

that allows me to determine whether each element there is odd or not.

How should `my_filter`

be implemented to produce the expected `c(1, 3, 5)`

?

It could apply `odd`

to each element in `c(1, 2, 3, 4, 5)`

to get this: `c(TRUE, FALSE, TRUE, FALSE, TRUE)`

, and then, in a typical R way, use the logical vector for indexing `c(1, 2, 3, 4, 5)`

:

`c(1, 2, 3, 4, 5)[c(TRUE, FALSE, TRUE, FALSE, TRUE)]`

I can sketch a general version of this idea. Instead of `odd`

and `1:5`

I generalize to whatever predicate, `f`

, and whatever vector, `x`

. Also, and just for readability, I create a local variable, `ind`

, to name the logical vector.

```
my_filter <-
function(f, x) {
ind <- # apply f to each element in x
x[ind]
}
```

Only one thing remains to be filled here, the code that applies `f`

to each element in `x`

.

If you have an imperative programming background you will come up with some kind of loop to process each element in the vector and check whether the predicate holds for it. Something along these lines:

```
apply_f <-
function(f, x) {
fx <- vector("logical")
for (i in seq_along(x)) {
fx <- c(fx, f(x[[i]]))
}
fx
}
```

Let's add this helper function to `my_filter`

as a nested (local) function

```
my_filter <-
function(f, x) {
apply_f <-
function() {
fx <- vector("logical")
for (i in seq_along(x)) {
fx <- c(fx, f(x[[i]]))
}
fx
}
ind <- apply_f()
x[ind]
}
```

Wait a minute! Shouldn't `apply_f`

be tested before doing anything else? and definitely before converting it into a local (and as such untestable) function? This is a great question. Let's claim that it is not needed because it is a too easy function and we trust it will work without further testing. In a moment we will return to this and see the consequences of this assumption.

It's time to run tests:

```
> library(testthat)
> test_file("test_my_filter.R")
```

Great! All tests passed.

However, instead of explicit loops you may know that idiomatic R usually favors the *apply* family of functions.

The immediate candidate for this task seems to be `sapply`

, that is typically used when we want a vector as a result of applying the function to each element in a given vector `x`

. `apply_f`

could be written with `sapply`

as follows:

```
apply_f <-
function(f, x) {
sapply(x, f)
}
```

Being so simple, it doesn't make sense to wrap the `sapply`

call in a function. So our more idiomatic and concise filter implementation would be:

```
my_filter <-
function(f, x) {
ind <- sapply(x, f)
x[ind]
}
```

Run tests ... and, oops! one test fails! The case for the empty vector as input.

Looking into `?sapply`

more carefully, we see that our initial expectation was based in a partial understanding of `sapply`

:

Simplification in ‘sapply’ is only attempted if ‘X’ has length greater than zero ...

It turns out that `sapply`

does not always produce a vector when a vector is given. In other words, our current implementation breaks the signature of `my_filter`

.

We have different alternatives to fix the bug. One would be to use a conditional with one branch for the empty vector and another one for the rest of cases. Another option is to use `lapply`

, that produces a list instead of a vector, and unlist the result to get the vector we need. This second strategy has an advantage. Core R code shouldn't use user friendly wrappers like `sapply`

but primitive functions. It is not only a matter of style, it is usually a matter of performance (core functions are more efficient), and it is above all a matter of reliability. User wrappers are for helping users write quickly, mainly in an interactive setting, but they may change or even become defunct in future releases. In general, we shouldn't trust user wrapper functions when we want to write enduring code.

With `lapply`

and `unlist`

we get this solution:

```
my_filter <-
function(f, x) {
ind <- unlist(lapply(x, f))
x[ind]
```

}

and it passes all tests.

If we compare our last version with the official implementation we'll observe that there are still some differences.

Starting from the end, the official implementation prefers `which(ind)`

instead of just `ind`

for indexing. This is a subtle divergence. One thing it handles differently than our implementation is `NA`

values. In our implementation, if a `NA`

appears in `x`

it will appear also as `NA`

in `ind`

, and if we subset `x`

directly via `ind`

the `NA`

ends up in the result too. However, `?Filter`

has to say something about this (just we omitted it in the first reading to keep things simple):

Note that possible ‘NA’ values are currently always taken as false.

This implies that `NA`

values in `x`

will be discarded from the result, whatever `f`

and `x`

, and this is exactly what `which`

can nicely achieve. `which`

takes a logical object and produces the indices, as integers, of `TRUE`

values. In other words, for our case, where a vector is given as input, it has this signature:

`which :: [Bool] -> [Integer]`

For instance,

```
> which(c(TRUE, FALSE, FALSE, TRUE))
[1] 1 4
```

Additionally, `which`

omits `NA`

values in the logical vector, in other words, it treats them as if they were `FALSE`

. For example:

```
> which(c(TRUE, FALSE, FALSE, TRUE, NA))
[1] 1 4
```

and that's precisely the documented behavior that `Filter`

should show.

The second difference lies in the first line, the explicit conversion to logical absent from our solution:

`as.logical(unlist(lapply(x, f)))`

Someone could argue that `as.logical`

is redundant. After all, the predicate `f`

always produces a boolean value and the result of the composition of `lapply`

and `unlist`

should produce a logical vector. It should do, but it actually doesn't. Passing an empty vector to `unlist(lapply(...))`

produces `NULL`

:

```
> unlist(lapply(integer(0), is.integer))
NULL
```

Inadvertently our previous implementation worked, and it worked because when `NULL`

is an index it is treated as `integer(0)`

and indexing an empty vector with `integer(0)`

produces an empty vector, logical by default.

```
> vector()[integer(0)]
logical(0)
```

Once we substitute direct indexing by `which(ind)`

things stop working, `which(NULL)`

would raise error because `which`

requires a logical object as argument, and `NULL`

is neither a logical value nor converted by `which`

to logical. That's the reason for introducing `as.logical`

. In general, `as.logical`

makes a lot of sense by making sure that the signature of `which`

will never be broken.

Let's go back to the moment when we hesitated about why not to test the helper function `apply_f`

. Under the new light the question appears pretty reasonable. Had we tested `apply_f`

, in other words, had we left it as a non-local function and verified it independently, we would have caught this bug soon.

For instance, we could have written `apply_f`

as:

```
apply_f <-
function(f, x) {
unlist(lapply(x, f))
}
```

and add tests for it:

```
test_that("test apply_f", {
expect_equal(apply_f(is.integer, integer(0)), logical(0))
expect_equal(apply_f(odd, 1:5), c(TRUE, FALSE, TRUE, FALSE, TRUE))
# ... (more tests here)
})
```

Running tests would have uncovered the bug in the very beginning. Once `apply_f`

passes all tests we can convert it into a local function, in this case just the call we have seen in the final definition. The morale is that untested code, which we lazily assume being too easy to test is often the source of unexpected bugs and countless wasted hours of debugging. Better to always test everything that is not absolutely trivial.

To finish this exploration and for completeness just add this new test case, which verifies the `NA`

discarding behavior, to the `test_my_filter.R`

file:

`expect_equal(my_filter(odd, c(1:4, NA)), c(1, 3))`

and confirm that the final implementation with `as.logical`

, that is exactly the R version of `Filter`

, passes all tests.

## No hay comentarios:

## Publicar un comentario