martes, 26 de julio de 2016

Reading R code. The function Filter

R is functional language, however not a pure one like Haskell, R is fond of the functional paradigm. As John Chambers has recently remind us, in R everything that happens is a function call [John Chambers, Extending R, CRC Press]. R functions are first-class citizens and the functional programming style is prominent.

filter, map, and reduce (aka. fold) are among the most wanted functions in the toolset of every functional programmer. R provides them too under the names Filter, Map, Reduce. Even though the apply family and vectorization are preferred, R is kind enough to give these functions to seasoned functional programmers. It couldn't have been otherwise.

The source code of base R, currently at /src/library/base/R contains a file with the implementation of these functions among others, funprog.R.

In my exploration of R source code I have to begin with something. Filter may be a good candidate. It is brief, easier to read, and a sensible choice for a fan of functional programming as I am.

What is filter?

filter, as its name suggests, serves the purpose of filtering elements in a list according to certain given function, called predicate, so that elements in this list for which the predicate holds are selected and the rest discarded, where predicate is the technical term for any function that produces true or false (a boolean value).

Some examples will make this clear.

Let's suppose we have this list of numbers [1, 2, 3, 4, 5], and a function odd that takes a number and determines whether it is odd or not. Selecting all odd numbers in [1, 2, 3, 4, 5] is a matter of filtering them by their oddity

filter odd [1, 2, 3, 4, 5]

will produce [1, 3, 5]

Another example. We want to select words in ["hi", "world", "bye", "hell"] whose first letter is 'h'.

filter start_with_h ["hi", "world", "bye", "hell"]

will produce the desired list ["hi", "hell"]

In general, filter takes a predicate and a list of elements of some type, say type a, and produces a list of elements of the same type. Formally expressed:

filter :: (a -> Bool) [a] -> [a]

where [a] stands for list of elements of type a, and the arrow stands for a function whose arguments are to the left and the result to the right of the arrow. Note that the predicate, the first argument, is also a function that consumes values of type a and produces a boolean.

This type description is called the signature of the function. In some functional programming languages it is customary to apply the so called currying operation that translates the evaluation of a function taking multiple arguments into the evaluation of a sequence of functions, with a single argument each. Under currying, the signature of filter would be:

filter :: (a -> Bool) -> [a] -> [a]

Although this is a mathematically more appealing description, I'll stick here and in what follows to the first uncurried form.

What about implementing filter? A natural implementation of filter would take the form of a mathematical definition.

Mathematics is plenty of inductive or recursive definitions. Do you recall factorial? In Math, it is defined as a two-part function:

factorial 0 = 1
factorial n = (factorial n - 1) * n

In words, the factorial of 0 is equal to 1, the factorial of n is equal to the factorial of n - 1 times n.

Similarly, a math-like definition for filter could be expressed as a two-part function with two branches for the second part:

  • The filter of a predicate and an empty list is the empty list.
  • The filter of a predicate and a non-empty list is
    • either (if the predicate holds for the first element of the given list) the list consisting of that first element and the result of the filter of the predicate and the rest of the given list,
    • or (otherwise) the filter of the predicate and the rest of the given list.

This wording is exact but verbose. Formalizing it a bit turns out to be easier to read. What follows is the actual Haskell implementation of filter, that hopefully is almost self-explanatory. Actually the signature before is also the real Haskell signature.

filter p [] = []
filter p (x:xs) | p x       = x : filter p xs
                | otherwise = filter p xs

Another possible syntax, now written in friendly Racket, that you may find even more readable, looks as follows [signature included as an initial comment]:

;; (X -> Boolean) (listof X) -> (listof X)
(define (filter p lox)
  (cond [(empty? lox) empty]
        [else
          (if (p (first lox))
              (cons p (filter p (rest lox)))
              (filter p (rest lox)))))

Filter in R

What about the R implementation? Here it is:

Filter <- 
function(f, x)
{
  ind <- as.logical(unlist(lapply(x, f)))
  x[which(ind)]
}

It looks quite different. Obviously, it is not recursive. This is not surprising, recursion is the main technique in pure functional languages that are well prepared to handle recursive implementations without performance penalties.

Having noted that, let's explore this code in detail.

The very first thing we always have to do when studying a function definition is to be sure about the value(s) that the function consumes and the value that the function produces, the signature of the function. By the way, as you may know, some languages check signatures at compile time, like Haskell, others don't, like R or the Racket version above [though Racket has variants for type checking]. Whatever the case the signature is critical for programmers and for users, since it tells what kind of values are involved.

The documentation for Filter reveals the assumed signature. Omitting for the moment some nuances and summarizing the 'Arguments' and 'Details' section , ?Filter says this:

‘Filter’ applies the unary predicate function ‘f’ to each element of ‘x’ [which is a vector] ..., and returns the subset of ‘x’ for which this gives true.

Recall the header of Filter

function(f, x)

The doc says that f, the first argument, is a unary predicate function, meaning a function that takes a single argument (unary) [of any type, say a as before], and produces a boolean (TRUE or FALSE). The signature of the predicate is therefore:

(a -> Bool)

The second argument, x, is in turn a vector, instead of a list. A vector in R can be an atomic vector, a list, or an expression. The main R data type to represent arbitrarily large collections of objects, possibly nested, is vector and in this regard is a natural choice to represent what in functional languages is typically represented by lists.

Let us symbolize, just by convention, vector of some type with [a], that's the type of the second argument of Filter.

The result of Filter is in turn a subset of x, hence again a vector of type [a].

Therefore the whole signature could be specified as follows:

Filter :: (a -> Bool) [a] -> [a]

As it's easy to see, this is the same signature of filter we figure out before, just that [a] stands for list in the former and for vector (atomic or not) in the latter.

Let's go on with the implementation. In order to understand an implementation I find really helpful trying first to design our own version.

Before starting off with the implementation itself we should write examples of the function usage, at least as many as the function signature requires, and wrap them as test cases. Examples will guide the implementation, and at the same time, wrapped as test cases, provide for the indispensable testing workhorse. For the latter I will use testthat as explained in the final part of the first post in this series: https://los-pajaros-de-hogano.blogspot.com.es/2016/07/reading-r-code-introduction.html

So save these examples into the file test_my_filter.R:

source("my_filter.R")

# Some predicates for testing filter
odd <- function(x) { x %% 2 != 0 }
starts_with_h <- function(x) { any(grepl(pattern = "^h.*", x = x)) }
numeric_expression <- function(x) { is.numeric(eval(x)) }

test_that("test my_filter", {
  expect_equal(my_filter(odd, integer(0)), integer(0))
  expect_equal(my_filter(is.atomic, list()), list())
  expect_equal(my_filter(is.expression, expression()), expression())
  expect_equal(my_filter(odd, 1:5), c(1, 3, 5))
  expect_equal(my_filter(starts_with_h, c("hi", "world", "bye", "hell")),
               c("hi", "hell"))
  expect_equal(my_filter(is.atomic, list(1, list(3, 4), 5)), list(1, 5))
  expect_equal(my_filter(numeric_expression, expression(c, "a", 1, 3)),
               expression(1, 3))
})
      

A quick glance at Filter gives us a first hint as to the way we could follow in the implementation. The last line is a typical subsetting operation. We could try to use this idea to delineate our code.

You may figure it out after a bit of thinking. Let's begin with one of our examples:

my_filter(odd, 1:5)

Well, I have something like the vector c(1, 2, 3, 4, 5) and I want to filter odd elements in it. I also have a predicate odd that allows me to determine whether each element there is odd or not.

How should my_filter be implemented to produce the expected c(1, 3, 5)?

It could apply odd to each element in c(1, 2, 3, 4, 5) to get this: c(TRUE, FALSE, TRUE, FALSE, TRUE), and then, in a typical R way, use the logical vector for indexing c(1, 2, 3, 4, 5):

c(1, 2, 3, 4, 5)[c(TRUE, FALSE, TRUE, FALSE, TRUE)]

I can sketch a general version of this idea. Instead of odd and 1:5 I generalize to whatever predicate, f, and whatever vector, x. Also, and just for readability, I create a local variable, ind, to name the logical vector.

my_filter <-
function(f, x) {
  ind <- # apply f to each element in x
  x[ind]
}

Only one thing remains to be filled here, the code that applies f to each element in x.

If you have an imperative programming background you will come up with some kind of loop to process each element in the vector and check whether the predicate holds for it. Something along these lines:

apply_f <-
function(f, x) {
  fx <- vector("logical")
  for (i in seq_along(x)) {
    fx <- c(fx, f(x[[i]]))
  }
  fx
}

Let's add this helper function to my_filter as a nested (local) function

my_filter <-
function(f, x) {
  apply_f <- 
  function() {
    fx <- vector("logical")
    for (i in seq_along(x)) {
      fx <- c(fx, f(x[[i]]))
    }
    fx
  }

  ind <- apply_f() 
  x[ind]
}

Wait a minute! Shouldn't apply_f be tested before doing anything else? and definitely before converting it into a local (and as such untestable) function? This is a great question. Let's claim that it is not needed because it is a too easy function and we trust it will work without further testing. In a moment we will return to this and see the consequences of this assumption.

It's time to run tests:

> library(testthat)
> test_file("test_my_filter.R")

Great! All tests passed.

However, instead of explicit loops you may know that idiomatic R usually favors the apply family of functions.

The immediate candidate for this task seems to be sapply, that is typically used when we want a vector as a result of applying the function to each element in a given vector x. apply_f could be written with sapply as follows:

apply_f <- 
function(f, x) {
    sapply(x, f)
}

Being so simple, it doesn't make sense to wrap the sapply call in a function. So our more idiomatic and concise filter implementation would be:

my_filter <-
function(f, x) {
    ind <- sapply(x, f) 
    x[ind]
}

Run tests ... and, oops! one test fails! The case for the empty vector as input.

Looking into ?sapply more carefully, we see that our initial expectation was based in a partial understanding of sapply:

Simplification in ‘sapply’ is only attempted if ‘X’ has length greater than zero ...

It turns out that sapply does not always produce a vector when a vector is given. In other words, our current implementation breaks the signature of my_filter.

We have different alternatives to fix the bug. One would be to use a conditional with one branch for the empty vector and another one for the rest of cases. Another option is to use lapply, that produces a list instead of a vector, and unlist the result to get the vector we need. This second strategy has an advantage. Core R code shouldn't use user friendly wrappers like sapply but primitive functions. It is not only a matter of style, it is usually a matter of performance (core functions are more efficient), and it is above all a matter of reliability. User wrappers are for helping users write quickly, mainly in an interactive setting, but they may change or even become defunct in future releases. In general, we shouldn't trust user wrapper functions when we want to write enduring code.

With lapply and unlist we get this solution:

my_filter <-
function(f, x) {
  ind <- unlist(lapply(x, f))
  x[ind]

}

and it passes all tests.

If we compare our last version with the official implementation we'll observe that there are still some differences.

Starting from the end, the official implementation prefers which(ind) instead of just ind for indexing. This is a subtle divergence. One thing it handles differently than our implementation is NA values. In our implementation, if a NA appears in x it will appear also as NA in ind, and if we subset x directly via ind the NA ends up in the result too. However, ?Filter has to say something about this (just we omitted it in the first reading to keep things simple):

Note that possible ‘NA’ values are currently always taken as false.

This implies that NA values in x will be discarded from the result, whatever f and x, and this is exactly what which can nicely achieve. which takes a logical object and produces the indices, as integers, of TRUE values. In other words, for our case, where a vector is given as input, it has this signature:

which :: [Bool] -> [Integer]

For instance,

> which(c(TRUE, FALSE, FALSE, TRUE))
[1] 1 4

Additionally, which omits NA values in the logical vector, in other words, it treats them as if they were FALSE. For example:

> which(c(TRUE, FALSE, FALSE, TRUE, NA))
[1] 1 4

and that's precisely the documented behavior that Filter should show.

The second difference lies in the first line, the explicit conversion to logical absent from our solution:

as.logical(unlist(lapply(x, f)))

Someone could argue that as.logical is redundant. After all, the predicate f always produces a boolean value and the result of the composition of lapply and unlist should produce a logical vector. It should do, but it actually doesn't. Passing an empty vector to unlist(lapply(...)) produces NULL:

> unlist(lapply(integer(0), is.integer))
NULL

Inadvertently our previous implementation worked, and it worked because when NULL is an index it is treated as integer(0) and indexing an empty vector with integer(0) produces an empty vector, logical by default.

> vector()[integer(0)]
logical(0)

Once we substitute direct indexing by which(ind) things stop working, which(NULL) would raise error because which requires a logical object as argument, and NULL is neither a logical value nor converted by which to logical. That's the reason for introducing as.logical. In general, as.logical makes a lot of sense by making sure that the signature of which will never be broken.

Let's go back to the moment when we hesitated about why not to test the helper function apply_f. Under the new light the question appears pretty reasonable. Had we tested apply_f, in other words, had we left it as a non-local function and verified it independently, we would have caught this bug soon.

For instance, we could have written apply_f as:

apply_f <- 
function(f, x) {
  unlist(lapply(x, f))
}

and add tests for it:

test_that("test apply_f", {
  expect_equal(apply_f(is.integer, integer(0)), logical(0))
  expect_equal(apply_f(odd, 1:5), c(TRUE, FALSE, TRUE, FALSE, TRUE))
  # ... (more tests here)
})

Running tests would have uncovered the bug in the very beginning. Once apply_f passes all tests we can convert it into a local function, in this case just the call we have seen in the final definition. The morale is that untested code, which we lazily assume being too easy to test is often the source of unexpected bugs and countless wasted hours of debugging. Better to always test everything that is not absolutely trivial.

To finish this exploration and for completeness just add this new test case, which verifies the NA discarding behavior, to the test_my_filter.R file:

expect_equal(my_filter(odd, c(1:4, NA)), c(1, 3))

and confirm that the final implementation with as.logical, that is exactly the R version of Filter, passes all tests.

No hay comentarios:

Publicar un comentario