domingo, 31 de julio de 2016

Reading R Code. The function Reduce. Folds

Among the essential higher-order functions in functional programming (we have already seen map and filter) reduce (aka. fold) is probably the most amazing and powerful. It is the Swiss-Army-knife in functional programmers' hands, sort of. I highly recommend this excellent exposition, about the power and flexibility of fold:

http://www.cs.nott.ac.uk/~pszgmh/fold.pdf

What is fold?

Although fold deserves and probably requires several blog posts, I'll try an elementary presentation suitable to our purposes.

fold is essentially the functional abstraction that corresponds to the very structure of lists.

A list of whatever kind of values, say, type X, is either the empty list or a list constructed from an X and a list of Xs, where by construction I mean the operation of consing X and a list of Xs. For instance (in Lisp):

[1]> (cons 1 (list 2 3))
(1 2 3)

or in Haskell, where : is the Haskell equivalent to the Lispier cons:

Prelude> 1 : [2, 3]
[1,2,3]

Note the self-referential nature of the type description for lists. Any function over lists relies on this self-referential structure, and it will have a natural form, where the recursive call emanates naturally from the type description. So the sketch or template of a function for list of X looks like this [written now in Racket]:

(define (f-for-lox lox)
  (cond [(empty? lox) (...)]
        [else
          (... (first lox)
               (f-for-lox (rest lox)))]))
       

On the other hand, and by stack space efficiency reasons, functions over lists can be written in such a way that the recursive call is in tail position (in a position that ensures that is the last call in the procedure). Observe that f-for-lox, the recursive call in the template above, is not in tail position; instead, the function ending up in place of the last three dots in that template will be in tail position.

I heartily recommend to watch these excellent videos, on which this exposition is mostly based, for a better understanding:

  1. https://www.youtube.com/watch?v=Z7qXdkt7yOE
  2. https://www.youtube.com/watch?v=nx8BALSH3nY
  3. https://www.youtube.com/watch?v=fMqbMiGuQ9c
  4. https://www.youtube.com/watch?v=6NyPb8jQROs

Such functions where the recursive call is in tail positions are known as tail-recursive functions and normally written with the aid of a local function that supplies an accumulator. A tail-recursive template for lists has this form:

(define (f-for-lox lox0)
  (local [(define (f-for-lox-acc acc lox)
            (cond [(empty? lox) (... acc)]
                  [else
                    (f-for-lox-acc (... acc (first lox))
                                   (rest lox))]))]
    (f-for-lox-acc ... lox0)))

Or this one, if we change the order of components in the recursive call:

(define (f-for-lox lox0)
  (local [(define (f-for-lox-acc acc lox)
            (cond [(empty? lox) (... acc)]
                  [else
                    (f-for-lox-acc (... (first lox) acc)
                                   (rest lox))]))]
    (f-for-lox-acc ... lox0)))

fold is the function that capture these two abstractions. Accordingly, there are two possible fold functions (the last one with two variants), fold-right that captures the structural recursive procedure, and fold-left that captures the tail-recursive one.

Let's see this. If we fill those templates with more meaningful placeholders, i standing for the initial value, and f standing for a function to combine the contribution of the first element and the contribution of the recursion, and we add them as parameters to the functions, we have this function for the natural recursive template:

(define (fold-right f i lox)
  (cond [(empty? lox) i]
        [else
          (f (first lox)
             (f-for-lox (rest lox)))]))

and this two variants for the two tail-recursive templates:

(define (fold-left-1 f i lox0)
  (local [(define (f-for-lox-acc acc lox)
            (cond [(empty? lox) acc]
                  [else
                    (f-for-lox-acc (f acc (first lox))
                                   (rest lox))]))]
    (f-for-lox-acc i lox0)))


(define (fold-left-2 f i lox0)
  (local [(define (f-for-lox-acc acc lox)
            (cond [(empty? lox) acc]
                  [else
                    (f-for-lox-acc (f (first lox) acc)
                                   (rest lox))]))]
    (f-for-lox-acc i lox0)))

that are precisely fold-right and fold-left, with two variants, respectively.

A careful analysis of those functions enable us to infer their respective signatures.

;; fold-right                :: (X Y -> Y) Y (listof X) -> Y
;; fold-left-1 (1st variant) :: (Y X -> Y) Y (listof X) -> Y
;; fold-left-2 (2nd variant) :: (X Y -> Y) Y (listof X) -> Y

Note, in particular, that fold-right and the second variant of fold-left share the same signature, while the signature for the first variant of fold-left differs due to the different order of arguments in the call to f.

fold-right is basically the same in all functional programming languages, while different languages pick the first version of fold-left (Lisp, Haskell, OCaml, ...) or the second one (SML, Racket, ...) for its implementation of this function.

The difference between the two variants of fold-left have an impact in many cases. Just choose as f a function for which the order of operands matters (a non-commutative function). For instance if f is + the order doesn't matter but if it is - it absolutely matters. Let's consider this for - in all the languages we have mentioned:

# Lisp [reduce is fold-left by default]
[1]> (reduce #'- '(1 2 3 4) :initial-value 0)
-10

# Haskell
Prelude> foldl (-) 0 [1, 2, 3, 4]
-10

# OCaml
# List.fold_left (-) 0 [1; 2; 3; 4] ;;
- : int = -10

While:

# Racket
> (foldl - 0 '(1 2 3 4))
2

# SML
- List.foldl op- 0 [1, 2, 3, 4] ;
val it = 2 : int

[The R's Reduce will be the subject of the next post.]

jueves, 28 de julio de 2016

Reading R code. The functions Map and mapply

In this post I'm going to consider the implementation of the function Map in R. As in the previous post in the series, I first introduce the typical map in functional languages. Then I go into the R's Map implementation. Along the way I explore some other R functions to the extent needed for a basic understanding of the implementation.

What is map

map applies a function to each element in a given list to get a new list with the result of the application. For instance:

map sqrt [1, 2, 3]

produces a list of the square roots of each number:

[1.0,1.4142135623730951,1.7320508075688772]

In general, map takes a function and a list of elements of some type, say a, and produces a list of elements of some other type, say b, where a and b are the types of, respectively, the input and output values of the function that map takes. The signature for the example above could be written as:

(Num -> Double) [Num] -> [Double]

meaning that the first argument, sqrt, takes a number (in general) and produces a double-precision floating point number; the second argument is a list of numbers, and the result a list of Doubles.

So in general map has this signature:

(a -> b) [a] -> [b]

As with filter a natural implementation of map is a math-like recursive definition:

map f []     = []
map f (x:xs) = f x : map f xs

Or, in Racket:

(define (map f lox)
  (cond [(empty? lox) empty]
        [else
          (cons (f (first lox))
                (map f (rest lox)))))

This is the basic usage. The function map can usually handle multiple lists too. One of the variants works as follows:

It applies the function to the first element of given lists, then to the second, and so on till the last element of the shortest list is processed, the rest of elements of longer lists are ignored, and the result is the list produced by the successive application. This is a possible example with two lists as input. I use the name zipWith to refer to this kind of map that can take in two lists:

zipWith (+) [1, 2, 3] [4, 5, 6, 7]

produces:

[5, 7, 9] -- so [1 + 4, 2 + 5, 3 + 6]

The signature would be:

zipWith :: (a b -> c) [a] [b] -> [c]

Generalizations to cope with an arbitrary number of lists are also available in functional languages.

Map in R

The R Map function provides, as ?Map states, a generalization of the sort described:

‘Map’ applies a function to the corresponding elements of given vectors.

(Note that like Filter the consumed objects are R vectors.)

However, unlike the generalized map mentioned, Map doesn't discard remaining elements of longer lists. Instead, it uses recycling:

‘Map’ ... is similar to Common Lisp's ‘mapcar’ (with arguments being recycled, however)

Common-Lisp mapcar is a function that behaves as explained above (the Haskell's zipWith example). Indeed, executing the corresponding Lisp code on the clisp interpreter gives the same result:

[1]> (mapcar #'+ '(1 2 3) '(4 5 6 7))
(5 7 9)

while in R recycling is at work, though:

> Map(`+`, 1:3, 4:7)
[[1]]
[1] 5

[[2]]
[1] 7

[[3]]
[1] 9

[[4]]
[1] 8

As you can see, when the shortest list is exhausted, R recycles it as needed: the last element is the result of 1 + 7, where 1 comes here from recycling the shorter list. Note also that the result is not an atomic vector as one might expect but a list. This is also documented:

‘Map’ ... does not attempt to simplify the result ... Future versions may allow some control of the result type.

As for the implementation ?Map also tells us what it is:

‘Map’ is a simple wrapper to ‘mapply’.

It is not uncommon in R to find traces of implementation details in its docs.

The actual implementation expresses in code all of these points:

function (f, ...) 
{
  f <- match.fun(f)
  mapply(FUN = f, ..., SIMPLIFY = FALSE)
}

Let's have a closer look into the definition.

The function takes a first argument f, the mapping function to be applied, and an undetermined number of extra arguments. The three dots construct allows to catch those extra arguments and pass them on later to mapply. Vectors on which f will be applied are among the arguments that the caller will pass; eventually, more arguments might be passed by the caller, arguments that mapply can receive.

In the first line, the function passed as first argument is extracted and saved into a local variable f, or discarded if it cannot be interpreted in any way as a function, that's what match.fun basically does.

The function mapply

Once this first argument has been checked, it is passed as the FUN argument to mapply that in turn calls the underlying C code to carry out the computation. Additionally, Map sets the SIMPLIFY argument of mapply to FALSE to avoid the simplification, as documented.

We can see this better if we take a look at the implementation of mapply:

function(FUN,..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
{
  FUN <- match.fun(FUN)
  dots <- list(...)

  answer <- .Internal(mapply(FUN, dots, MoreArgs))

  if (USE.NAMES && length(dots)) {
    if (is.null(names1 <- names(dots[[1L]])) && is.character(dots[[1L]]))
      names(answer) <- dots[[1L]]
    else if (!is.null(names1))
      names(answer) <- names1
  }
  if(!identical(SIMPLIFY, FALSE) && length(answer))
    simplify2array(answer, higher = (SIMPLIFY == "array"))
  else answer
}

Note that answer, the returned value, is the result produced by the internal function, written in C, which is invoked via the call to .Internal. We come to this construct over and over while reading R base functions. Many R base functions just wrap the call to an underlying function, eventually preparing things for it, and/or adapting the result of it according to the arguments passed in the first place.

Map sets mapply's SIMPLIFY to FALSE so that the simplification that mapply could do otherwise will never be executed.

However, despite the fact that the intended arguments for the three dots are just the vectors, as ?Map documents, nothing prevents us from passing other possible mapply arguments, USE.NAMES and MoreArgs.

MoreArgs allows for passing extra arguments to the f function. For instance, to get a list of vectors where 42 is repeated from 1 to 4 times, we can use MoreArgs in this way [example borrowed from the mapply doc]:

> Map(rep, times = 1:4, MoreArgs = list(x = 42))
[[1]]
[1] 42

[[2]]
[1] 42 42

[[3]]
[1] 42 42 42

[[4]]
[1] 42 42 42 42

As R allows us to pass extra arguments just by naming them after the function,

> Map(rep, times = 1:4, x = 42)

MoreArgs seems to be of little use in this regard.

Furthermore, one can always resort to the commonly-used idiom in functional programming: supplying an anonymous function in place of f:

> Map(function(x) rep(42, times = x), 1:4)

The other extra option that we may pass to Map is USE.NAMES. mapply sets it by default to TRUE. In such a setting, and if more arguments apart form the initial function f are passed (length(dots) is not 0) this code in mapply will be executed:

if (USE.NAMES && length(dots)) {
  if (is.null(names1 <- names(dots[[1L]])) && is.character(dots[[1L]])) 
    names(answer) <- dots[[1L]]
  else if (!is.null(names1)) 
    names(answer) <- names1
}

Two cases are especially handled:

  • The second argument passed to mapply (dots[[1L]]) doesn't have names but it is of character type. (Recall that the first argument is always the function f).
  • The second argument passed to mapply has names.

In the first case, the character object is used to set the names of the result of mapply, as in this example taken from ?mapply:

> mapply(function(C, k) paste0(rep.int(C, k)), LETTERS[1:6], 6:1)

In the second case, the names of the argument become the names of the result. For instance [again from ?mapply]:

> mapply(function(x, y) seq_len(x) + y,
> +             c(a =  1, b = 2, c = 3),  # names from first
> +             c(A = 10, B = 0, C = -10))

This is simply put on the doc:

use names if the first ... argument has names, or if it is a character vector, use that character vector as the names.

If we want no names handling in Map we can just set USE.NAMES to FALSE overriding the default behavior of mapply, or viewed from the implementation, falling back to the default behavior of the internal C code.

A couple of points about R idioms that we see in the mapply:

Assignments can be sub-expressions. Here:

is.null(names1 <- names(dots[[1L]]))

The assignment expression names1 <- names(dots[[1L]]) evaluates to the value of names1, once assigned to. The value is passed to is.null as a part of the if condition. The value of names1 is later re-used in the last branch.

Secondly, in a logical context, the number 0 evaluates to FALSE, any other number evaluates to TRUE. This allows for concise conditions as the previous one:

if (USE.NAMES && length(dots))

The second part of the relational expression check whether dots is empty or not. Hence, there is no need to write length(dots) != 0. This is a common idiom in many languages supporting boolean evaluation of different classes of objects.

To complete the basic understanding of these functions I should inspect more closely match.fun and simplify2array (the function responsible of simplifying the result). match.fun appears frequently in R code and I pin it on top of my task list, simplify2array is a helper function currently used only by mapply and sapply. We can live without studying it for the time being.

martes, 26 de julio de 2016

Reading R code. The function Filter

R is functional language, however not a pure one like Haskell, R is fond of the functional paradigm. As John Chambers has recently remind us, in R everything that happens is a function call [John Chambers, Extending R, CRC Press]. R functions are first-class citizens and the functional programming style is prominent.

filter, map, and reduce (aka. fold) are among the most wanted functions in the toolset of every functional programmer. R provides them too under the names Filter, Map, Reduce. Even though the apply family and vectorization are preferred, R is kind enough to give these functions to seasoned functional programmers. It couldn't have been otherwise.

The source code of base R, currently at /src/library/base/R contains a file with the implementation of these functions among others, funprog.R.

In my exploration of R source code I have to begin with something. Filter may be a good candidate. It is brief, easier to read, and a sensible choice for a fan of functional programming as I am.

What is filter?

filter, as its name suggests, serves the purpose of filtering elements in a list according to certain given function, called predicate, so that elements in this list for which the predicate holds are selected and the rest discarded, where predicate is the technical term for any function that produces true or false (a boolean value).

Some examples will make this clear.

Let's suppose we have this list of numbers [1, 2, 3, 4, 5], and a function odd that takes a number and determines whether it is odd or not. Selecting all odd numbers in [1, 2, 3, 4, 5] is a matter of filtering them by their oddity

filter odd [1, 2, 3, 4, 5]

will produce [1, 3, 5]

Another example. We want to select words in ["hi", "world", "bye", "hell"] whose first letter is 'h'.

filter start_with_h ["hi", "world", "bye", "hell"]

will produce the desired list ["hi", "hell"]

In general, filter takes a predicate and a list of elements of some type, say type a, and produces a list of elements of the same type. Formally expressed:

filter :: (a -> Bool) [a] -> [a]

where [a] stands for list of elements of type a, and the arrow stands for a function whose arguments are to the left and the result to the right of the arrow. Note that the predicate, the first argument, is also a function that consumes values of type a and produces a boolean.

This type description is called the signature of the function. In some functional programming languages it is customary to apply the so called currying operation that translates the evaluation of a function taking multiple arguments into the evaluation of a sequence of functions, with a single argument each. Under currying, the signature of filter would be:

filter :: (a -> Bool) -> [a] -> [a]

Although this is a mathematically more appealing description, I'll stick here and in what follows to the first uncurried form.

What about implementing filter? A natural implementation of filter would take the form of a mathematical definition.

Mathematics is plenty of inductive or recursive definitions. Do you recall factorial? In Math, it is defined as a two-part function:

factorial 0 = 1
factorial n = (factorial n - 1) * n

In words, the factorial of 0 is equal to 1, the factorial of n is equal to the factorial of n - 1 times n.

Similarly, a math-like definition for filter could be expressed as a two-part function with two branches for the second part:

  • The filter of a predicate and an empty list is the empty list.
  • The filter of a predicate and a non-empty list is
    • either (if the predicate holds for the first element of the given list) the list consisting of that first element and the result of the filter of the predicate and the rest of the given list,
    • or (otherwise) the filter of the predicate and the rest of the given list.

This wording is exact but verbose. Formalizing it a bit turns out to be easier to read. What follows is the actual Haskell implementation of filter, that hopefully is almost self-explanatory. Actually the signature before is also the real Haskell signature.

filter p [] = []
filter p (x:xs) | p x       = x : filter p xs
                | otherwise = filter p xs

Another possible syntax, now written in friendly Racket, that you may find even more readable, looks as follows [signature included as an initial comment]:

;; (X -> Boolean) (listof X) -> (listof X)
(define (filter p lox)
  (cond [(empty? lox) empty]
        [else
          (if (p (first lox))
              (cons p (filter p (rest lox)))
              (filter p (rest lox)))))

Filter in R

What about the R implementation? Here it is:

Filter <- 
function(f, x)
{
  ind <- as.logical(unlist(lapply(x, f)))
  x[which(ind)]
}

It looks quite different. Obviously, it is not recursive. This is not surprising, recursion is the main technique in pure functional languages that are well prepared to handle recursive implementations without performance penalties.

Having noted that, let's explore this code in detail.

The very first thing we always have to do when studying a function definition is to be sure about the value(s) that the function consumes and the value that the function produces, the signature of the function. By the way, as you may know, some languages check signatures at compile time, like Haskell, others don't, like R or the Racket version above [though Racket has variants for type checking]. Whatever the case the signature is critical for programmers and for users, since it tells what kind of values are involved.

The documentation for Filter reveals the assumed signature. Omitting for the moment some nuances and summarizing the 'Arguments' and 'Details' section , ?Filter says this:

‘Filter’ applies the unary predicate function ‘f’ to each element of ‘x’ [which is a vector] ..., and returns the subset of ‘x’ for which this gives true.

Recall the header of Filter

function(f, x)

The doc says that f, the first argument, is a unary predicate function, meaning a function that takes a single argument (unary) [of any type, say a as before], and produces a boolean (TRUE or FALSE). The signature of the predicate is therefore:

(a -> Bool)

The second argument, x, is in turn a vector, instead of a list. A vector in R can be an atomic vector, a list, or an expression. The main R data type to represent arbitrarily large collections of objects, possibly nested, is vector and in this regard is a natural choice to represent what in functional languages is typically represented by lists.

Let us symbolize, just by convention, vector of some type with [a], that's the type of the second argument of Filter.

The result of Filter is in turn a subset of x, hence again a vector of type [a].

Therefore the whole signature could be specified as follows:

Filter :: (a -> Bool) [a] -> [a]

As it's easy to see, this is the same signature of filter we figure out before, just that [a] stands for list in the former and for vector (atomic or not) in the latter.

Let's go on with the implementation. In order to understand an implementation I find really helpful trying first to design our own version.

Before starting off with the implementation itself we should write examples of the function usage, at least as many as the function signature requires, and wrap them as test cases. Examples will guide the implementation, and at the same time, wrapped as test cases, provide for the indispensable testing workhorse. For the latter I will use testthat as explained in the final part of the first post in this series: https://los-pajaros-de-hogano.blogspot.com.es/2016/07/reading-r-code-introduction.html

So save these examples into the file test_my_filter.R:

source("my_filter.R")

# Some predicates for testing filter
odd <- function(x) { x %% 2 != 0 }
starts_with_h <- function(x) { any(grepl(pattern = "^h.*", x = x)) }
numeric_expression <- function(x) { is.numeric(eval(x)) }

test_that("test my_filter", {
  expect_equal(my_filter(odd, integer(0)), integer(0))
  expect_equal(my_filter(is.atomic, list()), list())
  expect_equal(my_filter(is.expression, expression()), expression())
  expect_equal(my_filter(odd, 1:5), c(1, 3, 5))
  expect_equal(my_filter(starts_with_h, c("hi", "world", "bye", "hell")),
               c("hi", "hell"))
  expect_equal(my_filter(is.atomic, list(1, list(3, 4), 5)), list(1, 5))
  expect_equal(my_filter(numeric_expression, expression(c, "a", 1, 3)),
               expression(1, 3))
})
      

A quick glance at Filter gives us a first hint as to the way we could follow in the implementation. The last line is a typical subsetting operation. We could try to use this idea to delineate our code.

You may figure it out after a bit of thinking. Let's begin with one of our examples:

my_filter(odd, 1:5)

Well, I have something like the vector c(1, 2, 3, 4, 5) and I want to filter odd elements in it. I also have a predicate odd that allows me to determine whether each element there is odd or not.

How should my_filter be implemented to produce the expected c(1, 3, 5)?

It could apply odd to each element in c(1, 2, 3, 4, 5) to get this: c(TRUE, FALSE, TRUE, FALSE, TRUE), and then, in a typical R way, use the logical vector for indexing c(1, 2, 3, 4, 5):

c(1, 2, 3, 4, 5)[c(TRUE, FALSE, TRUE, FALSE, TRUE)]

I can sketch a general version of this idea. Instead of odd and 1:5 I generalize to whatever predicate, f, and whatever vector, x. Also, and just for readability, I create a local variable, ind, to name the logical vector.

my_filter <-
function(f, x) {
  ind <- # apply f to each element in x
  x[ind]
}

Only one thing remains to be filled here, the code that applies f to each element in x.

If you have an imperative programming background you will come up with some kind of loop to process each element in the vector and check whether the predicate holds for it. Something along these lines:

apply_f <-
function(f, x) {
  fx <- vector("logical")
  for (i in seq_along(x)) {
    fx <- c(fx, f(x[[i]]))
  }
  fx
}

Let's add this helper function to my_filter as a nested (local) function

my_filter <-
function(f, x) {
  apply_f <- 
  function() {
    fx <- vector("logical")
    for (i in seq_along(x)) {
      fx <- c(fx, f(x[[i]]))
    }
    fx
  }

  ind <- apply_f() 
  x[ind]
}

Wait a minute! Shouldn't apply_f be tested before doing anything else? and definitely before converting it into a local (and as such untestable) function? This is a great question. Let's claim that it is not needed because it is a too easy function and we trust it will work without further testing. In a moment we will return to this and see the consequences of this assumption.

It's time to run tests:

> library(testthat)
> test_file("test_my_filter.R")

Great! All tests passed.

However, instead of explicit loops you may know that idiomatic R usually favors the apply family of functions.

The immediate candidate for this task seems to be sapply, that is typically used when we want a vector as a result of applying the function to each element in a given vector x. apply_f could be written with sapply as follows:

apply_f <- 
function(f, x) {
    sapply(x, f)
}

Being so simple, it doesn't make sense to wrap the sapply call in a function. So our more idiomatic and concise filter implementation would be:

my_filter <-
function(f, x) {
    ind <- sapply(x, f) 
    x[ind]
}

Run tests ... and, oops! one test fails! The case for the empty vector as input.

Looking into ?sapply more carefully, we see that our initial expectation was based in a partial understanding of sapply:

Simplification in ‘sapply’ is only attempted if ‘X’ has length greater than zero ...

It turns out that sapply does not always produce a vector when a vector is given. In other words, our current implementation breaks the signature of my_filter.

We have different alternatives to fix the bug. One would be to use a conditional with one branch for the empty vector and another one for the rest of cases. Another option is to use lapply, that produces a list instead of a vector, and unlist the result to get the vector we need. This second strategy has an advantage. Core R code shouldn't use user friendly wrappers like sapply but primitive functions. It is not only a matter of style, it is usually a matter of performance (core functions are more efficient), and it is above all a matter of reliability. User wrappers are for helping users write quickly, mainly in an interactive setting, but they may change or even become defunct in future releases. In general, we shouldn't trust user wrapper functions when we want to write enduring code.

With lapply and unlist we get this solution:

my_filter <-
function(f, x) {
  ind <- unlist(lapply(x, f))
  x[ind]

}

and it passes all tests.

If we compare our last version with the official implementation we'll observe that there are still some differences.

Starting from the end, the official implementation prefers which(ind) instead of just ind for indexing. This is a subtle divergence. One thing it handles differently than our implementation is NA values. In our implementation, if a NA appears in x it will appear also as NA in ind, and if we subset x directly via ind the NA ends up in the result too. However, ?Filter has to say something about this (just we omitted it in the first reading to keep things simple):

Note that possible ‘NA’ values are currently always taken as false.

This implies that NA values in x will be discarded from the result, whatever f and x, and this is exactly what which can nicely achieve. which takes a logical object and produces the indices, as integers, of TRUE values. In other words, for our case, where a vector is given as input, it has this signature:

which :: [Bool] -> [Integer]

For instance,

> which(c(TRUE, FALSE, FALSE, TRUE))
[1] 1 4

Additionally, which omits NA values in the logical vector, in other words, it treats them as if they were FALSE. For example:

> which(c(TRUE, FALSE, FALSE, TRUE, NA))
[1] 1 4

and that's precisely the documented behavior that Filter should show.

The second difference lies in the first line, the explicit conversion to logical absent from our solution:

as.logical(unlist(lapply(x, f)))

Someone could argue that as.logical is redundant. After all, the predicate f always produces a boolean value and the result of the composition of lapply and unlist should produce a logical vector. It should do, but it actually doesn't. Passing an empty vector to unlist(lapply(...)) produces NULL:

> unlist(lapply(integer(0), is.integer))
NULL

Inadvertently our previous implementation worked, and it worked because when NULL is an index it is treated as integer(0) and indexing an empty vector with integer(0) produces an empty vector, logical by default.

> vector()[integer(0)]
logical(0)

Once we substitute direct indexing by which(ind) things stop working, which(NULL) would raise error because which requires a logical object as argument, and NULL is neither a logical value nor converted by which to logical. That's the reason for introducing as.logical. In general, as.logical makes a lot of sense by making sure that the signature of which will never be broken.

Let's go back to the moment when we hesitated about why not to test the helper function apply_f. Under the new light the question appears pretty reasonable. Had we tested apply_f, in other words, had we left it as a non-local function and verified it independently, we would have caught this bug soon.

For instance, we could have written apply_f as:

apply_f <- 
function(f, x) {
  unlist(lapply(x, f))
}

and add tests for it:

test_that("test apply_f", {
  expect_equal(apply_f(is.integer, integer(0)), logical(0))
  expect_equal(apply_f(odd, 1:5), c(TRUE, FALSE, TRUE, FALSE, TRUE))
  # ... (more tests here)
})

Running tests would have uncovered the bug in the very beginning. Once apply_f passes all tests we can convert it into a local function, in this case just the call we have seen in the final definition. The morale is that untested code, which we lazily assume being too easy to test is often the source of unexpected bugs and countless wasted hours of debugging. Better to always test everything that is not absolutely trivial.

To finish this exploration and for completeness just add this new test case, which verifies the NA discarding behavior, to the test_my_filter.R file:

expect_equal(my_filter(odd, c(1:4, NA)), c(1, 3))

and confirm that the final implementation with as.logical, that is exactly the R version of Filter, passes all tests.

lunes, 25 de julio de 2016

Reading R code. Introduction

One of the most important things I learned when I started my higher education was that in order to really know deeply about a subject one has to dive into the original sources as soon as possible rather than solely sticking to secondary literature. Better to read Plato's and Kant's works in their original language than countless books about Critiques and Dialogues; better to analyze Mozart's and Beethoven's Sonatas (as Charles Rosen does) than skimming over a lot of general articles on Classical music.

When I was put in charge of developing a web site for my institution I hadn't any programming background at all. After some discouraging attempts to learning from intro books and tutorials, I decided to go to the sources and I studied the entire HTML and CSS specifications directly. It was hard, but everything was clear at last [Remark: nowadays it would be overwhelming. HTML and CSS specs are currently huge].

In learning programming languages the next step to the initial exposure to syntax, concepts and techniques might be well studying core libraries written by main developers.

How could one understand better a programming language than by trying to read the code that its core developers have created?

I'm always surprised by the scarcity of commentaries about exemplary code in any language. There is somewhat like a gap between introductory expositions and the source code itself. The latter is silently digested by the people who develop it or who create libraries on top of it, while beginners live always in the other much narrower side, their little friendly universe of tutorials and simple recipes, kind enough, for sure, but maybe a bit tasteless [though great exceptions can always be found out] when their time, the absolute beginners' time, has passed.

One reason for this lack might be that core libraries are very complex and abstract beasts, and an understanding of parts of it is just hopeless without a firm grasp on the whole architecture and design, something unreachable to anyone else but experts.

This is not always the case, though. At least it is not the case for a good bunch of functions in R base code. Many are written in R itself and they are almost self-contained in the sense that a preliminary comprehension doesn't depend on a complete acquaintance with the abstract underlying architecture.

So I've thought that I can give this idea a try, picking some R functions, and reading them with the aim of understanding R better. The goal is educational, self-educational mainly. Along the way I'll try to make things perhaps easier to others with a bit less programming background than mine hoping that in doing so I not only reinforce my understanding but also help others deepen their own.

I'm not an R expert. This means that while reading and trying to make sense of official implementations I probably will make some more or less educated guesses and I could (and surely will) make mistakes. So if someone more knowledgeable than I am (there should be many) read this, please let me know to fix any error, misleading step, or gratuitous deduction.

The intended audience is people with a basic working understanding of R data structures and programming constructs, including conditionals, loops and function definitions. The only required tool is the R console. And for the moment only one extension package will be used, the package testthat for unit testing support. It can be installed as usual via:

> install.packages(testthat)

To use testthat in a simple way (not the best one for real projects but enough for our purposes now) proceed as follows:

  1. Save the function you want to test in a file.
  2. Create a new file on the same directory for testing that function. The name of this file should start with 'test'.
  3. Source the code of the function to be tested.
  4. Write tests.
  5. Run tests from the console.

An example of this workflow.

Save the following function in foo.R:

foo <- 
function(x) {
    ifelse((identical(x, 1)), "I'm 1", "I'm not 1")
}

Create test_foo.R with this content:

source("foo.R")

test_that("foo is 1 or not 1", {
  expect_equal(foo(1), "I'm 1")
  expect_equal(foo(0), "I'm not 1")
  expect_equal(foo("hi"), "I'm not 1")
})

Load testthat on your session and run tests from the console:

> library(testthat)
> test_file("test_foo.R")

Introducing unit testing from the very beginning may seem unnecessary. On the contrary, testing is of paramount importance, and writing first a minimal set of tests guides the implementation. Since we probably will write our own code here and there, it is crucial to be equipped with a tool that enable us to always write those tests. This is just common-place and fundamental practice whatever the programming language.

By the way, for a perfect introduction to programming fundamentals, where programming stand here for "programming well" rather than "just coding", read this book:

http://www.ccs.neu.edu/home/matthias/HtDP2e/

and/or take this course:

https://www.edx.org/xseries/how-code-systematic-program-design

They both are gems that no one should miss.

One last point about the R source code. Apart from interactively getting the code as usual by typing the function name, for instance:

> which

and its documentation:

> ?which

you can, if you like, download the complete source from

https://cran.r-project.org/sources.html

Also, you can access to the official current snapshot on Subversion if you know how to do so, or browse over (or clone) a non-official github mirror. I'm aware of these two:

miércoles, 20 de julio de 2016

How to read the R documentation. An example with plot().

From my experience as teaching assistant on several R intro MOOCs I'm getting the impression that beginners, and even intermediate users assume that the R documentation is only for experts and as a consequence don't read the doc in the first place.

This is an unfortunate prejudice since the R built-in documentation is one of its most remarkable features and it is there to help precisely the users. Even though (I admit) some docs are pretty technical, many others are perfectly readable, even for beginners.

In this post I'll try to show the strategy I typically follow when dealing with R doc pages.

Let's take as excuse the following question posted in one of those MOOCs:

What does the function plot do when its input is a data frame?

Getting the right doc

The first thing we need to do is to call the help for plot [I put the first few lines of the result]:

> ?plot

# -------------------

plot                 package:graphics                  R Documentation

Generic X-Y Plotting

Description:

     Generic function for plotting of R objects.  For more details
     about the graphical parameter arguments, see ‘par’.

...

# -------------------

Note the first sentence in the 'Description' section. It tells us that plot is a generic function for plotting R objects..

Not too much, isn't it? but still something. For our purposes generic function means that we need to search for a method (= another function), which will be actually called depending on the class of object we pass to plot. [Since my intent is to guide beginners I omit all discussions regarding technicalities, as that about the exact meaning of method and function in R, as well as the difference between non-generic and generic functions. For the moment take those terms and others of this sort just as jargon we are liberally using to talk about these things].

To know more about plot methods we just type this:

> methods(plot)

This is what I get on my installation:

 [1] plot.acf*           plot.data.frame*    plot.decomposed.ts*
 [4] plot.default        plot.dendrogram*    plot.density*      
 [7] plot.ecdf           plot.factor*        plot.formula*      
[10] plot.function       plot.hclust*        plot.histogram*    
[13] plot.HoltWinters*   plot.isoreg*        plot.lm*           
[16] plot.medpolish*     plot.mlm*           plot.ppr*          
[19] plot.prcomp*        plot.princomp*      plot.profile.nls*  
[22] plot.raster*        plot.spec*          plot.stepfun       
[25] plot.stl*           plot.table*         plot.ts            
[28] plot.tskernel*      plot.TukeyHSD*

Among these methods plot.data.frame is naturally the one which we are interested in. And R helps here too:

> ?plot.data.frame

# -------------------

plot.data.frame            package:graphics            R Documentation

Plot Method for Data Frames

Description:

     ‘plot.data.frame’, a method for the ‘plot’ generic.  It is
     designed for a quick look at numeric data frames.

Usage:

     ## S3 method for class 'data.frame'
     plot(x, ...)
     
Arguments:

       x: object of class ‘data.frame’.

     ...: further arguments to ‘stripchart’, ‘plot.default’ or ‘pairs’.

Details:

     This is intended for data frames with _numeric_ columns. For more
     than two columns it first calls ‘data.matrix’ to convert the data
     frame to a numeric matrix and then calls ‘pairs’ to produce a
     scatterplot matrix).  This can fail and may well be inappropriate:
     for example numerical conversion of dates will lose their special
     meaning and a warning will be given.

     For a two-column data frame it plots the second column against the
     first by the most appropriate method for the first column.

     For a single numeric column it uses ‘stripchart’, and for other
     single-column data frames tries to find a plot method for the
     single column.

See Also:

     ‘data.frame’

Examples:

     plot(OrchardSprays[1], method = "jitter")
     plot(OrchardSprays[c(4,1)])
     plot(OrchardSprays)
     
     plot(iris)
     plot(iris[5:4])
     plot(women)

# -------------------

Reading Details

Most R help pages share the same structure consisting of different sections (Title, Description, Usage, etc)

Some sections may be more important than others depending on what we want to know. For instance, if we only want to know what this function is about in general terms it might be enough to read the terse 'Description'. Or if we already know what the function does but we forget some particular use of certain argument we can look into the 'Arguments' section. If we really want to know what the function exactly does we will need to read 'Details' and 'Examples'.

A superficial reading just doesn't work, we have to take the time to read thoroughly. Let's do it now. Even without fully understanding everything a careful reading allows us to outline what this function does:

  • If the given data frame consists of more than two columns, it ideally displays a scatterplot matrix.
  • If the data frame has two columns, it displays a suitable y versus x plot, where x is the first column and y the second.
  • If the data frame has a single numeric column it generates a stripchart; if that column is not numeric a suitable plot [not specified] is displayed.

Reading examples

So far so good but an image, an example, worths thousands of words. And R usally excels in providing ready-made examples for us. This is critically important when reading the R documentation. If given, please don't skim examples, work them out. Even more, rather than just running them via example(function_name), explore them on the console, one by one, at least till you reach a good enough understanding. Let's go on the 'Examples' section.

The first three examples apply plot to the data set OrchardSprays, which a basic R installation provides by default. This data set has the following structure:

> str(OrchardSprays)
'data.frame': 64 obs. of  4 variables:
 $ decrease : num  57 95 8 69 92 90 15 2 84 6 ...
 $ rowpos   : num  1 2 3 4 5 6 7 8 1 2 ...
 $ colpos   : num  1 1 1 1 1 1 1 1 2 2 ...
 $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ...

So four columns, the first three numeric, and the last one categorical (a factor in R parlance).

The first example passes a data frame with a single column, a one-column subset drawn from the OrchardSprays data frame. Therefore, it's an example for the case where the data frame is made of a single numeric variable. We should expect a stripchart, as documented, and we get that:

> plot(OrchardSprays[1], method = "jitter")

The second example passes this data frame:

> str(OrchardSprays[c(4, 1)])
'data.frame': 64 obs. of  2 variables:
 $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ...
 $ decrease : num  57 95 8 69 92 90 15 2 84 6 ...
So an example of the second case described above, where the first column (x) is categorical and the second (y) is numerical. A suitable plot would be a series of boxplots, one for level of the categorical variable. And that's exactly what we obtain:

> plot(OrchardSprays[c(4, 1)])

The third example passes the whole data frame. An illustration of the first case mentioned in the doc. And we get the corresponding scatterplot matrix:

> plot(OrchardSprays)

I leave the reader to investigate the last three examples. The only new case is plot(women), where the input is a data frame with two columns but both numeric in this case.

I honestly believe that reading this doc (as many other R docs) gives more reliable information about the function at hand than googling during hours or skimming over dozens of books.

Getting and reading the source code

One more thing is still available to users, the source code itself, that obviously is the definitive answer to all questions.

Many R functions are implemented in R itself, and an intermediate R user should be able to read and understand the implementation, to some extent at least. Yes, many core function are written in C and those are beyond the level of expertise of a non professional programmer with sufficient time to invest in navigating over the entire C basis and making sense of it. But we can always give a try, just in case.

As for the function plot.data.frame we are lucky.

There is an initial difficulty, though, locating the source. Methods listed by the above mentioned methods function that come suffixed with * are functions whose code cannot be reached by just typing the function name, as usual. The source code is still accessible. In particular, when the function is an S3 method, as plot.data.frame is [I omit commenting about S3 vs S4 methods. See the R manuals in cran.r-project.org for more info] we have among maybe others any of these two instructions:

> getS3method("plot", "data.frame")

or

> getAnywhere("plot.data.frame")

that displays this code [line numbers added for commenting below]:

1 function (x, ...) 
2 {
3      plot2 <- function(x, xlab = names(x)[1L], ylab = names(x)[2L], 
           ...) plot(x[[1L]], x[[2L]], xlab = xlab, ylab = ylab, 
           ...)
4      if (!is.data.frame(x)) 
5          stop("'plot.data.frame' applied to non data frame")
6      if (ncol(x) == 1) {
7          x1 <- x[[1L]]
8          cl <- class(x1)
9          if (cl %in% c("integer", "numeric")) 
10             stripchart(x1, ...)
11         else plot(x1, ...)
12     }
13     else if (ncol(x) == 2) {
14         plot2(x, ...)
15     }
16     else {
17         pairs(data.matrix(x), ...)
19     }
19 }

A concise commentary:

The function takes a mandatory argument, a data frame by assumption, and an arbitrary number of optional arguments passed by possibly inner function calls to other graphic functions [line 1].

It defines an inner function, plot2, that in turn calls plot with the first and second column of the given data frame and sets appropriate titles and labels for the resulting plot. This function will be used later for one of the possible cases mentioned in the documentation, where the data frame has two columns [line 3].

The function exits with an error message if the argument passed is not a data frame [line 4-5]. This is just the usual defensive line to handle arguments that don't follow the assumed type of input.

Next the main part [lines 6ff.] goes, that conditionally selects a kind of graph depending on the number and, if required, the class of columns in the data frame.

If the data frame has one column and it is "numeric" or "integer" a stripchart is displayed; otherwise (the column is of another class) it relies on the generic plot again to generate the appropriate plot. [lines 6-11].

If the data frame has two columns the previously defined plot2 function is called, so that a suitable y vs. x plot is obtained [lines 13-14].

Finally, if the number of columns is greater than 2, the another possible case, the data frame is coerced to a data.matrix and the function pairs is applied to the result of the coercion [lines 16-17].

To get a grasp on this last thing, here is my challenge, read now ?data.matrix and ?pairs. Have fun and happy R reading!