## jueves, 28 de julio de 2016

### Reading R code. The functions Map and mapply

In this post I'm going to consider the implementation of the function `Map` in R. As in the previous post in the series, I first introduce the typical `map` in functional languages. Then I go into the R's `Map` implementation. Along the way I explore some other R functions to the extent needed for a basic understanding of the implementation.

### What is `map`

`map` applies a function to each element in a given list to get a new list with the result of the application. For instance:

``map sqrt [1, 2, 3]``

produces a list of the square roots of each number:

``[1.0,1.4142135623730951,1.7320508075688772]``

In general, `map` takes a function and a list of elements of some type, say `a`, and produces a list of elements of some other type, say `b`, where `a` and `b` are the types of, respectively, the input and output values of the function that `map` takes. The signature for the example above could be written as:

``(Num -> Double) [Num] -> [Double]``

meaning that the first argument, `sqrt`, takes a number (in general) and produces a double-precision floating point number; the second argument is a list of numbers, and the result a list of `Double`s.

So in general `map` has this signature:

``(a -> b) [a] -> [b]``

As with `filter` a natural implementation of `map` is a math-like recursive definition:

``````map f []     = []
map f (x:xs) = f x : map f xs``````

Or, in Racket:

``````(define (map f lox)
(cond [(empty? lox) empty]
[else
(cons (f (first lox))
(map f (rest lox)))))``````

This is the basic usage. The function `map` can usually handle multiple lists too. One of the variants works as follows:

It applies the function to the first element of given lists, then to the second, and so on till the last element of the shortest list is processed, the rest of elements of longer lists are ignored, and the result is the list produced by the successive application. This is a possible example with two lists as input. I use the name `zipWith` to refer to this kind of `map` that can take in two lists:

``zipWith (+) [1, 2, 3] [4, 5, 6, 7]``

produces:

``[5, 7, 9] -- so [1 + 4, 2 + 5, 3 + 6]``

The signature would be:

``zipWith :: (a b -> c) [a] [b] -> [c]``

Generalizations to cope with an arbitrary number of lists are also available in functional languages.

### Map in R

The R `Map` function provides, as `?Map` states, a generalization of the sort described:

‘Map’ applies a function to the corresponding elements of given vectors.

(Note that like `Filter` the consumed objects are R vectors.)

However, unlike the generalized `map` mentioned, `Map` doesn't discard remaining elements of longer lists. Instead, it uses recycling:

‘Map’ ... is similar to Common Lisp's ‘mapcar’ (with arguments being recycled, however)

Common-Lisp `mapcar` is a function that behaves as explained above (the Haskell's `zipWith` example). Indeed, executing the corresponding Lisp code on the `clisp` interpreter gives the same result:

``````[1]> (mapcar #'+ '(1 2 3) '(4 5 6 7))
(5 7 9)``````

while in R recycling is at work, though:

``````> Map(`+`, 1:3, 4:7)
[[1]]
[1] 5

[[2]]
[1] 7

[[3]]
[1] 9

[[4]]
[1] 8``````

As you can see, when the shortest list is exhausted, R recycles it as needed: the last element is the result of 1 + 7, where 1 comes here from recycling the shorter list. Note also that the result is not an atomic vector as one might expect but a list. This is also documented:

‘Map’ ... does not attempt to simplify the result ... Future versions may allow some control of the result type.

As for the implementation `?Map` also tells us what it is:

‘Map’ is a simple wrapper to ‘mapply’.

It is not uncommon in R to find traces of implementation details in its docs.

The actual implementation expresses in code all of these points:

``````function (f, ...)
{
f <- match.fun(f)
mapply(FUN = f, ..., SIMPLIFY = FALSE)
}``````

Let's have a closer look into the definition.

The function takes a first argument `f`, the mapping function to be applied, and an undetermined number of extra arguments. The three dots construct allows to catch those extra arguments and pass them on later to `mapply`. Vectors on which `f` will be applied are among the arguments that the caller will pass; eventually, more arguments might be passed by the caller, arguments that `mapply` can receive.

In the first line, the function passed as first argument is extracted and saved into a local variable `f`, or discarded if it cannot be interpreted in any way as a function, that's what `match.fun` basically does.

### The function `mapply`

Once this first argument has been checked, it is passed as the `FUN` argument to `mapply` that in turn calls the underlying C code to carry out the computation. Additionally, `Map` sets the `SIMPLIFY` argument of `mapply` to `FALSE` to avoid the simplification, as documented.

We can see this better if we take a look at the implementation of `mapply`:

``````function(FUN,..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
{
FUN <- match.fun(FUN)
dots <- list(...)

if (USE.NAMES && length(dots)) {
if (is.null(names1 <- names(dots[[1L]])) && is.character(dots[[1L]]))
else if (!is.null(names1))
}
simplify2array(answer, higher = (SIMPLIFY == "array"))
}``````

Note that `answer`, the returned value, is the result produced by the internal function, written in C, which is invoked via the call to `.Internal`. We come to this construct over and over while reading R base functions. Many R base functions just wrap the call to an underlying function, eventually preparing things for it, and/or adapting the result of it according to the arguments passed in the first place.

`Map` sets `mapply`'s `SIMPLIFY` to `FALSE` so that the simplification that `mapply` could do otherwise will never be executed.

However, despite the fact that the intended arguments for the three dots are just the vectors, as `?Map` documents, nothing prevents us from passing other possible `mapply` arguments, `USE.NAMES` and `MoreArgs`.

`MoreArgs` allows for passing extra arguments to the `f` function. For instance, to get a list of vectors where 42 is repeated from 1 to 4 times, we can use `MoreArgs` in this way [example borrowed from the `mapply` doc]:

``````> Map(rep, times = 1:4, MoreArgs = list(x = 42))
[[1]]
[1] 42

[[2]]
[1] 42 42

[[3]]
[1] 42 42 42

[[4]]
[1] 42 42 42 42``````

As R allows us to pass extra arguments just by naming them after the function,

``> Map(rep, times = 1:4, x = 42)``

`MoreArgs` seems to be of little use in this regard.

Furthermore, one can always resort to the commonly-used idiom in functional programming: supplying an anonymous function in place of `f`:

``> Map(function(x) rep(42, times = x), 1:4)``

The other extra option that we may pass to `Map` is `USE.NAMES`. `mapply` sets it by default to `TRUE`. In such a setting, and if more arguments apart form the initial function `f` are passed (`length(dots)` is not 0) this code in `mapply` will be executed:

``````if (USE.NAMES && length(dots)) {
if (is.null(names1 <- names(dots[[1L]])) && is.character(dots[[1L]]))
else if (!is.null(names1))
}``````

Two cases are especially handled:

• The second argument passed to `mapply` (`dots[[1L]]`) doesn't have names but it is of character type. (Recall that the first argument is always the function `f`).
• The second argument passed to `mapply` has names.

In the first case, the character object is used to set the names of the result of `mapply`, as in this example taken from `?mapply`:

``> mapply(function(C, k) paste0(rep.int(C, k)), LETTERS[1:6], 6:1)``

In the second case, the names of the argument become the names of the result. For instance [again from `?mapply`]:

``````> mapply(function(x, y) seq_len(x) + y,
> +             c(a =  1, b = 2, c = 3),  # names from first
> +             c(A = 10, B = 0, C = -10))``````

This is simply put on the doc:

use names if the first ... argument has names, or if it is a character vector, use that character vector as the names.

If we want no names handling in `Map` we can just set `USE.NAMES` to `FALSE` overriding the default behavior of `mapply`, or viewed from the implementation, falling back to the default behavior of the internal C code.

A couple of points about R idioms that we see in the `mapply`:

Assignments can be sub-expressions. Here:

``is.null(names1 <- names(dots[[1L]]))``

The assignment expression `names1 <- names(dots[[1L]])` evaluates to the value of `names1`, once assigned to. The value is passed to `is.null` as a part of the `if` condition. The value of `names1` is later re-used in the last branch.

Secondly, in a logical context, the number 0 evaluates to `FALSE`, any other number evaluates to `TRUE`. This allows for concise conditions as the previous one:

``if (USE.NAMES && length(dots))``

The second part of the relational expression check whether `dots` is empty or not. Hence, there is no need to write `length(dots) != 0`. This is a common idiom in many languages supporting boolean evaluation of different classes of objects.

To complete the basic understanding of these functions I should inspect more closely `match.fun` and `simplify2array` (the function responsible of simplifying the result). `match.fun` appears frequently in R code and I pin it on top of my task list, `simplify2array` is a helper function currently used only by `mapply` and `sapply`. We can live without studying it for the time being.