jueves, 28 de julio de 2016

Reading R code. The functions Map and mapply

In this post I'm going to consider the implementation of the function Map in R. As in the previous post in the series, I first introduce the typical map in functional languages. Then I go into the R's Map implementation. Along the way I explore some other R functions to the extent needed for a basic understanding of the implementation.

What is map

map applies a function to each element in a given list to get a new list with the result of the application. For instance:

map sqrt [1, 2, 3]

produces a list of the square roots of each number:

[1.0,1.4142135623730951,1.7320508075688772]

In general, map takes a function and a list of elements of some type, say a, and produces a list of elements of some other type, say b, where a and b are the types of, respectively, the input and output values of the function that map takes. The signature for the example above could be written as:

(Num -> Double) [Num] -> [Double]

meaning that the first argument, sqrt, takes a number (in general) and produces a double-precision floating point number; the second argument is a list of numbers, and the result a list of Doubles.

So in general map has this signature:

(a -> b) [a] -> [b]

As with filter a natural implementation of map is a math-like recursive definition:

map f []     = []
map f (x:xs) = f x : map f xs

Or, in Racket:

(define (map f lox)
  (cond [(empty? lox) empty]
        [else
          (cons (f (first lox))
                (map f (rest lox)))))

This is the basic usage. The function map can usually handle multiple lists too. One of the variants works as follows:

It applies the function to the first element of given lists, then to the second, and so on till the last element of the shortest list is processed, the rest of elements of longer lists are ignored, and the result is the list produced by the successive application. This is a possible example with two lists as input. I use the name zipWith to refer to this kind of map that can take in two lists:

zipWith (+) [1, 2, 3] [4, 5, 6, 7]

produces:

[5, 7, 9] -- so [1 + 4, 2 + 5, 3 + 6]

The signature would be:

zipWith :: (a b -> c) [a] [b] -> [c]

Generalizations to cope with an arbitrary number of lists are also available in functional languages.

Map in R

The R Map function provides, as ?Map states, a generalization of the sort described:

‘Map’ applies a function to the corresponding elements of given vectors.

(Note that like Filter the consumed objects are R vectors.)

However, unlike the generalized map mentioned, Map doesn't discard remaining elements of longer lists. Instead, it uses recycling:

‘Map’ ... is similar to Common Lisp's ‘mapcar’ (with arguments being recycled, however)

Common-Lisp mapcar is a function that behaves as explained above (the Haskell's zipWith example). Indeed, executing the corresponding Lisp code on the clisp interpreter gives the same result:

[1]> (mapcar #'+ '(1 2 3) '(4 5 6 7))
(5 7 9)

while in R recycling is at work, though:

> Map(`+`, 1:3, 4:7)
[[1]]
[1] 5

[[2]]
[1] 7

[[3]]
[1] 9

[[4]]
[1] 8

As you can see, when the shortest list is exhausted, R recycles it as needed: the last element is the result of 1 + 7, where 1 comes here from recycling the shorter list. Note also that the result is not an atomic vector as one might expect but a list. This is also documented:

‘Map’ ... does not attempt to simplify the result ... Future versions may allow some control of the result type.

As for the implementation ?Map also tells us what it is:

‘Map’ is a simple wrapper to ‘mapply’.

It is not uncommon in R to find traces of implementation details in its docs.

The actual implementation expresses in code all of these points:

function (f, ...) 
{
  f <- match.fun(f)
  mapply(FUN = f, ..., SIMPLIFY = FALSE)
}

Let's have a closer look into the definition.

The function takes a first argument f, the mapping function to be applied, and an undetermined number of extra arguments. The three dots construct allows to catch those extra arguments and pass them on later to mapply. Vectors on which f will be applied are among the arguments that the caller will pass; eventually, more arguments might be passed by the caller, arguments that mapply can receive.

In the first line, the function passed as first argument is extracted and saved into a local variable f, or discarded if it cannot be interpreted in any way as a function, that's what match.fun basically does.

The function mapply

Once this first argument has been checked, it is passed as the FUN argument to mapply that in turn calls the underlying C code to carry out the computation. Additionally, Map sets the SIMPLIFY argument of mapply to FALSE to avoid the simplification, as documented.

We can see this better if we take a look at the implementation of mapply:

function(FUN,..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
{
  FUN <- match.fun(FUN)
  dots <- list(...)

  answer <- .Internal(mapply(FUN, dots, MoreArgs))

  if (USE.NAMES && length(dots)) {
    if (is.null(names1 <- names(dots[[1L]])) && is.character(dots[[1L]]))
      names(answer) <- dots[[1L]]
    else if (!is.null(names1))
      names(answer) <- names1
  }
  if(!identical(SIMPLIFY, FALSE) && length(answer))
    simplify2array(answer, higher = (SIMPLIFY == "array"))
  else answer
}

Note that answer, the returned value, is the result produced by the internal function, written in C, which is invoked via the call to .Internal. We come to this construct over and over while reading R base functions. Many R base functions just wrap the call to an underlying function, eventually preparing things for it, and/or adapting the result of it according to the arguments passed in the first place.

Map sets mapply's SIMPLIFY to FALSE so that the simplification that mapply could do otherwise will never be executed.

However, despite the fact that the intended arguments for the three dots are just the vectors, as ?Map documents, nothing prevents us from passing other possible mapply arguments, USE.NAMES and MoreArgs.

MoreArgs allows for passing extra arguments to the f function. For instance, to get a list of vectors where 42 is repeated from 1 to 4 times, we can use MoreArgs in this way [example borrowed from the mapply doc]:

> Map(rep, times = 1:4, MoreArgs = list(x = 42))
[[1]]
[1] 42

[[2]]
[1] 42 42

[[3]]
[1] 42 42 42

[[4]]
[1] 42 42 42 42

As R allows us to pass extra arguments just by naming them after the function,

> Map(rep, times = 1:4, x = 42)

MoreArgs seems to be of little use in this regard.

Furthermore, one can always resort to the commonly-used idiom in functional programming: supplying an anonymous function in place of f:

> Map(function(x) rep(42, times = x), 1:4)

The other extra option that we may pass to Map is USE.NAMES. mapply sets it by default to TRUE. In such a setting, and if more arguments apart form the initial function f are passed (length(dots) is not 0) this code in mapply will be executed:

if (USE.NAMES && length(dots)) {
  if (is.null(names1 <- names(dots[[1L]])) && is.character(dots[[1L]])) 
    names(answer) <- dots[[1L]]
  else if (!is.null(names1)) 
    names(answer) <- names1
}

Two cases are especially handled:

  • The second argument passed to mapply (dots[[1L]]) doesn't have names but it is of character type. (Recall that the first argument is always the function f).
  • The second argument passed to mapply has names.

In the first case, the character object is used to set the names of the result of mapply, as in this example taken from ?mapply:

> mapply(function(C, k) paste0(rep.int(C, k)), LETTERS[1:6], 6:1)

In the second case, the names of the argument become the names of the result. For instance [again from ?mapply]:

> mapply(function(x, y) seq_len(x) + y,
> +             c(a =  1, b = 2, c = 3),  # names from first
> +             c(A = 10, B = 0, C = -10))

This is simply put on the doc:

use names if the first ... argument has names, or if it is a character vector, use that character vector as the names.

If we want no names handling in Map we can just set USE.NAMES to FALSE overriding the default behavior of mapply, or viewed from the implementation, falling back to the default behavior of the internal C code.

A couple of points about R idioms that we see in the mapply:

Assignments can be sub-expressions. Here:

is.null(names1 <- names(dots[[1L]]))

The assignment expression names1 <- names(dots[[1L]]) evaluates to the value of names1, once assigned to. The value is passed to is.null as a part of the if condition. The value of names1 is later re-used in the last branch.

Secondly, in a logical context, the number 0 evaluates to FALSE, any other number evaluates to TRUE. This allows for concise conditions as the previous one:

if (USE.NAMES && length(dots))

The second part of the relational expression check whether dots is empty or not. Hence, there is no need to write length(dots) != 0. This is a common idiom in many languages supporting boolean evaluation of different classes of objects.

To complete the basic understanding of these functions I should inspect more closely match.fun and simplify2array (the function responsible of simplifying the result). match.fun appears frequently in R code and I pin it on top of my task list, simplify2array is a helper function currently used only by mapply and sapply. We can live without studying it for the time being.

No hay comentarios:

Publicar un comentario