## martes, 23 de agosto de 2016

### Reading R Code. Factors in R

R factors are a common source of confusion for R beginners, even in their most basic use. In this post I try to shed some light on R factors. But rather than only rely on external sources of information or on the R documentation, I will make some experiments and have a look at the implementation to understand what a factor is and how it behaves in the most simple use cases.

First of all, R factors are intended for representing categorical variables. A categorical variable is a variable that can take values of a limited set of so-called categories. For instance, to encode the gender of a population we would use a categorical variable with two possible categories, male and female. Or to encode the educational level of a population under 18 we can use a categorical variable with three categories: elementary, middle and high.

In addition, categories of a categorical variable may or may not have some internal ordering. We cannot ascribe any sound ordering to the first example of categorical variable, but we should in the second case (elementary `<` middle `<` high)

In R, the term `levels` stands for categories of the categorical variable (the factor). `labels` in turn suggests something like custom names for those categories.

This distinction might be blurry, and it is. Think of some data you are going to analyze, say, a record of the gender of 6 persons:

``> x <- c("F", "M", "F", "F", "F", "M")``

If we want to convert this into a factor, which are the levels? Maybe, "male" and "female"?

``````> factor(x, c("male", "female"))
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: male female``````

Yikes! This doesn't work. Interestingly, no error message has been triggered, and levels are stored as we intended. But what are those `<NA>`s there?

Let's try without supplying levels and see what happens:

``````> factor(x)
> [1] F M F F F M
> Levels: F M``````

This looks better, at least nasty `<NA>`s have disappeared. So, when no levels are supplied, levels end up being the distinct elements of the original `x`. It seems that we could pass them instead of our `"male"` and `"female"` before:

``````> factor(x, levels = c("M", "F"))
> [1] F M F F F M
> Levels: M F``````

Fine! This works. Only that levels appear now in a different order, the one we have supplied.

Still we would prefer the more human-friendly "male" and "female". It might be that these words are rather labels. Let's try:

``````> factor(x, labels = c("male", "female"))
> [1] male   female male   male   male   female
> Levels: male female``````

Much better, but wait, our initial vector was another one. This totally wrecks havoc with `x`!

Looks like the order in which we pass the labels makes a serious effect. And it certainly does:

``````> factor(x, labels = c("female", "male"))
> [1] female male   female female female male
> Levels: female male``````

Great! If for some hidden reason (maybe, say, because we expect that in a plot of this factor the legend shows male scores before female ones) we still want the other ordering of levels, we may try to combine both parameters:

``````> factor(x, levels = c("M", "F"), labels = c("male", "female"))
> [1] female male   female female female male
> Levels: male female``````

That's it! Particularly, note how each label corresponds to each level in the values supplied. To verify, something like the following (where there is a mismatch in the ordering of levels and labels)

``````> factor(x, levels = c("F", "M"), labels = c("male", "female"))
> [1] male   female male   male   male   female
> Levels: male female``````

ruins `x` again.

From these experiments we can draw some intuitive conclusions:

1. levels are the distinct elements in the object to be converted to factor.
2. If levels are not supplied they are constructed from those unique elements sorted in alphabetical order.
3. labels are the "names" we want to give to our levels.
4. If no supplied, labels are the same as the levels.
5. The order of labels passed must match the order of the levels, either of the default levels or of the levels we supply.
6. The less risky way to create factors with custom labels is to supply both parameters.

Looking at the implementation makes crystal clear what we have found out.

``````  1 function (x = character(), levels, labels = levels, exclude = NA,
2     ordered = is.ordered(x), nmax = NA)
3 {
4     if (is.null(x))
5         x <- character()
6     nx <- names(x)
7     if (missing(levels)) {
8         y <- unique(x, nmax = nmax)
9         ind <- sort.list(y)
10         y <- as.character(y)
11         levels <- unique(y[ind])
12     }
13     force(ordered)
14     exclude <- as.vector(exclude, typeof(x))
15     x <- as.character(x)
16     levels <- levels[is.na(match(levels, exclude))]
17     f <- match(x, levels)
18     if (!is.null(nx))
19         names(f) <- nx
20     nl <- length(labels)
21     nL <- length(levels)
22     if (!any(nl == c(1L, nL)))
23         stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
24             nl, nL), domain = NA)
25     levels(f) <- if (nl == nL)
26         as.character(labels)
27     else paste0(labels, seq_along(levels))
28     class(f) <- c(if (ordered) "ordered", "factor")
29     f
30 }``````

Let us consider the second point above. If levels are not supplied they will be the distinct elements of the given vector:

``````if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- sort.list(y)
y <- as.character(y)
levels <- unique(y[ind])
}``````

If levels are missing, we create a vector `y` of unique elements of `x`:

``````> y <- unique(x)
> y
[1] "F" "M"``````

Note that `nmax` is set by default to `NA` in the function header:

``````function (x = character(), levels, labels = levels, exclude = NA,
ordered = is.ordered(x), nmax = NA) ``````

and `unique(x, nmax = NA)` is equivalent to the call just above.

Also, note that if the order of elements in `x` were different, say, `c("M", "F", "M")`, `unique` would produce the distinct elements just by removing duplicates in the given vector:

``````> unique(c("M", "F", "M"))
[1] "M" "F"``````

So `unique` doesn't sort the result.

To sort the result we need a sorting function, here `sort.list`. This function produces the indices suitable for subsetting `y`:

``````> sort.list(y)
[1] 1 2

> y[sort.list(y)]
[1] "F" "M"``````

Note also that `y` holds a character vector always. So levels produced are always a character vector sorted in alphabetical order.

If levels are supplied it is expected that they match the unique elements of `x`. Pay attention to this crucial line:

``f <- match(x, levels)``

`match` is a nice but maybe tricky function to understand, it produces the positions (= indices) of matches of `x` in `levels`.

Let's try some examples. Recall that `x` is:

``````> x
[1] "F" "M" "F" "F" "F" "M"

> match(x, c("M", "F"))
[1] 2 1 2 2 2 1``````

Indeed, the first element of `x` matches the `2`nd element of `levels`; the second element of `x` matches the `1`st element of `levels`, and so on.

If there is no match, a `NA` is produced by default:

``````> match(x, c("M", "O"))
[1] NA  1 NA NA NA  1``````

Now we see where the `NA`s in our initial wrong attempt with "male" and "female" as levels came from.

If labels are not supplied they are just the levels as the function header states (`levels` is the default value for `labels`).

If they are supplied, these lines give labels to levels:

``````levels(f) <- if (nl == nL)
as.character(labels)
else paste0(labels, seq_along(levels))``````

Actually, the line that is executed for our previous examples with `labels` is just:

``levels(f) <- as.character(labels)``

since in our examples the number of levels are labels are the same (`nl == nL`, where `nl <- length(labels)` and `nL <- length(levels)`)

Note again that levels with those new labels are still always a character object.

The last line in the snippet above deals with the case where a single label is supplied. We haven't tried an example for that, it is time to do it now:

``````> factor(x, labels = "gender")
[1] gender1 gender2 gender1 gender1 gender1 gender2
Levels: gender1 gender2``````

This is a valid way to assign labels to levels, though not so good for this example. The code responsible of producing this vector is:

``paste0(labels, seq_along(levels))``

`seq_along(levels)` produces a sequence `1, 2, ... n` where `n` is the length of `levels`. `paste0` just concatenates the single label with those digits in the sequence via recycling.

However, the most remarkable fact we may notice from looking at the implementation is that a `factor` in R is just an integer vector whose attributes "levels" (as we have seen), "names" (as names of `x` [see line 6 and lines 18-19 in the implementation]) and, above all, "class" are set appropriately.

As for the latter the last but one line in the implementation sets the `"class"` attribute to `"factor"` always. Besides, if `x` is an ordered vector (the default behavior), or if we pass `TRUE` as value to the `ordered` argument, `"class"` is set to `c("ordered", "factor")`:

``class(f) <- c(if (ordered) "ordered", "factor") ``

Note the order of elements in the last vector for "class". The order here matters, and means that in such a case `f` is like an instance of the class `"ordered"`, that in turn is somewhat like a subclass of "`factor`". [See `?class` for more details about this kind of inheritance mechanism.]

A nice way of finishing this post is to demonstrate what we have discovered, that a factor is, from the R's internal point of view, just an integer vector with certain attributes set appropriately.

``````> f <- c(1L, 2L, 1L, 1L, 1L, 2L)
> levels(f) <- c("female", "male")
> class(f) <- "factor"
> str(f)
Factor w/ 2 levels "female","male": 1 2 1 1 1 2
> f
[1] female male   female female female male
Levels: female male

> f2 <- c(1L, 3L, 3L, 2L, 1L, 2L)
> levels(f2) <- c("elem", "middle", "high")
> class(f2) <- c("ordered", "factor")
> str(f2)
Ord.factor w/ 3 levels "elem"<"middle"<..: 1 3 3 2 1 2
> f2
[1] elem   high   high   middle elem   middle
Levels: elem < middle < high``````

## lunes, 8 de agosto de 2016

### Reading R Code. The function Reduce in R (part II)

[ This post is the third one in a series about `Reduce`. To understand it you also need to read the first and second posts in this series.]

### Loops

Our implementation of `Reduce` directly translates the recursive procedures capturing the `fold-left` and `fold-right` computations. I think that this kind of recursive thinking is not only more high-level but also easier to understand, and that was the reason for beginning with them. However, in languages that don't provide mechanisms for optimizing recursion, iteration is always expressed via loops or another non-recursive procedure.

We have to do so in R if we hope to handle sequences of certain length, otherwise we'll get something like this:

``````> source("my_reduce.R")
> my_reduce("+", 1:100000)
Error: evaluation nested too deeply ...``````

Let's get started with converting the higher-level recursion used so far into a lower-level kind of loop for our first simple implementation in order to see how this can be easily done for the rest of the function.

That first implementation was for `fold-left` with a given initial value:

``````my_reduce <-
function(f, x, init) {
f <- match.fun(f)

iter <- function(acc, x) {
len <- length(x)
if (!len) {
acc
}
else {
first <- x[[1L]]
rest <- tail(x, -1L)

iter(f(acc, first), rest)
}
}

iter(init, x)
}``````

Using a `for` loop instead of recursion we get this self-explanatory version:

``````my_reduce <-
function(f, x, init) {
f <- match.fun(f)
acc <- init

for (e in x) {
acc <- f(acc, e)
}
acc
}``````

Accessing elements by its index is another way to get the same thing. The refactored function based on indices would be:

``````my_reduce <-
function(f, x, init) {
f <- match.fun(f)
acc <- init
len <- length(x)
ind <- seq_len(len)

for (i in ind) {
acc <- f(acc, x[[i]])
}
acc
}``````

`seq_len` is an R function that takes a number and produces an integer vector from 1 to that number (included). So, `seq_len(4)` produces the sequence `1, 2, 3, 4`. `ind` in the code above just refers to the numerical indices of `x` to be used in the loop.

A cosmetic and optional detail. Since the local variable `acc` takes the value passed as `init`, and the initial `init` value won't be used any more, we can stick with `init` as the name for the accumulated value so far, and remove `acc`:

``````my_reduce <-
function(f, x, init) {
f <- match.fun(f)
len <- length(x)
ind <- seq_len(len)

for (i in ind) {
init <- f(init, x[[i]])
}
init
}``````

### `Reduce`, at last!

That's actually the R implementation for left folding with `init` given:

``````function(f, x, init) {
len <- length(x)
f <- match.fun(f)
ind <- seq_len(len)

for (i in ind) init <- forceAndCall(2, f, init, x[[i]])
init
}``````

The only subtle difference lies in how `f` is called. Instead of a direct application, it uses the `forceAndCall` R function. The purpose of this is to avoid delayed evaluation of arguments. That means that we force R to evaluate the first 2 arguments (note the `2` as first argument) of `f` immediately before executing the `f` body, rather than leaving R to evaluate them after. This is rather technical at this point and we can go ahead without fully understanding. Just note that this construct is like calling the function but in a better way when that function is to be the value passed to a higher-order function like `Reduce`. We might go deeper into this in the future.

Once we have transformed our initial version into a loop-based version the first half of `Reduce` is now easy to understand.

`````` 1 Reduce <-
2 function(f, x, init, right = FALSE, accumulate = FALSE)
3 {
4   mis <- missing(init)
5   len <- length(x)
6
7   if(len == 0L) return(if(mis) NULL else init)
8
9   f <- match.fun(f)
10
11
12   if(!is.vector(x) || is.object(x))
13     x <- as.list(x)
14
15   ind <- seq_len(len)
16
17   if(mis) {
18     if(right) {
19       init <- x[[len]]
20       ind <- ind[-len]
21     }
22     else {
23       init <- x[[1L]]
24       ind <- ind[-1L]
25     }
26   }
27
28   if(!accumulate) {
29     if(right) {
30       for(i in rev(ind))
31         init <- forceAndCall(2, f, x[[i]], init)
33     }
34     else {
35       for(i in ind)
36         init <- forceAndCall(2, f, init, x[[i]])
37     }
38     init
39   }
40   else {
..     # to be explained later
..   }
.. }``````

Apart from pretty minor differences, it is the same code as in our implementation but using indexing and looping instead of recursion.

Let's see:

• Line 7 is just the code that handles what happens if the given object is empty, the empty list or an empty vector, typically. It returns `NULL` if no initial value is given; otherwise, returns the initial value.

• Lines 12-13 introduce one new thing only present in the official R version. Our version assumes a list as input. In R, a vector is expected, These lines ensure that if the value is not a vector or is an object in general it is converted to a type that the following code can deal with, a `list`.

• Lines 17-26 initialize variables `init` and `ind` accordingly to the kind of fold (left or right) requested if no initial value were provided. `init` is set exactly as in our version. Regarding `ind`, instead of working on the rest of the list directly as in our recursive implementation, indices are initialized appropriately, `ind[-1L]` produces the next indices to the index of the first element in order to process the rest of list for left folding. On the other hand `ind[-len]` produces the indices for right folding: those of all elements save the last one (that has been used as `init`). These indices will be reverted in line 30 to process the remaining list from right to left as explained in the previous post.

• Finally [lines 29-37], the function `f` is iteratively called with the proper
order of parameters for each kind of fold, and the result of folding returned in line 38.

The rest of `Reduce` is the code handling the case were the caller asks for a sequence of partial results rather than the final reduction. We'll look at it in the next post.

## sábado, 6 de agosto de 2016

### Reading R Code. The function Reduce in R (part I)

[This post is the second one about the function `Reduce`. It assumes everything discussed in the first post about Reduce]

The function `Reduce` in R is based on the Lisp function `reduce`. It also adds an option borrowed from Haskell's `scan` functions.

Let's see this by exploring a few simple examples and observing their behavior in R as well as in Lisp and Haskell.

### What does `Reduce` do?

``````## Example 1.
# R
> Reduce(`+`, 1:4)
[1] 10

# Lisp
[1]> (reduce #'+ '(1 2 3 4))
10

Prelude> foldl (+) 1 [2, 3, 4]
10``````

Look at the Haskell function. I have used `foldl`, that is the `fold-left` function in Haskell, because R and Lisp do `fold-left` by default. Note also that while in Haskell the initial value has to be specified, in R and in Lisp we don't need to do so. When not specified the initial value for left-folding is the first element in the given list.

The evaluation of all these functions (`fold-left`) expands as follows for the given example:

``````(((1 + 2) + 3) + 4)

## Example 2.
# R
> Reduce(`-`, 1:4)
[1] -8

# Lisp
[1]> (reduce #'- '(1 2 3 4))
-8

Prelude> foldl (-) 1 [2, 3, 4]
-8``````

The expansion is the same:

``````(((1 - 2) - 3) - 4)

## Example 3.
# R
> Reduce(`-`, 1:4, right = TRUE)
[1] -2

# Lisp
[1]> (reduce #'- '(1 2 3 4) :from-end t)
-2

Prelude> foldr (-) 4 [1, 2, 3]
-2``````

The evaluation expands to this now:

``(1 - (2 - (3 - 4)))``

This is `fold-right`. In R and Lisp we have to set an option for this fold, and at this point it should be clear why these two functions are described as "left" and "right" folds. Note also that, for `fold-right` the initial value, if given (as in Haskell) become the right-most item in the expansion. In fact, in R and Lisp if the initial value is not given, it is the last element of the given list.

R provides an extra feature borrowed from Haskell for producing a sequence of results of accumulated values rather than the final reduction.

In Haskell, there is a special set of functions for this purpose, `scan` functions, `scanr` corresponds to `foldr` and `scanl` to `foldl`:

``````Prelude> scanl (-) 1 [2, 3, 4]
[1,-1,-4,-8]``````

Observe the partial results in the computations above:

``````initial value:        =  1
1st result   :  1 - 2 = -1
2nd result   : -1 - 3 = -4
3rd result   : -4 - 4 = -8``````

Or, for a right fold:

``````Prelude> scanr (-) 4 [1, 2, 3]
[-2,3,-1,4]``````

where partial results from right to left and in reverse order are:

``````initial value:          =  4
1st result   : 3 - 4    = -1
2nd result   : 2 - (-1) =  3
3rd result   : 1 - 3    = -2``````

The R equivalents need the `accumulate` parameter set to `TRUE`:

``````Reduce("-", 1:4, accumulate = TRUE)
> [1]  1 -1 -4 -8``````

and

``````Reduce("-", 1:4, right = TRUE, accumulate = TRUE)
> [1] -2  3 -1  4``````

These functions are just variants of the tail-recursive pattern where a second accumulator is added to keep track of computed values so far.

So for instance `scanl` can be implemented as follows:

`````` (define (scan-left f i lox0)
(local [(define (f-for-lox-acc acc rsf lox)
(cond [(empty? lox) (reverse (cons acc rsf))]
[else
(f-for-lox-acc (f acc (first lox))
(cons acc rsf)
(rest lox))]))]
(f-for-lox-acc i empty lox0)))``````

Only and extra accumulator `rsf` has been added. It maintains a list of resulting values so far. Each time a new computed value is produced is inserted into the `rsf` accumulator. The result is finally reversed. It is possible to avoid the final `reverse` if computed values are appended to the end of `rsf`. Anyway it is clear the idea behind this.

### R implementation of `Reduce`

As we have seen, `Reduce` in R is a multipurpose function. Other languages as Haskell prefer independent elementary functions for each task (right/left fold and returning accumulated values or not). I find the latter more elegant, easier to read and easier to test. R, though, is fond of functions with lots of parameters, and full of conditionals to select the code to execute for a specific purpose.

Even so, and with the goal of firstly implementing our own version for better understanding the R official version, it is reasonable to begin with basic use cases and add more after those have been implemented.

One of the simplest use case is to execute `Reduce` for left fold with a given initial value. As usual, we need a very basic set of tests. (more should be added for good coverage). The following are hopefully enough at this stage:

``````test_that("test reduce: fold-left, init given", {
expect_equal(my_reduce("+", integer(0), init = 1), 1)
expect_equal(my_reduce("-", 2:4, init = 1), -8)
expect_equal(my_reduce("/", c(2, 9, 13), init = 7), 7/234)
expect_equal(my_reduce(list, 2:4, init = 1), (list(list(list(1, 2), 3), 4)))
})``````

The most natural implementation at this point is the translation into R of the `fold-left` tail recursive procedure (first variant), as this:

``````my_reduce <-
function(f, x, init) {
f <- match.fun(f)

iter <- function(acc, x) {
len <- length(x)
if (!len) {
acc
}
else {
first <- x[[1L]]
rest <- tail(x, -1L)

iter(f(acc, first), rest)
}
}

iter(init, x)
}``````

Assuming that test cases are saved into the file `test_my_reduce.R`, running tests with:

``````> library(testthat)
> test_file("test_my_reduce.R")``````

confirms that this implementation works.

A few points about this code. The first line applies `match.fun` as customary in R code to check the validity of the argument passed as `f`. I have used `1L` instead of just `1` for indexing. This is a common practice in R base code: when an integer is required the number as a literal integer (the number followed by `L`) is passed. Finally, I have intentionally named the auxiliary helper as `iter` for reasons that will be clear. Any name would have worked though.

The next step is handling the case where no initial value is passed. Some test cases for this:

``````test_that("test reduce: fold-left, init missing", {
expect_equal(my_reduce("+", integer(0)), NULL)
expect_equal(my_reduce("-", 1:4), -8)
expect_equal(my_reduce("/", c(7, 2, 9, 13)), 7/234)
expect_equal(my_reduce(list, 1:4), (list(list(list(1, 2), 3), 4)))
})``````

The first test case is for the empty vector. As no initial value is given, we cannot return it as before. By convention we return `NULL` in this case.

And this is an obvious implementation that passes all tests above [only the first lines are added, the `iter` function remains the same]:

``````my_reduce <-
function(f, x, init) {
f <- match.fun(f)
mis <- missing(init)
len <- length(x)

if (mis && !len) {
return(NULL)
}

if (mis) {
init <- x[[1L]]
x <- tail(x, -1L)
}

iter <- function(acc, x) {
len <- length(x)
if (!len) {
acc
}
else {
first <- x[[1L]]
rest <- tail(x, -1L)

iter(f(acc, first), rest)
}
}

iter(init, x)
}``````

What about the `fold-right` operation?

Let's write some tests for fold-right with and without the initial value given:

``````test_that("test reduce: fold-right, init given", {
expect_equal(my_reduce("+", integer(0), init = 1, right = TRUE), 1)
expect_equal(my_reduce("-", 1:3, init = 4, right = TRUE), -2)
expect_equal(my_reduce("/", c(7, 2, 9), init = 13, right = TRUE), 63/26)
expect_equal(my_reduce(list, 1:3, init = 4, right = TRUE),
list(1, list(2, list(3, 4))))
})

test_that("test reduce: fold-right, init missing", {
expect_equal(my_reduce("+", integer(0), right = TRUE), NULL)
expect_equal(my_reduce("-", 1:4, right = TRUE), -2)
expect_equal(my_reduce("/", c(7, 2, 9, 13), right = TRUE), 63/26)
expect_equal(my_reduce(list, 1:4, right = TRUE),
list(1, list(2, list(3, 4))))
})``````

As for the implementation it looks like it would be a very different piece of code if we were to translate the natural recursive procedure examined in the post referred above. However, when designing functions that handle several possibilities we should seek after maximizing the amount of code shared by each of them. The best possible scenario for the function at hand would be one in which the tail-recursive pattern employed for `fold-left` is shared by the `fold-right` operation. It turns out that we already have a promising candidate, the second variant of `fold-left`. Remember that it has the same signature as `fold-right`, and it shouldn't be difficult to adapt it so that it serves as a tail-recursive version for `fold-right`. A closer examination into the expansions of both procedures gives the answer. Let's recall them once again:

``````fold-right   (-) 1 '(2 3 4) : 2 - (3 - (4 - 1))
fold-left-2  (-) 1 '(2 3 4) : 4 - (3 - (2 - 1)) ``````

The only difference lies in the order of elements in the given list. The function we are searching for (a tail-recursive `fold-right`) should have the same expansion as `fold-right`:

``fold-right-2 (-) ??         : 2 - (3 - (4 - 1))``

Which arguments should it receive? Obviously:

``fold-right-2 (-) 1 '(4 3 2)``

Hence, `fold-right-2` is just `fold-left-2` with the input list reversed, and, as you surely remember, the only remaining difference between our `fold-left`, the one employed in R, `fold-left-1` and `fold-left-2` is the order of arguments passed to `f`.

All of these points are captured by the following implementation [number of lines included]:

``````1  my_reduce <-
2  function(f, x, init, right = FALSE) {
3    iter <- function(acc, x) {
4      len <- length(x)
5      if (!len) {
6        acc
7      }
8      else {
9        first <- x[[1L]]
10       rest <- tail(x, -1L)
11
12       if (right) iter(f(first, acc), rest) else iter(f(acc, first), rest)
13     }
14   }
15
16   f <- match.fun(f)
17   mis <- missing(init)
18   len <- length(x)
19
20   if (mis && !len) {
21     return(NULL)
22   }
23
24   if (mis) {
25     if (right) {
26       init <- x[[len]]
28     }
29     else {
30       init <- x[[1L]]
31       x <- tail(x, -1L)
32     }
33   }
34   else {
35     if (right) {
36       x <- rev(x)
37     }
38   }
39
40   iter(init, x)
41 }``````

Changes with respect to the previous implementation are:

1. The nested function `iter` has been modified so that the recursive call passes parameters to `f` in the order suitable to the requested kind of fold [line 12].
2. The handling of options for the arguments `init` and `right` has been appropriately extended [lines 24-38]. Note, in particular, that when right is `TRUE` we need to reverse the sequence, as explained. The R function `rev` does the job [lines 27, 36].

This implementation passes our current test set. Nice!

Do you think that this function is getting long? If so you are with me. As I said before I prefer shorter functions with as few parameters as possible and independent helpers. But R base code is plenty of this kind of monolithic functions rich in parameters and conditional branches.

[To be continued]

## domingo, 31 de julio de 2016

### Reading R Code. The function Reduce. Folds

Among the essential higher-order functions in functional programming (we have already seen `map` and `filter`) `reduce` (aka. `fold`) is probably the most amazing and powerful. It is the Swiss-Army-knife in functional programmers' hands, sort of. I highly recommend this excellent exposition, about the power and flexibility of `fold`:

http://www.cs.nott.ac.uk/~pszgmh/fold.pdf

### What is `fold`?

Although `fold` deserves and probably requires several blog posts, I'll try an elementary presentation suitable to our purposes.

`fold` is essentially the functional abstraction that corresponds to the very structure of lists.

A list of whatever kind of values, say, type `X`, is either the empty list or a list constructed from an `X` and a list of `X`s, where by construction I mean the operation of `cons`ing `X` and a list of `X`s. For instance (in Lisp):

``````[1]> (cons 1 (list 2 3))
(1 2 3)``````

or in Haskell, where `:` is the Haskell equivalent to the Lispier `cons`:

``````Prelude> 1 : [2, 3]
[1,2,3]``````

Note the self-referential nature of the type description for lists. Any function over lists relies on this self-referential structure, and it will have a natural form, where the recursive call emanates naturally from the type description. So the sketch or template of a function for list of `X` looks like this [written now in Racket]:

``````(define (f-for-lox lox)
(cond [(empty? lox) (...)]
[else
(... (first lox)
(f-for-lox (rest lox)))]))
``````

On the other hand, and by stack space efficiency reasons, functions over lists can be written in such a way that the recursive call is in tail position (in a position that ensures that is the last call in the procedure). Observe that `f-for-lox`, the recursive call in the template above, is not in tail position; instead, the function ending up in place of the last three dots in that template will be in tail position.

I heartily recommend to watch these excellent videos, on which this exposition is mostly based, for a better understanding:

Such functions where the recursive call is in tail positions are known as tail-recursive functions and normally written with the aid of a local function that supplies an accumulator. A tail-recursive template for lists has this form:

``````(define (f-for-lox lox0)
(local [(define (f-for-lox-acc acc lox)
(cond [(empty? lox) (... acc)]
[else
(f-for-lox-acc (... acc (first lox))
(rest lox))]))]
(f-for-lox-acc ... lox0)))``````

Or this one, if we change the order of components in the recursive call:

``````(define (f-for-lox lox0)
(local [(define (f-for-lox-acc acc lox)
(cond [(empty? lox) (... acc)]
[else
(f-for-lox-acc (... (first lox) acc)
(rest lox))]))]
(f-for-lox-acc ... lox0)))``````

`fold` is the function that capture these two abstractions. Accordingly, there are two possible `fold` functions (the last one with two variants), `fold-right` that captures the structural recursive procedure, and `fold-left` that captures the tail-recursive one.

Let's see this. If we fill those templates with more meaningful placeholders, `i` standing for the initial value, and `f` standing for a function to combine the contribution of the first element and the contribution of the recursion, and we add them as parameters to the functions, we have this function for the natural recursive template:

``````(define (fold-right f i lox)
(cond [(empty? lox) i]
[else
(f (first lox)
(f-for-lox (rest lox)))]))``````

and this two variants for the two tail-recursive templates:

``````(define (fold-left-1 f i lox0)
(local [(define (f-for-lox-acc acc lox)
(cond [(empty? lox) acc]
[else
(f-for-lox-acc (f acc (first lox))
(rest lox))]))]
(f-for-lox-acc i lox0)))

(define (fold-left-2 f i lox0)
(local [(define (f-for-lox-acc acc lox)
(cond [(empty? lox) acc]
[else
(f-for-lox-acc (f (first lox) acc)
(rest lox))]))]
(f-for-lox-acc i lox0)))``````

that are precisely `fold-right` and `fold-left`, with two variants, respectively.

A careful analysis of those functions enable us to infer their respective signatures.

``````;; fold-right                :: (X Y -> Y) Y (listof X) -> Y
;; fold-left-1 (1st variant) :: (Y X -> Y) Y (listof X) -> Y
;; fold-left-2 (2nd variant) :: (X Y -> Y) Y (listof X) -> Y``````

Note, in particular, that `fold-right` and the second variant of `fold-left` share the same signature, while the signature for the first variant of `fold-left` differs due to the different order of arguments in the call to `f`.

`fold-right` is basically the same in all functional programming languages, while different languages pick the first version of `fold-left` (Lisp, Haskell, OCaml, ...) or the second one (SML, Racket, ...) for its implementation of this function.

The difference between the two variants of `fold-left` have an impact in many cases. Just choose as `f` a function for which the order of operands matters (a non-commutative function). For instance if `f` is `+` the order doesn't matter but if it is `-` it absolutely matters. Let's consider this for `-` in all the languages we have mentioned:

``````# Lisp [reduce is fold-left by default]
[1]> (reduce #'- '(1 2 3 4) :initial-value 0)
-10

Prelude> foldl (-) 0 [1, 2, 3, 4]
-10

# OCaml
# List.fold_left (-) 0 [1; 2; 3; 4] ;;
- : int = -10``````

While:

``````# Racket
> (foldl - 0 '(1 2 3 4))
2

# SML
- List.foldl op- 0 [1, 2, 3, 4] ;
val it = 2 : int``````

[The R's `Reduce` will be the subject of the next post.]

## jueves, 28 de julio de 2016

### Reading R code. The functions Map and mapply

In this post I'm going to consider the implementation of the function `Map` in R. As in the previous post in the series, I first introduce the typical `map` in functional languages. Then I go into the R's `Map` implementation. Along the way I explore some other R functions to the extent needed for a basic understanding of the implementation.

### What is `map`

`map` applies a function to each element in a given list to get a new list with the result of the application. For instance:

``map sqrt [1, 2, 3]``

produces a list of the square roots of each number:

``[1.0,1.4142135623730951,1.7320508075688772]``

In general, `map` takes a function and a list of elements of some type, say `a`, and produces a list of elements of some other type, say `b`, where `a` and `b` are the types of, respectively, the input and output values of the function that `map` takes. The signature for the example above could be written as:

``(Num -> Double) [Num] -> [Double]``

meaning that the first argument, `sqrt`, takes a number (in general) and produces a double-precision floating point number; the second argument is a list of numbers, and the result a list of `Double`s.

So in general `map` has this signature:

``(a -> b) [a] -> [b]``

As with `filter` a natural implementation of `map` is a math-like recursive definition:

``````map f []     = []
map f (x:xs) = f x : map f xs``````

Or, in Racket:

``````(define (map f lox)
(cond [(empty? lox) empty]
[else
(cons (f (first lox))
(map f (rest lox)))))``````

This is the basic usage. The function `map` can usually handle multiple lists too. One of the variants works as follows:

It applies the function to the first element of given lists, then to the second, and so on till the last element of the shortest list is processed, the rest of elements of longer lists are ignored, and the result is the list produced by the successive application. This is a possible example with two lists as input. I use the name `zipWith` to refer to this kind of `map` that can take in two lists:

``zipWith (+) [1, 2, 3] [4, 5, 6, 7]``

produces:

``[5, 7, 9] -- so [1 + 4, 2 + 5, 3 + 6]``

The signature would be:

``zipWith :: (a b -> c) [a] [b] -> [c]``

Generalizations to cope with an arbitrary number of lists are also available in functional languages.

### Map in R

The R `Map` function provides, as `?Map` states, a generalization of the sort described:

‘Map’ applies a function to the corresponding elements of given vectors.

(Note that like `Filter` the consumed objects are R vectors.)

However, unlike the generalized `map` mentioned, `Map` doesn't discard remaining elements of longer lists. Instead, it uses recycling:

‘Map’ ... is similar to Common Lisp's ‘mapcar’ (with arguments being recycled, however)

Common-Lisp `mapcar` is a function that behaves as explained above (the Haskell's `zipWith` example). Indeed, executing the corresponding Lisp code on the `clisp` interpreter gives the same result:

``````[1]> (mapcar #'+ '(1 2 3) '(4 5 6 7))
(5 7 9)``````

while in R recycling is at work, though:

``````> Map(`+`, 1:3, 4:7)
[[1]]
[1] 5

[[2]]
[1] 7

[[3]]
[1] 9

[[4]]
[1] 8``````

As you can see, when the shortest list is exhausted, R recycles it as needed: the last element is the result of 1 + 7, where 1 comes here from recycling the shorter list. Note also that the result is not an atomic vector as one might expect but a list. This is also documented:

‘Map’ ... does not attempt to simplify the result ... Future versions may allow some control of the result type.

As for the implementation `?Map` also tells us what it is:

‘Map’ is a simple wrapper to ‘mapply’.

It is not uncommon in R to find traces of implementation details in its docs.

The actual implementation expresses in code all of these points:

``````function (f, ...)
{
f <- match.fun(f)
mapply(FUN = f, ..., SIMPLIFY = FALSE)
}``````

Let's have a closer look into the definition.

The function takes a first argument `f`, the mapping function to be applied, and an undetermined number of extra arguments. The three dots construct allows to catch those extra arguments and pass them on later to `mapply`. Vectors on which `f` will be applied are among the arguments that the caller will pass; eventually, more arguments might be passed by the caller, arguments that `mapply` can receive.

In the first line, the function passed as first argument is extracted and saved into a local variable `f`, or discarded if it cannot be interpreted in any way as a function, that's what `match.fun` basically does.

### The function `mapply`

Once this first argument has been checked, it is passed as the `FUN` argument to `mapply` that in turn calls the underlying C code to carry out the computation. Additionally, `Map` sets the `SIMPLIFY` argument of `mapply` to `FALSE` to avoid the simplification, as documented.

We can see this better if we take a look at the implementation of `mapply`:

``````function(FUN,..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
{
FUN <- match.fun(FUN)
dots <- list(...)

if (USE.NAMES && length(dots)) {
if (is.null(names1 <- names(dots[[1L]])) && is.character(dots[[1L]]))
else if (!is.null(names1))
}
simplify2array(answer, higher = (SIMPLIFY == "array"))
}``````

Note that `answer`, the returned value, is the result produced by the internal function, written in C, which is invoked via the call to `.Internal`. We come to this construct over and over while reading R base functions. Many R base functions just wrap the call to an underlying function, eventually preparing things for it, and/or adapting the result of it according to the arguments passed in the first place.

`Map` sets `mapply`'s `SIMPLIFY` to `FALSE` so that the simplification that `mapply` could do otherwise will never be executed.

However, despite the fact that the intended arguments for the three dots are just the vectors, as `?Map` documents, nothing prevents us from passing other possible `mapply` arguments, `USE.NAMES` and `MoreArgs`.

`MoreArgs` allows for passing extra arguments to the `f` function. For instance, to get a list of vectors where 42 is repeated from 1 to 4 times, we can use `MoreArgs` in this way [example borrowed from the `mapply` doc]:

``````> Map(rep, times = 1:4, MoreArgs = list(x = 42))
[[1]]
[1] 42

[[2]]
[1] 42 42

[[3]]
[1] 42 42 42

[[4]]
[1] 42 42 42 42``````

As R allows us to pass extra arguments just by naming them after the function,

``> Map(rep, times = 1:4, x = 42)``

`MoreArgs` seems to be of little use in this regard.

Furthermore, one can always resort to the commonly-used idiom in functional programming: supplying an anonymous function in place of `f`:

``> Map(function(x) rep(42, times = x), 1:4)``

The other extra option that we may pass to `Map` is `USE.NAMES`. `mapply` sets it by default to `TRUE`. In such a setting, and if more arguments apart form the initial function `f` are passed (`length(dots)` is not 0) this code in `mapply` will be executed:

``````if (USE.NAMES && length(dots)) {
if (is.null(names1 <- names(dots[[1L]])) && is.character(dots[[1L]]))
else if (!is.null(names1))
}``````

Two cases are especially handled:

• The second argument passed to `mapply` (`dots[[1L]]`) doesn't have names but it is of character type. (Recall that the first argument is always the function `f`).
• The second argument passed to `mapply` has names.

In the first case, the character object is used to set the names of the result of `mapply`, as in this example taken from `?mapply`:

``> mapply(function(C, k) paste0(rep.int(C, k)), LETTERS[1:6], 6:1)``

In the second case, the names of the argument become the names of the result. For instance [again from `?mapply`]:

``````> mapply(function(x, y) seq_len(x) + y,
> +             c(a =  1, b = 2, c = 3),  # names from first
> +             c(A = 10, B = 0, C = -10))``````

This is simply put on the doc:

use names if the first ... argument has names, or if it is a character vector, use that character vector as the names.

If we want no names handling in `Map` we can just set `USE.NAMES` to `FALSE` overriding the default behavior of `mapply`, or viewed from the implementation, falling back to the default behavior of the internal C code.

A couple of points about R idioms that we see in the `mapply`:

Assignments can be sub-expressions. Here:

``is.null(names1 <- names(dots[[1L]]))``

The assignment expression `names1 <- names(dots[[1L]])` evaluates to the value of `names1`, once assigned to. The value is passed to `is.null` as a part of the `if` condition. The value of `names1` is later re-used in the last branch.

Secondly, in a logical context, the number 0 evaluates to `FALSE`, any other number evaluates to `TRUE`. This allows for concise conditions as the previous one:

``if (USE.NAMES && length(dots))``

The second part of the relational expression check whether `dots` is empty or not. Hence, there is no need to write `length(dots) != 0`. This is a common idiom in many languages supporting boolean evaluation of different classes of objects.

To complete the basic understanding of these functions I should inspect more closely `match.fun` and `simplify2array` (the function responsible of simplifying the result). `match.fun` appears frequently in R code and I pin it on top of my task list, `simplify2array` is a helper function currently used only by `mapply` and `sapply`. We can live without studying it for the time being.

## martes, 26 de julio de 2016

### Reading R code. The function Filter

R is functional language, however not a pure one like Haskell, R is fond of the functional paradigm. As John Chambers has recently remind us, in R everything that happens is a function call [John Chambers, Extending R, CRC Press]. R functions are first-class citizens and the functional programming style is prominent.

`filter`, `map`, and `reduce` (aka. `fold`) are among the most wanted functions in the toolset of every functional programmer. R provides them too under the names `Filter`, `Map`, `Reduce`. Even though the apply family and vectorization are preferred, R is kind enough to give these functions to seasoned functional programmers. It couldn't have been otherwise.

The source code of base R, currently at `/src/library/base/R` contains a file with the implementation of these functions among others, `funprog.R`.

In my exploration of R source code I have to begin with something. `Filter` may be a good candidate. It is brief, easier to read, and a sensible choice for a fan of functional programming as I am.

### What is filter?

`filter`, as its name suggests, serves the purpose of filtering elements in a list according to certain given function, called predicate, so that elements in this list for which the predicate holds are selected and the rest discarded, where predicate is the technical term for any function that produces true or false (a boolean value).

Some examples will make this clear.

Let's suppose we have this list of numbers `[1, 2, 3, 4, 5]`, and a function `odd` that takes a number and determines whether it is odd or not. Selecting all odd numbers in `[1, 2, 3, 4, 5]` is a matter of filtering them by their oddity

``filter odd [1, 2, 3, 4, 5]``

will produce `[1, 3, 5]`

Another example. We want to select words in `["hi", "world", "bye", "hell"]` whose first letter is 'h'.

``filter start_with_h ["hi", "world", "bye", "hell"]``

will produce the desired list `["hi", "hell"]`

In general, `filter` takes a predicate and a list of elements of some type, say type `a`, and produces a list of elements of the same type. Formally expressed:

``filter :: (a -> Bool) [a] -> [a]``

where `[a]` stands for list of elements of type `a`, and the arrow stands for a function whose arguments are to the left and the result to the right of the arrow. Note that the predicate, the first argument, is also a function that consumes values of type `a` and produces a boolean.

This type description is called the signature of the function. In some functional programming languages it is customary to apply the so called currying operation that translates the evaluation of a function taking multiple arguments into the evaluation of a sequence of functions, with a single argument each. Under currying, the signature of filter would be:

``filter :: (a -> Bool) -> [a] -> [a]``

Although this is a mathematically more appealing description, I'll stick here and in what follows to the first uncurried form.

What about implementing `filter`? A natural implementation of `filter` would take the form of a mathematical definition.

Mathematics is plenty of inductive or recursive definitions. Do you recall factorial? In Math, it is defined as a two-part function:

``````factorial 0 = 1
factorial n = (factorial n - 1) * n``````

In words, the factorial of 0 is equal to 1, the factorial of n is equal to the factorial of n - 1 times n.

Similarly, a math-like definition for `filter` could be expressed as a two-part function with two branches for the second part:

• The filter of a predicate and an empty list is the empty list.
• The filter of a predicate and a non-empty list is
• either (if the predicate holds for the first element of the given list) the list consisting of that first element and the result of the filter of the predicate and the rest of the given list,
• or (otherwise) the filter of the predicate and the rest of the given list.

This wording is exact but verbose. Formalizing it a bit turns out to be easier to read. What follows is the actual Haskell implementation of `filter`, that hopefully is almost self-explanatory. Actually the signature before is also the real Haskell signature.

``````filter p [] = []
filter p (x:xs) | p x       = x : filter p xs
| otherwise = filter p xs``````

Another possible syntax, now written in friendly Racket, that you may find even more readable, looks as follows [signature included as an initial comment]:

``````;; (X -> Boolean) (listof X) -> (listof X)
(define (filter p lox)
(cond [(empty? lox) empty]
[else
(if (p (first lox))
(cons p (filter p (rest lox)))
(filter p (rest lox)))))``````

### Filter in R

What about the R implementation? Here it is:

``````Filter <-
function(f, x)
{
ind <- as.logical(unlist(lapply(x, f)))
x[which(ind)]
}``````

It looks quite different. Obviously, it is not recursive. This is not surprising, recursion is the main technique in pure functional languages that are well prepared to handle recursive implementations without performance penalties.

Having noted that, let's explore this code in detail.

The very first thing we always have to do when studying a function definition is to be sure about the value(s) that the function consumes and the value that the function produces, the signature of the function. By the way, as you may know, some languages check signatures at compile time, like Haskell, others don't, like R or the Racket version above [though Racket has variants for type checking]. Whatever the case the signature is critical for programmers and for users, since it tells what kind of values are involved.

The documentation for `Filter` reveals the assumed signature. Omitting for the moment some nuances and summarizing the 'Arguments' and 'Details' section , `?Filter` says this:

‘Filter’ applies the unary predicate function ‘f’ to each element of ‘x’ [which is a vector] ..., and returns the subset of ‘x’ for which this gives true.

Recall the header of `Filter`

``function(f, x)``

The doc says that `f`, the first argument, is a unary predicate function, meaning a function that takes a single argument (unary) [of any type, say `a` as before], and produces a boolean (`TRUE` or `FALSE`). The signature of the predicate is therefore:

``(a -> Bool)``

The second argument, `x`, is in turn a vector, instead of a list. A vector in R can be an atomic vector, a list, or an expression. The main R data type to represent arbitrarily large collections of objects, possibly nested, is vector and in this regard is a natural choice to represent what in functional languages is typically represented by lists.

Let us symbolize, just by convention, vector of some type with `[a]`, that's the type of the second argument of `Filter`.

The result of `Filter` is in turn a subset of `x`, hence again a vector of type `[a]`.

Therefore the whole signature could be specified as follows:

``Filter :: (a -> Bool) [a] -> [a]``

As it's easy to see, this is the same signature of `filter` we figure out before, just that `[a]` stands for list in the former and for vector (atomic or not) in the latter.

Let's go on with the implementation. In order to understand an implementation I find really helpful trying first to design our own version.

Before starting off with the implementation itself we should write examples of the function usage, at least as many as the function signature requires, and wrap them as test cases. Examples will guide the implementation, and at the same time, wrapped as test cases, provide for the indispensable testing workhorse. For the latter I will use `testthat` as explained in the final part of the first post in this series: https://los-pajaros-de-hogano.blogspot.com.es/2016/07/reading-r-code-introduction.html

So save these examples into the file `test_my_filter.R`:

``````source("my_filter.R")

# Some predicates for testing filter
odd <- function(x) { x %% 2 != 0 }
starts_with_h <- function(x) { any(grepl(pattern = "^h.*", x = x)) }
numeric_expression <- function(x) { is.numeric(eval(x)) }

test_that("test my_filter", {
expect_equal(my_filter(odd, integer(0)), integer(0))
expect_equal(my_filter(is.atomic, list()), list())
expect_equal(my_filter(is.expression, expression()), expression())
expect_equal(my_filter(odd, 1:5), c(1, 3, 5))
expect_equal(my_filter(starts_with_h, c("hi", "world", "bye", "hell")),
c("hi", "hell"))
expect_equal(my_filter(is.atomic, list(1, list(3, 4), 5)), list(1, 5))
expect_equal(my_filter(numeric_expression, expression(c, "a", 1, 3)),
expression(1, 3))
})
``````

A quick glance at `Filter` gives us a first hint as to the way we could follow in the implementation. The last line is a typical subsetting operation. We could try to use this idea to delineate our code.

You may figure it out after a bit of thinking. Let's begin with one of our examples:

``my_filter(odd, 1:5)``

Well, I have something like the vector `c(1, 2, 3, 4, 5)` and I want to filter odd elements in it. I also have a predicate `odd` that allows me to determine whether each element there is odd or not.

How should `my_filter` be implemented to produce the expected `c(1, 3, 5)`?

It could apply `odd` to each element in `c(1, 2, 3, 4, 5)` to get this: `c(TRUE, FALSE, TRUE, FALSE, TRUE)`, and then, in a typical R way, use the logical vector for indexing `c(1, 2, 3, 4, 5)`:

``c(1, 2, 3, 4, 5)[c(TRUE, FALSE, TRUE, FALSE, TRUE)]``

I can sketch a general version of this idea. Instead of `odd` and `1:5` I generalize to whatever predicate, `f`, and whatever vector, `x`. Also, and just for readability, I create a local variable, `ind`, to name the logical vector.

``````my_filter <-
function(f, x) {
ind <- # apply f to each element in x
x[ind]
}``````

Only one thing remains to be filled here, the code that applies `f` to each element in `x`.

If you have an imperative programming background you will come up with some kind of loop to process each element in the vector and check whether the predicate holds for it. Something along these lines:

``````apply_f <-
function(f, x) {
fx <- vector("logical")
for (i in seq_along(x)) {
fx <- c(fx, f(x[[i]]))
}
fx
}``````

Let's add this helper function to `my_filter` as a nested (local) function

``````my_filter <-
function(f, x) {
apply_f <-
function() {
fx <- vector("logical")
for (i in seq_along(x)) {
fx <- c(fx, f(x[[i]]))
}
fx
}

ind <- apply_f()
x[ind]
}``````

Wait a minute! Shouldn't `apply_f` be tested before doing anything else? and definitely before converting it into a local (and as such untestable) function? This is a great question. Let's claim that it is not needed because it is a too easy function and we trust it will work without further testing. In a moment we will return to this and see the consequences of this assumption.

It's time to run tests:

``````> library(testthat)
> test_file("test_my_filter.R")``````

Great! All tests passed.

However, instead of explicit loops you may know that idiomatic R usually favors the apply family of functions.

The immediate candidate for this task seems to be `sapply`, that is typically used when we want a vector as a result of applying the function to each element in a given vector `x`. `apply_f` could be written with `sapply` as follows:

``````apply_f <-
function(f, x) {
sapply(x, f)
}``````

Being so simple, it doesn't make sense to wrap the `sapply` call in a function. So our more idiomatic and concise filter implementation would be:

``````my_filter <-
function(f, x) {
ind <- sapply(x, f)
x[ind]
}``````

Run tests ... and, oops! one test fails! The case for the empty vector as input.

Looking into `?sapply` more carefully, we see that our initial expectation was based in a partial understanding of `sapply`:

Simplification in ‘sapply’ is only attempted if ‘X’ has length greater than zero ...

It turns out that `sapply` does not always produce a vector when a vector is given. In other words, our current implementation breaks the signature of `my_filter`.

We have different alternatives to fix the bug. One would be to use a conditional with one branch for the empty vector and another one for the rest of cases. Another option is to use `lapply`, that produces a list instead of a vector, and unlist the result to get the vector we need. This second strategy has an advantage. Core R code shouldn't use user friendly wrappers like `sapply` but primitive functions. It is not only a matter of style, it is usually a matter of performance (core functions are more efficient), and it is above all a matter of reliability. User wrappers are for helping users write quickly, mainly in an interactive setting, but they may change or even become defunct in future releases. In general, we shouldn't trust user wrapper functions when we want to write enduring code.

With `lapply` and `unlist` we get this solution:

``````my_filter <-
function(f, x) {
ind <- unlist(lapply(x, f))
x[ind]``````

}

and it passes all tests.

If we compare our last version with the official implementation we'll observe that there are still some differences.

Starting from the end, the official implementation prefers `which(ind)` instead of just `ind` for indexing. This is a subtle divergence. One thing it handles differently than our implementation is `NA` values. In our implementation, if a `NA` appears in `x` it will appear also as `NA` in `ind`, and if we subset `x` directly via `ind` the `NA` ends up in the result too. However, `?Filter` has to say something about this (just we omitted it in the first reading to keep things simple):

Note that possible ‘NA’ values are currently always taken as false.

This implies that `NA` values in `x` will be discarded from the result, whatever `f` and `x`, and this is exactly what `which` can nicely achieve. `which` takes a logical object and produces the indices, as integers, of `TRUE` values. In other words, for our case, where a vector is given as input, it has this signature:

``which :: [Bool] -> [Integer]``

For instance,

``````> which(c(TRUE, FALSE, FALSE, TRUE))
[1] 1 4``````

Additionally, `which` omits `NA` values in the logical vector, in other words, it treats them as if they were `FALSE`. For example:

``````> which(c(TRUE, FALSE, FALSE, TRUE, NA))
[1] 1 4``````

and that's precisely the documented behavior that `Filter` should show.

The second difference lies in the first line, the explicit conversion to logical absent from our solution:

``as.logical(unlist(lapply(x, f)))``

Someone could argue that `as.logical` is redundant. After all, the predicate `f` always produces a boolean value and the result of the composition of `lapply` and `unlist` should produce a logical vector. It should do, but it actually doesn't. Passing an empty vector to `unlist(lapply(...))` produces `NULL`:

``````> unlist(lapply(integer(0), is.integer))
NULL``````

Inadvertently our previous implementation worked, and it worked because when `NULL` is an index it is treated as `integer(0)` and indexing an empty vector with `integer(0)` produces an empty vector, logical by default.

``````> vector()[integer(0)]
logical(0)``````

Once we substitute direct indexing by `which(ind)` things stop working, `which(NULL)` would raise error because `which` requires a logical object as argument, and `NULL` is neither a logical value nor converted by `which` to logical. That's the reason for introducing `as.logical`. In general, `as.logical` makes a lot of sense by making sure that the signature of `which` will never be broken.

Let's go back to the moment when we hesitated about why not to test the helper function `apply_f`. Under the new light the question appears pretty reasonable. Had we tested `apply_f`, in other words, had we left it as a non-local function and verified it independently, we would have caught this bug soon.

For instance, we could have written `apply_f` as:

``````apply_f <-
function(f, x) {
unlist(lapply(x, f))
}``````

``````test_that("test apply_f", {
expect_equal(apply_f(is.integer, integer(0)), logical(0))
expect_equal(apply_f(odd, 1:5), c(TRUE, FALSE, TRUE, FALSE, TRUE))
# ... (more tests here)
})``````

Running tests would have uncovered the bug in the very beginning. Once `apply_f` passes all tests we can convert it into a local function, in this case just the call we have seen in the final definition. The morale is that untested code, which we lazily assume being too easy to test is often the source of unexpected bugs and countless wasted hours of debugging. Better to always test everything that is not absolutely trivial.

To finish this exploration and for completeness just add this new test case, which verifies the `NA` discarding behavior, to the `test_my_filter.R` file:

``expect_equal(my_filter(odd, c(1:4, NA)), c(1, 3))``

and confirm that the final implementation with `as.logical`, that is exactly the R version of `Filter`, passes all tests.

## lunes, 25 de julio de 2016

One of the most important things I learned when I started my higher education was that in order to really know deeply about a subject one has to dive into the original sources as soon as possible rather than solely sticking to secondary literature. Better to read Plato's and Kant's works in their original language than countless books about Critiques and Dialogues; better to analyze Mozart's and Beethoven's Sonatas (as Charles Rosen does) than skimming over a lot of general articles on Classical music.

When I was put in charge of developing a web site for my institution I hadn't any programming background at all. After some discouraging attempts to learning from intro books and tutorials, I decided to go to the sources and I studied the entire HTML and CSS specifications directly. It was hard, but everything was clear at last [Remark: nowadays it would be overwhelming. HTML and CSS specs are currently huge].

In learning programming languages the next step to the initial exposure to syntax, concepts and techniques might be well studying core libraries written by main developers.

How could one understand better a programming language than by trying to read the code that its core developers have created?

I'm always surprised by the scarcity of commentaries about exemplary code in any language. There is somewhat like a gap between introductory expositions and the source code itself. The latter is silently digested by the people who develop it or who create libraries on top of it, while beginners live always in the other much narrower side, their little friendly universe of tutorials and simple recipes, kind enough, for sure, but maybe a bit tasteless [though great exceptions can always be found out] when their time, the absolute beginners' time, has passed.

One reason for this lack might be that core libraries are very complex and abstract beasts, and an understanding of parts of it is just hopeless without a firm grasp on the whole architecture and design, something unreachable to anyone else but experts.

This is not always the case, though. At least it is not the case for a good bunch of functions in R base code. Many are written in R itself and they are almost self-contained in the sense that a preliminary comprehension doesn't depend on a complete acquaintance with the abstract underlying architecture.

So I've thought that I can give this idea a try, picking some R functions, and reading them with the aim of understanding R better. The goal is educational, self-educational mainly. Along the way I'll try to make things perhaps easier to others with a bit less programming background than mine hoping that in doing so I not only reinforce my understanding but also help others deepen their own.

I'm not an R expert. This means that while reading and trying to make sense of official implementations I probably will make some more or less educated guesses and I could (and surely will) make mistakes. So if someone more knowledgeable than I am (there should be many) read this, please let me know to fix any error, misleading step, or gratuitous deduction.

The intended audience is people with a basic working understanding of R data structures and programming constructs, including conditionals, loops and function definitions. The only required tool is the R console. And for the moment only one extension package will be used, the package `testthat` for unit testing support. It can be installed as usual via:

``> install.packages(testthat)``

To use `testthat` in a simple way (not the best one for real projects but enough for our purposes now) proceed as follows:

1. Save the function you want to test in a file.
2. Create a new file on the same directory for testing that function. The name of this file should start with 'test'.
3. Source the code of the function to be tested.
4. Write tests.
5. Run tests from the console.

An example of this workflow.

Save the following function in `foo.R`:

``````foo <-
function(x) {
ifelse((identical(x, 1)), "I'm 1", "I'm not 1")
}``````

Create `test_foo.R` with this content:

``````source("foo.R")

test_that("foo is 1 or not 1", {
expect_equal(foo(1), "I'm 1")
expect_equal(foo(0), "I'm not 1")
expect_equal(foo("hi"), "I'm not 1")
})``````

Load `testthat` on your session and run tests from the console:

``````> library(testthat)
> test_file("test_foo.R")``````

Introducing unit testing from the very beginning may seem unnecessary. On the contrary, testing is of paramount importance, and writing first a minimal set of tests guides the implementation. Since we probably will write our own code here and there, it is crucial to be equipped with a tool that enable us to always write those tests. This is just common-place and fundamental practice whatever the programming language.

By the way, for a perfect introduction to programming fundamentals, where programming stand here for "programming well" rather than "just coding", read this book:

http://www.ccs.neu.edu/home/matthias/HtDP2e/

and/or take this course:

https://www.edx.org/xseries/how-code-systematic-program-design

They both are gems that no one should miss.

One last point about the R source code. Apart from interactively getting the code as usual by typing the function name, for instance:

``> which``

and its documentation:

``> ?which``