miércoles, 26 de abril de 2017

The Game of Life implemented in functional style (Racket) and with complex numbers

If you're reading this you probably know what The Game of Life is. If not, visit the Wikipedia for more information:

https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life

You may want to just recall the rules governing those aggregates of live cells called polyominos. They are very simple:

  • a live cell with more than 3 live neighbors (where neighbors of a cell are those cells adjacent to it in every direction) dies (over-population).

  • a live cell with less than 2 live neighbors dies (under-population).

  • a dead cell with exactly 3 live neighbors becomes live.

The following diagrams clarify the rules and serve as the preliminary analysis of the problem domain.

And now the code. It is written in a language based on Racket called "Intermediate Student Language with lambda". It can be run on DrRacket (https://download.racket-lang.org/) by setting "Language" to that. ISL+ is a purely functional language that, additionally, allows the use of anonymous functions via lambda expressions. I have also included the Racket module for working with sets (require racket/set). You can, of course, define your own implementation of the set functions applied, instead. They are easy to define. For supporting graphics and interactivity two more modules are required: 2htdp/image and 2htdp/universe.

Regarding the kind of data to represent the domain information I have chosen complex numbers, which are well supported by Racket, to represent a cell. The advantage of using complex numbers instead of, say, a structure of x and y coordinates is mainly performance. A functional implementation without any kind of mutation will be typically slow. At least resorting to a plain data type should improve things in that regard. That choice doesn't compromise readability, though, I think. Besides, the implementation of functions, in particular neighbors, is really neat with such a data type.

On the other hand, I have represented polyominos as sets of cells. Actually lists of cells but treated as immutable sets. That, I hope, produces an elegant and pretty understandable implementation. It also allows an arbitrary number of cells in the grid. I'm aware that clever algorithms could improve efficiency, but I preferred to write first the most natural and readable version.

One last thing, the animation tends to be sluggish after certain point, depending on the clock rate specified and the number of cells involved in that moment. To avoid that annoyance from the very beginning, freezeing the image of the grid is absolutely indispensable.

If you want to change the number of rows and columns of the grid, modify COLS# and ROWS# as you wish.

To play the game, run the program with (main START) from the Interactions Window within DrRacket. Select cells to construct a polyomino by clicking on them on the initial grid. To clear the grid, press 'Escape', to (re)start the animation press 'Enter', to stop the animation, press any key except 'Enter' or 'Escape'.

;; ===============================================
;; CONSTANTS

;; -----------------------------------------------
;; Game rules
(define OVER-POPULATION-THRESHOLD 3)
(define UNDER-POPULATION-THRESHOLD 2)
(define PROCREATION-FACTOR 3)

;; -----------------------------------------------
;; Graphics - Dimensions
; number of rows and columns in the grid
(define ROWS# 81)
(define COLS# 81)

; length of each cell side (in pixels)
(define SIDE 10)

; x and y positions of the middle of, roughly, the grid's center tile
(define CTR-X
  (if (odd? COLS#)
      (quotient (* COLS# SIDE) 2)
      (- (quotient (* COLS# SIDE) 2) (quotient SIDE 2))))

(define CTR-Y
  (if (odd? ROWS#)
      (quotient (* ROWS# SIDE) 2)
      (- (quotient (* ROWS# SIDE) 2) (quotient SIDE 2))))

;; -----------------------------------------------
;; Graphics - Images
(define LIVE-CELL (color-frame "transparent"
                               (square (sub1 SIDE) "solid" "red")))
(define DEAD-CELL (square SIDE "outline" "black"))
(define GRID
  (local [(define col-img
            (foldr above
                   empty-image
                   (build-list ROWS# (lambda (_) DEAD-CELL))))]
    (freeze
     (foldr beside
            empty-image
            (build-list COLS# (lambda (_) col-img))))))

;; ===============================================
;; DATA DEFINITIONS

;; -----------------------------------------------
;; Cell is Complex
;; interp. a cell in a game of life, where its real and imaginary
;;         parts represent, respectively, its positions on the x and
;;         y axis of the complex plane of the game.
;; remark: actual values for those parts are chosen in such a way
;;         that the 0+0i cell in cell patterns is, roughly, at the
;;         center of the pattern, which makes easier to design
;;         drawing functions
(define C0 (make-rectangular 0 0)) ;0+0i

;; -----------------------------------------------
;; Life is (listof Cell)
;; interp. a set of living cells in a game of life
(define I-TROMINO
  (list (make-rectangular 0 1)
        (make-rectangular 0 0)
        (make-rectangular 0 -1)))

(define P-PENTOMINO
  (list (make-rectangular 0 1)
        (make-rectangular 1 1)
        (make-rectangular 0 0)
        (make-rectangular 1 0)
        (make-rectangular 0 -1)))

(define R-PENTOMINO
  (list (make-rectangular 0 1)
        (make-rectangular 1 1)
        (make-rectangular -1 0)
        (make-rectangular 0 0)
        (make-rectangular 0 -1)))

;; -----------------------------------------------
;; Direction is one of:
(define E (make-rectangular 1 0))
(define NE (make-rectangular 1 1))
(define N (make-rectangular 0 1))
(define NW (make-rectangular -1 1))
(define W (make-rectangular -1 0))
(define SW (make-rectangular -1 -1))
(define S (make-rectangular 0 -1))
(define SE (make-rectangular 1 -1))

;; interp. the eight cardinal points

(define DIRECTIONS (list E NE N NW W SW S SE))

;; -----------------------------------------------
(define-struct gol (life running?))
;; GoL is (make-gol Life Boolean)
;; interp. a game of life with its life and a
;;         state, running or stopped
(define STOPPED #f)
(define RUNNING #t)
(define START (make-gol '() STOPPED))

;; ===============================================
;; FUNCTIONS

;; -----------------------------------------------
;; GoL -> GoL
;; start main with (main START)
(define (main gol)
  (big-bang gol
            (on-tick next .5)
            (to-draw render)
            (on-mouse add-cell)
            (on-key handle-key)))

;; -----------------------------------------------
;; GoL -> GoL
;; produce the next gol from the given one
(define (next gol)
  (cond [(gol-running? gol)
         (make-gol (evolve (gol-life gol)) RUNNING)]
        [else gol]))

;; -----------------------------------------------
;; Life -> Life
;; produce the next evolutive step of the given life
(define (evolve l)
  (local [;; Cell -> (listof Cell)
          ;; produce the neighbors of the given cell
          (define (neighbors c)
            (map (lambda (d) (+ c d)) DIRECTIONS))

          ;; Cell -> Natural
          ;; produce the number of the living neighbors of given cell 
          (define (alive-neighbors# c)
            (set-count (set-intersect (neighbors c) l)))

          ;; Cell -> Boolean
          ;; determine whether the given cell is going to survive
          (define (survive? c)
            (<= UNDER-POPULATION-THRESHOLD
                (alive-neighbors# c)
                OVER-POPULATION-THRESHOLD))

          ;; Cell -> Boolean
          ;; determine whether the given cell is going to rise
          (define (rise? c)
            (= (alive-neighbors# c) PROCREATION-FACTOR))

          ;; surviving cells in the given life
          (define surviving
            (filter survive? l))

          ;; surrounding cells around the given life
          (define environ
            (set-subtract (foldr set-union '() (map neighbors l)) l))

          ;; cells going to live in the environment of the given life
          (define rising
            (filter rise? environ))]

    (set-union surviving rising)))

;; -----------------------------------------------
;; GoL -> Image
;; render the given gol life on the grid
(define (render gol)
  (local [;; Cell -> Posn
          ;; produce the position on the grid corresponding
          ;; to the given cell
          (define (cell->pos c)
            (make-posn (+ CTR-X (* (real-part c) SIDE))
                       (- CTR-Y (* (imag-part c) SIDE))))

          ;; Cell Image -> Image
          ;; draw the given the cell onto the given image
          (define (draw-cell c img)
            (local [(define p (cell->pos c))]
              (place-image LIVE-CELL (posn-x p) (posn-y p) img)))]
    
    (foldr draw-cell GRID (gol-life gol))))

;; -----------------------------------------------
;; GoL Integer Integer MouseEvent -> GoL
;; add the cell clicked on to the given gol's life
(define (add-cell gol x y me)
  (local [(define (xy->cell x y)
            (local [(define ctr-delta
                      (make-posn (ceiling (sub1 (/ COLS# 2)))
                                 (ceiling (sub1 (/ ROWS# 2)))))]
              (make-rectangular (- (quotient x SIDE)
                                   (posn-x ctr-delta))
                                (+ (- (quotient y SIDE))
                                   (posn-y ctr-delta)))))]
    (cond [(mouse=? me "button-down")
           (make-gol (cons (xy->cell x y) (gol-life gol))
                     (gol-running? gol))]
          [else gol])))

;; -----------------------------------------------
;; GoL KeyEvent -> GoL
;; handle keys as follows
;; - 'Enter'  - starts the animation
;; - 'Escape' - clear all cells and stops the animation
;; - any other key - stops the animation
(define (handle-key l ke)
  (cond [(key=? ke "\r")
         (make-gol (gol-life l) RUNNING)]
        [(key=? ke "escape")
         (make-gol '() STOPPED)]
        [else
         (make-gol (gol-life l) STOPPED)]))

domingo, 5 de febrero de 2017

The unfold function. Gentle introduction from examples

This post is about a higher-order function known by functional programmers under the name unfold or ana (from the word anamorphism), It is less famous than fold but still worth of consideration and pretty useful.

Most of the expositions I know introducing unfold are based on Haskell or on a Haskell-like language. To mention some seminal but readable papers:

There is also a brief and illuminating introduction in Hutton, G., Programming in Haskell, Cambridge University Press, 2016.

My intention is to present unfold under a Scheme-like syntax and at slower pace than in the referred papers. In particular, I'll use a domain-specific language tailored for teaching purposes from Racket. It is the language employed in the excellent book Felleisen, M. et. al., How to Design Programs. From this book, as well as from the amazing MOOC How To Code: Systematic Program Design I also borrowed the same systematic way of presentation.

All the code below can be run on DrRacket, the IDE for our language that is available here:

https://download.racket-lang.org/

Specifically, if you want to run the code, set the DrRacket language to 'Advanced Student Language'.

Additionally, write these two lines at the beginning of the code. They load a couple of libraries I will rely upon

(require racket/function)
(require racket/list)

Let us consider functions that produce a list from a (seed) value. So functions with this signature:

;; Y -> (listof X)

that is a function that takes one argument of type Y, and produces a list of elements of type X.

[Hopefully, the meaning of those signatures is self-explanatory. For more information look into the first chapter of HtDP/2e or take the How To Code MOOC.

Also, signatures here are just code comments for helping us design functions, but not checked by the compiler].

A function that resembles that signature is, say, copier, that creates a list of n copies of a string [see HtDP/2e section 9.3]. Note that this function contains an extra parameter, the string to be copied. So it doesn't follow the signature, but I begin with it because it is very easy to understand. We will look later what we can do with the extra parameter.

The signature of copier is

;; Natural String -> (listof String)

The implementation based on the template for Natural:

;; <template for Natural>
(define (fn-for-natural n)
  (cond [(zero? n) (...)]
        [else
         (... (fn-for-natural (sub1 n)))]))

is straightforward:

;; Natural String -> (listof String)
;; produce a list with n copies of s
(check-expect (copier 0 "hi") '())
(check-expect (copier 3 "hi") '("hi" "hi" "hi"))

(define (copier n s)
  (cond [(zero? n) '()]
        [else
         (cons s
               (copier (sub1 n) s))]))

Another typical function that constructs a list from some initial value is string-split, that produces a list of words in a given string.

For convenience we represent String as (listof 1String), where 1String is a String consisting of a single letter. [In the definition I'll use the list functions takef and dropf provided by racket/list. These are also known as take-while and drop-while in other languages.]

To design this function we rely on the template for lists:

#;
(define (fn-for-lox lox)
  (cond [(empty? lox) (...)]
        [else
         (... (first lox)
              (fn-for-lox (rest lox)))]))

;; (listof 1String) -> (listof (listof 1String))
;; produce the list of words in los
(check-expect (string-split '()) '())
(check-expect (string-split '(" ")) '(()))
(check-expect (string-split '("a")) '(("a")))
(check-expect (string-split '("a" "b" " " "c" " " "d" "e" "f"))
              '(("a" "b") ("c") ("d" "e" "f")))

(define (string-split los)
  (cond [(empty? los) '()]
        [else
         (cons (first-word los)
               (string-split (remove-first-word los)))]))

;; (listof 1String) -> (listof 1String)
;; produce the first word in los
(check-expect (first-word '()) '())
(check-expect (first-word '("a" "b" " ")) '("a" "b"))

(define (first-word los)
  (takef los not-whitespace?))

;; (listof 1String) -> (listof 1String)
;; remove from los its first word as any leading whitespaces before it
(check-expect (remove-first-word '()) '())
(check-expect (remove-first-word '("a" "b" " " "c" "d" " " "e"))
              '("c" "d" " " "e"))

(define (remove-first-word los)
  (trim-leading-whitespaces (dropf los not-whitespace?)))

;; (listof 1String) -> (listof 1String)
;; remove from los its leading whitespaces
(check-expect (trim-leading-whitespaces '()) '())
(check-expect (trim-leading-whitespaces '("a")) '("a"))
(check-expect (trim-leading-whitespaces '(" " " " "a")) '("a"))

(define (trim-leading-whitespaces los)
  (dropf los string-whitespace?))

;; 1String -> Boolean
;; determine whether given s is not a whitespace
(check-expect (not-whitespace? " ") #f)
(check-expect (not-whitespace? "a") #t)

(define not-whitespace? (compose not string-whitespace?))

What about the higher-order function map? It actually produces a list too, this time from a given list and some function over the type of its elements:

;; (X -> Y) (listof X) -> (listof Y)
;; produce the list (list (f x1) (f x2) ...) by applying
;; f to each element of lox
(check-expect (my-map sqr '()) '())
(check-expect (my-map sqr '(1 2 3 4)) '(1 4 9 16))

Again a function over a list that we can define easily:

(define (my-map f lox)
  (cond [(empty? lox) '()]
        [else
         (cons (f (first lox))
               (my-map f (rest lox)))]))

As a final example over a more intricate input type consider the function zip that takes a pair of lists and produces a list of pairs.

A natural recursive implementation of zip is as follows:

;; (listof X) (listof Y) -> (listof (list X Y))
;; produce (list (x1 y1) (x2 y2) ...) from given lists, if lists
;; have different length, excess elements of the longer list
;; are discarded.
(check-expect (zip0 '() '()) '())
(check-expect (zip0 '(1 2 3) '(3 4 5)) '((1 3) (2 4) (3 5)))
(check-expect (zip0 '(1 2) '(3)) '((1 3)))
(check-expect (zip0 '(1 2) '(3 4 5)) '((1 3) (2 4)))

(define (zip0 lox loy)
  (cond [(or (empty? lox) (empty? loy)) '()]
        [else
         (cons (list (first lox) (first loy))
               (zip0 (rest lox) (rest loy)))]))

More suitable for the argument's sake is to pass both lists wrapped in a single pair of type (list (listof X) (listof Y)):

;; (list (listof X) (listof Y)) -> (listof (list X Y))
;; produce (list (x1 y1) (x2 y2) ...) from given list pair ...
(check-expect (zip '(() ())) '())
(check-expect (zip '((1 2 3) (3 4 5))) '((1 3) (2 4) (3 5)))
(check-expect (zip '((1 2) (3))) '((1 3)))
(check-expect (zip '((1 2) (3 4 5))) '((1 3) (2 4)))

(define (zip lp)
  (cond [(ormap empty? lp) '()]
        [else
         (cons (map first lp)
               (zip (map rest lp)))]))

;; ---------------------------------

All of those functions share the same pattern.

They all have a predicate that produces the empty list if it is satisfied by the seed value, and a list constructor that applies some functions to the first and the rest of the constructed list.

Therefore we can abstract the pattern from the examples. [More about abstraction from examples in HtDP/2e Sect. 15].

#;
(define (copier ... ... ... n s)
  (cond [(...? n) '()]
        [else
         (cons (... s) ;-> s, what about n?
               (copier (... n) s))]))

#;
(define (string-split ... ... ... los)
  (cond [(...? los) '()]
        [else
         (cons (... los)
               (string-split (... los)))]))

#;
(define (my-map ... ... ... lox)
  (cond [(...? lox) '()]
        [else
         (cons (... lox)) ;-> (f ...)
               (my-map f (... lox))]))

#;
(define (zip ... ... ... lp)
  (cond [(...? lp) '()]
        [else
         (cons (... lp)
               (zip (... lp)))]))

Note that in order to preserve the same pattern on all templates I have made a few tweaks in the template of a couple of the above functions.

First, The seed value is missing in copier. There is an s instead of n. Also, there is no function call at the same place at which it appears in the rest of the templates.

Secondly, I have temporarily removed the f argument from my-map. Instead I put a comment to remind me that f will play, for sure, some role in the final design.

After filling the gaps we get this:

;; (Y -> Bool) (Y -> X) (Y -> Y) Y -> (listof X)
(define (unfold p? f g b)
  (cond [(p? b) '()]
        [else
         (cons (f b)
               (unfold p? f g (g b)))]))

This abstract function encapsulates a recursive pattern that is dual to fold. While fold de-structs a list (so the funny name by which it is known, cata-morphism), unfold cons-tructs a list (and, likewise, is called ana-morphism).

Now we can re-defined the examples with unfold, and amend the tweaks made above as required.

The easiest one is string-split:

;; (listof 1String) -> (listof (listof 1String))
;; produce the list of words in los
(check-expect (string-split2 '()) '())
(check-expect (string-split2 '(" ")) '(()))
(check-expect (string-split2 '("a")) '(("a")))
(check-expect (string-split2 '("a" "b" " " "c" " " "d" "e" "f"))
              '(("a" "b") ("c") ("d" "e" "f")))

(define (string-split2 los)
  (unfold empty? first-word remove-first-word los))

As for my-map the original f argument can be reinstated at the place occupied by the f of unfold as the initial member of a function composition with first:

;; (X -> Y) (listof X) -> (listof Y)
;; produce the list (list (f x1) (f x2) ...) by applying
;; f to each element of lox
(check-expect (my-map2 sqr '()) '())
(check-expect (my-map2 sqr '(1 2 3 4)) '(1 4 9 16))

(define (my-map2 f lox)
  (unfold empty? (compose f first) rest lox))

Regarding copier the only really diverging thing is the extra parameter, s. In other words, the unfold pattern takes a single seed parameter while copier takes two. One way to cope with this is to implement copier as a closure. Besides, there is no actual function call on the first term of the constructed list. In such cases the pattern obliges to supply some function. Typically, identity or, as here, const. [const is a function that produces the value of its body whatever the argument passed into it. It is provided by racket/function].

;; Natural String -> (listof String)
;; produce a list with n copies of s
(check-expect (copier2 0 "hi") '())
(check-expect (copier2 3 "hi") '("hi" "hi" "hi"))

(define (copier2 n0 s)
  (local [(define (recr n)
            (unfold zero? (const s) sub1 n))]
    (recr n0)))

zip doesn't entail anything special. We just pass the appropriate functions at due places. We can give them names, anonymous functions seem to be simpler, though.

;; (list (listof X) (listof Y)) -> (listof (list X Y))
;; produce (list (x1 y1) (x2 y2) ...) from given list pair ...
(check-expect (zip2 '(() ())) '())
(check-expect (zip2 '((1 2 3) (3 4 5))) '((1 3) (2 4) (3 5)))
(check-expect (zip2 '((1 2) (3))) '((1 3)))
(check-expect (zip2 '((1 2) (3 4 5))) '((1 3) (2 4)))

(define (zip2 lp)
  (unfold (lambda (xs) (ormap empty? xs))
          (lambda (xs) (map first xs))
          (lambda (xs) (map rest xs))
          lp))

Or in a more concise manner, taking advantage of curry, a library function in racket/function [For more info about currying: https://en.wikipedia.org/wiki/Currying]:

(define (zip3 lp)
  (unfold (curry ormap empty?)
          (curry map first)
          (curry map rest)
          lp))

In summary, whenever you find a function that constructs a list from some value, unfold might be a very useful and elegant abstraction.

martes, 23 de agosto de 2016

Reading R Code. Factors in R

R factors are a common source of confusion for R beginners, even in their most basic use. In this post I try to shed some light on R factors. But rather than only rely on external sources of information or on the R documentation, I will make some experiments and have a look at the implementation to understand what a factor is and how it behaves in the most simple use cases.

First of all, R factors are intended for representing categorical variables. A categorical variable is a variable that can take values of a limited set of so-called categories. For instance, to encode the gender of a population we would use a categorical variable with two possible categories, male and female. Or to encode the educational level of a population under 18 we can use a categorical variable with three categories: elementary, middle and high.

In addition, categories of a categorical variable may or may not have some internal ordering. We cannot ascribe any sound ordering to the first example of categorical variable, but we should in the second case (elementary < middle < high)

In R, the term levels stands for categories of the categorical variable (the factor). labels in turn suggests something like custom names for those categories.

This distinction might be blurry, and it is. Think of some data you are going to analyze, say, a record of the gender of 6 persons:

> x <- c("F", "M", "F", "F", "F", "M")

If we want to convert this into a factor, which are the levels? Maybe, "male" and "female"?

> factor(x, c("male", "female"))
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: male female

Yikes! This doesn't work. Interestingly, no error message has been triggered, and levels are stored as we intended. But what are those <NA>s there?

Let's try without supplying levels and see what happens:

> factor(x)
> [1] F M F F F M
> Levels: F M

This looks better, at least nasty <NA>s have disappeared. So, when no levels are supplied, levels end up being the distinct elements of the original x. It seems that we could pass them instead of our "male" and "female" before:

> factor(x, levels = c("M", "F"))
> [1] F M F F F M
> Levels: M F

Fine! This works. Only that levels appear now in a different order, the one we have supplied.

Still we would prefer the more human-friendly "male" and "female". It might be that these words are rather labels. Let's try:

> factor(x, labels = c("male", "female"))
> [1] male   female male   male   male   female
> Levels: male female

Much better, but wait, our initial vector was another one. This totally wrecks havoc with x!

Looks like the order in which we pass the labels makes a serious effect. And it certainly does:

> factor(x, labels = c("female", "male"))
> [1] female male   female female female male  
> Levels: female male

Great! If for some hidden reason (maybe, say, because we expect that in a plot of this factor the legend shows male scores before female ones) we still want the other ordering of levels, we may try to combine both parameters:

> factor(x, levels = c("M", "F"), labels = c("male", "female"))
> [1] female male   female female female male  
> Levels: male female

That's it! Particularly, note how each label corresponds to each level in the values supplied. To verify, something like the following (where there is a mismatch in the ordering of levels and labels)

> factor(x, levels = c("F", "M"), labels = c("male", "female"))
> [1] male   female male   male   male   female
> Levels: male female

ruins x again.

From these experiments we can draw some intuitive conclusions:

  1. levels are the distinct elements in the object to be converted to factor.
  2. If levels are not supplied they are constructed from those unique elements sorted in alphabetical order.
  3. labels are the "names" we want to give to our levels.
  4. If no supplied, labels are the same as the levels.
  5. The order of labels passed must match the order of the levels, either of the default levels or of the levels we supply.
  6. The less risky way to create factors with custom labels is to supply both parameters.

Looking at the implementation makes crystal clear what we have found out.

  1 function (x = character(), levels, labels = levels, exclude = NA,               
  2     ordered = is.ordered(x), nmax = NA)                                         
  3 {       
  4     if (is.null(x)) 
  5         x <- character()
  6     nx <- names(x)                                                              
  7     if (missing(levels)) {                                                      
  8         y <- unique(x, nmax = nmax)
  9         ind <- sort.list(y)
 10         y <- as.character(y)
 11         levels <- unique(y[ind])
 12     }
 13     force(ordered)
 14     exclude <- as.vector(exclude, typeof(x))
 15     x <- as.character(x)
 16     levels <- levels[is.na(match(levels, exclude))]
 17     f <- match(x, levels)
 18     if (!is.null(nx)) 
 19         names(f) <- nx
 20     nl <- length(labels)
 21     nL <- length(levels)
 22     if (!any(nl == c(1L, nL)))
 23         stop(gettextf("invalid 'labels'; length %d should be 1 or %d",
 24             nl, nL), domain = NA)
 25     levels(f) <- if (nl == nL)
 26         as.character(labels)
 27     else paste0(labels, seq_along(levels))
 28     class(f) <- c(if (ordered) "ordered", "factor")
 29     f
 30 }

Let us consider the second point above. If levels are not supplied they will be the distinct elements of the given vector:

if (missing(levels)) {
    y <- unique(x, nmax = nmax)
    ind <- sort.list(y)
    y <- as.character(y)
    levels <- unique(y[ind])
}

If levels are missing, we create a vector y of unique elements of x:

> y <- unique(x)
> y 
[1] "F" "M"

Note that nmax is set by default to NA in the function header:

function (x = character(), levels, labels = levels, exclude = NA, 
    ordered = is.ordered(x), nmax = NA) 

and unique(x, nmax = NA) is equivalent to the call just above.

Also, note that if the order of elements in x were different, say, c("M", "F", "M"), unique would produce the distinct elements just by removing duplicates in the given vector:

> unique(c("M", "F", "M"))
[1] "M" "F"

So unique doesn't sort the result.

To sort the result we need a sorting function, here sort.list. This function produces the indices suitable for subsetting y:

> sort.list(y)
[1] 1 2

> y[sort.list(y)]
[1] "F" "M"

Note also that y holds a character vector always. So levels produced are always a character vector sorted in alphabetical order.

If levels are supplied it is expected that they match the unique elements of x. Pay attention to this crucial line:

f <- match(x, levels)

match is a nice but maybe tricky function to understand, it produces the positions (= indices) of matches of x in levels.

Let's try some examples. Recall that x is:

> x
[1] "F" "M" "F" "F" "F" "M"


> match(x, c("M", "F"))
[1] 2 1 2 2 2 1

Indeed, the first element of x matches the 2nd element of levels; the second element of x matches the 1st element of levels, and so on.

If there is no match, a NA is produced by default:

> match(x, c("M", "O"))
[1] NA  1 NA NA NA  1

Now we see where the NAs in our initial wrong attempt with "male" and "female" as levels came from.

What about labels?

If labels are not supplied they are just the levels as the function header states (levels is the default value for labels).

If they are supplied, these lines give labels to levels:

levels(f) <- if (nl == nL) 
    as.character(labels)
else paste0(labels, seq_along(levels))

Actually, the line that is executed for our previous examples with labels is just:

levels(f) <- as.character(labels)

since in our examples the number of levels are labels are the same (nl == nL, where nl <- length(labels) and nL <- length(levels))

Note again that levels with those new labels are still always a character object.

The last line in the snippet above deals with the case where a single label is supplied. We haven't tried an example for that, it is time to do it now:

> factor(x, labels = "gender")
[1] gender1 gender2 gender1 gender1 gender1 gender2
Levels: gender1 gender2

This is a valid way to assign labels to levels, though not so good for this example. The code responsible of producing this vector is:

paste0(labels, seq_along(levels))

seq_along(levels) produces a sequence 1, 2, ... n where n is the length of levels. paste0 just concatenates the single label with those digits in the sequence via recycling.

However, the most remarkable fact we may notice from looking at the implementation is that a factor in R is just an integer vector whose attributes "levels" (as we have seen), "names" (as names of x [see line 6 and lines 18-19 in the implementation]) and, above all, "class" are set appropriately.

As for the latter the last but one line in the implementation sets the "class" attribute to "factor" always. Besides, if x is an ordered vector (the default behavior), or if we pass TRUE as value to the ordered argument, "class" is set to c("ordered", "factor"):

class(f) <- c(if (ordered) "ordered", "factor") 

Note the order of elements in the last vector for "class". The order here matters, and means that in such a case f is like an instance of the class "ordered", that in turn is somewhat like a subclass of "factor". [See ?class for more details about this kind of inheritance mechanism.]

A nice way of finishing this post is to demonstrate what we have discovered, that a factor is, from the R's internal point of view, just an integer vector with certain attributes set appropriately.

> f <- c(1L, 2L, 1L, 1L, 1L, 2L)
> levels(f) <- c("female", "male")
> class(f) <- "factor"
> str(f)
 Factor w/ 2 levels "female","male": 1 2 1 1 1 2
> f
[1] female male   female female female male  
Levels: female male

> f2 <- c(1L, 3L, 3L, 2L, 1L, 2L)
> levels(f2) <- c("elem", "middle", "high")
> class(f2) <- c("ordered", "factor")
> str(f2)
 Ord.factor w/ 3 levels "elem"<"middle"<..: 1 3 3 2 1 2
> f2
[1] elem   high   high   middle elem   middle
Levels: elem < middle < high

lunes, 8 de agosto de 2016

Reading R Code. The function Reduce in R (part II)

[ This post is the third one in a series about Reduce. To understand it you also need to read the first and second posts in this series.]

Loops

Our implementation of Reduce directly translates the recursive procedures capturing the fold-left and fold-right computations. I think that this kind of recursive thinking is not only more high-level but also easier to understand, and that was the reason for beginning with them. However, in languages that don't provide mechanisms for optimizing recursion, iteration is always expressed via loops or another non-recursive procedure.

We have to do so in R if we hope to handle sequences of certain length, otherwise we'll get something like this:

> source("my_reduce.R")
> my_reduce("+", 1:100000)
Error: evaluation nested too deeply ...

Let's get started with converting the higher-level recursion used so far into a lower-level kind of loop for our first simple implementation in order to see how this can be easily done for the rest of the function.

That first implementation was for fold-left with a given initial value:

my_reduce <-
function(f, x, init) {
  f <- match.fun(f)

  iter <- function(acc, x) {
    len <- length(x)
    if (!len) {
      acc
    }
    else {
      first <- x[[1L]]
      rest <- tail(x, -1L)

      iter(f(acc, first), rest)
    }
  }

  iter(init, x)
}

Using a for loop instead of recursion we get this self-explanatory version:

my_reduce <-
function(f, x, init) {
  f <- match.fun(f)
  acc <- init

  for (e in x) {
    acc <- f(acc, e)
  }
  acc
}

Accessing elements by its index is another way to get the same thing. The refactored function based on indices would be:

my_reduce <-
function(f, x, init) {
  f <- match.fun(f)
  acc <- init
  len <- length(x)
  ind <- seq_len(len)

  for (i in ind) {
    acc <- f(acc, x[[i]])
  }
  acc
}

seq_len is an R function that takes a number and produces an integer vector from 1 to that number (included). So, seq_len(4) produces the sequence 1, 2, 3, 4. ind in the code above just refers to the numerical indices of x to be used in the loop.

A cosmetic and optional detail. Since the local variable acc takes the value passed as init, and the initial init value won't be used any more, we can stick with init as the name for the accumulated value so far, and remove acc:

my_reduce <-
function(f, x, init) {
  f <- match.fun(f)
  len <- length(x)
  ind <- seq_len(len)

  for (i in ind) {
    init <- f(init, x[[i]])
  }
  init
}

Reduce, at last!

That's actually the R implementation for left folding with init given:

function(f, x, init) {
  len <- length(x)
  f <- match.fun(f)
  ind <- seq_len(len)

  for (i in ind) init <- forceAndCall(2, f, init, x[[i]])
  init
}

The only subtle difference lies in how f is called. Instead of a direct application, it uses the forceAndCall R function. The purpose of this is to avoid delayed evaluation of arguments. That means that we force R to evaluate the first 2 arguments (note the 2 as first argument) of f immediately before executing the f body, rather than leaving R to evaluate them after. This is rather technical at this point and we can go ahead without fully understanding. Just note that this construct is like calling the function but in a better way when that function is to be the value passed to a higher-order function like Reduce. We might go deeper into this in the future.

Once we have transformed our initial version into a loop-based version the first half of Reduce is now easy to understand.

 1 Reduce <-
 2 function(f, x, init, right = FALSE, accumulate = FALSE)
 3 {
 4   mis <- missing(init)
 5   len <- length(x)
 6 
 7   if(len == 0L) return(if(mis) NULL else init)
 8 
 9   f <- match.fun(f)
10 
11 
12   if(!is.vector(x) || is.object(x))
13     x <- as.list(x)
14 
15   ind <- seq_len(len)
16 
17   if(mis) {
18     if(right) {
19       init <- x[[len]]
20       ind <- ind[-len]
21     }
22     else {
23       init <- x[[1L]]
24       ind <- ind[-1L]
25     }
26   }
27 
28   if(!accumulate) {
29     if(right) {
30       for(i in rev(ind))
31         init <- forceAndCall(2, f, x[[i]], init)
33     }
34     else {
35       for(i in ind)
36         init <- forceAndCall(2, f, init, x[[i]])
37     }
38     init
39   }
40   else {
..     # to be explained later
..   }
.. }

Apart from pretty minor differences, it is the same code as in our implementation but using indexing and looping instead of recursion.

Let's see:

  • Line 7 is just the code that handles what happens if the given object is empty, the empty list or an empty vector, typically. It returns NULL if no initial value is given; otherwise, returns the initial value.

  • Lines 12-13 introduce one new thing only present in the official R version. Our version assumes a list as input. In R, a vector is expected, These lines ensure that if the value is not a vector or is an object in general it is converted to a type that the following code can deal with, a list.

  • Lines 17-26 initialize variables init and ind accordingly to the kind of fold (left or right) requested if no initial value were provided. init is set exactly as in our version. Regarding ind, instead of working on the rest of the list directly as in our recursive implementation, indices are initialized appropriately, ind[-1L] produces the next indices to the index of the first element in order to process the rest of list for left folding. On the other hand ind[-len] produces the indices for right folding: those of all elements save the last one (that has been used as init). These indices will be reverted in line 30 to process the remaining list from right to left as explained in the previous post.

  • Finally [lines 29-37], the function f is iteratively called with the proper
    order of parameters for each kind of fold, and the result of folding returned in line 38.

The rest of Reduce is the code handling the case were the caller asks for a sequence of partial results rather than the final reduction. We'll look at it in the next post.

sábado, 6 de agosto de 2016

Reading R Code. The function Reduce in R (part I)

[This post is the second one about the function Reduce. It assumes everything discussed in the first post about Reduce]


The function Reduce in R is based on the Lisp function reduce. It also adds an option borrowed from Haskell's scan functions.

Let's see this by exploring a few simple examples and observing their behavior in R as well as in Lisp and Haskell.

What does Reduce do?

## Example 1.
# R
> Reduce(`+`, 1:4)
[1] 10

# Lisp
[1]> (reduce #'+ '(1 2 3 4))
10

# Haskell
Prelude> foldl (+) 1 [2, 3, 4]
10

Look at the Haskell function. I have used foldl, that is the fold-left function in Haskell, because R and Lisp do fold-left by default. Note also that while in Haskell the initial value has to be specified, in R and in Lisp we don't need to do so. When not specified the initial value for left-folding is the first element in the given list.

The evaluation of all these functions (fold-left) expands as follows for the given example:

(((1 + 2) + 3) + 4)


## Example 2.
# R
> Reduce(`-`, 1:4)
[1] -8

# Lisp
[1]> (reduce #'- '(1 2 3 4))
-8

# Haskell
Prelude> foldl (-) 1 [2, 3, 4]
-8

The expansion is the same:

(((1 - 2) - 3) - 4)


## Example 3.
# R
> Reduce(`-`, 1:4, right = TRUE)
[1] -2

# Lisp
[1]> (reduce #'- '(1 2 3 4) :from-end t)
-2

# Haskell
Prelude> foldr (-) 4 [1, 2, 3]
-2

The evaluation expands to this now:

(1 - (2 - (3 - 4)))

This is fold-right. In R and Lisp we have to set an option for this fold, and at this point it should be clear why these two functions are described as "left" and "right" folds. Note also that, for fold-right the initial value, if given (as in Haskell) become the right-most item in the expansion. In fact, in R and Lisp if the initial value is not given, it is the last element of the given list.

R provides an extra feature borrowed from Haskell for producing a sequence of results of accumulated values rather than the final reduction.

In Haskell, there is a special set of functions for this purpose, scan functions, scanr corresponds to foldr and scanl to foldl:

Prelude> scanl (-) 1 [2, 3, 4]
[1,-1,-4,-8]

Observe the partial results in the computations above:

initial value:        =  1
1st result   :  1 - 2 = -1
2nd result   : -1 - 3 = -4
3rd result   : -4 - 4 = -8

Or, for a right fold:

Prelude> scanr (-) 4 [1, 2, 3]
[-2,3,-1,4]

where partial results from right to left and in reverse order are:

initial value:          =  4
1st result   : 3 - 4    = -1
2nd result   : 2 - (-1) =  3
3rd result   : 1 - 3    = -2

The R equivalents need the accumulate parameter set to TRUE:

Reduce("-", 1:4, accumulate = TRUE)
> [1]  1 -1 -4 -8

and

Reduce("-", 1:4, right = TRUE, accumulate = TRUE)
> [1] -2  3 -1  4

These functions are just variants of the tail-recursive pattern where a second accumulator is added to keep track of computed values so far.

So for instance scanl can be implemented as follows:

 (define (scan-left f i lox0)
  (local [(define (f-for-lox-acc acc rsf lox)
            (cond [(empty? lox) (reverse (cons acc rsf))]
                  [else
                    (f-for-lox-acc (f acc (first lox))
                                   (cons acc rsf)
                                   (rest lox))]))]
    (f-for-lox-acc i empty lox0)))

Only and extra accumulator rsf has been added. It maintains a list of resulting values so far. Each time a new computed value is produced is inserted into the rsf accumulator. The result is finally reversed. It is possible to avoid the final reverse if computed values are appended to the end of rsf. Anyway it is clear the idea behind this.

R implementation of Reduce

As we have seen, Reduce in R is a multipurpose function. Other languages as Haskell prefer independent elementary functions for each task (right/left fold and returning accumulated values or not). I find the latter more elegant, easier to read and easier to test. R, though, is fond of functions with lots of parameters, and full of conditionals to select the code to execute for a specific purpose.

Even so, and with the goal of firstly implementing our own version for better understanding the R official version, it is reasonable to begin with basic use cases and add more after those have been implemented.

One of the simplest use case is to execute Reduce for left fold with a given initial value. As usual, we need a very basic set of tests. (more should be added for good coverage). The following are hopefully enough at this stage:

test_that("test reduce: fold-left, init given", {
  expect_equal(my_reduce("+", integer(0), init = 1), 1)
  expect_equal(my_reduce("-", 2:4, init = 1), -8)
  expect_equal(my_reduce("/", c(2, 9, 13), init = 7), 7/234)
  expect_equal(my_reduce(list, 2:4, init = 1), (list(list(list(1, 2), 3), 4)))
})

The most natural implementation at this point is the translation into R of the fold-left tail recursive procedure (first variant), as this:

my_reduce <-
function(f, x, init) {
  f <- match.fun(f)

  iter <- function(acc, x) {
    len <- length(x)
    if (!len) {
      acc
    }
    else {
      first <- x[[1L]]
      rest <- tail(x, -1L)

      iter(f(acc, first), rest)
    }
  }

  iter(init, x)
}

Assuming that test cases are saved into the file test_my_reduce.R, running tests with:

> library(testthat)
> test_file("test_my_reduce.R")

confirms that this implementation works.

A few points about this code. The first line applies match.fun as customary in R code to check the validity of the argument passed as f. I have used 1L instead of just 1 for indexing. This is a common practice in R base code: when an integer is required the number as a literal integer (the number followed by L) is passed. Finally, I have intentionally named the auxiliary helper as iter for reasons that will be clear. Any name would have worked though.

The next step is handling the case where no initial value is passed. Some test cases for this:

test_that("test reduce: fold-left, init missing", {
  expect_equal(my_reduce("+", integer(0)), NULL) 
  expect_equal(my_reduce("-", 1:4), -8)
  expect_equal(my_reduce("/", c(7, 2, 9, 13)), 7/234)
  expect_equal(my_reduce(list, 1:4), (list(list(list(1, 2), 3), 4)))
})

The first test case is for the empty vector. As no initial value is given, we cannot return it as before. By convention we return NULL in this case.

And this is an obvious implementation that passes all tests above [only the first lines are added, the iter function remains the same]:

my_reduce <-
function(f, x, init) {
  f <- match.fun(f)
  mis <- missing(init)
  len <- length(x)

  if (mis && !len) {
    return(NULL)
  }

  if (mis) {
    init <- x[[1L]]
    x <- tail(x, -1L)
  }

  iter <- function(acc, x) {
    len <- length(x)
    if (!len) {
      acc
    }
    else {
      first <- x[[1L]]
      rest <- tail(x, -1L)

      iter(f(acc, first), rest)
    }
  }

  iter(init, x)
}

What about the fold-right operation?

Let's write some tests for fold-right with and without the initial value given:

test_that("test reduce: fold-right, init given", {
  expect_equal(my_reduce("+", integer(0), init = 1, right = TRUE), 1)
  expect_equal(my_reduce("-", 1:3, init = 4, right = TRUE), -2)
  expect_equal(my_reduce("/", c(7, 2, 9), init = 13, right = TRUE), 63/26)
  expect_equal(my_reduce(list, 1:3, init = 4, right = TRUE), 
           list(1, list(2, list(3, 4))))
})

test_that("test reduce: fold-right, init missing", {
  expect_equal(my_reduce("+", integer(0), right = TRUE), NULL)
  expect_equal(my_reduce("-", 1:4, right = TRUE), -2)
  expect_equal(my_reduce("/", c(7, 2, 9, 13), right = TRUE), 63/26)
  expect_equal(my_reduce(list, 1:4, right = TRUE), 
           list(1, list(2, list(3, 4))))
})

As for the implementation it looks like it would be a very different piece of code if we were to translate the natural recursive procedure examined in the post referred above. However, when designing functions that handle several possibilities we should seek after maximizing the amount of code shared by each of them. The best possible scenario for the function at hand would be one in which the tail-recursive pattern employed for fold-left is shared by the fold-right operation. It turns out that we already have a promising candidate, the second variant of fold-left. Remember that it has the same signature as fold-right, and it shouldn't be difficult to adapt it so that it serves as a tail-recursive version for fold-right. A closer examination into the expansions of both procedures gives the answer. Let's recall them once again:

fold-right   (-) 1 '(2 3 4) : 2 - (3 - (4 - 1))
fold-left-2  (-) 1 '(2 3 4) : 4 - (3 - (2 - 1)) 

The only difference lies in the order of elements in the given list. The function we are searching for (a tail-recursive fold-right) should have the same expansion as fold-right:

fold-right-2 (-) ??         : 2 - (3 - (4 - 1))

Which arguments should it receive? Obviously:

fold-right-2 (-) 1 '(4 3 2)

Hence, fold-right-2 is just fold-left-2 with the input list reversed, and, as you surely remember, the only remaining difference between our fold-left, the one employed in R, fold-left-1 and fold-left-2 is the order of arguments passed to f.

All of these points are captured by the following implementation [number of lines included]:

1  my_reduce <-
2  function(f, x, init, right = FALSE) {
3    iter <- function(acc, x) {
4      len <- length(x)
5      if (!len) {
6        acc
7      }
8      else {
9        first <- x[[1L]]
10       rest <- tail(x, -1L)
11
12       if (right) iter(f(first, acc), rest) else iter(f(acc, first), rest)
13     }
14   }
15  
16   f <- match.fun(f)
17   mis <- missing(init)
18   len <- length(x)
19 
20   if (mis && !len) {
21     return(NULL)
22   }
23
24   if (mis) {
25     if (right) {
26       init <- x[[len]]
27       x <- rev(head(x, -1L))
28     }
29     else {
30       init <- x[[1L]]
31       x <- tail(x, -1L)
32     }
33   }
34   else {
35     if (right) {
36       x <- rev(x)
37     }
38   }
39 
40   iter(init, x)
41 }

Changes with respect to the previous implementation are:

  1. The nested function iter has been modified so that the recursive call passes parameters to f in the order suitable to the requested kind of fold [line 12].
  2. The handling of options for the arguments init and right has been appropriately extended [lines 24-38]. Note, in particular, that when right is TRUE we need to reverse the sequence, as explained. The R function rev does the job [lines 27, 36].

This implementation passes our current test set. Nice!

Do you think that this function is getting long? If so you are with me. As I said before I prefer shorter functions with as few parameters as possible and independent helpers. But R base code is plenty of this kind of monolithic functions rich in parameters and conditional branches.

[To be continued]

domingo, 31 de julio de 2016

Reading R Code. The function Reduce. Folds

Among the essential higher-order functions in functional programming (we have already seen map and filter) reduce (aka. fold) is probably the most amazing and powerful. It is the Swiss-Army-knife in functional programmers' hands, sort of. I highly recommend this excellent exposition, about the power and flexibility of fold:

http://www.cs.nott.ac.uk/~pszgmh/fold.pdf

What is fold?

Although fold deserves and probably requires several blog posts, I'll try an elementary presentation suitable to our purposes.

fold is essentially the functional abstraction that corresponds to the very structure of lists.

A list of whatever kind of values, say, type X, is either the empty list or a list constructed from an X and a list of Xs, where by construction I mean the operation of consing X and a list of Xs. For instance (in Lisp):

[1]> (cons 1 (list 2 3))
(1 2 3)

or in Haskell, where : is the Haskell equivalent to the Lispier cons:

Prelude> 1 : [2, 3]
[1,2,3]

Note the self-referential nature of the type description for lists. Any function over lists relies on this self-referential structure, and it will have a natural form, where the recursive call emanates naturally from the type description. So the sketch or template of a function for list of X looks like this [written now in Racket]:

(define (f-for-lox lox)
  (cond [(empty? lox) (...)]
        [else
          (... (first lox)
               (f-for-lox (rest lox)))]))
       

On the other hand, and by stack space efficiency reasons, functions over lists can be written in such a way that the recursive call is in tail position (in a position that ensures that is the last call in the procedure). Observe that f-for-lox, the recursive call in the template above, is not in tail position; instead, the function ending up in place of the last three dots in that template will be in tail position.

I heartily recommend to watch these excellent videos, on which this exposition is mostly based, for a better understanding:

  1. https://www.youtube.com/watch?v=Z7qXdkt7yOE
  2. https://www.youtube.com/watch?v=nx8BALSH3nY
  3. https://www.youtube.com/watch?v=fMqbMiGuQ9c
  4. https://www.youtube.com/watch?v=6NyPb8jQROs

Such functions where the recursive call is in tail positions are known as tail-recursive functions and normally written with the aid of a local function that supplies an accumulator. A tail-recursive template for lists has this form:

(define (f-for-lox lox0)
  (local [(define (f-for-lox-acc acc lox)
            (cond [(empty? lox) (... acc)]
                  [else
                    (f-for-lox-acc (... acc (first lox))
                                   (rest lox))]))]
    (f-for-lox-acc ... lox0)))

Or this one, if we change the order of components in the recursive call:

(define (f-for-lox lox0)
  (local [(define (f-for-lox-acc acc lox)
            (cond [(empty? lox) (... acc)]
                  [else
                    (f-for-lox-acc (... (first lox) acc)
                                   (rest lox))]))]
    (f-for-lox-acc ... lox0)))

fold is the function that capture these two abstractions. Accordingly, there are two possible fold functions (the last one with two variants), fold-right that captures the structural recursive procedure, and fold-left that captures the tail-recursive one.

Let's see this. If we fill those templates with more meaningful placeholders, i standing for the initial value, and f standing for a function to combine the contribution of the first element and the contribution of the recursion, and we add them as parameters to the functions, we have this function for the natural recursive template:

(define (fold-right f i lox)
  (cond [(empty? lox) i]
        [else
          (f (first lox)
             (f-for-lox (rest lox)))]))

and this two variants for the two tail-recursive templates:

(define (fold-left-1 f i lox0)
  (local [(define (f-for-lox-acc acc lox)
            (cond [(empty? lox) acc]
                  [else
                    (f-for-lox-acc (f acc (first lox))
                                   (rest lox))]))]
    (f-for-lox-acc i lox0)))


(define (fold-left-2 f i lox0)
  (local [(define (f-for-lox-acc acc lox)
            (cond [(empty? lox) acc]
                  [else
                    (f-for-lox-acc (f (first lox) acc)
                                   (rest lox))]))]
    (f-for-lox-acc i lox0)))

that are precisely fold-right and fold-left, with two variants, respectively.

A careful analysis of those functions enable us to infer their respective signatures.

;; fold-right                :: (X Y -> Y) Y (listof X) -> Y
;; fold-left-1 (1st variant) :: (Y X -> Y) Y (listof X) -> Y
;; fold-left-2 (2nd variant) :: (X Y -> Y) Y (listof X) -> Y

Note, in particular, that fold-right and the second variant of fold-left share the same signature, while the signature for the first variant of fold-left differs due to the different order of arguments in the call to f.

fold-right is basically the same in all functional programming languages, while different languages pick the first version of fold-left (Lisp, Haskell, OCaml, ...) or the second one (SML, Racket, ...) for its implementation of this function.

The difference between the two variants of fold-left have an impact in many cases. Just choose as f a function for which the order of operands matters (a non-commutative function). For instance if f is + the order doesn't matter but if it is - it absolutely matters. Let's consider this for - in all the languages we have mentioned:

# Lisp [reduce is fold-left by default]
[1]> (reduce #'- '(1 2 3 4) :initial-value 0)
-10

# Haskell
Prelude> foldl (-) 0 [1, 2, 3, 4]
-10

# OCaml
# List.fold_left (-) 0 [1; 2; 3; 4] ;;
- : int = -10

While:

# Racket
> (foldl - 0 '(1 2 3 4))
2

# SML
- List.foldl op- 0 [1, 2, 3, 4] ;
val it = 2 : int

[The R's Reduce will be the subject of the next post.]

jueves, 28 de julio de 2016

Reading R code. The functions Map and mapply

In this post I'm going to consider the implementation of the function Map in R. As in the previous post in the series, I first introduce the typical map in functional languages. Then I go into the R's Map implementation. Along the way I explore some other R functions to the extent needed for a basic understanding of the implementation.

What is map

map applies a function to each element in a given list to get a new list with the result of the application. For instance:

map sqrt [1, 2, 3]

produces a list of the square roots of each number:

[1.0,1.4142135623730951,1.7320508075688772]

In general, map takes a function and a list of elements of some type, say a, and produces a list of elements of some other type, say b, where a and b are the types of, respectively, the input and output values of the function that map takes. The signature for the example above could be written as:

(Num -> Double) [Num] -> [Double]

meaning that the first argument, sqrt, takes a number (in general) and produces a double-precision floating point number; the second argument is a list of numbers, and the result a list of Doubles.

So in general map has this signature:

(a -> b) [a] -> [b]

As with filter a natural implementation of map is a math-like recursive definition:

map f []     = []
map f (x:xs) = f x : map f xs

Or, in Racket:

(define (map f lox)
  (cond [(empty? lox) empty]
        [else
          (cons (f (first lox))
                (map f (rest lox)))))

This is the basic usage. The function map can usually handle multiple lists too. One of the variants works as follows:

It applies the function to the first element of given lists, then to the second, and so on till the last element of the shortest list is processed, the rest of elements of longer lists are ignored, and the result is the list produced by the successive application. This is a possible example with two lists as input. I use the name zipWith to refer to this kind of map that can take in two lists:

zipWith (+) [1, 2, 3] [4, 5, 6, 7]

produces:

[5, 7, 9] -- so [1 + 4, 2 + 5, 3 + 6]

The signature would be:

zipWith :: (a b -> c) [a] [b] -> [c]

Generalizations to cope with an arbitrary number of lists are also available in functional languages.

Map in R

The R Map function provides, as ?Map states, a generalization of the sort described:

‘Map’ applies a function to the corresponding elements of given vectors.

(Note that like Filter the consumed objects are R vectors.)

However, unlike the generalized map mentioned, Map doesn't discard remaining elements of longer lists. Instead, it uses recycling:

‘Map’ ... is similar to Common Lisp's ‘mapcar’ (with arguments being recycled, however)

Common-Lisp mapcar is a function that behaves as explained above (the Haskell's zipWith example). Indeed, executing the corresponding Lisp code on the clisp interpreter gives the same result:

[1]> (mapcar #'+ '(1 2 3) '(4 5 6 7))
(5 7 9)

while in R recycling is at work, though:

> Map(`+`, 1:3, 4:7)
[[1]]
[1] 5

[[2]]
[1] 7

[[3]]
[1] 9

[[4]]
[1] 8

As you can see, when the shortest list is exhausted, R recycles it as needed: the last element is the result of 1 + 7, where 1 comes here from recycling the shorter list. Note also that the result is not an atomic vector as one might expect but a list. This is also documented:

‘Map’ ... does not attempt to simplify the result ... Future versions may allow some control of the result type.

As for the implementation ?Map also tells us what it is:

‘Map’ is a simple wrapper to ‘mapply’.

It is not uncommon in R to find traces of implementation details in its docs.

The actual implementation expresses in code all of these points:

function (f, ...) 
{
  f <- match.fun(f)
  mapply(FUN = f, ..., SIMPLIFY = FALSE)
}

Let's have a closer look into the definition.

The function takes a first argument f, the mapping function to be applied, and an undetermined number of extra arguments. The three dots construct allows to catch those extra arguments and pass them on later to mapply. Vectors on which f will be applied are among the arguments that the caller will pass; eventually, more arguments might be passed by the caller, arguments that mapply can receive.

In the first line, the function passed as first argument is extracted and saved into a local variable f, or discarded if it cannot be interpreted in any way as a function, that's what match.fun basically does.

The function mapply

Once this first argument has been checked, it is passed as the FUN argument to mapply that in turn calls the underlying C code to carry out the computation. Additionally, Map sets the SIMPLIFY argument of mapply to FALSE to avoid the simplification, as documented.

We can see this better if we take a look at the implementation of mapply:

function(FUN,..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
{
  FUN <- match.fun(FUN)
  dots <- list(...)

  answer <- .Internal(mapply(FUN, dots, MoreArgs))

  if (USE.NAMES && length(dots)) {
    if (is.null(names1 <- names(dots[[1L]])) && is.character(dots[[1L]]))
      names(answer) <- dots[[1L]]
    else if (!is.null(names1))
      names(answer) <- names1
  }
  if(!identical(SIMPLIFY, FALSE) && length(answer))
    simplify2array(answer, higher = (SIMPLIFY == "array"))
  else answer
}

Note that answer, the returned value, is the result produced by the internal function, written in C, which is invoked via the call to .Internal. We come to this construct over and over while reading R base functions. Many R base functions just wrap the call to an underlying function, eventually preparing things for it, and/or adapting the result of it according to the arguments passed in the first place.

Map sets mapply's SIMPLIFY to FALSE so that the simplification that mapply could do otherwise will never be executed.

However, despite the fact that the intended arguments for the three dots are just the vectors, as ?Map documents, nothing prevents us from passing other possible mapply arguments, USE.NAMES and MoreArgs.

MoreArgs allows for passing extra arguments to the f function. For instance, to get a list of vectors where 42 is repeated from 1 to 4 times, we can use MoreArgs in this way [example borrowed from the mapply doc]:

> Map(rep, times = 1:4, MoreArgs = list(x = 42))
[[1]]
[1] 42

[[2]]
[1] 42 42

[[3]]
[1] 42 42 42

[[4]]
[1] 42 42 42 42

As R allows us to pass extra arguments just by naming them after the function,

> Map(rep, times = 1:4, x = 42)

MoreArgs seems to be of little use in this regard.

Furthermore, one can always resort to the commonly-used idiom in functional programming: supplying an anonymous function in place of f:

> Map(function(x) rep(42, times = x), 1:4)

The other extra option that we may pass to Map is USE.NAMES. mapply sets it by default to TRUE. In such a setting, and if more arguments apart form the initial function f are passed (length(dots) is not 0) this code in mapply will be executed:

if (USE.NAMES && length(dots)) {
  if (is.null(names1 <- names(dots[[1L]])) && is.character(dots[[1L]])) 
    names(answer) <- dots[[1L]]
  else if (!is.null(names1)) 
    names(answer) <- names1
}

Two cases are especially handled:

  • The second argument passed to mapply (dots[[1L]]) doesn't have names but it is of character type. (Recall that the first argument is always the function f).
  • The second argument passed to mapply has names.

In the first case, the character object is used to set the names of the result of mapply, as in this example taken from ?mapply:

> mapply(function(C, k) paste0(rep.int(C, k)), LETTERS[1:6], 6:1)

In the second case, the names of the argument become the names of the result. For instance [again from ?mapply]:

> mapply(function(x, y) seq_len(x) + y,
> +             c(a =  1, b = 2, c = 3),  # names from first
> +             c(A = 10, B = 0, C = -10))

This is simply put on the doc:

use names if the first ... argument has names, or if it is a character vector, use that character vector as the names.

If we want no names handling in Map we can just set USE.NAMES to FALSE overriding the default behavior of mapply, or viewed from the implementation, falling back to the default behavior of the internal C code.

A couple of points about R idioms that we see in the mapply:

Assignments can be sub-expressions. Here:

is.null(names1 <- names(dots[[1L]]))

The assignment expression names1 <- names(dots[[1L]]) evaluates to the value of names1, once assigned to. The value is passed to is.null as a part of the if condition. The value of names1 is later re-used in the last branch.

Secondly, in a logical context, the number 0 evaluates to FALSE, any other number evaluates to TRUE. This allows for concise conditions as the previous one:

if (USE.NAMES && length(dots))

The second part of the relational expression check whether dots is empty or not. Hence, there is no need to write length(dots) != 0. This is a common idiom in many languages supporting boolean evaluation of different classes of objects.

To complete the basic understanding of these functions I should inspect more closely match.fun and simplify2array (the function responsible of simplifying the result). match.fun appears frequently in R code and I pin it on top of my task list, simplify2array is a helper function currently used only by mapply and sapply. We can live without studying it for the time being.