miércoles, 20 de julio de 2016

How to read the R documentation. An example with plot().

From my experience as teaching assistant on several R intro MOOCs I'm getting the impression that beginners, and even intermediate users assume that the R documentation is only for experts and as a consequence don't read the doc in the first place.

This is an unfortunate prejudice since the R built-in documentation is one of its most remarkable features and it is there to help precisely the users. Even though (I admit) some docs are pretty technical, many others are perfectly readable, even for beginners.

In this post I'll try to show the strategy I typically follow when dealing with R doc pages.

Let's take as excuse the following question posted in one of those MOOCs:

What does the function plot do when its input is a data frame?

Getting the right doc

The first thing we need to do is to call the help for plot [I put the first few lines of the result]:

> ?plot

# -------------------

plot                 package:graphics                  R Documentation

Generic X-Y Plotting

Description:

     Generic function for plotting of R objects.  For more details
     about the graphical parameter arguments, see ‘par’.

...

# -------------------

Note the first sentence in the 'Description' section. It tells us that plot is a generic function for plotting R objects..

Not too much, isn't it? but still something. For our purposes generic function means that we need to search for a method (= another function), which will be actually called depending on the class of object we pass to plot. [Since my intent is to guide beginners I omit all discussions regarding technicalities, as that about the exact meaning of method and function in R, as well as the difference between non-generic and generic functions. For the moment take those terms and others of this sort just as jargon we are liberally using to talk about these things].

To know more about plot methods we just type this:

> methods(plot)

This is what I get on my installation:

 [1] plot.acf*           plot.data.frame*    plot.decomposed.ts*
 [4] plot.default        plot.dendrogram*    plot.density*      
 [7] plot.ecdf           plot.factor*        plot.formula*      
[10] plot.function       plot.hclust*        plot.histogram*    
[13] plot.HoltWinters*   plot.isoreg*        plot.lm*           
[16] plot.medpolish*     plot.mlm*           plot.ppr*          
[19] plot.prcomp*        plot.princomp*      plot.profile.nls*  
[22] plot.raster*        plot.spec*          plot.stepfun       
[25] plot.stl*           plot.table*         plot.ts            
[28] plot.tskernel*      plot.TukeyHSD*

Among these methods plot.data.frame is naturally the one which we are interested in. And R helps here too:

> ?plot.data.frame

# -------------------

plot.data.frame            package:graphics            R Documentation

Plot Method for Data Frames

Description:

     ‘plot.data.frame’, a method for the ‘plot’ generic.  It is
     designed for a quick look at numeric data frames.

Usage:

     ## S3 method for class 'data.frame'
     plot(x, ...)
     
Arguments:

       x: object of class ‘data.frame’.

     ...: further arguments to ‘stripchart’, ‘plot.default’ or ‘pairs’.

Details:

     This is intended for data frames with _numeric_ columns. For more
     than two columns it first calls ‘data.matrix’ to convert the data
     frame to a numeric matrix and then calls ‘pairs’ to produce a
     scatterplot matrix).  This can fail and may well be inappropriate:
     for example numerical conversion of dates will lose their special
     meaning and a warning will be given.

     For a two-column data frame it plots the second column against the
     first by the most appropriate method for the first column.

     For a single numeric column it uses ‘stripchart’, and for other
     single-column data frames tries to find a plot method for the
     single column.

See Also:

     ‘data.frame’

Examples:

     plot(OrchardSprays[1], method = "jitter")
     plot(OrchardSprays[c(4,1)])
     plot(OrchardSprays)
     
     plot(iris)
     plot(iris[5:4])
     plot(women)

# -------------------

Reading Details

Most R help pages share the same structure consisting of different sections (Title, Description, Usage, etc)

Some sections may be more important than others depending on what we want to know. For instance, if we only want to know what this function is about in general terms it might be enough to read the terse 'Description'. Or if we already know what the function does but we forget some particular use of certain argument we can look into the 'Arguments' section. If we really want to know what the function exactly does we will need to read 'Details' and 'Examples'.

A superficial reading just doesn't work, we have to take the time to read thoroughly. Let's do it now. Even without fully understanding everything a careful reading allows us to outline what this function does:

  • If the given data frame consists of more than two columns, it ideally displays a scatterplot matrix.
  • If the data frame has two columns, it displays a suitable y versus x plot, where x is the first column and y the second.
  • If the data frame has a single numeric column it generates a stripchart; if that column is not numeric a suitable plot [not specified] is displayed.

Reading examples

So far so good but an image, an example, worths thousands of words. And R usally excels in providing ready-made examples for us. This is critically important when reading the R documentation. If given, please don't skim examples, work them out. Even more, rather than just running them via example(function_name), explore them on the console, one by one, at least till you reach a good enough understanding. Let's go on the 'Examples' section.

The first three examples apply plot to the data set OrchardSprays, which a basic R installation provides by default. This data set has the following structure:

> str(OrchardSprays)
'data.frame': 64 obs. of  4 variables:
 $ decrease : num  57 95 8 69 92 90 15 2 84 6 ...
 $ rowpos   : num  1 2 3 4 5 6 7 8 1 2 ...
 $ colpos   : num  1 1 1 1 1 1 1 1 2 2 ...
 $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ...

So four columns, the first three numeric, and the last one categorical (a factor in R parlance).

The first example passes a data frame with a single column, a one-column subset drawn from the OrchardSprays data frame. Therefore, it's an example for the case where the data frame is made of a single numeric variable. We should expect a stripchart, as documented, and we get that:

> plot(OrchardSprays[1], method = "jitter")

The second example passes this data frame:

> str(OrchardSprays[c(4, 1)])
'data.frame': 64 obs. of  2 variables:
 $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ...
 $ decrease : num  57 95 8 69 92 90 15 2 84 6 ...
So an example of the second case described above, where the first column (x) is categorical and the second (y) is numerical. A suitable plot would be a series of boxplots, one for level of the categorical variable. And that's exactly what we obtain:

> plot(OrchardSprays[c(4, 1)])

The third example passes the whole data frame. An illustration of the first case mentioned in the doc. And we get the corresponding scatterplot matrix:

> plot(OrchardSprays)

I leave the reader to investigate the last three examples. The only new case is plot(women), where the input is a data frame with two columns but both numeric in this case.

I honestly believe that reading this doc (as many other R docs) gives more reliable information about the function at hand than googling during hours or skimming over dozens of books.

Getting and reading the source code

One more thing is still available to users, the source code itself, that obviously is the definitive answer to all questions.

Many R functions are implemented in R itself, and an intermediate R user should be able to read and understand the implementation, to some extent at least. Yes, many core function are written in C and those are beyond the level of expertise of a non professional programmer with sufficient time to invest in navigating over the entire C basis and making sense of it. But we can always give a try, just in case.

As for the function plot.data.frame we are lucky.

There is an initial difficulty, though, locating the source. Methods listed by the above mentioned methods function that come suffixed with * are functions whose code cannot be reached by just typing the function name, as usual. The source code is still accessible. In particular, when the function is an S3 method, as plot.data.frame is [I omit commenting about S3 vs S4 methods. See the R manuals in cran.r-project.org for more info] we have among maybe others any of these two instructions:

> getS3method("plot", "data.frame")

or

> getAnywhere("plot.data.frame")

that displays this code [line numbers added for commenting below]:

1 function (x, ...) 
2 {
3      plot2 <- function(x, xlab = names(x)[1L], ylab = names(x)[2L], 
           ...) plot(x[[1L]], x[[2L]], xlab = xlab, ylab = ylab, 
           ...)
4      if (!is.data.frame(x)) 
5          stop("'plot.data.frame' applied to non data frame")
6      if (ncol(x) == 1) {
7          x1 <- x[[1L]]
8          cl <- class(x1)
9          if (cl %in% c("integer", "numeric")) 
10             stripchart(x1, ...)
11         else plot(x1, ...)
12     }
13     else if (ncol(x) == 2) {
14         plot2(x, ...)
15     }
16     else {
17         pairs(data.matrix(x), ...)
19     }
19 }

A concise commentary:

The function takes a mandatory argument, a data frame by assumption, and an arbitrary number of optional arguments passed by possibly inner function calls to other graphic functions [line 1].

It defines an inner function, plot2, that in turn calls plot with the first and second column of the given data frame and sets appropriate titles and labels for the resulting plot. This function will be used later for one of the possible cases mentioned in the documentation, where the data frame has two columns [line 3].

The function exits with an error message if the argument passed is not a data frame [line 4-5]. This is just the usual defensive line to handle arguments that don't follow the assumed type of input.

Next the main part [lines 6ff.] goes, that conditionally selects a kind of graph depending on the number and, if required, the class of columns in the data frame.

If the data frame has one column and it is "numeric" or "integer" a stripchart is displayed; otherwise (the column is of another class) it relies on the generic plot again to generate the appropriate plot. [lines 6-11].

If the data frame has two columns the previously defined plot2 function is called, so that a suitable y vs. x plot is obtained [lines 13-14].

Finally, if the number of columns is greater than 2, the another possible case, the data frame is coerced to a data.matrix and the function pairs is applied to the result of the coercion [lines 16-17].

To get a grasp on this last thing, here is my challenge, read now ?data.matrix and ?pairs. Have fun and happy R reading!

No hay comentarios:

Publicar un comentario