From my experience as teaching assistant on several R intro MOOCs I'm getting the impression that beginners, and even intermediate users assume that the R documentation is only for experts and as a consequence don't read the doc in the first place.

This is an unfortunate prejudice since the R built-in documentation is one of its most remarkable features and it is there to help precisely the users. Even though (I admit) some docs are pretty technical, many others are perfectly readable, even for beginners.

In this post I'll try to show the strategy I typically follow when dealing with R doc pages.

Let's take as excuse the following question posted in one of those MOOCs:

What does the function plot do when its input is a data frame?

### Getting the right doc

The first thing we need to do is to call the help for plot [I put the first few lines of the result]:

`> ?plot`

# ------------------- plot package:graphics R Documentation Generic X-Y Plotting Description: Generic function for plotting of R objects. For more details about the graphical parameter arguments, see ‘par’. ... # -------------------

Note the first sentence in the 'Description' section. It tells us that `plot`

is a *generic function for plotting R objects*..

Not too much, isn't it? but still something. For our purposes *generic* function means that we need to search for a *method* (= another function), which will be actually called depending on the class of object we pass to `plot`

. [Since my intent is to guide beginners I omit all discussions regarding technicalities, as that about the exact meaning of *method* and *function* in R, as well as the difference between non-generic and generic functions. For the moment take those terms and others of this sort just as jargon we are liberally using to talk about these things].

To know more about `plot`

methods we just type this:

` > methods(plot)`

This is what I get on my installation:

[1] plot.acf* plot.data.frame* plot.decomposed.ts* [4] plot.default plot.dendrogram* plot.density* [7] plot.ecdf plot.factor* plot.formula* [10] plot.function plot.hclust* plot.histogram* [13] plot.HoltWinters* plot.isoreg* plot.lm* [16] plot.medpolish* plot.mlm* plot.ppr* [19] plot.prcomp* plot.princomp* plot.profile.nls* [22] plot.raster* plot.spec* plot.stepfun [25] plot.stl* plot.table* plot.ts [28] plot.tskernel* plot.TukeyHSD*

Among these methods `plot.data.frame`

is naturally the one which we are interested in. And R helps here too:

` > ?plot.data.frame `

# ------------------- plot.data.frame package:graphics R Documentation Plot Method for Data Frames Description: ‘plot.data.frame’, a method for the ‘plot’ generic. It is designed for a quick look at numeric data frames. Usage: ## S3 method for class 'data.frame' plot(x, ...) Arguments: x: object of class ‘data.frame’. ...: further arguments to ‘stripchart’, ‘plot.default’ or ‘pairs’. Details: This is intended for data frames with _numeric_ columns. For more than two columns it first calls ‘data.matrix’ to convert the data frame to a numeric matrix and then calls ‘pairs’ to produce a scatterplot matrix). This can fail and may well be inappropriate: for example numerical conversion of dates will lose their special meaning and a warning will be given. For a two-column data frame it plots the second column against the first by the most appropriate method for the first column. For a single numeric column it uses ‘stripchart’, and for other single-column data frames tries to find a plot method for the single column. See Also: ‘data.frame’ Examples: plot(OrchardSprays[1], method = "jitter") plot(OrchardSprays[c(4,1)]) plot(OrchardSprays) plot(iris) plot(iris[5:4]) plot(women) # -------------------

### Reading Details

Most R help pages share the same structure consisting of different sections (Title, Description, Usage, etc)

Some sections may be more important than others depending on what we want to know. For instance, if we only want to know what this function is about in general terms it might be enough to read the terse 'Description'. Or if we already know what the function does but we forget some particular use of certain argument we can look into the 'Arguments' section. If we really want to know what the function exactly does we will need to read 'Details' and 'Examples'.

A superficial reading just doesn't work, we have to take the time to read thoroughly. Let's do it now. Even without fully understanding everything a careful reading allows us to outline what this function does:

- If the given data frame consists of more than two columns, it ideally displays a
*scatterplot matrix*. - If the data frame has two columns, it displays a suitable
*y*versus*x*plot, where*x*is the first column and*y*the second. - If the data frame has a single numeric column it generates a
*stripchart*; if that column is not numeric a suitable plot [not specified] is displayed.

### Reading examples

So far so good but an image, an example, worths thousands of words. And R usally excels in providing ready-made examples for us. This is critically important when reading the R documentation. If given, please don't skim examples, work them out. Even more, rather than just running them via `example(function_name)`

, explore them on the console, one by one, at least till you reach a good enough understanding. Let's go on the 'Examples' section.

The first three examples apply `plot`

to the data set `OrchardSprays`

, which a
basic R installation provides by default. This data set has the following structure:

> str(OrchardSprays) 'data.frame': 64 obs. of 4 variables: $ decrease : num 57 95 8 69 92 90 15 2 84 6 ... $ rowpos : num 1 2 3 4 5 6 7 8 1 2 ... $ colpos : num 1 1 1 1 1 1 1 1 2 2 ... $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ...

So four columns, the first three numeric, and the last one categorical (a *factor* in R parlance).

The first example passes a data frame with a single column, a one-column subset drawn from the `OrchardSprays`

data frame. Therefore, it's an example for the case where the data frame is made of a single numeric variable. We should expect a *stripchart*, as documented, and we get that:

` > plot(OrchardSprays[1], method = "jitter")`

The second example passes this data frame:

> str(OrchardSprays[c(4, 1)]) 'data.frame': 64 obs. of 2 variables: $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ... $ decrease : num 57 95 8 69 92 90 15 2 84 6 ...So an example of the second case described above, where the first column (

*x*) is categorical and the second (

*y*) is numerical. A suitable plot would be a series of

*boxplots*, one for level of the categorical variable. And that's exactly what we obtain:

` > plot(OrchardSprays[c(4, 1)])`

The third example passes the whole data frame. An illustration of the first case mentioned in the doc. And we get the corresponding *scatterplot matrix*:

` > plot(OrchardSprays)`

I leave the reader to investigate the last three examples. The only new case is `plot(women)`

, where the input is a data frame with two columns but both numeric in this case.

I honestly believe that reading this doc (as many other R docs) gives more reliable information about the function at hand than googling during hours or skimming over dozens of books.

### Getting and reading the source code

One more thing is still available to users, the source code itself, that obviously is the definitive answer to all questions.

Many R functions are implemented in R itself, and an intermediate R user should be able to read and understand the implementation, to some extent at least. Yes, many core function are written in C and those are beyond the level of expertise of a non professional programmer with sufficient time to invest in navigating over the entire C basis and making sense of it. But we can always give a try, just in case.

As for the function `plot.data.frame`

we are lucky.

There is an initial difficulty, though, locating the source. Methods listed by the above mentioned `methods`

function that come suffixed with * are functions whose code cannot be reached by just typing the function name, as usual. The source code is still accessible. In particular, when the function is an S3 method, as plot.data.frame is [I omit commenting about S3 vs S4 methods. See the R manuals
in cran.r-project.org for more info] we have among maybe others any of these two instructions:

` > getS3method("plot", "data.frame")`

or

` > getAnywhere("plot.data.frame")`

that displays this code [line numbers added for commenting below]:

1 function (x, ...) 2 { 3 plot2 <- function(x, xlab = names(x)[1L], ylab = names(x)[2L], ...) plot(x[[1L]], x[[2L]], xlab = xlab, ylab = ylab, ...) 4 if (!is.data.frame(x)) 5 stop("'plot.data.frame' applied to non data frame") 6 if (ncol(x) == 1) { 7 x1 <- x[[1L]] 8 cl <- class(x1) 9 if (cl %in% c("integer", "numeric")) 10 stripchart(x1, ...) 11 else plot(x1, ...) 12 } 13 else if (ncol(x) == 2) { 14 plot2(x, ...) 15 } 16 else { 17 pairs(data.matrix(x), ...) 19 } 19 }

A concise commentary:

The function takes a mandatory argument, a data frame by assumption, and an arbitrary number of optional arguments passed by possibly inner function calls to other graphic functions [line 1].

It defines an inner function, `plot2`

, that in turn calls plot with the first and second column of the given data frame and sets appropriate titles and labels for the resulting plot. This function will be used later for one of the possible cases mentioned in the documentation, where the data frame has two columns [line 3].

The function exits with an error message if the argument passed is not a data frame [line 4-5]. This is just the usual defensive line to handle arguments that don't follow the assumed type of input.

Next the main part [lines 6ff.] goes, that conditionally selects a kind of graph depending on the number and, if required, the class of columns in the data frame.

If the data frame has one column and it is "numeric" or "integer" a *stripchart* is displayed; otherwise (the column is of another class) it relies on the generic plot again to generate the appropriate plot. [lines 6-11].

If the data frame has two columns the previously defined `plot2`

function is called, so that a suitable *y* vs. *x* plot is obtained [lines 13-14].

Finally, if the number of columns is greater than 2, the another possible case, the data frame is coerced
to a `data.matrix`

and the function `pairs`

is applied to the result of the coercion [lines 16-17].

To get a grasp on this last thing, here is my challenge, read now `?data.matrix`

and `?pairs`

.
`Have fun and happy R reading!`

## No hay comentarios:

## Publicar un comentario