From my experience as teaching assistant on several R intro MOOCs I'm getting the impression that beginners, and even intermediate users assume that the R documentation is only for experts and as a consequence don't read the doc in the first place.
This is an unfortunate prejudice since the R built-in documentation is one of its most remarkable features and it is there to help precisely the users. Even though (I admit) some docs are pretty technical, many others are perfectly readable, even for beginners.
In this post I'll try to show the strategy I typically follow when dealing with R doc pages.
Let's take as excuse the following question posted in one of those MOOCs:
What does the function plot do when its input is a data frame?
Getting the right doc
The first thing we need to do is to call the help for plot [I put the first few lines of the result]:
> ?plot
# ------------------- plot package:graphics R Documentation Generic X-Y Plotting Description: Generic function for plotting of R objects. For more details about the graphical parameter arguments, see ‘par’. ... # -------------------
Note the first sentence in the 'Description' section. It tells us that Not too much, isn't it? but still something. For our purposes generic function means that we need to search for a method (= another function), which will be actually called depending on the class of object we pass to To know more about This is what I get on my installation:
Among these methods Most R help pages share the same structure consisting of different sections (Title,
Description, Usage, etc)
Some sections may be more important than others depending on what we want to know. For instance, if we only want to know what this function is about in general terms it might be enough to read the terse 'Description'. Or if we already know what the function does but we forget some particular use of certain argument we can look into the 'Arguments' section. If we really want to know what the function exactly does we will need to read 'Details' and 'Examples'.
A superficial reading just doesn't work, we have to take the time to read thoroughly. Let's do it
now.
Even without fully understanding everything a careful reading allows us to outline what this function does:
So far so good but an image, an example, worths thousands of words. And R usally excels in providing ready-made examples for us. This is critically important when reading the R documentation. If given, please don't skim examples, work them out. Even more, rather than just running them via The first three examples apply So four columns, the first three numeric, and the last one categorical (a factor in R parlance).
The first example passes a data frame with a single column, a one-column subset drawn from the The second example passes this data frame:
The third example passes the whole data frame. An illustration of the first case mentioned in the doc. And we get the corresponding scatterplot matrix:
I leave the reader to investigate the last three examples. The only new case is I honestly believe that reading this doc (as many other R docs) gives more reliable information about the function at hand than googling during hours or skimming over dozens of books.
One more thing is still available to users, the source code itself, that obviously is the definitive answer to all questions.
Many R functions are implemented in R itself, and an intermediate R user should be able to read and understand the implementation, to some extent at least. Yes, many core function are written in C and those are beyond
the level of expertise of a non professional programmer with sufficient time to invest in navigating over the entire C basis and making sense of it. But we can always give a try, just in case.
As for the function There is an initial difficulty, though, locating the source. Methods listed by the above mentioned or
that displays this code [line numbers added for commenting below]:
A concise commentary:
The function takes a mandatory argument, a data frame by assumption, and an arbitrary number
of optional arguments passed by possibly inner function calls to other graphic functions [line 1].
It defines an inner function, The function exits with an error message if the argument passed is not a data frame [line 4-5]. This is just the usual defensive line to handle arguments that don't follow the assumed type of input.
Next the main part [lines 6ff.] goes, that conditionally selects a kind of graph depending on the number and, if required, the class of columns in the data frame.
If the data frame has one column and it is "numeric" or "integer" a stripchart is displayed; otherwise (the column is of another class) it relies on the generic plot again to generate the appropriate plot. [lines 6-11].
If the data frame has two columns the previously defined Finally, if the number of columns is greater than 2, the another possible case, the data frame is coerced
to a To get a grasp on this last thing, here is my challenge, read now plot
is a generic function for plotting R objects..
plot
. [Since my intent is to guide beginners I omit all discussions regarding technicalities, as that about the exact meaning of method and function in R, as well as the difference between non-generic and generic functions. For the moment take those terms and others of this sort just as jargon we are liberally using to talk about these things].
plot
methods we just type this:
> methods(plot)
[1] plot.acf* plot.data.frame* plot.decomposed.ts*
[4] plot.default plot.dendrogram* plot.density*
[7] plot.ecdf plot.factor* plot.formula*
[10] plot.function plot.hclust* plot.histogram*
[13] plot.HoltWinters* plot.isoreg* plot.lm*
[16] plot.medpolish* plot.mlm* plot.ppr*
[19] plot.prcomp* plot.princomp* plot.profile.nls*
[22] plot.raster* plot.spec* plot.stepfun
[25] plot.stl* plot.table* plot.ts
[28] plot.tskernel* plot.TukeyHSD*
plot.data.frame
is naturally the one which we are interested in. And R helps here too:
> ?plot.data.frame
# -------------------
plot.data.frame package:graphics R Documentation
Plot Method for Data Frames
Description:
‘plot.data.frame’, a method for the ‘plot’ generic. It is
designed for a quick look at numeric data frames.
Usage:
## S3 method for class 'data.frame'
plot(x, ...)
Arguments:
x: object of class ‘data.frame’.
...: further arguments to ‘stripchart’, ‘plot.default’ or ‘pairs’.
Details:
This is intended for data frames with _numeric_ columns. For more
than two columns it first calls ‘data.matrix’ to convert the data
frame to a numeric matrix and then calls ‘pairs’ to produce a
scatterplot matrix). This can fail and may well be inappropriate:
for example numerical conversion of dates will lose their special
meaning and a warning will be given.
For a two-column data frame it plots the second column against the
first by the most appropriate method for the first column.
For a single numeric column it uses ‘stripchart’, and for other
single-column data frames tries to find a plot method for the
single column.
See Also:
‘data.frame’
Examples:
plot(OrchardSprays[1], method = "jitter")
plot(OrchardSprays[c(4,1)])
plot(OrchardSprays)
plot(iris)
plot(iris[5:4])
plot(women)
# -------------------
Reading Details
Reading examples
example(function_name)
, explore them on the console, one by one, at least till you reach a good enough understanding. Let's go on the 'Examples' section.
plot
to the data set OrchardSprays
, which a
basic R installation provides by default. This data set has the following structure:
> str(OrchardSprays)
'data.frame': 64 obs. of 4 variables:
$ decrease : num 57 95 8 69 92 90 15 2 84 6 ...
$ rowpos : num 1 2 3 4 5 6 7 8 1 2 ...
$ colpos : num 1 1 1 1 1 1 1 1 2 2 ...
$ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ...
OrchardSprays
data frame. Therefore, it's an example for the case where the data frame is made of a single numeric variable. We should expect a stripchart, as documented, and we get that:
> plot(OrchardSprays[1], method = "jitter")
> str(OrchardSprays[c(4, 1)])
'data.frame': 64 obs. of 2 variables:
$ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ...
$ decrease : num 57 95 8 69 92 90 15 2 84 6 ...
So an example of the second case described above, where the first column (x) is categorical and the second (y) is numerical. A suitable plot would be a series of boxplots, one for level of the categorical variable. And that's exactly what we obtain:
> plot(OrchardSprays[c(4, 1)])
> plot(OrchardSprays)
plot(women)
, where the input is a data frame with two columns but both numeric in this case.
Getting and reading the source code
plot.data.frame
we are lucky.
methods
function that come suffixed with * are functions whose code cannot be reached by just typing the function name, as usual. The source code is still accessible. In particular, when the function is an S3 method, as plot.data.frame is [I omit commenting about S3 vs S4 methods. See the R manuals
in cran.r-project.org for more info] we have among maybe others any of these two instructions:
> getS3method("plot", "data.frame")
> getAnywhere("plot.data.frame")
1 function (x, ...)
2 {
3 plot2 <- function(x, xlab = names(x)[1L], ylab = names(x)[2L],
...) plot(x[[1L]], x[[2L]], xlab = xlab, ylab = ylab,
...)
4 if (!is.data.frame(x))
5 stop("'plot.data.frame' applied to non data frame")
6 if (ncol(x) == 1) {
7 x1 <- x[[1L]]
8 cl <- class(x1)
9 if (cl %in% c("integer", "numeric"))
10 stripchart(x1, ...)
11 else plot(x1, ...)
12 }
13 else if (ncol(x) == 2) {
14 plot2(x, ...)
15 }
16 else {
17 pairs(data.matrix(x), ...)
19 }
19 }
plot2
, that in turn calls plot with the first and second column of the given data frame and sets appropriate titles and labels for the resulting plot. This function will be used later for one of the possible cases mentioned in the documentation, where the data frame has two columns [line 3].
plot2
function is called, so that a suitable y vs. x plot is obtained [lines 13-14].
data.matrix
and the function pairs
is applied to the result of the coercion [lines 16-17].
?data.matrix
and ?pairs
.
Have fun and happy R reading!
No hay comentarios:
Publicar un comentario