The package boot
has elegant and powerful support for
bootstrapping. In order to use it, you have to repackage your
estimation function as follows.
R has very elegant and abstract notation in array indexes. Suppose
there is an integer vector OBS
containing the elements 2,
3, 7, i.e. that OBS <- c(2,3,7);
. Suppose x is a
vector. Then the notation x[OBS]
is a vector containing
elements x[2], x[3] and x[7]. This beautiful notation works for x as a
dataset (data frame) also. Here are demos:
# For vectors -- > x = c(10,20,30,40,50) > d = c(3,2,2) > x[d] [1] 30 20 20 # For data frames -- > D = data.frame(x=seq(10,50,10), y=seq(500,100,-100)) > t(D) 1 2 3 4 5 x 10 20 30 40 50 y 500 400 300 200 100 > D[d,] x y 3 30 300 2 20 400 2.1 20 400
Now for the key point: how does the R boot package work? The R
package boot
repeatedly calls your estimation function,
and each time, the bootstrap sample is supplied using an integer
vector of indexes like above. Let me show you two examples of how you
would write estimation functions which are compatible with the
package:
samplemean <- function(x, d) { return(mean(x[d])) } samplemedian <- function(x, d) { return(median(x[d])) }
The estimation function (that you write) consumes data
x
and a vector of indices d
. This function
will be called many times, one for each bootstrap replication. Every
time, the data `x' will be the same, and the bootstrap sample `d' will
be different.
At each call, the boot package will supply a fresh set of indices d. The notation x[d] allows us to make a brand-new vector (the bootstrap sample), which is given to mean() or median(). This reflects sampling with replacement from the original data vector.
Once you have written a function like this, here is how you would obtain bootstrap estimates of the standard deviation of the distribution of the median:
b = boot(x, samplemedian, R=1000) # 1000 replications
The object `b' that is returned by boot() is interesting and
useful. Say ?boot to learn about it. For example, after making
b
as shown above, you can say:
print(sd(b$t[,1]))
Here, I'm using the fact that b$t is a matrix containing 1000 rows which holds all the results of estimation. The 1st column in it is the only thing being estimated by samplemedian(), which is the sample median.
The default plot() operator does nice things when fed with this
object. Try it: say plot(b)
Here is an example, which uses the bootstrap to report the ratio of two standard deviations:
library(boot) sdratio <- function(D, d) { E=D[d,] return(sd(E$x)/sd(E$y)) } x = runif(100) y = 2*runif(100) D = data.frame(x, y) b = boot(D, sdratio, R=1000) cat("Standard deviation of sdratio = ", sd(b$t[,1]), "\n") ci = boot.ci(b, type="basic") cat("95% CI from ", ci$basic[1,4], " - ", ci$basic[1,5], "\n")
Note the beautiful syntax E = D[d,]
which gives you a
data frame E using the rows out of data frame D that are specified by
the integer vector d.
Many times, you want to send additional things to your estimation function. You're allowed to say whatever you want to boot(), after you have supplied the two mandatory things that he wants. Here's an example: the trimmed mean.
The R function mean() is general, and will also do a trimmed mean. If you say mean(x, 0.1), then it will remove the most extreme 10% of the data at both the top and the bottom, and report the mean of the middle 80%. Suppose you want to explore the sampling characteristics of the trimmed mean using boot(). You would write this:
trimmedmean <- function(x, d, trim=0) { return(mean(x[d], trim/length(x))) }
Here, I'm defaulting trim to 0. And, I'll allowing the caller to talk in the units of observations, not fractions of the data. So the user would say "5" to trim off the most extreme 5 observations at the top and the bottom. I convert that into fractions before feeding this to mean().
Here's how you would call boot() using this:
b = boot(x, trimmedmean, R=1000, trim=5)
This sends the extra argument trim=5 to boot, which sends it on to our trimmedmean() function.
The boot() function is very powerful. The above examples only scratch the surface. Among other things, it does things like the block bootstrap for time-series data, randomly censored data, etc. The manual can be accessed by saying:
library(boot) ?boot
but what you really need is the article Resampling Methods in R:
The boot
package by Angelo J. Canty, which appeared
in the December 2002 issue of R News.
Also see the web appendix to An R and S-PLUS Companion to Applied Regression by John Fox [pdf], and a tutorial by Patrick Burns [html].
Return to R by example
Ajay Shah
ajayshah at mayin dot org