Brian Ripley created the following tutorial, which was posted to
s-news
on 7 Aug 92 and is archived in digest81 on statlib
.
This is a series of hints on a memory usage in S, from a real teaching example. We are running S-Plus 3.0 (August 1991 S) on Sun Sparcs; the examples were computed on an IPC with 12Mb ram, 32Mb swap on a local disc.
Brian Ripley ripley@stats.ox.ac.uk
Consider a shoe experiment with 10 boys, an experiment reported in Box, Hunter & Hunter (1977), Statistics for Experimenters. There were two materials (A and B) that were randomly assigned to the left or right shoe:
shoes <- scan(,list(L=0, R=0)) 13.2 14.0 8.2 8.8 11.2 10.9 14.3 14.2 11.8 10.7 6.6 6.4 9.5 9.8 10.8 11.3 9.3 8.8 13.3 13.6 attach(shoes) t.test(L,R, paired=T)
The sample size is rather small, and one might wonder about the validity of the t-distribution. An alternative for a randomized experiment such as this is to base inference on the permutation distribution of d. Computation shows that the agreement is very good, but that computation causes problems in S.
The most obvious way to explore the permutation distribution of the t-test of d = L-R is to select random permutations. The supplied function S-Plus function t.test computes much more than we need, and so is rather slow (about 0.7 secs on a Sun SparcStation SLC). It is simple to write a replacement function to do exactly what we need.
attach(shoes) d <- L-R ttest <- function(x) mean(x)/sqrt(var(x)/length(x)) n <- 100 res <- numeric(n) for(i in 1:n){ x <- d*sign(runif(10)-0.5) res[i] <- ttest(x) print(c(i, memory.size())) }
This took about 70 secs and used an additional 1Mb of memory! Increasing the sample size to 1000 causes seriously antisocial paging activity, and takes about 3 hours.
As the permutation distribution has only 2^10 = 1024 points we can explore it directly:
n <- 1024 perm.res <- numeric(n) for(i in 1:n){ j <- i; x<-d for(k in 1:10) {x[k] <- x[k]*(2*(j%%2)-1); j <- j%/%2} perm.res[i] <- ttest(x) print(c(i, memory.size())) } par(mfrow=c(1,2)) hist(perm.res, 25, probability=T, xlab="diff") x <- seq(-4,4, 0.1) lines(x, dt(x,9)) sres<- c(sort(perm.res), 4) yres<- (0:1024)/1024 plot(sres, yres, type="S", xlab="diff", ylab="") lines(x, pt(x,9), lty=3) legend(-5, 1.05, c("Permutation dsn","t_9 cdf"), lty=c(1,3))
which took about 5 hours and 17Mb of memory.
The problem is that S does not release any memory until the loop is completed, so the memory usage is linear in the size of the loop. It may help to encapsulate the contents of the loop in a function:
n <- 100 res <- numeric(n) test.t <- function(x){ res <- ttest(d*sign(runif(10)-0.5)) print(c(i, memory.size())) res } for(i in 1:n) res[i] <- test.t(x)
but the resources used in this instance are virtually unchanged. We can of course run 10 loops of length 100, provided these are done sequentially and not in a loop:
attach(shoes) d <- L-R ttest <- function(x) mean(x)/sqrt(var(x)/length(x)) test.t <- function(x){ res <- ttest(d*sign(runif(10)-0.5)) print(c(i, memory.size())) res } res <- numeric(1000) for(i in 1:100) res[i] <- test.t(x) for(i in 101:200) res[i] <- test.t(x) for(i in 201:300) res[i] <- test.t(x) for(i in 301:400) res[i] <- test.t(x) for(i in 401:500) res[i] <- test.t(x) for(i in 501:600) res[i] <- test.t(x) for(i in 601:700) res[i] <- test.t(x) for(i in 701:800) res[i] <- test.t(x) for(i in 801:900) res[i] <- test.t(x) for(i in 901:1000) res[i] <- test.t(x)
but there is yet another catch. If this is run from a file with source() the memory is not released after each loop. It is necessary to put these instructions in a file t.test.s and use
Splus < t.test.s
or to cut-and-paste the instructions into a window running S. With an input file, it can be helpful to include
options(echo=T)
to echo the input. This approach can be taken as far as needed, even to listing each step of the loop on the file to minimize memory usage.
For extensive runs you may want to use the BATCH mode of S, run from the Unix command line as
Splus BATCH infile outfile
which runs a background job (at reduced priority 5) taking input from infile and writing output to outfile. (Note that options(echo=T) is set automatically in this mode.)
The corresponding command under Windows is
Splus /BATCH infile outfile
Generating a bootstrap sample in S is very easy:
sample(x, replace=T)
samples with replacement length(x) items from x. The difficulty comes from the memory problems. We can avoid explicit loops by the following device:
n <- 100 A <- matrix(rep(d,n),10,n) res <- apply(A,2,function(x) ttest(sample(x, replace=T))) print(memory.size())
but this is once again subject to memory build-up to store A, which needs 8 x 10 x n bytes of storage, and for the internal for loop in apply . However, we can do the computation in chunks, from the command-line or an input file:
n <- 250 A <- matrix(rep(d,n),10,n) res <- apply(A,2,function(x) ttest(sample(x, replace=T))) res<- c(res, apply(A,2,function(x) ttest(sample(x, replace=T))) ) res<- c(res, apply(A,2,function(x) ttest(sample(x, replace=T))) ) res<- c(res, apply(A,2,function(x) ttest(sample(x, replace=T))) )
which used about 3Mb and took 7 minutes. The alternative approach of a series of four explicit for() loops uses essentially the same memory but took 40 seconds longer.
However they are done, S is not very suitable for long series of simulations of small problems. These are better done by an external computer program or a special subroutine calling a Fortran or C program.
If S must be used, the ultimate approach is to split the problem up into small pieces and to launch a child S process for each part. It helps that assignments are permanent. The basic style is
unix("Splus < infile >> outfile", output.to.S=F)
and as an example, create a file named t.test.s
containing the commands
for (i in 1:100) res1[i] <- ttest(sample(d, replace=T)) res <- c(res, res1)
then within Splus
attach(shoes) d <- L-R res1 <- numeric(100) res <- numeric(0) ttest <- function(x) mean(x)/sqrt(var(x)/length(x)) for(i in 1:10) unix("Splus < t.test.s >> junk", output.to.S=F)
This took about 4Mb (as 2 S processes run) and 9 minutes. Note that loops can safely be used here, as the memory build-up occurs in the child process. However, the overhead of launching the S child process is large, so the parts should be fairly large.
Go to the first, previous, next, last section, table of contents.