Nick Stokes Distance code, now with Big Memory

Home > Uncategorized > Nick Stokes Distance code, now with Big Memory

Nick Stokes Distance code, now with Big Memory

April 12, 2012 Steven Mosher Leave a comment Go to comments

In my last post I was struggling with getting a big memory version of the distance matrix to work fast. Nick and other readers had some suggestions and after puttering around with Nicks code I’ve adapted it to big memory and not impacted the run time very much. For comparison writing a 10K by 10K matrix was taking minutes and I projected about 24 hours for berkeley data. Nick came to the rescue with two really cool ideas. The first is a different method for calculating distance on the sphere. Here he expanded on my idea of pre computing sin() and cos() for latitudes and longitudes and he developed an approach that allows us to compute these values once and then solve a distance equation using super fast matrix maths. On top of that he came up with a “block” approach. At first this looks like it will be slower because we calculate the entire matrix as oppsoed to just the upper half. In his approach Nick wrote out files for every block. In my approach I write blocks to a filebacked big matrix. Thats a bit slower but I get one file that I can then random access. Here’s the code

createDistanceMatrix <- function(Inventory, blockSize = 20, Directory= getwd(), filename
= “distance.bin”){
# based on code by Nick Stokes

require(bigmemory)
options(bigmemory.allow.dimnames=TRUE)
if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)
columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))
if(length(columnsPresent) != 3) stop(“missing the correct column names”)
descName <- sub(“bin”,”desc”,filename)
if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)

L <- cbind(Inventory$Lat,Inventory$Lon)*(pi/180)
s <- cbind(cos(L),sin(L))
# z is the array of 3D coords of stations. All unit vecs
z <- cbind(s[,1]*s[,2],s[,1]*s[,4],s[,3])

z2 <- z/sqrt(2) # Better to mult now than after outer prod
D <- 6371*2 #Diam of Earth

n <- nrow(L)
if(n <= blockSize)stop(“BlockSize must be less than rows in Inventory”)
blocks <- n %/% blockSize
if((n %% blockSize) > 0)blocks <- blocks + 1
dex <- 1:blockSize

BM <- filebacked.big.matrix(nrow = n , ncol = n,
dimnames = list(as.list(Inventory$Id),NULL),
init = NA,
backingpath = Directory,
backingfile = filename,
descriptorfile = descName,
type = “double”)
startTime <- Sys.time()

for(i in 1:blocks){

p <- dex + (i-1)*blockSize
p <- p[p<= n]
for(j in 1:blocks){
q <- dex +(j-1)*blockSize
q <- q[q<=n]
x <- 0.5000000001-z2[p,]%*%t(z2[q,])

x <- D * asin(sqrt(x))
if(identical(p,q))diag(x)<-0
BM[p,q]<- x

}
}

print(Sys.time()-startTime)
return(BM)
}

That takes 46 seconds. orders of magintude improvement

Categories: Uncategorized

Comments (9) Trackbacks (1) Leave a comment Trackback

Doug Cotton

April 15, 2012 at 6:41 PM

Reply

Genuine statement through current NASA employee …

“My son is a nuclear physicist with NASA and knows GHG theory is bogus and NASA distorts AGW data. BHO won’t allow Civil Servants to express skepticism. James Hanson is actual spokesman for NASA on GHG theory, picked by Al Gore and BHO”

This was said by a father to a contact I have who spoke personally with that father on April 13th.
BillC

April 26, 2012 at 1:55 AM

Reply

Steve,

If you get a chance please check out my comment at Lucia’s:

http://rankexploits.com/musings/2012/adding-enabled-bots-or-n-bets-uah-will-be-0/#comment-94678
wayne

May 9, 2012 at 6:12 AM

Reply

“… I projected about 24 hours for berkeley data …”

Steven, I haven’t visited your site you’re a while but I took a look at what you are doing in ‘R’ and I just have to ask why? I have done some analysis on Berkeley’s data and the vast majority of the runs execute in seconds, not hours but I did something you did not do. The first thing was to collapse the some 18+ gigabytes of unpacked data they supply to an intelligent equivalent file format that collapsed that 18 gigabytes to right at 600 megabytes. That is 30:1 and my files are not compressed though they are binary.

Now my routines all run lightening fast and all longitudes and latitudes are already in radians to use the spherical distance formulate to create a subsidiary file of all distances between each pair ~15500^2/2 should run I in couple of hours at most, start to finish. That is about 120 million distance calculations to 4byte results each. Only half of the results needs to actually be stored if you use a (low location, high location) pair lookup into a table. The distance between two stations is the same of course.

I’m just wondering why you are spinning your wheels so? Of course that type of work is exactly where my expertise lies so most may never even consider taking that efficiency tack. If you would be interested in seeing how I collapsed their data to that 30:1 degree (and it is not ‘compressed’), comment back. To be honest it was in the flag files where the vast majority of the savings came from and you can probably imagine why. They collapsed at a ratio of 192+:1 and they are, by far, of the largest of the files.

Of course the collapsing routines did take many hours to run themselves, but that is a one-time waste of hours, but you only have to waste it once! To me one CD of data is better than 30 CDs of data if it is, in fact, exactly the same data! Just timed the loading time of the entire BEST dataset, 3 seconds, average all temperatures members of certain flags and in a given latitude band, 6 seconds.
- Steven Mosher
  
  May 9, 2012 at 6:49 AM
  
  Reply
  
  my goal is not to make something that saves data space or processing time. disk space is cheap. for processing the code executes in seconds. if i want speed ups ill use c under R. my goal is to create open tools for scientists. so i work to make apis that they can use. its also important for my work to keep things in formats that are easily understood and useable by others.
MrPete

July 21, 2012 at 5:17 PM

Reply

Hi Steven,
First time visiting, at least in a long time.
Your nearest-neighbor problem is one of a class of challenges faced by ever GIS designer. In the previous thread there’s a comment that points to an efficient solution. Nice part is that the solution is no available in open source software. The pgSQL implementation was noted earlier (Descriptive link here), and includes a nice capability for retrieving N records based on distance. MySQL has spatial indexing and a whole set of useful geometry functions that provide the fundamentals. The latter is easily sufficient in most cases, but harder to use for problems like “what are the 10 nearest points” when you don’t know how far away those points might be.

Bottom line: instead of pre-calculating all inter-point distances, it’s a ton faster to spatially index geo-objects using their known lat-lon bounding boxes (MBR == Minimum Bounding Rectangle), and use that index to retrieve a set of candidate objects.

If you really need exact inter-point distances for the remaining set, this can be done on the fly.

Under the hood the only magic trick involved, for those who care, is creating a good index in two dimensions. The state of the art when I was active was to alternate each “level” of the index tree in X and Y.

Anyway, looks like you are having some fun! I’ll probably be back… I will soon have a home office computer with serious multi-core and big memory capabilities.

(BTW — if you’ve not done so already, RUN out and get an SSD drive for your Big Memory caching. Zero seek time and 200-400 MB/sec transfer rates make a big difference.)
- Steven Mosher
  
  July 22, 2012 at 4:39 AM
  
  Reply
  
  Thanks Mr Pete. I wrote an MBR approach for the poster that zeke, nick and I did for AGU 2011. very fast. Even that however hit the memory limits of R. I may go back and retool that
  algorithm for big memory. as it stands I just solved it by writing it out.
MrPete

July 21, 2012 at 6:00 PM

Reply

(SSD hint: get an SSD that does NOT use SandForce controllers. One good type: Vertex 4.)
Tobias

January 10, 2013 at 8:17 PM

Reply

Hi Steven,
Do I get you right that computing a 10kx10k matrix took you 46 seconds? Or is this the running time for the whole berkely data set?
Cheers,
Tobias
- Steven Mosher
  
  January 14, 2013 at 7:00 AM
  
  Reply
  
  We were just testing a 10K by 10K subset,