Home > Uncategorized > Using bigmemory for a distance matrix

Using bigmemory for a distance matrix

UPDATE:  code is in the drop box for anybody who wants to play around. change extension from txt to R

The process of working on metadata and temperature series gives rise to several situations where I need to calculate the distance from every station to every other station. With a small number of stations this can be done easily on the fly with the result stored in a matrix. The matrix has rows and columns equal to the number of stations and it is filled by looping through the various station combinations. Of course one need only calculate the distance for every pair once. When the number of stations increases to 10s of thousands, such as with the Berkeley Earth project, there are two challenges. First is the time that the process takes and second is the memory it consumes.  Handling the space problem is relatively straightforward and I decided to use bigmemory to store the data. The speed problem is an entirely different matter, but I’m making a little progress. In addition to storing the data I also want a way to quickly retrieve the neighbors of a given station and I want my solution to be relatively bullet proof and easy to understand.

To solve the problem I define two functions  createDistanceMatrix() and getNeighbors(). The coed for each follows:

Lets start with  createDistanceMatrix:

createDistanceMatrix <- function(Inventory,Directory = getwd(), filename= “distance.bin”){

             require(bigmemory)

             deg2rad <- function(deg) return(deg*pi/180)

             if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)
            columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))
            if(length(columnsPresent) != 3) stop(“missing the correct column names”)
           descName <- sub(“bin”,”desc”,filename)
           if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)

The function is declared with a variable for the  Inventory of stations, a Directory for where the file will be written, and a default filename. Next we indicate that the bigmemory library is required and then proceed to define a local function [deg2rad()] and do some simple input checking. I enforce the use of a specific extension (“.bin”) which I used for all my filebacked bigmatrices. Then we check the inventory to make sure that it has the columns required:columns named “Id”, “Lon” and “Lat”.  In my coding Inventories are usually data.frames, but I allow for using a matrix. If you use a matrix I coerce it locally to a data.frame. Finally, I create a  new file name that is critical for future “reads” of the data. In bigmemorywhen you write a file a “description file” is created to document the structure of the file. We will use this to attach the file in the future. Proceeding:

Inventory[, "sinLat"] <- sin(deg2rad(Inventory$Lat))
Inventory[, "cosLat"] <- cos(deg2rad(Inventory$Lat))
Inventory$Lon <- deg2rad(Inventory$Lon)
ER <- 6371

These lines require a bit of explanation. To calculate the distance between stations I am going to use a great circle calculation. That calculation repeatedly takes the sin() and cos() of the latitude. For speed I do that once for every station in the inventory. Also, R takes radians so I do that transformation once. ER is the earth’s radius in kilometers. Next we define some parameters for the file we want to create:

nr <- nrow(Inventory)
nc <- nrow(Inventory)
options(bigmemory.allow.dimnames=TRUE)

The matrix we create and store as a file will be square. In truth I could make the matrix one row or one column smaller, but I’ve left it like this for now. Note as well that I decide to set the options for using rownames and column names. This will allow people to use character Ids for their station Id in an inventory. That will be clearer as we progress. With those preliminaries we are ready to initialize the filebacked big matrix:

D <- filebacked.big.matrix(nrow = nr, ncol = nc,
                                                       dimnames = list(as.list(Inventory$Id),NULL),
                                                       init = NA,
                                                       backingpath = Directory,
                                                       backingfile = filename,
                                                      descriptorfile = descName,
                                                       type = “double”)

Some things to note: I set the rows and columns to be equal. I initialize the data with NA. We only write into the upper diagonal of the matrix and I want the other values to be NA. I set the dimnames so that the rownames are equal to the station Ids. They can be numbers or characters.  When you seek on the file you’ll use the station ID. So, you could use characters ( “USC26541328″) for station Ids and the functions will still work. I also set a directory where the file is written, the file name and the decriptor name. Note, the decriptor name is DERIVED from the filename by simply changing the extension: “distance.bin” has a file descriptor of “distance.desc”. Type is set as “double.”  Next I do a couple of things that might look confusing. Sometime a station has no Latitude or Longitude. I handle that here by allowing those stations including in the Inventory, but I detect that problem and write a negative distance in the appropriate rows and columns. Our matrix will have 3 values: distance, a negative filler, and NA for elements of the matrix that dont have to be calculated ( lower half of the diagonal ). I also determine which rows have valid Lat and Lon. That will be used as a set for looping  controls. Below we set the row and column of stations that have missing Lat/Lon to -9999.

missingLonLat <- which(is.na(Inventory$Lat) | is.na(Inventory$Lon))
validLonLat <- which(!is.na(Inventory$Lat) & !is.na(Inventory$Lon))
D[missingLonLat,] <- (-9999)
D[,missingLonLat] <- (-9999)

Time to loop over the data. The approach is simple: Imagine we had a 4×4 matrix: we want to generate   1:2;1:3;1:4;2:3;2:4;3:4. Our first loop control goes over the 1-3 and the second goes over 2-4. Because some stations have missing Lat/lon we will skip over them. That is the purpose of validLonLat. It conatins those rows that are valid.

for(i in validLonLat[-length(validLonLat)]){
              for(j in validLonLat[validLonLat > i] ){

                      D[i,j] <- acos(Inventory$sinLat[i] * Inventory$sinLat[j]
                                                + Inventory$cosLat[i] * Inventory$cosLat[j]
                                                  * cos(Inventory$Lon[j]-Inventory$Lon[i])) * ER
                       if(is.nan(D[i,j])) D[i,j] <- 0

                  }
  }

So, we start looping over the first element in validLonLat and we loop to N-1. In the second loop, we skip lower rownumbers. The range value is calculated by using the precomputed values for sinLat() and cosLat() so that I am basically reducing the number of trig functions that have to be called by orders of magnitude. Finally, there is a nasty little warning we have to care about.  Sometimes when two stations have the same location acos() will blow up. This happens when acos() is feed a number slightly greater than 1 or less than -1. It happens rarely. The result is a NaN instead of the desired result of zero.  Rather than prevent the warning, I decide to allow the calculation and catch and fix the error. I suppose I could trim sinLat or cosLat and prevent it. In anycase when this finishes we have  created a bigmatrix on file. Lets test it:

Inventory<-data.frame(Id=seq(1,1000),Lat=runif(1000,-90,90),Lon= runif(1000,-180,180))

Inventory$Lat[unique(sample.int(1000,20))]<-NA
Inventory$Lon[unique(sample.int(1000,20))]<-NA
startTime <- Sys.time()
BD <- createDistanceMatrix(Inventory)
print( Sys.time()-startTime)

I create a random inventory of 1000 stations. I introduce some random NAs and time the function:

Time difference of 6.371714 mins

Creating a matrix of 1 million elements takes a bit over 6 minutes. For this we are calculating roughly 500K distances. I’m sure part of the slowness of this approach is the repeated writing I am doing to the file. So, I’ll have to investigate ways to buffer this, but for now it works. I could also see ways of using apply(). Now, lets look at how we get data from this file.

test <- getNeighbors(“distance.desc”, Id = 333)

Warning message:
In getNeighbors("distance.desc", Id = 333) : coercing Id to character

To grab a stations neighbors I call “getNeighbors” and pass the filename of the descriptor file. I may change this, bt you’ll get the idea. I also pass the Id.  Id is supposed to be passed as a character. If you screw up and pass an integer, I coerce it under the hood and proceed. The code for the function looks like this:

getNeighbors <- function(descriptFile,Id){
       require(bigmemory)
 
      D <- attach.big.matrix(dget(descriptFile))
        if(class(Id) != “character”){
            warning(“coercing Id to character”)
             Id <- as.character(Id)
         }
          dex <- which(rownames(D) == Id)
          if(length(dex) !=1) {
                  if(length(dex) == 0)stop(“index not found”)
                   if(length(dex) > 1)stop(“index found multiple times”)
           }
          neighbors <- c(D[1:dex,dex],D[dex,(dex+1):ncol(D)])
          neighbors[is.na(neighbors)] <- 0
          neighbors[neighbors < 0] <- NA
          names(neighbors) <- rownames(D)
          return(neighbors)
 }

Pretty straightforward. First we “attach” the file to read it using the descriptor file. If “Id” is not a character I coerce it.  Then I do some checking to insure that the ID can be found in the rownames of the file and to make sure it is only found once. I will probably add a check to the creating file to make sure that Ids are unique. For now I check it here, but that will change.  Then I pull the neighbors for the ID.  The neighbors of any given station consists of the cells in a column and cells in a row. The best way to see this is to draw a diagram using a 4×4 matrix as an example: presume you want the 3rd row: To get those values you take rows 1-3 of the 3rd column. cell 3,3 will be NA because we do not calculate the distance from a site to itself. The balance of neighbors will be in 3 row on column 4. neighbors <- c(D[1:dex,dex],D[dex,(dex+1):ncol(D)])

That line collects all the neighbors for an Id including the Id. Next,  we take the cell that is NA  D[dex,dex] and set it to 0 in neighbors. Finally we take the negative distances and set them to NA. I suppose I could just as easily get rid of the whole setting things to -9999  ploy and probably will in a final version.

Lets look at test:

test[330:340]
      330       331       332       333       334       335       336       337       338 
 7722.563        NA  7564.581     0.000 16180.166  8846.114 15410.572  5890.057 17409.563 
      339       340 
 7262.202 10958.382

So, station 331 had a missing Lat/Lon and station 333 is zero km from itself.

Neat.  Ok.  Taking a few minutes to add  some clean up. Here is the revised versions

createDistanceMatrix <- function(Inventory,Directory = getwd(), filename= “distance.bin”){
                 require(bigmemory)

                 deg2rad <- function(deg) return(deg*pi/180)

                if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)
               columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))
               if(length(columnsPresent) != 3) stop(“missing the correct column names”)

               descName <- sub(“bin”,”desc”,filename)
                if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)
                if(length(unique(Inventory$Id)) != length(Inventory$Id))stop(“Ids, must be unique”)
                # some temporary variables to speed up processing of distance calculation
               # these are done once and resused inside the loop
               Inventory[, "sinLat"] <- sin(deg2rad(Inventory$Lat))
               Inventory[, "cosLat"] <- cos(deg2rad(Inventory$Lat))
               Inventory$Lon <- deg2rad(Inventory$Lon)
               ER <- 6371

               nr <- nrow(Inventory)
               nc <- nrow(Inventory)

               options(bigmemory.allow.dimnames=TRUE)

               D <- filebacked.big.matrix(nrow = nr, ncol = nc,
                                                          dimnames = list(as.list(Inventory$Id),NULL),
                                                          init = NA,
                                                         backingpath = Directory,
                                                         backingfile = filename,
                                                         descriptorfile = descName,
                                                        type = “double”)

                   validLonLat <- which(!is.na(Inventory$Lat) & !is.na(Inventory$Lon))

             for(i in validLonLat[-length(validLonLat)]){
                                for(j in validLonLat[validLonLat > i] ){

                                          D[i,j] <- acos(Inventory$sinLat[i] * Inventory$sinLat[j]
                                                                      + Inventory$cosLat[i] * Inventory$cosLat[j]
                                                                      * cos(Inventory$Lon[j]-Inventory$Lon[i])) * ER
                                                      if(is.nan(D[i,j])) D[i,j] <- 0

                                         }
                          }

            return(D)

  }

And

getNeighbors <- function(descriptFile = “distance.desc”,Id){
                   require(bigmemory)

                   DATA <- attach.big.matrix(dget(descriptFile))

                  if(class(Id) != “character”){
                           warning(“coercing Id to character”)
                          Id <- as.character(Id)
                   }
             dex <- which(rownames(DATA) == Id)
             
              if(length(dex) == 0)stop(“index not found”)
              if(length(dex) > 1)stop(“index found multiple times”)
             
              neighbors <- c(DATA[1:dex,dex],DATA[dex,(dex+1):ncol(DATA)])
              neighbors[dex] <- 0
              names(neighbors) <- rownames(DATA)
             return(neighbors)
  }

About these ads
Categories: Uncategorized
  1. Nick Stokes
    April 8, 2012 at 6:05 PM

    Steven,
    Some speedup thoughts:
    1. You could use the trig sum formula for longitude:
    cos(L_i-L_j)=cos(L_i)*cos(L_j)+sin(L_i)*sin(L_j)
    Looks worse, but now you can use the same trick you did for lats. Just precompute the vector of coslon and sinlon. Then for the bulk you are calculating products rather than trig.

    2. Do you need to take the acos? If you mainly want neighbors, why not just look for cos(D) ~1. To take that further, if it’s mainly neighbors you want, why not just use ordinary distance squared. Once you’ve sorted the close ones, then you could go to spherical if you really need to.

    3. With the sum formula, you could then use outer (%o%) instead of loops. Just
    cos D = sinlat %o% sinlat + (coslat*coslon) %o% (coslat*coslon) + (coslat*sinlon) %o% (coslat*sinlon)
    Each of those () is just a vector over stations.

    • Steven Mosher
      April 9, 2012 at 1:23 AM

      Thanks Nick. I’ll try a couple of those things. with the current timing 50K stations would take a year.
      1. I’ll give that a try right away.
      2. for a variety of other applications I need real distance. Although I may put off the acos()
      until I actually fetch the neighbors.
      3. I’ll look at that. I think the biggest time crunch ( I need to profile ) comes from the
      way I am writing to file. I should profile a bit before I thrash away

    • Steven Mosher
      April 9, 2012 at 3:52 AM

      Well, that did not speed things up. the big PIG in this calculation is the assignment
      of values to D[] and then checking for NaN in D[] everything else is peanuts

  2. Rhead Enion
    April 9, 2012 at 12:05 AM

    You mention using apply; I think you could see a significant speedup using some form of apply for the distance calculation. That double “for” loop is really only calculating one matrix cell at a time without any dependencies on preceding calculations, so the “for” loop is not strictly necessary.

    Instead, pre-determine all the matrix indices that will require a distance calculation:
    indexLonLat <- combn(validLonLat, 2)
    i <- indexLonLat[1,]
    j <- indexLonLat[2,]

    Now you can pass those indices to a distance calculation using mapply (or with R 2.15, mcmapply to use multiple cores in parallel or change this code to use mclapply with earlier versions of R):

    DistanceCalculation <- function(i, j, Inventory, ER) {
    res <- acos(Inventory$sinLat[i] * Inventory$sinLat[j]
    + Inventory$cosLat[i] * Inventory$cosLat[j]
    * cos(Inventory$Lon[j]-Inventory$Lon[i])) * ER
    if(is.nan(D[i,j])) return(0) else return(res)
    }

    mapply will return a vector, which complicates things slightly:
    D.vec <- as.vector(D)
    D.vec[((j-1)*nrow(D) + i] <- mapply(DistanceCalculation, i, j, MoreArgs=list(Inventory=Inventory, ER=ER))

    D.new <- matrix(D.vec, nrow=nrow(D))

    You may want to watch memory usage, particularly if using mcmapply on a lot of cores. If memory is a problem, you could run the code in chunks, using only a portion of the i,j vectors at a time.

    I don't have your code, but I often see a significant speedup when switching from a double for loop to using the apply family of functions.

    • Steven Mosher
      April 9, 2012 at 1:26 AM

      Nice I will post my code in the drop box if you want to play. I need a 1000 fold speed up

    • Steven Mosher
      April 9, 2012 at 7:04 AM

      Wow. mapply worked really well. I’ve face a similar problem many times and could never figure out how to do the indexing in the LH side of the equation. Thanks!

      Doing it in chunks will also help the overhead on the disk access.

    • Steven Mosher
      April 9, 2012 at 10:46 PM

      one issue is that combn() is not memory safe. If I feed it very large numbers you crash the stack. I recall using this before and the underlying code is recursive. So looping, while slower, is memory safe. I think I can handle this by generating combn() in blocks.

      I’ll try to do something general purpose

  3. Nick Stokes
    April 9, 2012 at 4:19 AM

    Steven,
    Here’s one thought:
    1. Calc position of stations z_i in 3D space (a quick calc, about 50000)
    2. Calc ordinary dist^2 using outer product:
    D_ij=(z_i-z_j).(z_i-z_j)
    =z_i.z_i+z_j.z_j-2 z_i*z_j.
    The first two are just vectors, and the last is efficiently calculated as a matrix product. If z is nx3, then it is z %*% t(z).
    3. Spherical distance s is related to euclidean dist d by
    s=DE*asin(d/DE) where D=diameter Earth.

    The code for 2 would be something like:
    z2=z*sqrt(2) # makes the outer prod quicker
    zz=rowSums(z*z) #z_i.z_i
    zm=matrix(zz,n,n)
    D=zm+t(zm)-z2 %o% z2.

    This would in calc terms be similar to what you’d have with precomputed trig functions – the main gain would be clarity (maybe). There is a plus in that asin is well-behaved for close stations, though the difficulty would show up as possibly negative D. That can be fixed with abs().

    • Steven Mosher
      April 9, 2012 at 4:39 AM

      Thanks Nick. Looking at the profiling profR it looks like a I have a huge problem in how
      I am writing and reading from the big matrix. The reads and writes are taking 85% of the time. If they are blocking on read/write then no amount of fiddling with the math will get me closer. I need to figure that out because I also want to be able to go through stations and look for duplicates or merge candidates. Calculating the distance between time series is
      a hotbed of research ( google SAX time series to see some of the cool stuff being done
      and even some hilbert space stuff– arg waay over my head )

      I’m also looking at the package “ff” which has a different interface to Disk.

      Ideally people who write these packages would not block on read/write when they go to disk and that way the math can continue while data is being dispatched/fetched. I will try to conatct the package maintainer to see if he has some hints. I may have to chuck writes manually.

  4. Nick Stokes
    April 9, 2012 at 4:21 AM

    That last term – not z2%o%z2 but z2 %*% t(z2)

  5. Steven Mosher
    April 9, 2012 at 5:33 AM

    well, dropping the line if(is.nan(D[i,j])) D[i,j] <- 0. Improved the speed from 7 minutes to 5.
    That confirms the timing which showed about 15% of the time spent on this line.

    I changed the inner loop and moved the assignment to D[] outside that loop and the time
    went to 20 seconds. figuring out the best way to batch writes is the biggest issue in time savings

  6. Steven Mosher
    April 9, 2012 at 6:24 AM

    Well, 10000 stations took 1 hour. that’s a start in terms of improvements.

    moving up to 50K stations is now feasible. I’ll run 50K overnite and see how it scales.

    Then onto see how multicore can play and some of nicks ideas

  7. Nick Stokes
    April 9, 2012 at 6:37 AM

    Steven,
    I wrote a program (zipfile here) to do the cartesian idea. It breaks the vectors into 10 parts, so the result comes out on 55 RDATA files. For GHCNv3 7279 stations, it takes 2.3 minutes to get the spherical distances.

    • Steven Mosher
      April 9, 2012 at 6:39 AM

      Ok, I’ll take a look. Another guy on the Rhelp list is looking at a similar issue

      • Nick Stokes
        April 9, 2012 at 7:26 AM

        I had a ref to a variable zm which wasn’t declared. The line can just be deleted. I’ve updated the zipfile.

      • Nick Stokes
        April 9, 2012 at 1:53 PM

        Fixed an error that caused duplicate calculation. It now takes about 1 min for GHCN. I ran the original BEST inventory, with 39026 stations. It took 32 min, and the matrix in 55 files occupied 5.6 Gb in RDATA format.

      • Steven Mosher
        April 9, 2012 at 10:48 PM

        Ok Nick. I’ve downloaded it and will take a look. Since I have a bunch of these to do I’ll see if I can generalize a solution

      • Steven Mosher
        April 10, 2012 at 1:10 AM

        Hmm

        z0=z2[p,] I see this computed but not used

      • Nick Stokes
        April 10, 2012 at 7:24 AM

        Steve,
        z0 was part of the duplicating error that I mentioned. The latest version doesn’t have it.

      • Steven Mosher
        April 10, 2012 at 8:21 AM

        argg. hmm there was an indexing error as well. ill get home later and sort it ouy

  8. Etienne B. Racine
    April 13, 2012 at 12:30 AM

    This is not R, but maybe postgresql’s KNN GiST could be helpful ? It seems you can use it with postgis as well.

    • Steven Mosher
      April 13, 2012 at 12:55 AM

      Funny you should mention that. I was looking into porting data into mySQl or equivalent to make some of the data management easier. postgresql was intersting because of the support for time series databases.. also looking at FAME

  9. Graham Thompson
    May 7, 2012 at 5:33 AM

    This is interesting and relevant …

    An internal study by the U.S. EPA completed by Dr. Alan Carlin and John Davidson concluded the IPCC was wrong about global warming. One statement in the executive summary stated that a 2009 paper found that the crucial assumption in the Greenhouse Climate Models (GCM) used by the IPCC concerning a strong positive feedback from water vapor is not supported by empirical evidence and that the feedback is actually negative. Water vapor in the atmosphere causes a cooling effect, not a warming one. Carbon dioxide also causes a slight cooling effect but it so small it could never be measured by man’s instrumentation.

    EPA tried to bury the report. An email from Al McGartland, Office Director of EPA’s National Center for Environmental Economics (NCEE), to Dr. Alan Carlin, Senior Operations Research Analyst at NCEE, forbade him from speaking to anyone outside NCEE on endangerment issues. In a March 17 email from McGartland to Carlin, stated that he will not forward Carlin’s study. “The time for such discussion of fundamental issues has passed for this round. The administrator (Lisa Jackson) and the administration have decided to move forward on endangerment, and your comments do not help the legal or policy case for this decision. …. I can only see one impact of your comments given where we are in the process, and that would be a very negative impact on our office.” A second email from McGartland stated “I don’t want you to spend any additional EPA time on climate change.”

    McGartland’s emails demonstrate that he was rejecting Dr. Carlin’s study because its conclusions ran counter to the EPA’s current position. Yet this study had its basis in three prior reports by Carlin (two in 2007 and one in 2008) that were accepted. Another government cover-up, just what the United States does not need.

    Eliminate this regulation immediately. This is a scientific tragedy.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 38 other followers

%d bloggers like this: