All material here is Licensed under CC by sa

City Size and SUHI

October 2, 2012 17 comments

In the course of putting together data for my kriging project with the CRN stations, I got another idea related to a small but potentially important corner of the concerns over UHI in the global temperature index. For clarity I suppose I should make it clear that my position is that the UHI bias is probably low, less than .1C decade. That’s basically what Zeke and I found in our AGU poster work ( with help from Nick Stokes, Matt Menne and Claude Williams ). Nevertheless some people persist in claiming that the bias is bigger and further than the bias can even be found in small cities and towns. There isn’t much peer reviewed science to back up this claim, but there are a few “studies”  here and there looking at the problem. Here is short incomplete bibliography.  Torok 2001  http://www.warwickhughes.com/climate/seozuhi.htm; Spenser  http://www.drroyspencer.com/2012/04/regional-u-s-population-adjustments-to-surface-temperatures-since-1973-still-little-warming/;  Warwick Hughes .http://www.warwickhughes.com/blog/?p=575http://solberg.snr.missouri.edu/gcc/heatisland09.pdfhttps://ams.confex.com/ams/Annual2006/techprogram/paper_100595.htm; Not exactly what I would call compelling arguments, but rather some interesting observations. One reason this matters relates to the definitions of “urban” and “rural” that we use for studying the possibility of bias in the global temperature record. The argument, I have found, usually proceeds like this.  The records of UHI for a large city such as Tokyo, Phoenix, or New York are shown and one might see figures for UHI that are 2,3 or even 9C. The reasonable inference from this is that these types of bias may be in the global temperature record.  For a bit of background have a look at this   http://www.nasa.gov/topics/earth/features/heat-island-sprawl.html Dr. Imhoff and his team do some great work and much of it has inspired me to dig a little deeper into the issue. Results like Imhoff’s on large cities create some concern that this UHI bias may be in the record. There are a few things we need to note. The UHI measured in these studies is typically SUHI, that is, the surface urban heat island, while thermometers measure the air temperature 2 meters above this surface. If  SUHI were simply related to the air temperature above the surface, then this would not be a issue, however, air temperature is not simply related to the surface temperature although it can explain a good portion of the variance. The second issue with satellite measures of SUHI is that they are biased to capture the worst case. SUHI and UHI are modulated by the synoptic weather conditions. So, for example, on cloudy days or rainy days the heat islands that develop tend to be smaller than those on sunny days. Sensing the Land surface temperature requires a cloud free day. Hence the satellite measures will be biased high, or rather they will tend to capture the worst SUHI days. This approach is fine for understanding the human health concerns related to UHI and heat waves–another project I am working on,  but one should not be surprised to find that even large cities don’t see  9C heat islands every day of the year. SUHI and UHI vary by weather and by season and by location. The other thing to note about Imhoff’s study is that it tended to focus on large cities.  A comprehensive look at large cities was conducted by Peng who looked at 419 cities around the world http://cybele.bu.edu/download/manuscripts/peng-uhi-est-2012.pdf.  Some interesting findings that tend to confirm some of the things I’ve thought. First, if you look at annual figures for SUHI ( or UHI ) you will come up with numbers typically less than 3C and second there appears to be a threshold to the bias at the high end. Peng argues, for example, that for these large cities that SUHI is insensitive to population density. Since these are all cities over 1 million it might be more precise to say that there is a threshold effect above certain densities.  That’s consistent with what Oke already understood in 1973 when he fit maximum UHI with a log curve. Imhoff, appears to have found something similar in his studies focusing on impervious area as opposed to population density. In his study of 38 large US cities he found a similar log type relationship with the impervious area of a city. That single variable explains 71% of the variance in SUHI. and There are some interesting constrasts between Peng and Imhoff’s work that will take this post in a different direction that center around finding different driver for daytime SUHI and nighttime SUHI, but I’m going to focus on the rather small number of data points for cities less than 10 sq km in size.  The goal for this project is to   focus  on small city SUHI using Imhoff’s approach and then perhaps Peng’s approach. My goal at the outset was to get something on the order of 100 to 500 small cities less than 10 sqkm in size. Along the way we will also see the limitations of some datasets for these types of studies. In due course we will see the limitations of using Grump population data ( Like Spenser uses ) for small  areas and low population areas and we will see some difficulties in using Modis Urban. The Modis limitation is something I’ve struggled with before specifically in our AGU poster and then later in the Berkeley earth UHI paper. There were two basic issues. First, the Modis calibration was done on cities of 100K people or more. Given the size of the Modis pixel it is likely ( as one can see plowing through thousands of small cities) that Modis will miss small urban/town areas and identify areas as rural when in fact they are “small towns” or villages. The other issue with Modis that no one has pointed out is that I think some northern latitudes have snow cover where there are cities, so that Alaska, for example, is mis represented as more rural than it actually is. Checking the dates for when the Modis images were taken provides some confirmation of this. It appears this wasn’t caught in the calibration testing since all the cities used for calibration are at latitudes below 50 ( As I recall, I need to recheck all of this!) There are also some software goals in this project. There is a new Modis package in R that I am eager to use and some landscape statistic packages that I’ve always wanted to play with, so it’s a small bit of new science  to learn and a lot of software fun. Let’s get started! Here is a brief synopsis of Imhoff’s approach ( Peng is a bit different but let’s save that till later ). A)  identify the urban areas you want to study. B) identify the biome C) identify the rural area for comparison D) eliminate confounding variables E) process the Modis LST data for various time periods and seasons. F) regressions Along the way, we will learn something about the limitations of various datasets. A) Finding small urban areas. To find small urban/ small town areas  I employed US census data. After banging my head against the wall with census block shapefiles, the collections by state are huge, I took a different path. I downloaded the gazetter files: http://www.census.gov/geo/www/gazetteer/gazetteer2010.html.  Here I took the “Places” file  which is basically a flat file of  places in the US: cities, towns and CDPs or census designated Places. For now I’m working with 2010 data which is read in nicely by R. Later I’ll update with 2000 data which is not so easily read in. placesFile <- “G:\\USCensus\\Gaz_places_national.txt”  Places <- read.delim(placesFile, stringsAsFactors = FALSE) The file contains all the information I need to conduct a quick search for areas that are small and have people. I also know that when I return to the census shapefiles for these areas I’ll have an easier job of finding what I want. Next I impose some limits on the areas I will be looking at. colnames(Places)[13]<-“Lat” colnames(Places)[14]<-“Lon” Places <- Places[Places$Lat < 50 & Places$Lat > 22 & Places$Lon > -130,] Places <- Places[Places$ALAND <= 10000000,] Places <- Places[Places$POP10 < 2500,] Places <- Places[Places$AWATER < 25000,] After changing the names of Lat and lon columns I limit my area of interest to CONUS. I do that for several reasons. I plan on working with NLCD landcover 2006 and Alaska and Hawaii are missing from that. Second, I have some concerns about Modis Landcover (urban) in Alaska. I may return to this selection criteria. Next I use the ALAND metric provided by the US census to select only those areas with less than 10Million sq meters of area or 10sq km. This area is the area of the polygon that defines that area for the census bureau. Again, I’m only using this to narrow a search I will be doing in the satellite data. Next, I limit the population size I am looking at. The Census data will contain many small physical areas that are located within huge urban sprawls and what we are looking for is small cities that are surrounded by rural or un populated areas. Lastly, I limit the amount of water than can be found within the bounding polygon. With this criteria  we get a huge number of sites to look at on the order of  12,000. Plotting these and looking at them on google earth is my typical approach just to get an idea of potential problems any automated selection process will face. The site statistics are as follows: The average population is 600, average number of houses is 275 amount of land in the average area is 2.5m sq meters and the 75% of the sites have less than 100 sq meters of water. After reviewing plots of some of these areas it was clear that I was still not finding what I was looking for so to winnow this collection more I turned to another dataset  of Impervious surface, the global 1km dataset. Moving from a 30 meter dataset to a kilometer dataset makes the search faster.For the final look at things we will revert back to the 30 meter data, but first we need to find a good selection of small cities that are surrounded by rural environment. I used two versions of the data the standard dataset which contains Isa values from 0 to 100 ( with -1 for water)  and a reformated version of that file where 0 represents no Isa and 1-100 is recoded as 1. This allows me so simply sum to find the number of 1km cells that have any Isa. Places <- cbind(Places, Isa= extract(Isa,cbind(Places$Lon,Places$Lat))) Places <- Places[Places$Isa >= 0,] Places <- cbind(Places, Isa10sqkm = extract(IsaB, cbind(Places$Lon,Places$Lat), buffer = 3000, fun = sum, na.rm= TRUE)) Places <- Places[Places$Isa10sqkm > 0,] Places <- Places[Places$Isa10sqkm <= 36,] The processing above removes and location where water is at the center of the site location and removes those sites that are embedded in areas of other cities. At this stage we are left with 7445 sites which I then review by looking at plots  while counting the total number of 1km Isa cells in a 1/2 degree zone about the site location. for( i in 1:nrow(Places)){ e <- getExtent(IsaB,.5,c(Places$Lon[i],Places$Lat[i])) patch<- crop(IsaB,e) icount <- count(patch,1) Places$IsaCount[i]<-icount fname <- paste(i,”Patch.png”,sep=”_”) fname <- file.path(“Patches”,fname,fsep=.Platform$file.sep) png(fname, 785,785) plot(patch, main= icount ) dev.off() print(i/nrow(Places)) } That code creates 7445 plots of all the potential locations and counts the number of cells in a 1/2 degree zone. I then review those plots. Here is small sample The review of the 7000 plots suggested that if I wanted to find examples where there was a small amount of Isa  in the center of the plot and “non isa” surrounding it that a limit of 1500 total Isa cells in the  1/2 degree zone would go a considerable way toward that goal.  In addition, at this stage I winnowed the sites by Imhoff’s biome criteria. Since the biome that the urban site is embedded in drives the differential in temperature we want to make sure that a city is not in an area where the biome changes. The biome within  1/2 a degree of the site is check to ensure that a city is not in an area where biomes change.  These two cuts through the data reduce the number of sites to examine to  roughly 2400.  At this stage the underlying statistics of the area have not changed with the average population still be around 600 and the average sq km being 2.5. In addition, we now have tagged every site with its biome. The next step was plot all 2400 sites using 30 meter NLCD impervious surface data. The following gives you an idea of the detail available. With  these 2400 sites I figure I have a good chance of finding  a fairly large sample of small cities that I can perform the “Imhoff”  style analysis on. I will say that looking through thousands of maps for small urban areas gives me a good idea of some of the difficulties ahead. It always helps to get really down and dirty with the data. Reviewing the Imhoff approach one should note that he defines Urban areas by NLCD Isa with various zones such as an urban core at 75-100% Isa and then surounding zones at various levels. His cut off is 25% Isa.  One should also note that he checks this against Modis Urban.  That check is the check I turn to next. Why Modis Urban?  The answer lies in the bowels of MODIS LST.  Land Surface temperature data is estimated using a variety of MODIS  products. On critical unknown is emmissivity. That is estimated from Modis Land cover. Simply put, if Modis Land cover doesnt think a pixel is urban it will not apply the correct emissivity and the temperature will not be estimated correctly.  My next step then was to use Modis Land cover to winnow the 2400 sites down to those where Modis thought the small urban area was  “Modis Urban”.  In this next step I add a few more constraints, basically,  steps to ensure that the small urban center is surrounded only by non urban modis pixels. That step reduces the dataset down to 149 stations, mapped below A few things to note.  There are some sites that are located near water which should make an interesting bit of analysis. As I recall recall Imhoff has special handling for water pixels.  There  are  a few more restrictions that Imhoff employs for “rural” pixels  that will probably reduce this set even more and then there are pixels that need to be removed because of altitude differences, but with 150 sites to work with I think I have a manageable number. At this stage I wanted to also check out the  land cover surrounding a station. Imhoff used Biomes  which really dont give a current day view of the land cover rather they give a historical view of things. I may tweak that part of the analysis.  Here are some sample of land cover around the sites. Sites are coded in red..I’m still playing around with ‘rasterVis’  to get legends about land cover to work out

The site stats  now look like this:  Population ranges from  141 to 2417  with a mean of 999 and a median of 879.   The size of the community according to cadastral boundaries is from rounghly .5 sqkm to 10 sq km  with a mean of around 3km.  and the population density according to census figures runs from 36 people per sq km to 1200.

Advertisements
Categories: Uncategorized

rasterVis to the rescue

September 26, 2012 8 comments

Programmers like Oscar Perpiñán Lamigueiro are the reason I love R!  Oscar is the maintainer of the rasterVis package and it in this post I’ll explain why it is must have package for anyone working with the raster package in R.  My latest project is focused on the NOAA’s Climate Reference Network. The details can be found here.  Over a year ago I wrote a package call CRN designed to download and organize the data from the Climate Reference Network and I wanted to update and put some finishing touches on that package. In the course of that I decided it would be a cool idea to build a detailed metadata package for the stations.

The CRN is a rather like a gold standard of climate data collection. Triple redundant sensors, 5 minute data, hourly data, daily and monthly. The sites have all been selected using rigorous standards and as the system has matured the data availability rates are close to 100%. Here are just of few of the projects I have in mind.

1. Doing a proper estimate of the US temperature using just the CRN stations. One person has already  done this incorrectly ( or sub optimally ) so, I’m planning on approaching the problem with Kriging. For this I won’t use the Berkeley Earth code because I want to do something in R, using gstat perhaps.

2. Looking at LST  around the CRN sites. There is also a really cool new package in R called Modis and I really want to do an interesting project to test that package out. The goal here would be to fetch the correct Modis tiles for Land Surface Temperature, and then look at the Air temperatures, soil temperatures ( CRN collects those ) 

3. Landcover study. I’ve been reading a bunch of papers by Marc Imhoff on  SUHI and the relationship between SUHI and impervious surfaces, tree cover, etc. and Marc’s team has done some really cool work on the relationship between SUHI and the particular biome that a city is embedded in. In some forthcoming work they also look at the effects of city size and city shape on SUHI. A a few points in their papers they have hinted at SUHI effects for even small urban developments with areas around 1 sq km. I’d like to understand that better.

 To pull off all these projects I will need some rather detailed metadata, more detailed than I’ve ever worked with before. For starters  I needed a station list for the CRN and USRCRN stations. There are any number of ways of getting this data, but I drop a note to Matt Menne of NOAA and he directed me to Michael Palecki who was kind enough to send me a csv file of stations. The thing I was looking for was the most accurate Latitude/Longitude data I could find, and Michael supplied it. 

Next task was collecting metadata for the stations and assembling a variety of tables. I won’t document that work here, but just to give you an idea of what I’m putting together For every station I have the following :  1) elevation,slope and aspect  from 30 meter DEM.  2) distance from the sea coast, distance from the nearest body of water, and percentage of the “land” that is covered in water over a broad range of radii from the site. 3  various landcover datasets. 4. Impervious Area. 5. Biome. 6 Anthrom. 7 Populations from various sources and at various levels of accuarcy. 8 Housing. 9. Nightlights ( stable and radiance calibrated ). 10 economic activity.  and I plan to get LST and probably Modis Albedo.

So where does Oscar and rasterVis come in?  Specifically in this case Oscar came to the rescue for a plotting problem I was having with landcover data.

Using the  2006 NLCD 30 meter land cover dataset the first thing I do is save off  “patches”  of landcover for  roughly 1 degree around the site. The 30 meter dataset is huge and things go much easier if I just create  small data sets for each station.

In code that looks like this

library(raster)
library(rgdal)

LcName  <-  “G:\\NLCD\\nlcd2006_landcover_4-20-11_se5.img”
LC           <-  raster(LcName)

Inv <- read.table(“Core.tab”,stringsAsFactors =FALSE)

coordinates(Inv) <- c(“Lon”,”Lat”)
proj4string(Inv)   <- CRS(“+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0” )
Inv <-  spTransform(Inv, CRS(projection(LC)))
 

dir.create(“NLCD2006”)

for(station in 1:nrow(Inv)){
     Lon <- coordinates(Inv)[station,”Lon”]
     Lat <- coordinates(Inv)[station,”Lat”]
     if(Lon >= xmin(LC) & Lon <= xmax(LC) & Lat >= ymin(LC) & Lat <= ymax(LC)){
         e <- extent(Lon-110000,Lon+110000,Lat-110000,Lat+110000)
         patch <- crop(LC,e)
         fname <-file.path(“NLCD2006″,paste(Inv$Id[station],”NLCD06.grd”,sep=””),fsep=.Platform$file.sep)
         writeRaster(patch,filename = fname)
         print(station)
}else {
    print(“Alaska or HI”)
}

}

Pretty simple. For 2006 Alaska and Hawaii are not in the dataset so we just skip them. The output of this is a small grid with the station in the center and 30 meter landcover data for 110,000 meters in all directions. That’s the data I will be working with.

The next step is to calculate a variety of statistics on this “patch” and smaller and smaller versions of the patch, so we get the land cover stats at various radii from the site, from 110km  down to 30 meters. Working with the 1 degree “patch” makes this work go very quickly.  raster has a function called “freq” which counts the frequency of values in the entire raster. [ As I write this I wonder if “freq” can be passed as  a function to the raster  “extract” function as a buffer option] 

In addition to creating 1 degree patches for every site I also create 1km patches which makes a nice size for display. That’s where rasterVis comes in.

Before Oscar solved my problem here is an example of what my plot looked like. Note this is at 500 meters

Image

 

There are a bunch of problems here. First and foremost I could not get a consistent coloring for these plots. The issue is that plot() in raster was considering my data to be integers rather than factors and if land cover types were missing my colors scheme changed from plot to plot.

Oscar fixed all that by suggesting that I use rasterVis  and the  levelplot function.  His code dropped right in and after finding the color scheme in rgb for NLCD  the same plot now looks like this

 

Image

 

 

I’ve got some work to do to switch back to the official legend descriptions but the maps now use the rgb from the standard and every map is consistently covered.  In the next installment I’ll post up Oscars code.

 

Here is what the Impervious surface data looks like ( Needs a grey scale of 0-100% )

 

Image

 

 

To the naked eye it looks like this

 

Image

 

 

Categories: Uncategorized

In search of large ice floes

September 18, 2012 8 comments
Categories: Uncategorized

A gWidgets GUI for climate data

July 10, 2012 19 comments

If you haven’t worked with the gWidgets package it’s worth some time exploring it which is what I’ve been doing for a little paleo project I’ve been working on. After struggling with the few demos and tutorials I could find I went ahead and bought the book: Programming Graphical User Interfaces in R. Luckily the book was a little better than the tutorials which just scratch the surface. GUI programming, in my mind, is unlike other programming and I had many false starts, primarily struggling with lexical scoping issues and getting widgets to update based on changes made to other widgets and the underlying data. The only reliable method I could find was <<-   Yikes. I’m sure over time I’d figure away around making these type of “globals” it ended up looking pretty but being ugly under the hood.

Lets Start with the Top Window

Pretty basic but it took a while to get this to work. Part of the problem was mixing the window on the left  a ggraphics() widget with the panel on the right and having it operate correctly with resizing. I tried ggroups(), gframe(), combinations of those two and in all cases the windows would not draw properly upon resizing. Text got left all over the place. That problem was solved with a gpanedgroup(). So the left pane is a gframe() that contains a  gcheckboxgroup() and then a ggrpahics() widget and the right group is also a gframe containing other widgets. The purpose of the GI is pretty simple: take a huge climate data file and station inventory and “crop” it with respect to a climate proxy file, in this case pollen data. The idea is to create some custom climate files without requiring programming. The process starts by loading a climate file ( from Berkeley Earth datasets) which uses my BerkeleyEarth Package. Then you load a pollen database and plot the two:

The  glabel(0 widgets on the right show you how many stations are loaded and the min and max time. They also show the lat /lon which defaults to the whole world. Next we select our region of interest by using the graphics crosshairs. Selecting that “extent” will automatically crop the climate data to the extent selected:

 

We see the station count adjust and the lat/lon adjust. And the times at which we have temperatures is displayed. Then hit update plot:

 

And the plot updates. Here I’ve selected an even smaller region. Unfortunately if you make a region too small, the ONLY way to undo that mistake is to “reset data”  which brings you back to square one with all the data present. The last step is to adjust the time window: Here I want to build a dataset from 1961 to 1990. So I move the gsliders() and hit update plot.

 

The last step I have to write is the “save file”.  I also had plans to add more plots, like stations versus time, but at one point I hit a nasty gWidgets error about hitting the limits on the stack. Probably my programming.  For now, this will do.  For future projects I think I’m going downstream to do toolkit programming. The layout restrictions in gWidgets did cramp my style and I really haven’t mastered the whole “signaling” handler, updating widgets.. perhaps I should use an observer model.. more reading.

I usually post code. In this case its so ugly that it would probably teach you bad things. When I get better at this I’ll do some tutorials.

 

Categories: Uncategorized

Updates to Old code

I notice from time to time that people download code from the drop box.

The drop box lags  CRAN.  I’ve upload the newest versions to the drop box.

In the future I will post new releases to the drop box before I submit to CRAN.

Categories: Uncategorized

Nick Stokes Distance code, now with Big Memory

April 12, 2012 10 comments

In my last post I was struggling with getting a big memory version of the distance matrix to work fast. Nick and other readers had some suggestions and after puttering around with Nicks code I’ve adapted it to big memory and not impacted the run time very much. For comparison writing a 10K by 10K  matrix was taking  minutes and I projected about 24 hours for berkeley data. Nick came to the rescue with two really cool ideas. The first is a different method for calculating distance on the sphere. Here he expanded on my idea of pre computing sin() and cos() for latitudes and longitudes and he developed an approach that allows us to compute these values once and then solve a distance equation using super fast matrix maths. On top of that he came up with a “block” approach. At first this looks like it will be slower because we calculate the entire matrix as oppsoed to just the upper half.  In his approach Nick wrote out files for every block. In my approach I write blocks to a filebacked big matrix. Thats a bit slower but I get one file that I can then random access. Here’s the code

createDistanceMatrix <- function(Inventory, blockSize = 20, Directory= getwd(), filename
                                                                                    = “distance.bin”){
# based on code by Nick Stokes

            require(bigmemory)
            options(bigmemory.allow.dimnames=TRUE)
           if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)
           columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))
           if(length(columnsPresent) != 3) stop(“missing the correct column names”)
           descName <- sub(“bin”,”desc”,filename)
           if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)

          L <- cbind(Inventory$Lat,Inventory$Lon)*(pi/180)
          s <- cbind(cos(L),sin(L))
        # z is the array of 3D coords of stations. All unit vecs
         z <- cbind(s[,1]*s[,2],s[,1]*s[,4],s[,3])

             z2 <- z/sqrt(2) # Better to mult now than after outer prod
            D <- 6371*2 #Diam of Earth

          n <- nrow(L)
          if(n <= blockSize)stop(“BlockSize must be less than rows in Inventory”)
         blocks <- n %/% blockSize
         if((n %% blockSize) > 0)blocks <- blocks + 1
          dex <- 1:blockSize

              BM <- filebacked.big.matrix(nrow = n , ncol = n,
                                                                                dimnames = list(as.list(Inventory$Id),NULL),
                                                                                init = NA,
                                                                               backingpath = Directory,
                                                                               backingfile = filename,
                                                                              descriptorfile = descName,
                                                                              type = “double”)
                              startTime <- Sys.time()

                              for(i in 1:blocks){

                                        p <- dex + (i-1)*blockSize
                                        p <- p[p<= n]
                                         for(j in 1:blocks){
                                             q <- dex +(j-1)*blockSize
                                             q <- q[q<=n]
                                             x <- 0.5000000001-z2[p,]%*%t(z2[q,])

                                            x <- D * asin(sqrt(x))
                                           if(identical(p,q))diag(x)<-0
                                          BM[p,q]<- x

                           }
             }

print(Sys.time()-startTime)
return(BM)
}

That takes 46 seconds.  orders of magintude improvement

Categories: Uncategorized

Using bigmemory for a distance matrix

April 8, 2012 24 comments

UPDATE:  code is in the drop box for anybody who wants to play around. change extension from txt to R

The process of working on metadata and temperature series gives rise to several situations where I need to calculate the distance from every station to every other station. With a small number of stations this can be done easily on the fly with the result stored in a matrix. The matrix has rows and columns equal to the number of stations and it is filled by looping through the various station combinations. Of course one need only calculate the distance for every pair once. When the number of stations increases to 10s of thousands, such as with the Berkeley Earth project, there are two challenges. First is the time that the process takes and second is the memory it consumes.  Handling the space problem is relatively straightforward and I decided to use bigmemory to store the data. The speed problem is an entirely different matter, but I’m making a little progress. In addition to storing the data I also want a way to quickly retrieve the neighbors of a given station and I want my solution to be relatively bullet proof and easy to understand.

To solve the problem I define two functions  createDistanceMatrix() and getNeighbors(). The coed for each follows:

Lets start with  createDistanceMatrix:

createDistanceMatrix <- function(Inventory,Directory = getwd(), filename= “distance.bin”){

             require(bigmemory)

             deg2rad <- function(deg) return(deg*pi/180)

             if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)
            columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))
            if(length(columnsPresent) != 3) stop(“missing the correct column names”)
           descName <- sub(“bin”,”desc”,filename)
           if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)

The function is declared with a variable for the  Inventory of stations, a Directory for where the file will be written, and a default filename. Next we indicate that the bigmemory library is required and then proceed to define a local function [deg2rad()] and do some simple input checking. I enforce the use of a specific extension (“.bin”) which I used for all my filebacked bigmatrices. Then we check the inventory to make sure that it has the columns required:columns named “Id”, “Lon” and “Lat”.  In my coding Inventories are usually data.frames, but I allow for using a matrix. If you use a matrix I coerce it locally to a data.frame. Finally, I create a  new file name that is critical for future “reads” of the data. In bigmemorywhen you write a file a “description file” is created to document the structure of the file. We will use this to attach the file in the future. Proceeding:

Inventory[, “sinLat”] <- sin(deg2rad(Inventory$Lat))
Inventory[, “cosLat”] <- cos(deg2rad(Inventory$Lat))
Inventory$Lon <- deg2rad(Inventory$Lon)
ER <- 6371

These lines require a bit of explanation. To calculate the distance between stations I am going to use a great circle calculation. That calculation repeatedly takes the sin() and cos() of the latitude. For speed I do that once for every station in the inventory. Also, R takes radians so I do that transformation once. ER is the earth’s radius in kilometers. Next we define some parameters for the file we want to create:

nr <- nrow(Inventory)
nc <- nrow(Inventory)
options(bigmemory.allow.dimnames=TRUE)

The matrix we create and store as a file will be square. In truth I could make the matrix one row or one column smaller, but I’ve left it like this for now. Note as well that I decide to set the options for using rownames and column names. This will allow people to use character Ids for their station Id in an inventory. That will be clearer as we progress. With those preliminaries we are ready to initialize the filebacked big matrix:

D <- filebacked.big.matrix(nrow = nr, ncol = nc,
                                                       dimnames = list(as.list(Inventory$Id),NULL),
                                                       init = NA,
                                                       backingpath = Directory,
                                                       backingfile = filename,
                                                      descriptorfile = descName,
                                                       type = “double”)

Some things to note: I set the rows and columns to be equal. I initialize the data with NA. We only write into the upper diagonal of the matrix and I want the other values to be NA. I set the dimnames so that the rownames are equal to the station Ids. They can be numbers or characters.  When you seek on the file you’ll use the station ID. So, you could use characters ( “USC26541328”) for station Ids and the functions will still work. I also set a directory where the file is written, the file name and the decriptor name. Note, the decriptor name is DERIVED from the filename by simply changing the extension: “distance.bin” has a file descriptor of “distance.desc”. Type is set as “double.”  Next I do a couple of things that might look confusing. Sometime a station has no Latitude or Longitude. I handle that here by allowing those stations including in the Inventory, but I detect that problem and write a negative distance in the appropriate rows and columns. Our matrix will have 3 values: distance, a negative filler, and NA for elements of the matrix that dont have to be calculated ( lower half of the diagonal ). I also determine which rows have valid Lat and Lon. That will be used as a set for looping  controls. Below we set the row and column of stations that have missing Lat/Lon to -9999.

missingLonLat <- which(is.na(Inventory$Lat) | is.na(Inventory$Lon))
validLonLat <- which(!is.na(Inventory$Lat) & !is.na(Inventory$Lon))
D[missingLonLat,] <- (-9999)
D[,missingLonLat] <- (-9999)

Time to loop over the data. The approach is simple: Imagine we had a 4×4 matrix: we want to generate   1:2;1:3;1:4;2:3;2:4;3:4. Our first loop control goes over the 1-3 and the second goes over 2-4. Because some stations have missing Lat/lon we will skip over them. That is the purpose of validLonLat. It conatins those rows that are valid.

for(i in validLonLat[-length(validLonLat)]){
              for(j in validLonLat[validLonLat > i] ){

                      D[i,j] <- acos(Inventory$sinLat[i] * Inventory$sinLat[j]
                                                + Inventory$cosLat[i] * Inventory$cosLat[j]
                                                  * cos(Inventory$Lon[j]-Inventory$Lon[i])) * ER
                       if(is.nan(D[i,j])) D[i,j] <- 0

                  }
  }

So, we start looping over the first element in validLonLat and we loop to N-1. In the second loop, we skip lower rownumbers. The range value is calculated by using the precomputed values for sinLat() and cosLat() so that I am basically reducing the number of trig functions that have to be called by orders of magnitude. Finally, there is a nasty little warning we have to care about.  Sometimes when two stations have the same location acos() will blow up. This happens when acos() is feed a number slightly greater than 1 or less than -1. It happens rarely. The result is a NaN instead of the desired result of zero.  Rather than prevent the warning, I decide to allow the calculation and catch and fix the error. I suppose I could trim sinLat or cosLat and prevent it. In anycase when this finishes we have  created a bigmatrix on file. Lets test it:

Inventory<-data.frame(Id=seq(1,1000),Lat=runif(1000,-90,90),Lon= runif(1000,-180,180))

Inventory$Lat[unique(sample.int(1000,20))]<-NA
Inventory$Lon[unique(sample.int(1000,20))]<-NA
startTime <- Sys.time()
BD <- createDistanceMatrix(Inventory)
print( Sys.time()-startTime)

I create a random inventory of 1000 stations. I introduce some random NAs and time the function:

Time difference of 6.371714 mins

Creating a matrix of 1 million elements takes a bit over 6 minutes. For this we are calculating roughly 500K distances. I’m sure part of the slowness of this approach is the repeated writing I am doing to the file. So, I’ll have to investigate ways to buffer this, but for now it works. I could also see ways of using apply(). Now, lets look at how we get data from this file.

test <- getNeighbors(“distance.desc”, Id = 333)

Warning message:
In getNeighbors("distance.desc", Id = 333) : coercing Id to character

To grab a stations neighbors I call “getNeighbors” and pass the filename of the descriptor file. I may change this, bt you’ll get the idea. I also pass the Id.  Id is supposed to be passed as a character. If you screw up and pass an integer, I coerce it under the hood and proceed. The code for the function looks like this:

getNeighbors <- function(descriptFile,Id){
       require(bigmemory)
 
      D <- attach.big.matrix(dget(descriptFile))
        if(class(Id) != “character”){
            warning(“coercing Id to character”)
             Id <- as.character(Id)
         }
          dex <- which(rownames(D) == Id)
          if(length(dex) !=1) {
                  if(length(dex) == 0)stop(“index not found”)
                   if(length(dex) > 1)stop(“index found multiple times”)
           }
          neighbors <- c(D[1:dex,dex],D[dex,(dex+1):ncol(D)])
          neighbors[is.na(neighbors)] <- 0
          neighbors[neighbors < 0] <- NA
          names(neighbors) <- rownames(D)
          return(neighbors)
 }

Pretty straightforward. First we “attach” the file to read it using the descriptor file. If “Id” is not a character I coerce it.  Then I do some checking to insure that the ID can be found in the rownames of the file and to make sure it is only found once. I will probably add a check to the creating file to make sure that Ids are unique. For now I check it here, but that will change.  Then I pull the neighbors for the ID.  The neighbors of any given station consists of the cells in a column and cells in a row. The best way to see this is to draw a diagram using a 4×4 matrix as an example: presume you want the 3rd row: To get those values you take rows 1-3 of the 3rd column. cell 3,3 will be NA because we do not calculate the distance from a site to itself. The balance of neighbors will be in 3 row on column 4. neighbors <- c(D[1:dex,dex],D[dex,(dex+1):ncol(D)])

That line collects all the neighbors for an Id including the Id. Next,  we take the cell that is NA  D[dex,dex] and set it to 0 in neighbors. Finally we take the negative distances and set them to NA. I suppose I could just as easily get rid of the whole setting things to -9999  ploy and probably will in a final version.

Lets look at test:

test[330:340]
      330       331       332       333       334       335       336       337       338 
 7722.563        NA  7564.581     0.000 16180.166  8846.114 15410.572  5890.057 17409.563 
      339       340 
 7262.202 10958.382

So, station 331 had a missing Lat/Lon and station 333 is zero km from itself.

Neat.  Ok.  Taking a few minutes to add  some clean up. Here is the revised versions

createDistanceMatrix <- function(Inventory,Directory = getwd(), filename= “distance.bin”){
                 require(bigmemory)

                 deg2rad <- function(deg) return(deg*pi/180)

                if(!grepl(“\\.bin”,filename))stop(“please use a bin extension”)
               columnsPresent <- intersect(colnames(Inventory),c(“Id”,”Lon”,”Lat”))
               if(length(columnsPresent) != 3) stop(“missing the correct column names”)

               descName <- sub(“bin”,”desc”,filename)
                if(class(Inventory) == “matrix”) Inventory <- as.data.frame(Inventory)
                if(length(unique(Inventory$Id)) != length(Inventory$Id))stop(“Ids, must be unique”)
                # some temporary variables to speed up processing of distance calculation
               # these are done once and resused inside the loop
               Inventory[, “sinLat”] <- sin(deg2rad(Inventory$Lat))
               Inventory[, “cosLat”] <- cos(deg2rad(Inventory$Lat))
               Inventory$Lon <- deg2rad(Inventory$Lon)
               ER <- 6371

               nr <- nrow(Inventory)
               nc <- nrow(Inventory)

               options(bigmemory.allow.dimnames=TRUE)

               D <- filebacked.big.matrix(nrow = nr, ncol = nc,
                                                          dimnames = list(as.list(Inventory$Id),NULL),
                                                          init = NA,
                                                         backingpath = Directory,
                                                         backingfile = filename,
                                                         descriptorfile = descName,
                                                        type = “double”)

                   validLonLat <- which(!is.na(Inventory$Lat) & !is.na(Inventory$Lon))

             for(i in validLonLat[-length(validLonLat)]){
                                for(j in validLonLat[validLonLat > i] ){

                                          D[i,j] <- acos(Inventory$sinLat[i] * Inventory$sinLat[j]
                                                                      + Inventory$cosLat[i] * Inventory$cosLat[j]
                                                                      * cos(Inventory$Lon[j]-Inventory$Lon[i])) * ER
                                                      if(is.nan(D[i,j])) D[i,j] <- 0

                                         }
                          }

            return(D)

  }

And

getNeighbors <- function(descriptFile = “distance.desc”,Id){
                   require(bigmemory)

                   DATA <- attach.big.matrix(dget(descriptFile))

                  if(class(Id) != “character”){
                           warning(“coercing Id to character”)
                          Id <- as.character(Id)
                   }
             dex <- which(rownames(DATA) == Id)
             
              if(length(dex) == 0)stop(“index not found”)
              if(length(dex) > 1)stop(“index found multiple times”)
             
              neighbors <- c(DATA[1:dex,dex],DATA[dex,(dex+1):ncol(DATA)])
              neighbors[dex] <- 0
              names(neighbors) <- rownames(DATA)
             return(neighbors)
  }

Categories: Uncategorized