## Tamino’s Method: Regional Temperatures

UPDATE: I pushed ahead and did a quick package build of 1.2 and put Tamino’s code in along with a demo. Should hit CRAN in a day or so

Tamino over at Open Mind has a new post detailing his approach for calculating temperature averages. See his post here.

His method is based on the Berkeley method as he notes and he uses it primarily for calculating regional or local temperature averages. Read his post for the math details behind the approach. I got my hands on the code last night and set about trying to figure out how to integrate his approach in the RghcnV3 package. That, of course, meant I stopped work on version 1.2 which is ready to hit CRAN. Tamino, posted his code in a pdf. That caused a little trouble with copying and pasting, but in the end it was easy to clean up.

His code would be a good example for other folks ( who shall go unmentioned) to follow. It was clean and understandable. I might complain a bit about variable names and ask for less terse names, but since I got the general idea of the algorithm it was easy to follow.

Here is his code. I’ve taken the liberty of tossing in a few comments, my initials signify my additions

# Here's the code: ################################## ################################## # align # combine station data # t is time # time data is a vector of times (smm) # x is data # data is a vector of data (smm) # part is station ID # part is a vector of station ids aligned with time and data (smm) # offres is resolution for offsets # tround is rounding digits for t # xround is rounding digits for x ################################## ################################## align = function(t,x,part,full=F,tround=NULL,xround=4,offres=.001){ ########################## # make "part" a factor # and use 1st as reference ########################## part = as.factor(part) nparts = length(levels(part)) # part is cast as a factor so he can loop over it and use it # as names in the dataframe. If there is only one station we bail (smm) if (nparts < 2){ zout = data.frame(t,x) zout = zout[order(t),] zout = list(data=zout,offsets=0) } else{ ####################### # form giant data frame ####################### # the giant data frame will end up having one station PER COLUMN (smm) # the merge( all =TRUE) function is used to create time series (smm) # that are complete with every station having every time. NAs are (smm) # filled into those years or months that are missing (smm) # loop over every individual station (smm) for (jpart in 1:nparts){ tt = t[part == levels(part)[jpart]] xx = x[part == levels(part)[jpart]] xx = xx[order(tt)] tt = tt[order(tt)] if (length(tround) > 0){ tt = round(tt,tround) } z1 = data.frame(tt,xx) names(z1) = c("t",levels(part)[jpart]) if (jpart == 1){ zz = z1 } else{ # for stations 2-n we merge using time. This aligns the stations (smm) # that may have different or missing times. NAs are filled in. Nice work tamino (smm) zz = merge(zz,z1,by=1,all=T) } } ### (smm) at this stage we have a data structure zz where everu column is a station ### (smm) every row in the data.frame is a time slice. ### (smm)column names are station ids ### (smm) the number of rows is the number of time slices. All time series are complete. ### (smm) NOTE the algorithm below will fail if stations have all NA values. That is, ### (smm) if you have a station where all times have no data, then you'll throw errors on the while loop. ### (smm) checks will be added for this condition ( perhaps dmu = sum(abs()) needs na.rm =T) ### (smm) basically I fed it some data where some stations had all NA and it crashed. two ways to fix that ### (smm) remove stations with all NA or perhaps add na.rm to the dmu =sum(abs()) line at the end of the while ############################### # iteratively determine offsets # and merged values ############################### # (smm) grab just the temperature data in columns 2:n zdat = zz[,2:(nparts+1)] zdat = as.matrix(zdat) ntimes = dim(zz)[1] mu = rep(0,nparts) gam = rep(NA,ntimes) dmu = 9999 while (dmu > offres){ for (jj in 1:ntimes){ xx = zdat[jj,] xx = xx-mu gam[jj] = mean(xx,na.rm=T) } oldmu = mu for (jj in 1:nparts){ xx = zdat[,jj] xx = xx-gam # (smm)xx will be NAs if the station has no data for all times ## (smm) mean of xx when xx is all NA will return Nan into mu[jj] ## (smm)that will cause an error in the while loop from the sum function dmu=sum() ## (smm) as dmu gets set to NA. mu[jj] = mean(xx,na.rm=T) } ################################## # shift offsets so 1st offset is 0 ################################## mu = mu-mu[1] # (smm) as noted above if a station has all NAs the the sum function below will return NA into dmu. crash dmu = sum(abs(mu-oldmu)) } ############ # final mean ############ num = rep(NA,ntimes) se = rep(NA,ntimes) std.dev = rep(NA,ntimes) for (jj in 1:ntimes){ xx = zdat[jj,] xx = xx-mu xx = xx[is.finite(xx)] num[jj] = length(xx) gam[jj] = mean(xx) std.dev[jj] = sd(xx) } se = std.dev/sqrt(num) zout = data.frame(x=gam,num,se,std.dev) if (length(xround) > 0){ zout = round(zout,xround) } zout = data.frame(t=zz[,1],zout) if (full){ zd = data.frame(zdat) names(zd) = levels(part) zout = data.frame(zout,zd) } zout = list(data=zout,offsets=mu) } zout }

Well, That is some easy code to work with. You’ll note my caveats about the NA station deal. I’ll explain how I found that one.

After determining what was in the data.frame (column 1 is time and 2:n columns are station data, I was stoked because that is the basic data structure of the zoo objects in RghcnV3. in a zoo object every column is station data. TIME in a zoo object is not in a separate column, its in a data slot. My first step thus was dividing taminos code into two functions: One function to build the data.frame and the other function to process the data. As much as possible I want to keep functions that BUILD data structures separate from functions that PROCESS and output data. I find that makes me focus on building good data structures that then can have various methods applied to them. In anycase I separated out the core processing function and changed its interface to accept a data.frame like tamino generates. Here is that code, actually the first few lines only:

taminoCore = function(zz,full=F,tround=NULL,xround=4,offres=.001){ # taminos algorithm fed a data frame with # add code in to output if there is only one station # time in col one. # (smm) changed to use ncol nparts = ncol(zz)-1 # moved code from the data.frame creation part of align to the # core math function. if (nparts == 1){ # only have one station zout = data.frame(zz[,1],zz[,2]) zout = zout[order(zz[,1]),] zout = list(data=zout,offsets=0) }else{

Wow. Like no change of course. Except I add the code for handling the single station handling to the core math function where it makes more sense. After I did this I threw some GHCN data at the algorithm and all hell broke loose. WTF! My function createTemperature() takes a GHCN dataset and creates a zoo object that mirror the dataframe tamino uses. So I took that zoo object, cast it as a dataframe, added a first column for time and threw it at this code. fail. EPIC fail with a nasty bug in the while loop about a true false value being missing. Oh ya, I forgot to mention that I ‘windowed’ the data. That is I took a cut from 1950 to 2010. Big mistake. It turned out that when I windowed the data I ended up with a dozen or so stations ( using a usa subset) that had all NAs for that time period. I guess they had data prior to 1950. Anyway, that took some figuring out ( debug by print() always fun ! ) That means I’ll have to add some checks for that in code and perhaps put some of those checks into Rghcnv3 package. Once I fixed that by removing the empty stations, the code just worked! and its fast.

The next step in my mind was to build a function to generate data. I needed to test that splitting the code into two functions did NOT change the answer. I’d miodified taminos code, had I changed the answer? The change was trivial, but It’s also good to check. So, what we do is create two more functions. A function to generate data and a function to generate the dataframe. Lets look at the second one first. Its just tamino’s code for making a dataframe:

taminoDF = function(t,x,part,full=F,tround=NULL,xround=4,offres=.001){ ########################## # make "part" a factor # and use 1st as reference ########################## part = as.factor(part) nparts = length(levels(part)) ####################### # form giant data frame ####################### for (jpart in 1:nparts){ tt = t[part == levels(part)[jpart]] xx = x[part == levels(part)[jpart]] xx = xx[order(tt)] tt = tt[order(tt)] if (length(tround) > 0){ tt = round(tt,tround) } z1 = data.frame(tt,xx) names(z1) = c("t",levels(part)[jpart]) if (jpart == 1){ zz = z1 } else{ zz = merge(zz,z1,by=1,all=T) } } return(zz) }

Pretty basic. I take tamino’s “align()” function and I split it into two functions: one to build the data.frame and a second to do the math. That means I can play around with the math function in isolation. Going forward I’ll probably break that math function down into sub functions. just to make the testing an optimization trials go more quickly. Lastly I create a program to generate data that gets feed into taminoDF. I did a simple one.

createTaminoData <-function(stations){ set.seed(42) startDates <- c(1900, 1910, 1920, 1950) endDates <- c(1970, 1980, 1990, 2010) stationStart <- 42598765000 myStations <- stationStart:(stationStart+stations - 1) L <- length(myStations) Time <- vector() Data <- vector() Id <- vector() for( i in 1:L){ stationStart <- sample(startDates, 1) stationEnd <- sample(endDates, 1) stationMonths <- ((stationEnd - stationStart) +1 ) * 12 stationTime <- outer(stationStart,seq(0,stationMonths-1)/12,"+") stationData <- rnorm(length(stationTime), 10 , sd = 5 ) stationId <- rep(myStations[i], length(stationTime)) Time <- c(Time, stationTime) Data <- c(Data, stationData) Id <- c(Id, stationId) } dataOut <- list(t = Time,data = Data, part = Id) return(dataOut) }</pre> <pre>

That function will create N series of time data for stations. The start dates and end dates are randomly selected. Time data is filled in and random temperature data is filled in. I suppose one could throw random NAs into the mix, but that will come as we test it with real data. for now I just want to test that splitting the function into two pieces didnt screw up the math ( how could it?)

To do that I write this

Test <- createTaminoData(50) AlignTest <-align(t = Test$t, x = Test$data, part = Test$part) DF <- taminoDF(t = Test$t, x = Test$data, part = Test$part) CoreTest <- taminoCore(zz=DF) identical(AlignTest,CoreTest) [1] TRUE

So there you have it. We’ve split tamino’s code into two pieces ( not that hard) and taminoCore() can now be modified to accept a zoo object instead of a data.frame. When that is done the algorithm can be added to the package and you’ll be able to download GHCN V3 data, select a region ( cropInv()) create the zoo object for that regions data, and feed the data directly to taminos function.

Verification code is in the drop box. enjoy

Steve,

Good stuff. Would you mind if I cross posted with links?

Sure go ahead. I’m making some other changes to his code ( basically taking some stuff

out ) and I should have a build tonight.

I’ve talked to Roman about getting your stuff in. Looking at your code I think it might be a bit more of a challenge, but in the input side I can output the kind of data structure that Roman

(mts) that Roman is expecting. I’ve added support for mts and it will go up in version 1.2.

Looking at his stuff I’m pretty sure I can speed it up or at least try. Anyway it works

like a charm

Thanks Steve. I don’t think my gridding is particularly elegant and would recommend redoing with your own methodology. Roman’s method allows input of a vector for station weighting too if I recall. That way you might try some unusual yet more accurate grid/station weighting.

I really think Roman’s method is the best out there, but it needs to be combined with a unique higher flexibility gridding. I would love to have time to do it with CRN 1 and 2 stations in the US.

Sorry should have asked by email. Snip it if you like.

Hi Steven,

I’m not a big user of R but I would like to get the hang of it. My goal is to be able to download from ghcn, use tamino’s method on a specific set of lat/longs and then be able to computer confidence intervals based on the number of stations. What would be the best way about doing that?

Well, It would go something like this:

1. you should get familiar with R,

2. download the package RghcnV3. Look at the demo examples: I did one called texas for Tamino’s method.

3. Then you have to write a short program to resample: use the function sample()

Since the station Id are integers you just sample() from that vector.

you then call the function regionalAverage() N number of times. You collect

that series of time. If I get some time later this week I will do a toy example for you

Basically:

v3Data <-readV3Data()

inv <-readInventory()

myArea <-extent(lonMin,lonMax,latMin,latMax)

inv <-cropInv(inv)

v3Data <-windowV3(v3Data,start=1900,end=2010)

TS <-v3ToZoo()

DATA <-intersectInvZoo(inv,TS)

## there you have all the stations and time series data for the region in myArea

stations <- getStations(DATA$Inventory)

Then it depends how you want to sample the space. if you have 50 stations

what do want to.. look at 5 station sets, 10 station sets etc etc. Its just

brute force

Then loop

thisSample <- sample(stations, 10, replace=FALSE)

…

..S <- regionalAverage(…..)

end loop

You R a Beautiful man.

U see it:

Too.

He’s not in the sky, afteR all.

He’s inside.

The answer is yes.

End game:

Propeller heads bashing into nowhere walls.

That also defines you.

we R the hollow men,

headpiece filed with straw.

I do code. It’s like knitting or playing bingo.