## MoshTemp302

I’ll probably update the code dump this weekend unless somebody screams. For now I’ll just go through some of the code that I’ve just added in and retested.

Recall where we are. We have V2Anomalies. Rows of time, columns of stations. We have a gridded inventory. The two have to work together. Here is the issue. When we process v2mean we loose stations. We apply a time window and we have CAM criteria. So V2Anomalies has around 5100 stations. The inventory has 7280. So, we should trim the inventory. That’s easy. However, we have another issue. The land mask. Do we want to weight cells by the area of the cell OR weight by the land area. If the land mask is TRUE we want to weight by land area, This means we might have to trim both v2mean and the gridded inventory. Also, we want be able to handle situations where we take the 5100 stations in V2Anomalies and trim them. Like, only select the airport stations. There are bunch of solutions to this. None pretty. I can just assume that the user will do this trimming on his own. Its easy. BUT if you forget, you get bad answers. So I want to enforce a reconciliation.

R has no global variables. The scoping rules are such that a function can’t modify a parameter you pass to it. Well, you CAN with a trick. I will avoid that. Instead I handle the problem this way:

>temp<-intersect.GridInvAnomalies(gridInv,V2Anomalies,landmask)

>V2Anomalies<-temp$Anomalies

>gridInv<-temp$GridInventory

Basically, call a function that intersects these 3 pieces of information in one place and returns TWO values, held in a list. That’s a basic R paradigm. Lets look inside that code:

**intersect.InvAnomalies<-function(gridInv,anomalies,mask=LAND.MASK){**

**if(is.null(gridInv$Cell)) stop(“Pass a valid grid inventory”)**

**stations<-gridInv$Id**

**if(mask) stations<-gridInv$Id[which(gridInv$Land>0)]**

** **

**anomaly.stations<-getAnomalyStations(anomalies)**

**common<-intersect(stations,anomaly.stations)**

**if(length(common)==0) stop(“No Stations in intersection”)**

**InvOut<-gridInv[which(gridInv$Id %in% common),]**

**anomalyOut<-anomalies[,which(anomaly.stations %in% common)]**

**intersect.InvAnomalies<-list(GridInventory=InvOut,Anomalies=anomalyOut)**

**}**

Starightforward. We check to make sure that the caller passed in a gridded inventory. Then we select stations that have a value for Land that is greater than ZERO, iff the landmask is true. This drops a few stations on small islands and ships. Then we ‘get’ the stations from the Anomalies. the function that does this just gets the column names. Then we do some set logic to find the intersection. **common<-intersect(stations,anomaly.stations) **Then we trim BOTH the anomalies and the inventory so they are justified with each other. If you dont do this ( trust me) you will have troubles. Lastly we set the values to two items in a list.

The process flow looks like this: read in the stored v2Anomalies from disk. Read in the entire Inventory. Do whatever you like to slice and dice the inventory. Like select airports, rural, high altitude, whatever. Build as many different inventory options as you like. Then intersect them with the BASELINE Anomalies.

The next piece of code takes us to the next level of geographical data.

**as.GridCells<-function(anomalies,gridinv){**

**if(is.null(gridinv$Cell)) stop( ” Pass a Gridded Inventory”)**

**nstations<-length(gridinv$Id)**

**anomaly.stations<-ncol(anomalies)**

**if(!(nstations==anomaly.stations)) stop( “Inventory Mismatch”)**

** o<-order(gridinv$Cell)**

**gridinv<-gridinv[o,]**

** anomalies[,o]**

**anomalies<-t(anomalies)**

** idx<-list(unlist(gridinv$Cell))**

** cellmatrix<-aggregate(anomalies, by=idx, mean, na.rm=T)**

**cellmatrix[is.na(cellmatrix)]<-NA**

**row.names(cellmatrix)<-cellmatrix[,1]**

**cellmatrix<-cellmatrix[,-1]**

**return(cellmatrix)**

**}**

What is going on here? In a nutshell. We are averaging stations WITHIN a grid cell. First I do yet ANOTHER check on the issue of justification of the inventories. I cannot prevent the user from passing in the WRONG inventory, sizewise. Then we sort the inventory and the anomalies, then we apply the function aggregate. This collects all the stations in a cell and averages them. Note also that I transpose anomalies to get the rows as columns and columns as rows. We fix the NA issues that arise when there is no data. Lastly we take the first column ( which contains cell numbers) and make row names out of it. We now return a Cell Matrix that is Gridcells by row, and Time by column.

Thats it. A couple more routines and we are done.

Steven,

I have a somewhat unrelated question for you. Which method do you view as superior, the RSM method or the first difference method? I have tried to do my own variant of the RSM method in which the MAAT average of all the stations over the time period is used as a reference. I then compute the offsets between each stations data and the MAAT average data. Adjust each station for the average offset of the overlapping period and then compute the MAAT average of the offset-adjusted stations. Do you think this method introduces any flaws into the method? I just thought it seemed easier than the RSM method without the data requirements of the anomaly method

Robert, Good question. We are discussing it an others like it on CA

having a hard time following your description. math formual or toy example will help

Okay, we have say 30 weather stations.

We have yearly values.

We make an average of all weather stations for each year. Some have missing years and discontinuities and such. The average of all the weather stations is obviously faulted by this. In order to try and fix it. I calculated the offsets (subtract them) between each individual station and the average of all the weather stations where they overlap. Then I do this for each remaining station. With the offsets computed,I adjust each series (add the offset) by the offset and therefore all the stations are lined up. Then I take an average per year of all the stations (after adjustment) to make my final composite. I’m asking if there’s any large errors in this method as it is slightly different from the RSM method. Thanks though for your time. I read the CA discussion and am learning a good bit about 1st difference method

Station 1 10 10 10 10 10

Station 2 5 NA NA NA 5

Station 3 10 10 NA NA 10

We make an average of all weather stations for each year.

Work through your approach with a toy example

Raw A B C D Simple Averaging MAAT

2001 5 7 16 6 8.5

2002 10 15 12 12.33333

2003 23 12 8 14.33333

2004 16 20 12 16

2005 12 19 10 13.66667

2006 18 12 11 13.66667

Avg of Series

13.2 15.2 13.7 9.4

MAAT Avg During Overlap Period

12.96667 13.2333 12.625 12.5

Offsets -0.23333 -1.96667 -1.12 3.1

Adjusted Data

Year A B C D Final Product

2001 4.76667 5.03333 14.875 9.1 8.44375

2002 9.76667 13.875 15.1 12.91389

2003 22.76667 10.03333 11.1 14.63333333

2004 15.76667 18.03333 10.875 14.89166667

2005 11.76667 17.03333 13.1 13.96666667

2006 16.03333 10.875 14.1 13.66944333

Im not quite sure if this is what you meant by a toy. Im semi new to this stuff. That being said I would not be able to reproduce code because I use excel for these calculations but im trying to learn matlab (not so easy though). It of course is not feasible to do manual calculations for global reconstructions but for small areas it is of course plausible.

If there are any errors with my method I would love to find out a better way which can be done without needing too much code.

Thanks for your time.

MAAT?

how is that calculated in a simple example. Spell out all the detail using an example with missing data

where does 12.996 come from?

Avg of Series

13.2 15.2 13.7 9.4

average of each or average between? spell it all out

To clarify now,

Raw Data is the first matrix, it goes:

Year (Station A) (Station B) (Station C) (Station D) MAAT

Okay,

We have our station data from stations A, B, C, and D

We have holes in the data for some years but we still want to calculate the average regional temperature for 2001 through 2006 for example.

We begin by taking our single column matrices for each station (1 by 6 matrices) and we average them all together for each year to get a MAAT (Mean annual air temperature) averaged across the entire region (Call it RegMAAT). This method is the simple averaging method and it is obviously flawed because there are holes in the data and because certain stations might be warmer or cooler than others meaning that if a cold station is included for 2006 but not 2005 then the simple average will dip even if real life anomalies raised.

In order to deal with this problem we don’t throw away the RegMAAT, instead we keep it to use as our reference station for the RSM method. Instead of using the station with the longest record and averaging it from the 2nd longest record (and so on…) and calculating offsets between the two like you would in the RSM method. You use the RegMAAT as the reference station and calculate the offsets from each individual station at once from the RegMAAT reference.

So you don’t have to use a weighted average like you do in the RSM method but rather you can calculate all the offsets between the RegMAAT and each station (A,B,C, and D) at once.

The purpose of the offsets is to try and bump up all the stations to the one same area. In order to do so for station A for example we take the average of the RegMAAT data where it overlaps with Station A and then subtract it by the average of all the values in Station A’s column. This step is repeated for each station.

The result is you end up with offsets for each station compared to the RegMAAT.

In order to adjust each station so that they can all be at the same level we add the offsets to the corresponding data for each station. This essentially bumps the data up so its all in line.

Now we have a set of Adjusted values for each year at each station.

We take the average of these and that is your final RSM (comparable) result.

Therefore you end up with a Adjusted RegMAAT which avoids the perturbation of cold or warm stations and missing years.

I don’t know if this method really works well or is optimal but that is something that i’ve been working on and I think it makes sense in some ways. Let me know what you think.

Station 1 15 15 15 15 15

Station 2 10 10 10 10 10

Station 3 20 20 20 20 NA # guess what it is

MAAT 15 15 15 15 8.3

Now what? You need to do a TOY example of the EXACT calculation or write a FORMULA.

Not words. or a formula AND words, AND and example. But I’ll try to follow your words.

MATH is best done in Symbols. symbols have no ambiguity.

Offset

Station 1 0 0 0 0 7.7

Station 2 5 5 5 5 1.7

Station 3 -5 -5 -5 -5 NA

“In order to do so for station A for example we take the average of the RegMAAT data where it overlaps with Station A and then subtract it by the average of all the values in Station A’s column.”

MAAT<-15+15+15+8.3/4 13.325

and then subtract it by the average of all the values in Station A’s column."

IT? subtract WHAT 15-13.325? ok. 1.675

10-13.325 = 3.325

5-15= -10

"The result is you end up with offsets for each station compared to the RegMAAT.??

I allready have offsets. are you using one word for the same thing

Mathematically, you are handling missing data by merely smoothing averages. Unless you write the math out you cannot tell what you are doing. And I cannot help.

Anomaly method

Station 1 0 0 0 0 0

Station 2 0 0 0 0 0

Station 3 0 0 0 0 NA

HOT or COLD is NOT the issue. NOT THE ISSUE. Anomaly is a STANDARDIZING process.

That is why we use it.

First just to point out. Cold does matter. I have a temperature composite of 100 years and in the last 30 there is data from a station which is in the most northernmost location and is extremely cold so if I just averaged it would result in the last 30 years looking colder.

Anyways I’ll try to explain and show better. Below is the raw data in (1/10th of a °C)

NA means not available

2001 2002 2003 2004 2005 2006

Station A 5 10 23 16 12 NA

Station B 7 NA 12 20 19 18

Station C 16 15 NA 12 NA 12

Station D 6 12 8 NA 10 11

Therefore SimMAAT= the average of all raw data available from all stations for each year

We calculate this below and this will be used as the reference station in the RSM method

2001 2002 2003 2004 2005 2006

SimMAAT 8.5 12.33 14.33 16 13.67 13.67

In order to determine how much we need to bump each station to make it line up with the reference station (SimMAAT) we need to calculate how much to adjust each station.

Ultimately what we do is we try and make the average of the station data (StAVG) equal the average of the reference station (SimMAAT) where they overlap

Thus we need StAVG=SimMAAT where overlap between the two occurs. Thus we will not include SimMAAT data for years in which the station we are looking at does not have data

St AVG (Station A)= 13.2 SimMAAT (Overlap period: 2001-2005)= 12.967

StAVG (Station B)= 15.2 SimMAAT (Overlap period: 2001,2003-2006)= 13.23

StAVG (Station C)= 13.75 SimMAAT (Overlap period: 2001,2002,2004,2006)= 12.625

StAVG (Station D)= 9.4 SimMAAT (Overlap period: 2001-2003,2005,2006)= 12.5

In order to calculate my offsets I subtract the SimMAAT for each station by the StAVG per station

Therefore, for station A it would be: SimMAAT (Overlap period: 2001-2005) – StAVG (Station A)

or 12.967-13.2

Therefore, Offset (Station A) = -0.233

I follow the same step for each station and end up with the following offsets between the reference station and the raw station data

Offset (Station A)= -0.233

Offset (Station B)= -1.967

Offset (Station C)= -1.125

Offset (Station D)= 3.1

In order to get the adjusted Data series which is lined up with the reference station, the following method is employed

Raw Data (Station (X)) + Offset (Station (X)) =Adj Data (Station (X))

Therefore for Station A Data

2001 2002 2003 2004 2005 2006

5 + (-0.2333) 10 + (-0.2333) 23 + (-0.2333) 16 + (-0.2333) 12 + (-0.2333) NA

4.767 9.7667 22.76667 15.7667 11.7667

This step is repeated for each station to get the adjusted data for each.

Finally to end up with the temperature composite, all one has to do is average all the stations for each year because they are lined up.

FComp (2001) = average Adj Data (2001, Station A:Station D)

Therefore the final product is :

2001 2002 2003 2004 2005 2006

8.44 12.91 14.63 14.89 13.97 13.7

I will see if I can unravel this later tonight. The process of averaging and offsetting and re averaging is not very attractive.

especially since you have no idea of the SD of the means you are subtracting. If you think about what you are doing

you are merely estimating the missing data by averaging. Sometimes averaging over short periods ( short overlap)

Sometimes estimating over long periods ( long overlap)

One way to test this is with synthetic data. programming wise its a smap, but you have to learn how to describe things

in math and get away from using excell.

[1] 8.44375 12.91389 14.63333 14.89167 13.96667 13.66944

I implemented your method Put in regular synthetic data with a Known answer

Your Method gives a very wrong answer

A<-rep(10,100)

B<-rep(20,100)

C<-rep(0,100)

D<-rep(-50,100)

C[20:30]<-NA

Just run that example thru your spreadsheet.

Compare SiMAAT with the Truth and your estimate with the truth.

Use Anomalies and you wont have that issue. Also, put through synthetic data with warming trends.

Every station warming ta the same trend. Some cold (0 warming to 10) others warm warming to hot

( 10-20) Then cut sections out. You will see that your estimate distorts the truth depending up

when you take the data out and how much you take out.

A<-rep(10,10)

B<-rep(20,10)

C<-rep(0,10)

D<-rep(-50,10)

C[3:5]<-NA

S<-rbind(A,B,C,D)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

A 10 10 10 10 10 10 10 10 10 10

B 20 20 20 20 20 20 20 20 20 20

C 0 0 NA NA NA 0 0 0 0 0

D -50 -50 -50 -50 -50 -50 -50 -50 -50 -50

SiMAAT<-colMeans(S,na.rm=T)

print(SiMAAT)

1] -5.000000 -5.000000 -6.666667 -6.666667 -6.666667 -5.000000 -5.000000 -5.000000 -5.000000 -5.000000

O<-rowMeans(S,na.rm=T)

print(O)

A B C D

10 20 0 -50

A2<-mean(SiMAAT[!is.na(A)])

B2<-mean(SiMAAT[!is.na(B)])

C2<-mean(SiMAAT[!is.na(C)])

D2<-mean(SiMAAT[!is.na(D)])

X<-c(A2,B2,C2,D2)

print(X)

] -5.5 -5.5 -5.0 -5.5

Offset<-O-X

print(Offset)

Offset2<-matrix(rep(Offset,10),ncol=10,byrow=F)

print(Offset2)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 15.5 15.5 15.5 15.5 15.5 15.5 15.5 15.5 15.5 15.5

[2,] 25.5 25.5 25.5 25.5 25.5 25.5 25.5 25.5 25.5 25.5

[3,] 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0

[4,] -44.5 -44.5 -44.5 -44.5 -44.5 -44.5 -44.5 -44.5 -44.5 -44.5

Adjust<-S-Offset2

Answer print(Answer-SiMAAT)

[1] -0.375000 -0.375000 1.166667 1.166667 1.166667 -0.375000 -0.375000 -0.375000 -0.375000 -0.375000

You merely spread the error over time distorting the trend information. Not a good property when it is the trend we are interesting in. Anomaly

solves this problem easily and preserves the trend reqardless of whether stations are HOT or COLD. so, no need to invent a new method

when the old ones work well. Especially if you dont bother to test the method with synthetic data.

Well at the end of the series after I have the adjusted MAAT I compute anomalies for it.

Thank you for helping with this. Have you a suggestion for what method I should perhaps use then? My real data is from 31 stations with yearly data spanning from 1885 to 2009.

I have many series with holes in it and little effective overlap in many important stations rendering me unable to do the CAM method. I am unfortunately still using excel but like I said, I hope to be good with matlab within the new year. But for now my data is all in excel. I have spreadsheet with all my raw data for the 31 stations.

I am not going to do any gridding or distance weighting. The result I get from you is that my results are incorrect.

Is another method optimal in this case?

you only have yearly data?

well, your “RSM” appraoch is basically ESTIMATING missing data from known data. So just use a standard method to do that

There are several ways to do that, it depnds on the gap in the station. You could also do it by correlation with

other stations..

For all the stations is there ANY period of overlap?

Point me at the source.

My data comes from the Canadian Climate Data Archive at

http://www.climate.weatheroffice.gc.ca/advanceSearch/searchHistoricData_e.html?timeframe=1&Prov=XX&StationID=9999&Year=2010&Month=8&Day=21

I used this data because I wanted to use raw data that has not been adjusted in any way. This might be of interest to you for your UHI work because you could theoretically use completely unadjusted data to analyze the difference in trends between rural and urban. Plus Canada has a lot of truly rural areas and a few large urban centers.

My data would be hard to find because I downloaded the 31 series individually and then processed the data with an excel macro to take it from a straight line monthly data into a gridded monthly data format. My final table includes yearly data for the 31 series that I use and if you would like the table I can email it to you (my email is rway024@uottawa.ca) So send me a msg if you want it.

So the data does come in monthly but I have already changed it to yearly because i am comparing it with climate indices such as TSI which comes as yearly values.

For all stations there is not necessarily a period of overlap with one station but there is with my method. The thing about doing the RSM method is that eventually you do get a period of overlap as you get through the longest few records.

use the monthly data to infill and estimate missing months. Comparing to TSI is a not a path I would suggest..

you will find one of two things: good correlation or bad correlation. if you find good correlation, then what? one might conclude what we already know.The sun influences the temperature. However, we knew that. We also know that GHGs increase the warming. So the question is which aspect is

the most important? that question can’t be answered by looking at the data you have. it can only be APPROACHED by simulation. Trying to tease out casuality from data analysis is fraught with danger. best start from first principles of physics.

Hmm. Looking at those data sources I would start with the daily data. I would write them and ask them for access to the files behind the interface.

Failing that I would write a program to scrape their damn data. However, you would need to QC that data. It’s a lot of work.

Just because data is “raw” does not mean it is better. So of that data may already be online in different databases held by NOAA. GHCN is not the only database. It was compiled in the 90s. its a 90 look backward at what the history was at that time, collecting and collating many historical records.

Well considering what I have right now in terms of data, is there a preference of a method to use?

The method will rely on the data. For filling in missing annual data? wow. I’d say ask RyanO. maybe regM, I tend to work on stuff where you DONT estimate the missing data because then your error of estimation is hard to account for in the final analysis.

Well if you were in my situation with respect to the data what would you do? If you were given yearly data which spans 1880 to 2009 from 31 stations and were told to turn this into a regional temperature composite, what would be the method you would use?