MoshTemp 102
In the last installment we started by downloading the v2.mean.Z file unzipping it and processing the raw data into an R object. The code for that is Moshtemp101.R, located here. If that code is executed we end up with an R object containing all the data of v2.mean. Data<-readV2Mean(), gets us the Data object. At your console execute the following command on Data:
>str(Data)
We can also access the data using a column number. The duplicates variable is in the second position and we can summarize duplicates by doing the following
>summary(Data[,2])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.5781 1.0000 9.0000
This tells us that some station Id’s come with many duplicate records, as many as 9, while most have no duplicates. Visually we can see this by plotting a histogram. I’ll spare you the graphics.
>hist(Data$Duplicates,breaks=10)
And we can see that there are a fair number of temperature values as well as missing values. First, we will sum up the temperature values. The function, is.na(), simply takes every value that has a value of NA (missing) and assigns it a value of 1 for TRUE. !is.na(DATA), gets us the opposite. A data structure with TRUE for every place where there is a temperature.
sum(!is.na(Data))
[1] 8579259
> sum(is.na(Data))
[1] 381336
And we can take a mean of all temperatures by month if we like, we issue the mean command over all the columns we want. And we have to remove NA’s or the mean will be NA.
mean(Data[,4:15],na.rm=T)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
26.13510 39.13747 73.96073 118.90887 158.05500 189.59006 208.01058 203.03812 174.36174 131.29355 80.00377 39.81402
That’s enough basics for now. Lets proceed to the problem of combining duplicates. First an example of what a duplicate looks like: In the example below a duplicate is actually just the same values. But that is not always the case.
> Data[74:75,1:15]
Id Duplicate Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
74 10160360000 0 1963 NA 104 121 149 160 210 250 260 226 172 167 139
75 10160360000 0 1964 104 122 141 148 189 223 243 250 231 184 152 111
> Data[102:103,1:15]
Id Duplicate Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
102 10160360000 1 1963 NA 104 121 149 160 210 250 260 226 172 167 139
103 10160360000 1 1964 104 122 141 148 189 223 243 250 231 214 152 111
It doesnt take long to find one. Lets start by pulling up all those Ids that have 9 duplicates
Data$Id[which(Data$Duplicate==9)]
[1] 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000……
I just pick one.
> Data[Data$Id==22230710000,]
And if you scan that display you’ll spot one. Pull these row numbers.
> Data[151606,]
Id Duplicate Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
151606 22230710000 6 1971 -168 -205 -103 25 99 153 164 153 78 26 -48 -182
> Data[151626,]
Id Duplicate Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
151626 22230710000 7 1971 -168 -205 -104 25 100 153 164 153 78 25 -48 -182
As you can see duplicate record 6 and 7 have the same value for most months but on certain months they differ. In this case by 1/100 of a degree C. I spent a considerable time comparing duplicate records. It’s a lot of grunt work. In the end, you have to decide how to process these records. There are several analytical choices and for my work I decided to average the duplicates. This will entail finding all the duplicate records within a ID that have overlapping years and then averaging them by months. Handling this in a looping structure is fairly straightforward. Also, since most of the stations don’t have duplicates it might be attractive to segregate the stations into those that have duplicates versus those that don’t. I’ve coded up several versions of this process using various built in functions of R. In the end, I settled on a solution offered up by the R help list. This approach just computes the mean even if there are no duplicates. Since averaging duplicates only needs to happen when you refresh the base data, I decided that an elegant one line wonder was preferable to more complicated code. The averaging code:
averageDuplicates<-function(v2mean){
data<- aggregate(v2mean[c('Jan','Feb','Mar','Apr','May','Jun',
'Jul','Aug','Sep','Oct','Nov','Dec')], v2mean[c('Id', 'Year')], ‘mean’, na.rm=T)
data[is.na(data)]<-NA
return(data[order(data$Id),])
}
That’s it! The aggregate function works on a data structure, in this case those columns of the incoming data that are labeled with monthly names. Aggregate, needs to be fed a set of indices, in this case we feed it an indices for Id and an index for Year. v2mean[c('Id','Year')]. The function mean is then applied to each of the columns. NA,s are removed and the result is output. The last couple of details. I’ve included a line that looks like it makes no sense data[is.na(data)]<-NA. which looks like that just assigns an NA to a element that is already NA!. In reality when we aggregate over the dateframe there will be cases where we are averaging NA’s. But, when we call mean(…na.rm) if the lines only have NAs then we get a result that is NaN. In previous versions, I had written code to detect this and apply the mean() function conditionally, that is only if the values were not all NA. Since is.na(NaN) tests true, I just choose to replace the NaN with NA. Finally, I do a quick sort on the IDs, just to make sure that everything has remained in order. This isn’t strictly necessary.
At the end of Moshtemp101 we had read in the raw v2.mean and created an R object Data. Now we average all the duplicates in that file:
Data<-averageDuplicates(Data)
Or if we want to do it all in one line, we just do this:
Data<-averageDuplicates(readV2Mean())
Next up, Selecting subsets of the data, like spans of years, and writing out data objects. Since, we won’t be processing duplicates very often, we will be saving the results of these processes as R objects. More on that in Moshtemp103. And yes I hate my color scheme already. Find MoshTemp102 at the drop
Yes, the aggregate() function is very neat. That’s a very helpful mailing list.
Your color scheme is very arty, and the blog looks great, but the blue on black, especially, is very hard to read.
Thanks Nick,
I dont’ know what came over me when I picked the template, Artisty more than techy has allows been the dilemma.
Aggregate is very slick although I’ve had some issues with it of late when using it to sum months into years.
I’ll see what I can do about the blue on black
Combining duplicates when the monthly values differ by 1/10 degree C seems safe because those differences are due to rounding during the conversion from degrees F to degrees C. However, here’s an example of when NOT to combine duplicates:
Know why the numbers are so different? Because duplicate 0 is airport data and duplicate 1 is city data. They are totally different stations, even though they are both “in” Denver.
Some where my stack of stuff I have the total statistics on all duplicates the maximum seen difference, the median, the average etc.
duplicates are the same location. Location is given by the ghcnid 42572469000. The duplicate flag means that NOAA has two officila reports for the same station at the same location. the 0 or 1 does not refer to a different location. regardless, if the were different locations you would still average them
Duplicates are SUPPOSED to be the same location. In the case of Denver (and others), they are not. Would you average New York and Boston? They are different locations.
JR. You are confusing IMOD with duplicate:
4257246900001948 -31 -30 -3 109 148 191 226 227 193 107 24 -13
4257246900011948 -14 -13 7 118 155 198 236 233 200 114 33 -1
42572469000=GHCNID 425 =country,72469=WMO,000=IMOD
425724690000=GHCNID 425 =country,72469=WMO,000=IMOD,0=Duplicate
DONT confuse that with IMOD differences
42572469000,DENVER STAPLE,39.78,-104.87,1626,1637,U,1980,FL,NA,no,NA,TRUE,1,COOL IRRIGATED,C
42572469001,DILLON 1E,39.63,-106.03,2763,2937,R,NA,MV,NA,LA,NA,FALSE,NA,COOL CONIFER,C
Other people have made the same mistake.
4257246900001948 -31 -30 -3 109 148 191 226 227 193 107 24 -13
4257246900011948 -14 -13 7 118 155 198 236 233 200 114 33 -1
That is the same GHCNID. BOTH DENVER. two records for Denver, record number 0, and record number 1.
When NOAA built the GHCN they went to get records. In this case they got TWO records from the same location.
Now, Location,
42572469001 is Different, That is Dillion.
Some methods, like Hansen, will COMBINE at the IMOD level using a reference station method. I dont.
The difference between IMOD and duplicate is tough because you actually have to read the NOAA documentation which is
long and boring. But you are confusing the two. 11 digit number is the GHCNID, 12 digit is the ID PLUS the duplicate flag.
Duplicate mean, SAME ID, multiple scribal records for the same location. It arises because the same lcations often have TWO or more instruments.
So we average over the Duplicates. Same location.
WRT averaging Boston and new york. Yes I would average them. IF, you ask me what the average temperature
is for that AREA.
Today it is 50F in san francisco. Suppose its 70 in LA. If that is ALL the data I have for california and you ask me
Whats the average for california, I will say “my BEST estimate is 60″ If, you tell me that San Jose is 60F then I have
some Analytical choices to make.
A: 70+50+60=180/3 =60.
B: 70+(50+60)/2= 125; 125/2= 62.5
In the first case they are not adjusted for Area. the two close stations ( SJ and SF) are given equal weight
In the second case I average by AREA. averaging WITHIN a certain distance first, THEN averaging over the blocks.
Good question though and a good lead into our next step. The question is WHAT AREA can one average over? Which turns on the spatial auto corelation
in TREND. So You can test that by doing the averages over different sized areas
I completely understand imod vs. duplicate. There are two different records in 1948 for Denver 42572469000. I did not show you any data for Dillon 42572469001. Dillon has nothing to do with this discussion. The two records I showed you are for Denver with different duplicate numbers. So GHCN is saying that they are supposed to be from the same station – same imod. But in actuality, they are not.
here is the record you showed:
4257246900001948 -31 -30 -3 109 148 191 226 227 193 107 24 -13
4257246900011948 -14 -13 7 118 155 198 236 233 200 114 33 -1
Then you claimed “Because duplicate 0 is airport data and duplicate 1 is city data. They are totally different stations, even though they are both “in” Denver.” I guess I’m asking this:
1. do you have the documentation that shows this. In looking through all the GHCN documentation I see no metadata for duplicates.
2. if they ARE different stations within Denver, yes I WOULD average them. Always.
Its now 55F in SOMA ( south of market) across town in the Marina its 62F.
Whats the temperature in the tenderloin?
if you DONT average them, How do you get an average for denver? add them?
Good – now we are making progress.
1.) The documentation is not in GHCN – that is one of the problems – not information information to use the data correctly. You have to look at other data sources, like World Weather Records to detangle the GHCN mess.
2.) Denver city reported from 1873-1970. Denver Stapleton reported from 1940-2001. The reason you can’t average them is because they don’t have the same period of record. Simple example: every year from 1873-1970, the city temperature is 10 degrees. Every year from 1941-2001, the airport temperature is 8 degrees. Average them, you get 10 degrees for 1873-1940, 9 degrees for 1941-1970, and 8 degrees for 1971-2001. Now use 1941-1970 as your anomaly base period. Was the temperature in Denver really 1 degree above normal until 1940 and then 1 degree below normal from 1971-2000? No!
1. link to the WWR so I can cross check the records
2. In the case of stitched records
a: the sensitivity to stitching methods is nil. The reference station method shows this.
b. the number of stitched records is relatively low
c. monte carlo testing of the duplicate records shows the same results.
d. The impact to trend is nil
The bottomline is that it’s getting warmer. One can for example, show this by selecting only continuous records. When I’m done posting the codeyou
are welcomed to do your own testing of that. All the tools will be provided.
The bottom line i
What do you think your New York/Boston average would look like if the time series for New York was 1900-1950 and the time series for Boston was 1935-2010?
if I had a HIGH degree of stations like that I would use Nick Stokes or jeff Ids or Chad hermans(mcKitrick) Least squares approach. The least squares approach doesnt depend upon the overlap. My personal preference is Roman/Ids approach for Just this reason. However, since our answers are highly correlated the concern you raise is interesting, discussed before, and OBE. If I get time I may play with the stitching issue. If you have anymore questions then read up on the reference station method, have a look at Nicks code or Ids code. Its one of those things where I thought “it will make a huge difference” in the end, not important. The issues lie in metadata. The issues lie in adjustments. the issues lie in UHI. not in handling duplicates. but you have the data, go make a case of the global importance of it.