Home > Uncategorized > MoshTemp 102

MoshTemp 102

In the last installment we started by downloading the  v2.mean.Z file unzipping it and processing the raw data into an R object. The code for that is Moshtemp101.R, located here. If that code is executed we end up with an R object containing all the data of v2.mean. Data<-readV2Mean(), gets us the Data object. At your console execute the following command on Data:


‘data.frame’: 597373 obs. of  15 variables:
$ Id       : num  1.02e+10 1.02e+10 1.02e+10 1.02e+10 1.02e+10 …
$ Duplicate: int  0 0 0 0 0 0 0 0 0 0 …
$ Year     : int  1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 …
$ Jan      : int  NA 100 109 117 129 116 101 112 122 115 …
$ Feb      : int  NA 112 123 114 110 104 128 101 118 109 …
$ Mar      : int  NA 129 130 134 122 105 141 106 137 119 …
$ Apr      : int  NA 143 160 149 138 155 NA 136 134 136 …
$ May      : int  NA 181 181 182 165 183 160 183 181 167 …
$ Jun      : int  NA 193 207 194 211 202 204 216 216 192 …
$ Jul      : int  NA 239 247 223 232 238 220 242 NA 240 …
$ Aug      : int  NA 255 241 241 249 271 228 246 240 247 …
$ Sep      : int  NA 225 223 220 233 226 210 235 223 235 …
$ Oct      : int  NA 201 191 185 NA 186 199 187 170 NA …
$ Nov      : int  133 166 162 161 148 133 154 140 141 139 …
$ Dec      : int  110 107 128 108 114 112 119 121 NA 118 …

The result, shown above tells us that we have read the file in as a “data.frame” one of the workhorse  data structures in R. The columns of Data all have names and we can access the data in many ways. For example, we can look at the span of years in the data:

[1] 2010
> min(Data$Year)
[1] 1701

We can also access the data using a column number. The duplicates variable is in the second position and we can summarize duplicates by doing the following


Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

0.0000  0.0000  0.0000  0.5781  1.0000  9.0000

This tells us that some station Id’s come with many duplicate records, as many as 9, while most have no duplicates. Visually we can see this by plotting a histogram. I’ll spare you the graphics.


And we can see that there are a fair number of temperature values as well as missing values. First, we will sum up the temperature values. The function, is.na(), simply takes every value that has a value of NA (missing) and assigns it a value of 1 for TRUE. !is.na(DATA), gets us the opposite. A data structure with TRUE for every place where there is a temperature.


[1] 8579259

>  sum(is.na(Data))

[1] 381336

And we can take a mean of all temperatures by month if we like, we issue the mean command over all the columns we want. And we have to remove NA’s or the mean will be NA.


Jan       Feb       Mar       Apr       May       Jun       Jul       Aug       Sep       Oct       Nov       Dec

26.13510  39.13747  73.96073 118.90887 158.05500 189.59006 208.01058 203.03812 174.36174 131.29355  80.00377  39.81402

That’s enough basics for now. Lets proceed to the problem of combining duplicates. First an example of what a duplicate looks like: In the example below a duplicate is actually just the same values. But that is not always the case.

> Data[74:75,1:15]

Id Duplicate Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

74 10160360000         0 1963  NA 104 121 149 160 210 250 260 226 172 167 139

75 10160360000         0 1964 104 122 141 148 189 223 243 250 231 184 152 111

> Data[102:103,1:15]

Id Duplicate Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

102 10160360000         1 1963  NA 104 121 149 160 210 250 260 226 172 167 139

103 10160360000         1 1964 104 122 141 148 189 223 243 250 231 214 152 111

It doesnt take long to find one. Lets start by pulling up all those Ids that have 9 duplicates


[1] 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000 22230710000……

I just pick one.

> Data[Data$Id==22230710000,]

And if you scan that display you’ll spot one. Pull these row numbers.

> Data[151606,]

Id Duplicate Year  Jan  Feb  Mar Apr May Jun Jul Aug Sep Oct Nov  Dec

151606 22230710000         6 1971 -168 -205 -103  25  99 153 164 153  78  26 -48 -182

> Data[151626,]

Id Duplicate Year  Jan  Feb  Mar Apr May Jun Jul Aug Sep Oct Nov  Dec

151626 22230710000         7 1971 -168 -205 -104  25 100 153 164 153  78  25 -48 -182

As you can see duplicate record 6 and 7 have the same value for most months but on certain months they differ. In this case by 1/100 of a degree C. I spent a considerable time comparing duplicate records. It’s a lot of grunt work. In the end, you have to decide how to process these records. There are several analytical choices and for my work I decided to average the duplicates. This will entail finding all the duplicate records within a ID that have overlapping years and then averaging them by months. Handling this in a looping structure is fairly straightforward. Also, since most of the stations don’t have duplicates it might be attractive to segregate the stations into those that have duplicates versus those that don’t. I’ve coded up several versions of this process using various built in functions of R. In the end, I settled on a solution offered up by the R help list. This approach just computes the mean even if there are no duplicates. Since averaging duplicates only needs to happen when you refresh the base data, I decided that an elegant  one line wonder was preferable to more complicated code. The averaging code:


data<- aggregate(v2mean[c('Jan','Feb','Mar','Apr','May','Jun',

'Jul','Aug','Sep','Oct','Nov','Dec')], v2mean[c('Id', 'Year')], ‘mean’, na.rm=T)




That’s it! The aggregate function works on a data structure, in this case those columns of the incoming data that are labeled with monthly names. Aggregate, needs to be fed a set of indices, in this case we feed it an indices for Id and an index for Year. v2mean[c('Id','Year')]. The function mean is then applied to each of the columns. NA,s are removed and the result is output. The last couple of details. I’ve included a line that looks like it makes no sense data[is.na(data)]<-NA. which looks like that just assigns an NA to a element that is already NA!. In reality when we aggregate over the dateframe there will be cases where we are averaging NA’s. But, when we call  mean(…na.rm) if the lines only have NAs then we get a result that is NaN. In previous versions, I had written code to detect this and apply the mean() function conditionally, that is only if the values were not all NA. Since is.na(NaN) tests true, I just choose to replace the NaN with NA. Finally, I do a quick sort on the IDs, just to make sure that everything has remained in order. This isn’t strictly necessary.

At the end of Moshtemp101 we had read in the raw v2.mean and created an R object Data. Now we average all the duplicates in that file:


Or if we want to do it all in one line, we just do this:


Next up, Selecting subsets of the data, like spans of years, and writing out data objects. Since, we won’t be processing duplicates very often, we will be saving the results of these processes as R objects. More on that in Moshtemp103. And yes I hate my color scheme already. Find MoshTemp102 at the drop

About these ads
Categories: Uncategorized
  1. August 17, 2010 at 3:15 PM | #1

    Yes, the aggregate() function is very neat. That’s a very helpful mailing list.

    Your color scheme is very arty, and the blog looks great, but the blue on black, especially, is very hard to read.

    • Steven Mosher
      August 17, 2010 at 8:26 PM | #2

      Thanks Nick,

      I dont’ know what came over me when I picked the template, Artisty more than techy has allows been the dilemma.
      Aggregate is very slick although I’ve had some issues with it of late when using it to sum months into years.

      I’ll see what I can do about the blue on black

  2. JR
    August 19, 2010 at 5:18 AM | #3

    Combining duplicates when the monthly values differ by 1/10 degree C seems safe because those differences are due to rounding during the conversion from degrees F to degrees C. However, here’s an example of when NOT to combine duplicates:

    4257246900001948  -31  -30   -3  109  148  191  226  227  193  107   24  -13
    4257246900011948  -14  -13    7  118  155  198  236  233  200  114   33   -1

    Know why the numbers are so different? Because duplicate 0 is airport data and duplicate 1 is city data. They are totally different stations, even though they are both “in” Denver.

    • Steven Mosher
      August 19, 2010 at 12:24 PM | #4

      Some where my stack of stuff I have the total statistics on all duplicates the maximum seen difference, the median, the average etc.

      duplicates are the same location. Location is given by the ghcnid 42572469000. The duplicate flag means that NOAA has two officila reports for the same station at the same location. the 0 or 1 does not refer to a different location. regardless, if the were different locations you would still average them

      • JR
        August 19, 2010 at 5:28 PM | #5

        Duplicates are SUPPOSED to be the same location. In the case of Denver (and others), they are not. Would you average New York and Boston? They are different locations.

      • Steven Mosher
        August 19, 2010 at 8:44 PM | #6

        JR. You are confusing IMOD with duplicate:

        4257246900001948 -31 -30 -3 109 148 191 226 227 193 107 24 -13
        4257246900011948 -14 -13 7 118 155 198 236 233 200 114 33 -1

        42572469000=GHCNID 425 =country,72469=WMO,000=IMOD
        425724690000=GHCNID 425 =country,72469=WMO,000=IMOD,0=Duplicate

        DONT confuse that with IMOD differences

        42572469000,DENVER STAPLE,39.78,-104.87,1626,1637,U,1980,FL,NA,no,NA,TRUE,1,COOL IRRIGATED,C
        42572469001,DILLON 1E,39.63,-106.03,2763,2937,R,NA,MV,NA,LA,NA,FALSE,NA,COOL CONIFER,C

        Other people have made the same mistake.

        4257246900001948 -31 -30 -3 109 148 191 226 227 193 107 24 -13
        4257246900011948 -14 -13 7 118 155 198 236 233 200 114 33 -1

        That is the same GHCNID. BOTH DENVER. two records for Denver, record number 0, and record number 1.
        When NOAA built the GHCN they went to get records. In this case they got TWO records from the same location.

        Now, Location,

        42572469001 is Different, That is Dillion.

        Some methods, like Hansen, will COMBINE at the IMOD level using a reference station method. I dont.

        The difference between IMOD and duplicate is tough because you actually have to read the NOAA documentation which is
        long and boring. But you are confusing the two. 11 digit number is the GHCNID, 12 digit is the ID PLUS the duplicate flag.
        Duplicate mean, SAME ID, multiple scribal records for the same location. It arises because the same lcations often have TWO or more instruments.
        So we average over the Duplicates. Same location.

      • Steven Mosher
        August 19, 2010 at 8:58 PM | #7

        WRT averaging Boston and new york. Yes I would average them. IF, you ask me what the average temperature
        is for that AREA.

        Today it is 50F in san francisco. Suppose its 70 in LA. If that is ALL the data I have for california and you ask me
        Whats the average for california, I will say “my BEST estimate is 60″ If, you tell me that San Jose is 60F then I have
        some Analytical choices to make.

        A: 70+50+60=180/3 =60.
        B: 70+(50+60)/2= 125; 125/2= 62.5

        In the first case they are not adjusted for Area. the two close stations ( SJ and SF) are given equal weight
        In the second case I average by AREA. averaging WITHIN a certain distance first, THEN averaging over the blocks.

        Good question though and a good lead into our next step. The question is WHAT AREA can one average over? Which turns on the spatial auto corelation
        in TREND. So You can test that by doing the averages over different sized areas

  3. JR
    August 19, 2010 at 8:53 PM | #8

    I completely understand imod vs. duplicate. There are two different records in 1948 for Denver 42572469000. I did not show you any data for Dillon 42572469001. Dillon has nothing to do with this discussion. The two records I showed you are for Denver with different duplicate numbers. So GHCN is saying that they are supposed to be from the same station – same imod. But in actuality, they are not.

    • Steven Mosher
      August 19, 2010 at 10:52 PM | #9

      here is the record you showed:
      4257246900001948 -31 -30 -3 109 148 191 226 227 193 107 24 -13
      4257246900011948 -14 -13 7 118 155 198 236 233 200 114 33 -1

      Then you claimed “Because duplicate 0 is airport data and duplicate 1 is city data. They are totally different stations, even though they are both “in” Denver.” I guess I’m asking this:

      1. do you have the documentation that shows this. In looking through all the GHCN documentation I see no metadata for duplicates.
      2. if they ARE different stations within Denver, yes I WOULD average them. Always.

      Its now 55F in SOMA ( south of market) across town in the Marina its 62F.
      Whats the temperature in the tenderloin?

      if you DONT average them, How do you get an average for denver? add them?

      • JR
        August 19, 2010 at 11:20 PM | #10

        Good – now we are making progress.

        1.) The documentation is not in GHCN – that is one of the problems – not information information to use the data correctly. You have to look at other data sources, like World Weather Records to detangle the GHCN mess.

        2.) Denver city reported from 1873-1970. Denver Stapleton reported from 1940-2001. The reason you can’t average them is because they don’t have the same period of record. Simple example: every year from 1873-1970, the city temperature is 10 degrees. Every year from 1941-2001, the airport temperature is 8 degrees. Average them, you get 10 degrees for 1873-1940, 9 degrees for 1941-1970, and 8 degrees for 1971-2001. Now use 1941-1970 as your anomaly base period. Was the temperature in Denver really 1 degree above normal until 1940 and then 1 degree below normal from 1971-2000? No!

      • Steven Mosher
        August 20, 2010 at 12:23 AM | #11

        1. link to the WWR so I can cross check the records

        2. In the case of stitched records

        a: the sensitivity to stitching methods is nil. The reference station method shows this.
        b. the number of stitched records is relatively low
        c. monte carlo testing of the duplicate records shows the same results.
        d. The impact to trend is nil

        The bottomline is that it’s getting warmer. One can for example, show this by selecting only continuous records. When I’m done posting the codeyou
        are welcomed to do your own testing of that. All the tools will be provided.

        The bottom line i

  4. JR
    August 19, 2010 at 9:05 PM | #12

    What do you think your New York/Boston average would look like if the time series for New York was 1900-1950 and the time series for Boston was 1935-2010?

    • Steven Mosher
      August 20, 2010 at 4:41 AM | #13

      if I had a HIGH degree of stations like that I would use Nick Stokes or jeff Ids or Chad hermans(mcKitrick) Least squares approach. The least squares approach doesnt depend upon the overlap. My personal preference is Roman/Ids approach for Just this reason. However, since our answers are highly correlated the concern you raise is interesting, discussed before, and OBE. If I get time I may play with the stitching issue. If you have anymore questions then read up on the reference station method, have a look at Nicks code or Ids code. Its one of those things where I thought “it will make a huge difference” in the end, not important. The issues lie in metadata. The issues lie in adjustments. the issues lie in UHI. not in handling duplicates. but you have the data, go make a case of the global importance of it.

  1. August 30, 2010 at 9:46 PM | #1

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 31 other followers

%d bloggers like this: