In the last installation we added a window function to our list of utility functions, windowYears() and we passed that function a v2mean dataframe and a start year and end year. The code for moshtemp103 is here. Before we move onto the next step, I’ll draw you attention to a simple change I’m going to make. I defined the range of years like so. yearRange<-c(1900,2009). That’s not very literate. It can lead to errors and it doesn’t document itself. yearRange$Start is far more literate than yearRange. So I’ll redefine that going forward as yearRange<-list(Start=1900,End=2009). Ideally, when doing an OOP design then I would define a class of objects defined as a “yearRange” upon Construction, then we can check for things like yearRange$End being greater than yearRange$Start. And we can enforce rules about datatypes.
To date we have loaded the v2mean file, averaged the duplicates, and windowed the years. You realize that you can now window the years before combining the duplicates! That is, if you are going to work exclusively with data from 1900 to present, you can just do this:
Simply, read v2mean into the year windowing function, dump the years before 1900 and the average the duplicates.
Now, we are ready for the last step before we have useable data. First, a couple notes. As we look at the v2mean data.frame we can note that column 1 has Id’s and column two has Years. The rest of the columns, 3:14, contain temperature data. That’s ugly. It’s ugly because anytime I want to work on just the temperature data I have to index the right columns. And the temperature data is timeline which suggests a single dimension for all time. In the next steps we will get rid of that ugliness. In the end, we will have a data structure where the row names and column names get used for these variables.
Before that we have some more processing to do. In the common anomaly method of temperature construction we have to do the following. First, we pick a period called the baseline period. It’s typically 30 years. For CRU, that period is 1961-90. For GISS its 1951-1980. These periods are selected because they have the highest count of station reports. A while back I analyzed the issue in a bit of mania and figured out the optimal window was actually 1953-1982. During that period you get the maximum of stations reporting. However, if you only looked at the southern hemisphere, then 1961-1990 was better. Weird. So we will select 1961-1990 as our baseline period. A quick look at the data before we progress gives us this:
I’ll spare you the output, but you’ll see that the years around 1950-1990 have the highest counts.
Another thing. When we window the data and when we select a base period we will lose certain stations. We window out any stations that only report before 1900 and we will delete stations that do not report in our baseline period, 1961-1990. If you’re interested you can just check how this winnowing process progresses by looking at this value length(unique(v2Mean$Id)) throughout the process. The last thing to note before we get to the code is this. We have two more criteria for selecting stations” Full years in the period and a rule for what “counts” as a full year. our Common Anomaly Method is thus described by the following
That is what I selected. You can change that. Won’t make much of a difference. What that criteria says is this. Stations must have at least 15 years of data in the window 1961-1990, where “Full year” means a year with ALL 12 months reporting. CRU for example uses this criteria
camCriteria<-list(Start=1961,End=1990, Years=20,Threshold=10). That is, they keep a station if it has 20 years, where at least 10 of the months for those years report. You could also spend time fiddling with these numbers and making more complex rules. The end is a selection of stations that meet your criteria.
The code. The selection of stations is a true/false test. A station is in or out. It’s either in the baseperiod and satisfies the criteria or not. So, we are doing a logic test. Think of it like the function “is.na” . On to the function:
mask<-which(v2mean$Id %in% as.numeric(ids))
I’ve named the function “in.Base” logically what we would want to do is create a class of object called v2mean. It would have 14 columns, the first being of type id, the second of type year, and the rest of type temperature. Base would also be a sub class of that an we would overload the operator %in% to function accordingly, and so V2mean<-V2mean %in% Base would get us the subset of v2mean that satisfies the matching criteria, a metaphoric extension of the %in% function to some degree.
Ok. what’s that code do. Well, the first thing we do is extract the baseline window of data. For all the stations we select the period 1961-1990. Next, we have to count the months in every year. Recall that our data is still in 14 columns where column 3 through 14 is jan through dec. We apply an NA mask making the temperature value TRUE or FALSE. we sum by rows and now the resulting data would look like this:
Ghcnid Year count
x 1961 12
So for every station and every year we have a count that represents the number of months reporting for that year, 0-12. We want to apply our threshold next. Our threshold is 12. If you dont have 12 months of data, toss that year. Again, you could make this 11 or 10 or 6. In the end however, you will be average all the Januaries in this period with all the other Januaries, and lowering the threshold could have some deleterious side effects.
Next, for each station we have to SUM the number years that have passed the threshold test. the function “rowsum” does this and thanks again to the R help list. So more house keeping with datastructures and then we select those stations that pass the Year criteria:
Pretty slick. We have a list of those Ids, that meet our criteria. 15 (12 month)years of data in the window. Then we create and return a MASK to apply to our dataset. Remember, each station ID is reported multiple times in the data. So we ask Which Ids in the incoming data (v2mean) match those Ids in the base period. that vector of true/ false is returned to the calling function.
gets us a version of the v2data that we need.
Clear your workspace of objects, Run the moshtemp104 code and then check the number of Ids
So we started this process with 7280 stations in the data. We windowed the years, and we selected only those stations with 15 years in the CAM period.
Code at the drop: moshtemp104. Remember you need a copy of the “v2.mean” raw data for this to work. See moshtemp101 for questions about that. next step. ZOO, a very cool package in Cran. And we finally get to a datastructure where the dimensions ( row and column) have meaning explicitly represented.