Home > Uncategorized > RGhcnV3 A new package

RGhcnV3 A new package

It’s been a long journey and there are some people to thank for helping me along the way. Steve McIntyre, Ron Broberg, Jeff Id, Ryan ODonnell, RomanM, Nick Stokes, Robert Hijmans, Gabor  Grothendieck, Hadley Wickham, David Winsemius, and countless others on the R Help list.

The Package is done.

Package: RghcnV3
Type: Package
Title: Global Historical Climate Network Version 3
Version: 1.0
Date: 2011-06-16
Author: Steven Mosher
Maintainer: Steven Mosher <moshersteven@gmail.com>
Depends: R (>= 2.13.0), R.utils, R.oo, R.methodsS3, zoo, raster, sp, rgdal
Suggests:
Description: The Rghcn package provides the core functions required to
  download the GHCN V3 data and process it into temperature anomalies.
  In addition, there are a few core functions required to download and
  create land masks.
License: GPL (>= 2)
URL: https://stevemosher.wordpress.com/
LazyLoad: yes
LazyData: no

Over the course of the last year or so I  been learning R and aiming at building a package for working with GHCN data.  I have probably written close to 10 different versions of the package tryin to get it down to something that was clear and clean. More importantly, I wanted to leverage the existing Open source resources– the existing packages in R.  Over the course of that time I’ve played around with S4 classes and R.oo for  object oriented design.  At some future date I’ll probably switch the design over to OOP, but for now, it’s plain old vanilla R.

Currently, I’m in the final testing of the package which should probably take a day or two and then I’ll be uploading it to CRAN. I may also decide to write a vignette for the package, but I haven’t decided on that yet. Maybe a blog post first.

The key to getting this package down to a few minimal calls is the leveraging of existing R packages. Let me take a minute to talk about them and how I use them. The first package is “zoo”  maintained by Gabor.  When you get down to the bottom of GHCN data it is nothing more than a collection of time series, regular time series. That means regularly spaced time based data. For that type of data “zoo” is the correct package.  If you know anything about GHCN data you know that the storage format is very dense. It basically goes like this:

ID YEAR JAN,FEB,MAR, etc

So that for a given station ID you have entries by year. And every year has monthly data.  The big problem?  Missing years.

4251234500  1900 12 12 13 14 15 15 16 15 13 13 12 14

4251234500  1901 12 12 13 14 15 15 16 15 13 13 12 14

4251234500  1908 12 12 13 14 15 15 16 15 13 13 12 14

4251234500  1909 12 12 13 14 15 15 16 15 13 13 12 14

Where it not for these missing years ( 1902-1907) it would be an easier matter to turn this N *12 array into a vector with every month represented. That is, if GHCN data had NAs in all 12 months of missing years, one could merely reshape the  n*12 matrix into a long vector or long time series. However, that’s not the case. So one has to unfold that matrix and insert NA years into the spaces where it is required. Turns out that’s really easy in Zoo and no loops required. What we end up with is a  dataframe or matrix type object where every column is a station and rows contain the temperature data. That data structure  a dataframe of zoo time series can  be manipulated by all sorts of zoo functions. zoo functions are functions targeted at time series analysis. So I can do “windowing”,  filtering etc etc.  Transforming my GHCN data into a zoo object then gets me huge leverage. I can use all the tools of time series analysis from zoo.

The next package is raster. raster is maintained by Robert Hijmans.  raster is package  devoted to spatial analysis. Think of a raster like a giant spatial grid. That’s what they are. Now extend that grid into the 3rd dimension (time) and you have an idea of the final data structure that temperature records go into. They are time series located in space.

The raster object for that is called a brick. So at a very high level of data abstraction the RGhcnV3 package consists of nothing more than mashing a 2 dimensional zoo object into a  3 dimensional raster object.  The result is stunningly simple because once the spatio temporal temperature records are in a raster brick ( lat/lon/time) then all our processing can happen through calls on raster objects.  So the RGhcnV3 package will consist of a very limited number of routines to get data down from the internet, then on to your local disk, then into a zoo object, then into a raster brick. From that point on all your programming happens in the raster package.

As I sit here I know there are a couple little fiddles I would like to do to make the process even slicker, but I’m going to resist that urge for now.

So a couple days of testing and then CRAN. the package builds and passes CRAN checks. manuals are done. I’ll post more in a couple days

Categories: Uncategorized
  1. June 21, 2011 at 1:10 PM

    This may be overkill, but I was wondering if we could get all this work into a VMWare appliance for easy distribution.

  2. Steven Mosher
    June 21, 2011 at 1:22 PM

    basically, I’ve built a windows package which is typically the hardest package to build. Its a native R package with no c or fortran or C++ ( yet) so it should just run on MaxOs and Linux, I’ll do a MAC build in a couple days and if it builds on MAC ( which it should 100%) then it will build on Linux.. my linux partition, however, is toast on my Mac… since its pure R it should just work.. famous last words..

    There is some nastiness with some of the system functions ( various versions of tar ) that may cause hiccups, but that’s not that hard to fix

    I havent worked with VMWare for years…

    Before I do that I have a really really really cool idea for blogs. I mean really cool.. like super geeky cool..
    like holy effin batman geeky cool. and no I wont tell you.

  3. June 21, 2011 at 2:51 PM

    Steven,
    Congratulations – I’ll look forward to downloading it…

    and especially reading the manuals 🙂

    • steven mosher
      June 21, 2011 at 9:55 PM

      The manuals are a PITA. It’s a latext type markup language and I havent spent enough time working with it to write freely.

      In anycase when you are ready to Nick I will help you turn your work into a package

  4. June 21, 2011 at 6:34 PM

    Hi,
    Interesting proposition 🙂
    Have you considered Virtual Box http://www.virtualbox.org – I noticed Sage Math http://www.sagemath.org has started using it in it’s latest release instead of VMWare!

    Anthony

    • Steven Mosher
      June 21, 2011 at 11:04 PM

      That’s really not required. For R we build packages for the 3 main OS, linux, windows and Mac.

      It’s all open source so If the Windows package is missing I just get the source and compile locally.
      I suppose guys could run in a virtualized environment.

  5. sbmalev
    June 22, 2011 at 5:19 AM

    Looking forward to this!!!

    • steven mosher
      June 22, 2011 at 8:03 AM

      Cool. Well the last tests have passed. I should have really played around with Test Driven development
      because writing tests after the fact is rather stupid. I’ve tested almost all the code, but have not tested
      failure modes. That is, as an API this package should fail gracefully with good explainations.. like
      if you pass bad parameters.

      Anyway. I looked at your blog.. zoo is a great package, have a talk with Gabor on the R help list.

      I have two things left to do.

      1. Clean up some warnings in the manual. PITA PITA PITA.. these are stupid warnings and I cannot
      submit to CRAN with warnings, well you can, but you have to explain why warnings dont matter

      2. Finish the demo code. That’s almost done..

      Then Ill submit and start to work on 1.1. Ive got a ton of code to add back into the core
      package but I want to keep it lean..

  1. No trackbacks yet.

Leave a comment