Home > Uncategorized > CHCN: Canadian Historical Climate Network

CHCN: Canadian Historical Climate Network

A reader asked a question about data from   environment canada.  He wanted to know if that data could somehow be integrated into the RGhcnV3 package.  That turned out to be a bit more challenging that I expected.  In short order I’d found a couple other people who had done something similar.  DrJ of course was in the house with his scraper. That scraper relies on another scraper found here.  That let me know it was possible and an email from Environment Canada let me know it was acceptable.

So,  it’s ok to write a scraper and possible to write one.  My goal was to do this in R.  I  really enjoy DrJ’s  code and his other work, but I had to try this on my own. The folks at Environment Canada suggested that scraping was the best option as their SOAP and REST mechanism wasnt entirely supported. That was fine with me as SOAP in R  wasn’t a good option. I won’t go into the reasons. So my plan of attack was to leverage a small piece of DrJs work and do the rest in R.

The Key master file is created by  one of DrJs scrapers and can be downloaded here as a csv.  At some point I will go duplicate that in R, but for now I rely on this file. The file lists the master list of all stations. It contains the two bits of information we need to scrape data: the station webId and the FIRST YEAR of monthly data.  So there is a function to get that csv file, cleverly named downloadMaster().  The next step is to read that file and select only those stations that report monthly data:  writeMonthlyStations().  The next step is to scrape the data: I looked at several ways of doing this. First I tried “on the fly” scraping. That means making a request, and then parsing the file into a  friendly R object. This was some nasty code since the csv file is in an unfriendly format. Metadata in the first 17 lines and then 25 columns of climate data. It took some fiddling but I was able to manage it. It involved doing two reads on the connection. The first read to just get the metadata and the second read to skip the metadata and read the data. That function worked if the server cooperated. Alas, the server had a habit of crapping out and dumping the connection.  The prospect of doing error trapping in R “trycatch()”  wasn’t in the cards.  But I suppose in a future version I will do that. So I oped for the brute force. Download every file. As it turns out that is  7676 csv files. The good new is that with a downloaded CSV file I can work at my leisure and not try to debug things hoping that the connection will time out so I can test the code.

Consequency we have a function scrapeToCsv()  which takes the list of stations and makes a http request for every one. That works pretty slick and takes a long time. Of course, that process is also prone to server timeouts. When it balks we have a function to clean up after that: this function getMissingScrape() looks at your download directory and the files there, looks at the list of stations you wanted to scrape and figures out what is missing. calling scrapeToCvs(get = getMissingScrape())  will restart a scrape and chug along.  When the scrapes are finished you have to do one last check  getEmptyCsv(). There are times when the  connection is made, the local file name is written, but no data is transmitted. So you get zero sized files. No problem, we detect that and rescrape:  scrapeToCsv(get = getEmptyCsv()). Clever folks can just write a while loop that exits on the conditions that all files are there and non empty.

After downloading 7676 files you then create a inventory with metadata:  createInventory() This includes the station name, lat, lon province. and various identifiers ( WMO etc).

      "Id" "Lat" "Lon" "Altitude" "Name" "Province" "ClimateId" "WMO" "TCid"
  99111111 "49.91" "-99.95" "409.40" "BRANDON.A" "MANITOBA" "5010480" "71140" "YBR"
  99111112 "51.10" "-100.05" "304.50" "DAUPHIN.A" "MANITOBA" "5040680" "" "PDH"

Then you create a huge master datafile  createDataset(). This has all the data ( temperatures, rain etc). Next you can just extract the mean temperature  asChcn() which creates a GHCN like data structure of temperature data.

version 1.1 is done and is being tested. Should hit CRAN when some outside users report back.

Categories: Uncategorized
  1. August 7, 2011 at 3:14 PM

    Well, Steve, I’m impressed by your tenacity, and I’m glad you’ve done it, but I’m not in a hurry to emulate it. I think even with the help from the library, I’ll wait for reports on how useful it is.

    • Steven Mosher
      August 7, 2011 at 10:20 PM

      Thanks nick, the request came from a working scientist who is doing some arctic research and I thought it would be a one day job especially since DrJ had already done the heavy lifting. Ideally if I get back to this I will just have a script that updates things once a month
      Downloading all 350MB and then processing it is a PITA. The data, however, now goes into
      RghcnV3 ( with the bug fix to sortdata that I just mailed you ) I should probably host on R Forge so we can work together more effectively. Leeme investigate that. We need to divide
      up out work in an effective manner.

      version 2.0 of RghcnV3 is in final stages. I have incorporated all your code. I do need to test your solver ( I didnt change it I just made it pretty ) I should probably run your code
      through formatR before touching it so I touch it less.

  2. September 11, 2013 at 1:21 AM

    Hi Steve,
    The scraper dos not seem to work any more since the renewed EC website:
    > scrapeToCsv(Stations)
    essai de l’URL ‘http://climate.weatheroffice.gc.ca/climateData/bulkdata_e.html?timeframe=3&Prov=XX&StationID=5231&Year=1973&Month=1&Day=1&format=csv’
    Erreur dans download.file(stationurl, Destination, mode = “wb”) :
    impossible d’ouvrir l’URL ‘http://climate.weatheroffice.gc.ca/climateData/bulkdata_e.html?timeframe=3&Prov=XX&StationID=5231&Year=1973&Month=1&Day=1&format=csv’
    De plus : Message d’avis :
    In download.file(stationurl, Destination, mode = “wb”) :
    ouverture impossible : le statut HTTP était ‘500 Internal Server Error’

    Too bad, the functions would streamline the data acquisition process…

  1. No trackbacks yet.

Leave a comment