Home > Uncategorized > BerkeleyEarth Version 1.6

BerkeleyEarth Version 1.6

Version 1.6 has just been submitted to CRAN and I will post the source here in the dropbox. Version 1.6 will probably be stable for a long time unless I find a bug. The main additions I made were adding a few functions to make some things easier, and I converted two files–flags and sources– to file.backed.big.matrixes.  Thus the read() routines for those files have changed the input parameters. The flags.txt and the sources.txt files are so large that they cause even my 4GB windows system to choke and slow down. The functions now work to create a  *bin file when they are first called. After the first call, access to them is immediate. In addition I added a function for creating *bin files for all versions of data.txt you may have on your system. makeBinFiles()  is called automagically after you download all the files using downloadBerkeley().  Or, you can call it separately from your main working directory. Converting the files takes about 10 minutes per file but it worth the time to do it once. After that, the file is attached instantly.

With 1.6 installed let’s do a short example to illustrate something about the Berkeley datasets. I’ve installed the package 1.6 and created a working directory called BestDownloadTest. I make that my working directory and I run downloadBerkeley(). After that function completes I have a workspace that looks like this:


I’ll run getFileInformation() on the TAVG directory. That function will create separate readmes for the all the files and collect some information. Next,  we can select a directory to work with, on windows I just use  choose.dir() and point at a directory like the “Single-Value” folder.  I can do this like so Data <- readBerkeleyData(Directory=choose.dir()) and that will instantly attach  “data.bin” to the variable Data.

Then we can get some simple statistics on the file  length(unique(Data[,”Id”]))  gives us the number of unique station Ids: 36853. The range of dates min(Data[,”Date”]) 1701.042 and the then the max(Data[,”Date”]) 2011.875.  And then I can do histograms of the dates and the temperatures




Categories: Uncategorized
  1. DocMartyn
    March 6, 2012 at 6:09 AM

    Steve, I know you have all sorts of stuff going on, but I have found something very odd with GISS.
    I look at which month per year had the highest or lowest temperature anomaly, then plotted the distribution.
    To make sure I was not insane I then examined the 1881-1946 and the 1946-2011 distributions.
    Again, very odd.

    Could you have a quick shuftie at the monthly distribution in BEST? Trust me, it is worth doing.

    • Steven Mosher
      March 6, 2012 at 6:22 AM

      sure I would expect dec or jan to be the wackiest missing data is not normaly distributed. that’s my guess going in. plus winter has higher variance. adjusted for hemisphere

  2. DocMartyn
    March 6, 2012 at 5:57 PM

    I expected a normal distribution for, both max and min, centered around DJF.
    July in the GISS is very odd indeed.

    • steven mosher
      March 6, 2012 at 10:57 PM

      Yes, that is what I would expect. 1 from the variance in the months and 2 because they
      are generally sampled less.

      Now, if July is weird, that gets interesting

    • steven mosher
      March 7, 2012 at 5:19 AM


      I’m not seeing anything too odd.

      I collected all jan, dec and jul. I calculated a mean jan, mean dec, mean july and june
      created anomalies here is what you see. These are anomalies. Every mon – mean of all months. ( hence the zero for mean )


      Min. 1st Qu. Median Mean 3rd Qu. Max.
      -60.7600 -5.0640 -0.1711 0.0000 5.4010 24.8000
      1.760 -11.140 -6.215 11.680
      > summary(Jul)
      Min. 1st Qu. Median Mean 3rd Qu. Max.
      -70.7500 -4.8510 -0.6219 0.0000 4.9210 22.8100
      Min. 1st Qu. Median Mean 3rd Qu. Max.
      -70.7500 -4.8510 -0.6219 0.0000 4.9210 22.8100
      > summary(Jun)
      Min. 1st Qu. Median Mean 3rd Qu. Max.
      -70.8400 -5.0170 -0.6457 0.0000 5.0670 23.2400

  3. DocMartyn
    March 7, 2012 at 7:15 PM

    I will post the plots of GISS this evening. I wanted to see which month, each year, produced the lowest or highest temperature anomaly.
    The early, pre 1946, years have a huge highest temp anomaly centered in July. It is rally odd.

  4. DocMartyn
    March 8, 2012 at 3:55 AM

    Here you go Steve; like I say odd.

    The frequency for July having the max anomaly pre 1946 was very high. There is no Jun/Jul or Jul/Aug coupling. Odd.

    • Steven Mosher
      March 8, 2012 at 4:05 AM


      Ok I see what you did. You looked at the month with the greatest anomaly by year.

      Thats a bit different than what I did. Anyways.

      1881 – 1946 is 65 Years:

      So, about 18% of those ~10 or so, had a really hot july.

      Constrain your sample to 1911 to 1946 and see what you get.

      or just print out the years in which July had the highest anomaly. you can probably guess which years it will be

  5. DocMartyn
    March 8, 2012 at 6:54 AM


    But no Junes or Aug’s in the 30’s

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: