Home > Uncategorized > Berkeley Earth Surface Temperature: V1.5

Berkeley Earth Surface Temperature: V1.5

My R package designed to import all of the Berkeley Earth Surface temperature data is officially on CRAN, as BerkeleyEarth.  The version there is 1.3 and I’ve completed some testing with the help of David Vavra. The result of that is version 1.5 which is available here at the drop box. I’ll be posting that to CRAN in a bit.  For anyone who has worked with temperature data from the various sources  Berkeley Earth is a godsend. For the first time we have a dataset that brings together all the open available temperature datasets into one consistent format. The following sources are merged and reconciled.

  1. Global Historical Climatology Network – Monthly
  2. Global Historical Climatology Network – Daily
  3. US Historical Climatology Network – Monthly
  4.  World Monthly Surface Station Climatology
  5.  Hadley Centre / Climate Research Unit Data Collection
  6.  US Cooperative Summary of the Month
  7.  US Cooperative Summary of the Day
  8.  US First Order Summary of the Day
  9.  Scientific Committee on Antarctic Research
  10.  GSN Monthly Summaries from NOAA
  11.  Monthly Climatic Data of the World
  12.  GCOS Monthly Summaries from DWD
  13.  World Weather Records (only those published since 1961)
  14.  Colonial Era Weather Archives.

The data files are availabe here: http://berkeleyearth.org/data/

Let’s start with a top level description of the dataflow through the system. All the source data is collected and turned into a common format: http://berkeleyearth.org/source-files/.   Those files are then merged into a single file called the “multi value” file. In this file every series for every station is present. The data format for all the temperature data files is common: there are 7 columns: Station Id, Series Number, Date, Temperature, Uncertainty, Observations, and Time of Observation. So, in the “multi-value” file a single station will have multiple series numbers.  In the next step of the process “single value” files are created. There are four versions of these files depending upon the Quality control applied and whether or not seasonality is removed.  Thus there are 5 versions of the  data: Multi value, single value with no QC and no removal of seasonality, single value with QC and no removal… you get the idea. In addition, the final files are delivered as TMAX, TMIN and TAVG. In other words there are 15 datasets.

The 15 datasets can all be downloaded with version 1.5 of the package using the function downloadBerkeley(). The function is passed a data.frame of Urls to the files and those selected are downloaded and unzipped. In this process the package will create Three top level directories: TAVG, TMIN, and TMAX.  Files are then downloaded to sub directories under the correct directory. It’s vitally important to keep all directories and file names intact for this package to function. The file named “data.txt”, has the same name across all 15 versions, so keeping things organized via directory structure will prevent obvious mistakes. There is a safeguard of sorts in the files themselves. Every file starts with comments that indicate the type of file that it is ( Tmax, multi value ). I’ve included a function getFileInformation()  that will iterate through a directory and write this information to a local file. That function also  reads all the files and extracts the “readme” headers, writing them to a separate “readme” subdirectory.

The download takes a while and I suggest you fire it up and leave your system alone while it downloads and unpacks the data. Should you get any warnings  or errors you can always patch things up by hand ( download manually) or call downloadBerkeley() again subsetting the  url data.frame to target those files that get corrupted. That is, on occasion you MAY get a warning that the file size downloaded doesnt match the file description. I suggest patching these up by hand downloading. I could, of course add some code to check and verify the whole downloading process, so ask if you would like that.

Once you have all the files downloaded you are ready to use the rest of the package. Every folder has the same set of files: data related to the stations and the core file “data.txt” The station metadata comes in several versions from the bare minimum ( station, lat,lon, altitude) to the complete station descrition. Functions are provided to read every file: They are all named to let you know exactly what file they read   readSiteComplete()  readSiteDetail().  The filenames are all defaulted to the Berkeley defined name. You merely set the Directory name and call the function. All functions return a data.frame with standard R NA’s used in place of the Berkeley values for NA.  In addition, I’ve rearranged some of the data columns so that the station inventories can be used with the package RghcnV3.

The big task in all of this is reading the file “data.txt”. On the surface it’s easy. Its a 7 column file that can be read as a matrix, using read.delim(). There are two challenges. The first challenge is the “sparse” time format. There are over 44K stations. Some of those stations have a couple months of data, others have data from 1701 on. Some stations have complete records with reports for every month; other stations have gaps in there reporting. Berkeley common format only reports the  months that have data. Like So:

Station    Series     date  Temperature

1               1           1806.042    23

1               1           1925.125     16

If all the dates between 1806 and 1925 have no records ( either absent or dropped because of QC)  then the months are simply missing. There are no NA. This gives us a very compact data storage solution, however, if you want to do any real work you have to put that data into a structure like a time series or a matrix where all times of all stations are aligned. In short, you need to fill in NAs. And you have to do this every time you  read the data in. At some point I expect that people will get that storage is cheap and they will just store NAs where they are needed. Reading in sparse data and filling in NAs is simple, time consuming, and prone to bone headed mistakes. Our second challenge is memory. Once we’ve expanded the data to include NA we run the risk of blowing thru RAM. Then if we want to calculate on the data we might make intermediate versions. More memory. There isn’t a simple solution to this, but here is what version 1.5 has. It has three different routines for reading in the data:

readBerkeleyData():  This routine  reads all 7 data columns and does no infilling. Its primary purpose is to create a memory backed file of the data. However, if you want to analyze things like time of observation or number of observations you have to use this function. Also, if you have your own method of “infilling NA”   you can use this to grab all the data in its time sparse format. On FIRST read the function will take about 10 minutes to create a file backed version of the matrix using the package bigmemory. Every subsequent use of the call gets you immediate access to the data.bin file it creates.

readBerkeleyTemp(): this routine also creates a file backed matrix. On the very first call it sees if the temperature.bin file exists. Since that file doesnt exists,  it is created. It is created from “data.txt” OR “data.bin”.  Data.bin is created by readBerkeleyData(). So basically, readBerkeleyTemp() on first pass calls readBerkeleyData(). If readBerkeleyData() hasn’t been called before, it is called and data.bin is created and returned to readBerkeleyTemp(). The function then proceeds to create a file called temperature.bin.  That file has a column for every station and a row for every time. NAs are put in place. The column names are also used to represent the station Id. Row names are used for time. The Berkeley “date” format is changed as well.  This process can take over 2 hours. A buffer variable is provided to control how much data is read in before it is flushed to disk. It is set to 500K.  At some stage This buffer will be optimized to the local RAM actually available. If you have more than 4GB you can play with this number to see if that speeds things up.

Lastly the function  readAsArray() is provided. This function does not create a file backed matrix. It reads in “data.txt” and creates a 3D array of temperature only. The first dimension is stations, the second is months and the third  is years. dimnames are provided. This data structure is used by the analytical functions in RghcnV3.




About these ads
Categories: Uncategorized
  1. Wayne2
    February 22, 2012 at 1:56 AM | #1

    Excellent, and thanks!

  2. Willis Eschenbach
    February 22, 2012 at 2:26 AM | #2

    Many thanks as always, Mosh. Good work.


  3. Bruce
    February 22, 2012 at 10:16 PM | #3

    Data seems a bit off for the one station I checked – Malahat BEST # 7973

    BEST single value compared to Environment Canada

    EC data scraped from:


    Sorted by biggest ABS difference

    17.89 2002 12
    8.55 2003 2
    7 1995 11
    2.7 2011 12
    1.6 2000 6
    1.183 2008 4
    0.867 2001 6
    0.793 2007 4
    0.789 2008 6
    0.782 2000 7
    0.759 2002 3
    0.746 2001 4
    0.722 2008 5
    0.667 2003 9
    0.659 1991 12
    0.633 2010 4
    0.633 2004 9
    0.605 1996 1
    0.604 2008 8
    0.602 1997 6
    0.602 2006 3
    0.587 2008 2
    0.572 2000 4
    0.555 2008 7
    0.528 2003 5
    0.465 2010 6
    0.461 2006 2
    0.43 2006 9
    0.406 2000 10
    0.391 2000 9
    0.358 1998 11
    0.341 2000 1
    0.318 1996 11
    0.314 1999 12
    0.276 1999 11
    0.263 2007 3
    0.244 1994 3
    0.243 2004 12
    0.242 1995 12
    0.233 2006 11
    0.232 1994 12
    0.231 2006 12
    0.231 1997 9
    0.206 2007 10
    0.204 2000 5
    0.195 1993 5
    0.195 1999 3
    0.188 1999 5
    0.187 2005 12
    0.175 1996 4
    0.172 1995 6
    0.171 1998 6
    0.165 1992 5
    0.154 1999 10
    0.15 1995 3
    0.148 1997 5
    0.146 2001 12
    0.144 2007 11
    0.132 1999 7
    0.129 2000 12
    0.125 1994 1
    0.123 1992 1
    0.123 1993 3
    0.122 1996 9
    0.122 2006 1
    0.118 1993 12
    0.117 1997 3

    …. etc

    • Steven Mosher
      February 22, 2012 at 10:47 PM | #4

      1. You have to specify which BEST dataset you used.
      2. EC is not used as a source.

  4. Bruce
    February 22, 2012 at 11:08 PM | #5

    LATEST – Single-valued

    There are 234 BEST records. Only two records match EC exactly because of rounding.

    72 records difference by more than .1C

    It was not -13.79C In December 2002 according to many records around Malahat.

    EC’s value of 4.1 looks correct.

    • Steven Mosher
      February 22, 2012 at 11:31 PM | #6

      EC data on the website, as they describe, is not necessarily from stations that pass quality requirements. So, you really cannot accomplish anything by comparing it with anything else.
      Been there done that. The most you can say is that EC sometimes agrees with itself and sometimes disagrees with itself.

      You will also find many differences positive and negative with individual records. The key of course is finding a consistent and widespread bias. That would be interesting. You will also find outliers. See the file you used? Single Valued? there is no QC on that file.

  5. Bruce
    February 22, 2012 at 11:52 PM | #7

    “See the file you used? Single Valued? there is no QC on that file.”

    From BEST

    Quality Controlled

    Same as “Single-valued” except that all values flagged as bad via the quality control processes have been removed.

    Were the values that I list above “flagged as bad”?

    “The key of course is finding a consistent and widespread bias.”

    If the data is crap is does matter.

    • Steven Mosher
      February 23, 2012 at 12:18 AM | #8

      you need to use the right files. but if ec doesn’t agree then ask yourself why you trust it more than they do. the trusted data is passed on to other agencies. basically you cannot assess records in isolation and can’t make a case with poor unrepeatable assertions.

  6. Bruce
    February 23, 2012 at 12:37 AM | #9

    Steve, I live 60 miles from Malahat.

    If it was -13.79C for December 2002 I would have noticed.

    Duncan was 4.2C (A little north of Malahat)
    Victoria was 5.7C (A little south)

    • Steven Mosher
      February 23, 2012 at 1:51 AM | #10

      Again, Bruce.

      1. You need to use the Quality controlled files 2. You would need to inspect the output after scalping.

      When you do that you need to document every step in a way that can be reproduced.

      That will tell you what value goes into the algorithm.

      Second, you may think that your word is evidence. It’s not. Your word is not evidence. Write up documentation in a way that others can REPEAT your analysis. Then submit a email to Berkeley. Like the documents from Heartland I take nothing at face value. Not them, not you. not Gleick. Everyone gets the same treatment. Dont take it personally.


  7. Bruce
    February 23, 2012 at 12:45 AM | #11

    Ok. I looked in Quality Controlled.

    Record dumped.

    7973 1 2002.792 9.663 0.0064 31 -99
    7973 1 2002.875 7.527 0.0065 30 -99
    7973 1 2003.042 6.065 0.0064 31 -99

    • Steven Mosher
      February 23, 2012 at 3:26 AM | #12

      So, bascially, you looked at single value files for decmeber and found a temperature
      out of wack. You then look at QC data and find that its been removed.

      Your lesson here is what?

      1. If you want to help, document what you do in a way that is helpful to others.

  8. Bruce
    February 23, 2012 at 1:17 AM | #13

    Feb 2003 also dumped

    7973 1 2003.042 6.065 0.0064 31 -99
    7973 1 2003.208 5.692 0.0064 31 -99

    And Nov 1995

    7973 1 1995.792 9.070 0.0067 28 -99
    7973 1 1995.958 2.858 0.0072 -99 -99

    Dec 2011 probably dumped.

    7973 1 2011.875 3.588 0.0071 25 -99
    7974 1 1970.208 6.198 0.0064 31 -99

  9. Bruce
    February 23, 2012 at 1:45 AM | #14

    Oh, and hundreds of station don’t have BC as the province code even though their name ends with ,BC

    7992 – CARMANAH,BC
    8289 – CLAYOQUOT,BC
    8342 – KEREMEOS,BC
    8375 – STAVE FALLS,BC
    8388 – VANCOUVER UBC
    8473 – FRENCH CREEK,BC
    8482 – HEDLEY,BC
    8602 – FERNIE,BC
    8862 – POWELL RIVER,BC
    8900 – KASLO,BC
    8909 – KELOWNA AWOS, BC
    9066 – NAKUSP,BC
    9148 – THURSTON BAY,BC
    9194 – INVERMERE,BC
    9204 – QUATSINO,BC
    9307 – SHALALTH,BC
    9319 – SIDMOUTH,BC
    9325 – CAPE SCOTT,BC
    9389 – REVELSTOKE A, BC
    9470 – BARRIERE,BC
    9493 – EGG ISLAND,BC
    9495 – SEYMOUR ARM,BC
    9575 – DONALD,BC
    9606 – VAVENBY,BC
    9619 – DOG CREEK A,BC
    9628 – 100 MILE HOUSE,BC
    9639 – RIVERS INLET,BC
    9652 – BIG CREEK,BC
    9719 – CAPE ST JAMES,BC
    9762 – ALEXIS CREEK,BC
    9837 – IKEDA BAY,BC
    9861 – OCEAN FALLS,BC
    9954 – TASU SOUND,BC
    9986 – SEWELL INLET,BC
    10013 – BARKERVILLE,BC
    10076 – MCBRIDE 4SE,BC
    10141 – TLELL,BC
    10170 – KEMANO,BC
    10200 – PORT CLEMENTS,BC
    10210 – DOME CREEK,BC
    10228 – WISTARIA,BC
    10229 – KILDALA,BC
    10242 – FORT FRASER 13S,BC
    10291 – MASSET CFS,BC
    10312 – ALEZA LAKE,BC
    10350 – SALVUS CAMP,BC
    10384 – FORT ST JAMES,BC
    10398 – PORT SIMPSON,BC
    10411 – TELKWA,BC
    10495 – NASS CAMP,BC
    10505 – FORT BABINE,BC
    10512 – PINE PASS,BC
    10555 – POUCE COUPE,BC
    10590 – HUDSON HOPE,BC
    10593 – PREMIER,BC
    10610 – BALDONNEL,BC
    10617 – FORT ST JOHN,BC
    10662 – WONOWON,BC
    10705 – BEATTON RIVER A,BC
    10706 – WARE,BC
    10711 – TODAGIN RANCH,BC
    10797 – CASSIAR,BC
    10804 – PLEASANT CAMP,BC
    10808 – HAINES APPS NO 2,BC
    10813 – GRAHAM INLET,BC
    10823 – LOWER POST,BC

    • Steven Mosher
      February 23, 2012 at 1:56 AM | #15

      The Province code comes from those metadata records that provide a FIELD named provide/state code. Some agencies, such as GHCN, conflate the province into the name.

      The Best metadata does not ADD to the record. It reconciles those records that exist. So, if a source BURIES the province in the name, that is not pulled out.

      You can, if you like, get the maps package in R and use lat/lon to figure out the province. Here is another clue: DONT trust names. What matters is lat/lon. In short the algorithm doesnt care about names: it cares about lat/lon

      Next: dont use comments to post random data: Post code that can be run. Otherwise, I have no way of verifying what you claim.

  10. Bruce
    February 23, 2012 at 2:44 AM | #16

    I used a text editor …

    But if you insist.

    dirDATA <- "???"

    fname = paste(dirDATA,"/data.txt",sep = "")
    BESTdata <- read.delim(fname,sep="\t",comment.char = "%",quote="",strip.white=TRUE)
    attr(BESTdata, "names") <- c("Station_ID", "Series Number", "Date", "Tm", "Uncertainty", "Observations", "Time of Observation")

    BESTdata[4:5, "Year"] <- 0.0
    BESTdata$Year <- trunc(BESTdata$Date,0)

    BESTdata[4:5, "Month"] <- 0.0
    BESTdata$Month <- ((round(((BESTdata$Date – BESTdata$Year) * 24),0) + 1 ) / 2)

    Malahat <- subset(BESTdata,Station_ID == 7973,select = c(Year,Month,Tm,Date))

    Too bad the data for even one station is so bad.

    • Steven Mosher
      February 23, 2012 at 3:19 AM | #17

      dirDATA <- "???"

      As I said Bruce You have to identify WHICH dataset you are using.

      Then, you have to look at the data post scalpel.

      Then you have to post code that actually works.

      Then you have to check your EC data.

      In the end, if you are right, then you show that the factoid matters.

      I will give you an example: In the non QC files you will find temperatures of 10000C

      But, they are so few that the final answer ( which uses billions of records ) are not effected.

      So, no data will be perfect. It is an on going process to find and correct bad raw data.
      You can help by documenting these errors in a way that other people can verify your work

      • Bruce
        February 23, 2012 at 4:25 AM | #18

        Learn to read.

        February 23, 2012 at 12:45 AM | #11
        Reply | Quote

        Ok. I looked in Quality Controlled.

      • Steven Mosher
        February 23, 2012 at 4:37 AM | #19

        and you found that bad raw data was removed.


  11. February 23, 2012 at 2:49 AM | #20

    FWIW, GSOD gives -12.1C for Dec 2002 in Malahat. And that is way out of line with other years (Dec), which do cluster around 4-6C. Victoria Harbor was 7.9C (GSOD) in that month. Looks like there may be an error somewhere which has got into GSOD and BEST.

    Congratulations, Steven, it’s a big dataset to be coping with. I hope I’ll get a chance to try out your package soon.

    • Steven Mosher
      February 23, 2012 at 3:31 AM | #21

      Thanks Nick. I could definitely use some of your wizardry with speed..

  12. Bruce
    February 23, 2012 at 5:54 AM | #22

    Yes it was removed. Therefore infilling algorithms will supply the missing data.

    What a perfect opportunity to manipulate the results.

    And this was one station where 99% of the records do not match Environment Canada and 30% are different by more than .1C.

    • Steven Mosher
      February 23, 2012 at 7:14 AM | #23

      there is no infilling. get the code

  13. Steve Hempell
    February 23, 2012 at 5:58 AM | #24

    Hi Steven,

    I was interested in downloading the BerkekleyEarth_1.5.tar.gz file . How large is it? I am on a restricted satellite connection – 250 Mb /day. Is the above file just the R script?

    I tried the link but only got the file name in the folder created for it with a size of 0 KB. Just checked again. Now have size of 13Kb – is that correct?

    I understand the data files are huge. Is there a way of downloading them piecemeal maybe just the one that gives the overall global temperature graph that you see to get a start.

    By the way I get this invitation from Linkedln from you. I rejected it as there is nothing I could add to the excellent work that you do. I have no expertise in what you do.

  14. Bruce
    February 23, 2012 at 10:02 AM | #25

    I just checked a list of 165 stations with BC as the State/Prov code.

    Just looking at 1998 to 2011 data.

    1099 records have been dropped from SV to QC.


    Station_ID Stn_Name Year Month QCTm SVTm
    1 7890 SHERINGHAM POINT, BC 2007 5 No Row 9.144
    2 7890 SHERINGHAM POINT, BC 2007 7 No Row 14.865
    3 7890 SHERI

    dffilteredStationsQC = list of stations you want to check
    BESTQCdata = BEST QC data
    BESTSVdata = BEST SV data


    StartYear <- 1998
    EndYear <- 2011

    months <- c("January","February","March","April","May","June","July","August","September","October","November","December")

    SF <- NULL

    for (s in 1:length(dffilteredStationsQC$Station_ID)) {
    QCstations <- subset(dffilteredStationsQC,Station_ID == dffilteredStationsQC$Station_ID[s])
    QC.s = StartYear & Year <= EndYear,select = c(Year,Tm,Month))
    SV.s = StartYear & Year <= EndYear,select = c(Year,Tm,Month))
    for (myYear in StartYear:EndYear) {
    for(myMonth in 1:12){
    QCTm <- "No Row"
    SVTm <- "No Row"

    QC.m 0) {QCTm <- paste(QC.m$Tm) }

    SV.m 0) {SVTm SF


    • Steven Mosher
      February 23, 2012 at 11:52 AM | #26


      That is why I pay no mind to people who blindly clamor for “raw data” Raw temperature data is filled with bad numbers. a very small percentage, but you can find certain places where they are abundant. In any case, the process of quality control is used to call out these bad raw data.

      Still, if you do the analysis on raw data and the analysis on QCed data… gues what? The trend doesnt change. You can of course look at the isolated places where the bad data came from, but BC isnt the whole world, however much you may think it is.

      1. we can only work with the data that was collected 2. a small fraction of the raw data is bad. 3. if you remove it the answer doesnt change much.

      We’ve known this for a long time. Good to see you catching up.

      • Bruce
        February 23, 2012 at 12:24 PM | #27

        Well, Environment Canada had the numbers right. So why is it just the people who believe in the CO2 fairy think the data is “bad” and want to drop large chunks of it.

        “if you remove it the answer doesnt change much.”

        BC is still cooling, but I’ll stick with Environment Canada.

      • Steven Mosher
        February 23, 2012 at 12:42 PM | #28

        It seems you forgot your own argument.

        1. you argued that the temperature BEST had was wrong. 2. you forgot to mention you were using the raw data that best gets from its sources ( check the source flag to see where that data comes from) 3. When you looked at the QC data, guess what? you found that the bad data you found had been found by an algorithm and removed.

        I suppose you might be pissed that a piece of code removed your talking point.

        In any case. One of the reasons that EC is not used by BEST is that they dont provide their data in a easy to use, easy to download, public dataset. You have to scrape it.

        You can find my scraper for EC on CRAN its named CHCN. That program will allow you to scrape all of EC data.

        I suppose in the future the EC data may be added to BEST. The answer wont change. It cant, because the physics of C02 will still be true.

  15. Bruce
    February 23, 2012 at 10:11 AM | #29

    StartYear <- 1998
    EndYear <- 2011

    months <- c("January","February","March","April","May","June","July","August","September","October","November","December")

    SF <- NULL

    for (s in 1: 12) { ## {length(dffilteredStationsQC$Station_ID)) {
    QCstations <- subset(dffilteredStationsQC,Station_ID == dffilteredStationsQC$Station_ID[s])
    QC.s = StartYear & Year <= EndYear,select = c(Year,Tm,Month))
    SV.s = StartYear & Year <= EndYear,select = c(Year,Tm,Month))
    for (myYear in StartYear:EndYear) {
    for(myMonth in 1:12){
    QCTm <- "No Row"
    SVTm <- "No Row"

    QC.m 0) {QCTm <- paste(QC.m$Tm) }

    SV.m 0) {SVTm SF

  16. Bruce
    February 23, 2012 at 10:15 AM | #30

    Ok … can’t post code.


  17. Bruce
    February 23, 2012 at 8:23 PM | #31

    You claim it is “bad data”. Yet EC has good data for the same stations and year and month.

    Aren’t you interested why?

    • Steven Mosher
      February 23, 2012 at 9:08 PM | #32

      bruce I think you first need to acknowledge the mistakes you made and the conclusions you jumped to. that common decency is lacking in ur case

      • Bruce
        February 24, 2012 at 2:57 AM | #33

        I jumped to the conclusion that the data BEST had was out of whack for BC compared to Environment Canada.

        I was correct.

        And I am appalled that BEST dropped 1000s of data points as a way of correcting their crap data.

        They didn’t fix the data. They deleted it.

        I would NOT call that Quality Control.

        I would call it hiding the decline in quality of data.

      • Steven Mosher
        February 24, 2012 at 4:38 AM | #34

        you dont understand the first thing about QC.


  18. nat
    February 29, 2012 at 12:22 PM | #35


  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 31 other followers

%d bloggers like this: