"You can find it online" A review of open climate data

The need to address climate change is becoming increasingly urgent.

Furthermore, it is undeniably clear that we need access to as much information as possible in order to act effectively to rapidly wind down fossil fuel consumption, enact climate justice by helping those already affected by anthropogenic climate change, while bringing society with us on the journey. To do this, we need to be both bold and informed - and to be informed we need data.

Improving access to data is becoming more and more pervasive, with a myriad of organisations recognising the value of data for their own operations and sometimes for the community at large. A few years ago there was a huge push from the IT community to collect as much data as possible - usually in the context of creating centralised data stores - with the ultimately unrealised promise for most that data was the new oil. New data-focussed tools and techniques were developed including new database architectures, machine learning and cloud services. However, as with many journeys on the technology hype cycle, not everything that was promised came to pass. Organisations have been left with data lakes and gateways that they no longer know what to do with.

In this blog I delve into the state of play with open climate data, examining some of the major themes and challenges, and why simply making it open is not sufficient. Finally, I will cover what Subak is doing to help with climate data.

Can I speak to a librarian, please?

As a user, my number one priority is to be able to access the data I am searching for. However, information management seems to have been an early casualty of the massive expansion of data across the web. The decentralised nature of many organisations publishing their own information on web portals and the proliferation of search and information management standards has created a jumble of data sources across a huge number of websites.

Example: Where is the Botswana transmission network?

I started off by googling “Botswana transmission network”, and I found two datasets on energydata.info. Option 1 is a GeoJSON and Option 2 is a shapefile, so I may have a preference already.

Option 1

The geoJSON one was last updated in 2020, but from the metadata I can see a reference to Bonneville Power Administration, which is a US Federal Agency. If this is a typo, or mistake, there’s no way to be sure. It’s also stated that this is from 2006, but without the data source there’s no way to check whether there is a more up to date version. We could ask the author by emailing them, assuming they are still maintaining the dataset.

Option 2

This one was created later, so could be more up to date. There is also a reference to a source, the AICD study. There’s another source which seems to be the same as above, and a paid map in PDF format (which I do not want).

I now google AICD, African Infrastructure Country Diagnostic, which was completed in 2010 as a one-off study. I open the document and do a in-page search for “transmission”, which has a link to a website infrastructureafrica.org along with the methodology.
The link indeed has arcGIS files (public domain only) and interactive PDFs. I have verified the source of the data, but will need to do more manual examination of the files to see what suits my needs.
Finally, a review of the Botswana Power Corporation website doesn’t show any maps or data readily available without a request.

Floating the most popular (and and usually most useful) data to the top of results is rarely achieved, resulting in an experience of having to sift through multiple pages with little recourse to help. This wouldn’t be so much of a problem, except that in-portal search often relies on poor quality metadata and rudimentary search methods. If the source isn’t surfaced by Google, the gold standard of search, then finding data is laborious and difficult to validate.

It’s fair to say that organisations with a specific remit to collect and share data do this much better than others, particularly where the organisation’s product is their data. Data stewardship takes a considerable amount of time and energy. Understanding the characteristics of a dataset through descriptions stored in well-maintained metadata can allow you to quickly assess whether that data are suitable, but also for the provider to understand what their user’s needs are.

This is where the role of the “librarian” still has value, understanding and reflecting on what the users are looking for to tailor their platforms. Gone are the days of looking up data in tables of academic tomes, but there still needs to be care and attention paid to the discoverability and searchability of data.

Platforms with no product

Open data portals come in different formats but a consistent theme is that websites publish data “as-is”, with little to no thought about how users will investigate and consume their data. The scale of data providers aligns closely with the extent to which the organisation treats its data as a product.

Furthermore, if the data product generates revenue it tends to have additional benefits such as account managers and technical support, which enhance the user experience by allowing them to ask questions more easily and drawing them into the user base. Data providers that have data as a product tend to be the best in terms of access types (APIs, download variety), quality and documentation. They still have a tendency to focus on maximising revenue from a wide audience, rather than catering to a specific need such as climate impact.

Example: Nordpool

An excellent example of a well developed data product is the Nordpool exchange platform.
Nordpool is the market operator for the UK and Nordic region day-ahead and intraday electricity markets. On the website there is a preview of the historic market data so the user can instantly see what the full dataset they will get.

The types of historic data that a user can access are clearly articulated on the product page, and the delineation between the real-time data for energy market traders and historic data is clearly indicated. The cost for historic data and for the service data are also clearly articulated. There are clear contact details for customer support, and the terms of use are presented to the users too.

Revenue isn’t a necessity for data to be treated as a product. In fact, the low cost to publish and share static datasets creates a catch-22. Consider the business model for a data product: the expected cost recovery for data is based on the maintenance of the data and the access to knowledge about its characteristics. This means the price, if it cannot be absorbed by business-as-usual, should ideally be based on cost recovery for labour time rather than some value assumed by the data owners, as is often done. Indeed, successful providers of climate data such as World Resources Institute and the World Bank have their own sources of funding and still treat their data as a product. Static data that was produced as part of a project or study is difficult to make a compelling case for spending time on maintenance and is better off archived. There are various issues that data platforms often fall into. For example, stale data sets, and poor or missing documentation. Lack of transparency of the contents of the data is also a particular issue, particularly when behind a paywall.

Connecting datasets together

Data standardisation across an entire sector would be a huge undertaking. For example, engineering standards are often formed along safety lines, best practice, or lived experience, but need to be managed and maintained by an organisation and adopted by a wide range of organisations and people. This may not be time effective in the case of climate data, and there is not necessarily a central body to do this, but it is true that connectivity and translation (both in language and format/standard) is desirable.

The ability of a user to connect datasets together increases the value of data to them by a multiplicative factor. Data are like roads, their value is limited unless connected to the other sources they need. This reinforces the importance of being able to identify what terms in data are similar (or the same) even with different terminology, allowing the user to compare like-with-like and draw conclusions based on these comparisons. It doesn’t matter if it’s a motorway, autobahn or barabara kuu, it matters that elements are able to be linked together informally without the imposition of standards. This is even more important when discussing climate impact where these connections are key to delivering the right data.

Example: Entso-e and US EIA

Entso-e tracks the generation, transmission and trading volumes for the European market. The US EIA has similar statistics on a monthly basis but not separated by capacity units. The definitions given by Entso-e are separated by generation type, for example “Nuclear”, “Biomass”, “Fossil Gas”. In the US EIA there are some differing definitions, for example, “Biomass” is separated into “Wood and wood-derived” and “Other”. Comparisons are possible at a high level, but more investigation is necessary to directly compare energy generation across these markets and there are some nuances between fuel types that require further investigation, potentially requiring some specialist knowledge about energy generation.

Concluding thoughts

To address the difficulties we have outlined in this blog, Subak is launching the Data Catalogue on Monday 23rd May. The Data Catalogue is an online platform that forms part of our core mission at Subak, connecting people and organisations to the data that they need through our Data Cooperative. The Data Cooperative starts with a community of data users and producers, initially our Member organisations supported directly through our accelerator programme, but open to everyone. The next part of the Data Cooperative is knowledge, training and tools to effectively use, create and steward open data. Currently this is delivered through our accelerator. The third part is the Data Catalogue.

The aim of the Data Catalogue is to make climate data more searchable, trusted and connected to accelerate climate impact and address some of the issues mentioned above. The Catalogue is an online web portal where anyone can search for, request, or add climate data. We’re looking for people to join our community both as users, providers and members of the knowledge community.

You can register for our launch event here to find out more about the work Subak is doing to make data searchable, trusted and connected.

Register for the launch event