What Van Halen can teach us about data

By Sym Roe

on 15th March 2021

Data,
Ideas

On the day that the GDS registers are shut down, we reflect on the Government’s National Data Strategy and offer 5 suggestions for whatever comes next.

Data as M&Ms

There’s an old story about Van Halen’s contract. Apparently they used to ask venues to provide a bowl of M&Ms with all the brown ones removed.

This wasn’t because the band were divas, had a rare form of kastanophobia (yes I googled for it) or wanted to flex their power over small town venues. The “M&M clause” was a tiny, tiny part of a huge contract that covered everything from the weight the stage could take, to the power requirements for the perm machines.

If the band walked into their dressing room to see some brown M&Ms, they knew the venue hadn’t been through the contract carefully, so everything else needed to be double checked. They used a trivial thing that they could see as soon as they arrived to prove the most complex parts of their contract had been completed.

Ain’t Talkin’ ‘Bout Innovation

This story – once again – shows that the open data world can learn a lot from glam rock, and naturally makes us think about The UK’s National Data Strategy (NDS).

Early last year we wrote a response to the consultation. We decided to take the David Lee Roth approach to consultations by focusing on the small things that might signal wider problems.

The focus of the NDS, rightly, is on the pyrotechnics and double-denim that is economic growth and innovation. There’s a decent set list including hits such as “more standards” and “interoperable data” and only the occasional “Geospatial Commission” B-side.

However, a look into the dressing room of government data will find a whole load of brown M&Ms.

At the time of writing the response, the GDS registers had been mostly abandoned. A local authority had been missing from one for almost two years, with no obvious error reporting mechanism, support agreement or understanding of what could be expected from the service.

It did have “data you can trust” printed in large friendly letters on the home page, though.

Today, GDS have given up on the registers product, and there are vague rumors that there is a replacement in the works.

We think the metric that the NDS should be judged by is the ability to deliver simple, reliable lists over the long term.

It should not be judged by new “innovative projects” that will spend half their money on trying to fix typos and inconsistencies in shoddy government data.

Publishing simple lists of data is the planking exercise of digital services. Most people can plank for a short time without any problem, but sticking it out in the long term is really hard.

With that in mind, we have some suggestions for future iterations of registers – or reliable simple lists – and we will be using them as a proxy – or M&M bowl - for the rest of data policy in government.

We hope these points will be taken into consideration for future data publishing efforts. We used the local authority registers in a lot of our projects and they almost worked excellently. We were sad that they contained errors for over two years before being scrapped, and we would be very keen on contributing to future attempts. Not least because it means we wouldn’t have to do it on our own.

1. Don’t approach data as a technology problem

Technology to data publishing should be what Euclidean geometry is to Photoshop - integral but not the end goal. The really hard bit about publishing reliable lists of data is making them reliable, not publishing them.

In most cases publishing small lists is a commodity technology. For versioning, use Git. To combine the two, use GitHub or another hosted Git. There is no point in reinventing the wheel here.

Rather, it’s better to spend time building tooling for maintainers. Any work done to bring data maintainers closer to the down-stream data users is worthwhile. The initial registers offering introduced named custodians. This is an excellent idea in practice, as it offers an appearance of accountability, but has a couple of problems.

First, there might not be a single custodian in place - it might be a good idea to ask why not, but there will be cases where a single person maintainer doesn’t exist.

Second, naming someone as responsible for a list without giving them the tools to maintain it won’t help anyone. They need to be able to do their job, and they’re less likely to take the job seriously if they can’t do it.

2. Only publish small lists

As Paul Clarke notes:

A register with 100 records behaves differently to one with 64.1 million. You won’t be able to maintain them both using the same tools and techniques.

We suggest sticking to small, easy to manage lists - the sort of thing that one person could realistically be expected to maintain. There are no rules here, but can-edit-in-Excel is about the size.

This point assumes that “registers 2.0” is a product or platform of some sort. Of course, it could express itself as more of a quality assurance or data SLA that is given to data all over the government estate. If this is the case, then it’s still a good idea to start small - it will be much harder to change publishing processes of e.g AddressBase than it would to change the process of publishing the bank holiday JSON endpoint on GOV.UK.

3. Only publish lists that already exist

Publishing a new reliable list should be thought of like creating a new position in an organization or creating a new service. It’s not good enough to just make a list one, publish it and assume that the processes to maintain that list will spring into existence somehow.

Rather than forcing data into existence by publishing new data, focus on existing processes that could create open data. Work with the teams involved in the process and see if maintained data can fit. This is a per-domain bit of research and could form the majority of the work of a “registers team”.

A good starting point is legislation. If primary or secondary legislation is required to make a change, then there is a good chance the organisational pipeline already exists to maintain data.

4. Support collaborative maintenance

If a list of data is a machine readable representation of, for example legislation, then it will always follow reality. That is, adding an item to the published data can’t change legislation, so it will always end up behind to some extent.

Contrast “follow reality” lists to, e.g Companies House or land registry, where the list is the canonical source (or close enough).

In the same way “List of things that just happened” will need constant maintenance and will be wrong a lot. “List of things that happened before this date” doesn’t have the same maintenance problem, but could also be wrong at the point of publishing.

Opening up the maintenance of the list to everyone will help catch errors faster. To adapt Linus’s law, if data maintenance is collaborative, all errors can be fixed quickly.

Collaborative maintenance doesn’t mean that the list needs to be publicly editable - this model is unlikely to work for government data, but it does mean that 3rd parties can report problems based on an actual source - legislation says X, the list says X.

Imagine a varying scale of collaboration from Wikipedia or OpenStreetMap at one end, GitHub pull requests in the middle and a PDF of a scanned letter emailed to you from someone at the other end.

This is important because organisations that publish data are often bad at making sure it’s correct. For a case study, look at how OpenStreenMap reacts to bulk adding data from companies.

Anyone, in or outside of the organisation, team, department, or government should be able to help maintain the data with clear, well explained processes.

Or translated: just use an off the shelf issue tracker that’s open to the public.

5. Don’t start with user needs

Or rather, don’t define users as people who consume the data. There will be so many cases of reuse that it will be impossible to identify a useful set of needs. Don’t be tempted to scope limit to a smaller set of reusers, as inevitability this will result in the data publishing layer manipulating the data to meet reusers needs. This will break over time as it’s another layer of abstraction over the data.

For example - a list might contain some information about dates. This might be a text field that contains a range of actual dates, it might contain “10 days before” another event, or might indicate something like sitting days in parliament - a date that will change in hard to predict ways.

Any attempt to convert this so an ISO standard date will cause information loss that might help some users, but would be corrupting the initial data. If this must be done, always include the initial data alongside the converted data.

Thinking about the needs of lists is like thinking about the needs of electricity or a tin of blue paint. The users want things like reliability and consistency, and they want those things in turn because they want a cup of tea, or blue walls.

Photo credit Patrick Fore