One of the things we’ve done a lot of at Democracy Club is to gather data from all over local government into a single place. Either we’re trawling through 450-odd PDFs looking for candidates, or we’re sending freedom of information requests to councils to get their polling station data.
Let’s be honest, this is boring. Really boring. And it’s not something any of us actually wants to be doing, but there isn’t really any other way.
We’re not alone though: we know of a few other organisations who have to do this sort of work, such as the UK Parliament and the LGA for election results data.
The work we’re all doing falls into two areas:
- Finding what you’re looking for on every council web site (about 450 of them for general elections).
- Creating a standard format from each source.
For this post, I’m going to ignore point two. Standards are hard to agree on, and many people have spent many hours arguing about them. We’ll save that for another day.
Even before we have agreed standards, we can cut down the time it takes to harvest data from all these web sites by solving point one.
Our proposal is this. All web sites at the root of each government domain should publish a sort of ‘index’ of all the services and data they offer, with URLs to these services.
For example, Trumpton Council’s website at
https://trumpton.gov.uk/ would serve a file at
/data.json (we can argue about the exact name and format later).
That file might look something like this:
The reason that the council should host this file is that it then becomes a part of the digital infrastructure that the council is maintaining, not something that they have to do somewhere else.
With this system, we only need to agree (or enforce):
- The URL of this index file
- What the ‘keys’ are (i.e., that “contact-information” is the term that’s used for contact pages).
There is some excellent existing work on the second point. First, there is a taxonomy of local services on
local.direct.gov.uk and a central list of URLs to pages (not data) on council websites.
Unfortunately, this is fairly out of date, and it might not be maintained anymore (the domain name is a clue).
The other bit of positive work in this area is from the Local Government Association, who have “inventories” that point to data that the council publishes.
The final piece in this puzzle is the list of all the domain names. This fits nicely within the scope of GOV.UK, who should do two things:
- Provide a register of all domains in government (this is harder than it might seem at first)
- Provide a central validator and cache of all indexes.
Once this is done, our job will become a lot easier.
Now, about those standards…
P.S. This is not a new idea – it’s been inspired by a few existing implementations:
For years, websites have used ‘robots.txt‘ to communicate to web crawlers how they should interact with their site. The idea in this post is more complex than a robots.txt, but it works on the same basic principle.