I’m resisting making a Zelda pun here
As noted yesterday, one of the first problems with creating a single list of all UK representatives is finding the links to each local list spread across local government websites.
In this technical blog post, I’ll talk about that process, as well some initial thoughts as how easy it is to scrape the data from those local lists once they’ve been found.
Ultimately, we want to cover all representatives, but in this post I’m just talking about councillors and council websites.
Finding some links
Here are some of our assumptions.
Every council has a website that contains a list of councillors somewhere on it. This means that there is a fairly small set of URLs we need to find. We can reasonably spend some time manually making lists of URLs. We’re not trying to spider the whole web.
I started by manually collecting the lists, going to each authority’s website and trying to find the list of councillors.
As an aside, it’s amazing how obscure anything about deliberation or representation is on some sites. Democracy is pushed to the footer links, when transactional services are front and centre. Of course, there is a good reason for this, but maybe there’s a need for a bit more balance.
Back to URLs. It’s a slow progress to manually collect them, so the next step was to look at the Content Management System (CMS) market. One of the major suppliers to local governments is ModernGov, and nicely they have quite distinctive URLs containing things like
mgFindMember.aspx. A search for
inurl:gov.uk inurl:mgFindMember.aspx found a lot of them quickly.
Then, a breakthrough, possibly. The Government Digital Service (GDS) maintains a service called “Local Links Manager” that local authorities can log in to and add URLs against a taxonomy of services. It’s not very well publicised, but reading through GitHub produced this URL:
This is a CSV with a lot of links in against a category including “Find out about your local councillors” (taxonomy code
Filtering by these left me with about 581 rows – about the right number.
This was a great start, but it’s not quite right.
First, at a simple level, about 16% of the links in the CSV are broken. That is, they return a 404, 500 or in one worrying case a
410 Gone. Military coup by status code maybe?
Obviously some councils aren’t maintaining the links as they should.
Second, the working URLs often don’t actually point to a list of councillors, rather a page explaining a little about how many councillors there are, what they do, etc. This is a useful page to have (and for GOV.UK to link to – one of the reasons this CSV exists) but not for what we’re trying to do.
So it was back to manually editing the list using this CSV as a base, until chatting to Jon Lawson from OpenCouncilData who has already done this work in this CSV. All the links in the file actually link to the list of councillors, and better still they contain a GSS code for each too.
For our scrapers we want to use an identifier for the organisation not its current boundaries (GSS is a boundary ID, so it’s best not to use them for identifying councils).
Thankfully, Alex Parsons has made a CSV that maps between GSS codes and organisation IDs. The following magic command, with the help of Python
csvkit gives us what we want:
$ wget https://raw.githubusercontent.com/ajparsons/uk_local_authority_names_and_codes/master/uk_local_authorities.csv
$ wget http://opencouncildata.co.uk/csv1.php
$ csvjoin csv1.php uk_local_authorities.csv --left -c 18,16 | csvcut -c 25,21
At last we have a list of links to lists of councillors against a standard ID.
What to do with them?
Now for some scraping. Let’s start with the easy ones, ModGov and CMIS.
ModGov is probably the best CMS for us, because it has an API with a URL that’s predictable.
For example, the link in the CSV we made above is:
This tells us that ModGov is installed at
http://democracy.york.gov.uk, and that means we can change the URL to
Bingo, structured XML for councillors in York!
The links file we have gives us 195 councils who use ModernGov. And with a little Python we can get structured data on councillors for about a third of the councils we’re aiming for.
It’s a little harder to spot CMIS from the URLs alone, but we can look for
cmis in the URL. There’s no API here, but the HTML is standard enough that we can look for the class
PE_People_PersonBlock and pull councillors from there.
In tomorrow’s post I’ll dive into more detail about what I did with the links next. Do get in touch if you’re interested in any of this, or have ideas as to how best to file more common patterns in council CMSs.
Next post: Scrapers for councillor data
Photo credit cogdog