Council scrapers are much shorter than skyscrapers, because they’re in a pothole. Or something.
Yesterday I talked about how I found a load of links to lists of councillors. The next set of questions include: how might that list be used by scrapers; who might write them; how might they be maintained; and how much work would all that involve (and could it pay for itself).
I can’t answer all those questions, but I can share the initial work I’ve done in thinking about how they might be maintained.
There are enough people who can write a little code and care enough to maintain one or more councillor scrapers
If that statement is true, then the project of making a list of all councillors could look like making good tooling for maintaining scrapers.
I already knew that there were some common patterns in how the data is published, so this might nicely lend itself to some helpful abstractions.
To test this, I started writing some tools that:
- Had some standard data models
- Could turn a web page in to those models
- Could be run easily
- Could be extended and work with each URL I knew about
The Local Government Scraper Framework
After some train journeys and a few more evenings hacking, I ended up with a command line interface that could run one or more scrapers (one per council). The scrapers would write data to a local folder.
None of this is very interesting, but the point is that it should be easy for new developers to meaningfully contribute to.
For example, we found yesterday that York uses ModernGov.
Here’s what the councillor scraper for York looks like:
from lgsf.scrapers.councillors import ModGovCouncillorScraper class Scraper(ModGovCouncillorScraper): base_url = "http://democracy.york.gov.uk"
Save this in
scrapers/YOR/councillors.py, run with
python manage.py councillors --council YOR and we have 47 JSON files created, each containing information on a councillor.
If a council has a more complex case, methods on the class can be overridden like I did for Birmingham, subclassing the CMIS scraper and adding methods where needed.
There’s also helpers for custom HTML sites, like Stroud have.
Want to improve the ModernGov scraper? Run only scrapers that use that class with
python manage.py councillors --tags modgov.
Have some scrapers you want to work on (because they’re all near where you live)? Tag them in the class with
tags = ["symroe"] and run
python manage.py councillors --tags symroe.
The point is that the interface is the useful part, not the scraper code (although some of that is handy too).
In future it might be possible to create scrapers for different sorts of data from councils, like meetings, planning notices and so on.
In the spirit of publishing while still embarrassed about the quality, I put all of this on GitHub yesterday.
I’m really interested in feedback in any form – it’s absolutely possible that this isn’t the right approach. In being open to anyone who can code a little I might be shutting too many people out – maybe tooling like we have for candidates would be better.
Maybe a hybrid system where scrapers seed the data and humans verify it later would be best.
Try it out, submit bugs or get in touch with other ideas.
Photo credit la_bretagne_a_paris