skip to content

What might scrapers for councillors look like?

Council scrapers are much shorter than skyscrapers, because they’re in a pothole. Or something.

Yesterday I talked about how I found a load of links to lists of councillors. The next set of questions include: how might that list be used by scrapers; who might write them; how might they be maintained; and how much work would all that involve (and could it pay for itself).

I can’t answer all those questions, but I can share the initial work I’ve done in thinking about how they might be maintained.

After chatting to Lucy at #NotWestminster earlier this year, I came up with a hypothesis:

There are enough people who can write a little code and care enough to maintain one or more councillor scrapers

If that statement is true, then the project of making a list of all councillors could look like making good tooling for maintaining scrapers.

I already knew that there were some common patterns in how the data is published, so this might nicely lend itself to some helpful abstractions.

To test this, I started writing some tools that:

  1. Had some standard data models
  2. Could turn a web page in to those models
  3. Could be run easily
  4. Could be extended and work with each URL I knew about

The Local Government Scraper Framework

After some train journeys and a few more evenings hacking, I ended up with a command line interface that could run one or more scrapers (one per council). The scrapers would write data to a local folder.

None of this is very interesting, but the point is that it should be easy for new developers to meaningfully contribute to.

For example, we found yesterday that York uses ModernGov.

Here’s what the councillor scraper for York looks like:

    from lgsf.scrapers.councillors import ModGovCouncillorScraper
    class Scraper(ModGovCouncillorScraper):
        base_url = "http://democracy.york.gov.uk"

Save this in scrapers/YOR/councillors.py, run with python manage.py councillors --council YOR and we have 47 JSON files created, each containing information on a councillor.

If a council has a more complex case, methods on the class can be overridden like I did for Birmingham, subclassing the CMIS scraper and adding methods where needed.

There’s also helpers for custom HTML sites, like Stroud have.

Want to improve the ModernGov scraper? Run only scrapers that use that class with python manage.py councillors --tags modgov.

Have some scrapers you want to work on (because they’re all near where you live)? Tag them in the class with tags = ["symroe"] and run python manage.py councillors --tags symroe.

The point is that the interface is the useful part, not the scraper code (although some of that is handy too).

In future it might be possible to create scrapers for different sorts of data from councils, like meetings, planning notices and so on.

In the spirit of publishing while still embarrassed about the quality, I put all of this on GitHub yesterday.

I’m really interested in feedback in any form – it’s absolutely possible that this isn’t the right approach. In being open to anyone who can code a little I might be shutting too many people out – maybe tooling like we have for candidates would be better.

Maybe a hybrid system where scrapers seed the data and humans verify it later would be best.

Try it out, submit bugs or get in touch with other ideas.

Next post: How useful is scraped councillor data?

Photo credit la_bretagne_a_paris

Get in touch:

Jump into the online chat in Slack, tweet us, or email hello@democracyclub.org.uk.