Charles Babbage stood for election. Twice actually. He wouldn’t have liked PDFs but he would have made a machine that did.
Know about machine learning? Want to help democracy in the UK? We have a challenge for you!
We want to know whether it’s possible to turn PDFs into data on election candidates. Can you help?
We collect data on every candidate standing for election in the UK. The official source of candidate information comes from a “Statement of Persons Nominated” (SOPN) published by each local authority that is managing an election. That’s around 400 for a general election or around 130 this year.
The statement is normally published as a PDF on each local authority’s website.
At the moment, with the help of thousands of crowdsourcers, we collect the links to all the PDFs on the day they’re published. Then we manually enter the information into a structured format using our crowdsourcing website.
The candidates data is open and free for anyone to reuse.
We think we can make some improvements to the second stage of this process: turning PDFs into data.
Processing this information is time sensitive: the nomination papers are published a few weeks before the election and we manage to get through them all in a few days manually. Ideally we’d speed the process up significantly by changing the work to manually verifying the data that a program had created.
The training data
The good news is that, because we’ve been doing this for a while, we have a load of training data in PDF form and the corresponding structured data.
The other good news is that the content of the PDF is normally produced by one of four “Electoral Management System” software vendors. Some councils change the default template, but the edge cases should be more limited than just parsing random PDFs.
Because the document is a statutory notice, we should know more or less what they’ll look like in advance.
The per candidate fields in the SOPNs are:
- Candidate name in
LAST, First Middleformat.
- The home address of the candidate (can be blank)
- A registered description of the political party (can be blank, see below)
- The nominees of the candidate, normally a list of one or more names
- “Reason why no longer nominated”, blank unless the nomination is invalid (this is important – a candidate can be listed on the SOPN but noted as not standing for various reasons. This is the most common cause of manual data input errors)
You can see an example here: https://candidates.democracyclub.org.uk/upload_document/7798/
The fields change a little depending on the election type (police commissioners vs devolved governments vs local elections vs general elections and so on) but that’s the common basics. It’s possible we’ll need a parser per election type, but we really only want name and party to be extracted.
The document itself gets a little more complex. In some cases there is a single document per division. In others all the divisions are in one document.
Here’s an example of more than one division per document:
In that case the PDF could be split on the text “The following is a statement of the persons nominated for election as a District Councillor for”, but of course that isn’t always the case. One of the first useful bits of work would be to detect if a SOPN is for one or multiple divisions.
At nomination time a candidate or party can pick one of the pre-registered “descriptions” of that party. We’ve included a list of these against the party IDs in the downloads section below. Each description is unique and doesn’t have to include the party name.
Parties can also be “joint”, meaning the candidate is standing for two parties under a joint description. We include these descriptions with pseudo-identifiers in our data.
Candidates don’t have to have a party, and we have a pseudo-identifier for independents too. Sometimes independants have “independent” in the field, other times it’s blank. There are some parties with “independent” in their name.
In data modelling terms, party can be thought of as a list that most often contains a single item, but can contain 0 or 2 items.
…are hard. The required format in legislation is
LAST, first middle [middle2, …]. We store the data in
First Middle Last format, but also store other names against a person (Edward Miliband vs Ed Miliband).
What we’re after is to turn each PDF in to a CSV containing:
For election and division IDs please see the documentation for EveryElection.
It would also be nice to have the exact coordinates in the PDF that the candidate row it located, to make checking the document easier – a nice interface would be to show an image from the PDF against the data we have and ask people to confirm they match.
This output is far from a hard requirement – any ideas of how we might improve this process are welcome, and something is better than nothing.
For example, someone has suggested putting out all the rows from a SOPN and grouping them by page. This might help our data inputting process a bit more than opening up the whole PDF.
If you’re keen, start off by downloading the data:
We would want to be able to run code and we do everything openly – so ideally the project would be on GitHub – we can give you admin on a repo on our organisation account if you like.
Image credit: Mirko Tobias Schäfer