PDF parsing: as reliable as a train
At Democracy Club, we collect data on every candidate standing for election in the UK. The official source of candidate information comes from a “Statement of Persons Nominated” (SoPN) published by each local authority that is managing an election. That’s around 650 for a general election or around 4,000 this year for local elections. These are typically published in PDF format on each council’s website, meaning that they are difficult for both humans and machines to read information from.
So if you wanted a list of everyone standing for election, you would have to go and look through each PDF and extract the data manually. Luckily, we do that for you, and make the data available for everyone to use at https://candidates.democracyclub.org.uk/.
We “parse” data (elections, wards, candidate names, parties, etc) from each SoPN in a multi-step process we internally refer to as ‘SoPN parsing’. Currently, there is no standard format or file type required for this official document, which makes extracting data from it…complicated.
Back in late October, I started exploring an error with parsing an older SoPN for Strensall ward, in York, that was causing some pages to be missed. This bug seemed to affect one page SoPNs, due to how we were defining the top page of the document in our code. After digging into the parsing code, and making various improvements, we decided to get a jump on 2022 SoPN day and tackle as many SoPN-related bugs as possible.
First, Michael untangled and refactored key steps of the parsing process - page extraction, table extraction and parsing tables. Then we dug into each step and made various improvements along the way:
Converting to PDF
Each year, we have a handful of SoPNs that arrive to us in a non-pdf format.
Because our tooling deals with the majority case, we don’t actually support SoPNs that aren’t a PDF. This is ironic, we know.
Rather than letting those uploads error silently, we’ve added pdf validation to the document upload form as well as the ability to convert html and docx files to pdf in the bulk SoPN import process.
Sometimes a SoPN contains information for a single ward, other times we get a single document for every ward in the council area. We need to match the pages of a PDF to the wards we know have elections. We start by detecting title pages, checking for blank pages, and defining page headers. In the case of multi-page SoPNs, we match pages where a candidate list for a ward spans more than one page. If this step is successful, at the end of the parsing process, only the page that matches the ballot will appear.
To give and example, the SOPN for Buckinghamshire in May 2021 contained nominations for 16 wards. We managed to match the pages that relate to each ward and show them to our users.
However, we were unable to match pages to wards for the the differently formatted SOPN for Bolton.
Notice how all the wards/pages for Bolton show here rather than just the page that contains Smithills.
We refactored the way we match pages and saw a big increase in the number of SoPNs we can match after the changes.
In this step, we use camelot to determine the shape and size of a table in a PDF and extract it for further analysis. Tables can span pages, and in this case, camelot assumes they’re different tables, so we join them back together. Camelot works in most cases, but very occasionally, it cannot determine the table design and the extraction fails.
Here, we define table headers, fields for names, parties, and other column titles that might appear in a SoPN and clean up the data inside each cell. We compare and match parties and party descriptions with the Electoral Commission register and our own party database. The Electoral Commission does not keep historical party names and descriptions so we compare with our own data to improve the chance for a match. We also introduced the Levenshtein avg to help when party matching fails due to a spacing discrepancy. For example:
ReformUK - London Deserves Better
Reform UK - London Deserves Better The Conservative Partycandidate
The Conservative Party Candidate
And related to improving party parsing, we changed a performance feature that limited the selection of parties until the user clicked to expand the selection and search again. Now, all parties are pre-selected in the bulk-adding form.
We also improved error handling and messaging throughout to better pinpoint where exactly bugs occur. With this change, we can now be more responsive to fixing any unknown SoPN bugs that may pop up - which will be especially useful when councils publish SoPNs for local elections beginning March 30th in Scotland, April 6th in England and Wales and April 8th in Northern Ireland. This is the period of time when the code is most used and therefore bugs are quickly discovered and documented.
SoPN Test Tools
To ensure we didn’t create any new bugs while solving old ones, Michael developed an MVP for new testing commands to create a baseline for measuring increases and decreases in the number of people and parties parsed. Each time we fix a bug or add a feature, we run these tests against the code to check for any decrease. We iterated on this MVP as we worked on other issues, adding new reports and print statements where helpful.
For the eager readers, all of our SoPN related work over the past couple of months can be found here. For those of you who want to participate in upcoming SoPN days, we are now tracking issues here:
Will there always be some SoPNs we can’t parse? Yes. For example, image-based SoPNs can be saved but not parsed. We want to continue to improve parsing on SoPNs where there is more than one ward per page. We have some work to do to ensure names appear in order, especially when there is a middle initial.
We want to enable URL uploads and publish SoPN test tool reports. But first, we want to hear from you! What do you think our SoPN related work focus should focus on next? Get in touch if you’re interested in being a part of this discussion.