A Request to CSV users

By Sym Roe

on 1st February 2019

There’s no such thing as a good photo of a CSV file, so have a photo of Sym’s dog Zita in the snow instead.

If you use our candidate CSV downloads, or CSV downloads from anyone else, we want your help and feedback!

Our candidates site stores loads of data on candidates for elections in the UK. We want as many people as possible to benefit from this data, either in our own apps and widgets, in 3rd party sites or by downloading it.

We have an API for developers to use, but by far the quickest and easiest way of getting the data is to download a CSV file and open it as a spreadsheet.

A lot of people know how to use spreadsheets, and it’s an easy entry point in to creating things like graphs or aggregate stats without knowing how to program.

Because of this we want to make the process of getting hold of the CSV files and using them as good as it can be.

At the moment we present a long list of a file per election, or a single file for all candidates on the site. We need to improve this for sure (feedback or design ideas very welcome!).

But here’s an area we really want feedback on. It’s a bit tricky.

At the moment we have a set of columns that are defined once and are stable across all the files we produce.

There are currently 38 columns and about 60,000 rows. A lot of columns only contain data for a tiny percent of the rows. For example we have a “TheyWorkForYou URL”, but only 892 rows have a value, because that’s all the elected MPs we have profiles for (over 3 general elections).

We’re moving to a model where we can add any number of links to a person. The links all have a “value type” assigned (email, twitter_username, etc), but we want to be able to add more type, and even allow users to make up the types themselves.

This leaves us in a tricky position for CSV outputs:

The number of columns will change as soon as a single new value type is added. If someone adds an obscure link to a single candidate then a new column is added to each of the 60k rows in the CSV file.
The headings might change. If we merge a value type in to one, or split values in to different types, the heading values would change. This shouldn’t matter to anyone who’s just looking at the data, but it does make the whole output less stable. Scripts and spreadsheets would have to use the column names, not numbers in order to use the file.
The file could get very wide. Related to 1., it’s possible we end up with loads of different values over time. For example, we might have Companies House IDs for the 2017 parliamentary candidates or WikiData IDs for all elected MPs. The records with values in these columns will make up a tiny subset of the data. We could easily have 20 or 30 different value types over time, each adding a column to the CSV.
We might need an option for having more than one identifier for a value type, for example if we have more than one “news article” link for a person. It’s not obvious how this would work in the CSV format.

All of this makes the CSV less predictable, larger and generally harder to use.

So, we’re looking for ideas and feedback. One solution could be to create a sort of “builder” where users can select things like the election(s) and fields they are interested in and be given a URL to download. That way, if we add a new column the custom export won’t change. If we removed a column it would, but hopefully that’s rare.

This sounds nice, but without feedback from users, we might end up building a super complex thing only to find that the CSV downloads are only ever used for finding candidates favourite biscuits. Useful as this is, if we knew that in advance we could just make a CSV file limited to rows with information on biscuits.

There are some CSV specification like the Frictionless Data CSV Dialect that we should start using so we’re compatible with 3rd party tooling, but that doesn’t address the need to have a easy to use basic CSV download.

So the questions are:

What do you want from our CSV downloads? Even if you’ve not used them, telling us what you’d expect from them or how you would use them would be super helpful
Have you dealt with CSV files that change columns about often? Are there good ways of dealing with this, or communicating to users what the changes are?
Are there other formats we should be exploring? Maybe we should automatically update a Google Sheet with the candidate data? Or publish to some other website that makes exploring dat easier. All ideas welcome!

As ever, comment here, on GitHub, email us or join the Democracy Club Slack for feedback

Get in touch: