CollegeMobile: What is Open Data?

Saskatoon Transit Buses App Screenshot CollegeMobile is proud to host a forum in Saskatoon entitled: An Open Discussion About Open Data on May 23. We’ve got a number of Saskatoon city managers coming, and are hoping that the development community will respond with insightful and constructive ideas.

I’m writing this from the perspective of a developer, and I’m operating from the standpoint of Open Data being a Good Thing. With that being said, there are questions that should be asked before a municipality embarks on the Open Data journey.

What is Open Data?

Here’s the Wikipedia definition of Open Data: Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control.

I’m going to start at the front of that statement and work my way to the back. First of all, the use of the words “certain data” implies that we’re not talking about ALL data. That’s important, because not all data is meant to be shared with the public. We wouldn’t necessarily want to publish everyone’s water usage bills, or how many parking tickets a person has received. We make a distinction between private and public data. We only want the public data. The question of what constitutes public and private data is definitely out of the scope of this article.

The next part of the sentence states: “… should be freely available to everyone to use and republish as they wish …”. This bit is self-explanatory, but is also pretty central to the concept. The public data that we mentioned above should be made accessible to anyone who wants it.

The last section: “… without restrictions from copyright, patents or other mechanisms of control.” makes the ‘openness’ of the data more explicit. Note that it doesn’t say: “without any form of control”. It’s ok to have a licence attached to the data, especially if you want to protect yourself from liability. The licence should allow the user to use the data for any purpose, and not restrict the usage in any way.

So, what is Open Data? It’s making public information available to anyone who wants it, for any purpose they might want it for. Simple, right?

Which data should be Open Sourced?

The obvious answer that would come to most developers’ minds would be, “All of it, of course.” While this would be great, it’s probably not practical. For starters, municipalities need to spend money getting their datasets in order, and converting them into useful formats. The datasets need to be maintained, lest the data becomes stale.

The Open Data Handbook has some good ideas when it comes to figuring out which data should be opened up first. The main point is that community involvement is the best way to go. In addition, you should start small, and move quickly. Get a couple of datasets up as fast as you can, and build from there. Gather feedback from the community as to how the datasets are working, and what people would like to see next.

Which format should the data be in?

Once again, the Open Data Handbook has good advice. The data should be available in bulk, and it should be machine-readable.

Useful formats:

* XML, JSON

* KML, GeoJSON

* CSV, Tab-separated

* API

Most of these formats are similar, in that they follow open, well-established standards for containing data, and generally don’t include any information for how the data should be formatted. APIs are a different animal, and are possibly the gold standard for developers. APIs are great for ever-changing data that needs to be accessed quickly, but they have their downsides too. They have to be maintained, and the bandwidth costs can be considerable. For instance; if an API becomes very popular, who pays for the data to be accessed? Google had this problem when they created their maps and translation APIs. Everything was fine at first, but eventually Google had to start charging for access, as it was very expensive to keep up with the bandwidth. If API maintenance is prohibitively expensive, it might be useful to start with static datasets, that can be massaged and clubbed together by the community.

Less useful formats:

* PDF

* Excel, Word

The main problem with these formats is that they are designed to be human-readable, and as such, are less easy for computers to parse and read. Documents in .pdf and .docx format will often have a great deal of formatting applied. Margins,

logos, letterheads, headers and footers may be great on paper, but are detrimental in the context of data analysis.

The bottom line is: whatever data is in the catalogue, it needs to be consumable by a computer. Any format that includes formatting (think of a PDF with a letterhead on top) is going to be less useful. While PDF data is ok for someone just trying to learn one or two facts, it’s very inconvenient for developers trying to mash datasets together.

How do we create a solid Open Data Catalogue?

1) Start small. Release a small number of datasets. Connect with the development community in order to choose which ones, because the developers are the ones who are going to connect your data with the public at large, by building databases, apps, and websites that take advantage of the data you’re releasing.

2) Use analytics to measure which datasets are popular, but don’t make the mistake of thinking that low-usage datasets are low-usefulness. You can’t be assured that every person who accesses the data will be getting it directly from you. A developer might download your map of neighbourhood boundary data, then write an API that allows others to access it. Your data catalogue might be the original source of the data, but it’s not necessarily going to be the only one, or even the most popular one. Accept the fact that once the data is out there, it’s going to be used in ways you might not expect.

3) To get a better picture of which data is useful, continue engagement with the community. Read their blogs, retweet their tweets, and host hackathons. Feel free to copy liberally from other Open Data Catalogues. If other cities have got a good mix of datasets, consider that these might be good candidates for the next phase of your data release.

Be Flexible, Be Bold.

Governments and developers are heading into uncharted territory with Open Data. Nobody knows exactly what the possibilities are, nor where the pitfalls lie. So communication is going to be essential.

To learn more, feel free to attend the Open Discussion About Open Data on May 23.

If you can’t attend, be sure to follow the conversation on Twitter, using the hashtag: #skdev

What is Open Data?

What is Open Data?

Which data should be Open Sourced?

Which format should the data be in?

How do we create a solid Open Data Catalogue?

Be Flexible, Be Bold.

Francis Chary