Open Data Hackathon Day: ScraperWiki views

Open data is really only as interesting as what we can do with it!

One sweet thing about ScraperWiki is that it enables quick creation of visualizations called ‘views’ from inside the wiki. They’ve got templates that use Google Visualization to help the process along.

I made the following today (from this datasource):

I don’t have the entire data set, but this graph indicates that the recession had a significant negative impact on the creation rate of new businesses in Oregon.

I just started a new scraper job to pull more information about people and the places where the businesses are located. When that job is done, I hope to create a few more fun visualizations with this data.

UPDATE: I’m playing around more, and here’s the embedded version of the graph if you click through (takes a while to load!).
Continue reading

Open Data Hackathon Day: Oregon Business License Registry

At the Portland Software Summit on Thursday, a couple people mentioned that it was hard to keep track of new businesses that pop up, and that merger and acquisition activity wasn’t being sufficiently publicised.

I thought – maybe we could get this information in an automated way!

I started with the state of Oregon’s business registry search site. Unfortunately, they limit search results for business searches to 1000, and they don’t paginate their results. So, we kicked ScraperWiki into gear, and wrote a very simple scraper with @maxogden: http://scraperwiki.com/scrapers/oregon_business_registry/

Next, I wanted to find out information about businesses specifically in Portland. The City releases information about this, but in PDF form: http://www.portlandonline.com/omf/index.cfm?c=32192

I wrote a quick and dirty Python script to scrape out information, and am getting probably 250 of the 300+ businesses in the November release. Next, I want to cross reference this data with what’s in the Oregon site. I’ll be publishing the Python scripts over the weekend. Hopefully ScraperWiki will add pyPDF to their Python repo support and I will be able to publish the transform there so it can be easily linked to the Oregon data.

Two lessons today:

  • Governments: Please don’t publish data in PDFs. YUCK.
  • Governments: Please paginate results from your site! Hard limits are just kinda lame.

The alternative to scraping the state of Oregon’s site is to order a CD-ROM for $50. I think this is such a stupid profit center for the state. I’d be interested to know how much money they’re really making off of it, and whether they could take a page out of Metro’s book and find a way to share the data with a different, more useful service.