Aug 012009
 

These days, a bonanza of political information is freely available on the internet.  Sometimes this information comes in the guise of excel sheets, comma separated data or other formats which are more or less readily machine readable. But more often than not, information is presented as tables designed to be read by humans. This is where the gentle art of screen scraping, web scraping or spidering comes in. In the past, I have used kludgy Perl scripts to get electoral results at the district level off sites maintained by the French ministry of the interior or by universities (very interesting if you do not really speak/read French). A slightly more elegant approach might be to use R’s builtin Perl-like capabilities for doing the job, as demonstrated by Simon Jackman. Finally, Python is gaining ground in the political science community,  which has some very decent libraries for screen/web scraping – see this elaborate post on Drew Conway’s Zero Intelligence Agents blog. But, let’s face it: I am lazy. I want to spend time analysing the data, not scraping them. And so I was very pleased when I came across outwit, a massive plugin for the firefox browser (Linux, Mac and Windows versions available) that acts as a point-and-click scraper.

French Départements (from Wikipedia)

French Départements (from Wikipedia)

Say you need a dataset with the names and Insee numbers for all the French Départements. The (hopefully trustworthy) Wikipedia page has a neat table, complete with information on the Prefecture and many tiny coats of arms which are of absolutely no use at all. We could either key in the relevant data (doable, but a nuisance), or we could try to copy and paste the table into a word processor, hoping that we do not lose accents and other funny characters, and that WinWord or whatever we use converts the HTML table into something that we can edit to extract the information we really need.

Or you we could use outwit. One push of the button loads the page

Scraping a table with outwit

Scraping a table with outwit

into a sub-window, a second push (data->tables) extracts the HTML tables on the page. Now, we can either mark the lines we are interested in by hand (often the quickest option) or use a filter to selfect them. One final click, and they are exported as a CSV file that can be read into R, OpenOffice, or Stata for post processing and analysis.

While I’m all in favour of scriptable and open-source tools like Perl, Python and R, outwit has a lot to go for it if all you need is a quick hack. Outwit also has functions to mass-download files (say PDFs) from a page and give the unique names. If the job is complex, there is even more functionality under the hood, and you can use the point-and-click interface to program you own scraper, though I would tend use a real programming language for these cases. At any rate, outwit is a useful and free tool for the lazy data analyst.

Click to share

  One Response to “Web-scraping made easy: outwit”

  1. This post will undoubtedly be the highlight of my week. Thank you.

Agree? Disagree? Leave a reply (also works with Facebook, G+, Disqus ...)