Aug 012009

These days, a bonanza of political information is freely available on the internet.  Sometimes this information comes in the guise of excel sheets, comma separated data or other formats which are more or less readily machine readable. But more often than not, information is presented as tables designed to be read by humans. This is where the gentle art of screen scraping, web scraping or spidering comes in. In the past, I have used kludgy Perl scripts to get electoral results at the district level off sites maintained by the French ministry of the interior or by universities (very interesting if you do not really speak/read French). A slightly more elegant approach might be to use R’s builtin Perl-like capabilities for doing the job, as demonstrated by Simon Jackman. Finally, Python is gaining ground in the political science community,  which has some very decent libraries for screen/web scraping – see this elaborate post on Drew Conway’s Zero Intelligence Agents blog. But, let’s face it: I am lazy. I want to spend time analysing the data, not scraping them. And so I was very pleased when I came across outwit, a massive plugin for the firefox browser (Linux, Mac and Windows versions available) that acts as a point-and-click scraper.

French Départements (from Wikipedia)

French Départements (from Wikipedia)

Say you need a dataset with the names and Insee numbers for all the French Départements. The (hopefully trustworthy) Wikipedia page has a neat table, complete with information on the Prefecture and many tiny coats of arms which are of absolutely no use at all. We could either key in the relevant data (doable, but a nuisance), or we could try to copy and paste the table into a word processor, hoping that we do not lose accents and other funny characters, and that WinWord or whatever we use converts the HTML table into something that we can edit to extract the information we really need.

Or you we could use outwit. One push of the button loads the page

Scraping a table with outwit

Scraping a table with outwit

into a sub-window, a second push (data->tables) extracts the HTML tables on the page. Now, we can either mark the lines we are interested in by hand (often the quickest option) or use a filter to selfect them. One final click, and they are exported as a CSV file that can be read into R, OpenOffice, or Stata for post processing and analysis.

While I’m all in favour of scriptable and open-source tools like Perl, Python and R, outwit has a lot to go for it if all you need is a quick hack. Outwit also has functions to mass-download files (say PDFs) from a page and give the unique names. If the job is complex, there is even more functionality under the hood, and you can use the point-and-click interface to program you own scraper, though I would tend use a real programming language for these cases. At any rate, outwit is a useful and free tool for the lazy data analyst.

Jul 082008

Our project on social (citation and collaboration) networks in British and German political science involves networks with hundreds and thousands of nodes (scientists and articles). At the moment, our data come from the Social Science Citation Index (part of the ISI web of knowledge), and we use a bundle of rather eclectic (erratic?) scripts written in Perl to convert the ISI records into something that programs like Pajek or Stata can read. Some canned solutions (Wos2pajek, network workbench, bibexcel) are available for free, but I was not aware of them when I started this project, did not manage to install them properly, or was not happy with the results. Perl is the Swiss Army Chainsaw (TM) for data pre-processing, incredibly powerful (my scripts are typically less than 50 lines, and I am not an efficient programmer), and every time I want to do something in a slightly different way (i.e. I spot a bug), all I have to do is to change a few lines in the scripts.
After trying a lot of other programs available on the internet, we have chosen Pajek for doing the analyses and producing those intriguing graphs of cliques and inner circles in Political Science. Pajek is closed source but free for non-commercial use and runs on Windows or (via wine) Linux. It is very fast, can (unlike many other programs) easily handle very large networks, produces decent graphs and does many standard analyses. Its user interface may be slightly less than straightforward but I got used to it rather quickly, and it even has basic scripting capacities.

The only thing that is missing is a proper manual, but even this is not really a problem since Pajek’s creators have written a very accessible introduction to social network analysis that doubles up as documentation for the program (order from,, However, Pajek has been under constant development since the 1990s (!) and has acquired a lot of new features since the book was published. Some of them are documented in an appendix, others are simply listed in the very short document that is the official manual for Pajek. You will want to go through the many presentations which are available via the Pajek wiki.

Of course, there is much more software available, often at no cost. If you do program Java or Python (I don’t), there are several libraries available that look very promising. Amongst the stand-alone programs, visone stands out because it can easily produce very attractive-looking graphs of small networks. Even more software has been developed in the context of other sciences that have an interest in networks (chemistry, biology, engineering etc.)
Here is a rather messy collection of links to sna software. Generally, you will want something that is more systematic and informative. Ines Mergel has recently launched a bid for creating a comprehensive software list on wikipedia. The resulting page on social network analysis software is obviously work in progress but provides very valuable guidance.

Technorati-Tags: , , , , , , , , , ,

Mar 192008

If you are interested in subnational politics, France is an interesting case for many reasons. On the one hand, the country is highly centralised and divided into 96 (European) Departements (administrative units) with equal legal rights (though Corsica is a bit of an exception to this). In fact, Departements were created after the revolution in an attempt to replace the provinces of the Ancien Regime with something rational and neat. On the other hand, the Departements are vastly different in terms of their size, population, economic, political and social structure, which gives you a lot of variance that can be modelled. Electoral data is often made available at the level of the Departement (see e.g. the useful book by Caramani for historical results and the CDSP and government websites for recent elections) or can be aggregated to that level since electoral districts are nested in Departements. The French National Insitute for Statistics and Economic Studies (INSEE) has a wealth of data from the 1999 census and other sources, and even more is available from Eurostat. One thing that is incredibly annoying, however, is that many sources like Caramani, INSEE and the Wikipedia use the traditional French system. This system (which is part of the ISO standard ISO 3166-1) assigns numbers from 1 to 95 that once reflected the alphabetical order of the Departments’ names, though this initial order was a bit scrambled by territorial changes. The most obvious result of these are the odd 2A/2B codes for Corsica (after 1975, see this article on the French Official Geographic Code for the details). Rather unsurprisingly, Eurostat (and a few others) prefer the European NUTS-3 codes, which have a hierarchical structure that consists of a country (FR), region, and subregion (=Departement) code. If you want to merge Departmental data from various sources you obviously have to map one system to the other, which is cumbersome and prone to error. That’s why I wrote a little script in Perl that reads a table of Departmental Codes and creates a do-File for Stata, which does the actual mapping. From within Stata, you can simply type net from to get the whole package. It should be fairly easy to adopt this to your own needs – enjoy!

Technorati Tags: , , , , ,

Social Bookmarks: