Apr 262012
 

For our piece on distance effects in English elections we geocoded the addresses of hundreds of candidates. For the un-initiated: Geocoding is the fine art of converting addresses into geographical coordinates (longitude and latitude). Thanks to Google and some other providers like OpenStreeMap, this is now a relatively painless process. But when one needs more than a few addresses geocoded, one does not rely on pointing-and-clicking. One needs an API, i.e. a software library that makes the service accessible through R, Python or some other programming language.

The upside is that I learned a bit about the wonders of Python in general and the charms of geopy in particular. The downside is that writing a simple script that takes a number of strings from a Stata file, converts them into coordinates and gets them back into Stata took longer than I ever thought possible. Just now, I’ve learned about a possible shortcut (via the excellent data monkey blog): geocode is a user-written Stata command that takes a variable containing address strings and returns two new variables containing the latitude/longitude information. Now that would have been a bit of a time-saver. You can install geocode by typing

net from http://www.stata-journal.com/software/sj11-1
net install dm0053

There is, however, one potential drawback: Google limits the number of free queries per day (and possibly per minute). Via Python, you can easily stagger your requests, and you can also use an API key that is supposed to give you a bigger quota. Geocoding a large number of addresses from Stata in one go, on the other hand, will probably result in an equally large number of parsing errors.

Apr 212012
 

I’m more and more intrigued by the potential spatial data hold for political science. Once you begin to think about it, concepts like proximity and clustering are basic building blocks  for explaining social phenomena. Even better, since the idea of open data has gone mainstream, more and more spatially referenced information becomes available, and when it comes to free, open source software, we are spoilt for choice or, at least in my case, up and beyond the point of utter confusion.

For our paper on the effect of spatial distance between candidates and their prospective voters,  we needed  a choropleth map of English Westminster constituencies that shows how many of the mainstream candidates live within the constituency’s boundaries. Basically, we had three options (not counting the rather few user-contributed packages for Stata): GRASS, a motley collection of Python packages, and a host of libraries for R.

GRASS is a full-blown open source GIS, whose user interface is perfect for keyboard aficionados and brings back happy memories of the 1980s. While GRASS can do amazing things with raster and vector maps, it is suboptimal for dealing with rectangular data. In the end, we used only its underrated cartographic ps.map module, which reliably creates high-resolution postscript maps.

Python has huge potential for social scientists, both in its own right and as a kind of glue that binds various programs together. In principle, a lot of GIS-related tasks could be done with Python alone. We used the very useful geopy toolboxfor converting UK postcodes to LatLong co-ordinates, with a few lines of code and a little help from Google.

candidates col 212x300 Putting candidates in their place with R

Candidate locations by constituency

The real treasure trove, however, is R. The quality of packages for spatial analysis is amazing, and their scope is a little overwhelming. Applied Spatial Data Analysis with R by Roger Bivand, who wrote much of the relevant code, provides much-needed guidance.

Counting the number of mainstream candidates living in a constituency is a point-in-polygon problem: each candidate is a co-ordinate enclosed by a constituency boundary. Function overlay from package sp carries out the relevant operation. Once I had it located, I was seriously tempted to loop over constituencies and candidates. Just in time, I remembered the R mantra of vectorisation. Provided that points (candidates) and polygons (constituencies) have been transformed to the same projection, all that is needed is this:

mymap@data$homeconst1 <-overlay(candpos1,mymap)
mymap@data$homeconst2 <-overlay(candpos2,mymap)
mymap@data$homeconst3 <-overlay(candpos3,mymap)

This works because candpos1 is a vector of points that represent the spatial positions of all Labour candidates. These are tested against all constituency boundaries. The result is another vector of indices, i.e. sequence numbers of the constituencies the candidates are living in. Put differently, overlay takes a list of points and a bunch of polygons and returns a list that maps the former to the latter. With a bit of boolean logic, a vector of zeros (candidate outside constituency) and ones (candidate living in their constituency) ensues. Summing up the respective vectors for Labour, Tories, and LibDems then gives the required count that can be mapped. Result!

 Putting candidates in their place with R
Apr 092011
 

Sometimes, a man’s gotta do what a man’s gotta do. Which, in my case, might be a little simulation of a random process involving an unordered categorical variable. In R, sampling from a multinomial distribution is trivial.

rmultinom(1,1000,c(.1,.7,.2,.1))

gives me a vector of random numbers from a multinomial distribution with outcomes 1, 2, 3, and 4, where the probability of observing a ’1′ is 10 percent, the probability of observing a ’2′ is 70 per cent, and so on. But I could not find an equivalent function in Stata. Generating artificial data in R is not very elegant, so I kept digging and found a solution in section M-5 of the Mata handbook. Hidden in the entry on runiform is a reference to rdiscrete(r,c,p), a Mata function which generates a r*c matrix of draws from a multinomial distribution defined by a vector p of probabilities.

That leaves but one question: Is wrapping a handful of lines around a Mata call to replace a non-existent Stata function more elegant than calling an external program?

Jan 102010
 

Statistics and Data links roundup for November 23rd through December 29th:

  • The Data and Story Library – DASL (pronounced “dazzle”) is an online library of datafiles and stories that illustrate the use of basic statistics methods. We hope to provide data from a wide variety of topics so that statistics teachers can find real-world examples that will be interesting to their students. Use DASL’s powerful search engine to locate the story or datafile of interest.
  • Drawing graphs using tikz/pgf & gnuplot | politicaldata.org -
Nov 232009
 

Statistics and Data links roundup for November 14th through November 23rd:

It’s surprisingly difficult to find suitable datasets for a sna workshop that are relevant for political scientists.

Nov 142009
 
 Statistics and Data links roundup
Aug 012009
 

These days, a bonanza of political information is freely available on the internet.  Sometimes this information comes in the guise of excel sheets, comma separated data or other formats which are more or less readily machine readable. But more often than not, information is presented as tables designed to be read by humans. This is where the gentle art of screen scraping, web scraping or spidering comes in. In the past, I have used kludgy Perl scripts to get electoral results at the district level off sites maintained by the French ministry of the interior or by universities (very interesting if you do not really speak/read French). A slightly more elegant approach might be to use R’s builtin Perl-like capabilities for doing the job, as demonstrated by Simon Jackman. Finally, Python is gaining ground in the political science community,  which has some very decent libraries for screen/web scraping – see this elaborate post on Drew Conway’s Zero Intelligence Agents blog. But, let’s face it: I am lazy. I want to spend time analysing the data, not scraping them. And so I was very pleased when I came across outwit, a massive plugin for the firefox browser (Linux, Mac and Windows versions available) that acts as a point-and-click scraper.

outwit 1 300x178 Web scraping made easy: outwit

French Départements (from Wikipedia)

Say you need a dataset with the names and Insee numbers for all the French Départements. The (hopefully trustworthy) Wikipedia page has a neat table, complete with information on the Prefecture and many tiny coats of arms which are of absolutely no use at all. We could either key in the relevant data (doable, but a nuisance), or we could try to copy and paste the table into a word processor, hoping that we do not lose accents and other funny characters, and that WinWord or whatever we use converts the HTML table into something that we can edit to extract the information we really need.

Or you we could use outwit. One push of the button loads the page

outwit 22 300x178 Web scraping made easy: outwit

Scraping a table with outwit

into a sub-window, a second push (data->tables) extracts the HTML tables on the page. Now, we can either mark the lines we are interested in by hand (often the quickest option) or use a filter to selfect them. One final click, and they are exported as a CSV file that can be read into R, OpenOffice, or Stata for post processing and analysis.

While I’m all in favour of scriptable and open-source tools like Perl, Python and R, outwit has a lot to go for it if all you need is a quick hack. Outwit also has functions to mass-download files (say PDFs) from a page and give the unique names. If the job is complex, there is even more functionality under the hood, and you can use the point-and-click interface to program you own scraper, though I would tend use a real programming language for these cases. At any rate, outwit is a useful and free tool for the lazy data analyst.

 Web scraping made easy: outwit
Jul 082008
 

Our project on social (citation and collaboration) networks in British and German political science involves networks with hundreds and thousands of nodes (scientists and articles). At the moment, our data come from the Social Science Citation Index (part of the ISI web of knowledge), and we use a bundle of rather eclectic (erratic?) scripts written in Perl to convert the ISI records into something that programs like Pajek or Stata can read. Some canned solutions (Wos2pajek, network workbench, bibexcel) are available for free, but I was not aware of them when I started this project, did not manage to install them properly, or was not happy with the results. Perl is the Swiss Army Chainsaw (TM) for data pre-processing, incredibly powerful (my scripts are typically less than 50 lines, and I am not an efficient programmer), and every time I want to do something in a slightly different way (i.e. I spot a bug), all I have to do is to change a few lines in the scripts.
After trying a lot of other programs available on the internet, we have chosen Pajek for doing the analyses and producing those intriguing graphs of cliques and inner circles in Political Science. Pajek is closed source but free for non-commercial use and runs on Windows or (via wine) Linux. It is very fast, can (unlike many other programs) easily handle very large networks, produces decent graphs and does many standard analyses. Its user interface may be slightly less than straightforward but I got used to it rather quickly, and it even has basic scripting capacities.

The only thing that is missing is a proper manual, but even this is not really a problem since Pajek’s creators have written a very accessible introduction to social network analysis that doubles up as documentation for the program (order from amazon.co.uk, amazon.com, amazon.de. However, Pajek has been under constant development since the 1990s (!) and has acquired a lot of new features since the book was published. Some of them are documented in an appendix, others are simply listed in the very short document that is the official manual for Pajek. You will want to go through the many presentations which are available via the Pajek wiki.

Of course, there is much more software available, often at no cost. If you do program Java or Python (I don’t), there are several libraries available that look very promising. Amongst the stand-alone programs, visone stands out because it can easily produce very attractive-looking graphs of small networks. Even more software has been developed in the context of other sciences that have an interest in networks (chemistry, biology, engineering etc.)
Here is a rather messy collection of links to sna software. Generally, you will want something that is more systematic and informative. Ines Mergel has recently launched a bid for creating a comprehensive software list on wikipedia. The resulting page on social network analysis software is obviously work in progress but provides very valuable guidance.

Technorati-Tags: , , , , , , , , , ,

 Software for Social Network Analysis: Pajek and Friends