I’m still collecting references for the next iteration of the Extreme Right Bibliography (but I am almost there. Honest to God. Really). Meanwhile, while I should have probably been doing other things, I’ve brushed up my fairly rudimentary R skills and taught myself how to write a similarly rudimentary twitterbot.
If you are reading this, the chances that you are interested in the Radical/Extreme/Etc Right are high. If you also happen to be on twitter, you will want to follow the Radical Right Research Robot for all sorts of serendipitous insights, e.g. that reference to the article you always suspected exists but were to shy to ask about.
And if that does not appeal, it has a cutesy profile pic. So follow it (him? her?). Resistance is futile.
Currently, most of the items in the The Eclectic, Erratic Bibliography on the Extreme Right in Western Europe (TM) are journal articles. The books/chapters/articles ratios have shifted somewhat over the years, reflecting both general trends in publishing and my changing reading habits, and by now the dominance of journal articles is rather striking.
The most important journals for Radical Right research (add pinch of salt as required)
One in three of this articles has been published in one of the four apparent top journals for Radical Right research: the European Journal of Political Research, West European Politics, Party Politics, and Acta Politica. I say ’apparent’ here, because this result may be a function of my (Western) Eurocentrism and my primary interest in Political Science and Sociology. Other Social Sciences are underrepresented, and literature from national journals that publish in other languages than English is virtually absent.
But hey: Laying all scruples aside, here is a table of the ten most important journals for Radical Right research:
No. of articles
European Journal of Political Research
West European Politics
Patterns of Prejudice
Comparative European Politics
Comparative Political Studies
Government and Opposition
Neat, isn’t it?
I did a similar analysis nearly two years ago. Government and Opposition as well as Comparative European Politics are new additions to the top ten (replacing Österreichische Zeitschrift für Politikwissenschaft and Osteuropa), but otherwise, the picture is much the same. So if you publish on the Radical Right and want your research to be noticed, you should probably aim for these journals.
Because BibTeX is so venerable, it’s unsurprising that there is even an R package (RefManageR) that can read and write BibTeX files, effectively turning bibliographic data into a dataset that can be analysed, graphed and otherwise mangled to one’s heart’s desire. And so my totally unscientific analysis of the Radical Right literature (as reflected in my personal preferences and interests) is just three lines of code away:
ex <- ReadBib("/home/kai/Work/bibliography/xr-bibliography/extreme-right-western-europe-bibliography.bib")
So 2012, 2014 and 2015(!) saw a lot of publications that ended up on my list, but 2000 and particularly 2002 (the year Jean-Marie Le Pen made it into the second round of the French presidential election) were not bad either. 2013 and 2003 (not listed) were also relatively strong years, with 33 publications each.
To get a more complete overview, it’s best to plot the whole time series (ignoring some very old titles):
There is a distinct upwards trend all through the 1990s, a post-millenial decline in the mid-naughties (perhaps due to the fact that I completed a book manuscript then and became temporarily negligent in my collector’s duties, but I don’t think so), and then a new peak during the last five years, undoubtedly driven by recent political events and countless eager postdocs and PhD students. I’m just beginning to understand the structure of data objects that RefManageR creates from my bibliography, but I think it’s time for some league tables next.
I’ve recently discovered Rfacebook, which lets you access public information on Facebook from R. In terms of convenience, no package for R or Python that I have seen so far comes near. Get yourself a long-lived token, store it as a variable, and put all posts on a fanpage you are interested in into one R object with a single function call. Check it out here.
R Package Parallel: How Not to Solve a Problem That Does Not Exist
Somewhat foolishly, my university has granted me access to Mogon: not the god, not the death metal band but rather their supercomputer, which currently holds the 182th spot in the top 500 list of the fastest computers on the planet. It has some 34,000+ cores and more than 80 TB of RAM, but basically it’s just a very large bunch of Linux boxes. That means that I have a rough idea how to handle it, and that it happily runs my native Linux Stata and MPlus (and hopefully Jags) binaries for me. It also has R installed, and this is where my misery began.
I have a lengthy R job that deals with census data. Basically, it looks up the absolute number of minority residents in some 25,000 output areas and their immediate neighbours and calculates a series of percentages from these figures. I think this could in principle be done in Stata, but R provides convenient libraries for dealing with geo-coded data (sp and friends), non-rectangular data structures and all the trappings of a full-featured programming language, so it would be stupid not to make use of it. The only problem is that R is relatively slow and single-threaded, and that my script is what they call embarrassingly parallel: The same trivial function is applied to 33 vectors with 25,000 elements each. Each calculation on a vector takes about eight seconds to complete, which amounts to roughly five minutes in total. Add the time it takes to read in the data and some fairly large lookup-tables (it would be very time-consuming to repeatedly calculate which output area is close enough to each other output area to be considered a neighbour), and we are looking at eight to ten minutes for one run.
Mogon. Image Credit: ZDV JGU Mainz
While I do not plan to run this script very often – once the calculations are done and saved, the results can be used in the analysis proper over and over again – I fully expect that I might change some operationalisations, include different variables etc., and so I began toying with the parallel package for R to make use of the many cores suddenly at my disposal.
Twelve hours later, I had learned the basics of the scheduling system (LSF), solved the problem of synching my data between home, office, central, and super-computer, gained some understanding of the way parallel works and otherwise achieved basically nothing: Even the best attempt at running a parallelised version of the script on the supercomputer was a little slower than the serialised version on my very capable office machine (and that is without the time (between 15 and 90 seconds) the scripts spends waiting to be transferred to a suitable node of the cluster). I tried different things: replacing lapply with mclapply, which was slower, regardless of the number of cores; using clusterApply instead of lapply (same result), and forking the 33 serial jobs into the background, which was even worse, presumably because storing the returned values resulted in changes to rather large data structures that were propagated to all cores involved.
So yes, to save a few minutes in a script that I will presumably run not more than four or five times over the next couple of weeks, I spent 12 hours, with zilch results. But at least I learned a few things (apart from the obvious re-iteration of the old ‘never change a half-way running system’ mantra). First, even if it takes eight seconds to do the sums, a vector of 25,000 elements is probably to short to really benefit from shifting the calculations to more cores. While forking should be cheap, the overhead of setting up the additional threads dominates any savings. Second, running jobs in parallel without really understanding what overhead this creates is a stupid idea, and knowing what overhead this creates and how to avoid this is probably not worth the candle (see the above). Third, I can always re-use the infrastructure I’ve created (for more pointless experiments). Forth, my next go at Mogon shall avoid half-baked middle-level parallelisation altogether. Instead I shall combine fine-grained implicit parallelism (built into Stata and Mplus) and very coarse explicit parallelism (by breaking up lengthy scripts into small chunks that can be run independently). Further research is definitively needed.
For our piece on distance effects in English elections we geocoded the addresses of hundreds of candidates. For the un-initiated: Geocoding is the fine art of converting addresses into geographical coordinates (longitude and latitude). Thanks to Google and some other providers like OpenStreeMap, this is now a relatively painless process. But when one needs more than a few addresses geocoded, one does not rely on pointing-and-clicking. One needs an API, i.e. a software library that makes the service accessible through R, Python or some other programming language.
The upside is that I learned a bit about the wonders of Python in general and the charms of geopy in particular. The downside is that writing a simple script that takes a number of strings from a Stata file, converts them into coordinates and gets them back into Stata took longer than I ever thought possible. Just now, I’ve learned about a possible shortcut (via the excellent data monkey blog): geocode is a user-written Stata command that takes a variable containing address strings and returns two new variables containing the latitude/longitude information. Now that would have been a bit of a time-saver. You can install geocode by typing
net from http://www.stata-journal.com/software/sj11-1 net install dm0053
There is, however, one potential drawback: Google limits the number of free queries per day (and possibly per minute). Via Python, you can easily stagger your requests, and you can also use an API key that is supposed to give you a bigger quota. Geocoding a large number of addresses from Stata in one go, on the other hand, will probably result in an equally large number of parsing errors.
I’m more and more intrigued by the potential spatial data hold for political science. Once you begin to think about it, concepts like proximity and clustering are basic building blocks for explaining social phenomena. Even better, since the idea of open data has gone mainstream, more and more spatially referenced information becomes available, and when it comes to free, open source software, we are spoilt for choice or, at least in my case, up and beyond the point of utter confusion.
GRASS is a full-blown open source GIS, whose user interface is perfect for keyboard aficionados and brings back happy memories of the 1980s. While GRASS can do amazing things with raster and vector maps, it is suboptimal for dealing with rectangular data. In the end, we used only its underrated cartographic ps.map module, which reliably creates high-resolution postscript maps.
The real treasure trove, however, is R. The quality of packages for spatial analysis is amazing, and their scope is a little overwhelming. Applied Spatial Data Analysis with R by Roger Bivand, who wrote much of the relevant code, provides much-needed guidance.
Counting the number of mainstream candidates living in a constituency is a point-in-polygon problem: each candidate is a co-ordinate enclosed by a constituency boundary. Function overlay from package sp carries out the relevant operation. Once I had it located, I was seriously tempted to loop over constituencies and candidates. Just in time, I remembered the R mantra of vectorisation. Provided that points (candidates) and polygons (constituencies) have been transformed to the same projection, all that is needed is this: mymap@data$homeconst1 <-overlay(candpos1,mymap) mymap@data$homeconst2 <-overlay(candpos2,mymap) mymap@data$homeconst3 <-overlay(candpos3,mymap)
This works because candpos1 is a vector of points that represent the spatial positions of all Labour candidates. These are tested against all constituency boundaries. The result is another vector of indices, i.e. sequence numbers of the constituencies the candidates are living in. Put differently, overlay takes a list of points and a bunch of polygons and returns a list that maps the former to the latter. With a bit of boolean logic, a vector of zeros (candidate outside constituency) and ones (candidate living in their constituency) ensues. Summing up the respective vectors for Labour, Tories, and LibDems then gives the required count that can be mapped. Result!
Sometimes, a man’s gotta do what a man’s gotta do. Which, in my case, might be a little simulation of a random process involving an unordered categorical variable. In R, sampling from a multinomial distribution is trivial.
gives me a vector of random numbers from a multinomial distribution with outcomes 1, 2, 3, and 4, where the probability of observing a ‘1’ is 10 percent, the probability of observing a ‘2’ is 70 per cent, and so on. But I could not find an equivalent function in Stata. Generating artificial data in R is not very elegant, so I kept digging and found a solution in section M-5 of the Mata handbook. Hidden in the entry on runiform is a reference to rdiscrete(r,c,p), a Mata function which generates a r*c matrix of draws from a multinomial distribution defined by a vector p of probabilities.
That leaves but one question: Is wrapping a handful of lines around a Mata call to replace a non-existent Stata function more elegant than calling an external program?
These days, a bonanza of political information is freely available on the internet. Sometimes this information comes in the guise of excel sheets, comma separated data or other formats which are more or less readily machine readable. But more often than not, information is presented as tables designed to be read by humans. This is where the gentle art of screen scraping, web scraping or spidering comes in. In the past, I have used kludgy Perl scripts to get electoral results at the district level off sites maintained by the French ministry of the interior or by universities (very interesting if you do not really speak/read French). A slightly more elegant approach might be to use R’s builtin Perl-like capabilities for doing the job, as demonstrated by Simon Jackman. Finally, Python is gaining ground in the political science community, which has some very decent libraries for screen/web scraping – see this elaborate post on Drew Conway’s Zero Intelligence Agents blog. But, let’s face it: I am lazy. I want to spend time analysing the data, not scraping them. And so I was very pleased when I came across outwit, a massive plugin for the firefox browser (Linux, Mac and Windows versions available) that acts as a point-and-click scraper.
French Départements (from Wikipedia)
Say you need a dataset with the names and Insee numbers for all the French Départements. The (hopefully trustworthy) Wikipedia page has a neat table, complete with information on the Prefecture and many tiny coats of arms which are of absolutely no use at all. We could either key in the relevant data (doable, but a nuisance), or we could try to copy and paste the table into a word processor, hoping that we do not lose accents and other funny characters, and that WinWord or whatever we use converts the HTML table into something that we can edit to extract the information we really need.
Or you we could use outwit. One push of the button loads the page
Scraping a table with outwit
into a sub-window, a second push (data->tables) extracts the HTML tables on the page. Now, we can either mark the lines we are interested in by hand (often the quickest option) or use a filter to selfect them. One final click, and they are exported as a CSV file that can be read into R, OpenOffice, or Stata for post processing and analysis.
While I’m all in favour of scriptable and open-source tools like Perl, Python and R, outwit has a lot to go for it if all you need is a quick hack. Outwit also has functions to mass-download files (say PDFs) from a page and give the unique names. If the job is complex, there is even more functionality under the hood, and you can use the point-and-click interface to program you own scraper, though I would tend use a real programming language for these cases. At any rate, outwit is a useful and free tool for the lazy data analyst.