This is me, about once per year, when I bemoan my lack of R-coolness whilst simultaneously enjoying my Stata-efficiency.
- From the Monkey Cage: Italy just voted for two very different kinds of populism
- The botrnot package for the R language: Which world leaders are actually bots? (Use your own judgment)
- Science community blogs: recognising value and measuring reach
- Germany being Germany (or Bavaria?): German minister under fire for no women in leadership team
- 11 Brexit promises the government quietly dropped. Well, “promise” was probably too strong a word. More like intentions. Out-of-the-box thinking etc.
I’m still collecting references for the next iteration of the Extreme Right Bibliography (but I am almost there. Honest to God. Really). Meanwhile, while I should have probably been doing other things, I’ve brushed up my fairly rudimentary R skills and taught myself how to write a similarly rudimentary twitterbot.
If you are reading this, the chances that you are interested in the Radical/Extreme/Etc Right are high. If you also happen to be on twitter, you will want to follow the Radical Right Research Robot for all sorts of serendipitous insights, e.g. that reference to the article you always suspected exists but were to shy to ask about.
And if that does not appeal, it has a cutesy profile pic. So follow it (him? her?). Resistance is futile.
- One of the best (and most depressing) articles on Trump I’ve read so far: “Inside Trump’s Hour-by-Hour Battle for Self-Preservation”
- If you learnt R as a grad student and if that was some time ago (cough), here is help to get you started on the new ways of doing things in R
- To further drive this home, here is a quick and only slightly dirty analysis of the Weinstein effect in newspaper reporting using tidytext
- What remains of the traditional French centre-right after Macron is poaching on the Front National. Art Goldhammer nails it.
- Meanwhile, the Front National is once more in hot water over the misuse of EU funds.
Which publishers are the most relevant for Radical Right research? Good question.
@kai_arzheimer Kai can you also export that data by publisher? Asking for a friend 😉
— Fascism & Far Right (@FFRBookSeries) October 21, 2016
Radical Right research by type of publication
Currently, most of the items in the The Eclectic, Erratic Bibliography on the Extreme Right in Western Europe (TM) are journal articles. The books/chapters/articles ratios have shifted somewhat over the years, reflecting both general trends in publishing and my changing reading habits, and by now the dominance of journal articles is rather striking.
The most important journals for Radical Right research (add pinch of salt as required)
One in three of this articles has been published in one of the four apparent top journals for Radical Right research: the European Journal of Political Research, West European Politics, Party Politics, and Acta Politica. I say ’apparent’ here, because this result may be a function of my (Western) Eurocentrism and my primary interest in Political Science and Sociology. Other Social Sciences are underrepresented, and literature from national journals that publish in other languages than English is virtually absent.
But hey: Laying all scruples aside, here is a table of the ten most important journals for Radical Right research:
|Journal||No. of articles|
|European Journal of Political Research||38|
|West European Politics||35|
|Patterns of Prejudice||12|
|Comparative European Politics||10|
|Comparative Political Studies||10|
|Government and Opposition||9|
Neat, isn’t it?
I did a similar analysis nearly two years ago. Government and Opposition as well as Comparative European Politics are new additions to the top ten (replacing Österreichische Zeitschrift für Politikwissenschaft and Osteuropa), but otherwise, the picture is much the same. So if you publish on the Radical Right and want your research to be noticed, you should probably aim for these journals.
For the past 15 years or so, I have maintained an extensive collection of references on the Radical/Extreme/Populist/New/Whatever Right in Western Europe. Because I love TeX and other command line tools of destruction, these references live in a large BibTeX file. BibTeX is a well-documented format for bibliographic text files that has been around for decades and can be written and read by a large number of reference managers.
Because BibTeX is so venerable, it’s unsurprising that there is even an R package (RefManageR) that can read and write BibTeX files, effectively turning bibliographic data into a dataset that can be analysed, graphed and otherwise mangled to one’s heart’s desire. And so my totally unscientific analysis of the Radical Right literature (as reflected in my personal preferences and interests) is just three lines of code away:
library("RefManageR") # read ex <- ReadBib("/home/kai/Work/bibliography/xr-bibliography/extreme-right-western-europe-bibliography.bib") tail(sort(table(unlist(ex$year))),5)
So 2012, 2014 and 2015(!) saw a lot of publications that ended up on my list, but 2000 and particularly 2002 (the year Jean-Marie Le Pen made it into the second round of the French presidential election) were not bad either. 2013 and 2003 (not listed) were also relatively strong years, with 33 publications each.
To get a more complete overview, it’s best to plot the whole time series (ignoring some very old titles):
There is a distinct upwards trend all through the 1990s, a post-millenial decline in the mid-naughties (perhaps due to the fact that I completed a book manuscript then and became temporarily negligent in my collector’s duties, but I don’t think so), and then a new peak during the last five years, undoubtedly driven by recent political events and countless eager postdocs and PhD students. I’m just beginning to understand the structure of data objects that RefManageR creates from my bibliography, but I think it’s time for some league tables next.
I’ve recently discovered Rfacebook, which lets you access public information on Facebook from R. In terms of convenience, no package for R or Python that I have seen so far comes near. Get yourself a long-lived token, store it as a variable, and put all posts on a fanpage you are interested in into one R object with a single function call. Check it out here.
R Package Parallel: How Not to Solve a Problem That Does Not Exist
Somewhat foolishly, my university has granted me access to Mogon: not the god, not the death metal band but rather their supercomputer, which currently holds the 182th spot in the top 500 list of the fastest computers on the planet. It has some 34,000+ cores and more than 80 TB of RAM, but basically it’s just a very large bunch of Linux boxes. That means that I have a rough idea how to handle it, and that it happily runs my native Linux Stata and MPlus (and hopefully Jags) binaries for me. It also has R installed, and this is where my misery began.
I have a lengthy R job that deals with census data. Basically, it looks up the absolute number of minority residents in some 25,000 output areas and their immediate neighbours and calculates a series of percentages from these figures. I think this could in principle be done in Stata, but R provides convenient libraries for dealing with geo-coded data (sp and friends), non-rectangular data structures and all the trappings of a full-featured programming language, so it would be stupid not to make use of it. The only problem is that R is relatively slow and single-threaded, and that my script is what they call embarrassingly parallel: The same trivial function is applied to 33 vectors with 25,000 elements each. Each calculation on a vector takes about eight seconds to complete, which amounts to roughly five minutes in total. Add the time it takes to read in the data and some fairly large lookup-tables (it would be very time-consuming to repeatedly calculate which output area is close enough to each other output area to be considered a neighbour), and we are looking at eight to ten minutes for one run.
While I do not plan to run this script very often – once the calculations are done and saved, the results can be used in the analysis proper over and over again – I fully expect that I might change some operationalisations, include different variables etc., and so I began toying with the parallel package for R to make use of the many cores suddenly at my disposal.
Twelve hours later, I had learned the basics of the scheduling system (LSF), solved the problem of synching my data between home, office, central, and super-computer, gained some understanding of the way parallel works and otherwise achieved basically nothing: Even the best attempt at running a parallelised version of the script on the supercomputer was a little slower than the serialised version on my very capable office machine (and that is without the time (between 15 and 90 seconds) the scripts spends waiting to be transferred to a suitable node of the cluster). I tried different things: replacing lapply with mclapply, which was slower, regardless of the number of cores; using clusterApply instead of lapply (same result), and forking the 33 serial jobs into the background, which was even worse, presumably because storing the returned values resulted in changes to rather large data structures that were propagated to all cores involved.
So yes, to save a few minutes in a script that I will presumably run not more than four or five times over the next couple of weeks, I spent 12 hours, with zilch results. But at least I learned a few things (apart from the obvious re-iteration of the old ‘never change a half-way running system’ mantra). First, even if it takes eight seconds to do the sums, a vector of 25,000 elements is probably to short to really benefit from shifting the calculations to more cores. While forking should be cheap, the overhead of setting up the additional threads dominates any savings. Second, running jobs in parallel without really understanding what overhead this creates is a stupid idea, and knowing what overhead this creates and how to avoid this is probably not worth the candle (see the above). Third, I can always re-use the infrastructure I’ve created (for more pointless experiments). Forth, my next go at Mogon shall avoid half-baked middle-level parallelisation altogether. Instead I shall combine fine-grained implicit parallelism (built into Stata and Mplus) and very coarse explicit parallelism (by breaking up lengthy scripts into small chunks that can be run independently). Further research is definitively needed.
For our piece on distance effects in English elections we geocoded the addresses of hundreds of candidates. For the un-initiated: Geocoding is the fine art of converting addresses into geographical coordinates (longitude and latitude). Thanks to Google and some other providers like OpenStreeMap, this is now a relatively painless process. But when one needs more than a few addresses geocoded, one does not rely on pointing-and-clicking. One needs an API, i.e. a software library that makes the service accessible through R, Python or some other programming language.
The upside is that I learned a bit about the wonders of Python in general and the charms of geopy in particular. The downside is that writing a simple script that takes a number of strings from a Stata file, converts them into coordinates and gets them back into Stata took longer than I ever thought possible. Just now, I’ve learned about a possible shortcut (via the excellent data monkey blog): geocode is a user-written Stata command that takes a variable containing address strings and returns two new variables containing the latitude/longitude information. Now that would have been a bit of a time-saver. You can install geocode by typing
net from http://www.stata-journal.com/software/sj11-1
net install dm0053
There is, however, one potential drawback: Google limits the number of free queries per day (and possibly per minute). Via Python, you can easily stagger your requests, and you can also use an API key that is supposed to give you a bigger quota. Geocoding a large number of addresses from Stata in one go, on the other hand, will probably result in an equally large number of parsing errors.
I’m more and more intrigued by the potential spatial data hold for political science. Once you begin to think about it, concepts like proximity and clustering are basic building blocks for explaining social phenomena. Even better, since the idea of open data has gone mainstream, more and more spatially referenced information becomes available, and when it comes to free, open source software, we are spoilt for choice or, at least in my case, up and beyond the point of utter confusion.
For our paper on the effect of spatial distance between candidates and their prospective voters, we needed a choropleth map of English Westminster constituencies that shows how many of the mainstream candidates live within the constituency’s boundaries. Basically, we had three options (not counting the rather few user-contributed packages for Stata): GRASS, a motley collection of Python packages, and a host of libraries for R.
GRASS is a full-blown open source GIS, whose user interface is perfect for keyboard aficionados and brings back happy memories of the 1980s. While GRASS can do amazing things with raster and vector maps, it is suboptimal for dealing with rectangular data. In the end, we used only its underrated cartographic ps.map module, which reliably creates high-resolution postscript maps.
Python has huge potential for social scientists, both in its own right and as a kind of glue that binds various programs together. In principle, a lot of GIS-related tasks could be done with Python alone. We used the very useful geopy toolboxfor converting UK postcodes to LatLong co-ordinates, with a few lines of code and a little help from Google.
The real treasure trove, however, is R. The quality of packages for spatial analysis is amazing, and their scope is a little overwhelming. Applied Spatial Data Analysis with R by Roger Bivand, who wrote much of the relevant code, provides much-needed guidance.
Counting the number of mainstream candidates living in a constituency is a point-in-polygon problem: each candidate is a co-ordinate enclosed by a constituency boundary. Function overlay from package sp carries out the relevant operation. Once I had it located, I was seriously tempted to loop over constituencies and candidates. Just in time, I remembered the R mantra of vectorisation. Provided that points (candidates) and polygons (constituencies) have been transformed to the same projection, all that is needed is this:
This works because candpos1 is a vector of points that represent the spatial positions of all Labour candidates. These are tested against all constituency boundaries. The result is another vector of indices, i.e. sequence numbers of the constituencies the candidates are living in. Put differently, overlay takes a list of points and a bunch of polygons and returns a list that maps the former to the latter. With a bit of boolean logic, a vector of zeros (candidate outside constituency) and ones (candidate living in their constituency) ensues. Summing up the respective vectors for Labour, Tories, and LibDems then gives the required count that can be mapped. Result!