Nov 292020

Why yes, of course nothing says memefy just like a series of online lectures that everybody wants to fast-forward. And I have the tweets to prove it.

So I'm teaching a mandatory stats/methods class (always popular). Online. Following the advice from my own kids, I have memified the outline. For your own syllabus needs, here is the week-by-week program. Introduction:

Teaching stats online with memes 1

Causality & Designs

Teaching stats online with memes 2


Teaching stats online with memes 3

More SNA

Teaching stats online with memes 4

Still more SN

Teaching stats online with memes 5

Missing Data

Teaching stats online with memes 6


Teaching stats online with memes 7


Teaching stats online with memes 8

Factor Analysis & some SEM

Teaching stats online with memes 9

More covariances

Teaching stats online with memes 10

Cross-level inference

Teaching stats online with memes 11

Event data

Teaching stats online with memes 12

Multi-level structures

Teaching stats online with memes 13

Wrap-up & back to basics. That's the transition to digital sorted, yes?

Teaching stats online with memes 14

Originally tweeted by Kai Arzheimer 🇪🇺 (@kai_arzheimer) on November 20, 2020.

 Tagged with:
Apr 242019

How the tidyverse changed my view of #rstats 16

Back in the mist of time, when I should have been was working on my PhD, I found a blue book on the shelf that a previous occupant of the office had left there. As I learned later, it was The Blue Book that introduced the S language, the predecessor of R. I got sidetracked (as you do) and taught myself how to produce beautiful graphs in what is now known as base R, and how to run poorly understood time series analyses (impossible in SPSS at this point).

A little later, I got hooked on Stata, and to the present day, I refuse to be Stata-shamed, as Ben Stanley put it. 95 per cent of the time, it does the job, and quickly so. Also, the documentation is simply excellent.

But every now and then, I came back to R because I needed something specific. And it was mostly fun. Having access to all these APIs (in fact, concurrently having more than one data set in memory) was exciting. Having a real, reasonably straightforward scripting /programming language at my disposal instead of Stata’s hodgepodge of three (four if you count the graph language) half-baked syntaxes was exhilarating. Having a go at the latest methods on the basis of nothing more than skimming a working paper (skipping every non-trivial equation) was… I guess a little bit like trimming your hair with a chainsaw.

But finding, installing, updating and then loading three packages, just to make recoding a little more intuitive? Seriously, R? Not so cool. In fact, finding a variable (whose name and data set must be given in full) was usually enough to reduce me to tears. Attach() somehow never does what I think it should do. And so, I would return to Stata once more, like <insert awkward metaphor>.

Then, during one of my last forays, I began playing with the tidyverse. And as the young ones are prone to say: my mind was blown. Tibbles! Pipelines! Lots of yummie helper functions! Going from long to wide format and back (in various different ways)! Grouping, summarising, and even some pythonesque list traversing. This was no longer the fascinating but slightly stroppy R I used to know.

Compared to the handful of letters and abbreviations that I use in Stata to get things done, recoding-wise, this is still quite verbose, and I have to look up just about everything. But I really like it. Like, really like it. And so doing more stuff in R is firmly on the endless List Of Things I Want To Look Into. To end on the most positive note possible, here is a gratuitous picture of a cat.

How the tidyverse changed my view of #rstats 17

 Tagged with: ,
Mar 042019
Wakelet as a tool for archiving online debates on (academic) events 19

Wakelet – what is it, and why should academics care to “curate” tweets about events? Bear with me for a second.

The sad state of curating and social story telling

Until about about a year ago, there was a Their business idea was that people would “curate” tweets, facebook posts and other stuff found on social media to narrate stories on the interwebs.

It is a truth universally acknowledged that the idea of “curating” stuff as a mass phenomenon is industrial-grade bullshit. No one wants hordes of people linking half-read stuff together in a bid to be completely ignored by even more people. And so storify was acquired by Livefyre, which was in turn purchased by Adobe, and the whole “curating” business moved away from the masses into the realm of enterprise customers.

Why would a researcher ever think about social story telling?

My scepticism aside, there was at least one use case for storify in academia. When Prof Jane Ordinary is organising any sort of event these days, it is in her and other people’s interest to create a bit of a social media buzz. It is not just outreach and stuff: Jane wants to project at least a vague sense of awareness of her project into the wider world, and journalists and other researchers who would never read a four-pages press release may well want to follow parts of the debate in an informal setting.

The problem here: by its nature, social media is ephemeral. After the event, any buzz will be buried under billions and billions of newer posts. And even during the event, the silo-like structure of the current social mediascape as well as the frequent failure to agree on a single hashtag for smaller events makes it very difficult to get an overview of what people are saying online. Here, storify was useful, because one could link every (presentable) post into a story. Then, one (or one’s capable RA) could share the whole shebang or embedded into a more durable web page, either after or during the event.

Clearly a wake, not a wakelet

Photo by MadeByMark

From storify to wakelet

Looking for a replacement for storify to archive (curate??? seriously???) the online/offline story of the policy dialogue that we organised last week, I came across wakelet (apparently, giving your product a dorky name is still a thing in Silicon Valley). Wakelet does everything that storify did, and then a bit more. Basically, everything that has an URL can be linked into a “collection” (also called a wakelet). Tweets and videos get a special treatment: they appear in a “native” format, i.e. as a tweetbox or within a video player, respectively. It is possible to add images and texts, too.

While wakelet is sometimes a bit rough around the edges. I had to press reload a couple of times after re-ordering elements for everything to reappear. Also, wakelets could load a bit quicker. But nonetheless, wakelet very elegantly plugs this particular gap.

What I don’t see, however, is a sustainable mass-market business model. Currently, the service is free for anyone who wants to showcase something. Interleaving collections with adverts would defy the showcasing aspect. But I don’t see that casual users would be willing to pay for a subscription. And so, in the medium term, it’s turning into another enterprise service or going bust, I presume. But for the time being, wakelet is a useful, if highly specialised addition to the academic toolbox.

Policy Dialogue: immigration, local decline, the Radical Right & wakelet

Within our ORA project SCoRE, we look into the relationships between local decline, local levels of immigration, immigrant sentiment, and (radical right) voting. Obviously, our findings have (or should have) implications for public policy. And so we organised an event at the European Policy Centre in Brussels. We had a great panel, a sizable crowd of interested folks, and distributed about 100 copies of our policy brief. And then it was over.

But if you are interested in what the speakers said, how people reacted, and what it was like,  simply browse the wakelet that I embed below this post. At least until  some other, more profitable company buys them.

Mar 022019

Every remotely relevant reference I came across during the last 15 years or so resides in a single bibtex file. That is not a problem. The problem is that I’m moving into a shiny, new but somewhat smaller office, together with hundreds of copies of journal articles and hundreds of PDFs. Wouldn’t it be good to know which physical copies are effectively redundant (unreadable comments in the margins aside) and can therefore stay behind?

The trouble is that bibtex files have a rather flexible, human readable format. Each entry begins with the @ sign, followed by a type (book, article etc.), a reference name,  lots of key/value pairs (fields) in arbitrary order,  and even more curly braces.

grep @ full.bib|wc -l tells me that I have 2914 references in total. grep binder|wc -l (binder is a custom field that I use to keep track of the location of my copies) shows that I have printed out/copied 712 texts over the years, and grep file|wc -l indicates that there are 504 PDFs residing on my filesystem. But what is the magnitude of the intersection?

My first inclination was to look for a suitable Python parser/library. Pybtex looked good in principle but is underdocumented and had trouble reading full.bib, because that is encoded in Latin 1. So it was endless hours of amateurish coding and procrastination ahead. Then I remembered the “do one thing, and do it really well” mantra of old. Enter bibtool, which is a fast and reasonably stable bibtex file filter and pretty printer. Bibtool reads “resource files”, which are really just short scripts containing filtering/formatting directives. select = {binder ".+"} keeps those references whose “binder” field contains at least one character (.+ is a regular expression that matches any non-empty string). select = {file ".+"} selects all references for which I have a PDF. But bibtool applies a logical OR to these conditions while I’m interested in finding those references that meet both criteria.

The quick solution is to store each statement in a file of its own and apply bibtool twice, using a pipeline for extra efficiency: bibtool -r find-binder.rsc full.bib|bibtool -r find-pdf >intersection.bib does the trick and solves my problem in under a minute, without any coding.

As it turns out, there were just 65 references in both groups. Apparently, I stopped printing (or at least filing away) some time ago. Eventually, I binned two copies, but it is the principle that matters.

2019 Update

I still use bibtool for quick filtering/reformatting tasks at the command line, but for more complex jobs involving programmatic access to bibtex files from R, RefManageR is a wonderful package.  I have used it here in a bibliometric study of the Radical/Extreme Right literature. And my nifty RRResRobot also relies heavily on RefManageR. If you are interested at all in RefManageR, here is a short and sweet introduction.

Jan 142019

Terminology matters for science. If people use different words for the same thing, or even worse, the same word for different things, scientific communication turns into a dialogue of the deaf. European Radical Right Studies are a field where this is potentially a big problem: we use labels like “New”, “Populist”, “Radical”, “Extreme” or even “Extremist” with abandon. 

But how bad is it really? In a recent chapter (author’s version, not paywalled), I argue that communication in Radical Right studies still works. Texts using all 50 shades of “Right” are still cited together, indicating that later scholars realised they were all talking about (more or less) the same thing.

I have written a number of short blogs about the change in terminology over time, the extraction of the co-citation network, and the interpretation of the findings. But sometimes, all this reading is getting a bit much, and so I tried something different: using some newfangled software for noobs, I turned my findings into a short video. Have a look for yourself and tell me what you think.

The Extreme / Radical Right network of co-citations

Watch this video on YouTube.
Jan 102019
Identifying topics in research papers with the newsmap package for R (or: how the Radical Right Research Robot became slightly less dumb) 21

Topic modelling does not work well for (my) research paper abstracts

The Radical Right Research Robot is a fun side project whose life began exactly one year ago. The Robot exists to promote the very large body of knowledge on Radical Right parties and their voters that social scientists have accumulated over decades. At its core is a loop that randomly selects one of the more than 800 titles on my online bibliography on the Extreme/Radical Right every few hours and spits it out on twitter.

Yet the little android’s aim was always for some sort of serendipity, and so it tries to extract meaning from the abstracts (where available), sometimes with rather funny consequences. The robots’s first idea was to make use (structural) topic modelling. There are some implementations available in R and the first results looked promising, but in the end, topic modelling did not find meaningful clusters of papers that could easily be labelled with a common theme. One possible reason is that the abstracts are short, and that there are relatively few (less than 400) of them. And so the Robot reverted to using a small and fairly arbitrary set of keywords for identifying topics.

This approach produced some embarrassing howlers like this one:

Or this one (clearly the robot has a thing for media studies – who doesn’t?):

There are two problems here: first, even a single instance of a keyword in a given abstract is enough to trigger a classification, and second, the bot’s pedestrian implementation would classify an abstract using the last keyword that it detected, even if it was the most peripheral of several hits. Not good enough for world domination, obviously.

Newsmap works reasonably well for classifying topics in research paper abstracts

Looking for an alternative solution, the robot came across newsmap (now also available within quanteda), a geographical news classifier developed by Kohei Watanabe. Newsmap is semi-supervised: it starts with a dictionary of proper nouns and adjectives that all refer to geographical entities, say

'France': [Paris, France, French*] 
'Germany': [German*, Berlin]

But newsmap is able to pick up additional words that also help to identify the respective country with high probability, e.g. “Macron”, “Merkel”, “Marseille”, “Hamburg”, or even “Lederhosen”. In a (limited) sense, it learns to identify geographical context even when the country in question is not mentioned explicitly.

But the algorithm is not restricted to geographical entities. It can also identify topics from a list. An so these days, the robot starts with a dictionary of seed words that is work in progress but looks mostly like this at the moment:

'religion & culture': [muslim*, islam*, relig*, cultur*]
'media': [TV, newspaper*, journalis*]
'group conflict': [group*,contact, prejudice, stereotyp*, competition]

Results are not perfect, but at least they are less embarrassing than those from the simple keyword approach. One remaining problem is that newsmap tags each abstract with (at most) one topic. In reality, any given article will refer to two or more themes in the literature. Topic models are much more attractive in this respect, because they treat each text as a mixture of topics, and so the robot may have to revisit them in the future.

Dec 132018
Does use of Extreme Right / Radical Right terminology predict co-citations? (Part 2) 22

Reprise: The co-citation network in European Radical Right studies

In the last post, I tried to reconstruct the co-citation network in European Radical Right studies and ended up with this neat graph.

Co-citations within top 20 titles in Extreme / Radical Right studies

Co-citations within top 20 titles in Extreme / Radical Right studies

The titles are arranged in groups, with the “Extreme Right” camp on the right, the “Radical Right” group in the lower-left corner, and a small number of publications that is committed to neither in the upper-left corner. The width of the lines represents the number of co-citations connecting the titles.

What does the pattern look like? The articles by Knigge (1998) and Bale et al. (2010) are both in the “nothing in particular” group, but are never cited together, at least not in the data that I extracted. One potential reason is that they are twelve years apart and address quite different research questions.

Want to watch a video of this blog?

The Extreme / Radical Right network of co-citations

Watch this video on YouTube.

Apart from this gap, the network is complete, i.e. everyone is cited with everyone else in the top 20. This is already rather compelling against the idea of a split into incompatible two incompatible strands. Intriguingly, there are even some strong ties that bridge alleged intellectual cleavages, e.g. between Kitschelt’s monograph and the article by Golder, or between Lubbers, Gijsberts and Scheepers on the one hand and Norris and Kitschelt on the other.

While the use of identical terminology seems to play a minor role, the picture also suggests that co-citations are chiefly driven by the general prominence of the titles involved. However, network graphs can be notoriously misleading.

Modelling the number of co-citations in European Radical Right studies

Modelling the number of co-citations provides a more formal test for this intuition. There are \frac{20\times 19}{2}=190 counts of co-citations amongst the top 20 titles, ranging from 0 to 5476, with a mean count of 695 and a variance of 651,143. Because the variance is so much bigger than the mean, a regression model that assumes a negative binomial distribution, which can accommodate such overdispersion, is more adequate than one built around a Poison distribution. “General prominence” is operationalised as the sum of external co-citations of the two titles involved. Here are the results.

external co-citations0.0004.00002<0.05
same terminology0.4240.120<0.05


The findings show that controlling for general prominence (operationalised as the sum of co-citations outside the top 20), using the same terminology (coded as “extreme” / “radical” / “unspecific or other” does have a positive effect on the expected number of co-citations. But what do the numbers mean?

The model is additive in the logs. To recover the counts (and transform the model into its multiplicative form), one needs to exponentiate the coefficients. Accordingly, the effect of using the same terminology translates into a factor of exp(0.424) = 1.53.

What do these numbers mean?

But how relevant is this in practical terms? Because the model is non-linear, it’s best to plot the expected counts for equal/unequal terminology, together with their areas of confidence, against a plausible range of external co-citations.

Effect of external co-citations and use of terminology on predicted number of co-citations within top 20

Effect of external co-citations and use of terminology on predicted number of co-citations within top 20

As it turns out, terminology has only a small effect on the expected number of co-citations for works that have between 6,000 and 8,000 external co-citations. From this point on, the expected number of co-citations grows somewhat more quickly for dyads that share the same terminology. However, over the whole range of 6,000 to 12,000 external co-citations, the confidence intervals overlap and so this difference is not statistically significant.

Unless two titles have a very high number of external co-citations, the probability of them being both cited in a third work does not depend on the terminology they use. Even for the (few) heavily cited works, the evidence is insufficient to reject the null hypothesis that terminology makes no difference.

While the analysis is confined to the relationships between just 20 titles, these titles matter most, because they form the core of ERRS. If we cannot find separation here, that does not necessarily mean that it does not happen elsewhere, but if happens elsewhere, that is much less relevant. So: no two schools. Everyone is citing the same prominent stuff, whether the respective authors prefer “Radical” or “Extreme”. Communication happens, which seems good to me.

Are you surprised?

Got to the first part of this mini series, or read the full article on concepts in European Radical Right research here:

  • Arzheimer, Kai. “Conceptual Confusion is not Always a Bad Thing: The Curious Case of European Radical Right Studies.” Demokratie und Entscheidung. Eds. Marker, Karl, Michael Roseneck, Annette Schmitt, and Jürgen Sirsch. Wiesbaden: Springer, 2018. 23-40. doi:10.1007/978-3-658-24529-0_3
    [BibTeX] [Download PDF] [HTML]
    author = {Arzheimer, Kai},
    title = {Conceptual Confusion is not Always a Bad Thing: The Curious Case of
    European Radical Right Studies},
    booktitle = {Demokratie und Entscheidung},
    publisher = {Springer},
    address = {Wiesbaden},
    pages = {forthcoming},
    year = 2018,
    url =
    doi = {10.1007/978-3-658-24529-0_3},
    pages = {23-40},
    html =
    editor = {Marker, Karl and Roseneck, Michael and Schmitt, Annette and Sirsch,
    dateadded = {01-06-2018}