Jan 082022
The Stata idiom capture quietly makes it so that any output from the subsequent command is suppressed, and that even critical failures are happily ignored. Your script soldiers on, and you are none the wiser. I always thought that this is a wonderful metaphor for organisational behaviour.

In unrelated news, every other summer, Statacorp comes up with a new version of its product. Every other summer, I succumb to some Pavlovian reflex and decide to spend some institutional money on upgrading my unit’s licences for some interesting but usually quite marginal benefits.

It is the same story in other units and departments, and by coordinating and pooling our orders, we can get substantial discounts. And so, come autumn, the university’s IT centre is collating expressions of interest and communicating tentative prices, going back and forth until some equilibrium is reached. From then on, it can still take months until the new licences arrive, in spite of shipments being just codes and downloads now. Yesterday, I realised that Stata 17 came out in April, i.e. nine months ago, and so decided to find out what had happened to our order. As it turned out, the IT centre required our charge codes to proceed, but had never bothered to ask for them.

Apr 272021

Working with repeated comparative survey data – almost a howto

There is now a bonanza of studies that rely on surveys which are replicated across countries and time, often with fairly short intervals, with the ESS arguably one of the most prominent examples (but also see the “barometer” studies in various regions). Multi-level analysis is now the weapon of choice to tackle these data, but the appropriate structure of such models is not immediately obvious: are we looking at waves nested in countries? Countries nested in waves? Or rather at surveys cross-classified by year and country? What’s the role of the small-n problem when we are talking about countries? And does the notion of sampling even make sense when we are talking about what is effectively the whole population of countries that could be studied?

  • Schmidt-Catran, A. W., & Fairbrother, M. (2016). The random effects in multilevel models: getting them wrong and getting them right. European Sociological Review, 32(1), 23–38. http://dx.doi.org/10.1093/esr/jcv090
  • Schmidt-Catran, A. W., Fairbrother, M., & Andreß, H. (2019). Multilevel models for the analysis of comparative survey data: common problems and some solutions. , 71(1), 99–128. http://dx.doi.org/10.1007/s11577-019-00607-9

What we liked

It’s difficult to have a discussion about a text that provides a lot of factual information about methodological bits and bobs, especially when you have little prior knowledge. Having said that, students found both texts (which are related but complementary) remarkably accessible and helpful.

Schmidt-Catran, Fairbrother, Andreß 2019: 112

Sad but true: comparative analysis is hard, and multi-level models are no panacea. Nothing ever is. Bugger.

What we did not like so much

Nothing. Students liked these two. So did I. Period.

Mar 192021

I began putting stuff on the internet for fun and non-profit at some point in the previous millennium. In 2008, almost exactly 13 years ago, I registered this domain. After eight years of mostly uneventful but very slow shared hosting with a tiny company somewhere in Germany’s Wild East, I upgraded to a cheapo virtual private server (VPS) hosted by OVH, a French company that seemed less evil than The Big Americans. A VPS is a modern wonder: it is a very modest server, complete with processor, memory, disk etc., that is simulated alongside many others of its kind by a truly powerful server. It is the ghost in the machine, or rather the ghost of a machine, in the machine.

Running this site on the VPS was faster, and clearly more fun. I’ve failed as some things over the decades, including a stint as a failed sysadmin. Put differently, having a (virtual) server at my disposal led to some unproductive but interesting distractions, including the robot. From previous lives full of painful experience, I was well aware that it is crucial to have multiple off-site backups. And so I cobbled together some script that seemed to work reasonably well.

Everything was ticking along nicely (unless I messed things up because I had the urge to fiddle) until last week. On March 10, the virtual and the real hardware and everything it contained were spontaneously uploaded to the upper left corner of this image.

Cloud computing and backups 2

OVH lost half a data centre in a blast that left more than four million sites offline, many of them unrecoverable. Including mine, natch. So I remembered that half-forgotten script. Somewhat obviously, it had stopped working when I changed a password without thinking of the ripple effects that might have. Thankfully, I had lost only eight months of changes or so. And even those, I got (mostly) back thanks to the wonder that is the WayBack Machine.

All in all, I was stupid but lucky. Right now, I check that the backups work every couple of days. Knowing myself, I will completely have forgotten about this disaster in approximately 189 days from now.

Nov 292020

Why yes, of course nothing says memefy just like a series of online lectures that everybody wants to fast-forward. And I have the tweets to prove it.

So I’m teaching a mandatory stats/methods class (always popular). Online. Following the advice from my own kids, I have memified the outline. For your own syllabus needs, here is the week-by-week program

 Tagged with:
Apr 242019
How the tidyverse changed my view of #rstats 4

Back in the mist of time, when I should have been was working on my PhD, I found a blue book on the shelf that a previous occupant of the office had left there. As I learned later, it was The Blue Book that introduced the S language, the predecessor of R. I got sidetracked (as you do) and taught myself how to produce beautiful graphs in what is now known as base R, and how to run poorly understood time series analyses (impossible in SPSS at this point).

A little later, I got hooked on Stata, and to the present day, I refuse to be Stata-shamed, as Ben Stanley put it. 95 per cent of the time, it does the job, and quickly so. Also, the documentation is simply excellent.

But every now and then, I came back to R because I needed something specific. And it was mostly fun. Having access to all these APIs (in fact, concurrently having more than one data set in memory) was exciting. Having a real, reasonably straightforward scripting /programming language at my disposal instead of Stata’s hodgepodge of three (four if you count the graph language) half-baked syntaxes was exhilarating. Having a go at the latest methods on the basis of nothing more than skimming a working paper (skipping every non-trivial equation) was… I guess a little bit like trimming your hair with a chainsaw.

But finding, installing, updating and then loading three packages, just to make recoding a little more intuitive? Seriously, R? Not so cool. In fact, finding a variable (whose name and data set must be given in full) was usually enough to reduce me to tears. Attach() somehow never does what I think it should do. And so, I would return to Stata once more, like <insert awkward metaphor>.

Then, during one of my last forays, I began playing with the tidyverse. And as the young ones are prone to say: my mind was blown. Tibbles! Pipelines! Lots of yummie helper functions! Going from long to wide format and back (in various different ways)! Grouping, summarising, and even some pythonesque list traversing. This was no longer the fascinating but slightly stroppy R I used to know.

Compared to the handful of letters and abbreviations that I use in Stata to get things done, recoding-wise, this is still quite verbose, and I have to look up just about everything. But I really like it. Like, really like it. And so doing more stuff in R is firmly on the endless List Of Things I Want To Look Into. To end on the most positive note possible, here is a gratuitous picture of a cat.

How the tidyverse changed my view of #rstats 5

 Tagged with: ,
Mar 042019
Wakelet as a tool for archiving online debates on (academic) events 7

Wakelet – what is it, and why should academics care to “curate” tweets about events? Bear with me for a second.

The sad state of curating and social story telling

Until about about a year ago, there was a storify.com. Their business idea was that people would “curate” tweets, facebook posts and other stuff found on social media to narrate stories on the interwebs.

It is a truth universally acknowledged that the idea of “curating” stuff as a mass phenomenon is industrial-grade bullshit. No one wants hordes of people linking half-read stuff together in a bid to be completely ignored by even more people. And so storify was acquired by Livefyre, which was in turn purchased by Adobe, and the whole “curating” business moved away from the masses into the realm of enterprise customers.

Why would a researcher ever think about social story telling?

My scepticism aside, there was at least one use case for storify in academia. When Prof Jane Ordinary is organising any sort of event these days, it is in her and other people’s interest to create a bit of a social media buzz. It is not just outreach and stuff: Jane wants to project at least a vague sense of awareness of her project into the wider world, and journalists and other researchers who would never read a four-pages press release may well want to follow parts of the debate in an informal setting.

The problem here: by its nature, social media is ephemeral. After the event, any buzz will be buried under billions and billions of newer posts. And even during the event, the silo-like structure of the current social mediascape as well as the frequent failure to agree on a single hashtag for smaller events makes it very difficult to get an overview of what people are saying online. Here, storify was useful, because one could link every (presentable) post into a story. Then, one (or one’s capable RA) could share the whole shebang or embedded into a more durable web page, either after or during the event.

Clearly a wake, not a wakelet

Photo by MadeByMark

From storify to wakelet

Looking for a replacement for storify to archive (curate??? seriously???) the online/offline story of the policy dialogue that we organised last week, I came across wakelet (apparently, giving your product a dorky name is still a thing in Silicon Valley). Wakelet does everything that storify did, and then a bit more. Basically, everything that has an URL can be linked into a “collection” (also called a wakelet). Tweets and videos get a special treatment: they appear in a “native” format, i.e. as a tweetbox or within a video player, respectively. It is possible to add images and texts, too.

While wakelet is sometimes a bit rough around the edges. I had to press reload a couple of times after re-ordering elements for everything to reappear. Also, wakelets could load a bit quicker. But nonetheless, wakelet very elegantly plugs this particular gap.

What I don’t see, however, is a sustainable mass-market business model. Currently, the service is free for anyone who wants to showcase something. Interleaving collections with adverts would defy the showcasing aspect. But I don’t see that casual users would be willing to pay for a subscription. And so, in the medium term, it’s turning into another enterprise service or going bust, I presume. But for the time being, wakelet is a useful, if highly specialised addition to the academic toolbox.

Policy Dialogue: immigration, local decline, the Radical Right & wakelet

Within our ORA project SCoRE, we look into the relationships between local decline, local levels of immigration, immigrant sentiment, and (radical right) voting. Obviously, our findings have (or should have) implications for public policy. And so we organised an event at the European Policy Centre in Brussels. We had a great panel, a sizable crowd of interested folks, and distributed about 100 copies of our policy brief. And then it was over.

But if you are interested in what the speakers said, how people reacted, and what it was like,  simply browse the wakelet that I embed below this post. At least until  some other, more profitable company buys them.

Mar 022019
Every remotely relevant reference I came across during the last 15 years or so resides in a single bibtex file. That is not a problem. The problem is that I’m moving into a shiny, new but somewhat smaller office, together with hundreds of copies of journal articles and hundreds of PDFs. Wouldn’t it be good to know which physical copies are effectively redundant (unreadable comments in the margins aside) and can therefore stay behind?

The trouble is that bibtex files have a rather flexible, human readable format. Each entry begins with the @ sign, followed by a type (book, article etc.), a reference name,  lots of key/value pairs (fields) in arbitrary order,  and even more curly braces.

grep @ full.bib|wc -l tells me that I have 2914 references in total. grep binder|wc -l (binder is a custom field that I use to keep track of the location of my copies) shows that I have printed out/copied 712 texts over the years, and grep file|wc -l indicates that there are 504 PDFs residing on my filesystem. But what is the magnitude of the intersection?

My first inclination was to look for a suitable Python parser/library. Pybtex looked good in principle but is underdocumented and had trouble reading full.bib, because that is encoded in Latin 1. So it was endless hours of amateurish coding and procrastination ahead. Then I remembered the “do one thing, and do it really well” mantra of old. Enter bibtool, which is a fast and reasonably stable bibtex file filter and pretty printer. Bibtool reads “resource files”, which are really just short scripts containing filtering/formatting directives. select = {binder ".+"} keeps those references whose “binder” field contains at least one character (.+ is a regular expression that matches any non-empty string). select = {file ".+"} selects all references for which I have a PDF. But bibtool applies a logical OR to these conditions while I’m interested in finding those references that meet both criteria.

The quick solution is to store each statement in a file of its own and apply bibtool twice, using a pipeline for extra efficiency: bibtool -r find-binder.rsc full.bib|bibtool -r find-pdf >intersection.bib does the trick and solves my problem in under a minute, without any coding.

As it turns out, there were just 65 references in both groups. Apparently, I stopped printing (or at least filing away) some time ago. Eventually, I binned two copies, but it is the principle that matters.

2019 Update

I still use bibtool for quick filtering/reformatting tasks at the command line, but for more complex jobs involving programmatic access to bibtex files from R, RefManageR is a wonderful package.  I have used it here in a bibliometric study of the Radical/Extreme Right literature. And my nifty RRResRobot also relies heavily on RefManageR. If you are interested at all in RefManageR, here is a short and sweet introduction.