Stata-related posts

Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

Jan 022018

Personal blogs are so 1990s, yes?

This is not the late 1990s. Hey, it’s not even the early Naughties, and has not been for a while. I have had my own tiny corner of the Internet (then hosted on university Web space as it was the norm in the day) since Mosaic came under pressure from Netscape and the NYT experimented with releasing content as (I kid you not) postscript files, because PDF was not invented yet. I did this mostly because I liked computers, because it was new, and because it provided an excellent distraction from the things I should have been doing. By and large, not much changes over 25 years.

Photo by karimian

Later (that was before German universities had repositories or policies for such things), my webspace became a useful resource for teaching-related material. Reluctantly and with a certain resentment, I have copied slides and handouts from one site to the next, adding layers of disclaimers instead of leaving them behind, because some of this stuff carries hundreds of decade-old backlinks and gets downloaded / viewed dozens of times each day.

And of course, I started posting pre-publication versions of my papers, boldly ignoring / blissfully ignorant of the legal muddle surrounding the issue back in the day. Call me old fashioned, but making research visible and accessible is was the Web was invented for.

In summer 2008, I set up my own domain on a woefully underpowered shared webspace (since replaced by an underpowered virtual server). A bit earlier in the same year, already late to the party, I had started my own “Weblog” on, writing and ranting about science, politics, methods, and all that. A year down the road, I converted to wordpress, moved my blog over there, and have never looked back continously wondered why I kept doing this.

Why keep blogging?

In those days of old, we had trackbacks and pingbacks & stuff (now a distant memory), and social media was the idea of having a network of interlinking personal blogs, whose authors would comment on each other’s posts. Even back in 2008 on wordpress, my blog was not terribly popular, but for a couple of years, there was a bunch of people who had similar interests, with whom I would interact occasionally.

Then, academically minded multi-author blogs came along, which greatly reduced fragmentation and aimed at making social science accessible for a much bigger audience whilst removing the need to set up and maintain a site. For similar reasons, Facebook and particularly Twitter became perfect outlets for ranting “microblogging”, while Medium bypasses the fragmentation issue for longer texts and is far more aesthetically pleasing and faster than anything any of us could run by ourselves.

Photo by kjarrett

It is therefore only rational that many personal academic blogs died a slow death. People I used to read left Academia completely, gave up blogging, or moved on to the newer platforms. Do you remember blogrolls? No, you wouldn’t. Because I’m a dinosaur, I still get my news through an RSS reader (and you should, too). While there are a few exceptions (Chris Blattman and Andrew Gelman spring to mind), most of the sources in my “blog” drawer are run by collectives / institutions (the many LSE blogs, the Monkey Cage, the Duck etc.). I recently learned that I made it into an only slightly dubious looking list of the top 100 political science blogs, but that is surely because there are not many individual political science bloggers left.
So why am I still rambling in this empty Platonic man-cave? Off the top of my head, I can think of about five reasons:

  1. Total editorial control. I have written for the Monkey Cage, The Conversation, the LSE, and many other outlets. Working with their editors has made my texts much better, but sometimes I am not in the mood for clarity and accessibility. I want to rant, and be quick about it.
  2. Pre-prints. I like to have pre-publication versions of my work on my site, although again, institutional hosting makes much more sense. Once I upload them, I’m usually so happy that I want to say something about it.
  3. For me, my blog is still a bit like an open journal. If I need to remember some sequence of events in German or European politics for the day job, it’s helpful if I have blogged about it as it happened. Similarly, sometimes I work out the solution to some software issue but quickly forget the details. Five months later, a blog post is a handy reference and may help others.
  4. Irrelevance. Often, something annoys or interests me so much that I need to write a short piece about it, although few other people will care. I would have a better chance of being of finding an audience at Medium, but then again on my own wordpress-powered site, I have a perfectly serviceable CME which happens to have blogging functionality built in.
  5. Ease of use. I do almost all of my writing in Emacs and keep (almost) all my notes in orgmode code. Thanks to org2blog, turning a few paragraphs into a post is just some hard-to-remember key strokes away.

Bonus track: the five most popular posts in 2017

As everyone knows, I’m not obsessed with numbers, thank you very much. I keep switching between various types of analytic software and have no idea how much (or rather little) of an audience I actually have. Right now I’m back to the basic wordpress statistics and have been for over a year, so here is the list of the five posts that were the most popular in 2017.

Photo by diff_sky

Oct 312015

Here is an update on our work on surveybias.

How can we usefully summarise the accuracy of an election opinion poll compared to the real result of an election? In this blog, we describe a score we have devised to allow people to see how different polls compare in their reflection of the final election result, no matter how many parties or candidates are standing. This index, B, can be compared across time, polling company and even election to provide a simple demonstration of how the polls depicted public opinion in the run-up to polling-day

Source: How to measure opinion poll inaccuracy in elections – The Plot

Oct 082015

Just how badly biased is your pre-election survey? Once the election results are in, our scalar measures B and B_w provide convenient, single number summaries. Our surveybias add-on for Stata will calculate these and other measures from either raw data or from published margins. Its latest iteration (version 1.4) has just appeared on SSC. Surveybias 1.4 improves on the previous version by ditching the last remnants of the old numerical approximation code for calculating standard errors and is hence much faster in many applications. Install it now from within Stata by typing

ssc install surveybias


Sep 152015

In der letzten Woche ist meine Einführung zum Thema Strukturgleichungsmodelle bei Springer/VS erschienen. Das Buch zeigt, wie sich die gängigsten Modelle (u.a. einfache und Mehr-Gruppen-Konfirmatorische-Faktorenanalysen (CFA/MGCFA)) in Stata, Lisrel und MPlus realisieren lassen. Die Beispiele stammen aus dem Bereich der politikwissenschaftlichen Einstellungsforschung (Fremdenfeindlichkeit, politische Entfremdung, politisches Interesse …).

Alle Beispieldateien können hier heruntergeladen werden. Das Buch kostet 12,99 € (Ebook) bzw. 17,99 € (Paperback). Pre-prints der Einleitung und des Glossars gibt es hier, weitere Probeseiten direkt bei Springer.

Im einzelnen werden die folgenden Themen behandelt:

1 Einleitung
  1.1 Wieso, weshalb, warum? Strukturgleichungsmodelle in der Politikwissenschaft
  1.2 Aufbau des Buches
  1.3 Konventionen
  1.4 Software und Internetquellen
2 Grundlagen
  2.1 Matrixalgebra
    2.1.1 Dimensionen, Elemente, Vektoren, Submatrizen, Partitionen
    2.1.2 Besondere Matrizen
    2.1.3 Einfache Matrixoperationen
    2.1.4 Rang und Inverse
  2.2 Kovarianz, Korrelation, Regression
    2.2.1 Die Kovarianz: Maß für Zusammenhänge zwischen metrischen Variablen: 
    2.2.2 Der Pearsonsche Korrelationskoeffizient: Standardisiertes Maß für den Zusammenhang zwischen metrisch
    2.2.3 Das lineare Regressionsmodell: Baustein für Strukturgleichungsmodelle
  2.3 Messfehler und Faktorenanalyse
  2.4 Kausalität und Pfaddiagramme
    2.4.1 Kausalitätsbegriff
    2.4.2 Pfaddiagramme
  2.5 Das allgemeine Strukturgleichungsmodell
  2.6 Stichproben, Schätzungen, Strategien
    2.6.1 Realität, Modell und Daten
    2.6.2 Schätzverfahren
    2.6.3 Identifikation
    2.6.4 Modellvergleich: Fit-Indizes und Hypothesentests
    2.6.5 Standardisierte Schätzungen und Mittelwertstrukturen
3 Beispiele und Anwendungen
  3.1 Daten
  3.2 Konfirmatorische Faktorenanalyse: Einstellungen zu Migranten
  3.3 Gruppenvergleich und äquivalente Messungen
  3.4 Empfehlungen für Analyse und Darstellung
    3.4.1 Theoretische Grundlagen und Spezifikation
    3.4.2 Datenauswahl und -aufbereitung
    3.4.3 Modellschätzung und Respezifikation
    3.4.4 Präsentation
4 Fortgeschrittene Themen
  4.1 Kategoriale Variablen
    4.1.1 Kategoriale Indikatoren
    4.1.2 Ein Beispiel: Politische Wirksamkeit (efficacy)
  4.2 Latente Wachstumsmodelle
    4.2.1 Zunahme des Interesses am Wahlkampf
    4.2.2 Exkurs: Latente Wachstumsmodelle als Mehr-Ebenen-Modelle
  4.3 Ausblick und weiterführende Literatur
    4.3.1 Missing Data
    4.3.2 Kategoriale latente Variablen
    4.3.3 Mehr-Ebenen-Strukturgleichungsmodelle
5 Ausblick und weiterführende Literatur
  5.1 Grundlagen
  5.2 Einführungen
  5.3 Literatur zu einzelnen Programmen
  5.4 Fachzeitschriften und Handbücher
6 Literaturverzeichnis
Mar 312015

Worried about survey bias?

We have updated our add-on (or ado) surveybias, which calculates our multinomial generalisation of the old Martin, Traugott, and Kennedy (2005) measure for survey bias. If you have any dichotomous or multinomial variable in your survey whose true distribution is known (e.g. from the census, electoral counts, or other official data), surveybias can tell you just how badly damaged your sample really is with respect to that variable. Our software makes it trivially easy to asses bias in any survey.

Within Stata, you can install/update surveybias by entering ssc install surveybias. We’ve also created a separate page with more information on how to use surveybias, including a number of worked examples.survey bias

The new version is called 1.3b (please don’t ask). New features and improvements include:

  • Support for (some) complex variance estimators including Stata’s survey estimator (sample points, strata, survey weights etc.)
  • Improvements to the numerical approximation. survebias is roughly seven times faster now
  • A new analytical method for simple random samples that is even faster
  • Convenience options for naming variables created by survebiasseries
  • Lots of bug fixes and improvements to the code

If you need to quantify survey bias, give it a spin.

Mar 122015

Contrary to popular belief, it’s not always the third reviewer that gives you grief. In our case, it is the one and only reviewer that shot down a manuscript, because at the very least, s/he would have expected (and I quote) an “analytical derivation of the estimator”. For some odd reason of his own, the editor, instead of simply rejecting us, dared us to do just that, and against all odds, we succeeded after some months of gently banging various heads against assorted walls.

Needless to say that on second thought, the reviewer found the derivation “interesting but unnecessarily complicated” and now recommends relegating the material to a footnote. To make up for this, s/he delved into the code of our software, spotted some glaring mistakes and recommended a few changes (actually sending us a dozen lines of code) that result in a speed gain of some 600 per cent. This is very cool, very good news for end users, very embarrassing for us, and generally wrong on so many levels.

Bonus track: The third reviewer.


May 202014

The Problem: Assessing Bias without the Data Set

While the interwebs are awash with headline findings from countless surveys, commercial companies (and even some academics) are reluctant to make their raw data available for secondary analysis. But fear not: Quite often, media outlets and aggregator sites publish survey margins, and that is all the information you need. It’s as easy as \pi.

The Solution: surveybiasi

After installing our surveybias add-on for Stata, you will have access to surveybiasi. surveybiasi is an “immediate command” (Stata parlance) that compares the distribution of a categorical variable in a survey to its true distribution in the population. Both distributions need to be specified via the popvalues() and samplevalues() options, respectively. The elements of these two lists may be specified in terms of counts, of percentages, or of relative frequencies, as the list is internally rescaled so that its elements sum up to unity. surveybiasi will happily report k A^{\prime}_{i}s, B and B_{w} (check out our paper for more information on these multinomial measures of bias) for variables with 2 to 12 discrete categories.

Bias in a 2012 CBS/NYT Poll

A week before the 2012 election for the US House of Representatives, 563 likely voters were polled for CBS/The New York Times. 46 per cent said they would vote for the Republican candidate in their district, 48 per cent said they would vote for the Democratic candidate. Three per cent said it would depend, and another two per cent said they were unsure, or refused to answer the question. In the example these five per cent are treated as “other”. Due to rounding error, the numbers do not exactly add up to 100, but surveybiasi takes care of the necessary rescaling.

In the actual election, the Republicans won 47.6 and the Democrats 48.8 per cent of the popular vote, with the rest going to third-party candidates. To see if these differences are significant, run surveybiasi like this:

. surveybiasi , popvalues(47.6 48.8 3.6) samplevalues(46 48 5) n(563)
      catvar |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
A'           |
           1 |  -.0426919   .0844929    -0.51   0.613     -.208295    .1229111
           2 |  -.0123999   .0843284    -0.15   0.883    -.1776805    .1528807
           3 |   .3375101   .1938645     1.74   0.082    -.0424573    .7174776
B            |
           B |   .1308673   .0768722     1.70   0.089    -.0197994    .2815341
         B_w |   .0385229   .0247117     1.56   0.119    -.0099112    .0869569
    Ho: no bias
    Degrees of freedom: 2
    Chi-square (Pearson) = 3.0945337
    Pr (Pearson) = .21282887
    Chi-square (LR) = 2.7789278
    Pr (LR) = .24920887

Given the small sample size and the close match between survey and electoral counts, it is not surprising that there is no evidence for statistically or substantively significant bias in this poll.

An alternative approach is to follow Martin, Traugott and Kennedy (2005) and ignore third-party voters, undecided respondents, and refusals. This requires minimal adjustments: n is now 535 as the analytical sample size is reduced by five per cent, while the figures representing the “other” category can simply be dropped. Again, surveybiasiinternally rescales the values accordingly:

. surveybiasi , popvalues(47.6 48.8) samplevalues(46 48) n(535)
      catvar |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
A'           |
           1 |  -.0162297   .0864858    -0.19   0.851    -.1857388    .1532794
           2 |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
B            |
           B |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
         B_w |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
    Ho: no bias
    Degrees of freedom: 1
    Chi-square (Pearson) = .03521623
    Pr (Pearson) = .85114329
    Chi-square (LR) = .03521898
    Pr (LR) = .85113753

Under this two-party scenario, A^{\prime}_{1} is identical to Martin, Traugott, and Kennedy’s original A (and all other estimates are identical to A‘s absolute value). Its negative sign points to the (tiny) anti-Republican bias in this poll, which is of course even less significant than in the previous example.

Apr 032014

Survey Accuracy

The accuracy of pre-election surveys is a matter of considerable debate. Obviously, any rigorous discussion of bias in opinion polls requires a scalar measure of survey accuracy. Martin, Traugott, and Kennedy (2005) propose such a measure A for the two-party case, and in our own work (Arzheimer/Evans 2014), Jocelyn Evans and I demonstrate how A can be generalised to the multi-party case, giving rise to a new measure B (seriously) and some friends A^{\prime}_{i} and B_w:

    Arzheimer, Kai and Jocelyn Evans. “A New Multinomial Accuracy Measure for Polling Bias.” Political Analysis 22.1 (2014): 31–44. doi:10.1093/pan/mpt012
    [BibTeX] [Abstract] [Download PDF] [HTML] [DATA]

    In this article, we propose a polling accuracy measure for multi-party elections based on a generalization of Martin, Traugott, and Kennedy s two-party predictive accuracy index. Treating polls as random samples of a voting population, we first estimate an intercept only multinomial logit model to provide proportionate odds measures of each party s share of the vote, and thereby both unweighted and weighted averages of these values as a summary index for poll accuracy. We then propose measures for significance testing, and run a series of simulations to assess possible bias from the resulting folded normal distribution across different sample sizes, finding that bias is small even for polls with small samples. We apply our measure to the 2012 French presidential election polls to demonstrate its applicability in tracking overall polling performance across time and polling organizations. Finally, we demonstrate the practical value of our measure by using it as a dependent variable in an explanatory model of polling accuracy, testing the different possible sources of bias in the French data.

    author = {Arzheimer, Kai and Evans, Jocelyn},
    title = {A New Multinomial Accuracy Measure for Polling Bias },
    journal = {Political Analysis},
    year = 2014,
    abstract = {In this article, we propose a polling accuracy measure for
    multi-party elections based on a generalization of Martin,
    Traugott, and Kennedy s two-party predictive accuracy index.
    Treating polls as random samples of a voting population, we first
    estimate an intercept only multinomial logit model to provide
    proportionate odds measures of each party s share of the vote, and
    thereby both unweighted and weighted averages of these values as a
    summary index for poll accuracy. We then propose measures for
    significance testing, and run a series of simulations to assess
    possible bias from the resulting folded normal distribution across
    different sample sizes, finding that bias is small even for polls
    with small samples. We apply our measure to the 2012 French
    presidential election polls to demonstrate its applicability in
    tracking overall polling performance across time and polling
    organizations. Finally, we demonstrate the practical value of our
    measure by using it as a dependent variable in an explanatory model
    of polling accuracy, testing the different possible sources of bias
    in the French data.},
    keywords = {meth-e},
    volume = {22},
    number = {1},
    pages = {31--44},
    url =
    doi = {10.1093/pan/mpt012},
    data = {},
    html =

The Surveybias Software 1.1

Calculating the accuracy measures is a matter of some algebra. Estimating standard errors is a bit trickier but could be done manually by making use of the relationship between A^{\prime}_{i} and the multinomial logistic model on the one hand and Stata’s very powerful implementation of the Delta method on the other. But these calculations are error-prone and become tedious rather quickly. This is why we created a suite of user written programs (surveybias, surveybiasi, and surveybiasseries). They do all the necessary legwork and return the estimates of accuracy, complete with standard errors and statistical tests.

Voter poll
Those Were the / CC BY-SA

We have just updated our software. The new version 1.1 of surveybias features some bug fixes, a better mechanism for automagically dealing with convergence problems, better documentation, and a new example data set that compiles information on 152 German pre-election polls conducted between January and September 2013.

Examples, Please?

surveybias comes with example data from the French presidential election 2012 and the German parliamentary election 2013. From within Stata, type help surveybias, help surveybiasi, and help surveybiasseries to see how you can make use of our software. If I can find the time, I will illustrate the use of surveybias in a mini series of blogs over the next week.

Updating Surveybias

The new version 1.1 should appear is now on SSC within the next couple of days or so, but the truly impatient can get it now. In your internet-aware copy of Stata (version 11 or later), type

net from

net install surveybias, replace

Or use SSC: ssc install surveybias, replace


Jan 262014

R Package Parallel: How Not to Solve a Problem That Does Not Exist

Somewhat foolishly, my university has granted me access to Mogon: not the god, not the death metal band but rather their supercomputer, which currently holds the 182th spot in the top 500 list of the fastest computers on the planet. It has some 34,000+ cores and more than 80 TB of RAM, but basically it’s just a very large bunch of Linux boxes. That means that I have a rough idea how to handle it, and that it happily runs my native Linux Stata and MPlus (and hopefully Jags) binaries for me. It also has R installed, and this is where my misery began.

I have a lengthy R job that deals with census data. Basically, it looks up the absolute number of minority residents in some 25,000 output areas and their immediate neighbours and calculates a series of percentages from these figures. I think this could in principle be done in Stata, but R provides convenient libraries for dealing with geo-coded data (sp and friends), non-rectangular data structures and all the trappings of a full-featured programming language, so it would be stupid not to make use of it. The only problem is that R is relatively slow and single-threaded, and that my script is what they call embarrassingly parallel: The same trivial function is applied to 33 vectors with 25,000 elements each. Each calculation on a vector takes about eight seconds to complete, which amounts to roughly five minutes in total. Add the time it takes to read in the data and some fairly large lookup-tables (it would be very time-consuming to repeatedly calculate which output area is close enough to each other output area to be considered a neighbour), and we are looking at eight to ten minutes for one run.


Mogon. Image Credit: ZDV JGU Mainz

While I do not plan to run this script very often – once the calculations are done and saved, the results can be used in the analysis proper over and over again – I fully expect that I might change some operationalisations, include different variables etc., and so I began toying with the parallel package for R to make use of the many cores suddenly at my disposal.

Twelve hours later, I had learned the basics of the scheduling system (LSF), solved the problem of synching my data between home, office, central, and super-computer, gained some understanding of the way parallel works and otherwise achieved basically nothing: Even the best attempt at running a parallelised version of the script on the supercomputer was a little slower than the serialised version on my very capable office machine (and that is without the time (between 15 and 90 seconds) the scripts spends waiting to be transferred to a suitable node of the cluster). I tried different things: replacing lapply with mclapply, which was slower, regardless of the number of cores; using clusterApply instead of lapply (same result), and forking the 33 serial jobs into the background, which was even worse, presumably because storing the returned values resulted in changes to rather large data structures that were propagated to all cores involved.

Lessons Learned?

So yes, to save a few minutes in a script that I will presumably run not more than four or five times over the next couple of weeks, I spent 12 hours, with zilch results. But at least I learned a few things (apart from the obvious re-iteration of the old ‘never change a half-way running system’ mantra). First, even if it takes eight seconds to do the sums, a vector of 25,000 elements is probably to short to really benefit from shifting the calculations to more cores. While forking should be cheap, the overhead of setting up the additional threads dominates any savings. Second, running jobs in parallel without really understanding what overhead this creates is a stupid idea, and knowing what overhead this creates and how to avoid this is probably not worth the candle (see the above). Third, I can always re-use the infrastructure I’ve created (for more pointless experiments). Forth, my next go at Mogon shall avoid half-baked middle-level parallelisation altogether. Instead I shall combine fine-grained implicit parallelism (built into Stata and Mplus) and very coarse explicit parallelism (by breaking up lengthy scripts into small chunks that can be run independently). Further research is definitively needed.