Mar 312015

Worried about survey bias?

We have updated our add-on (or ado) surveybias, which calculates our multinomial generalisation of the old Martin, Traugott, and Kennedy (2005) measure for survey bias. If you have any dichotomous or multinomial variable in your survey whose true distribution is known (e.g. from the census, electoral counts, or other official data), surveybias can tell you just how badly damaged your sample really is with respect to that variable. Our software makes it trivially easy to asses bias in any survey.

Within Stata, you can install/update surveybias by entering ssc install surveybias. We’ve also created a separate page with more information on how to use surveybias, including a number of worked examples.survey bias

The new version is called 1.3b (please don’t ask). New features and improvements include:

  • Support for (some) complex variance estimators including Stata’s survey estimator (sample points, strata, survey weights etc.)
  • Improvements to the numerical approximation. survebias is roughly seven times faster now
  • A new analytical method for simple random samples that is even faster
  • Convenience options for naming variables created by survebiasseries
  • Lots of bug fixes and improvements to the code

If you need to quantify survey bias, give it a spin.

Mar 222014
One of my very able PhD students is working on a better instrument for measuring the interaction of national and European identities. Thanks to the generosity of the Fritz Thyssen Stiftung, we can now road-test some of his ideas in a three-wave telephone survey. Fieldwork for the first wave will commence on Monday, and we are rather excited, not least because we are running this survey in our own “studio”, with a large number of student research assistants working as interviewers.

Worldclouds 2009
NASA Earth Observatory / Foter / Public domain

In the past, the university had installed the voxco software in a PC lab that was equipped with headsets and landlines. But the program never worked well and became de facto unusable once the service contract waterminated. Looking for alternatives when we moved into a new building, we came across queXS, an open source CATI software that is based on limesurvey. Limesurvey had worked well for us in the past, so we gave queXS a spin and rather liked it. The only remaining problem was that our IT support could not setup the necessary servers and patch them into the university’s voice over ip infrastructure in time (we want to be in the field well before the Euro 2014 campaign takes off in two weeks or so). So we got in touch with ACSPRI, the Australian Consortium for Social and Political Research Incorporated, which offers access to a Amazon cloud-based installation of queXS that can be rented on a monthly basis for a reasonable fee. ACSPRI also helped us to find a German VOIP provider whose network we will use to place the calls.

Now our “studio” is still based in a university PC lab. But this is mostly an issue of convenience, and of easy supervision. In fact, it could be run on laptops or even tablet computers anywhere on the planet. The software is browser-based and hosted in some unknown, unmarked data centre somewhere. Connectivity to German landlines is provided through software in another data centre, and this whole virtualised infrastructure is supported and maintained from the other end of the world. Apart from the headsets, the only tangible part of the studio is a bunch of pen-drives that hold the interviewers’ access codes. Eerie, isn’t it?

The tests went well, but will it work in practice? I’ll keep you posted.

Jul 102013
Used Punchcard
BinaryApe / Foter / CC BY

In a recent paper, we derive various multinomial measures of bias in public opinion surveys (e.g. pre-election polls). Put differently, with our methodology, you may calculate a scalar measure of survey bias in multi-party elections.

Thanks to Kit Baum over at Boston College, our Stata add-on surveybias.ado is now available from Statistical Software Components (SSC).  The add-on takes as its argument the name of a categorical variable and said variable’s true distribution in the population. For what it’s worth, the program tries to be smart: surveybias vote, popvalues(900000 1200000 1800000), surveybias vote, popvalues(0.2307692 0.3076923 0.4615385), and surveybias vote, popvalues(23.07692 30.76923 46.15385) should all give the same result.

If you don’t have access to the raw data but want to assess survey bias evident in published figures, there is surveybiasi, an “immediate” command that lets you do stuff like this:  surveybiasi , popvalues(30 40 30) samplevalues(40 40 20) n(1000). Again, you may specify absolute values, relative frequencies, or percentages.

If you want to go ahead and measure survey bias, install surveybias.ado and surveybiasi.ado on your computer by typing ssc install surveybias in your net-aware copy of Stata. And if you use and like our software, please cite our forthcoming Political Analysis paper on the New Multinomial Accuracy Measure for Polling Bias.

Update April 2014: New version 1.1 available

Apr 262012
For our piece on distance effects in English elections we geocoded the addresses of hundreds of candidates. For the un-initiated: Geocoding is the fine art of converting addresses into geographical coordinates (longitude and latitude). Thanks to Google and some other providers like OpenStreeMap, this is now a relatively painless process. But when one needs more than a few addresses geocoded, one does not rely on pointing-and-clicking. One needs an API, i.e. a software library that makes the service accessible through R, Python or some other programming language.

The upside is that I learned a bit about the wonders of Python in general and the charms of geopy in particular. The downside is that writing a simple script that takes a number of strings from a Stata file, converts them into coordinates and gets them back into Stata took longer than I ever thought possible. Just now, I’ve learned about a possible shortcut (via the excellent data monkey blog): geocode is a user-written Stata command that takes a variable containing address strings and returns two new variables containing the latitude/longitude information. Now that would have been a bit of a time-saver. You can install geocode by typing

net from
net install dm0053

There is, however, one potential drawback: Google limits the number of free queries per day (and possibly per minute). Via Python, you can easily stagger your requests, and you can also use an API key that is supposed to give you a bigger quota. Geocoding a large number of addresses from Stata in one go, on the other hand, will probably result in an equally large number of parsing errors.

Jan 212012
In the past, I did a lot of multi-level modelling with MLwiN 2.02, which I quickly learned to loath. Back in the late 1990s, MLwiN was perhaps the first ML software that had a somewhat intuitive interface, i.e. it allowed one to build a model by pointing and clicking. Moreover, it printed updated estimates on the screen while cycling merrily through the parameter space. That was sort of cool, as it could take minutes to reach convergence, and without the updating, one would never have been sure that the program had not crashed yet. Which it did quite often, even for simple models.

Worse than the bugs was the lack of proper scriptability. Pointing and clicking  loses its appeal when you need to run the same model on 12 different datasets, or when you are looking at three variants of the same model and 10 recodes of the same variable. Throw in the desire to semi-automatically re-compile the findings from these exercises into two nice tables for inclusion in LaTeX again and again after finding yet another problem with a model, and you will agree that any  piece of software that is not scriptable is pretty useless for scientists.

MLwiN’s command language was unreliable and woefully underdocumented, and everything was a pain. So I embraced xtmixed when it came along with Stata 9/10, which solved all of these problems.

Running MLwiN from within Stata 1

runmlwin presentation (pdf)

But xtmixed is slow with large datsets/complex models. It relies on quadrature, which is exact but computationally intensive. MLwiN works with approximations of the likelihood function (quick and dirty) or MCMC (strictly speaking a Bayesian approach, but people don’t ask to many questions because it tends to be faster than quadrature). Moreover, MLwiN can run a lot of fancy models that xtmixed cannot, because it is a highly specialised program that has been around for a very long time.

Enter the good people over at the Centre for Multilevel Modelling at Bristol, who have come up with runmlwin, an ado that essentially makes the functionality of MLwiN available as a Stata command, postestimation analysis and all. Can’t wait to see if this works with Linux, wine and my ancient binaries, too.

Mar 212009
MLwiN is one of the granddaddies of multi-level modelling software (the other being HLM).  Essentially, it is a 1990s-ish looking and sometimes quirky GUI wrapped around  an old DOS program (MLn). The one feature that set MLwiN apart in the late 1990s is point-and-click interface that allows you to build the equations for a multi-level in a stepwise fashion. The underlying command language is still slightly confusing and less than well documented, and some of the modern features (such as modelling categorical dependent variables) are implemented as external macros, which does not need to concern you unless something goes horribly wrong, which happens occassionally.

That said, MLwiN is reasonably fast, does now incorporate modern MCMC estimators, has an interface with WINBUGS and can be convinced to do most things you would possibly want to do with it.  I bought version 1.10 ca. 1998, received free upgrades to 2.02 and good support well until 2004/2005 or so.  These days, Stata, R and MPlus can all estimate multi-level models, but working with MLwiN may still be worthwhile for you (by the way, you can download the free stata2mlwin addon from UCLA academic technology to export your variables from Stata to MLwiN).

Rather amazingly, MLwiN is now freely available for anyone working in UK universities: just enter your details including your, and few days later, they will send you a download link.

Jul 082008
Our project on social (citation and collaboration) networks in British and German political science involves networks with hundreds and thousands of nodes (scientists and articles). At the moment, our data come from the Social Science Citation Index (part of the ISI web of knowledge), and we use a bundle of rather eclectic (erratic?) scripts written in Perl to convert the ISI records into something that programs like Pajek or Stata can read. Some canned solutions (Wos2pajek, network workbench, bibexcel) are available for free, but I was not aware of them when I started this project, did not manage to install them properly, or was not happy with the results. Perl is the Swiss Army Chainsaw (TM) for data pre-processing, incredibly powerful (my scripts are typically less than 50 lines, and I am not an efficient programmer), and every time I want to do something in a slightly different way (i.e. I spot a bug), all I have to do is to change a few lines in the scripts.
After trying a lot of other programs available on the internet, we have chosen Pajek for doing the analyses and producing those intriguing graphs of cliques and inner circles in Political Science. Pajek is closed source but free for non-commercial use and runs on Windows or (via wine) Linux. It is very fast, can (unlike many other programs) easily handle very large networks, produces decent graphs and does many standard analyses. Its user interface may be slightly less than straightforward but I got used to it rather quickly, and it even has basic scripting capacities.

The only thing that is missing is a proper manual, but even this is not really a problem since Pajek’s creators have written a very accessible introduction to social network analysis that doubles up as documentation for the program (order from,, However, Pajek has been under constant development since the 1990s (!) and has acquired a lot of new features since the book was published. Some of them are documented in an appendix, others are simply listed in the very short document that is the official manual for Pajek. You will want to go through the many presentations which are available via the Pajek wiki.

Of course, there is much more software available, often at no cost. If you do program Java or Python (I don’t), there are several libraries available that look very promising. Amongst the stand-alone programs, visone stands out because it can easily produce very attractive-looking graphs of small networks. Even more software has been developed in the context of other sciences that have an interest in networks (chemistry, biology, engineering etc.)
Here is a rather messy collection of links to sna software. Generally, you will want something that is more systematic and informative. Ines Mergel has recently launched a bid for creating a comprehensive software list on wikipedia. The resulting page on social network analysis software is obviously work in progress but provides very valuable guidance.

Technorati-Tags: sna, software, political science, network, analysis, perl, citation, bibliometrics, networks, social, social networks

Software for Social Network Analysis: Pajek and Friends 2