Stata-related posts

Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

Mar 312015
 

We have updated our add-on (or ado) surveybias, which calculates our multinomial generalisation of the old Martin, Traugott, and Kennedy (2005) measure for bias. If you have any dichotomous or multinomial variable whose true distribution is known (e.g. from the census, electoral counts, or other official data), surveybias can tell you just how badly damaged your sample really is with respect to that variable. Within Stata, you can install/update surveybias by entering ssc install surveybias. We’ve also created a separate page with more information on how to use surveybias, including a number of worked examples.

The new version is called 1.3b (please don’t ask). New features and improvements include:

  • Support for (some) complex variance estimators including Stata’s survey estimator (sample points, strata, survey weights etc.)
  • Improvements to the numerical approximation. survebias is roughly seven times faster now
  • A new analytical method for simple random samples that is even faster
  • Convenience options for naming variables created by survebiasseries
  • Lots of bug fixes and improvements to the code

If you have a survey, give it a spin.

Mar 122015
 

Contrary to popular belief, it’s not always the third reviewer that gives you grief. In our case, it is the one and only reviewer that shot down a manuscript, because at the very least, s/he would have expected (and I quote) an “analytical derivation of the estimator”. For some odd reason of his own, the editor, instead of simply rejecting us, dared us to do just that, and against all odds, we succeeded after some months of gently banging various heads against assorted walls.

Needless to say that on second thought, the reviewer found the derivation “interesting but unnecessarily complicated” and now recommends relegating the material to a footnote. To make up for this, s/he delved into the code of our software, spotted some glaring mistakes and recommended a few changes (actually sending us a dozen lines of code) that result in a speed gain of some 600 per cent. This is very cool, very good news for end users, very embarrassing for us, and generally wrong on so many levels.

Bonus track: The third reviewer.

 

May 202014
 

The Problem: Assessing Bias without the Data Set

While the interwebs are awash with headline findings from countless surveys, commercial companies (and even some academics) are reluctant to make their raw data available for secondary analysis. But fear not: Quite often, media outlets and aggregator sites publish survey margins, and that is all the information you need. It’s as easy as \pi.

The Solution: surveybiasi

After installing our surveybias add-on for Stata, you will have access to surveybiasi. surveybiasi is an “immediate command” (Stata parlance) that compares the distribution of a categorical variable in a survey to its true distribution in the population. Both distributions need to be specified via the popvalues() and samplevalues() options, respectively. The elements of these two lists may be specified in terms of counts, of percentages, or of relative frequencies, as the list is internally rescaled so that its elements sum up to unity. surveybiasi will happily report k A^{\prime}_{i}s, B and B_{w} (check out our paper for more information on these multinomial measures of bias) for variables with 2 to 12 discrete categories.

Bias in a 2012 CBS/NYT Poll

A week before the 2012 election for the US House of Representatives, 563 likely voters were polled for CBS/The New York Times. 46 per cent said they would vote for the Republican candidate in their district, 48 per cent said they would vote for the Democratic candidate. Three per cent said it would depend, and another two per cent said they were unsure, or refused to answer the question. In the example these five per cent are treated as “other”. Due to rounding error, the numbers do not exactly add up to 100, but surveybiasi takes care of the necessary rescaling.

In the actual election, the Republicans won 47.6 and the Democrats 48.8 per cent of the popular vote, with the rest going to third-party candidates. To see if these differences are significant, run surveybiasi like this:


. surveybiasi , popvalues(47.6 48.8 3.6) samplevalues(46 48 5) n(563)
------------------------------------------------------------------------------
      catvar |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
A'           |
           1 |  -.0426919   .0844929    -0.51   0.613     -.208295    .1229111
           2 |  -.0123999   .0843284    -0.15   0.883    -.1776805    .1528807
           3 |   .3375101   .1938645     1.74   0.082    -.0424573    .7174776
-------------+----------------------------------------------------------------
B            |
           B |   .1308673   .0768722     1.70   0.089    -.0197994    .2815341
         B_w |   .0385229   .0247117     1.56   0.119    -.0099112    .0869569
------------------------------------------------------------------------------
 
    Ho: no bias
    Degrees of freedom: 2
    Chi-square (Pearson) = 3.0945337
    Pr (Pearson) = .21282887
    Chi-square (LR) = 2.7789278
    Pr (LR) = .24920887


Given the small sample size and the close match between survey and electoral counts, it is not surprising that there is no evidence for statistically or substantively significant bias in this poll.

An alternative approach is to follow Martin, Traugott and Kennedy (2005) and ignore third-party voters, undecided respondents, and refusals. This requires minimal adjustments: n is now 535 as the analytical sample size is reduced by five per cent, while the figures representing the “other” category can simply be dropped. Again, surveybiasiinternally rescales the values accordingly:


. surveybiasi , popvalues(47.6 48.8) samplevalues(46 48) n(535)
------------------------------------------------------------------------------
      catvar |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
A'           |
           1 |  -.0162297   .0864858    -0.19   0.851    -.1857388    .1532794
           2 |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
-------------+----------------------------------------------------------------
B            |
           B |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
         B_w |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
------------------------------------------------------------------------------
 
    Ho: no bias
    Degrees of freedom: 1
    Chi-square (Pearson) = .03521623
    Pr (Pearson) = .85114329
    Chi-square (LR) = .03521898
    Pr (LR) = .85113753

Under this two-party scenario, A^{\prime}_{1} is identical to Martin, Traugott, and Kennedy’s original A (and all other estimates are identical to A‘s absolute value). Its negative sign points to the (tiny) anti-Republican bias in this poll, which is of course even less significant than in the previous example.

Apr 032014
 

Survey Accuracy

The accuracy of pre-election surveys is a matter of considerable debate. Obviously, any rigorous discussion of bias in opinion polls requires a scalar measure of survey accuracy. Martin, Traugott, and Kennedy (2005) propose such a measure A for the two-party case, and in our own work (Arzheimer/Evans 2014), Jocelyn Evans and I demonstrate how A can be generalised to the multi-party case, giving rise to a new measure B (seriously) and some friends A^{\prime}_{i} and B_w:

    Arzheimer, Kai and Jocelyn Evans. “A New Multinomial Accuracy Measure for Polling Bias.” Political Analysis 22.1 (2014): 31-44. doi:10.1093/pan/mpt012
    [BibTeX] [Abstract] [Download PDF] [HTML] [DATA]
    In this article, we propose a polling accuracy measure for multi-party elections based on a generalization of Martin, Traugott, and Kennedy s two-party predictive accuracy index. Treating polls as random samples of a voting population, we first estimate an intercept only multinomial logit model to provide proportionate odds measures of each party s share of the vote, and thereby both unweighted and weighted averages of these values as a summary index for poll accuracy. We then propose measures for significance testing, and run a series of simulations to assess possible bias from the resulting folded normal distribution across different sample sizes, finding that bias is small even for polls with small samples. We apply our measure to the 2012 French presidential election polls to demonstrate its applicability in tracking overall polling performance across time and polling organizations. Finally, we demonstrate the practical value of our measure by using it as a dependent variable in an explanatory model of polling accuracy, testing the different possible sources of bias in the French data.

    @Article{arzheimer-evans-2013,
    author = {Arzheimer, Kai and Evans, Jocelyn},
    title = {A New Multinomial Accuracy Measure for Polling Bias },
    journal = {Political Analysis},
    year = 2014,
    abstract = {In this article, we propose a polling accuracy measure for
    multi-party elections based on a generalization of Martin,
    Traugott, and Kennedy s two-party predictive accuracy index.
    Treating polls as random samples of a voting population, we first
    estimate an intercept only multinomial logit model to provide
    proportionate odds measures of each party s share of the vote, and
    thereby both unweighted and weighted averages of these values as a
    summary index for poll accuracy. We then propose measures for
    significance testing, and run a series of simulations to assess
    possible bias from the resulting folded normal distribution across
    different sample sizes, finding that bias is small even for polls
    with small samples. We apply our measure to the 2012 French
    presidential election polls to demonstrate its applicability in
    tracking overall polling performance across time and polling
    organizations. Finally, we demonstrate the practical value of our
    measure by using it as a dependent variable in an explanatory model
    of polling accuracy, testing the different possible sources of bias
    in the French data.},
    keywords = {meth-e},
    volume = {22},
    number = {1},
    pages = {31--44},
    url =
    {http://pan.oxfordjournals.org/cgi/reprint/mpt012?ijkey=z9z740VU1fZp331&keytype=ref},
    doi = {10.1093/pan/mpt012},
    data = {http://hdl.handle.net/1902.1/21603},
    html =
    {http://www.kai-arzheimer.com/new-multinomial-accuracy-measure-for-polling-bias}
    }

The Surveybias Software 1.1

Calculating the accuracy measures is a matter of some algebra. Estimating standard errors is a bit trickier but could be done manually by making use of the relationship between A^{\prime}_{i} and the multinomial logistic model on the one hand and Stata’s very powerful implementation of the Delta method on the other. But these calculations are error-prone and become tedious rather quickly. This is why we created a suite of user written programs (surveybias, surveybiasi, and surveybiasseries). They do all the necessary legwork and return the estimates of accuracy, complete with standard errors and statistical tests.

Voter poll
Those Were the DaysFoter.com / CC BY-SA

We have just updated our software. The new version 1.1 of surveybias features some bug fixes, a better mechanism for automagically dealing with convergence problems, better documentation, and a new example data set that compiles information on 152 German pre-election polls conducted between January and September 2013.

Examples, Please?

surveybias comes with example data from the French presidential election 2012 and the German parliamentary election 2013. From within Stata, type help surveybias, help surveybiasi, and help surveybiasseries to see how you can make use of our software. If I can find the time, I will illustrate the use of surveybias in a mini series of blogs over the next week.

Updating Surveybias

The new version 1.1 should appear is now on SSC within the next couple of days or so, but the truly impatient can get it now. In your internet-aware copy of Stata (version 11 or later), type

net from http://www.kai-arzheimer.com/stata

net install surveybias, replace

Or use SSC: ssc install surveybias, replace

Enjoy!

Jan 262014
 

R Package Parallel: How Not to Solve a Problem That Does Not Exist

Somewhat foolishly, my university has granted me access to Mogon: not the god, not the death metal band but rather their supercomputer, which currently holds the 182th spot in the top 500 list of the fastest computers on the planet. It has some 34,000+ cores and more than 80 TB of RAM, but basically it’s just a very large bunch of Linux boxes. That means that I have a rough idea how to handle it, and that it happily runs my native Linux Stata and MPlus (and hopefully Jags) binaries for me. It also has R installed, and this is where my misery began.

I have a lengthy R job that deals with census data. Basically, it looks up the absolute number of minority residents in some 25,000 output areas and their immediate neighbours and calculates a series of percentages from these figures. I think this could in principle be done in Stata, but R provides convenient libraries for dealing with geo-coded data (sp and friends), non-rectangular data structures and all the trappings of a full-featured programming language, so it would be stupid not to make use of it. The only problem is that R is relatively slow and single-threaded, and that my script is what they call embarrassingly parallel: The same trivial function is applied to 33 vectors with 25,000 elements each. Each calculation on a vector takes about eight seconds to complete, which amounts to roughly five minutes in total. Add the time it takes to read in the data and some fairly large lookup-tables (it would be very time-consuming to repeatedly calculate which output area is close enough to each other output area to be considered a neighbour), and we are looking at eight to ten minutes for one run.

Mogon

Mogon. Image Credit: ZDV JGU Mainz

While I do not plan to run this script very often – once the calculations are done and saved, the results can be used in the analysis proper over and over again – I fully expect that I might change some operationalisations, include different variables etc., and so I began toying with the parallel package for R to make use of the many cores suddenly at my disposal.

Twelve hours later, I had learned the basics of the scheduling system (LSF), solved the problem of synching my data between home, office, central, and super-computer, gained some understanding of the way parallel works and otherwise achieved basically nothing: Even the best attempt at running a parallelised version of the script on the supercomputer was a little slower than the serialised version on my very capable office machine (and that is without the time (between 15 and 90 seconds) the scripts spends waiting to be transferred to a suitable node of the cluster). I tried different things: replacing lapply with mclapply, which was slower, regardless of the number of cores; using clusterApply instead of lapply (same result), and forking the 33 serial jobs into the background, which was even worse, presumably because storing the returned values resulted in changes to rather large data structures that were propagated to all cores involved.

Lessons Learned?

So yes, to save a few minutes in a script that I will presumably run not more than four or five times over the next couple of weeks, I spent 12 hours, with zilch results. But at least I learned a few things (apart from the obvious re-iteration of the old ‘never change a half-way running system’ mantra). First, even if it takes eight seconds to do the sums, a vector of 25,000 elements is probably to short to really benefit from shifting the calculations to more cores. While forking should be cheap, the overhead of setting up the additional threads dominates any savings. Second, running jobs in parallel without really understanding what overhead this creates is a stupid idea, and knowing what overhead this creates and how to avoid this is probably not worth the candle (see the above). Third, I can always re-use the infrastructure I’ve created (for more pointless experiments). Forth, my next go at Mogon shall avoid half-baked middle-level parallelisation altogether. Instead I shall combine fine-grained implicit parallelism (built into Stata and Mplus) and very coarse explicit parallelism (by breaking up lengthy scripts into small chunks that can be run independently). Further research is definitively needed.

Nov 222013
 

Measuring Survey Bias

In our recent Political Analysis paper (ungated authors’ version), Jocelyn Evans and I show how Martin, Traugott, and Kennedy’s two-party measure of survey accuracy can be extended to the multi-party case (which is slightly more relevant for comparativists and other people interested in the world outside the US). This extension leads to a series of party-specific measures of bias as well as to two scalar measures of overall survey bias.

Moreover, we demonstrate that our new measures are closely linked to the familiar multinomial logit model (just as the MTK measure is linked to the binomial logit). This demonstration is NOT an exercise in Excruciatingly Boring Algebra. Rather, it leads to a straightforward derivation of standard errors and facilitates the implementation of our methodology in standard statistical packages.

Voter poll
Those Were the DaysFoter.com / CC BY-SA

An Update to Our Free Software

We have programmed such an implementation in Stata, and it should not be too difficult to implement our methodology in R (any volunteers?). Our Stata code has been on SSC for a couple of months now but has recently been significantly updated. The new version 1.0 includes various bug fixes to the existing commands surveybias.ado and surveybiasi.ado, slightly better documentation, two toy data sets that should help you getting started with the methodology, and a new command surveybiasseries.ado.

surveybiasseries facilitates comparisons across a series of (pre-election) polls. It expects a data set in which each row corresponds to margins (predicted vote shares) from a survey. Such a dataset can quickly be constructed from published sources. Access to the original data is not required. surveybiasseries calculates the accuracy measures for each poll and stores them in a set of new variables, which can then be used as depended variable(s) in a model of poll accuracy.

Getting Started with Estimating Survey Bias

The new version of surveybias for Stata should appear be on SSC over the next couple of weeks or so (double check the version number (was 0.65, should now be 1.0) and the release date), but you can install it right now from this website:

net from http://www.kai-arzheimer.com/stata 
net install surveybias

To see the new command in action, try this

use fivefrenchsurveys, replace

will load information from five pre-election polls taken during the French presidential campaign (2012) into memory. The vote shares refer to eight candidates that competed in the first round.

surveybiasseries in 1/3 , popvaria(*true) samplev(fh-other) nvar(N) gen(frenchsurveys)

will calculate our accuracy measures and their standard errors for the first three surveys over the full set of candidates.

surveybiasseries in 4/5, popvariables(fhtrue-mptrue) samplevariables(fh-mp) nvar(N) gen(threeparty)

will calculate bias with respect to the three-party vote (i.e. Hollande, Sarkozy, Le Pen) for surveys no. 4 and 5 (vote shares a automatically rescaled to unity, no recoding required). The new variable names start with “frenchsurveys” and “threeparty” and should be otherwise self-explanatory (i.e. threepartybw is $B_w$ for the three party case, and threepartysebw the corresponding standard error). Feel free to plot and model to your heart’s content.

Jul 172013
 

Replication data for our forthcoming Political Analysis paper on our new, multinomial accuracy measure for bias in opinion surveys (e.g. pre-election polls) has just gone online at the PA dataverse. So if you want to gauge the performance of French pollsters over the 2012 presidential campaign (or cannot wait to re-run our simulations of B_w‘s sampling distribution), download our data and start playing.

A current version of our Stata add-on surveybias is included in the bundle. Alternatively, you can install the software into you personal ado dir by typing ssc install surveybias.

Jul 102013
 
Used Punchcard
BinaryApe / Foter / CC BY

In a recent paper, we derive various multinomial measures of bias in public opinion surveys (e.g. pre-election polls). Put differently, with our methodology, you may calculate a scalar measure of survey bias in multi-party elections.

Thanks to Kit Baum over at Boston College, our Stata add-on surveybias.ado is now available from Statistical Software Components (SSC).  The add-on takes as its argument the name of a categorical variable and said variable’s true distribution in the population. For what it’s worth, the program tries to be smart: surveybias vote, popvalues(900000 1200000 1800000), surveybias vote, popvalues(0.2307692 0.3076923 0.4615385), and surveybias vote, popvalues(23.07692 30.76923 46.15385) should all give the same result.

If you don’t have access to the raw data but want to assess survey bias evident in published figures, there is surveybiasi, an “immediate” command that lets you do stuff like this:  surveybiasi , popvalues(30 40 30) samplevalues(40 40 20) n(1000). Again, you may specify absolute values, relative frequencies, or percentages.

If you want to go ahead and measure survey bias, install surveybias.ado and surveybiasi.ado on your computer by typing ssc install surveybias in your net-aware copy of Stata. And if you use and like our software, please cite our forthcoming Political Analysis paper on the New Multinomial Accuracy Measure for Polling Bias.

Update April 2014: New version 1.1 available

Jun 232013
 

All surveys deviate from the true distributions of the variables, but some more so than others. This is particularly relevant in the context of election studies, where the true distribution of the vote is revealed on election night. Wouldn’t it be nice if one could quantify the bias exhibited by pollster X in their pre-election survey(s), with one single number? Heck, you could even model bias in polls, using RHS variables such as time to election, sample size or sponsor of the survey, coming up with an estimate of the infamous “house effect”,.

Jocelyn Evans and I have developed a method for calculating such a figure by extending Martin, Kennedy and Traugott’s measure A to the multi-party case. Being the very creative chaps we are, we call this new statistic [drumroll] B. We also derive a weighted version of this measure B_w, and statistics to measure bias in favour/against any single party (A'). Of course, our measures can be applied to the sampling of any categorical variable whose distribution is known.

We fully develop all these goodies (and illustrate their usefulness by analysing bias in French pre-election polls) in a paper that
(to our immense satisfaction) has just been accepted for publication in Political Analysis (replication files to follow).

Our module survebias is a Stata ado file that implements these methods. It should become available from SSC over the summer, giving you convenient access to the new methods. I’ll keep you posted.