Stata-related posts

Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

May 202014
 

The Problem: Assessing Bias without the Data Set

While the interwebs are awash with headline findings from countless surveys, commercial companies (and even some academics) are reluctant to make their raw data available for secondary analysis. But fear not: Quite often, media outlets and aggregator sites publish survey margins, and that is all the information you need. It’s as easy as $latex \pi$.

The Solution: surveybiasi

After installing our surveybias add-on for Stata, you will have access to surveybiasi. surveybiasi is an “immediate command” (Stata parlance) that compares the distribution of a categorical variable in a survey to its true distribution in the population. Both distributions need to be specified via the popvalues() and samplevalues() options, respectively. The elements of these two lists may be specified in terms of counts, of percentages, or of relative frequencies, as the list is internally rescaled so that its elements sum up to unity. surveybiasi will happily report k $latex A^{\prime}_{i}$s, $latex B$ and $latex B_{w} $ (check out our paper for more information on these multinomial measures of bias) for variables with 2 to 12 discrete categories.

Bias in a 2012 CBS/NYT Poll

A week before the 2012 election for the US House of Representatives, 563 likely voters were polled for CBS/The New York Times. 46 per cent said they would vote for the Republican candidate in their district, 48 per cent said they would vote for the Democratic candidate. Three per cent said it would depend, and another two per cent said they were unsure, or refused to answer the question. In the example these five per cent are treated as “other”. Due to rounding error, the numbers do not exactly add up to 100, but surveybiasi takes care of the necessary rescaling.

In the actual election, the Republicans won 47.6 and the Democrats 48.8 per cent of the popular vote, with the rest going to third-party candidates. To see if these differences are significant, run surveybiasi like this:


. surveybiasi , popvalues(47.6 48.8 3.6) samplevalues(46 48 5) n(563)
------------------------------------------------------------------------------
      catvar |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
A'           |
           1 |  -.0426919   .0844929    -0.51   0.613     -.208295    .1229111
           2 |  -.0123999   .0843284    -0.15   0.883    -.1776805    .1528807
           3 |   .3375101   .1938645     1.74   0.082    -.0424573    .7174776
-------------+----------------------------------------------------------------
B            |
           B |   .1308673   .0768722     1.70   0.089    -.0197994    .2815341
         B_w |   .0385229   .0247117     1.56   0.119    -.0099112    .0869569
------------------------------------------------------------------------------
 
    Ho: no bias
    Degrees of freedom: 2
    Chi-square (Pearson) = 3.0945337
    Pr (Pearson) = .21282887
    Chi-square (LR) = 2.7789278
    Pr (LR) = .24920887


Given the small sample size and the close match between survey and electoral counts, it is not surprising that there is no evidence for statistically or substantively significant bias in this poll.

An alternative approach is to follow Martin, Traugott and Kennedy (2005) and ignore third-party voters, undecided respondents, and refusals. This requires minimal adjustments: $latex n$ is now 535 as the analytical sample size is reduced by five per cent, while the figures representing the “other” category can simply be dropped. Again, surveybiasiinternally rescales the values accordingly:


. surveybiasi , popvalues(47.6 48.8) samplevalues(46 48) n(535)
------------------------------------------------------------------------------
      catvar |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
A'           |
           1 |  -.0162297   .0864858    -0.19   0.851    -.1857388    .1532794
           2 |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
-------------+----------------------------------------------------------------
B            |
           B |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
         B_w |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
------------------------------------------------------------------------------
 
    Ho: no bias
    Degrees of freedom: 1
    Chi-square (Pearson) = .03521623
    Pr (Pearson) = .85114329
    Chi-square (LR) = .03521898
    Pr (LR) = .85113753

Under this two-party scenario, $latex A^{\prime}_{1}$ is identical to Martin, Traugott, and Kennedy’s original $latex A$ (and all other estimates are identical to $latex A$’s absolute value). Its negative sign points to the (tiny) anti-Republican bias in this poll, which is of course even less significant than in the previous example.

Apr 032014
 

Survey Accuracy

The accuracy of pre-election surveys is a matter of considerable debate. Obviously, any rigorous discussion of bias in opinion polls requires a scalar measure of survey accuracy. Martin, Traugott, and Kennedy (2005) propose such a measure $latex A$ for the two-party case, and in our own work (Arzheimer/Evans 2014), Jocelyn Evans and I demonstrate how $latex A$ can be generalised to the multi-party case, giving rise to a new measure $latex B$ (seriously) and some friends $latex A^{\prime}_{i}$ and $latex B_w$:

    Arzheimer, Kai and Jocelyn Evans. “A New Multinomial Accuracy Measure for Polling Bias.” Political Analysis 22.1 (2014): 31-44. doi:10.1093/pan/mpt012
    [BibTeX] [Abstract] [Download PDF] [HTML] [DATA]
    In this article, we propose a polling accuracy measure for multi-party elections based on a generalization of Martin, Traugott, and Kennedy s two-party predictive accuracy index. Treating polls as random samples of a voting population, we first estimate an intercept only multinomial logit model to provide proportionate odds measures of each party s share of the vote, and thereby both unweighted and weighted averages of these values as a summary index for poll accuracy. We then propose measures for significance testing, and run a series of simulations to assess possible bias from the resulting folded normal distribution across different sample sizes, finding that bias is small even for polls with small samples. We apply our measure to the 2012 French presidential election polls to demonstrate its applicability in tracking overall polling performance across time and polling organizations. Finally, we demonstrate the practical value of our measure by using it as a dependent variable in an explanatory model of polling accuracy, testing the different possible sources of bias in the French data.

    @Article{arzheimer-evans-2013,
    author = {Arzheimer, Kai and Evans, Jocelyn},
    title = {A New Multinomial Accuracy Measure for Polling Bias },
    journal = {Political Analysis},
    year = 2014,
    abstract = {In this article, we propose a polling accuracy measure for
    multi-party elections based on a generalization of Martin,
    Traugott, and Kennedy s two-party predictive accuracy index.
    Treating polls as random samples of a voting population, we first
    estimate an intercept only multinomial logit model to provide
    proportionate odds measures of each party s share of the vote, and
    thereby both unweighted and weighted averages of these values as a
    summary index for poll accuracy. We then propose measures for
    significance testing, and run a series of simulations to assess
    possible bias from the resulting folded normal distribution across
    different sample sizes, finding that bias is small even for polls
    with small samples. We apply our measure to the 2012 French
    presidential election polls to demonstrate its applicability in
    tracking overall polling performance across time and polling
    organizations. Finally, we demonstrate the practical value of our
    measure by using it as a dependent variable in an explanatory model
    of polling accuracy, testing the different possible sources of bias
    in the French data.},
    keywords = {meth-e},
    volume = {22},
    number = {1},
    pages = {31--44},
    url =
    {http://pan.oxfordjournals.org/cgi/reprint/mpt012?ijkey=z9z740VU1fZp331&keytype=ref},
    doi = {10.1093/pan/mpt012},
    data = {http://hdl.handle.net/1902.1/21603},
    html =
    {http://www.kai-arzheimer.com/new-multinomial-accuracy-measure-for-polling-bias}
    }

The Surveybias Software 1.1

Calculating the accuracy measures is a matter of some algebra. Estimating standard errors is a bit trickier but could be done manually by making use of the relationship between $latex A^{\prime}_{i}$ and the multinomial logistic model on the one hand and Stata’s very powerful implementation of the Delta method on the other. But these calculations are error-prone and become tedious rather quickly. This is why we created a suite of user written programs (surveybias, surveybiasi, and surveybiasseries). They do all the necessary legwork and return the estimates of accuracy, complete with standard errors and statistical tests.

voter poll Surveybias Version 1.1 for Stata is out
Those Were the DaysFoter.com / CC BY-SA

We have just updated our software. The new version 1.1 of surveybias features some bug fixes, a better mechanism for automagically dealing with convergence problems, better documentation, and a new example data set that compiles information on 152 German pre-election polls conducted between January and September 2013.

Examples, Please?

surveybias comes with example data from the French presidential election 2012 and the German parliamentary election 2013. From within Stata, type help surveybias, help surveybiasi, and help surveybiasseries to see how you can make use of our software. If I can find the time, I will illustrate the use of surveybias in a mini series of blogs over the next week.

Updating Surveybias

The new version 1.1 should appear is now on SSC within the next couple of days or so, but the truly impatient can get it now. In your internet-aware copy of Stata (version 11 or later), type

net from http://www.kai-arzheimer.com/stata

net install surveybias, replace

Or use SSC: ssc install surveybias, replace

Enjoy!

Jan 262014
 

R Package Parallel: How Not to Solve a Problem That Does Not Exist

Somewhat foolishly, my university has granted me access to Mogon: not the god, not the death metal band but rather their supercomputer, which currently holds the 182th spot in the top 500 list of the fastest computers on the planet. It has some 34,000+ cores and more than 80 TB of RAM, but basically it’s just a very large bunch of Linux boxes. That means that I have a rough idea how to handle it, and that it happily runs my native Linux Stata and MPlus (and hopefully Jags) binaries for me. It also has R installed, and this is where my misery began.

I have a lengthy R job that deals with census data. Basically, it looks up the absolute number of minority residents in some 25,000 output areas and their immediate neighbours and calculates a series of percentages from these figures. I think this could in principle be done in Stata, but R provides convenient libraries for dealing with geo-coded data (sp and friends), non-rectangular data structures and all the trappings of a full-featured programming language, so it would be stupid not to make use of it. The only problem is that R is relatively slow and single-threaded, and that my script is what they call embarrassingly parallel: The same trivial function is applied to 33 vectors with 25,000 elements each. Each calculation on a vector takes about eight seconds to complete, which amounts to roughly five minutes in total. Add the time it takes to read in the data and some fairly large lookup-tables (it would be very time-consuming to repeatedly calculate which output area is close enough to each other output area to be considered a neighbour), and we are looking at eight to ten minutes for one run.

mogon Embarrassing Parallelism: I Got 99 Problems, but a Core aint One

Mogon. Image Credit: ZDV JGU Mainz

While I do not plan to run this script very often – once the calculations are done and saved, the results can be used in the analysis proper over and over again – I fully expect that I might change some operationalisations, include different variables etc., and so I began toying with the parallel package for R to make use of the many cores suddenly at my disposal.

Twelve hours later, I had learned the basics of the scheduling system (LSF), solved the problem of synching my data between home, office, central, and super-computer, gained some understanding of the way parallel works and otherwise achieved basically nothing: Even the best attempt at running a parallelised version of the script on the supercomputer was a little slower than the serialised version on my very capable office machine (and that is without the time (between 15 and 90 seconds) the scripts spends waiting to be transferred to a suitable node of the cluster). I tried different things: replacing lapply with mclapply, which was slower, regardless of the number of cores; using clusterApply instead of lapply (same result), and forking the 33 serial jobs into the background, which was even worse, presumably because storing the returned values resulted in changes to rather large data structures that were propagated to all cores involved.

Lessons Learned?

So yes, to save a few minutes in a script that I will presumably run not more than four or five times over the next couple of weeks, I spent 12 hours, with zilch results. But at least I learned a few things (apart from the obvious re-iteration of the old ‘never change a half-way running system’ mantra). First, even if it takes eight seconds to do the sums, a vector of 25,000 elements is probably to short to really benefit from shifting the calculations to more cores. While forking should be cheap, the overhead of setting up the additional threads dominates any savings. Second, running jobs in parallel without really understanding what overhead this creates is a stupid idea, and knowing what overhead this creates and how to avoid this is probably not worth the candle (see the above). Third, I can always re-use the infrastructure I’ve created (for more pointless experiments). Forth, my next go at Mogon shall avoid half-baked middle-level parallelisation altogether. Instead I shall combine fine-grained implicit parallelism (built into Stata and Mplus) and very coarse explicit parallelism (by breaking up lengthy scripts into small chunks that can be run independently). Further research is definitively needed.

Nov 222013
 

Measuring Survey Bias

In our recent Political Analysis paper (ungated authors’ version), Jocelyn Evans and I show how Martin, Traugott, and Kennedy’s two-party measure of survey accuracy can be extended to the multi-party case (which is slightly more relevant for comparativists and other people interested in the world outside the US). This extension leads to a series of party-specific measures of bias as well as to two scalar measures of overall survey bias.

Moreover, we demonstrate that our new measures are closely linked to the familiar multinomial logit model (just as the MTK measure is linked to the binomial logit). This demonstration is NOT an exercise in Excruciatingly Boring Algebra. Rather, it leads to a straightforward derivation of standard errors and facilitates the implementation of our methodology in standard statistical packages.

voter poll Just How Biased is Your Survey? Ask our Stata Add On (Update)
Those Were the DaysFoter.com / CC BY-SA

An Update to Our Free Software

We have programmed such an implementation in Stata, and it should not be too difficult to implement our methodology in R (any volunteers?). Our Stata code has been on SSC for a couple of months now but has recently been significantly updated. The new version 1.0 includes various bug fixes to the existing commands surveybias.ado and surveybiasi.ado, slightly better documentation, two toy data sets that should help you getting started with the methodology, and a new command surveybiasseries.ado.

surveybiasseries facilitates comparisons across a series of (pre-election) polls. It expects a data set in which each row corresponds to margins (predicted vote shares) from a survey. Such a dataset can quickly be constructed from published sources. Access to the original data is not required. surveybiasseries calculates the accuracy measures for each poll and stores them in a set of new variables, which can then be used as depended variable(s) in a model of poll accuracy.

Getting Started with Estimating Survey Bias

The new version of surveybias for Stata should appear be on SSC over the next couple of weeks or so (double check the version number (was 0.65, should now be 1.0) and the release date), but you can install it right now from this website:

net from http://www.kai-arzheimer.com/stata 
net install surveybias

To see the new command in action, try this

use fivefrenchsurveys, replace

will load information from five pre-election polls taken during the French presidential campaign (2012) into memory. The vote shares refer to eight candidates that competed in the first round.

surveybiasseries in 1/3 , popvaria(*true) samplev(fh-other) nvar(N) gen(frenchsurveys)

will calculate our accuracy measures and their standard errors for the first three surveys over the full set of candidates.

surveybiasseries in 4/5, popvariables(fhtrue-mptrue) samplevariables(fh-mp) nvar(N) gen(threeparty)

will calculate bias with respect to the three-party vote (i.e. Hollande, Sarkozy, Le Pen) for surveys no. 4 and 5 (vote shares a automatically rescaled to unity, no recoding required). The new variable names start with “frenchsurveys” and “threeparty” and should be otherwise self-explanatory (i.e. threepartybw is $B_w$ for the three party case, and threepartysebw the corresponding standard error). Feel free to plot and model to your heart’s content.

Jul 172013
 

Replication data for our forthcoming Political Analysis paper on our new, multinomial accuracy measure for bias in opinion surveys (e.g. pre-election polls) has just gone online at the PA dataverse. So if you want to gauge the performance of French pollsters over the 2012 presidential campaign (or cannot wait to re-run our simulations of [latex]B_w[/latex]’s sampling distribution), download our data and start playing.

A current version of our Stata add-on surveybias is included in the bundle. Alternatively, you can install the software into you personal ado dir by typing ssc install surveybias.

Jul 102013
 
used punchcard 1 m Stata Software for Assessing Survey Bias
BinaryApe / Foter / CC BY

In a recent paper, we derive various multinomial measures of bias in public opinion surveys (e.g. pre-election polls). Put differently, with our methodology, you may calculate a scalar measure of survey bias in multi-party elections.

Thanks to Kit Baum over at Boston College, our Stata add-on surveybias.ado is now available from Statistical Software Components (SSC).  The add-on takes as its argument the name of a categorical variable and said variable’s true distribution in the population. For what it’s worth, the program tries to be smart: surveybias vote, popvalues(900000 1200000 1800000), surveybias vote, popvalues(0.2307692 0.3076923 0.4615385), and surveybias vote, popvalues(23.07692 30.76923 46.15385) should all give the same result.

If you don’t have access to the raw data but want to assess survey bias evident in published figures, there is surveybiasi, an “immediate” command that lets you do stuff like this:  surveybiasi , popvalues(30 40 30) samplevalues(40 40 20) n(1000). Again, you may specify absolute values, relative frequencies, or percentages.

If you want to go ahead and measure survey bias, install surveybias.ado and surveybiasi.ado on your computer by typing ssc install surveybias in your net-aware copy of Stata. And if you use and like our software, please cite our forthcoming Political Analysis paper on the New Multinomial Accuracy Measure for Polling Bias.

Update April 2014: New version 1.1 available

Jun 232013
 

All surveys deviate from the true distributions of the variables, but some more so than others. This is particularly relevant in the context of election studies, where the true distribution of the vote is revealed on election night. Wouldn’t it be nice if one could quantify the bias exhibited by pollster X in their pre-election survey(s), with one single number? Heck, you could even model bias in polls, using RHS variables such as time to election, sample size or sponsor of the survey, coming up with an estimate of the infamous “house effect”,.

Jocelyn Evans and I have developed a method for calculating such a figure by extending Martin, Kennedy and Traugott’s measure $latex A$ to the multi-party case. Being the very creative chaps we are, we call this new statistic [drumroll] $latex B$. We also derive a weighted version of this measure $latex B_w$, and statistics to measure bias in favour/against any single party ($latex A’$). Of course, our measures can be applied to the sampling of any categorical variable whose distribution is known.

We fully develop all these goodies (and illustrate their usefulness by analysing bias in French pre-election polls) in a paper that
(to our immense satisfaction) has just been accepted for publication in Political Analysis (replication files to follow).

Our module survebias is a Stata ado file that implements these methods. It should become available from SSC over the summer, giving you convenient access to the new methods. I’ll keep you posted.

Jun 012013
 

What is the Delta Method?

I have used the delta method occasionally for years without really understanding what is going on under the hood. A recent encounter with an inquisitive reviewer has changed that. As it turned out, the delta method is even more useful than sliced bread, and much healthier.

The delta method, whose foundations were laid in the 1940s by Cramér (Oehlert 1942), approximates the expectation (or higher moments) of some function $latex g(\cdot)$ of a random variable $latex x$ by relying on a (truncated) Taylor series expansion. More specifically, Agresti (2002: 578) shows that (under weak conditions) for some parameter $latex \theta$ that has an approximately normal sampling distribution with variance $latex \sigma^{2}/n$, the sampling distribution of $latex g(\theta)$ is also approximately normal with variance $latex [g'(\theta)]^{2}\sigma^2/n$, since $latex g(\cdot)$ is approximately linear in the neighbourhood of $latex \theta$. The delta method can be generalised to the case of a multivariate normal random vector (Agresti 2002: 579) such as the joint sampling distribution of some set of parameter estimates.

In plain words, that means that one can use the delta method to calculate confidence intervals and perform hypothesis tests on just about every linear or nonlinear transformation of a vector of parameter estimates. If you are interested in the ratio of two coefficients and need a confidence interval, if, for some reason, you need to know if $latex e^{\beta} >c$ with some probability, the delta method is your friend.

The Delta Method and nlcom

Stata’s procedure nlcom is a particularly versatile and powerful implementation of the delta method. As a post-estimation command, nlcom accepts symbolic references to model parameters and computes sampling variances for their linear and non-linear combinations  and transformations. If you can write down the formula of the transformation, nlcom will spit out the result, standard error and confidence interval, and will even store the full variance-covariance matrix of the estimates. That, in turn, means that amongst other things, you can abuse Stata’s built in procedures to implement your own estimators.

What’s not to like? Well, for one thing, Stata gives no indication of how well the approximation works. It’s always worth checking that the results look reasonable, and in particularly complex circumstances, one should use simulation/bootstrapping for double checking. But bascially,>nlcom is great fun.

References

Agresti, Alan. 2002. Categorical Data Analysis. 2 ed. Hoboken: John Wiley.

Oehlert, Gary W. 1992. “A Note on the Delta Method.” The American Statistician
46(1):27–29.

Dec 192012
 

As a follow-up to my recent post on the relationship between gun ownership and gun homicide in OECD countries, I have rolled my dataset (compiled from information published by gunpolicy.org) and my analysis script into a neat Stata package. If you want to recreate the tables and graphs, or otherwise want to play with the data just enter

net get http://www.kai-arzheimer.com/stata/guns

do guns-analysis

in your net-aware copy of Stata.

If you don’t like Stata, you can get the raw data (ASCII) from http://www.kai-arzheimer.com/stata/oecd-gun-deaths.txt . Enjoy!