Stata-related posts

Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

Apr 032014
 

Survey Accuracy

The accuracy of pre-election surveys is a matter of considerable debate. Obviously, any rigorous discussion of bias in opinion polls requires a scalar measure of survey accuracy. Martin, Traugott, and Kennedy (2005) propose such a measure  Surveybias Version 1.1 for Stata is out for the two-party case, and in our own work (Arzheimer/Evans 2014), Jocelyn Evans and I demonstrate how  Surveybias Version 1.1 for Stata is out can be generalised to the multi-party case, giving rise to a new measure  Surveybias Version 1.1 for Stata is out (seriously) and some friends  Surveybias Version 1.1 for Stata is out and  Surveybias Version 1.1 for Stata is out:

    Arzheimer, Kai and Jocelyn Evans. “A New Multinomial Accuracy Measure for Polling Bias.” Political Analysis 22.1 (2014): 31-44. doi:10.1093/pan/mpt012
    [BibTeX] [Abstract] [Download PDF] [HTML] [DATA]
    In this article, we propose a polling accuracy measure for multi-party elections based on a generalization of Martin, Traugott, and Kennedy s two-party predictive accuracy index. Treating polls as random samples of a voting population, we first estimate an intercept only multinomial logit model to provide proportionate odds measures of each party s share of the vote, and thereby both unweighted and weighted averages of these values as a summary index for poll accuracy. We then propose measures for significance testing, and run a series of simulations to assess possible bias from the resulting folded normal distribution across different sample sizes, finding that bias is small even for polls with small samples. We apply our measure to the 2012 French presidential election polls to demonstrate its applicability in tracking overall polling performance across time and polling organizations. Finally, we demonstrate the practical value of our measure by using it as a dependent variable in an explanatory model of polling accuracy, testing the different possible sources of bias in the French data.

    @Article{arzheimer-evans-2013,
    author = {Arzheimer, Kai and Evans, Jocelyn},
    title = {A New Multinomial Accuracy Measure for Polling Bias },
    journal = {Political Analysis},
    year = 2014,
    abstract = {In this article, we propose a polling accuracy measure for
    multi-party elections based on a generalization of Martin,
    Traugott, and Kennedy s two-party predictive accuracy index.
    Treating polls as random samples of a voting population, we first
    estimate an intercept only multinomial logit model to provide
    proportionate odds measures of each party s share of the vote, and
    thereby both unweighted and weighted averages of these values as a
    summary index for poll accuracy. We then propose measures for
    significance testing, and run a series of simulations to assess
    possible bias from the resulting folded normal distribution across
    different sample sizes, finding that bias is small even for polls
    with small samples. We apply our measure to the 2012 French
    presidential election polls to demonstrate its applicability in
    tracking overall polling performance across time and polling
    organizations. Finally, we demonstrate the practical value of our
    measure by using it as a dependent variable in an explanatory model
    of polling accuracy, testing the different possible sources of bias
    in the French data.},
    keywords = {meth-e},
    volume = {22},
    number = {1},
    pages = {31--44},
    url =
    {http://pan.oxfordjournals.org/cgi/reprint/mpt012?ijkey=z9z740VU1fZp331&keytype=ref},
    doi = {10.1093/pan/mpt012},
    data = {http://hdl.handle.net/1902.1/21603},
    html =
    {http://www.kai-arzheimer.com/new-multinomial-accuracy-measure-for-polling-bias}
    }

The Surveybias Software 1.1

Calculating the accuracy measures is a matter of some algebra. Estimating standard errors is a bit trickier but could be done manually by making use of the relationship between  Surveybias Version 1.1 for Stata is out and the multinomial logistic model on the one hand and Stata’s very powerful implementation of the Delta method on the other. But these calculations are error-prone and become tedious rather quickly. This is why we created a suite of user written programs (surveybias, surveybiasi, and surveybiasseries). They do all the necessary legwork and return the estimates of accuracy, complete with standard errors and statistical tests.

voter poll Surveybias Version 1.1 for Stata is out
Those Were the DaysFoter.com / CC BY-SA

We have just updated our software. The new version 1.1 of surveybias features some bug fixes, a better mechanism for automagically dealing with convergence problems, better documentation, and a new example data set that compiles information on 152 German pre-election polls conducted between January and September 2013.

Examples, Please?

surveybias comes with example data from the French presidential election 2012 and the German parliamentary election 2013. From within Stata, type help surveybias, help surveybiasi, and help surveybiasseries to see how you can make use of our software. If I can find the time, I will illustrate the use of surveybias in a mini series of blogs over the next week.

Updating Surveybias

The new version 1.1 should appear is now on SSC within the next couple of days or so, but the truly impatient can get it now. In your internet-aware copy of Stata (version 11 or later), type

net from http://www.kai-arzheimer.com/stata

net install surveybias, replace

Or use SSC: ssc install surveybias, replace

Enjoy!

Jan 262014
 

R Package Parallel: How Not to Solve a Problem That Does Not Exist

Somewhat foolishly, my university has granted me access to Mogon: not the god, not the death metal band but rather their supercomputer, which currently holds the 182th spot in the top 500 list of the fastest computers on the planet. It has some 34,000+ cores and more than 80 TB of RAM, but basically it’s just a very large bunch of Linux boxes. That means that I have a rough idea how to handle it, and that it happily runs my native Linux Stata and MPlus (and hopefully Jags) binaries for me. It also has R installed, and this is where my misery began.

I have a lengthy R job that deals with census data. Basically, it looks up the absolute number of minority residents in some 25,000 output areas and their immediate neighbours and calculates a series of percentages from these figures. I think this could in principle be done in Stata, but R provides convenient libraries for dealing with geo-coded data (sp and friends), non-rectangular data structures and all the trappings of a full-featured programming language, so it would be stupid not to make use of it. The only problem is that R is relatively slow and single-threaded, and that my script is what they call embarrassingly parallel: The same trivial function is applied to 33 vectors with 25,000 elements each. Each calculation on a vector takes about eight seconds to complete, which amounts to roughly five minutes in total. Add the time it takes to read in the data and some fairly large lookup-tables (it would be very time-consuming to repeatedly calculate which output area is close enough to each other output area to be considered a neighbour), and we are looking at eight to ten minutes for one run.

mogon Embarrassing Parallelism: I Got 99 Problems, but a Core aint One

Mogon. Image Credit: ZDV JGU Mainz

While I do not plan to run this script very often – once the calculations are done and saved, the results can be used in the analysis proper over and over again – I fully expect that I might change some operationalisations, include different variables etc., and so I began toying with the parallel package for R to make use of the many cores suddenly at my disposal.

Twelve hours later, I had learned the basics of the scheduling system (LSF), solved the problem of synching my data between home, office, central, and super-computer, gained some understanding of the way parallel works and otherwise achieved basically nothing: Even the best attempt at running a parallelised version of the script on the supercomputer was a little slower than the serialised version on my very capable office machine (and that is without the time (between 15 and 90 seconds) the scripts spends waiting to be transferred to a suitable node of the cluster). I tried different things: replacing lapply with mclapply, which was slower, regardless of the number of cores; using clusterApply instead of lapply (same result), and forking the 33 serial jobs into the background, which was even worse, presumably because storing the returned values resulted in changes to rather large data structures that were propagated to all cores involved.

Lessons Learned?

So yes, to save a few minutes in a script that I will presumably run not more than four or five times over the next couple of weeks, I spent 12 hours, with zilch results. But at least I learned a few things (apart from the obvious re-iteration of the old ‘never change a half-way running system’ mantra). First, even if it takes eight seconds to do the sums, a vector of 25,000 elements is probably to short to really benefit from shifting the calculations to more cores. While forking should be cheap, the overhead of setting up the additional threads dominates any savings. Second, running jobs in parallel without really understanding what overhead this creates is a stupid idea, and knowing what overhead this creates and how to avoid this is probably not worth the candle (see the above). Third, I can always re-use the infrastructure I’ve created (for more pointless experiments). Forth, my next go at Mogon shall avoid half-baked middle-level parallelisation altogether. Instead I shall combine fine-grained implicit parallelism (built into Stata and Mplus) and very coarse explicit parallelism (by breaking up lengthy scripts into small chunks that can be run independently). Further research is definitively needed.

Nov 222013
 

Measuring Survey Bias

In our recent Political Analysis paper (ungated authors’ version), Jocelyn Evans and I show how Martin, Traugott, and Kennedy’s two-party measure of survey accuracy can be extended to the multi-party case (which is slightly more relevant for comparativists and other people interested in the world outside the US). This extension leads to a series of party-specific measures of bias as well as to two scalar measures of overall survey bias.

Moreover, we demonstrate that our new measures are closely linked to the familiar multinomial logit model (just as the MTK measure is linked to the binomial logit). This demonstration is NOT an exercise in Excruciatingly Boring Algebra. Rather, it leads to a straightforward derivation of standard errors and facilitates the implementation of our methodology in standard statistical packages.

voter poll Just How Biased is Your Survey? Ask our Stata Add On (Update)
Those Were the DaysFoter.com / CC BY-SA

An Update to Our Free Software

We have programmed such an implementation in Stata, and it should not be too difficult to implement our methodology in R (any volunteers?). Our Stata code has been on SSC for a couple of months now but has recently been significantly updated. The new version 1.0 includes various bug fixes to the existing commands surveybias.ado and surveybiasi.ado, slightly better documentation, two toy data sets that should help you getting started with the methodology, and a new command surveybiasseries.ado.

surveybiasseries facilitates comparisons across a series of (pre-election) polls. It expects a data set in which each row corresponds to margins (predicted vote shares) from a survey. Such a dataset can quickly be constructed from published sources. Access to the original data is not required. surveybiasseries calculates the accuracy measures for each poll and stores them in a set of new variables, which can then be used as depended variable(s) in a model of poll accuracy.

Getting Started with Estimating Survey Bias

The new version of surveybias for Stata should appear be on SSC over the next couple of weeks or so (double check the version number (was 0.65, should now be 1.0) and the release date), but you can install it right now from this website:

net from http://www.kai-arzheimer.com/stata 
net install surveybias

To see the new command in action, try this

use fivefrenchsurveys, replace

will load information from five pre-election polls taken during the French presidential campaign (2012) into memory. The vote shares refer to eight candidates that competed in the first round.

surveybiasseries in 1/3 , popvaria(*true) samplev(fh-other) nvar(N) gen(frenchsurveys)

will calculate our accuracy measures and their standard errors for the first three surveys over the full set of candidates.

surveybiasseries in 4/5, popvariables(fhtrue-mptrue) samplevariables(fh-mp) nvar(N) gen(threeparty)

will calculate bias with respect to the three-party vote (i.e. Hollande, Sarkozy, Le Pen) for surveys no. 4 and 5 (vote shares a automatically rescaled to unity, no recoding required). The new variable names start with “frenchsurveys” and “threeparty” and should be otherwise self-explanatory (i.e. threepartybw is $B_w$ for the three party case, and threepartysebw the corresponding standard error). Feel free to plot and model to your heart’s content.

Jul 172013
 

Replication data for our forthcoming Political Analysis paper on our new, multinomial accuracy measure for bias in opinion surveys (e.g. pre-election polls) has just gone online at the PA dataverse. So if you want to gauge the performance of French pollsters over the 2012 presidential campaign (or cannot wait to re-run our simulations of  Replication Data for A New Multinomial Accuracy Measure for Polling Bias ‘s sampling distribution), download our data and start playing.

A current version of our Stata add-on surveybias is included in the bundle. Alternatively, you can install the software into you personal ado dir by typing ssc install surveybias.

Jul 102013
 
used punchcard 1 m Stata Software for Assessing Survey Bias
BinaryApe / Foter / CC BY

In a recent paper, we derive various multinomial measures of bias in public opinion surveys (e.g. pre-election polls). Put differently, with our methodology, you may calculate a scalar measure of survey bias in multi-party elections.

Thanks to Kit Baum over at Boston College, our Stata add-on surveybias.ado is now available from Statistical Software Components (SSC).  The add-on takes as its argument the name of a categorical variable and said variable’s true distribution in the population. For what it’s worth, the program tries to be smart: surveybias vote, popvalues(900000 1200000 1800000), surveybias vote, popvalues(0.2307692 0.3076923 0.4615385), and surveybias vote, popvalues(23.07692 30.76923 46.15385) should all give the same result.

If you don’t have access to the raw data but want to assess survey bias evident in published figures, there is surveybiasi, an “immediate” command that lets you do stuff like this:  surveybiasi , popvalues(30 40 30) samplevalues(40 40 20) n(1000). Again, you may specify absolute values, relative frequencies, or percentages.

If you want to go ahead and measure survey bias, install surveybias.ado and surveybiasi.ado on your computer by typing ssc install surveybias in your net-aware copy of Stata. And if you use and like our software, please cite our forthcoming Political Analysis paper on the New Multinomial Accuracy Measure for Polling Bias.

Update April 2014: New version 1.1 available

Jun 232013
 

All surveys deviate from the true distributions of the variables, but some more so than others. This is particularly relevant in the context of election studies, where the true distribution of the vote is revealed on election night. Wouldn’t it be nice if one could quantify the bias exhibited by pollster X in their pre-election survey(s), with one single number? Heck, you could even model bias in polls, using RHS variables such as time to election, sample size or sponsor of the survey, coming up with an estimate of the infamous “house effect”,.

Jocelyn Evans and I have developed a method for calculating such a figure by extending Martin, Kennedy and Traugott’s measure  A Scalar Measure for Bias in (Multi Party Pre Election) Surveys to the multi-party case. Being the very creative chaps we are, we call this new statistic [drumroll]  A Scalar Measure for Bias in (Multi Party Pre Election) Surveys. We also derive a weighted version of this measure  A Scalar Measure for Bias in (Multi Party Pre Election) Surveys, and statistics to measure bias in favour/against any single party ( A Scalar Measure for Bias in (Multi Party Pre Election) Surveys). Of course, our measures can be applied to the sampling of any categorical variable whose distribution is known.

We fully develop all these goodies (and illustrate their usefulness by analysing bias in French pre-election polls) in a paper that
(to our immense satisfaction) has just been accepted for publication in Political Analysis (replication files to follow).

Our module survebias is a Stata ado file that implements these methods. It should become available from SSC over the summer, giving you convenient access to the new methods. I’ll keep you posted.

Jun 012013
 

What is the Delta Method?

I have used the delta method occasionally for years without really understanding what is going on under the hood. A recent encounter with an inquisitive reviewer has changed that. As it turned out, the delta method is even more useful than sliced bread, and much healthier.

The delta method, whose foundations were laid in the 1940s by Cramér (Oehlert 1942), approximates the expectation (or higher moments) of some function  nlcom and the Delta Method of a random variable  nlcom and the Delta Method by relying on a (truncated) Taylor series expansion. More specifically, Agresti (2002: 578) shows that (under weak conditions) for some parameter  nlcom and the Delta Method that has an approximately normal sampling distribution with variance  nlcom and the Delta Method, the sampling distribution of  nlcom and the Delta Method is also approximately normal with variance  nlcom and the Delta Method, since  nlcom and the Delta Method is approximately linear in the neighbourhood of  nlcom and the Delta Method. The delta method can be generalised to the case of a multivariate normal random vector (Agresti 2002: 579) such as the joint sampling distribution of some set of parameter estimates.

In plain words, that means that one can use the delta method to calculate confidence intervals and perform hypothesis tests on just about every linear or nonlinear transformation of a vector of parameter estimates. If you are interested in the ratio of two coefficients and need a confidence interval, if, for some reason, you need to know if  nlcom and the Delta Method with some probability, the delta method is your friend.

The Delta Method and nlcom

Stata’s procedure nlcom is a particularly versatile and powerful implementation of the delta method. As a post-estimation command, nlcom accepts symbolic references to model parameters and computes sampling variances for their linear and non-linear combinations  and transformations. If you can write down the formula of the transformation, nlcom will spit out the result, standard error and confidence interval, and will even store the full variance-covariance matrix of the estimates. That, in turn, means that amongst other things, you can abuse Stata’s built in procedures to implement your own estimators.

What’s not to like? Well, for one thing, Stata gives no indication of how well the approximation works. It’s always worth checking that the results look reasonable, and in particularly complex circumstances, one should use simulation/bootstrapping for double checking. But bascially,>nlcom is great fun.

References

Agresti, Alan. 2002. Categorical Data Analysis. 2 ed. Hoboken: John Wiley.

Oehlert, Gary W. 1992. “A Note on the Delta Method.” The American Statistician
46(1):27–29.

Dec 192012
 

As a follow-up to my recent post on the relationship between gun ownership and gun homicide in OECD countries, I have rolled my dataset (compiled from information published by gunpolicy.org) and my analysis script into a neat Stata package. If you want to recreate the tables and graphs, or otherwise want to play with the data just enter

net get http://www.kai-arzheimer.com/stata/guns

do guns-analysis

in your net-aware copy of Stata.

If you don’t like Stata, you can get the raw data (ASCII) from http://www.kai-arzheimer.com/stata/oecd-gun-deaths.txt . Enjoy!

Oct 072012
 
Matrix Graph in Stata

Do you like this graph? I don’t think it is particularly attractive, and that is after spending hours and hours creating it. What I really wanted was a matrix-like representation of 18 simulations I ran. More specifically, I simulated the sampling distribution of a statistic under six different conditions for three different sample sizes. Doing the simulations was a breeze, courtesy of Stata’s simulate command, which created 18 corresponding data sets. Graphing them with kdensity also poses no problem, but combining these graphs did, because I could find no canned command that produces what I wanted: a table-like arrangement, with labels for the columns (i.e. sample sizes) and rows (experimental conditions). What I could do was set up / label a variable with 18 categories (one for each data set) and use the ,by() option to create a trellis plot. But that would waste a lot of ink/space by replicating redundant information. At the end of the day, I created a nine graphs that were completely empty save for the text that I wanted as row/column labels, which I then combined into two separate figures, that were then combined (using a distorted aspect ratio) with my 18 separate plots. That boils down to a lot of dumb code. E.g., this creates the labels for the six conditions. Note the fxsize option that makes the combine graph narrow, and the necessity to create an empty scatter plot.

capture drop x
capture drop y
capture set obs 5
gen x= .
gen y= .

local allgraphs = “”

forvalues c = 1/6 {
graph twoway scatter x y, xtitle(“”) ytitle(“”) xscale(off) yscale(off) subtitle(“(`c’)”,position(0) nobox) graphregion(margin(zero)) plotregion(style(none))
local allgraphs “`allgraphs’ condition`c’”
graph rename condition`c’ , replace
}

graph combine `allgraphs’ , cols(1) colfirst imargin(0 0 0 0) fxsize(10) b1title(” “)

The column labels were created by similar code. Finally, I combined my 18 graphs (there names are in the local macro) and combined the results with the label graphs.

graph combine `graphs' ,colfirst cols(3) ycommon xcommon imargin(3 3 3 3) b1title("\$ B_w \$") l1title("Density")
graph rename simulations, replace
graph combine sizelabels.gph condlabels.gph simulations, imargin(0 0 0 0) cols(2) holes(1)

Can anyone of you think of a more elegant way to achieve this result?

matrix graph 300x200 Creating Matrix Like Plots in Stata