Stata-related posts

Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

Jan 262014
 

R Package Parallel: How Not to Solve a Problem That Does Not Exist

Somewhat foolishly, my university has granted me access to Mogon: not the god, not the death metal band but rather their supercomputer, which currently holds the 182th spot in the top 500 list of the fastest computers on the planet. It has some 34,000+ cores and more than 80 TB of RAM, but basically it’s just a very large bunch of Linux boxes. That means that I have a rough idea how to handle it, and that it happily runs my native Linux Stata and MPlus (and hopefully Jags) binaries for me. It also has R installed, and this is where my misery began.

I have a lengthy R job that deals with census data. Basically, it looks up the absolute number of minority residents in some 25,000 output areas and their immediate neighbours and calculates a series of percentages from these figures. I think this could in principle be done in Stata, but R provides convenient libraries for dealing with geo-coded data (sp and friends), non-rectangular data structures and all the trappings of a full-featured programming language, so it would be stupid not to make use of it. The only problem is that R is relatively slow and single-threaded, and that my script is what they call embarrassingly parallel: The same trivial function is applied to 33 vectors with 25,000 elements each. Each calculation on a vector takes about eight seconds to complete, which amounts to roughly five minutes in total. Add the time it takes to read in the data and some fairly large lookup-tables (it would be very time-consuming to repeatedly calculate which output area is close enough to each other output area to be considered a neighbour), and we are looking at eight to ten minutes for one run.

Mogon

Mogon. Image Credit: ZDV JGU Mainz

While I do not plan to run this script very often – once the calculations are done and saved, the results can be used in the analysis proper over and over again – I fully expect that I might change some operationalisations, include different variables etc., and so I began toying with the parallel package for R to make use of the many cores suddenly at my disposal.

Twelve hours later, I had learned the basics of the scheduling system (LSF), solved the problem of synching my data between home, office, central, and super-computer, gained some understanding of the way parallel works and otherwise achieved basically nothing: Even the best attempt at running a parallelised version of the script on the supercomputer was a little slower than the serialised version on my very capable office machine (and that is without the time (between 15 and 90 seconds) the scripts spends waiting to be transferred to a suitable node of the cluster). I tried different things: replacing lapply with mclapply, which was slower, regardless of the number of cores; using clusterApply instead of lapply (same result), and forking the 33 serial jobs into the background, which was even worse, presumably because storing the returned values resulted in changes to rather large data structures that were propagated to all cores involved.

Lessons Learned?

So yes, to save a few minutes in a script that I will presumably run not more than four or five times over the next couple of weeks, I spent 12 hours, with zilch results. But at least I learned a few things (apart from the obvious re-iteration of the old ‘never change a half-way running system’ mantra). First, even if it takes eight seconds to do the sums, a vector of 25,000 elements is probably to short to really benefit from shifting the calculations to more cores. While forking should be cheap, the overhead of setting up the additional threads dominates any savings. Second, running jobs in parallel without really understanding what overhead this creates is a stupid idea, and knowing what overhead this creates and how to avoid this is probably not worth the candle (see the above). Third, I can always re-use the infrastructure I’ve created (for more pointless experiments). Forth, my next go at Mogon shall avoid half-baked middle-level parallelisation altogether. Instead I shall combine fine-grained implicit parallelism (built into Stata and Mplus) and very coarse explicit parallelism (by breaking up lengthy scripts into small chunks that can be run independently). Further research is definitively needed.

Nov 222013
 

Measuring Survey Bias

In our recent Political Analysis paper (ungated authors’ version), Jocelyn Evans and I show how Martin, Traugott, and Kennedy’s two-party measure of survey accuracy can be extended to the multi-party case (which is slightly more relevant for comparativists and other people interested in the world outside the US). This extension leads to a series of party-specific measures of bias as well as to two scalar measures of overall survey bias.

Moreover, we demonstrate that our new measures are closely linked to the familiar multinomial logit model (just as the MTK measure is linked to the binomial logit). This demonstration is NOT an exercise in Excruciatingly Boring Algebra. Rather, it leads to a straightforward derivation of standard errors and facilitates the implementation of our methodology in standard statistical packages.

Voter poll
Those Were the DaysFoter.com / CC BY-SA

An Update to Our Free Software

We have programmed such an implementation in Stata, and it should not be too difficult to implement our methodology in R (any volunteers?). Our Stata code has been on SSC for a couple of months now but has recently been significantly updated. The new version 1.0 includes various bug fixes to the existing commands surveybias.ado and surveybiasi.ado, slightly better documentation, two toy data sets that should help you getting started with the methodology, and a new command surveybiasseries.ado.

surveybiasseries facilitates comparisons across a series of (pre-election) polls. It expects a data set in which each row corresponds to margins (predicted vote shares) from a survey. Such a dataset can quickly be constructed from published sources. Access to the original data is not required. surveybiasseries calculates the accuracy measures for each poll and stores them in a set of new variables, which can then be used as depended variable(s) in a model of poll accuracy.

Getting Started with Estimating Survey Bias

The new version of surveybias for Stata should appear be on SSC over the next couple of weeks or so (double check the version number (was 0.65, should now be 1.0) and the release date), but you can install it right now from this website:

net from https://www.kai-arzheimer.com/stata 
net install surveybias

To see the new command in action, try this

use fivefrenchsurveys, replace

will load information from five pre-election polls taken during the French presidential campaign (2012) into memory. The vote shares refer to eight candidates that competed in the first round.

surveybiasseries in 1/3 , popvaria(*true) samplev(fh-other) nvar(N) gen(frenchsurveys)

will calculate our accuracy measures and their standard errors for the first three surveys over the full set of candidates.

surveybiasseries in 4/5, popvariables(fhtrue-mptrue) samplevariables(fh-mp) nvar(N) gen(threeparty)

will calculate bias with respect to the three-party vote (i.e. Hollande, Sarkozy, Le Pen) for surveys no. 4 and 5 (vote shares a automatically rescaled to unity, no recoding required). The new variable names start with “frenchsurveys” and “threeparty” and should be otherwise self-explanatory (i.e. threepartybw is $B_w$ for the three party case, and threepartysebw the corresponding standard error). Feel free to plot and model to your heart’s content.

Jul 172013
 

Replication data for our forthcoming Political Analysis paper on our new, multinomial accuracy measure for bias in opinion surveys (e.g. pre-election polls) has just gone online at the PA dataverse. So if you want to gauge the performance of French pollsters over the 2012 presidential campaign (or cannot wait to re-run our simulations of B_w‘s sampling distribution), download our data and start playing.

A current version of our Stata add-on surveybias is included in the bundle. Alternatively, you can install the software into you personal ado dir by typing ssc install surveybias.

Jul 102013
 
Used Punchcard
BinaryApe / Foter / CC BY

In a recent paper, we derive various multinomial measures of bias in public opinion surveys (e.g. pre-election polls). Put differently, with our methodology, you may calculate a scalar measure of survey bias in multi-party elections.

Thanks to Kit Baum over at Boston College, our Stata add-on surveybias.ado is now available from Statistical Software Components (SSC).  The add-on takes as its argument the name of a categorical variable and said variable’s true distribution in the population. For what it’s worth, the program tries to be smart: surveybias vote, popvalues(900000 1200000 1800000), surveybias vote, popvalues(0.2307692 0.3076923 0.4615385), and surveybias vote, popvalues(23.07692 30.76923 46.15385) should all give the same result.

If you don’t have access to the raw data but want to assess survey bias evident in published figures, there is surveybiasi, an “immediate” command that lets you do stuff like this:  surveybiasi , popvalues(30 40 30) samplevalues(40 40 20) n(1000). Again, you may specify absolute values, relative frequencies, or percentages.

If you want to go ahead and measure survey bias, install surveybias.ado and surveybiasi.ado on your computer by typing ssc install surveybias in your net-aware copy of Stata. And if you use and like our software, please cite our forthcoming Political Analysis paper on the New Multinomial Accuracy Measure for Polling Bias.

Update April 2014: New version 1.1 available

Jun 232013
 

All surveys deviate from the true distributions of the variables, but some more so than others. This is particularly relevant in the context of election studies, where the true distribution of the vote is revealed on election night. Wouldn’t it be nice if one could quantify the bias exhibited by pollster X in their pre-election survey(s), with one single number? Heck, you could even model bias in polls, using RHS variables such as time to election, sample size or sponsor of the survey, coming up with an estimate of the infamous “house effect”,.

Jocelyn Evans and I have developed a method for calculating such a figure by extending Martin, Kennedy and Traugott’s measure A to the multi-party case. Being the very creative chaps we are, we call this new statistic [drumroll] B. We also derive a weighted version of this measure B_w, and statistics to measure bias in favour/against any single party (A'). Of course, our measures can be applied to the sampling of any categorical variable whose distribution is known.

We fully develop all these goodies (and illustrate their usefulness by analysing bias in French pre-election polls) in a paper that
(to our immense satisfaction) has just been accepted for publication in Political Analysis (replication files to follow).

Our module survebias is a Stata ado file that implements these methods. It should become available from SSC over the summer, giving you convenient access to the new methods. I’ll keep you posted.

Jun 012013
 

What is the Delta Method?

I have used the delta method occasionally for years without really understanding what is going on under the hood. A recent encounter with an inquisitive reviewer has changed that. As it turned out, the delta method is even more useful than sliced bread, and much healthier.

The delta method, whose foundations were laid in the 1940s by Cramér (Oehlert 1942), approximates the expectation (or higher moments) of some function g(\cdot) of a random variable x by relying on a (truncated) Taylor series expansion. More specifically, Agresti (2002: 578) shows that (under weak conditions) for some parameter \theta that has an approximately normal sampling distribution with variance \sigma^{2}/n, the sampling distribution of g(\theta) is also approximately normal with variance [g'(\theta)]^{2}\sigma^2/n, since g(\cdot) is approximately linear in the neighbourhood of \theta. The delta method can be generalised to the case of a multivariate normal random vector (Agresti 2002: 579) such as the joint sampling distribution of some set of parameter estimates.

In plain words, that means that one can use the delta method to calculate confidence intervals and perform hypothesis tests on just about every linear or nonlinear transformation of a vector of parameter estimates. If you are interested in the ratio of two coefficients and need a confidence interval, if, for some reason, you need to know if e^{\beta} >c with some probability, the delta method is your friend.

The Delta Method and nlcom

Stata’s procedure nlcom is a particularly versatile and powerful implementation of the delta method. As a post-estimation command, nlcom accepts symbolic references to model parameters and computes sampling variances for their linear and non-linear combinations  and transformations. If you can write down the formula of the transformation, nlcom will spit out the result, standard error and confidence interval, and will even store the full variance-covariance matrix of the estimates. That, in turn, means that amongst other things, you can abuse Stata’s built in procedures to implement your own estimators.

What’s not to like? Well, for one thing, Stata gives no indication of how well the approximation works. It’s always worth checking that the results look reasonable, and in particularly complex circumstances, one should use simulation/bootstrapping for double checking. But bascially,>nlcom is great fun.

References

Agresti, Alan. 2002. Categorical Data Analysis. 2 ed. Hoboken: John Wiley.

Oehlert, Gary W. 1992. “A Note on the Delta Method.” The American Statistician
46(1):27–29.

Dec 192012
 

As a follow-up to my recent post on the relationship between gun ownership and gun homicide in OECD countries, I have rolled my dataset (compiled from information published by gunpolicy.org) and my analysis script into a neat Stata package. If you want to recreate the tables and graphs, or otherwise want to play with the data just enter

net get https://www.kai-arzheimer.com/stata/guns

do guns-analysis

in your net-aware copy of Stata.

If you don’t like Stata, you can get the raw data (ASCII) from https://www.kai-arzheimer.com/stata/oecd-gun-deaths.txt . Enjoy!

Oct 072012
 
Matrix Graph in Stata

Do you like this graph? I don’t think it is particularly attractive, and that is after spending hours and hours creating it. What I really wanted was a matrix-like representation of 18 simulations I ran. More specifically, I simulated the sampling distribution of a statistic under six different conditions for three different sample sizes. Doing the simulations was a breeze, courtesy of Stata’s simulate command, which created 18 corresponding data sets. Graphing them with kdensity also poses no problem, but combining these graphs did, because I could find no canned command that produces what I wanted: a table-like arrangement, with labels for the columns (i.e. sample sizes) and rows (experimental conditions). What I could do was set up / label a variable with 18 categories (one for each data set) and use the ,by() option to create a trellis plot. But that would waste a lot of ink/space by replicating redundant information. At the end of the day, I created a nine graphs that were completely empty save for the text that I wanted as row/column labels, which I then combined into two separate figures, that were then combined (using a distorted aspect ratio) with my 18 separate plots. That boils down to a lot of dumb code. E.g., this creates the labels for the six conditions. Note the fxsize option that makes the combine graph narrow, and the necessity to create an empty scatter plot.

capture drop x
capture drop y
capture set obs 5
gen x= .
gen y= .

local allgraphs = “”

forvalues c = 1/6 {
graph twoway scatter x y, xtitle(“”) ytitle(“”) xscale(off) yscale(off) subtitle(“(`c’)”,position(0) nobox) graphregion(margin(zero)) plotregion(style(none))
local allgraphs “`allgraphs’ condition`c'”
graph rename condition`c’ , replace
}

graph combine `allgraphs’ , cols(1) colfirst imargin(0 0 0 0) fxsize(10) b1title(” “)

The column labels were created by similar code. Finally, I combined my 18 graphs (there names are in the local macro) and combined the results with the label graphs.

graph combine `graphs' ,colfirst cols(3) ycommon xcommon imargin(3 3 3 3) b1title("\$ B_w \$") l1title("Density")
graph rename simulations, replace
graph combine sizelabels.gph condlabels.gph simulations, imargin(0 0 0 0) cols(2) holes(1)

Can anyone of you think of a more elegant way to achieve this result?

Matrix Graph in Stata

 

Apr 262012
 

For our piece on distance effects in English elections we geocoded the addresses of hundreds of candidates. For the un-initiated: Geocoding is the fine art of converting addresses into geographical coordinates (longitude and latitude). Thanks to Google and some other providers like OpenStreeMap, this is now a relatively painless process. But when one needs more than a few addresses geocoded, one does not rely on pointing-and-clicking. One needs an API, i.e. a software library that makes the service accessible through R, Python or some other programming language.

The upside is that I learned a bit about the wonders of Python in general and the charms of geopy in particular. The downside is that writing a simple script that takes a number of strings from a Stata file, converts them into coordinates and gets them back into Stata took longer than I ever thought possible. Just now, I’ve learned about a possible shortcut (via the excellent data monkey blog): geocode is a user-written Stata command that takes a variable containing address strings and returns two new variables containing the latitude/longitude information. Now that would have been a bit of a time-saver. You can install geocode by typing

net from http://www.stata-journal.com/software/sj11-1
net install dm0053

There is, however, one potential drawback: Google limits the number of free queries per day (and possibly per minute). Via Python, you can easily stagger your requests, and you can also use an API key that is supposed to give you a bigger quota. Geocoding a large number of addresses from Stata in one go, on the other hand, will probably result in an equally large number of parsing errors.