# Stata-related posts

Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

The new version of our Stata module surveybias has now appeared on SSC. From within Stata, simply run

ssc install surveybias, all

to install it (no surprise here).

For the time being, the article on which the module is based has been ungated. Get it while it is free.

## Measuring Survey Bias

In our recent Political Analysis paper (ungated authors’ version), Jocelyn Evans and I show how Martin, Traugott, and Kennedy’s two-party measure of survey accuracy can be extended to the multi-party case (which is slightly more relevant for comparativists and other people interested in the world outside the US). This extension leads to a series of party-specific measures of bias as well as to two scalar measures of overall survey bias.

Moreover, we demonstrate that our new measures are closely linked to the familiar multinomial logit model (just as the MTK measure is linked to the binomial logit). This demonstration is NOT an exercise in Excruciatingly Boring Algebra. Rather, it leads to a straightforward derivation of standard errors and facilitates the implementation of our methodology in standard statistical packages.

Those Were the Days

## An Update to Our Free Software

We have programmed such an implementation in Stata, and it should not be too difficult to implement our methodology in R (any volunteers?). Our Stata code has been on SSC for a couple of months now but has recently been significantly updated. The new version 1.0 includes various bug fixes to the existing commands surveybias.ado and surveybiasi.ado, slightly better documentation, two toy data sets that should help you getting started with the methodology, and a new command surveybiasseries.ado.

surveybiasseries facilitates comparisons across a series of (pre-election) polls. It expects a data set in which each row corresponds to margins (predicted vote shares) from a survey. Such a dataset can quickly be constructed from published sources. Access to the original data is not required. surveybiasseries calculates the accuracy measures for each poll and stores them in a set of new variables, which can then be used as depended variable(s) in a model of poll accuracy.

## Getting Started with Estimating Survey Bias

The new version of surveybias for Stata should appear be on SSC over the next couple of weeks or so (double check the version number (was 0.65, should now be 1.0) and the release date), but you can install it right now from this website:

net from http://www.kai-arzheimer.com/stata
net install surveybias

To see the new command in action, try this

use fivefrenchsurveys, replace

will load information from five pre-election polls taken during the French presidential campaign (2012) into memory. The vote shares refer to eight candidates that competed in the first round.

surveybiasseries in 1/3 , popvaria(*true) samplev(fh-other) nvar(N) gen(frenchsurveys)

will calculate our accuracy measures and their standard errors for the first three surveys over the full set of candidates.

surveybiasseries in 4/5, popvariables(fhtrue-mptrue) samplevariables(fh-mp) nvar(N) gen(threeparty)

will calculate bias with respect to the three-party vote (i.e. Hollande, Sarkozy, Le Pen) for surveys no. 4 and 5 (vote shares a automatically rescaled to unity, no recoding required). The new variable names start with “frenchsurveys” and “threeparty” and should be otherwise self-explanatory (i.e. threepartybw is $B_w$ for the three party case, and threepartysebw the corresponding standard error). Feel free to plot and model to your heart’s content.

Replication data for our forthcoming Political Analysis paper on our new, multinomial accuracy measure for bias in opinion surveys (e.g. pre-election polls) has just gone online at the PA dataverse. So if you want to gauge the performance of French pollsters over the 2012 presidential campaign (or cannot wait to re-run our simulations of $Replication Data for A New Multinomial Accuracy Measure for Polling Bias$‘s sampling distribution), download our data and start playing.

A current version of our Stata add-on surveybias is included in the bundle. Alternatively, you can install the software into you personal ado dir by typing ssc install surveybias.

In a recent paper, we derive various multinomial measures of bias in public opinion surveys (e.g. pre-election polls). Put differently, with our methodology, you may calculate a scalar measure of survey bias in multi-party elections.

Thanks to Kit Baum over at Boston College, our Stata add-on surveybias.ado is now available from Statistical Software Components (SSC).  The add-on takes as its argument the name of a categorical variable and said variable’s true distribution in the population. For what it’s worth, the program tries to be smart: surveybias vote, popvalues(900000 1200000 1800000), surveybias vote, popvalues(0.2307692 0.3076923 0.4615385), and surveybias vote, popvalues(23.07692 30.76923 46.15385) should all give the same result.

If you don’t have access to the raw data but want to assess survey bias evident in published figures, there is surveybiasi, an “immediate” command that lets you do stuff like this:  surveybiasi , popvalues(30 40 30) samplevalues(40 40 20) n(1000). Again, you may specify absolute values, relative frequencies, or percentages.

If you want to go ahead and measure survey bias, install surveybias.ado and surveybiasi.ado on your computer by typing ssc install surveybias in your net-aware copy of Stata. And if you use and like our software, please cite our forthcoming Political Analysis paper on the New Multinomial Accuracy Measure for Polling Bias.

All surveys deviate from the true distributions of the variables, but some more so than others. This is particularly relevant in the context of election studies, where the true distribution of the vote is revealed on election night. Wouldn’t it be nice if one could quantify the bias exhibited by pollster X in their pre-election survey(s), with one single number? Heck, you could even model bias in polls, using RHS variables such as time to election, sample size or sponsor of the survey, coming up with an estimate of the infamous “house effect”,.

Jocelyn Evans and I have developed a method for calculating such a figure by extending Martin, Kennedy and Traugott’s measure $A Scalar Measure for Bias in (Multi Party Pre Election) Surveys$ to the multi-party case. Being the very creative chaps we are, we call this new statistic [drumroll] $A Scalar Measure for Bias in (Multi Party Pre Election) Surveys$. We also derive a weighted version of this measure $A Scalar Measure for Bias in (Multi Party Pre Election) Surveys$, and statistics to measure bias in favour/against any single party ($A Scalar Measure for Bias in (Multi Party Pre Election) Surveys$). Of course, our measures can be applied to the sampling of any categorical variable whose distribution is known.

We fully develop all these goodies (and illustrate their usefulness by analysing bias in French pre-election polls) in a paper that
(to our immense satisfaction) has just been accepted for publication in Political Analysis (replication files to follow).

Our module survebias is a Stata ado file that implements these methods. It should become available from SSC over the summer, giving you convenient access to the new methods. I’ll keep you posted.

# What is the Delta Method?

I have used the delta method occasionally for years without really understanding what is going on under the hood. A recent encounter with an inquisitive reviewer has changed that. As it turned out, the delta method is even more useful than sliced bread, and much healthier.

The delta method, whose foundations were laid in the 1940s by Cramér (Oehlert 1942), approximates the expectation (or higher moments) of some function $nlcom and the Delta Method$ of a random variable $nlcom and the Delta Method$ by relying on a (truncated) Taylor series expansion. More specifically, Agresti (2002: 578) shows that (under weak conditions) for some parameter $nlcom and the Delta Method$ that has an approximately normal sampling distribution with variance $nlcom and the Delta Method$, the sampling distribution of $nlcom and the Delta Method$ is also approximately normal with variance $nlcom and the Delta Method$, since $nlcom and the Delta Method$ is approximately linear in the neighbourhood of $nlcom and the Delta Method$. The delta method can be generalised to the case of a multivariate normal random vector (Agresti 2002: 579) such as the joint sampling distribution of some set of parameter estimates.

In plain words, that means that one can use the delta method to calculate confidence intervals and perform hypothesis tests on just about every linear or nonlinear transformation of a vector of parameter estimates. If you are interested in the ratio of two coefficients and need a confidence interval, if, for some reason, you need to know if $nlcom and the Delta Method$ with some probability, the delta method is your friend.

# The Delta Method and nlcom

Stata’s procedure nlcom is a particularly versatile and powerful implementation of the delta method. As a post-estimation command, nlcom accepts symbolic references to model parameters and computes sampling variances for their linear and non-linear combinations  and transformations. If you can write down the formula of the transformation, nlcom will spit out the result, standard error and confidence interval, and will even store the full variance-covariance matrix of the estimates. That, in turn, means that amongst other things, you can abuse Stata’s built in procedures to implement your own estimators.

What’s not to like? Well, for one thing, Stata gives no indication of how well the approximation works. It’s always worth checking that the results look reasonable, and in particularly complex circumstances, one should use simulation/bootstrapping for double checking. But bascially,>nlcom is great fun.

# References

Agresti, Alan. 2002. Categorical Data Analysis. 2 ed. Hoboken: John Wiley.

Oehlert, Gary W. 1992. “A Note on the Delta Method.” The American Statistician
46(1):27–29.

As a follow-up to my recent post on the relationship between gun ownership and gun homicide in OECD countries, I have rolled my dataset (compiled from information published by gunpolicy.org) and my analysis script into a neat Stata package. If you want to recreate the tables and graphs, or otherwise want to play with the data just enter
 net get http://www.kai-arzheimer.com/stata/guns

do guns-analysis

in your net-aware copy of Stata.

If you don’t like Stata, you can get the raw data (ASCII) from http://www.kai-arzheimer.com/stata/oecd-gun-deaths.txt . Enjoy!

Do you like this graph? I don’t think it is particularly attractive, and that is after spending hours and hours creating it. What I really wanted was a matrix-like representation of 18 simulations I ran. More specifically, I simulated the sampling distribution of a statistic under six different conditions for three different sample sizes. Doing the simulations was a breeze, courtesy of Stata’s simulate command, which created 18 corresponding data sets. Graphing them with kdensity also poses no problem, but combining these graphs did, because I could find no canned command that produces what I wanted: a table-like arrangement, with labels for the columns (i.e. sample sizes) and rows (experimental conditions). What I could do was set up / label a variable with 18 categories (one for each data set) and use the ,by() option to create a trellis plot. But that would waste a lot of ink/space by replicating redundant information. At the end of the day, I created a nine graphs that were completely empty save for the text that I wanted as row/column labels, which I then combined into two separate figures, that were then combined (using a distorted aspect ratio) with my 18 separate plots. That boils down to a lot of dumb code. E.g., this creates the labels for the six conditions. Note the fxsize option that makes the combine graph narrow, and the necessity to create an empty scatter plot.
 capture drop x capture drop y capture set obs 5 gen x= . gen y= .

local allgraphs = “”

forvalues c = 1/6 {
graph twoway scatter x y, xtitle(“”) ytitle(“”) xscale(off) yscale(off) subtitle(“(c’)”,position(0) nobox) graphregion(margin(zero)) plotregion(style(none))
local allgraphs “allgraphs’ conditionc’”
graph rename conditionc’ , replace
}

graph combine allgraphs’ , cols(1) colfirst imargin(0 0 0 0) fxsize(10) b1title(” “)

The column labels were created by similar code. Finally, I combined my 18 graphs (there names are in the local macro) and combined the results with the label graphs.
 graph combine graphs' ,colfirst cols(3) ycommon xcommon imargin(3 3 3 3) b1title("\$B_w \$") l1title("Density") graph rename simulations, replace graph combine sizelabels.gph condlabels.gph simulations, imargin(0 0 0 0) cols(2) holes(1) 
Can anyone of you think of a more elegant way to achieve this result?

For our piece on distance effects in English elections we geocoded the addresses of hundreds of candidates. For the un-initiated: Geocoding is the fine art of converting addresses into geographical coordinates (longitude and latitude). Thanks to Google and some other providers like OpenStreeMap, this is now a relatively painless process. But when one needs more than a few addresses geocoded, one does not rely on pointing-and-clicking. One needs an API, i.e. a software library that makes the service accessible through R, Python or some other programming language.

The upside is that I learned a bit about the wonders of Python in general and the charms of geopy in particular. The downside is that writing a simple script that takes a number of strings from a Stata file, converts them into coordinates and gets them back into Stata took longer than I ever thought possible. Just now, I’ve learned about a possible shortcut (via the excellent data monkey blog): geocode is a user-written Stata command that takes a variable containing address strings and returns two new variables containing the latitude/longitude information. Now that would have been a bit of a time-saver. You can install geocode by typing

net from http://www.stata-journal.com/software/sj11-1 net install dm0053 

There is, however, one potential drawback: Google limits the number of free queries per day (and possibly per minute). Via Python, you can easily stagger your requests, and you can also use an API key that is supposed to give you a bigger quota. Geocoding a large number of addresses from Stata in one go, on the other hand, will probably result in an equally large number of parsing errors.

In the past, I did a lot of multi-level modelling with MLwiN 2.02, which I quickly learned to loath. Back in the late 1990s, MLwiN was perhaps the first ML software that had a somewhat intuitive interface, i.e. it allowed one to build a model by pointing and clicking. Moreover, it printed updated estimates on the screen while cycling merrily through the parameter space. That was sort of cool, as it could take minutes to reach convergence, and without the updating, one would never have been sure that the program had not crashed yet. Which it did quite often, even for simple models.

Worse than the bugs was the lack of proper scriptability. Pointing and clicking  loses its appeal when you need to run the same model on 12 different datasets, or when you are looking at three variants of the same model and 10 recodes of the same variable. Throw in the desire to semi-automatically re-compile the findings from these exercises into two nice tables for inclusion in $Running MLwiN from within Stata$ again and again after finding yet another problem with a model, and you will agree that any  piece of software that is not scriptable is pretty useless for scientists.

MLwiN’s command language was unreliable and woefully underdocumented, and everything was a pain. So I embraced xtmixed when it came along with Stata 9/10, which solved all of these problems.

runmlwin presentation (pdf)

But xtmixed is slow with large datsets/complex models. It relies on quadrature, which is exact but computationally intensive. MLwiN works with approximations of the likelihood function (quick and dirty) or MCMC (strictly speaking a Bayesian approach, but people don’t ask to many questions because it tends to be faster than quadrature). Moreover, MLwiN can run a lot of fancy models that xtmixed cannot, because it is a highly specialised program that has been around for a very long time.

Enter the good people over at the Centre for Multilevel Modelling at Bristol, who have come up with runmlwin, an ado that essentially makes the functionality of MLwiN available as a Stata command, postestimation analysis and all. Can’t wait to see if this works with Linux, wine and my ancient binaries, too.