# Stata-related posts

Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

As a follow-up to my recent post on the relationship between gun ownership and gun homicide in OECD countries, I have rolled my dataset (compiled from information published by gunpolicy.org) and my analysis script into a neat Stata package. If you want to recreate the tables and graphs, or otherwise want to play with the data just enter
``` net get http://www.kai-arzheimer.com/stata/guns```

`do guns-analysis`

in your net-aware copy of Stata.

If you don’t like Stata, you can get the raw data (ASCII) from http://www.kai-arzheimer.com/stata/oecd-gun-deaths.txt . Enjoy!

Do you like this graph? I don’t think it is particularly attractive, and that is after spending hours and hours creating it. What I really wanted was a matrix-like representation of 18 simulations I ran. More specifically, I simulated the sampling distribution of a statistic under six different conditions for three different sample sizes. Doing the simulations was a breeze, courtesy of Stata’s `simulate` command, which created 18 corresponding data sets. Graphing them with `kdensity` also poses no problem, but combining these graphs did, because I could find no canned command that produces what I wanted: a table-like arrangement, with labels for the columns (i.e. sample sizes) and rows (experimental conditions). What I could do was set up / label a variable with 18 categories (one for each data set) and use the `,by()` option to create a trellis plot. But that would waste a lot of ink/space by replicating redundant information. At the end of the day, I created a nine graphs that were completely empty save for the text that I wanted as row/column labels, which I then combined into two separate figures, that were then combined (using a distorted aspect ratio) with my 18 separate plots. That boils down to a lot of dumb code. E.g., this creates the labels for the six conditions. Note the `fxsize` option that makes the combine graph narrow, and the necessity to create an empty scatter plot.
``` capture drop x capture drop y capture set obs 5 gen x= . gen y= .```

local allgraphs = “”

forvalues c = 1/6 {
graph twoway scatter x y, xtitle(“”) ytitle(“”) xscale(off) yscale(off) subtitle(“(`c’)”,position(0) nobox) graphregion(margin(zero)) plotregion(style(none))
local allgraphs “`allgraphs’ condition`c’”
graph rename condition`c’ , replace
}

graph combine `allgraphs’ , cols(1) colfirst imargin(0 0 0 0) fxsize(10) b1title(” “)

The column labels were created by similar code. Finally, I combined my 18 graphs (there names are in the local macro) and combined the results with the label graphs.
``` graph combine `graphs' ,colfirst cols(3) ycommon xcommon imargin(3 3 3 3) b1title("\\$ B_w \\$") l1title("Density") graph rename simulations, replace graph combine sizelabels.gph condlabels.gph simulations, imargin(0 0 0 0) cols(2) holes(1) ```
Can anyone of you think of a more elegant way to achieve this result?

For our piece on distance effects in English elections we geocoded the addresses of hundreds of candidates. For the un-initiated: Geocoding is the fine art of converting addresses into geographical coordinates (longitude and latitude). Thanks to Google and some other providers like OpenStreeMap, this is now a relatively painless process. But when one needs more than a few addresses geocoded, one does not rely on pointing-and-clicking. One needs an API, i.e. a software library that makes the service accessible through R, Python or some other programming language.

The upside is that I learned a bit about the wonders of Python in general and the charms of geopy in particular. The downside is that writing a simple script that takes a number of strings from a Stata file, converts them into coordinates and gets them back into Stata took longer than I ever thought possible. Just now, I’ve learned about a possible shortcut (via the excellent data monkey blog): geocode is a user-written Stata command that takes a variable containing address strings and returns two new variables containing the latitude/longitude information. Now that would have been a bit of a time-saver. You can install geocode by typing

```net from http://www.stata-journal.com/software/sj11-1 net install dm0053 ```

There is, however, one potential drawback: Google limits the number of free queries per day (and possibly per minute). Via Python, you can easily stagger your requests, and you can also use an API key that is supposed to give you a bigger quota. Geocoding a large number of addresses from Stata in one go, on the other hand, will probably result in an equally large number of parsing errors.

In the past, I did a lot of multi-level modelling with MLwiN 2.02, which I quickly learned to loath. Back in the late 1990s, MLwiN was perhaps the first ML software that had a somewhat intuitive interface, i.e. it allowed one to build a model by pointing and clicking. Moreover, it printed updated estimates on the screen while cycling merrily through the parameter space. That was sort of cool, as it could take minutes to reach convergence, and without the updating, one would never have been sure that the program had not crashed yet. Which it did quite often, even for simple models.

Worse than the bugs was the lack of proper scriptability. Pointing and clicking  loses its appeal when you need to run the same model on 12 different datasets, or when you are looking at three variants of the same model and 10 recodes of the same variable. Throw in the desire to semi-automatically re-compile the findings from these exercises into two nice tables for inclusion in $Running MLwiN from within Stata$ again and again after finding yet another problem with a model, and you will agree that any  piece of software that is not scriptable is pretty useless for scientists.

MLwiN’s command language was unreliable and woefully underdocumented, and everything was a pain. So I embraced xtmixed when it came along with Stata 9/10, which solved all of these problems.

runmlwin presentation (pdf)

But xtmixed is slow with large datsets/complex models. It relies on quadrature, which is exact but computationally intensive. MLwiN works with approximations of the likelihood function (quick and dirty) or MCMC (strictly speaking a Bayesian approach, but people don’t ask to many questions because it tends to be faster than quadrature). Moreover, MLwiN can run a lot of fancy models that xtmixed cannot, because it is a highly specialised program that has been around for a very long time.

Enter the good people over at the Centre for Multilevel Modelling at Bristol, who have come up with runmlwin, an ado that essentially makes the functionality of MLwiN available as a Stata command, postestimation analysis and all. Can’t wait to see if this works with Linux, wine and my ancient binaries, too.

I’m currently working on an analysis of the latest state election in Rhineland-Palatinate using aggregate data alone, i.e. electoral returns and structural information, which is available at the level of the state’s roughly 2300 municipalities. The state’s Green party (historically very weak) has roughly tripled their share of the vote since the last election in 2006, and I want to know were all these additional votes come from. And yes, I’m treading very careful around the very large potential ecological fallacy that lurks at the centre of my analysis, regressing Green gains on factors such as tax receipts and distance from next university town, but never claiming that the rich or the students or both turned to the Greens.

One common problem with this type of analysis is that not all municipalities are created equal. There is a surprisingly large number of flyspeck villages with only a few dozen voters on, whereas the state’s capital boasts more than 140,000 registered voters. Most places are somewhere in between. Having many small municipalities in the regression feels wrong for at least two reasons. First, small-scale changes of political preferences in tiny electorates will result in relatively large percentage changes. Second, the behaviour of a relatively large number of voters who happen to live in a small number of relatively large municipalities will be grossly underrepresented, i.e. the countryside will drive the results.

My PhD supervisor, who did a lot of this stuff in his time, used to weigh municipalities by the size of their electorates to deal with these problems. But this would lead to pretty extreme weights in my case. Moreover, while voters bring about electoral results, I really don’t want to introduce claims about individual behaviour through the back door.

My next idea was to weigh municipalities by the square root of the size their electorates. Why? In a sense, the observed behaviour is like a sample from the underlying distribution of preferences, and the reliability of this estimate is proportional to the square root of the number of people in a given community. But even taking the square root left me with weights that were quite extreme, and the concern regarding the level of analysis still applied.

Then I realised that instead of weighing by size, I could simply include the size of the electorate as an additional independent variable to correct for potential bias. But this still left me exposed to the danger of extreme outliers (think small, poor, rural communities where the number of Green voters goes up from one to four, a whopping 300 per cent increase) playing havoc with my analysis. So I began reading up on robust regression and its various implementations in Stata.

The basic idea of robust regression is that real data are more likely than not a mixture of (at least) two mechanisms: the “true model” whose coefficients we want to estimate one the one hand, and some other process(es) that contaminate the data on the other. If these contaminating data points are far away from the multivariate mean of the x-Variables (outliers) and deviate substantially from the true regression line, they will bias the estimates.

Robust regression estimators are able to deal with a high degree of contamination, i.e. they can recover the true parameters even if there are many outliers amongst the data points. The downside is that the older generation of robust estimators also have a low efficiency (the estimates are unbiased but have a much higher variance than regular OLS-estimates).

A number of newer (post-1980) estimators, however, are less affected by this problem. One particular promising approach is the MM estimator, that has been implemented in Stata ados by Veradi/Croux (MMregress) and by Ben Jann (robreg mm). Jann’s ado seems to be faster and plays nicely with his esttab/estout package, so I went with that.

The MM estimator works basically by identifying outliers and weighing them down, so it amounts to a particularly sophisticated case of weighted least squares. Using the defaults, MM claims to have 85 per cent of the efficiency of OLS while being able to deal with up to 50 per cent contamination. As you can see in the table, the MM estimates deviate somewhat from their OLS counterparts. The difference is most pronounced for the effect of tax receipts (hekst).

robreg mm has an option to store the optimal weights. I ran OLS again using these weights (column 3), thereby recovering the MM estimates and demonstrating that MM is really just weighted least squares (standard errors (which are not very relevant here) differ, because robreg uses the robust variance estimator). This is fascinating stuff, and I’m looking forward to a forthcoming book by Jann and Veradi on robust regression in Stata (to be published by Stata Press in 2012).

```                     OLS              MM            WLS

greenpct2006        0.193***        0.329***        0.329***
(0.0349)        (0.0592)        (0.0278)

hekst               0.311***        0.634***        0.634***
(0.0894)         (0.124)        (0.0688)

senioren          -0.0744***       -0.100***       -0.100***
(0.0131)        (0.0149)       (0.00994)

kregvoters11      -0.0125        -0.00844        -0.00844
(0.0146)       (0.00669)       (0.00982)

kbevdichte         -0.433        -0.00750        -0.00750
(0.464)         (0.330)         (0.326)

uni                 1.258           0.816           0.816
(1.695)         (0.765)         (1.137)

lnunidist          -0.418**        -0.372**        -0.372***
(0.127)         (0.113)        (0.0918)

_cons               8.232***        7.078***        7.078***
(0.627)         (0.663)         (0.461)```