Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

For our piece on distance effects in English elections we geocoded the addresses of hundreds of candidates. For the un-initiated: Geocoding is the fine art of converting addresses into geographical coordinates (longitude and latitude). Thanks to Google and some other providers like OpenStreeMap, this is now a relatively painless process. But when one needs more than a few addresses geocoded, one does not rely on pointing-and-clicking. One needs an API, i.e. a software library that makes the service accessible through R, Python or some other programming language.

The upside is that I learned a bit about the wonders of Python in general and the charms of geopy in particular. The downside is that writing a simple script that takes a number of strings from a Stata file, converts them into coordinates and gets them back into Stata took longer than I ever thought possible. Just now, I’ve learned about a possible shortcut (via the excellent data monkey blog): geocode is a user-written Stata command that takes a variable containing address strings and returns two new variables containing the latitude/longitude information. Now that would have been a bit of a time-saver. You can install geocode by typing

net from http://www.stata-journal.com/software/sj11-1 net install dm0053

There is, however, one potential drawback: Google limits the number of free queries per day (and possibly per minute). Via Python, you can easily stagger your requests, and you can also use an API key that is supposed to give you a bigger quota. Geocoding a large number of addresses from Stata in one go, on the other hand, will probably result in an equally large number of parsing errors.

In the past, I did a lot of multi-level modelling with MLwiN 2.02, which I quickly learned to loath. Back in the late 1990s, MLwiN was perhaps the first ML software that had a somewhat intuitive interface, i.e. it allowed one to build a model by pointing and clicking. Moreover, it printed updated estimates on the screen while cycling merrily through the parameter space. That was sort of cool, as it could take minutes to reach convergence, and without the updating, one would never have been sure that the program had not crashed yet. Which it did quite often, even for simple models.

Worse than the bugs was the lack of proper scriptability. Pointing and clicking loses its appeal when you need to run the same model on 12 different datasets, or when you are looking at three variants of the same model and 10 recodes of the same variable. Throw in the desire to semi-automatically re-compile the findings from these exercises into two nice tables for inclusion in again and again after finding yet another problem with a model, and you will agree that any piece of software that is not scriptable is pretty useless for scientists.

MLwiN’s command language was unreliable and woefully underdocumented, and everything was a pain. So I embraced xtmixed when it came along with Stata 9/10, which solved all of these problems.

runmlwin presentation (pdf)

But xtmixed is slow with large datsets/complex models. It relies on quadrature, which is exact but computationally intensive. MLwiN works with approximations of the likelihood function (quick and dirty) or MCMC (strictly speaking a Bayesian approach, but people don’t ask to many questions because it tends to be faster than quadrature). Moreover, MLwiN can run a lot of fancy models that xtmixed cannot, because it is a highly specialised program that has been around for a very long time.

Enter the good people over at the Centre for Multilevel Modelling at Bristol, who have come up with runmlwin, an ado that essentially makes the functionality of MLwiN available as a Stata command, postestimation analysis and all. Can’t wait to see if this works with Linux, wine and my ancient binaries, too.

I’m currently working on an analysis of the latest state election in Rhineland-Palatinate using aggregate data alone, i.e. electoral returns and structural information, which is available at the level of the state’s roughly 2300 municipalities. The state’s Green party (historically very weak) has roughly tripled their share of the vote since the last election in 2006, and I want to know were all these additional votes come from. And yes, I’m treading very careful around the very large potential ecological fallacy that lurks at the centre of my analysis, regressing Green gains on factors such as tax receipts and distance from next university town, but never claiming that the rich or the students or both turned to the Greens.

One common problem with this type of analysis is that not all municipalities are created equal. There is a surprisingly large number of flyspeck villages with only a few dozen voters on, whereas the state’s capital boasts more than 140,000 registered voters. Most places are somewhere in between. Having many small municipalities in the regression feels wrong for at least two reasons. First, small-scale changes of political preferences in tiny electorates will result in relatively large percentage changes. Second, the behaviour of a relatively large number of voters who happen to live in a small number of relatively large municipalities will be grossly underrepresented, i.e. the countryside will drive the results.

My PhD supervisor, who did a lot of this stuff in his time, used to weigh municipalities by the size of their electorates to deal with these problems. But this would lead to pretty extreme weights in my case. Moreover, while voters bring about electoral results, I really don’t want to introduce claims about individual behaviour through the back door.

My next idea was to weigh municipalities by the square root of the size their electorates. Why? In a sense, the observed behaviour is like a sample from the underlying distribution of preferences, and the reliability of this estimate is proportional to the square root of the number of people in a given community. But even taking the square root left me with weights that were quite extreme, and the concern regarding the level of analysis still applied.

Then I realised that instead of weighing by size, I could simply include the size of the electorate as an additional independent variable to correct for potential bias. But this still left me exposed to the danger of extreme outliers (think small, poor, rural communities where the number of Green voters goes up from one to four, a whopping 300 per cent increase) playing havoc with my analysis. So I began reading up on robust regression and its various implementations in Stata.

The basic idea of robust regression is that real data are more likely than not a mixture of (at least) two mechanisms: the “true model” whose coefficients we want to estimate one the one hand, and some other process(es) that contaminate the data on the other. If these contaminating data points are far away from the multivariate mean of the x-Variables (outliers) and deviate substantially from the true regression line, they will bias the estimates.

Robust regression estimators are able to deal with a high degree of contamination, i.e. they can recover the true parameters even if there are many outliers amongst the data points. The downside is that the older generation of robust estimators also have a low efficiency (the estimates are unbiased but have a much higher variance than regular OLS-estimates).

A number of newer (post-1980) estimators, however, are less affected by this problem. One particular promising approach is the MM estimator, that has been implemented in Stata ados by Veradi/Croux (MMregress) and by Ben Jann (robreg mm). Jann’s ado seems to be faster and plays nicely with his esttab/estout package, so I went with that.

The MM estimator works basically by identifying outliers and weighing them down, so it amounts to a particularly sophisticated case of weighted least squares. Using the defaults, MM claims to have 85 per cent of the efficiency of OLS while being able to deal with up to 50 per cent contamination. As you can see in the table, the MM estimates deviate somewhat from their OLS counterparts. The difference is most pronounced for the effect of tax receipts (hekst).

robreg mm has an option to store the optimal weights. I ran OLS again using these weights (column 3), thereby recovering the MM estimates and demonstrating that MM is really just weighted least squares (standard errors (which are not very relevant here) differ, because robreg uses the robust variance estimator). This is fascinating stuff, and I’m looking forward to a forthcoming book by Jann and Veradi on robust regression in Stata (to be published by Stata Press in 2012).

The liberal German weekly Zeit has commissioned a YouGov poll which demonstrates that Germans are more afraid of right-wing terrorists than of Islamist terrorists. The question read “What is, in your opinion, the biggest terrorist threat in Germany?” On offer were right-wingers (41 per cent), Islamists (36.6 per cent), left-wingers (5.6 per cent), other groups (3.8 per cent), or (my favourite) “no threat” (13 per cent). This is a pretty daft question anyway. Given the news coverage of the Neo-Nazi gang that has killed at least ten people more or less under the eyes of the authorities, and given that the authorities have so far managed to stop would-be terrorists in their tracks, the result is hardly surprising.

Nonetheless, the difference of just under five percentage points made the headlines, because there is a subtext for Zeit readers: Germans are worried about right-wing terrorism (a few weeks ago many people would have denied that there are right-wing terrorists operating in Germany), which must be a good thing, and they are less concerned about Islamist terrorists, which is possibly a progressive thing. Or something along those lines.

But is the five-point difference real?

YouGov has interviewed 1043 members of its online access panel. If we assume (and this is a heroic assumption) that these respondents can be treated like a simple random sample, what are the confidence intervals?

Binomial Confidence Intervals

First, we could treat the two categories as if they were distributed as binomial and ask Stata for exact confidence intervals.

The confidence intervals overlap, so we’re lead to think that the proportions in the population are not necessarily different. But the two categories are not independent, because the “not right-wingers” answers include the “Islamists” answers and vice versa, so the multinomial is a better choice.

Multinomial Model

It is easy to re-create the univariate distribution of answers in Stata:

set obs 5
gen threat = _n
lab def threat 1 "right-wingers" 2 "islamists" 3 "left-wingers" 4 "other" 5 "no threat"
lab val threat threat
gen number = round(1043* 0.41) in 1
replace number = round(1043* 0.366) in 2
replace number = round(1043* 0.056) in 3
replace number = round(1043* 0.038) in 4
replace number = round(1043* 0.13) in 5
expand number

Next, run an empty multinomial logit model

mlogit threat,base(5)

The parameters of the model reproduce the observed distribution exactly and are therefore not very interesting, but the estimates of their standard errors are available for testing hypotheses:

test [right_wingers]_cons = [islamists]_cons

At the conventional level of 0.05, we cannot reject the null hypothesis that both proportions are equal in the population, i.e. we cannot tell if Germans are really more worried about one of the two groups.

Simulation

Just for the fun of it, we can carry out one additional test and ask a rather specific question: If both proportions are 0.388 in the population and the other three are identical to their values in the sample, what is the probability of observing a difference of at least 4.4 points in favour of right-wingers?

The idea is to sample repeatedly from a multinomial with known probabilities. This could be done more elegantly by defining a program and using Stata’s simulate command, but if your machine has enough memory, it is just as easy and possibly faster to use two loops to generate/analyse the required number of variables (one per simulation) and to fill them all in one go with three lines of mata code. Depending on the number of trials, you may have to adjust maxvars

local trials = 10000
foreach v of newlist s1-s`trials' {
qui gen `v' = .
}
mata:
probs =(.388,.388,.056,.038,.13)
st_view(X.,.,"s1-s`trials'",)
X[.,.] = rdiscrete(1043,`trials',probs)
end
local excess = 0
forvalues sample = 1/`trials' {
qui tab s`sample' if s`sample' == 1
local rw = r(N)
qui tab s`sample' if s`sample' == 2
local isl = r(N)
if (`rw' / 1043 * 100) - (`isl' / 1043 * 100) >=4.4 local excess = `excess' +1
}
display "Difference >=4.4 in `excess' of `trials' samples"

Seems the chance of a 4.4 point difference is between 5 and 6 per cent. This probability is somewhat smaller than the one from the multinomial model because the null hypothesis is more specific, but still not statistically significant. And the Zeit does not even have a proper random sample, so there is no scientific evidence for the claim that Germans are more afraid of right-wing extremists than of Islamists, what ever that would have been worth. Bummer.

Sometimes, a man’s gotta do what a man’s gotta do. Which, in my case, might be a little simulation of a random process involving an unordered categorical variable. In R, sampling from a multinomial distribution is trivial.

rmultinom(1,1000,c(.1,.7,.2,.1))

gives me a vector of random numbers from a multinomial distribution with outcomes 1, 2, 3, and 4, where the probability of observing a ‘1’ is 10 percent, the probability of observing a ‘2’ is 70 per cent, and so on. But I could not find an equivalent function in Stata. Generating artificial data in R is not very elegant, so I kept digging and found a solution in section M-5 of the Mata handbook. Hidden in the entry on runiform is a reference to rdiscrete(r,c,p), a Mata function which generates a r*c matrix of draws from a multinomial distribution defined by a vector p of probabilities.

That leaves but one question: Is wrapping a handful of lines around a Mata call to replace a non-existent Stata function more elegant than calling an external program?

Seems that I am not the only one who is startled by Stata 11’s margins command, which does all sorts of amazing things. At a mere 50 pages (not counting the remarks on margins postestimation), the documentation is a little overwhelming, and there are just too many options. There are two separate issue that seem to confuse a lot of people (see this discussion on statalist on the then new margins command).

Marginal Effects at the Mean vs Average Marginal Effects

The first is that in the past when studying the implications from nonlinear (i.e. logit) models, many people including me used to analyse “marginal effects at the margin”. In short, this boils down to holding most independent vars constant at their grand means/modes while plugging a range of hopefully relevant values for one or two focal variables into the equation. This approach, which is known as analysing marginal effects at the mean, is easier to understand than to explain but can result in highly unrealistic scenarios if your independent variables are highly correlated (think of holding age constant while varying pensioner/non-pensioner status).

But one problem remains: margins uses a normal approximation for calculating confidence intervals. As a result, after estimating a model for categorical dependent variables, you might end up with a CI for your margins that includes zero, which obviously does not make much sense. Roger Newson seems to know how to get around this issue, but I haven’t tested this approach yet.

I’m teaching a lecture course on Political Sociology at the moment, and because everyone is so excited about social capital and social network analysis these days, I decided to run a little online experiment with and on my students. The audience is large (at the beginning of this term, about 220 students had registered for this lecture series) and quite diverse, with some students still in their first year, others in their second, third or fourth and even a bunch of veterans who have spent most of their adult lives in university education.

Who knows whom in a large group of learners?

Fortunately, I had a list of full names plus email addresses for everyone who had signalled interest in the lecture before the beginning of term, so I created a short questionnaire in limesurvey and asked them a very simple question: whom do you know in this group? Given the significant overcoverage of my list – in reality, there are probably not more than 120 students who regularly turn up for the lecture – the response rate was somewhere in the high 70s. If you want to collect network data with limesurvey, the “array with flexible labels” question type is your friend, but keying in 220 names plus unique ids would have been a major pain. Thankfully, one can program the question with a single placeholder name, then export it as a CSV file. Next, simply load the file into Emacs and insert the complete list, then re-import it in limesurvey.

Getting a data matrix from Stata into Pajek is not necessarily a fun exercise, so I decided to give the networkx module for Python a go, which is simply superb. Networkx has data types for representing social networks, so you can read in a rectangular data matrix (again as CSV), construct the network in Python and export the whole lot to Pajek with a few lines of code:

#Some boring stuff omitted #create network Lecture=nx.DiGraph() #Initialise for i in range(1,221): Lecture.add_node(i, stdg="0") for line in netreader: sender = int(line[-1]) #Sender-ID at the very end edges=line[6:216]
#Degree-scheme Lecture.node[sender]['stdg']=line[-8]
#Edges for index in range(len(edges)): if edges[index] == '2': Lecture.add_edge(sender,int(filter(str.isdigit,repr(knoten[index]))),weight=2) elif edges[index] == '3': Lecture.add_edge(sender,int(filter(str.isdigit,repr(knoten[index]))),weight=3) nx.write_pajek(Lecture,'file.net')

As it turns out, a lecture hall rebellion seems not very likely. About one third of all relationships are not reciprocated, and about a quarter of my students do not know a single other person in the room (at least not by name), so levels of social capital are pretty low. There is, however, a small group of 10 mostly older students who are form a tightly-knit core, and who know many of the suckers in the periphery. I need to keep an eye on these guys.

260 reciprocated ties within the same group

Finally, the second graph also shows that those relatively few students who are enrolled in our new BA programs (red, dark blue) are pretty much isolated within the larger group, which is still dominated by students enrolled in the old five year programs (MA yellow, State Examination green) that are phased out. Divide et impera.

I’m teaching an introductory SNA class this year. Following a time-honoured tradition, I conducted a small network survey at the beginning of the class using Limesurvey. Getting the data from Limesurvey to Stata via CSV was easy enough. Here is the data set. But how does one get the data from Stata to Pajek for analysis? Actually, it’s quite easy.

First, we need to change the layout of the data. In the data set, there is one record for each of the 13 respondent. Each record has 13 variables, one for each (potential) arc connecting the respondent to other students in the class. This is equivalent to Stata’s “wide” form. Stata’s reshape command will happily re-arrange the data to the “long” form, with one record for each arc. This is what Pajek requires.

Second, we need to save the data as an ASCII file that can be read into Pajek. This is most easily done using Roger Newson’s listtex, which can be tweaked to write the main chunks of a Pajek file. Here is the code, which should be readily adapted to your own problems.

If you are interested, you can get the whole package from within Stata: net from https://www.kai-arzheimer.com/stata/

Our project on social (citation and collaboration) networks in British and German political science involves networks with hundreds and thousands of nodes (scientists and articles). At the moment, our data come from the Social Science Citation Index (part of the ISI web of knowledge), and we use a bundle of rather eclectic (erratic?) scripts written in Perl to convert the ISI records into something that programs like Pajek or Stata can read. Some canned solutions (Wos2pajek, network workbench, bibexcel) are available for free, but I was not aware of them when I started this project, did not manage to install them properly, or was not happy with the results. Perl is the Swiss Army Chainsaw (TM) for data pre-processing, incredibly powerful (my scripts are typically less than 50 lines, and I am not an efficient programmer), and every time I want to do something in a slightly different way (i.e. I spot a bug), all I have to do is to change a few lines in the scripts. After trying a lot of other programs available on the internet, we have chosen Pajek for doing the analyses and producing those intriguing graphs of cliques and inner circles in Political Science. Pajek is closed source but free for non-commercial use and runs on Windows or (via wine) Linux. It is very fast, can (unlike many other programs) easily handle very large networks, produces decent graphs and does many standard analyses. Its user interface may be slightly less than straightforward but I got used to it rather quickly, and it even has basic scripting capacities.

The only thing that is missing is a proper manual, but even this is not really a problem since Pajek’s creators have written a very accessible introduction to social network analysis that doubles up as documentation for the program (order from amazon.co.uk, amazon.com, amazon.de. However, Pajek has been under constant development since the 1990s (!) and has acquired a lot of new features since the book was published. Some of them are documented in an appendix, others are simply listed in the very short document that is the official manual for Pajek. You will want to go through the many presentations which are available via the Pajek wiki.

Of course, there is much more software available, often at no cost. If you do program Java or Python (I don’t), there are several libraries available that look very promising. Amongst the stand-alone programs, visone stands out because it can easily produce very attractive-looking graphs of small networks. Even more software has been developed in the context of other sciences that have an interest in networks (chemistry, biology, engineering etc.) Here is a rather messy collection of links to sna software. Generally, you will want something that is more systematic and informative. Ines Mergel has recently launched a bid for creating a comprehensive software list on wikipedia. The resulting page on social network analysis software is obviously work in progress but provides very valuable guidance.

A couple of weeks ago, I posted an article on how make and Makefiles can help you to organise your Stata projects. If you are working in a unix environnment, you’ll already have make installed. If you work under Windows, install GNU make – it’s free, and it can make your Stata day. Rather unsurprisingly, make is also extremely useful if you have large or medium-sized latex project (or if you want to include tables and/or graphs produced by Stata) in a latex document. For instance, this comes handy if you have eps-Figures and use pdflatex. pdflatex produces pdf files instead of dvi files. If you produces slides with, this can save you a lot of time because you don’t have to go through the latex – dvips – ps2pdf cycle. However, pdflatex cannot read eps files: you have to convert your eps files with pstoedit to the meta post format, then use meta post to convert them to mps (which can be read by pdflatex). With this Makefile snippet, everything happens automagically:

#New implicit rules for conversion of eps->mp->mps #Change path if you have installed pstoedit in some other place %.mp : %.eps c:pstoedit/pstoedit.exe -f mpost $*.eps $*.mp

#Optional: if you want to create dataset x.eps, run x.do #Stata must be in your path %.eps : %.do tab wstata -e do $<

Now type make presentation.pdf, and make will call Stata, pstoedit, metapost and pdflatex as required. If you need more figures, just write the do-file and add a dependency.