While you were busy watching the mostly shambolic performance that pass for the Euro 2012, I was unboxing my set of complementary “Quantitative Applications in the Social Sciences” volumes (you know, the series of flimsy paperbacks whose sickly green sleeves conjure memories of psychiatric wards and man in lab coats). The latest addition to my collection of these is Paul Allison’s volume on Fixed Effects Regression Models. There seems to be a good deal of confusion about what constitutes a “fixed” (as opposed to a “random”) effect, and Allison does a great job in clarifying the issue. In a panel design, assuming that effects are fixed basically implies that you have to estimate a separate, time-invariant effect for each object (e.g. person, country, party …) from the data. If you run a random effects model, you simply estimate the variance of the distribution of these object-specific effects, which is much more efficient.

Obviously, random effects models are all the rage at the moment, but Allison (who also gave us a very accessible introduction to missing data problems) has much to say in favour of fixed effects. Crucially, he demonstrates how fixed effects can be included in linear, logistic and even survival-time regression models, and how fixed and random effects can be combined to create a hybrid model. This is by far the most clear and concise introduction to this topic, and I like the book a lot. And now, back to the other four books in that case.

I’m currently working on an analysis of the latest state election in Rhineland-Palatinate using aggregate data alone, i.e. electoral returns and structural information, which is available at the level of the state’s roughly 2300 municipalities. The state’s Green party (historically very weak) has roughly tripled their share of the vote since the last election in 2006, and I want to know were all these additional votes come from. And yes, I’m treading very careful around the very large potential ecological fallacy that lurks at the centre of my analysis, regressing Green gains on factors such as tax receipts and distance from next university town, but never claiming that the rich or the students or both turned to the Greens.

One common problem with this type of analysis is that not all municipalities are created equal. There is a surprisingly large number of flyspeck villages with only a few dozen voters on, whereas the state’s capital boasts more than 140,000 registered voters. Most places are somewhere in between. Having many small municipalities in the regression feels wrong for at least two reasons. First, small-scale changes of political preferences in tiny electorates will result in relatively large percentage changes. Second, the behaviour of a relatively large number of voters who happen to live in a small number of relatively large municipalities will be grossly underrepresented, i.e. the countryside will drive the results.

My PhD supervisor, who did a lot of this stuff in his time, used to weigh municipalities by the size of their electorates to deal with these problems. But this would lead to pretty extreme weights in my case. Moreover, while voters bring about electoral results, I really don’t want to introduce claims about individual behaviour through the back door.

My next idea was to weigh municipalities by the square root of the size their electorates. Why? In a sense, the observed behaviour is like a sample from the underlying distribution of preferences, and the reliability of this estimate is proportional to the square root of the number of people in a given community. But even taking the square root left me with weights that were quite extreme, and the concern regarding the level of analysis still applied.

Then I realised that instead of weighing by size, I could simply include the size of the electorate as an additional independent variable to correct for potential bias. But this still left me exposed to the danger of extreme outliers (think small, poor, rural communities where the number of Green voters goes up from one to four, a whopping 300 per cent increase) playing havoc with my analysis. So I began reading up on robust regression and its various implementations in Stata.

The basic idea of robust regression is that real data are more likely than not a mixture of (at least) two mechanisms: the “true model” whose coefficients we want to estimate one the one hand, and some other process(es) that contaminate the data on the other. If these contaminating data points are far away from the multivariate mean of the x-Variables (outliers) and deviate substantially from the true regression line, they will bias the estimates.

Robust regression estimators are able to deal with a high degree of contamination, i.e. they can recover the true parameters even if there are many outliers amongst the data points. The downside is that the older generation of robust estimators also have a low efficiency (the estimates are unbiased but have a much higher variance than regular OLS-estimates).

A number of newer (post-1980) estimators, however, are less affected by this problem. One particular promising approach is the MM estimator, that has been implemented in Stata ados by Veradi/Croux (MMregress) and by Ben Jann (robreg mm). Jann’s ado seems to be faster and plays nicely with his esttab/estout package, so I went with that.

The MM estimator works basically by identifying outliers and weighing them down, so it amounts to a particularly sophisticated case of weighted least squares. Using the defaults, MM claims to have 85 per cent of the efficiency of OLS while being able to deal with up to 50 per cent contamination. As you can see in the table, the MM estimates deviate somewhat from their OLS counterparts. The difference is most pronounced for the effect of tax receipts (hekst).

robreg mm has an option to store the optimal weights. I ran OLS again using these weights (column 3), thereby recovering the MM estimates and demonstrating that MM is really just weighted least squares (standard errors (which are not very relevant here) differ, because robreg uses the robust variance estimator). This is fascinating stuff, and I’m looking forward to a forthcoming book by Jann and Veradi on robust regression in Stata (to be published by Stata Press in 2012).

Should one weight their survey data? Is it worth the effort? The short answer must be ‘maybe’ or ‘it depends’. A slightly longer and much more useful answer was given by Leslie Kish in his enormously helpful paper ‘Weighting: Why, when and how’. Today (well, actually I submitted the final manuscript 2.5 years ago – that’s scientific progress for you!), I have added my own two cent with a short chapter that looks at the effects and non-effects of common weighting procedures (in German). The bottom line is that if you employ the usual weighting variables (age, gender, education and maybe class or region) as controls in your regression, weighting will make next to no difference but might mess with your standard errors.