Jan 142012
 
I’m currently working on an analysis of the latest state election in Rhineland-Palatinate using aggregate data alone, i.e. electoral returns and structural information, which is available at the level of the state’s roughly 2300 municipalities. The state’s Green party (historically very weak) has roughly tripled their share of the vote since the last election in 2006, and I want to know were all these additional votes come from. And yes, I’m treading very careful around the very large potential ecological fallacy that lurks at the centre of my analysis, regressing Green gains on factors such as tax receipts and distance from next university town, but never claiming that the rich or the students or both turned to the Greens.

One common problem with this type of analysis is that not all municipalities are created equal. There is a surprisingly large number of flyspeck villages with only a few dozen voters on, whereas the state’s capital boasts more than 140,000 registered voters. Most places are somewhere in between. Having many small municipalities in the regression feels wrong for at least two reasons. First, small-scale changes of political preferences in tiny electorates will result in relatively large percentage changes. Second, the behaviour of a relatively large number of voters who happen to live in a small number of relatively large municipalities will be grossly underrepresented, i.e. the countryside will drive the results.

My PhD supervisor, who did a lot of this stuff in his time, used to weigh municipalities by the size of their electorates to deal with these problems. But this would lead to pretty extreme weights in my case. Moreover, while voters bring about electoral results, I really don’t want to introduce claims about individual behaviour through the back door.

My next idea was to weigh municipalities by the square root of the size their electorates. Why? In a sense, the observed behaviour is like a sample from the underlying distribution of preferences, and the reliability of this estimate is proportional to the square root of the number of people in a given community. But even taking the square root left me with weights that were quite extreme, and the concern regarding the level of analysis still applied.

Then I realised that instead of weighing by size, I could simply include the size of the electorate as an additional independent variable to correct for potential bias. But this still left me exposed to the danger of extreme outliers (think small, poor, rural communities where the number of Green voters goes up from one to four, a whopping 300 per cent increase) playing havoc with my analysis. So I began reading up on robust regression and its various implementations in Stata.Robust Regression of Aggregate Data in Stata 1

The basic idea of robust regression is that real data are more likely than not a mixture of (at least) two mechanisms: the “true model” whose coefficients we want to estimate one the one hand, and some other process(es) that contaminate the data on the other. If these contaminating data points are far away from the multivariate mean of the x-Variables (outliers) and deviate substantially from the true regression line, they will bias the estimates.

Robust regression estimators are able to deal with a high degree of contamination, i.e. they can recover the true parameters even if there are many outliers amongst the data points. The downside is that the older generation of robust estimators also have a low efficiency (the estimates are unbiased but have a much higher variance than regular OLS-estimates).

A number of newer (post-1980) estimators, however, are less affected by this problem. One particular promising approach is the MM estimator, that has been implemented in Stata ados by Veradi/Croux (MMregress) and by Ben Jann (robreg mm). Jann’s ado seems to be faster and plays nicely with his esttab/estout package, so I went with that.

The MM estimator works basically by identifying outliers and weighing them down, so it amounts to a particularly sophisticated case of weighted least squares. Using the defaults, MM claims to have 85 per cent of the efficiency of OLS while being able to deal with up to 50 per cent contamination. As you can see in the table, the MM estimates deviate somewhat from their OLS counterparts. The difference is most pronounced for the effect of tax receipts (hekst).

robreg mm has an option to store the optimal weights. I ran OLS again using these weights (column 3), thereby recovering the MM estimates and demonstrating that MM is really just weighted least squares (standard errors (which are not very relevant here) differ, because robreg uses the robust variance estimator). This is fascinating stuff, and I’m looking forward to a forthcoming book by Jann and Veradi on robust regression in Stata (to be published by Stata Press in 2012).

                     OLS              MM            WLS

greenpct2006        0.193***        0.329***        0.329***
                 (0.0349)        (0.0592)        (0.0278)

hekst               0.311***        0.634***        0.634***
                 (0.0894)         (0.124)        (0.0688)

senioren          -0.0744***       -0.100***       -0.100***
                 (0.0131)        (0.0149)       (0.00994)

kregvoters11      -0.0125        -0.00844        -0.00844
                 (0.0146)       (0.00669)       (0.00982)

kbevdichte         -0.433        -0.00750        -0.00750
                  (0.464)         (0.330)         (0.326)

uni                 1.258           0.816           0.816
                  (1.695)         (0.765)         (1.137)

lnunidist          -0.418**        -0.372**        -0.372***
                  (0.127)         (0.113)        (0.0918)

_cons               8.232***        7.078***        7.078***
                  (0.627)         (0.663)         (0.461)
Enhanced by Zemanta
Mar 022011
 
In the olden days, the world was simple. The average extreme right party was strictly socially conservative, to say the least. Abortion and homosexuality were considered sinful, mostly so because both practices deprived the fatherland of future soldiers and potential mothers of even more soldiers. So sex was supposed to be intramarital and had one purpose only: to procreate for the fatherland. Then came Pim Fortuyn and somewhat confused the message, but this was of little concern to members of the German NPD, who sometimes seem to live blissfully in a parallel universe where the 1930s never came to an end.

NPD: more miniskirts, fewer minarets

NPD campaign poster, 2011

Or so I thought until this morning. It’s election time in Rhineland-Palatinate, which means great fun, because campaigns at the state level often have their own disarming and rather amateurish charm. On my way to work, I drove past at least a dozen very conventional NPD posters showcasing the party’s “Müttergehalt” (salary for mothers) policy that is supposed to stop the “Volkstod” (genocide – they really hate foreign words). But then I nearly crashed my car laughing out loud when I spotted this little gem, campaigning, as you would have guessed, for “miniskirts instead of minarets”. Ah, the demand for more miniskirts – always at the fore of the minds of  every self-respecting, socially conservative nationalist movement. About time that someone dared to speak out.

 

The untrained, illiterate observer might of course mistakenly believe that the NPD is finally defending the unalienable right of the Aryan hooker to strut her stuff while eying a collection of strangely shaped dildos. As always, it is all in the eye of the beholder.