Why Would I Want to Pool the Polls?
Pre-election polls are noisy for a number of reasons. First, there is sampling error. For n=1000, the confidence interval for a party whose true support is 40 per cent ranges from 37 to 43 per cent, which is more than most people would think. And this assuming simple random sampling. For multi-stage sampling, you could end up with one to two extra points at each end. Then there are house effects: Pollsters dress up their raw figures and different ways, use different sampling frames and slightly different modes and questions. And finally, political events and media coverage on the day of the poll will have effects, especially early on when many voters are undecided.
Combining results from different polls is one obvious strategy to deal with these problems: The combined sample size is bigger, and there is hope that the various sources of bias might offset each other. Hopefully.
Where Do the Data Come From?
The very useful site wahlrecht.de publishes margins from seven large German pollsters. Excluding INSA (they use an online-access panel), I check this site regularly and generate a data set from it that you can download here. To companies post (relatively) raw data, which I find preferable. What the others do to their figures, we cannot know.
How Does It Work?
The most straightforward idea in poll-pooling is calculating a moving (and possibly weighted). A more principled approach is model-based. My model borrows heavily from Simon Jackman’s (2005) paper and from Chris Hanretty’s application of a similar model to Italy, but differs in some respects. First, I treat the polls as draws from a multinomial distribution to account for Germany’s moderate multi-partyism. The parameters of this distribution depend on the relative strength of latent support for each party. Modelling the results as multinomial implies the constraint that the estimated shares must some to unity, which is useful. Second, like Jackman and Hanretty, I assume that latent support for each party follows a random walk (today’s support is yesterdays support plus a random quantity), but I allow for a drift: a linear trend in latent support over the course of the campaign. Third, I assign each poll to a week, because there are relatively few polls, and field-times are relatively long. Put differently, I assume that public opinion moves from week to week (but not from day to day).
The model estimates latent party support since January 2013 and makes predictions for the outcome of the election. The code (R & Bugs) is here.
Does It Work?
Honestly, I have no idea. This is work in progress, so take the findings with a pinch of salt.
What Could Possibly Go Wrong?
Everything? Anything? The most obviously dubious assumption of the model is that the polls are unbiased on average. Latent linear trends are a close second.