## Any rigorous discussion of **survey bias** requires a scalar measure of survey accuracy

For metric variables, this is trivial (e.g. deviation from the known mean in the population). But for categorical variables such as vote choices, there is no obvious solution.

Martin, Traugott, and Kennedy (2005) propose a measure for quantifying survey bias in the dichotomous (two-party) case. Arzheimer and Evans (2014) demonstrate how measures for the more common multi-party case can be derived by flagging up the relationship of the original Martin, Traugott, and Kennedy measure to the familiar logit model and then moving on from the binomial to the multinomial logit. The are category-specific measures of survey bias, whereas neatly summarise survey bias (with respect to one categorical variable) in a single figure. is a weighted variant of that takes the (true) size of the respective groups into account.

The math behind these measures is not very complex but a bit convoluted. You could do the calculations by hand (shudder) or with a bit of programming in some statistics/math package, but that would be tiresome. So we have created an add-on (or rather and ado) for Stata does the gruntwork for you.

The latest version (1.4) of the software has many new features and works much faster than previous releases.

## The surveybias Software for Assessing Bias in Surveys

### Installation

To install the latest version of our software from within Stata, simply type:

ssc install surveybias

Equivalently, you may install from this website:

net from https://www.kai-arzheimer.com/stata/ net install surveybias

### How to Use the surveybias Software

surveybias and the underlying methodology are applicable to any survey which samples a multinomial variable whose distribution in the population is known. The package consists of three separate ados. The main command is **surveybias**. It computes the , , and as well as standard errors and statistical tests from a variable held in memory and additional information about the true distribution of the respective variable in the population.

**surveybias** is complemented by **surveybiasi**, an immediate command that makes these calculations based on information typed as arguments on the command line. Using **surveybiasi**, it is possible to produce estimates of polling accuracy from published margins when the raw data are not available.

**surveybiasseries** takes this idea one step further. In the aftermath of an election, researchers will often want to compare polling accuracy across time and firms, but commercial pollsters tend to make their raw data available for secondary analysis only after some cooling-off period, if at all. **surveybiasseries** calculates accuracy measures from a dataset of published margins, where each row represents the headline findings from a single survey. **surveybiasseries** stores the accuracy measures as new variables in the dataset so that it is very easy to model polling accuracy as a function of variables such as duration and timing of field work, sample size, or polling company.

#### surveybias

[by varlist: surveybias varname if in weight , popvalues(#..#) verbose numerical vce() cluster() svy subpop(var) level(#)

**surveybias** compares the distribution of a categorical variable *varname* in the dataset to its true distribution in the population. This true distribution is submitted to the command as a numlist in *popvalues*. Subsamples can be selected via the *if* and *in* qualifiers so that the accuracy of group-specific predictions may be assessed. Standard errors will be based on the size of the reduced sample. When using the survey estimator, subpopulations should be specified with the *subpop()* option instead.

Click here for a worked **example** that shows how **surveybias** can be used to assess bias in a single French election poll.

For a separate worked **example** that demonstrates the use weights and information on the sampling design with surveybias in assessing bias in educational attainment, click here.

#### surveybiasi

surveybiasi , popvalues(#..#) samplevalues(#..#) n(#) numerical level(#)

**surveybiasi** is an immediate command that compares the distribution of a categorical variable in a survey to its true distribution in the population. Both distributions need to be specified via their respective options.

Click here for an **example** that shows how **surveybiasi** can be used to assess bias using published margins from an US pre-election poll.

#### surveybiasseries

```
surveybiasseries if in popvariables(varlist) samplevariables(varlist) nvar(varname) generate(newvarstub) missasnull popvalues numerical descriptivenames
```

**surveybiaseries** estimates accuracy measures from a dataset of survey margins. Each observation represents a single poll. For each survey, the distribution of some categorical variable is given by a series of *samplevariables*. The distribution can be expressed in terms of absolute frequencies, relative, frequencies, or percentages. Information on the true distribution can be specified either directly on the command line via the *popvalues* option or as another series of variables, specified through the *popvariables* option. Either *popvalues* or *popvariables* must be given, but not both. Moreover, another variable whose name is passed to the program in the *nvar* option must hold the respective sample sizes. Using the *if* and *in* qualifiers, it is possible to restrict the analysis to a subgroup of surveys.

The command leaves behind a series of new variables, whose names are based on the stub submitted via the *generate* option. *generate* is required. In these variables, **surveybiaseries** stores the complete information that would be generated by the equivalent series of **surveybias** commands.

Click here for a more complex **example** that shows how **surveybiasseries** can be used to quickly analyse bias in a large number of polls. Includes estimation of house effects and house-specific bias against a single party.

## Summary

If you are worried about, or interested in analysing survey bias with Stata, try our software now. It is free and may be quickly installed or updated by entering

`ssc install surveybias`

Try it now.

I have a question which it not entirely relevant to this post – I would like to know the statistical model that might validate or refute the accuracy of the Electoral College, versus that popular vote. Actually, there is no need to consider that popular vote: no scientist or mathematician would ever consider accurate the counting of the voters who happened to show up at the polls. The only way anyone could seriously even call it “popular vote” would be if ALL the eligible voters voted, or at least some mathematically defensible percentage – not 55%! However, I am relatively certain that the Electoral College, or something like it, COULD be considered accurate, if the correct predictive statistics were applied. Have you worked on this?

I’m not sure if I understand your question correctly. One could probably model the votes in the EC as draws from a binomial or even multinomial distribution of preferences amongst electors. But in what sense could we assess the accuracy of this vote? The members of the the EC are not a random sample, they are the universe of EC members, at least for any given presidential election year. If we are talking about the true preferences of the electorate, things look even worse than with the popular vote. In this case, the result in each state would constitute a very large but systematically biased (voters vs non-voters) sample, with clustering at the state level. Things are then compounded by some weighing (the number of EC votes is not exactly proportional to the size of the electorates), and by retaining only the modal category within each state (winner takes it all). In short, it is not obvious how the EC, that is loosely based on the popular vote would make the outcome more similar to the unknown will of the electorate than the popular vote itself.