Jul 102013
 
Used Punchcard
BinaryApe / Foter / CC BY

In a recent paper, we derive various multinomial measures of bias in public opinion surveys (e.g. pre-election polls). Put differently, with our methodology, you may calculate a scalar measure of survey bias in multi-party elections.

Thanks to Kit Baum over at Boston College, our Stata add-on surveybias.ado is now available from Statistical Software Components (SSC).  The add-on takes as its argument the name of a categorical variable and said variable’s true distribution in the population. For what it’s worth, the program tries to be smart: surveybias vote, popvalues(900000 1200000 1800000), surveybias vote, popvalues(0.2307692 0.3076923 0.4615385), and surveybias vote, popvalues(23.07692 30.76923 46.15385) should all give the same result.

If you don’t have access to the raw data but want to assess survey bias evident in published figures, there is surveybiasi, an “immediate” command that lets you do stuff like this:  surveybiasi , popvalues(30 40 30) samplevalues(40 40 20) n(1000). Again, you may specify absolute values, relative frequencies, or percentages.

If you want to go ahead and measure survey bias, install surveybias.ado and surveybiasi.ado on your computer by typing ssc install surveybias in your net-aware copy of Stata. And if you use and like our software, please cite our forthcoming Political Analysis paper on the New Multinomial Accuracy Measure for Polling Bias.

Update April 2014: New version 1.1 available

Oct 072012
 
Matrix Graph in Stata

Do you like this graph? I don’t think it is particularly attractive, and that is after spending hours and hours creating it. What I really wanted was a matrix-like representation of 18 simulations I ran. More specifically, I simulated the sampling distribution of a statistic under six different conditions for three different sample sizes. Doing the simulations was a breeze, courtesy of Stata’s simulate command, which created 18 corresponding data sets. Graphing them with kdensity also poses no problem, but combining these graphs did, because I could find no canned command that produces what I wanted: a table-like arrangement, with labels for the columns (i.e. sample sizes) and rows (experimental conditions). What I could do was set up / label a variable with 18 categories (one for each data set) and use the ,by() option to create a trellis plot. But that would waste a lot of ink/space by replicating redundant information. At the end of the day, I created a nine graphs that were completely empty save for the text that I wanted as row/column labels, which I then combined into two separate figures, that were then combined (using a distorted aspect ratio) with my 18 separate plots. That boils down to a lot of dumb code. E.g., this creates the labels for the six conditions. Note the fxsize option that makes the combine graph narrow, and the necessity to create an empty scatter plot.

capture drop x
capture drop y
capture set obs 5
gen x= .
gen y= .

local allgraphs = “”

forvalues c = 1/6 {
graph twoway scatter x y, xtitle(“”) ytitle(“”) xscale(off) yscale(off) subtitle(“(`c’)”,position(0) nobox) graphregion(margin(zero)) plotregion(style(none))
local allgraphs “`allgraphs’ condition`c'”
graph rename condition`c’ , replace
}

graph combine `allgraphs’ , cols(1) colfirst imargin(0 0 0 0) fxsize(10) b1title(” “)

The column labels were created by similar code. Finally, I combined my 18 graphs (there names are in the local macro) and combined the results with the label graphs.

graph combine `graphs' ,colfirst cols(3) ycommon xcommon imargin(3 3 3 3) b1title("\$ B_w \$") l1title("Density")
graph rename simulations, replace
graph combine sizelabels.gph condlabels.gph simulations, imargin(0 0 0 0) cols(2) holes(1)

Can anyone of you think of a more elegant way to achieve this result?

Matrix Graph in Stata

 

Dec 022011
 

Who is afraid of whom?

The liberal German weekly Zeit has commissioned a YouGov poll which demonstrates that Germans are more afraid of right-wing terrorists than of Islamist terrorists. The question read “What is, in your opinion, the biggest terrorist threat in Germany?” On offer were right-wingers (41 per cent), Islamists (36.6 per cent), left-wingers (5.6 per cent), other groups (3.8 per cent), or (my favourite) “no threat” (13 per cent). This is a pretty daft question anyway. Given the news coverage of the Neo-Nazi gang that has killed at least ten people more or less under the eyes of the authorities, and given that the authorities have so far managed to stop would-be terrorists in their tracks, the result is hardly surprising.

Nonetheless, the difference of just under five percentage points made the headlines, because there is a subtext for Zeit readers: Germans are worried about right-wing terrorism (a few weeks ago many people would have denied that there are right-wing terrorists operating in Germany), which must be a good thing, and they are less concerned about Islamist terrorists, which is possibly a progressive thing. Or something along those lines.

But is the five-point difference real?

YouGov has interviewed 1043 members of its online access panel. If we assume (and this is a heroic assumption) that these respondents can be treated like a simple random sample, what are the confidence intervals?

Binomial Confidence Intervals

First, we could treat the two categories as if they were distributed as binomial and ask Stata for exact confidence intervals.

cii 1043 round(1043*.41)
cii 1043 round(1043*.366)

The confidence intervals overlap, so we’re lead to think that the proportions in the population are not necessarily different. But the two categories are not independent, because the “not right-wingers” answers include the “Islamists” answers and vice versa, so the multinomial is a better choice.

Multinomial Model

It is easy to re-create the univariate distribution of answers in Stata:

set obs 5
gen threat = _n
lab def threat 1 "right-wingers" 2 "islamists" 3 "left-wingers" 4 "other" 5 "no threat"
lab val threat threat

gen number = round(1043* 0.41) in 1
replace number = round(1043* 0.366) in 2
replace number = round(1043* 0.056) in 3
replace number = round(1043* 0.038) in 4
replace number = round(1043* 0.13) in 5
expand number

Next, run an empty multinomial logit model

mlogit threat,base(5)

The parameters of the model reproduce the observed distribution exactly and are therefore not very interesting, but the estimates of their standard errors are available for testing hypotheses:

test [right_wingers]_cons = [islamists]_cons

At the conventional level of 0.05, we cannot reject the null hypothesis that both proportions are equal in the population, i.e. we cannot tell if Germans are really more worried about one of the two groups.

Simulation

Just for the fun of it, we can carry out one additional test and ask a rather specific question: If both proportions are 0.388 in the population and the other three are identical to their values in the sample, what is the probability of observing a difference of at least 4.4 points in favour of right-wingers?

The idea is to sample repeatedly from a multinomial with known probabilities. This could be done more elegantly by defining a program and using Stata’s simulate command, but if your machine has enough memory, it is just as easy and possibly faster to use two loops to generate/analyse the required number of variables (one per simulation) and to fill them all in one go with three lines of mata code. Depending on the number of trials, you may have to adjust maxvars

local trials = 10000
foreach v of newlist s1-s`trials' {
qui gen `v' = .
}

mata:
probs =(.388,.388,.056,.038,.13)
st_view(X.,.,"s1-s`trials'",)
X[.,.] = rdiscrete(1043,`trials',probs)
end

local excess = 0

forvalues sample = 1/`trials' {
qui tab s`sample' if s`sample' == 1
local rw = r(N)
qui tab s`sample' if s`sample' == 2
local isl = r(N)
if (`rw' / 1043 * 100) - (`isl' / 1043 * 100) >=4.4 local excess = `excess' +1
}

display "Difference >=4.4 in `excess' of `trials' samples"

Seems the chance of a 4.4 point difference is between 5 and 6 per cent. This probability is somewhat smaller than the one from the multinomial model because the null hypothesis is more specific, but still not statistically significant. And the Zeit does not even have a proper random sample, so there is no scientific evidence for the claim that Germans are more afraid of right-wing extremists than of Islamists, what ever that would have been worth. Bummer.

Apr 092011
 

Sometimes, a man’s gotta do what a man’s gotta do. Which, in my case, might be a little simulation of a random process involving an unordered categorical variable. In R, sampling from a multinomial distribution is trivial.

rmultinom(1,1000,c(.1,.7,.2,.1))

gives me a vector of random numbers from a multinomial distribution with outcomes 1, 2, 3, and 4, where the probability of observing a ‘1’ is 10 percent, the probability of observing a ‘2’ is 70 per cent, and so on. But I could not find an equivalent function in Stata. Generating artificial data in R is not very elegant, so I kept digging and found a solution in section M-5 of the Mata handbook. Hidden in the entry on runiform is a reference to rdiscrete(r,c,p), a Mata function which generates a r*c matrix of draws from a multinomial distribution defined by a vector p of probabilities.

That leaves but one question: Is wrapping a handful of lines around a Mata call to replace a non-existent Stata function more elegant than calling an external program?