# Stata-related posts

Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

A year ago, I wrote a slightly maudlin blog about the good and the not-so-good reasons for solo-blogging in this time and age. Good reasons or not, I kept up the good work with 35 blogs in 2018. That is a bit less than my long-term annual average, but Chapeau to my good self nonetheless.

But what were the most popular (used in a strictly relative sense) posts on the blog in 2018? Here is your handy guide:

This is me, about once per year, when I bemoan my lack of R-coolness whilst simultaneously enjoying my Stata-efficiency.

## Personal blogs are so 1990s, yes?

This is not the late 1990s. Hey, it’s not even the early Naughties, and has not been for a while. I have had my own tiny corner of the Internet (then hosted on university Web space as it was the norm in the day) since Mosaic came under pressure from Netscape and the NYT experimented with releasing content as (I kid you not) postscript files, because PDF was not invented yet. I did this mostly because I liked computers, because it was new, and because it provided an excellent distraction from the things I should have been doing. By and large, not much changes over 25 years.

Photo by karimian

Later (that was before German universities had repositories or policies for such things), my webspace became a useful resource for teaching-related material. Reluctantly and with a certain resentment, I have copied slides and handouts from one site to the next, adding layers of disclaimers instead of leaving them behind, because some of this stuff carries hundreds of decade-old backlinks and gets downloaded / viewed dozens of times each day.

And of course, I started posting pre-publication versions of my papers, boldly ignoring / blissfully ignorant of the legal muddle surrounding the issue back in the day. Call me old fashioned, but making research visible and accessible is was the Web was invented for.

In summer 2008, I set up my own domain on a woefully underpowered shared webspace (since replaced by an underpowered virtual server). A bit earlier in the same year, already late to the party, I had started my own “Weblog” on wordpress.com, writing and ranting about science, politics, methods, and all that. A year down the road, I converted www.kai-arzheimer.com to wordpress, moved my blog over there, and have never looked back continously wondered why I kept doing this.

## Why keep blogging?

In those days of old, we had trackbacks and pingbacks & stuff (now a distant memory), and social media was the idea of having a network of interlinking personal blogs, whose authors would comment on each other’s posts. Even back in 2008 on wordpress, my blog was not terribly popular, but for a couple of years, there was a bunch of people who had similar interests, with whom I would interact occasionally.

Then, academically minded multi-author blogs came along, which greatly reduced fragmentation and aimed at making social science accessible for a much bigger audience whilst removing the need to set up and maintain a site. For similar reasons, Facebook and particularly Twitter became perfect outlets for ranting “microblogging”, while Medium bypasses the fragmentation issue for longer texts and is far more aesthetically pleasing and faster than anything any of us could run by ourselves.

Photo by kjarrett

It is therefore only rational that many personal academic blogs died a slow death. People I used to read left Academia completely, gave up blogging, or moved on to the newer platforms. Do you remember blogrolls? No, you wouldn’t. Because I’m a dinosaur, I still get my news through an RSS reader (and you should, too). While there are a few exceptions (Chris Blattman and Andrew Gelman spring to mind), most of the sources in my “blog” drawer are run by collectives / institutions (the many LSE blogs, the Monkey Cage, the Duck etc.). I recently learned that I made it into an only slightly dubious looking list of the top 100 political science blogs, but that is surely because there are not many individual political science bloggers left.
So why am I still rambling in this empty Platonic man-cave? Off the top of my head, I can think of about five reasons:

1. Total editorial control. I have written for the Monkey Cage, The Conversation, the LSE, and many other outlets. Working with their editors has made my texts much better, but sometimes I am not in the mood for clarity and accessibility. I want to rant, and be quick about it.
2. Pre-prints. I like to have pre-publication versions of my work on my site, although again, institutional hosting makes much more sense. Once I upload them, I’m usually so happy that I want to say something about it.
3. For me, my blog is still a bit like an open journal. If I need to remember some sequence of events in German or European politics for the day job, it’s helpful if I have blogged about it as it happened. Similarly, sometimes I work out the solution to some software issue but quickly forget the details. Five months later, a blog post is a handy reference and may help others.
4. Irrelevance. Often, something annoys or interests me so much that I need to write a short piece about it, although few other people will care. I would have a better chance of being of finding an audience at Medium, but then again on my own wordpress-powered site, I have a perfectly serviceable CME which happens to have blogging functionality built in.
5. Ease of use. I do almost all of my writing in Emacs and keep (almost) all my notes in orgmode code. Thanks to org2blog, turning a few paragraphs into a post is just some hard-to-remember key strokes away.

## Bonus track: the five most popular posts in 2017

As everyone knows, I’m not obsessed with numbers, thank you very much. I keep switching between various types of analytic software and have no idea how much (or rather little) of an audience I actually have. Right now I’m back to the basic wordpress statistics and have been for over a year, so here is the list of the five posts that were the most popular in 2017.

Here is an update on our work on surveybias.

How can we usefully summarise the accuracy of an election opinion poll compared to the real result of an election? In this blog, we describe a score we have devised to allow people to see how different polls compare in their reflection of the final election result, no matter how many parties or candidates are standing. This index, B, can be compared across time, polling company and even election to provide a simple demonstration of how the polls depicted public opinion in the run-up to polling-day

Just how badly biased is your pre-election survey? Once the election results are in, our scalar measures B and B_w provide convenient, single number summaries. Our surveybias add-on for Stata will calculate these and other measures from either raw data or from published margins. Its latest iteration (version 1.4) has just appeared on SSC. Surveybias 1.4 improves on the previous version by ditching the last remnants of the old numerical approximation code for calculating standard errors and is hence much faster in many applications. Install it now from within Stata by typing

ssc install surveybias

In der letzten Woche ist meine Einführung zum Thema Strukturgleichungsmodelle bei Springer/VS erschienen. Das Buch zeigt, wie sich die gängigsten Modelle (u.a. einfache und Mehr-Gruppen-Konfirmatorische-Faktorenanalysen (CFA/MGCFA)) in Stata, Lisrel und MPlus realisieren lassen. Die Beispiele stammen aus dem Bereich der politikwissenschaftlichen Einstellungsforschung (Fremdenfeindlichkeit, politische Entfremdung, politisches Interesse …).

Alle Beispieldateien können hier heruntergeladen werden. Das Buch kostet 12,99 € (Ebook) bzw. 17,99 € (Paperback). Pre-prints der Einleitung und des Glossars gibt es hier, weitere Probeseiten direkt bei Springer.

Im einzelnen werden die folgenden Themen behandelt:

1 Einleitung
1.1 Wieso, weshalb, warum? Strukturgleichungsmodelle in der Politikwissenschaft
1.2 Aufbau des Buches
1.3 Konventionen
1.4 Software und Internetquellen
2 Grundlagen
2.1 Matrixalgebra
2.1.1 Dimensionen, Elemente, Vektoren, Submatrizen, Partitionen
2.1.2 Besondere Matrizen
2.1.3 Einfache Matrixoperationen
2.1.4 Rang und Inverse
2.2 Kovarianz, Korrelation, Regression
2.2.1 Die Kovarianz: Maß für Zusammenhänge zwischen metrischen Variablen:
2.2.2 Der Pearsonsche Korrelationskoeffizient: Standardisiertes Maß für den Zusammenhang zwischen metrisch
2.2.3 Das lineare Regressionsmodell: Baustein für Strukturgleichungsmodelle
2.3 Messfehler und Faktorenanalyse
2.4.1 Kausalitätsbegriff
2.5 Das allgemeine Strukturgleichungsmodell
2.6 Stichproben, Schätzungen, Strategien
2.6.1 Realität, Modell und Daten
2.6.2 Schätzverfahren
2.6.3 Identifikation
2.6.4 Modellvergleich: Fit-Indizes und Hypothesentests
2.6.5 Standardisierte Schätzungen und Mittelwertstrukturen
3 Beispiele und Anwendungen
3.1 Daten
3.2 Konfirmatorische Faktorenanalyse: Einstellungen zu Migranten
3.3 Gruppenvergleich und äquivalente Messungen
3.4 Empfehlungen für Analyse und Darstellung
3.4.1 Theoretische Grundlagen und Spezifikation
3.4.2 Datenauswahl und -aufbereitung
3.4.3 Modellschätzung und Respezifikation
3.4.4 Präsentation
4 Fortgeschrittene Themen
4.1 Kategoriale Variablen
4.1.1 Kategoriale Indikatoren
4.1.2 Ein Beispiel: Politische Wirksamkeit (efficacy)
4.2 Latente Wachstumsmodelle
4.2.1 Zunahme des Interesses am Wahlkampf
4.2.2 Exkurs: Latente Wachstumsmodelle als Mehr-Ebenen-Modelle
4.3 Ausblick und weiterführende Literatur
4.3.1 Missing Data
4.3.2 Kategoriale latente Variablen
4.3.3 Mehr-Ebenen-Strukturgleichungsmodelle
5 Ausblick und weiterführende Literatur
5.1 Grundlagen
5.2 Einführungen
5.3 Literatur zu einzelnen Programmen
5.4 Fachzeitschriften und Handbücher
6 Literaturverzeichnis

We have updated our add-on (or ado) surveybias, which calculates our multinomial generalisation of the old Martin, Traugott, and Kennedy (2005) measure for survey bias. If you have any dichotomous or multinomial variable in your survey whose true distribution is known (e.g. from the census, electoral counts, or other official data), surveybias can tell you just how badly damaged your sample really is with respect to that variable. Our software makes it trivially easy to asses bias in any survey.

Within Stata, you can install/update surveybias by entering ssc install surveybias. We’ve also created a separate page with more information on how to use surveybias, including a number of worked examples.

The new version is called 1.3b (please don’t ask). New features and improvements include:

• Support for (some) complex variance estimators including Stata’s survey estimator (sample points, strata, survey weights etc.)
• Improvements to the numerical approximation. survebias is roughly seven times faster now
• A new analytical method for simple random samples that is even faster
• Convenience options for naming variables created by survebiasseries
• Lots of bug fixes and improvements to the code

If you need to quantify survey bias, give it a spin.

Contrary to popular belief, it’s not always the third reviewer that gives you grief. In our case, it is the one and only reviewer that shot down a manuscript, because at the very least, s/he would have expected (and I quote) an “analytical derivation of the estimator”. For some odd reason of his own, the editor, instead of simply rejecting us, dared us to do just that, and against all odds, we succeeded after some months of gently banging various heads against assorted walls.

Needless to say that on second thought, the reviewer found the derivation “interesting but unnecessarily complicated” and now recommends relegating the material to a footnote. To make up for this, s/he delved into the code of our software, spotted some glaring mistakes and recommended a few changes (actually sending us a dozen lines of code) that result in a speed gain of some 600 per cent. This is very cool, very good news for end users, very embarrassing for us, and generally wrong on so many levels.

Bonus track: The third reviewer.

Scientific Peer Review, ca. 1945

## The Problem: Assessing Bias without the Data Set

While the interwebs are awash with headline findings from countless surveys, commercial companies (and even some academics) are reluctant to make their raw data available for secondary analysis. But fear not: Quite often, media outlets and aggregator sites publish survey margins, and that is all the information you need. It’s as easy as $\pi$.

## The Solution: surveybiasi

After installing our surveybias add-on for Stata, you will have access to surveybiasi. surveybiasi is an “immediate command” (Stata parlance) that compares the distribution of a categorical variable in a survey to its true distribution in the population. Both distributions need to be specified via the popvalues() and samplevalues() options, respectively. The elements of these two lists may be specified in terms of counts, of percentages, or of relative frequencies, as the list is internally rescaled so that its elements sum up to unity. surveybiasi will happily report k $A^{\prime}_{i}$s, $B$ and $B_{w}$ (check out our paper for more information on these multinomial measures of bias) for variables with 2 to 12 discrete categories.

## Bias in a 2012 CBS/NYT Poll

A week before the 2012 election for the US House of Representatives, 563 likely voters were polled for CBS/The New York Times. 46 per cent said they would vote for the Republican candidate in their district, 48 per cent said they would vote for the Democratic candidate. Three per cent said it would depend, and another two per cent said they were unsure, or refused to answer the question. In the example these five per cent are treated as “other”. Due to rounding error, the numbers do not exactly add up to 100, but surveybiasi takes care of the necessary rescaling.

In the actual election, the Republicans won 47.6 and the Democrats 48.8 per cent of the popular vote, with the rest going to third-party candidates. To see if these differences are significant, run surveybiasi like this:


. surveybiasi , popvalues(47.6 48.8 3.6) samplevalues(46 48 5) n(563)
------------------------------------------------------------------------------
catvar |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
A'           |
1 |  -.0426919   .0844929    -0.51   0.613     -.208295    .1229111
2 |  -.0123999   .0843284    -0.15   0.883    -.1776805    .1528807
3 |   .3375101   .1938645     1.74   0.082    -.0424573    .7174776
-------------+----------------------------------------------------------------
B            |
B |   .1308673   .0768722     1.70   0.089    -.0197994    .2815341
B_w |   .0385229   .0247117     1.56   0.119    -.0099112    .0869569
------------------------------------------------------------------------------

Ho: no bias
Degrees of freedom: 2
Chi-square (Pearson) = 3.0945337
Pr (Pearson) = .21282887
Chi-square (LR) = 2.7789278
Pr (LR) = .24920887




Given the small sample size and the close match between survey and electoral counts, it is not surprising that there is no evidence for statistically or substantively significant bias in this poll.

An alternative approach is to follow Martin, Traugott and Kennedy (2005) and ignore third-party voters, undecided respondents, and refusals. This requires minimal adjustments: $n$ is now 535 as the analytical sample size is reduced by five per cent, while the figures representing the “other” category can simply be dropped. Again, surveybiasiinternally rescales the values accordingly:


. surveybiasi , popvalues(47.6 48.8) samplevalues(46 48) n(535)
------------------------------------------------------------------------------
catvar |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
A'           |
1 |  -.0162297   .0864858    -0.19   0.851    -.1857388    .1532794
2 |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
-------------+----------------------------------------------------------------
B            |
B |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
B_w |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
------------------------------------------------------------------------------

Ho: no bias
Degrees of freedom: 1
Chi-square (Pearson) = .03521623
Pr (Pearson) = .85114329
Chi-square (LR) = .03521898
Pr (LR) = .85113753



Under this two-party scenario, $A^{\prime}_{1}$ is identical to Martin, Traugott, and Kennedy’s original $A$ (and all other estimates are identical to $A$‘s absolute value). Its negative sign points to the (tiny) anti-Republican bias in this poll, which is of course even less significant than in the previous example.

## Survey Accuracy

The accuracy of pre-election surveys is a matter of considerable debate. Obviously, any rigorous discussion of bias in opinion polls requires a scalar measure of survey accuracy. Martin, Traugott, and Kennedy (2005) propose such a measure $A$ for the two-party case, and in our own work (Arzheimer/Evans 2014), Jocelyn Evans and I demonstrate how $A$ can be generalised to the multi-party case, giving rise to a new measure $B$ (seriously) and some friends $A^{\prime}_{i}$ and $B_w$:

Arzheimer, Kai and Jocelyn Evans. “A New Multinomial Accuracy Measure for Polling Bias.” Political Analysis 22.1 (2014): 31–44. doi:10.1093/pan/mpt012

In this article, we propose a polling accuracy measure for multi-party elections based on a generalization of Martin, Traugott, and Kennedy s two-party predictive accuracy index. Treating polls as random samples of a voting population, we first estimate an intercept only multinomial logit model to provide proportionate odds measures of each party s share of the vote, and thereby both unweighted and weighted averages of these values as a summary index for poll accuracy. We then propose measures for significance testing, and run a series of simulations to assess possible bias from the resulting folded normal distribution across different sample sizes, finding that bias is small even for polls with small samples. We apply our measure to the 2012 French presidential election polls to demonstrate its applicability in tracking overall polling performance across time and polling organizations. Finally, we demonstrate the practical value of our measure by using it as a dependent variable in an explanatory model of polling accuracy, testing the different possible sources of bias in the French data.

@Article{arzheimer-evans-2013,
author = {Arzheimer, Kai and Evans, Jocelyn},
title = {A New Multinomial Accuracy Measure for Polling Bias },
journal = {Political Analysis},
year = 2014,
abstract = {In this article, we propose a polling accuracy measure for
multi-party elections based on a generalization of Martin,
Traugott, and Kennedy s two-party predictive accuracy index.
Treating polls as random samples of a voting population, we first
estimate an intercept only multinomial logit model to provide
proportionate odds measures of each party s share of the vote, and
thereby both unweighted and weighted averages of these values as a
summary index for poll accuracy. We then propose measures for
significance testing, and run a series of simulations to assess
possible bias from the resulting folded normal distribution across
different sample sizes, finding that bias is small even for polls
with small samples. We apply our measure to the 2012 French
presidential election polls to demonstrate its applicability in
tracking overall polling performance across time and polling
organizations. Finally, we demonstrate the practical value of our
measure by using it as a dependent variable in an explanatory model
of polling accuracy, testing the different possible sources of bias
in the French data.},
keywords = {meth-e},
volume = {22},
number = {1},
pages = {31--44},
url =
{http://pan.oxfordjournals.org/cgi/reprint/mpt012?ijkey=z9z740VU1fZp331&keytype=ref},
doi = {10.1093/pan/mpt012},
data = {http://hdl.handle.net/1902.1/21603},
html =
{https://www.kai-arzheimer.com/new-multinomial-accuracy-measure-for-polling-bias}
}

## The Surveybias Software 1.1

Calculating the accuracy measures is a matter of some algebra. Estimating standard errors is a bit trickier but could be done manually by making use of the relationship between $A^{\prime}_{i}$ and the multinomial logistic model on the one hand and Stata’s very powerful implementation of the Delta method on the other. But these calculations are error-prone and become tedious rather quickly. This is why we created a suite of user written programs (surveybias, surveybiasi, and surveybiasseries). They do all the necessary legwork and return the estimates of accuracy, complete with standard errors and statistical tests.

Those Were the Days

We have just updated our software. The new version 1.1 of surveybias features some bug fixes, a better mechanism for automagically dealing with convergence problems, better documentation, and a new example data set that compiles information on 152 German pre-election polls conducted between January and September 2013.

surveybias comes with example data from the French presidential election 2012 and the German parliamentary election 2013. From within Stata, type help surveybias, help surveybiasi, and help surveybiasseries to see how you can make use of our software. If I can find the time, I will illustrate the use of surveybias in a mini series of blogs over the next week.

## Updating Surveybias

The new version 1.1 should appear is now on SSC within the next couple of days or so, but the truly impatient can get it now. In your internet-aware copy of Stata (version 11 or later), type

net from https://www.kai-arzheimer.com/stata/

net install surveybias, replace

Or use SSC: ssc install surveybias, replace

Enjoy!