# Stata-related posts

Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

The Stata idiom capture quietly makes it so that any output from the subsequent command is suppressed, and that even critical failures are happily ignored. Your script soldiers on, and you are none the wiser. I always thought that this is a wonderful metaphor for organisational behaviour.

In unrelated news, every other summer, Statacorp comes up with a new version of its product. Every other summer, I succumb to some Pavlovian reflex and decide to spend some institutional money on upgrading my unit’s licences for some interesting but usually quite marginal benefits.

It is the same story in other units and departments, and by coordinating and pooling our orders, we can get substantial discounts. And so, come autumn, the university’s IT centre is collating expressions of interest and communicating tentative prices, going back and forth until some equilibrium is reached. From then on, it can still take months until the new licences arrive, in spite of shipments being just codes and downloads now. Yesterday, I realised that Stata 17 came out in April, i.e. nine months ago, and so decided to find out what had happened to our order. As it turned out, the IT centre required our charge codes to proceed, but had never bothered to ask for them.

A year ago, I wrote a slightly maudlin blog about the good and the not-so-good reasons for solo-blogging in this time and age. Good reasons or not, I kept up the good work with 35 blogs in 2018. That is a bit less than my long-term annual average, but Chapeau to my good self nonetheless.

But what were the most popular (used in a strictly relative sense) posts on the blog in 2018? Here is your handy guide:

This is me, about once per year, when I bemoan my lack of R-coolness whilst simultaneously enjoying my Stata-efficiency.

## Personal blogs are so 1990s, yes?

This is not the late 1990s. Hey, it’s not even the early Naughties, and has not been for a while. I have had my own tiny corner of the Internet (then hosted on university Web space as it was the norm in the day) since Mosaic came under pressure from Netscape and the NYT experimented with releasing content as (I kid you not) postscript files, because PDF was not invented yet. I did this mostly because I liked computers, because it was new, and because it provided an excellent distraction from the things I should have been doing. By and large, not much changes over 25 years.

Photo by karimian

Later (that was before German universities had repositories or policies for such things), my webspace became a useful resource for teaching-related material. Reluctantly and with a certain resentment, I have copied slides and handouts from one site to the next, adding layers of disclaimers instead of leaving them behind, because some of this stuff carries hundreds of decade-old backlinks and gets downloaded / viewed dozens of times each day.

And of course, I started posting pre-publication versions of my papers, boldly ignoring / blissfully ignorant of the legal muddle surrounding the issue back in the day. Call me old fashioned, but making research visible and accessible is was the Web was invented for.

In summer 2008, I set up my own domain on a woefully underpowered shared webspace (since replaced by an underpowered virtual server). A bit earlier in the same year, already late to the party, I had started my own “Weblog” on wordpress.com, writing and ranting about science, politics, methods, and all that. A year down the road, I converted www.kai-arzheimer.com to wordpress, moved my blog over there, and have never looked back continously wondered why I kept doing this.

## Why keep blogging?

In those days of old, we had trackbacks and pingbacks & stuff (now a distant memory), and social media was the idea of having a network of interlinking personal blogs, whose authors would comment on each other’s posts. Even back in 2008 on wordpress, my blog was not terribly popular, but for a couple of years, there was a bunch of people who had similar interests, with whom I would interact occasionally.

Then, academically minded multi-author blogs came along, which greatly reduced fragmentation and aimed at making social science accessible for a much bigger audience whilst removing the need to set up and maintain a site. For similar reasons, Facebook and particularly Twitter became perfect outlets for ranting “microblogging”, while Medium bypasses the fragmentation issue for longer texts and is far more aesthetically pleasing and faster than anything any of us could run by ourselves.

Photo by kjarrett

It is therefore only rational that many personal academic blogs died a slow death. People I used to read left Academia completely, gave up blogging, or moved on to the newer platforms. Do you remember blogrolls? No, you wouldn’t. Because I’m a dinosaur, I still get my news through an RSS reader (and you should, too). While there are a few exceptions (Chris Blattman and Andrew Gelman spring to mind), most of the sources in my “blog” drawer are run by collectives / institutions (the many LSE blogs, the Monkey Cage, the Duck etc.). I recently learned that I made it into an only slightly dubious looking list of the top 100 political science blogs, but that is surely because there are not many individual political science bloggers left.
So why am I still rambling in this empty Platonic man-cave? Off the top of my head, I can think of about five reasons:

1. Total editorial control. I have written for the Monkey Cage, The Conversation, the LSE, and many other outlets. Working with their editors has made my texts much better, but sometimes I am not in the mood for clarity and accessibility. I want to rant, and be quick about it.
2. Pre-prints. I like to have pre-publication versions of my work on my site, although again, institutional hosting makes much more sense. Once I upload them, I’m usually so happy that I want to say something about it.
3. For me, my blog is still a bit like an open journal. If I need to remember some sequence of events in German or European politics for the day job, it’s helpful if I have blogged about it as it happened. Similarly, sometimes I work out the solution to some software issue but quickly forget the details. Five months later, a blog post is a handy reference and may help others.
4. Irrelevance. Often, something annoys or interests me so much that I need to write a short piece about it, although few other people will care. I would have a better chance of being of finding an audience at Medium, but then again on my own wordpress-powered site, I have a perfectly serviceable CME which happens to have blogging functionality built in.
5. Ease of use. I do almost all of my writing in Emacs and keep (almost) all my notes in orgmode code. Thanks to org2blog, turning a few paragraphs into a post is just some hard-to-remember key strokes away.

## Bonus track: the five most popular posts in 2017

As everyone knows, I’m not obsessed with numbers, thank you very much. I keep switching between various types of analytic software and have no idea how much (or rather little) of an audience I actually have. Right now I’m back to the basic wordpress statistics and have been for over a year, so here is the list of the five posts that were the most popular in 2017.

Here is an update on our work on surveybias.

How can we usefully summarise the accuracy of an election opinion poll compared to the real result of an election? In this blog, we describe a score we have devised to allow people to see how different polls compare in their reflection of the final election result, no matter how many parties or candidates are standing. This index, B, can be compared across time, polling company and even election to provide a simple demonstration of how the polls depicted public opinion in the run-up to polling-day

Just how badly biased is your pre-election survey? Once the election results are in, our scalar measures B and B_w provide convenient, single number summaries. Our surveybias add-on for Stata will calculate these and other measures from either raw data or from published margins. Its latest iteration (version 1.4) has just appeared on SSC. Surveybias 1.4 improves on the previous version by ditching the last remnants of the old numerical approximation code for calculating standard errors and is hence much faster in many applications. Install it now from within Stata by typing

ssc install surveybias

In der letzten Woche ist meine Einführung zum Thema Strukturgleichungsmodelle bei Springer/VS erschienen. Das Buch zeigt, wie sich die gängigsten Modelle (u.a. einfache und Mehr-Gruppen-Konfirmatorische-Faktorenanalysen (CFA/MGCFA)) in Stata, Lisrel und MPlus realisieren lassen. Die Beispiele stammen aus dem Bereich der politikwissenschaftlichen Einstellungsforschung (Fremdenfeindlichkeit, politische Entfremdung, politisches Interesse …).

Alle Beispieldateien können hier heruntergeladen werden. Das Buch kostet 12,99 € (Ebook) bzw. 17,99 € (Paperback). Pre-prints der Einleitung und des Glossars gibt es hier, weitere Probeseiten direkt bei Springer.

Im einzelnen werden die folgenden Themen behandelt:

1 Einleitung
1.1 Wieso, weshalb, warum? Strukturgleichungsmodelle in der Politikwissenschaft
1.2 Aufbau des Buches
1.3 Konventionen
1.4 Software und Internetquellen
2 Grundlagen
2.1 Matrixalgebra
2.1.1 Dimensionen, Elemente, Vektoren, Submatrizen, Partitionen
2.1.2 Besondere Matrizen
2.1.3 Einfache Matrixoperationen
2.1.4 Rang und Inverse
2.2 Kovarianz, Korrelation, Regression
2.2.1 Die Kovarianz: Maß für Zusammenhänge zwischen metrischen Variablen:
2.2.2 Der Pearsonsche Korrelationskoeffizient: Standardisiertes Maß für den Zusammenhang zwischen metrisch
2.2.3 Das lineare Regressionsmodell: Baustein für Strukturgleichungsmodelle
2.3 Messfehler und Faktorenanalyse
2.4.1 Kausalitätsbegriff
2.5 Das allgemeine Strukturgleichungsmodell
2.6 Stichproben, Schätzungen, Strategien
2.6.1 Realität, Modell und Daten
2.6.2 Schätzverfahren
2.6.3 Identifikation
2.6.4 Modellvergleich: Fit-Indizes und Hypothesentests
2.6.5 Standardisierte Schätzungen und Mittelwertstrukturen
3 Beispiele und Anwendungen
3.1 Daten
3.2 Konfirmatorische Faktorenanalyse: Einstellungen zu Migranten
3.3 Gruppenvergleich und äquivalente Messungen
3.4 Empfehlungen für Analyse und Darstellung
3.4.1 Theoretische Grundlagen und Spezifikation
3.4.2 Datenauswahl und -aufbereitung
3.4.3 Modellschätzung und Respezifikation
3.4.4 Präsentation
4 Fortgeschrittene Themen
4.1 Kategoriale Variablen
4.1.1 Kategoriale Indikatoren
4.1.2 Ein Beispiel: Politische Wirksamkeit (efficacy)
4.2 Latente Wachstumsmodelle
4.2.1 Zunahme des Interesses am Wahlkampf
4.2.2 Exkurs: Latente Wachstumsmodelle als Mehr-Ebenen-Modelle
4.3 Ausblick und weiterführende Literatur
4.3.1 Missing Data
4.3.2 Kategoriale latente Variablen
4.3.3 Mehr-Ebenen-Strukturgleichungsmodelle
5 Ausblick und weiterführende Literatur
5.1 Grundlagen
5.2 Einführungen
5.3 Literatur zu einzelnen Programmen
5.4 Fachzeitschriften und Handbücher
6 Literaturverzeichnis

We have updated our add-on (or ado) surveybias, which calculates our multinomial generalisation of the old Martin, Traugott, and Kennedy (2005) measure for survey bias. If you have any dichotomous or multinomial variable in your survey whose true distribution is known (e.g. from the census, electoral counts, or other official data), surveybias can tell you just how badly damaged your sample really is with respect to that variable. Our software makes it trivially easy to asses bias in any survey.

Within Stata, you can install/update surveybias by entering ssc install surveybias. We’ve also created a separate page with more information on how to use surveybias, including a number of worked examples.

The new version is called 1.3b (please don’t ask). New features and improvements include:

• Support for (some) complex variance estimators including Stata’s survey estimator (sample points, strata, survey weights etc.)
• Improvements to the numerical approximation. survebias is roughly seven times faster now
• A new analytical method for simple random samples that is even faster
• Convenience options for naming variables created by survebiasseries
• Lots of bug fixes and improvements to the code

If you need to quantify survey bias, give it a spin.

Contrary to popular belief, it’s not always the third reviewer that gives you grief. In our case, it is the one and only reviewer that shot down a manuscript, because at the very least, s/he would have expected (and I quote) an “analytical derivation of the estimator”. For some odd reason of his own, the editor, instead of simply rejecting us, dared us to do just that, and against all odds, we succeeded after some months of gently banging various heads against assorted walls.

Needless to say that on second thought, the reviewer found the derivation “interesting but unnecessarily complicated” and now recommends relegating the material to a footnote. To make up for this, s/he delved into the code of our software, spotted some glaring mistakes and recommended a few changes (actually sending us a dozen lines of code) that result in a speed gain of some 600 per cent. This is very cool, very good news for end users, very embarrassing for us, and generally wrong on so many levels.

Bonus track: The third reviewer.

Scientific Peer Review, ca. 1945

## The Problem: Assessing Bias without the Data Set

While the interwebs are awash with headline findings from countless surveys, commercial companies (and even some academics) are reluctant to make their raw data available for secondary analysis. But fear not: Quite often, media outlets and aggregator sites publish survey margins, and that is all the information you need. It’s as easy as $\pi$.

## The Solution: surveybiasi

After installing our surveybias add-on for Stata, you will have access to surveybiasi. surveybiasi is an “immediate command” (Stata parlance) that compares the distribution of a categorical variable in a survey to its true distribution in the population. Both distributions need to be specified via the popvalues() and samplevalues() options, respectively. The elements of these two lists may be specified in terms of counts, of percentages, or of relative frequencies, as the list is internally rescaled so that its elements sum up to unity. surveybiasi will happily report k $A^{\prime}_{i}$s, $B$ and $B_{w}$ (check out our paper for more information on these multinomial measures of bias) for variables with 2 to 12 discrete categories.

## Bias in a 2012 CBS/NYT Poll

A week before the 2012 election for the US House of Representatives, 563 likely voters were polled for CBS/The New York Times. 46 per cent said they would vote for the Republican candidate in their district, 48 per cent said they would vote for the Democratic candidate. Three per cent said it would depend, and another two per cent said they were unsure, or refused to answer the question. In the example these five per cent are treated as “other”. Due to rounding error, the numbers do not exactly add up to 100, but surveybiasi takes care of the necessary rescaling.

In the actual election, the Republicans won 47.6 and the Democrats 48.8 per cent of the popular vote, with the rest going to third-party candidates. To see if these differences are significant, run surveybiasi like this:


. surveybiasi , popvalues(47.6 48.8 3.6) samplevalues(46 48 5) n(563)
------------------------------------------------------------------------------
catvar |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
A'           |
1 |  -.0426919   .0844929    -0.51   0.613     -.208295    .1229111
2 |  -.0123999   .0843284    -0.15   0.883    -.1776805    .1528807
3 |   .3375101   .1938645     1.74   0.082    -.0424573    .7174776
-------------+----------------------------------------------------------------
B            |
B |   .1308673   .0768722     1.70   0.089    -.0197994    .2815341
B_w |   .0385229   .0247117     1.56   0.119    -.0099112    .0869569
------------------------------------------------------------------------------

Ho: no bias
Degrees of freedom: 2
Chi-square (Pearson) = 3.0945337
Pr (Pearson) = .21282887
Chi-square (LR) = 2.7789278
Pr (LR) = .24920887




Given the small sample size and the close match between survey and electoral counts, it is not surprising that there is no evidence for statistically or substantively significant bias in this poll.

An alternative approach is to follow Martin, Traugott and Kennedy (2005) and ignore third-party voters, undecided respondents, and refusals. This requires minimal adjustments: $n$ is now 535 as the analytical sample size is reduced by five per cent, while the figures representing the “other” category can simply be dropped. Again, surveybiasiinternally rescales the values accordingly:


. surveybiasi , popvalues(47.6 48.8) samplevalues(46 48) n(535)
------------------------------------------------------------------------------
catvar |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
A'           |
1 |  -.0162297   .0864858    -0.19   0.851    -.1857388    .1532794
2 |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
-------------+----------------------------------------------------------------
B            |
B |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
B_w |   .0162297   .0864858     0.19   0.851    -.1532794    .1857388
------------------------------------------------------------------------------

Ho: no bias
Degrees of freedom: 1
Chi-square (Pearson) = .03521623
Pr (Pearson) = .85114329
Chi-square (LR) = .03521898
Pr (LR) = .85113753



Under this two-party scenario, $A^{\prime}_{1}$ is identical to Martin, Traugott, and Kennedy’s original $A$ (and all other estimates are identical to $A$‘s absolute value). Its negative sign points to the (tiny) anti-Republican bias in this poll, which is of course even less significant than in the previous example.