Stata-related posts

Stata is my favourite general-purpose stats package. Sadly, it is also one of my favourite pasttimes, but there you are. Here is my collection of Stata-related blog posts. If this is relevant for you, you might also be interested in a series of slides for a Stata course I taught some years ago (in German)

Mar 282011
 

Seems that I am not the only one who is startled by Stata 11′s margins command, which does all sorts of amazing things. At a mere 50 pages (not counting the remarks on margins postestimation), the documentation is a little overwhelming, and there are just too many options. There are two separate issue that seem to confuse a lot of people (see this discussion on statalist on the then new margins command).

Marginal Effects at the Mean vs Average Marginal Effects

The first is that in the past when studying the implications from nonlinear (i.e. logit) models, many people including me used to analyse “marginal effects at the margin”. In short, this boils down to holding most  independent vars constant at their grand means/modes while plugging a range of hopefully relevant values for one or two focal variables into the equation.  This approach, which is known as analysing marginal effects at the mean, is easier to understand than to explain but can result in highly unrealistic scenarios if your independent variables are highly correlated (think of holding age constant while varying pensioner/non-pensioner status).

Therefore, looking at average marginal effects might make more sense. These are calculated by varying the focal variable while holding everything else at their variables. This is was the margins command does by default. Michael Norman Mitchell has a post that clearly illustrates the differences between the two approaches to the estimation of margins.   Moreover, there is an older article by Tamás Bartus on his margeff command that is also quite instructive.

Dubious Confidence Intervals

But one problem remains: margins uses a normal approximation for calculating confidence intervals. As a result, after estimating a model for categorical dependent variables, you might end up with a CI for your margins that includes zero, which obviously does not make much sense. Roger Newson seems to know how to get around this issue, but I haven’t tested this approach yet.

Sep 072010
 
National Grid for Great Britain Statistics and Data links roundup for January through September 2010
Image via Wikipedia

Statistics and Data links roundup for January through September 2010

  • Making Working and Publication Tables in Stata – Review of various commands for making tables in Stata
  • OS OpenData Supply – Download or order Ordnance Survey OpenData – open data * Outline of Great Britain<br />
    * Overview of Great Britain<br />
    * MiniScale ®<br />
    * 1:250 000 Scale Colour Raster<br />
    * OS Street View ®<br />
    * Boundary-Line ™<br />
    * OS VectorMap ™ District – New
  • European Network for the Analysis of Political Text – The European Network for the Analysis of Political Texts (ENAPT) is a newly established network of PhD students and early career researchers who share an interest in the qualitative and quantitative analysis of party manifestos and other political text.<br />
    <br />
    The objectives of ENAPT include:<br />
    <br />
    * To collect and analyse political text in a systematic fashion using a combination of qualitative and quantitative methodologies,<br />
    * To organise meetings and workshops (primarily aimed at PhD students and early career researchers) regarding the coding of political text,<br />
    * To facilitate the dissemination of the findings of its members through conferences, workshops and the world-wide-web.<br />
    <br />
    ENAPT operates a mailing list at the National Academic Mailing List Service. To join the network and the mailing list, email Kostas Gemenis
 Statistics and Data links roundup for January through September 2010
May 192010
 

I’m teaching a lecture course on Political Sociology at the moment, and because everyone is so excited about social capital and social network analysis these days, I decided to run a little online experiment with and on my students. The audience is large (at the beginning of this term, about 220 students had registered for this lecture series) and quite diverse, with some students still in their first year, others in their second, third or fourth and even a bunch of veterans who have spent most of their adult lives in university education.

glorreiche 10 150x150 Which of my students are most likely to gang up against me?

Who knows whom in a large group of learners?

Fortunately, I had a list of full names plus email addresses for everyone who had signalled interest in the lecture before the beginning of term, so I created a short questionnaire in limesurvey and asked them a very simple question: whom do you know in this group? Given the significant overcoverage of my list – in reality, there are probably not more than 120 students who regularly turn up for the lecture – the response rate was somewhere in the high 70s. If you want to collect network data with limesurvey, the “array with flexible labels” question type is your friend, but keying in 220 names plus unique ids would have been a major pain. Thankfully, one can program the question with a single placeholder name, then export it as a CSV file. Next, simply load the file into Emacs and  insert the complete list, then re-import it in limesurvey.

Getting  a data matrix from Stata into Pajek is not necessarily a fun exercise, so I decided to give the networkx module for Python a go, which is simply superb. Networkx has data types for representing social networks, so you can read in a rectangular data matrix (again as CSV),  construct the network in Python and export the whole lot to Pajek with a few lines of code:


#Some boring stuff omitted
#create network
Lecture=nx.DiGraph()
#Initialise
for i in range(1,221):
Lecture.add_node(i, stdg="0")
for line in netreader:
sender = int(line[-1])
#Sender-ID at the very end
edges=line[6:216]
#Degree-scheme
Lecture.node[sender]['stdg']=line[-8]
#Edges
for index in range(len(edges)):
if edges[index] == '2':
Lecture.add_edge(sender,int(filter(str.isdigit,repr(knoten[index]))),weight=2)
elif edges[index] == '3':
Lecture.add_edge(sender,int(filter(str.isdigit,repr(knoten[index]))),weight=3)
nx.write_pajek(Lecture,'file.net')

As it turns out, a lecture hall rebellion seems not very likely. About one third of all relationships are not reciprocated, and about a quarter of my students do not know a single other person in the room (at least not by name), so levels of social capital are pretty low.  There is, however, a small group of 10 mostly older students who are form a tightly-knit core, and who know many of the suckers in the periphery. I need to keep an eye on these guys.

nur reziprok 150x150 Which of my students are most likely to gang up against me?

260 reciprocated ties within the same group

Finally, the second graph also shows that those relatively few students who are enrolled in our new BA programs (red, dark blue) are pretty much isolated within the larger group, which is still dominated by students enrolled in the old five year programs (MA yellow, State Examination green) that are phased out. Divide et impera.

 Which of my students are most likely to gang up against me?
Jan 102010
 

I’m teaching an introductory SNA class this year. Following a time-honoured tradition, I conducted a small network survey at the beginning of the class using Limesurvey. Getting the data from Limesurvey to Stata via CSV was easy enough. Here is the data set. But how does one get the data from Stata to Pajek for analysis? Actually, it’s quite easy.

First, we need to change the layout of the data. In the data set, there is one record for each of the 13 respondent. Each record has 13 variables, one for each (potential) arc connecting the respondent to other students in the class. This is equivalent to Stata’s “wide” form. Stata’s reshape command will happily re-arrange the data to the “long” form, with one record for each arc. This is what Pajek requires.

Second, we need to save the data as an ASCII file that can be read into Pajek. This is most easily done using Roger Newson’s listtex, which can be tweaked to write the main chunks of a Pajek file. Here is the code, which should be readily adapted to your own problems.

If you are interested, you can get the whole package from within Stata: net from http://www.kai-arzheimer.com/stata/

 How to get from Stata to Pajek
Jul 082008
 

Our project on social (citation and collaboration) networks in British and German political science involves networks with hundreds and thousands of nodes (scientists and articles). At the moment, our data come from the Social Science Citation Index (part of the ISI web of knowledge), and we use a bundle of rather eclectic (erratic?) scripts written in Perl to convert the ISI records into something that programs like Pajek or Stata can read. Some canned solutions (Wos2pajek, network workbench, bibexcel) are available for free, but I was not aware of them when I started this project, did not manage to install them properly, or was not happy with the results. Perl is the Swiss Army Chainsaw (TM) for data pre-processing, incredibly powerful (my scripts are typically less than 50 lines, and I am not an efficient programmer), and every time I want to do something in a slightly different way (i.e. I spot a bug), all I have to do is to change a few lines in the scripts.
After trying a lot of other programs available on the internet, we have chosen Pajek for doing the analyses and producing those intriguing graphs of cliques and inner circles in Political Science. Pajek is closed source but free for non-commercial use and runs on Windows or (via wine) Linux. It is very fast, can (unlike many other programs) easily handle very large networks, produces decent graphs and does many standard analyses. Its user interface may be slightly less than straightforward but I got used to it rather quickly, and it even has basic scripting capacities.

 Software for Social Network Analysis: Pajek and Friends

The Missing Manual

The only thing that is missing is a proper manual, but even this is not really a problem since Pajek’s creators have written a very accessible introduction to social network analysis that doubles up as documentation for the program (order from amazon.co.uk, amazon.com, amazon.de. However, Pajek has been under constant development since the 1990s (!) and has acquired a lot of new features since the book was published. Some of them are documented in an appendix, others are simply listed in the very short document that is the official manual for Pajek. You will want to go through the many presentations which are available via the Pajek wiki.

Of course, there is much more software available, often at no cost. If you do program Java or Python (I don’t), there are several libraries available that look very promising. Amongst the stand-alone programs, visone stands out because it can easily produce very attractive-looking graphs of small networks. Even more software has been developed in the context of other sciences that have an interest in networks (chemistry, biology, engineering etc.)
Here is a rather messy collection of links to sna software. Generally, you will want something that is more systematic and informative. Ines Mergel has recently launched a bid for creating a comprehensive software list on wikipedia. The resulting page on social network analysis software is obviously work in progress but provides very valuable guidance.

Technorati-Tags: , , , , , , , , , ,

 Software for Social Network Analysis: Pajek and Friends