Sampling from a Multinomial Distribution in Stata

Sometimes, a man’s gotta do what a man’s gotta do. Which, in my case, might be a little simulation of a random process involving an unordered categorical variable. In R, sampling from a multinomial distribution is trivial.

rmultinom(1,1000,c(.1,.7,.2,.1))

Continue reading “Sampling from a Multinomial Distribution in Stata” »

Statistics and Data links roundup for November 23rd through December 29th

Statistics and Data links roundup for November 23rd through December 29th:

  • The Data and Story Library – DASL (pronounced “dazzle”) is an online library of datafiles and stories that illustrate the use of basic statistics methods. We hope to provide data from a wide variety of topics so that statistics teachers can find real-world examples that will be interesting to their students. Use DASL’s powerful search engine to locate the story or datafile of interest.
  • Drawing graphs using tikz/pgf & gnuplot | politicaldata.org -

Statistics and Data links roundup for November 14th through November 23rd

Statistics and Data links roundup for November 14th through November 23rd:

It’s surprisingly difficult to find suitable datasets for a sna workshop that are relevant for political scientists.

Statistics and Data links roundup

300px The Normal Distribution.svg Statistics and Data links roundup
Image via Wikipedia

Continue reading “Statistics and Data links roundup” »

Web-scraping made easy: outwit

195px EAN 13 ISBN 13.svg Web scraping made easy: outwit

Image via Wikipedia

These days, a bonanza of political information is freely available on the internet.  Sometimes this information comes in the guise of excel sheets, comma separated data or other formats which are more or less readily machine readable. But more often than not, information is presented as tables designed to be read by humans. This is where the gentle art of screen scraping, web scraping or spidering comes in. In the past, I have used kludgy Perl scripts to get electoral results at the district level off sites maintained by the French ministry of the interior or by universities (very interesting if you do not really speak/read French). A slightly more elegant approach might be to use R’s builtin Perl-like capabilities for doing the job, as demonstrated by Simon Jackman. Finally, Python is gaining ground in the political science community,  which has some very decent libraries for screen/web scraping – see this elaborate post on Drew Conway’s Zero Intelligence Agents blog. But, let’s face it: I am lazy. I want to spend time analysing the data, not scraping them. And so I was very pleased when I came across outwit, a massive plugin for the firefox browser (Linux, Mac and Windows versions available) that acts as a point-and-click scraper.

Continue reading “Web-scraping made easy: outwit” »

Software for Social Network Analysis: Pajek and Friends

Our project on social (citation and collaboration) networks in British and German political science involves networks with hundreds and thousands of nodes (scientists and articles). At the moment, our data come from the Social Science Citation Index (part of the ISI web of knowledge), and we use a bundle of rather eclectic (erratic?) scripts written in Perl to convert the ISI records into something that programs like Pajek or Stata can read. Some canned solutions (Wos2pajek, network workbench, bibexcel) are available for free, but I was not aware of them when I started this project, did not manage to install them properly, or was not happy with the results. Perl is the Swiss Army Chainsaw (TM) for data pre-processing, incredibly powerful (my scripts are typically less than 50 lines, and I am not an efficient programmer), and every time I want to do something in a slightly different way (i.e. I spot a bug), all I have to do is to change a few lines in the scripts.
After trying a lot of other programs available on the internet, we have chosen Pajek for doing the analyses and producing those intriguing graphs of cliques and inner circles in Political Science. Pajek is closed source but free for non-commercial use and runs on Windows or (via wine) Linux. It is very fast, can (unlike many other programs) easily handle very large networks, produces decent graphs and does many standard analyses. Its user interface may be slightly less than straightforward but I got used to it rather quickly, and it even has basic scripting capacities.

 Software for Social Network Analysis: Pajek and Friends

The Missing Manual

Continue reading “Software for Social Network Analysis: Pajek and Friends” »