<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" ><channel><title>Kai Arzheimer &#187; python</title> <atom:link href="http://www.kai-arzheimer.com/blog/tag/python/feed/" rel="self" type="application/rss+xml" /><link>http://www.kai-arzheimer.com/blog</link> <description>A political science blog</description> <lastBuildDate>Sat, 21 Jan 2012 19:06:37 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.1</generator> <item><title>Which of my students are most likely to gang up against me?</title><link>http://www.kai-arzheimer.com/blog/which-of-my-students-are-most-likely-to-gang-up-against-me/</link> <comments>http://www.kai-arzheimer.com/blog/which-of-my-students-are-most-likely-to-gang-up-against-me/#comments</comments> <pubDate>Wed, 19 May 2010 20:30:53 +0000</pubDate> <dc:creator>kai</dc:creator> <category><![CDATA[Data and Methods]]></category> <category><![CDATA[My Stuff]]></category> <category><![CDATA[Political Science]]></category> <category><![CDATA[limesurvey]]></category> <category><![CDATA[networkx]]></category> <category><![CDATA[pajek]]></category> <category><![CDATA[political sociology]]></category> <category><![CDATA[python]]></category> <category><![CDATA[sna]]></category> <category><![CDATA[social capital]]></category> <category><![CDATA[social network analysis]]></category> <category><![CDATA[social networks]]></category> <category><![CDATA[stata]]></category> <category><![CDATA[survey data]]></category><guid isPermaLink="false">http://www.kai-arzheimer.com/blog/?p=405</guid> <description><![CDATA[I&#8217;m teaching a lecture course on Political Sociology at the moment, and because everyone is so excited about social capital and social network analysis these days, I decided to run a little online experiment with and on my students. The audience is large (at the beginning of this term, about 220 students had registered for [...]]]></description> <content:encoded><![CDATA[<p>I&#8217;m teaching a lecture course on Political Sociology at the moment, and because everyone is so excited about social capital and social network analysis these days, I decided to run a little online experiment with and on my students. The audience is large (at the beginning of this term, about 220 students had registered for this lecture series) and quite diverse, with some students still in their first year, others in their second, third or fourth and even a bunch of veterans who have spent most of their adult lives in university education.</p><div id="attachment_406" class="wp-caption alignright" style="width: 160px"><a href="http://www.kai-arzheimer.com/blog/wp-content/uploads/2010/05/glorreiche-10.jpg"><img class="size-thumbnail wp-image-406  " title="All ties in a group of students, well-connected students in yellow" src="http://www.kai-arzheimer.com/blog/wp-content/uploads/2010/05/glorreiche-10-150x150.jpg" alt="glorreiche 10 150x150 Which of my students are most likely to gang up against me?" width="150" height="150" /></a><p class="wp-caption-text">Who knows whom in a large group of learners?</p></div><p><span id="more-405"></span></p><p>Fortunately, I had a list of full names plus email addresses for everyone who had signalled interest in the lecture before the beginning of term, so I created a short questionnaire in <a href="http://www.limesurvey.org/" target="_blank">limesurvey</a> and asked them a very simple question: whom do you know in this group? Given the significant overcoverage of my list &#8211; in reality, there are probably not more than 120 students who regularly turn up for the lecture &#8211; the response rate was somewhere in the high 70s. If you want to collect network data with limesurvey, the &#8220;array with flexible labels&#8221; question type is your friend, but keying in 220 names plus unique ids would have been a major pain. Thankfully, one can program the question with a single placeholder name, then export it as a <a class="zem_slink" title="Comma-separated values" rel="wikipedia" href="http://en.wikipedia.org/wiki/Comma-separated_values">CSV</a> file. Next, simply load the file into Emacs and  insert the complete list, then re-import it in limesurvey.</p><p><a href="http://www.kai-arzheimer.com/blog/2010/01/10/how-to-get-from-stata-to-pajek/" target="_blank">Getting  a data matrix from Stata into Pajek is not necessarily a fun exercise,</a> so I decided to give the <a href="http://networkx.lanl.gov/" target="_blank">networkx</a> module for Python a go, which is simply superb. Networkx has data types for representing social networks, so you can read in a rectangular data matrix (again as CSV),  construct the network in Python and export the whole lot to Pajek with a few lines of code:</p><p><code><br /> #Some boring stuff omitted<br /> #create network<br /> Lecture=nx.DiGraph()<br /> #Initialise<br /> for i in range(1,221):<br /> Lecture.add_node(i, stdg="0")<br /> for line in netreader:<br /> sender = int(line[-1])<br /> #Sender-ID at the very  end<br /> edges=line[6:216]<br /> #Degree-scheme<br /> Lecture.node[sender]['stdg']=line[-8]<br /> #Edges<br /> for index in range(len(edges)):<br /> if edges[index] == '2':<br /> Lecture.add_edge(sender,int(filter(str.isdigit,repr(knoten[index]))),weight=2)<br /> elif edges[index] == '3':<br /> Lecture.add_edge(sender,int(filter(str.isdigit,repr(knoten[index]))),weight=3)<br /> nx.write_pajek(Lecture,'file.net')<br /> </code></p><p>As it turns out, a lecture hall rebellion seems not very likely. About one third of all relationships are not reciprocated, and about a quarter of my students do not know a single other person in the room (at least not by name), so levels of social capital are pretty low.  There is, however, a small group of 10 mostly older students who are form a tightly-knit core, and who know many of the suckers in the periphery. I need to keep an eye on these guys.</p><div id="attachment_407" class="wp-caption alignright" style="width: 160px"><a href="http://www.kai-arzheimer.com/blog/wp-content/uploads/2010/05/nur-reziprok.jpg"><img class="size-thumbnail wp-image-407 " title="Reciprocated ties in a group of students, information on degree scheme overlaid" src="http://www.kai-arzheimer.com/blog/wp-content/uploads/2010/05/nur-reziprok-150x150.jpg" alt="nur reziprok 150x150 Which of my students are most likely to gang up against me?" width="150" height="150" /></a><p class="wp-caption-text">260 reciprocated ties within the same group</p></div><p>Finally, the second graph also shows that those relatively few students who are enrolled in our new BA programs (red, dark blue) are pretty much isolated within the larger group, which is still dominated by students enrolled in the old five year programs (MA yellow, State Examination green) that are phased out. Divide et impera.</p><div class="zemanta-pixie" style="margin-top: 10px; height: 15px;"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/5065f66c-8fcc-401f-afbc-f04a7abf490e/"><img class="zemanta-pixie-img" style="border: medium none; float: right;" src="http://img.zemanta.com/reblog_e.png?x-id=5065f66c-8fcc-401f-afbc-f04a7abf490e" alt=" Which of my students are most likely to gang up against me?"  title="Which of my students are most likely to gang up against me? photo" /></a><span class="zem-script more-related pretty-attribution"><script src="http://static.zemanta.com/readside/loader.js" type="text/javascript"></script></span></div><div class="su-linkbox" id="post-405-linkbox"><div class="su-linkbox-label">Link to this post!</div><div class="su-linkbox-field"><input type="text" value="&lt;a href=&quot;http://www.kai-arzheimer.com/blog/which-of-my-students-are-most-likely-to-gang-up-against-me/&quot;&gt;Which of my students are most likely to gang up against me?&lt;/a&gt;" onclick="javascript:this.select()" readonly="readonly" style="width: 100%;" /></div></div>]]></content:encoded> <wfw:commentRss>http://www.kai-arzheimer.com/blog/which-of-my-students-are-most-likely-to-gang-up-against-me/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Web-scraping made easy: outwit</title><link>http://www.kai-arzheimer.com/blog/screenscraping-made-easy-outwit/</link> <comments>http://www.kai-arzheimer.com/blog/screenscraping-made-easy-outwit/#comments</comments> <pubDate>Sat, 01 Aug 2009 21:46:53 +0000</pubDate> <dc:creator>kai</dc:creator> <category><![CDATA[Data and Methods]]></category> <category><![CDATA[Political Science]]></category> <category><![CDATA[departements]]></category> <category><![CDATA[france]]></category> <category><![CDATA[outwit]]></category> <category><![CDATA[perl]]></category> <category><![CDATA[python]]></category> <category><![CDATA[R]]></category> <category><![CDATA[scraping]]></category> <category><![CDATA[screen]]></category> <category><![CDATA[web scraper]]></category><guid isPermaLink="false">http://www.kai-arzheimer.com/blog/?p=287</guid> <description><![CDATA[These days, a bonanza of political information is freely available on the internet.  Sometimes this information comes in the guise of excel sheets, comma separated data or other formats which are more or less readily machine readable. But more often than not, information is presented as tables designed to be read by humans. This is [...]]]></description> <content:encoded><![CDATA[<div class="zemanta-img" style="margin: 1em; display: block;"><div class="wp-caption alignright" style="width: 205px"><a href="http://commons.wikipedia.org/wiki/Image:EAN-13-ISBN-13.svg"><img title="EAN-13 bar code of ISBN-13 in compliance with ..." src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/28/EAN-13-ISBN-13.svg/195px-EAN-13-ISBN-13.svg.png" alt="195px EAN 13 ISBN 13.svg Web scraping made easy: outwit" width="195" height="124" /></a><p class="wp-caption-text">Image via Wikipedia</p></div></div><p>These days, a bonanza of political information is freely available on the internet.  Sometimes this information comes in the guise of excel sheets, comma separated data or other formats which are more or less readily machine readable. But more often than not, information is presented as tables designed to be read by humans. This is where the gentle art of screen scraping, <a class="zem_slink" title="Web scraping" rel="wikipedia" href="http://en.wikipedia.org/wiki/Web_scraping">web scraping</a> or spidering comes in. In the past, I have used kludgy <a class="zem_slink" title="Perl" rel="homepage" href="http://www.perl.org/">Perl</a> scripts to get electoral results at the district level off sites maintained by the French ministry of the interior or by universities (very interesting if you do not really speak/read French). A slightly more elegant approach might be to <a href="http://polmeth.wustl.edu/tpm/tpm_v14_n2.pdf" target="_blank">use R&#8217;s builtin Perl-like capabilities for doing the job, as demonstrated by Simon Jackman</a>. Finally, <a href="http://kops.ub.uni-konstanz.de/volltexte/2009/7652/pdf/doering_2008.pdf" target="_blank">Python is gaining ground in the political science community</a>,  which has some very decent libraries for screen/web scraping &#8211; see this <a href="http://www.drewconway.com/zia/?p=585" target="_blank">elaborate post on Drew Conway&#8217;s Zero Intelligence Agents blog</a>. But, let&#8217;s face it: I am lazy. I want to spend time analysing the data, not scraping them. And so I was very pleased when I came across outwit, a massive plugin for the firefox browser (Linux, Mac and Windows versions available) that acts as a <a href="http://www.outwit.com/" target="_blank">point-and-click scraper</a>.</p><p><span id="more-287"></span></p><div id="attachment_288" class="wp-caption alignright" style="width: 310px"><a href="http://www.kai-arzheimer.com/blog/wp-content/uploads/2009/08/outwit-1.png"><img class="size-medium wp-image-288" title="outwit-1" src="http://www.kai-arzheimer.com/blog/wp-content/uploads/2009/08/outwit-1-300x178.png" alt="outwit 1 300x178 Web scraping made easy: outwit" width="300" height="178" /></a><p class="wp-caption-text">French Départements (from Wikipedia)</p></div><p>Say you need a dataset with the names and Insee numbers for all the French Départements. The (hopefully trustworthy) <a href="http://en.wikipedia.org/wiki/Departments_of_France" target="_blank">Wikipedia page</a> has a neat table, complete with information on the Prefecture and many tiny coats of arms which are of absolutely no use at all. We could either key in the relevant data (doable, but a nuisance), or we could try to copy and paste the table into a word processor, hoping that we do not lose accents and other funny characters, and that WinWord or whatever we use converts the <a class="zem_slink" title="HTML element" rel="wikipedia" href="http://en.wikipedia.org/wiki/HTML_element">HTML table</a> into something that we can edit to extract the information we really need.</p><p>Or you we could use outwit. One push of the button loads the page</p><div id="attachment_291" class="wp-caption alignright" style="width: 310px"><a href="http://www.kai-arzheimer.com/blog/wp-content/uploads/2009/08/outwit-22.png"><img class="size-medium wp-image-291" title="outwit-2" src="http://www.kai-arzheimer.com/blog/wp-content/uploads/2009/08/outwit-22-300x178.png" alt="outwit 22 300x178 Web scraping made easy: outwit" width="300" height="178" /></a><p class="wp-caption-text">Scraping a table with outwit</p></div><p>into a sub-window, a second push (data-&gt;tables) extracts the HTML tables on the page. Now, we can either mark the lines we are interested in by hand (often the quickest option) or use a filter to selfect them. One final click, and they are exported as a <a href="http://www.kai-arzheimer.com/blog/wp-content/uploads/2009/08/Departments_Of_France.csv">CSV</a> file that can be read into R, OpenOffice, or Stata for post processing and analysis.</p><p>While I&#8217;m all in favour of scriptable and open-source tools like Perl, Python and R, outwit has a lot to go for it if all you need is a quick hack. Outwit also has functions to mass-download files (say PDFs) from a page and give the unique names. If the job is complex, there is even more functionality under the hood, and you can use the point-and-click interface to program you own scraper, though I would tend use a real programming language for these cases. At any rate, outwit is a useful and free tool for the lazy data analyst.</p><div class="zemanta-pixie" style="margin-top: 10px; height: 15px;"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/ff2f2488-87b5-4e7d-b17a-9b44a263ee57/"><img class="zemanta-pixie-img" style="border: medium none; float: right;" src="http://img.zemanta.com/reblog_e.png?x-id=ff2f2488-87b5-4e7d-b17a-9b44a263ee57" alt=" Web scraping made easy: outwit"  title="Web scraping made easy: outwit photo" /></a><span class="zem-script more-related pretty-attribution"><script src="http://static.zemanta.com/readside/loader.js" type="text/javascript"></script></span></div><div class="su-linkbox" id="post-287-linkbox"><div class="su-linkbox-label">Link to this post!</div><div class="su-linkbox-field"><input type="text" value="&lt;a href=&quot;http://www.kai-arzheimer.com/blog/screenscraping-made-easy-outwit/&quot;&gt;Web-scraping made easy: outwit&lt;/a&gt;" onclick="javascript:this.select()" readonly="readonly" style="width: 100%;" /></div></div>]]></content:encoded> <wfw:commentRss>http://www.kai-arzheimer.com/blog/screenscraping-made-easy-outwit/feed/</wfw:commentRss> <slash:comments>1</slash:comments> </item> </channel> </rss>
<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk: basic
Page Caching using disk: enhanced
Database Caching using disk: basic
Object Caching 1032/1117 objects using disk: basic

Served from: www.kai-arzheimer.com @ 2012-02-07 10:36:31 -->
