Topic modelling does not work well for (my) research paper abstracts
The Radical Right Research Robot is a fun side project whose life began exactly one year ago. The Robot exists to promote the very large body of knowledge on Radical Right parties and their voters that social scientists have accumulated over decades. At its core is a loop that randomly selects one of the more than 800 titles on my online bibliography on the Extreme/Radical Right every few hours and spits it out on twitter.
Yet the little android’s aim was always for some sort of serendipity, and so it tries to extract meaning from the abstracts (where available), sometimes with rather funny consequences. The robots’s first idea was to make use (structural) topic modelling. There are some implementations available in R and the first results looked promising, but in the end, topic modelling did not find meaningful clusters of papers that could easily be labelled with a common theme. One possible reason is that the abstracts are short, and that there are relatively few (less than 400) of them. And so the Robot reverted to using a small and fairly arbitrary set of keywords for identifying topics.
This approach produced some embarrassing howlers like this one:
the media and the #RadicalRight D. Halikiopoulou and T. Vlandas. “Risks, Costs and Labour Markets: Explaining Cross-national Patterns of Far Right Party Success in European Parliament Elections”. In: Jcms-Journal of Common Market Studies 54.3 (2016), pp. 636-655. https:// pic.twitter.com/K4YHAoZvNL
— Radical Right Research Robot (@RRResRobot) December 23, 2018
Or this one (clearly the robot has a thing for media studies – who doesn’t?):
the media: K. Loxbo. “The Impact of the Radical Right: Lessons from the Local Level in Sweden, 2002-2006”. In: Scandinavian Political Studies 33.3 (2010), pp. 295-315. https://t.co/p8JmK4ovXF. <URL: https://t.co/GEDZ3p4MnI>. pic.twitter.com/QhSF914y4x
— Radical Right Research Robot (@RRResRobot) December 18, 2018
There are two problems here: first, even a single instance of a keyword in a given abstract is enough to trigger a classification, and second, the bot’s pedestrian implementation would classify an abstract using the last keyword that it detected, even if it was the most peripheral of several hits. Not good enough for world domination, obviously.
Newsmap works reasonably well for classifying topics in research paper abstracts
Looking for an alternative solution, the robot came across newsmap (now also available within quanteda), a geographical news classifier developed by Kohei Watanabe. Newsmap is semi-supervised: it starts with a dictionary of proper nouns and adjectives that all refer to geographical entities, say
'France': [Paris, France, French*] 'Germany': [German*, Berlin] ...
But newsmap is able to pick up additional words that also help to identify the respective country with high probability, e.g. “Macron”, “Merkel”, “Marseille”, “Hamburg”, or even “Lederhosen”. In a (limited) sense, it learns to identify geographical context even when the country in question is not mentioned explicitly.
But the algorithm is not restricted to geographical entities. It can also identify topics from a list. An so these days, the robot starts with a dictionary of seed words that is work in progress but looks mostly like this at the moment:
'religion & culture': [muslim*, islam*, relig*, cultur*] 'media': [TV, newspaper*, journalis*] 'group conflict': [group*,contact, prejudice, stereotyp*, competition] ...
Results are not perfect, but at least they are less embarrassing than those from the simple keyword approach. One remaining problem is that newsmap tags each abstract with (at most) one topic. In reality, any given article will refer to two or more themes in the literature. Topic models are much more attractive in this respect, because they treat each text as a mixture of topics, and so the robot may have to revisit them in the future.