Jan 102019

Topic modelling does not work well for (my) research paper abstracts

The Radical Right Research Robot is a fun side project whose life began exactly one year ago. The Robot exists to promote the very large body of knowledge on Radical Right parties and their voters that social scientists have accumulated over decades. At its core is a loop that randomly selects one of the more than 800 titles on my online bibliography on the Extreme/Radical Right every few hours and spits it out on twitter.

Yet the little android’s aim was always for some sort of serendipity, and so it tries to extract meaning from the abstracts (where available), sometimes with rather funny consequences. The robots’s first idea was to make use (structural) topic modelling. There are some implementations available in R and the first results looked promising, but in the end, topic modelling did not find meaningful clusters of papers that could easily be labelled with a common theme. One possible reason is that the abstracts are short, and that there are relatively few (less than 400) of them. And so the Robot reverted to using a small and fairly arbitrary set of keywords for identifying topics.

This approach produced some embarrassing howlers like this one:

Or this one (clearly the robot has a thing for media studies – who doesn’t?):

There are two problems here: first, even a single instance of a keyword in a given abstract is enough to trigger a classification, and second, the bot’s pedestrian implementation would classify an abstract using the last keyword that it detected, even if it was the most peripheral of several hits. Not good enough for world domination, obviously.

Newsmap works reasonably well for classifying topics in research paper abstracts

Looking for an alternative solution, the robot came across newsmap (now also available within quanteda), a geographical news classifier developed by Kohei Watanabe. Newsmap is semi-supervised: it starts with a dictionary of proper nouns and adjectives that all refer to geographical entities, say

'France': [Paris, France, French*] 
'Germany': [German*, Berlin]

But newsmap is able to pick up additional words that also help to identify the respective country with high probability, e.g. “Macron”, “Merkel”, “Marseille”, “Hamburg”, or even “Lederhosen”. In a (limited) sense, it learns to identify geographical context even when the country in question is not mentioned explicitly.

But the algorithm is not restricted to geographical entities. It can also identify topics from a list. An so these days, the robot starts with a dictionary of seed words that is work in progress but looks mostly like this at the moment:

'religion & culture': [muslim*, islam*, relig*, cultur*]
'media': [TV, newspaper*, journalis*]
'group conflict': [group*,contact, prejudice, stereotyp*, competition]

Results are not perfect, but at least they are less embarrassing than those from the simple keyword approach. One remaining problem is that newsmap tags each abstract with (at most) one topic. In reality, any given article will refer to two or more themes in the literature. Topic models are much more attractive in this respect, because they treat each text as a mixture of topics, and so the robot may have to revisit them in the future.

  2 Responses to “Identifying topics in research papers with the newsmap package for R (or: how the Radical Right Research Robot became slightly less dumb)”

  1. RT @kai_arzheimer: Identifying topics in research papers with newsmap for #rstats https://t.co/ZS5p8i6Frr https://t.co/dJ3XkW9V2V

  2. RT @kai_arzheimer: Clustering research papers with newsmap for #rstats
    https://t.co/YNLdmY5rau https://t.co/vkkq0h9XMk

Agree? Disagree? Leave a reply (also works with Facebook, G+, Twitter...)

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: