Preserving data from social media is crucial for many scientific disciplines. Publicly available social media archives facilitate research in the social sciences and provide corpora for training and testing a wide range of machine learning and natural language processing methods. To reduce the reliance on commercial gatekeepers, we decided in 2013 to create a large-scale longitudinal archive of tweets from X (then Twitter) for research purposes. We collected data from the then freely available random sample of 1% of all tweets from Twitter’s streaming API.

In this talk, we will introduce TweetsKB – a knowledge base of tweets that has been enriched with named entities and sentiments. We also show how TweetsKB can be used to create topic specific sub-corpora, focusing on important societal events such as the COVID-19 pandemic. Understanding the COVID-19 discourse, its differences to the general Twitter discourse, and interdependencies with real-world events or (mis)information can foster valuable insights.

Presenters:

Sebastian Schellhammer

1 Comment

Leave A Reply