Twitter and Word Cloud

A Twitter bot, called @wordnuvola, attracted my curiosity: it can generate a word cloud of your tweets (nice!). As a user, you simply have to follow the account and send back a tweet. In return, you have a nice image after some minutes. The service seems quite successful (viral?) since many people I follow use it for fun. My word cloud result was:

Extraction process

Two questions came to my mind:

  • Is the word cloud accurate? I was expecting some keywords to be present and, for instance, I was surprised to have “splc18”: SPLC is an important conference in my research area, I’ve tweeted about the 2018 edition, but not that much.
  • How the Twitter bot can know about (all) my tweets? After all I did not give special access to my tweets and I have written thousands (most of the tweets are actually retweets).

To answer these two questions, I’ve decided to replicate the generation of the word cloud. It toke less than 1 hour, including the following steps:

  • request and download the full archive of your tweets (in the Twitter preferences)
  • understand how data/tweets are organized (it was a basic CSV file, where each row corresponds to a tweet)
  • process the tweets and generate the word cloud. I used Python, pandas, and wordcloud (actually created and maintained by one of the core contributor of scikit-learn, Andreas Mueller!)
  • fine-tune and test the solution (see below)

To obtain an interesting word cloud, an important step is the pre-processing of data:

  • removal of URLs or twitter’s @usernames (otherwise your most frequent words are followers you exchange with)
  • stop words: wordcloud library has a default set of words (english) to ignore, but as I tweet sometimes in French I need to add others…
  • consider or not tweets that are retweets (I prefer to not include them)

The final result is:

Extraction process

Clearly, the word cloud differs. So why?

I experimented a bit: with or without RT, tweets after 2018, etc. My conclusion was that the bot only considers recent tweets and RTs. Am I right?

Actually yes ;-) In fact the bot is open source and is publicly available here. The analysis of the source code shows that:

  • ‘include_rts’: ‘true’
  • 3200 tweets are collected, for a maximum page of 16 (thanks to Twitter API)

So the general conclusion is:

  • Is the word cloud accurate? No since it considers only recent tweets.
  • How the Twitter bot can know about (all) my tweets? It doesn’t know all tweets, only a sample.

Anyway, this service is fun and has the merit of being easy to use. But you know the secrets now ;) If you want a more comprehensive solution, I have shared my code.

Written on February 14, 2019
[ data | twitter | python ]