Twitter and Word Cloud
A Twitter bot, called @wordnuvola, attracted my curiosity: it can generate a word cloud of your tweets (nice!). As a user, you simply have to follow the account and send back a tweet. In return, you have a nice image after some minutes. The service seems quite successful (viral?) since many people I follow use it for fun. My word cloud result was:
Two questions came to my mind:
- Is the word cloud accurate? I was expecting some keywords to be present and, for instance, I was surprised to have “splc18”: SPLC is an important conference in my research area, I’ve tweeted about the 2018 edition, but not that much.
- How the Twitter bot can know about (all) my tweets? After all I did not give special access to my tweets and I have written thousands (most of the tweets are actually retweets).
To answer these two questions, I’ve decided to replicate the generation of the word cloud. It toke less than 1 hour, including the following steps:
- request and download the full archive of your tweets (in the Twitter preferences)
- understand how data/tweets are organized (it was a basic CSV file, where each row corresponds to a tweet)
- process the tweets and generate the word cloud. I used Python, pandas, and wordcloud (actually created and maintained by one of the core contributor of scikit-learn, Andreas Mueller!)
- fine-tune and test the solution (see below)
To obtain an interesting word cloud, an important step is the pre-processing of data:
- removal of URLs or twitter’s @usernames (otherwise your most frequent words are followers you exchange with)
- stop words: wordcloud library has a default set of words (english) to ignore, but as I tweet sometimes in French I need to add others…
- consider or not tweets that are retweets (I prefer to not include them)
The final result is:
Clearly, the word cloud differs. So why?
I experimented a bit: with or without RT, tweets after 2018, etc. My conclusion was that the bot only considers recent tweets and RTs. Am I right?
Actually yes ;-) In fact the bot is open source and is publicly available here. The analysis of the source code shows that:
- ‘include_rts’: ‘true’
- 3200 tweets are collected, for a maximum page of 16 (thanks to Twitter API)
So the general conclusion is:
- Is the word cloud accurate? No since it considers only recent tweets.
- How the Twitter bot can know about (all) my tweets? It doesn’t know all tweets, only a sample.
Anyway, this service is fun and has the merit of being easy to use. But you know the secrets now ;) If you want a more comprehensive solution, I have shared my code.