The value of Twitter has been in the news a lot recently. Antics related to Elon Musk’s possible Twitter buyout has led to a legal battle that may result in Twitter releasing a snapshot of data on its users and tweets. If Musk actually gets his hands on this data, what kind of things might he be looking for?
Data profiling is a great way to analyze and explore a dataset. A data profile not only helps you to understand the structure of a dataset, but the statistical metrics are also great for exploratory analysis.
In this article, I take some publicly available Twitter datasets and run them through a data profiler to see what interesting metrics I can find.
I used two datasets for this analysis:
- A portion of the Archive Team Twitter Stream dataset from June 2021. This dataset comes in JSON form, so you’ll need to use a tool such as twitter_to_csv to convert the raw Twitter API JSON to CSV.
- A pre-prepared dataset from Kaggle. I found the Gender classification set, with data from 2015, was the most useful as it retained much of the user data, such as profile image URL, retweet count, and profile description.
For both methods I used PipeRider, an open-source CLI tool for profiling datasets. It outputs an HTML report which makes it easy to see the metrics. There are also data assertions if you need to add testing to your dataset.
So, what things might Elon Musk, or anyone analyzing the value of a social network data need to look out for? What interesting metrics can be found, and what are the implications in regard to determining the ‘value’ of Twitter users and Tweets.
1. Account creation date
Looking at the account created date in the gender dataset, there are active accounts that date back all the way to 2006.
Twitter has good user retention, as these are users that have been coming back for years.
Twitter account creation date metric via PipeRider:
2. Profile image
You might not think that a profile image could indicate much, but the notorious Twitter egg fiasco proves otherwise. 18.3% of profile images in the Kaggle dataset are duplicates, and it looks like this is because they are using the default image.
The Archive Team dataset was around the same, at 17.1% for my sampling.
Twitter profile image metrics via PipeRider:
This could be an indication of throw-away accounts, bots, trolls, or generally users who aren’t invested enough to update their profile image.
This doesn’t refer to retweets, or quote tweets, but to tweets with identical content. Why would a tweet contain the same content? There was 8.7% of tweets with duplicated content in the Kaggle dataset.
This could be an indication of a bot farm spreading propaganda, or users could simply be clicking the “share this article” button on news websites.
Twitter metrics indicate possible bot messages via PipeRider:
The Kaggle dataset shows a lot bizarre of Weather Chanel tweets.
Interestingly, these tweets cannot be found by searching Twitter now. All I can find is someone asking why there are so many bots spamming the weather channel.
The Archive Team dataset shows many retweets of an account for a Korean band (I think). The account has 13million followers, so I suppose a few thousand fans retweeting isn’t out of the ordinary.
Tweet content metrics via PipeRider:
Retweet count was extremely low in both datasets — 99.9% of the tweets in the Kaggle set were not retweeted.
It could mean low interaction rates on Twitter, but is probably just an indication that the dataset time frame is too narrow — only an hour, and that it takes a while for a tweet to gain momentum. This could be one to analyze over a period of a few hours or days.
Retweet metrics via PipeRider:
5. Follower count
Follower count not only indicates how popular you are, but also if users are interacting and ‘listening’ to each other on Twitter. Are most users shouting into the wind? According to the Archive Team dataset, 4800 users were talking to themselves during that time frame.
It’s difficult to gauge this one without more analysis of the accounts with zero followers. They could be the bots we mentioned earlier, or just angry loners shouting in the wind.
Follower count metrics via PipeRider:
Data exploration is fun, even if you’re not a billionaire mulling over the potential purchase of a $44 billion social network.
Data profiling can provide interesting insights into a dataset that you may not have previously thought of. Even the process of cleaning and transforming data in preparation for analysis can yield interesting results. For instance, while importing the Archive Team dataset, I found that there were many ‘deleted’ entries listed. I presume these are deleted tweets, a figure that could also offer some valuable insight.
Add data profiling to your data reliability strategy
If you’re interested in the tool I used here, check out PipeRider — it’s a CLI tool that profiles all the popular data warehouses and provides an HTML data profile report.
Also, before you finish the reading the contracts for your Twitter purchase, please follow us on Twitter — we’re @infuseai ❤
If you found this article interesting please retweet to support more content like this.
InfuseAI is solving data quality issues
InfuseAI makes PipeRider, the open-source data reliability CLI tool that adds data profiling and assertions to data warehouses such as BigQuery, Snowflake, Redshift and more. Data profile and data assertion results are provided in an HTML report each time you run PipeRider.