There were many unique aspects of the 2016 U.S. presidential election, with one of the most interesting being Donald Trump’s use of Twitter. As luck would have it, tweets happen to be very easy to extract and put into a relational database format. Being the data junkies that we are, we couldn’t resist downloading President Trump’s entire tweet history and loading it into Sisense for Cloud Data Teams. It’s an excellent opportunity to slice, analyze, and quantify presidential communications while also exploring methods for analyzing the text in SQL at the same time.
So let’s get started.
The Data Set
First, we formatted Trump’s tweets into 5 columns of data:
- Tweet Text
- Tweet Date
- Tweet ID
With these columns, it was easy to do some quick exploration. @realDonaldTrump tweeted for the first time on May 4, 2009, with a tweet he almost certainly didn’t write.
select tweet_text as first_tweet from trump_tweets order by tweet_date limit 1
Around July of 2011, Trump’s tweet volume quickly escalated. It’s not clear what triggered this increase in activity, but based on his first tweet a reasonable guess is that Trump began tweeting himself, either in addition to or in lieu of staff. In total, Donald Trump has tweeted over 34,000 times. At his peak he was tweeting over 30 times per day. A remarkable number by any measure.
A look at Trump’s tweet frequency over time looks like this:
select month(tweet_date) as mnth , count(*) from trump_tweets group by 1
Keyword Frequency: A History of MAGA
Trump first used the phrase ‘Make America Great Again’ long before his announcement to run for President in the 2016 election as a Republican candidate. He filed papers in Iowa in 2012 for the “Make America Great Again” Party to begin a run as an independent candidate. His first use of the phrase on Twitter:
His use of the phrase has obviously increased since. If we sum the number of tweets containing the words ‘Make America Great Again’ or ‘MAGA’, and divide that number by the count of all tweets, we can easily define Trump’s MAGA rate. With a simple query we can look at the MAGA rate over time. Trump’s MAGA rate has a slightly bimodal distribution centered first around the announcement of his candidacy in 2015, then more strongly after his victory in the presidential election. Trump held an astounding peak monthly MAGA rate of 14% in September 2016.
-- Note the use of ilike and like. Ilike compares all strings -- as lowercase, while like is case-sensitive. I don’t want to -- count magazines. -- Here we will use a float conversion to avoid integer division. -- This is Redshift specific. Adjust accordingly. select month(tweet_date) mnth , sum( case when tweet_text ilike '%make america great again%' or tweet_text like '%MAGA%' then 1 else 0 end ) / count(*)::float as maga_rate from trump_tweets group by 1
Sentiment: The Anger and Joy of Donald J. Trump
Given Trump’s prolific use of Twitter leading up to and throughout the campaign, we can apply a quantitative approach to measuring the tone and sentiment of his campaign with text analysis.
SQL is not the first tool people think of when doing text analytics, but it has some interesting advantages. It’s fast, well-integrated with existing ETL processes, and easy to build into regular reports. We will use a simple but effective approach, by relating individual words to sentiment scores to get a quick read on the emotional state of DJT.
This technique we’re using was explored in the paper “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.” Using Twitter as a foundation, Finn Årup Nielsen from the Technical University of Denmark (DTU) scored 2,477 words from -5 to 5 based on their perceived sentiment, with -5 being a word linked to a strongly negative emotional content. By joining the words in Trump’s tweets to this list, we can quickly get a per-tweet sentiment score. The data set for this topic is published by the Department of Informatics and Mathematical Modelling at DTU.
The data is in the form:
In order to begin this analysis, the tweets must be broken down into individuals words to join to our sentiment table. The method to do this is outlined in our blog post about NPS and survey analytics.
Once we have a resulting table of individual words, we can begin joining our tweets to our sentiment scores.
With a few simple queries, we can find Trump’s most celebratory and antagonistic tweets. First, we sum the sentiment scores by tweet.
His most positive:
with sentiment as ( select tweet_text , tweet_date , sum(sentiment_score) as sentiment_sum , count(*)::float as words from trump_words group by 1,2 ) select tweet_text , tweet_date , sentiment_sum from sentiment where sentiment_sum is not null order by sentiment_sum desc Limit 10
And his most negative:
These are Trump’s most emotionally charged tweets, but since we have summed the sentiment scores we are biased towards longer phrases. Simple declarations such as “Great!” or “Sad!” convey a lot of emotional content without summing to large numbers. As we dig a little deeper we will divide the summed scores by the number of words in the Tweet to gauge the net sentiment.
Let’s look at Trump’s sentiment over time. This has been normalized to his average sentiment over the data history so that it is centered on 0. This is done because the sentiment data has little absolute meaning without a reference point.
with monthly_sentiment as ( select month(tweet_date) as mnth , tweet_text , sum(sentiment_score) / max(num_words)::float as sentiment_score from trump_words group by 1, 2 ) , avg_sentiment as ( select avg(sentiment_score) as sentiment_avg from trump_words ) select mnth , avg(sentiment_score) - sentiment_avg as sentiment from monthly_sentiment -- I want the average sentiment available for every row. -- You could use a window function as well. left join avg_sentiment on true
As can be seen, Trump had some serious highs and lows before 2013. The highs are often associated with the Miss USA and Miss Universe pageants or were tweets written by people (we suspect) other than Trump. The very low scores in 2011 – 2012 are mostly political in nature. Let’s take a closer look at some of these:
Most worrying though, Trump has only become more negative over the past 4 years, especially from 2015 and onwards. Zooming in on the period of the election we can see that Trump generally trended downwards, exhibiting increasing negativity over the course of his Presidential campaign. There were some high points, such as the week of his victory in November, but the overall sentiment of Trump’s tweets continues to decline. We can easily summarize Trump’s tweets with a single word: Sad!
Sentiment by Time of Day: Angry, Early Morning Tweets
Another interesting perspective is Trump’s sentiment as a function of the hour of the day. Much has been made of his proclivity for early morning tweeting and what the ramifications of this might be. With a bit of SQL, we can look at this quickly and easily. We use the same query as above, continuing to limit our timeframe to the presidential campaign, but we use a 3-hour rolling average and use an extract, like so:
with monthly_sentiment as ( select extract(hour from (tweet_date)::timestamp) as hr , tweet_text , sum(sentiment_score) / max(num_words)::float as sentiment_score from Trump_words where date >= '2015-06-16' group by 1, 2 ) , avg_sentiment as ( select avg(sentiment_score) as sentiment_avg from trump_words ) select hr , count(*) as tweets , avg(avg(sentiment_score)) over(order by hr rows between 1 preceding and 1 following) - sentiment_avg as sentiment from monthly_sentiment left join avg_sentiment on true group by
He’s at his most negative in the morning. On average, between the hours of 4 am and 7 am, Trump unleashes a high volume of negative tweets. Our data does reveal a counterbalance to this finding. Trump’s tweets become more positive throughout the day and this positive sentiment peaks at night, along with his overall tweet volume.
Many companies have an abundance of text data but don’t have a strategy for fully making use of it. After a few simple joins in SQL, keyword frequency and sentiment analysis are two excellent methods for gaining context and insight into text data. Both are broadly applicable beyond tweets. Survey data, comments, competitive information, or web copy can all be explored with the methods discussed above.