Twitter Sentiment Analysis

6 minute read

There were a couple of contradictory political events in Hungary lately and the below code tries to capture the reaction of the public by applying sentiment analysis on tweets from Twitter using 2 search words:

‘Viktor Orbán’ - the current prime minister of Hungary
#istandwithCEU - the most used supporting hashtag of the Central European University

The tweets used in this project are from 1st until 14th April 2017.

######NOTE: The project doesn’t want to take a stand on or against any political sides, it only shows an example of how to use R as a data science tool for sentiment analysis. Therefore the results will not be discussed.

I should give a huge credit to the authors of the following 2 blog articles, because the large part of the code used in this project was borrowed from them:

How to Use R to Scrape Tweets: Super Tuesday 2016 by Kris Eberwein
Joy to the World, and also Anticipation, Disgust, Surprise… by Julia Silge

Okay, now let’s get started!

Reading the 2 csv files, that has been scraped from Twitter using the twitteR package.

t_orban <- read.csv('/Users/mac/Desktop/Data Science/Pet Projects/Twitter/feed_orban.csv')
t_ceu <- read.csv('/Users/mac/Desktop/Data Science/Pet Projects/Twitter/feed_ceu.csv')

Reading in the dictionary of positive and negative words, which is a list of 6800 positive and negative words compiled by Bing Liu and Minqing Hu of the University of Illinois at Chicago. You can download it here.

good_text = scan('/Users/mac/Desktop/Data Science/Pet Projects/Twitter/positive-words.txt',
           what='character', comment.char=';')
bad_text = scan('/Users/mac/Desktop/Data Science/Pet Projects/Twitter/negative-words.txt',
           what='character', comment.char=';')

The next thing is to load a huge function, which will score the tweets, by searching for the positive and negative word instances in the text.

score.sentiment = function(sentences, good_text, bad_text, .progress='none')
{
        library(plyr)
        library(stringr)
        # we got a vector of sentences. plyr will handle a list
        # or a vector as an "l" for us
        # we want a simple array of scores back, so we use
        # "l" + "a" + "ply" = "laply":
        scores = laply(sentences, function(sentence, good_text, bad_text) {

                # clean up sentences with R's regex-driven global substitute, gsub():
                sentence = gsub('[[:punct:]]', '', sentence)
                sentence = gsub('[[:cntrl:]]', '', sentence)
                sentence = gsub('\\d+', '', sentence)
                #to remove emojis
                sentence <- iconv(sentence, 'UTF-8', 'ASCII')
                sentence = tolower(sentence)        
                # split into words. str_split is in the stringr package
                word.list = str_split(sentence, '\\s+')
                # sometimes a list() is one level of hierarchy too much
                words = unlist(word.list)

                # compare our words to the dictionaries of positive & negative terms
                pos.matches = match(words, good_text)
                neg.matches = match(words, bad_text)

                # match() returns the position of the matched term or NA
                # we just want a TRUE/FALSE:
                pos.matches = !is.na(pos.matches)
                neg.matches = !is.na(neg.matches)

                # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
                score = sum(pos.matches) - sum(neg.matches)

                return(score)
        }, good_text, bad_text, .progress=.progress )

        scores.df = data.frame(score=scores, text=sentences)
        return(scores.df)
}

The next step is to apply the previously defined function on the 2 loaded data frame, using the list of good and bad words from the sentiment dictionary. It will add the calculated sentiment scores to another column.

sent_orban <- score.sentiment(t_orban$tweets, good_text, bad_text, .progress='none')
sent_orban <- cbind(sent_orban, date=as.factor(t_orban$date))
sent_ceu <- score.sentiment(t_ceu$tweets, good_text, bad_text, .progress='none')
sent_ceu <- cbind(sent_ceu, date=as.factor(t_ceu$date))

Loading a couple of libraries that will be required:

library(ggplot2)
library(ggthemes)
library(tidyr)
library(dplyr)
library(stringr)

The upcoming barchart will be filled up with 3 color gradient, using the shades of green, yellow and red. Here is how the palette is created:

colpal <- colorRampPalette(c("green3", "lemonchiffon2", "firebrick"))

I used a cheatsheet provided by Thian Zheng (associate professor at Columbia University) to grab my chosen colors. It’s a very handy tool: you can quickly select your favourite colors and copy-paste their R names into your script.

The following code will count the total number of tweets per day:

sum_count_orban <-
        sent_orban %>%
        group_by(date) %>%
        count(date, date)

Sentiment scores for the search term ‘Viktor Orbán’

Let’s draw the first plot about the sentiment scores of the first search term.

p_o1 <- ggplot(data = sent_orban, aes(x = date, y = 1)) +
        geom_bar(stat = "identity",
                 aes(fill = factor(score, levels=rev(levels(as.factor(score)))))) +
        geom_text(data = sum_count_orban,
                  aes(y = n, label = n), size = 3,
                  vjust = -0.5, color = 'white') +
        scale_fill_manual(labels = c("3    Positive", "2", "1", "0    Neutral", "-1", "-2", "-3", "-4    Negative"), values = colpal(8)) +
        theme(panel.grid.major.x = element_blank(), panel.background = element_rect(fill = 'turquoise4')) +
        ggtitle("'Viktor Orbán' sentiment scores") +
        labs(y="Number of Tweets") +
        theme(legend.title=element_blank(), axis.title.x = element_blank()) +
        theme(plot.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=16, hjust=0)) +
        theme(axis.text.x = element_text(size = 12, angle = 90, hjust = 1),
              axis.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=12))
p_o1

Calculating the overall sentiment scores by grouping the positive, neutral and negative scores.

stat <- mutate(sent_orban, tweet=ifelse(sent_orban$score > 0, 'positive', ifelse(sent_orban$score < 0, 'negative', 'neutral')))
by.tweet <- group_by(stat, tweet, date)
by.tweet <- summarise(by.tweet, number=n())

Plotting the overall sentiment scores:

p_o2 <- ggplot(by.tweet, aes(date, number)) + geom_line(aes(group=tweet, color=tweet), size=1) +
                geom_point(aes(group=tweet, color=tweet), size=2) +
                scale_color_manual(values=c("#fc8d59", "#999999", "#91cf60")) +
        ggtitle("'Viktor Orbán' overall sentiment scores") +
        labs(y="Number of Tweets") +
        theme(legend.title=element_blank(), axis.title.x = element_blank()) +
        theme(plot.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=16, hjust=0)) +
        theme(axis.text.x = element_text(size = 12, angle = 90, hjust = 1),
              axis.title = element_text(family = "Trebuchet MS", color="#666666", face="bold", size=12))
p_o2

Next stop is emotions!

Now we are trying to dig deeper and unfold the positive and negative feelings by differentiating the various emotions from each other. To make this happen, I applied an algorithm from the syuzhet package, which is based on the NRC Word-Emotion Association Lexicon, done by Saif Mohammad and Peter Turney. It uses a dictionary, where each word has score, that is associated with 8 different emotions.

Loading the neccesary libraries:

library(syuzhet)
library(lubridate)
library(scales)
library(reshape2)

First we have to remove the graphical characters in order to run the NRC algorithm on the tweets:

usableText <- str_replace_all(sent_orban$text,"[^[:graph:]]", " ")
mySentiment <- get_nrc_sentiment(usableText)
tweets <- cbind(text=sent_orban$text, date=sent_orban$date, mySentiment)

After extracting the 8 emotion scores, we can aggregate the numbers for the visualization:

sentimentTotals <- data.frame(colSums(tweets[,c(3:11)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals), sentimentTotals)
rownames(sentimentTotals) <- NULL

Let’s look into the data frame that we got:

sentimentTotals

##      sentiment count
## 1        anger  8175
## 2 anticipation  5178
## 3      disgust  3416
## 4         fear  9328
## 5          joy  1461
## 6      sadness  6146
## 7     surprise  3520
## 8        trust  5772
## 9     negative 12830

Now we are ready to plot the aggregated scores by emotions and check their distribution:

p_o3 <- ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
        geom_bar(aes(fill = sentiment), stat = "identity") +
        geom_text(data = sentimentTotals,
                  aes(y = count, label = count), size = 4,
                  vjust = -0.5, color = 'black') +
        theme(legend.position = "none") +
        theme(axis.text.x = element_text(size = 8)) +
        xlab("Sentiment") + ylab("Total Count") + ggtitle("'Viktor Orbán' : Total Sentiment Scores")
p_o3

For the last plot, we summarize the average score of the emotions by day. This will give a nice overview of the different feelings changing over time.

To prepare the data for visualization, the mean of the sentiments should be grouped by date:

tweets$day <- date(tweets$date)
dailysentiment <- tweets %>% group_by(date) %>%
        summarise(anger = mean(anger),
                  anticipation = mean(anticipation),
                  disgust = mean(disgust),
                  fear = mean(fear),
                  joy = mean(joy),
                  sadness = mean(sadness),
                  surprise = mean(surprise),
                  trust = mean(trust)) %>% melt
names(dailysentiment) <- c("date", "sentiment", "meanvalue")

The average sentiment scores are ready to be plotted now:

p_o4 <- ggplot(data = dailysentiment, aes(x = as.factor(day(date)), y = meanvalue, group = sentiment)) +
        geom_line(size = 1.5, alpha = 0.7, aes(color = sentiment)) +
        geom_point(size = 0.5) +
        ylim(0, NA) +
        theme(legend.title=element_blank(), axis.title.x = element_blank()) +
        ylab("Average sentiment score") +
        ggtitle("'Viktor Orban' twitter sentiment, 1st - 14th April 2017") +
        theme(axis.text.x = element_text(size = 8, hjust = 1))
p_o4