You probably heard about Donald Trump’s last year’s Muslim ban. Officially, this executive order was called Protecting the Nation from Foreign Terrorist Entry into the United States. It limited the number of refugees accepted in the USA from 110 000 to 50 000 and banned the entry of all refugees and immigrants from 7 Muslim countries: Syria, Iran, Iraq, Libya, Somalia, Sudan and Yemen. The order prompted broad international criticism. Justin Trudeau, Angela Merkel, Francois Holland, 40 Nobel laureates and many other officials, academics, religious leaders condemned the ban. A myriad of public protests and demonstrations were organized.
The overall response to the ban was negative, at least the public one. But is there a way to confirm and quantify that analytically?
Well, there is – thanks to Twitter. Over 16 million tweets about the travel ban, containing hashtags such as: #MuslimBan, #NoBanNoWall, #NoMuslimBan, #JFKTerminal4, #RefugeesWelcome, #ImmigrationBan, #TravelBan, were published between January 30, 2017 and April 20, 2017. The majority of them are retweets, reposts, duplicates, but even when those are eradicated, there are still over 500 000 original tweets left, and over 400 000 of them are in English.
This is how many tweets were published daily during January 30 and March 31:
Thanks to natural language processing tools there is a way to extract sentiment from the data and classify it – it’s called sentiment analysis. In majority of cases, sentiment analysis is used for marketing purposes, e.g. to improve customer experience. I was really curious if I can get valid results from analyzing tweets about politics because there aren’t many papers about this topic. Using two models: dictionary-based algorithm and penalized logistic regression (machine learning based) model I analyzed the tweets and classified them as positive, negative and neutral.
The dictionary-based model is rather straightforward: the algorithm relies on lists with positive and negative words and scores each post according to the number of such words: that is, it adds one point for each “good” word and deducts one for each “bad word”. The sum of this scoring equals a final sentiment score. The machine learning model is much more complex and time-consuming as it requires an already classified training data set to learn, but the final result is more accurate and has a lower classification error (accuracy was equal to 76% vs 69% for dictionary based model):
The aggregated result of the logistic regression model classification looks as following:
which confirms that the reaction to the ban was, in majority, negative.
Sentiment analysis also allows for a deeper text mining, e.g. extracting the most popular words in the dataset:
This is just a short summary of my results, but it may give you a background to think about the following:
Do you think officials / politicians should analyze Twitter data? Should they draw conclusions from how many negative or positive tweets are published on the topic and adjust their strategy accordingly? Or should they disregard this data and focus on other information sources (e.g. newspapers, polls) instead?