This digitalized world is ideal for user generated content, which is basically any content that has been created and put out there by users of online platforms. Examples are images, tweets, blogposts, videos, reviews or comments. Although the content different in length or quality, it would be a waste not to use this free content in order to know your customers better. Conducting sentiment analysis can be very useful when monitoring social media to gain a better understanding about the wider public opinion behind certain topics. This ability to extract insights from social data is very practical and that practice is widely adopted.
What is sentiment analysis?
Sentiment analysis, also known as Opinion mining, is an automated process of understanding an opinion about a given subject from a text.
How to conduct sentiment analysis?
One possible method is a lexicon-based sentiment analysis. First, you need to determine the dimensions. These are aspects of a domain that are relevant to the consumers. For instance, the domain is ‘movies’, then relevant dimensions could ‘storyline’ or ‘acting’. Second, you need a training text. This is a body of text that is specific to a single dimension of the focal domain. This could be a textbook on acting. By removing stopwords from the training text, you have a list of unique words and the number of occurrences of it, for which you can then compute the likelihood of that particular word being about acting.
This gives the probability of the word occurring, given each hypothesis. Wk stands for each word in the vocabulary. Nk stands for the number word occurrence in the text. N is the number of unique words selected from your training text and vocabulary is the sum of all occurrences from the training text.
Then, the posterior probability is computed with Naïve Bayes through the following steps:
P(h): prior probability of h; the probability that some hypothesis h is true. h is in this case the dimension.
P(h l D): Posterior Probability of h;the probability of h, given some Data. (lexicon words)
P(D l h): Which is the probability that some data value holds given the hypothesis.
P(D): Is the probability that some training data will be observed (Zacharski et al., 2015).
(For an example see https://www.youtube.com/watch?v=doznOnG81xY&t=82s)
So, when analysing a user generated text, the P(h l D) is updated each time a word is found that occurs in your lexicon.
The same method is applied to measure sentiment. You need a lexicon on positive words and negative words, remove the stopwords, compute the likelihood and compute the posterior probability.
Limitations
This method comes with some challenges. Such as sarcasm, which is difficult to detected in user generated texts. Also those text can consist of slang or dialect, which makes it difficult to analyse the text. I am curious to find out more about this. So, what other methods do you know of and which one do you think is most accurate?
Sources
Zacharski et al. (2015). Chapter 6 Naïve Bayes. Retrieved from http://guidetodatamining.com/chapter6/
Zacharski et al. (2015). Chapter 7 Classifying unstructured text. Retrieved from http://guidetodatamining.com/chapter7/
Hi Cindy,
Great read!
I work with sentiment analysis often so this was an interesting read for me. I was wondering, how do you determine the dimensions of a post? So there are 100.000 twitter posts, would je need to give them each an individual dimension?
Most sentiment analysis are just done with an individual word analysis, and the sentiment of that word. Say a post says “awesome”, “great” or “nice” a lot, than it is perceived as positive. Accuracy is difficult, because this really depends on the company/topic you’re analysing. For example, when making a report on a company making weed control, seemingly negative posts are actually positive for that company.
Thanks!
Hi Cas,
You would use the training text of each chosen dimension to calculate the probability of a word appearing in a certain dimension. For example, a training text about storylines may contain the word “development” 310 times. A training text about acting may contain the word “development” 40 times. You can calculate the probability of the word “development” used in the context of storylines by dividing 310 by (310 + 40). You do this for each word in your training text. You can then use some coding to automatically go through each word in your twitter post to calculate the probability of the post being about a certain dimension by using Naive Bayes. This way you can differentiate between the overall sentiment and the sentiment per dimension.
I’m not the best in explaining things but I hope this somewhat answers your question.