This digitalized world is ideal for user generated content, which is basically any content that has been created and put out there by users of online platforms. Examples are images, tweets, blogposts, videos, reviews or comments. Although the content different in length or quality, it would be a waste not to use this free content in order to know your customers better. Conducting sentiment analysis can be very useful when monitoring social media to gain a better understanding about the wider public opinion behind certain topics. This ability to extract insights from social data is very practical and that practice is widely adopted.
What is sentiment analysis?
Sentiment analysis, also known as Opinion mining, is an automated process of understanding an opinion about a given subject from a text.
How to conduct sentiment analysis?
One possible method is a lexicon-based sentiment analysis. First, you need to determine the dimensions. These are aspects of a domain that are relevant to the consumers. For instance, the domain is ‘movies’, then relevant dimensions could ‘storyline’ or ‘acting’. Second, you need a training text. This is a body of text that is specific to a single dimension of the focal domain. This could be a textbook on acting. By removing stopwords from the training text, you have a list of unique words and the number of occurrences of it, for which you can then compute the likelihood of that particular word being about acting.
This gives the probability of the word occurring, given each hypothesis. Wk stands for each word in the vocabulary. Nk stands for the number word occurrence in the text. N is the number of unique words selected from your training text and vocabulary is the sum of all occurrences from the training text.
Then, the posterior probability is computed with Naïve Bayes through the following steps:
P(h): prior probability of h; the probability that some hypothesis h is true. h is in this case the dimension.
P(h l D): Posterior Probability of h;the probability of h, given some Data. (lexicon words)
P(D l h): Which is the probability that some data value holds given the hypothesis.
P(D): Is the probability that some training data will be observed (Zacharski et al., 2015).
(For an example see https://www.youtube.com/watch?v=doznOnG81xY&t=82s)
So, when analysing a user generated text, the P(h l D) is updated each time a word is found that occurs in your lexicon.
The same method is applied to measure sentiment. You need a lexicon on positive words and negative words, remove the stopwords, compute the likelihood and compute the posterior probability.
Limitations
This method comes with some challenges. Such as sarcasm, which is difficult to detected in user generated texts. Also those text can consist of slang or dialect, which makes it difficult to analyse the text. I am curious to find out more about this. So, what other methods do you know of and which one do you think is most accurate?
Sources
Zacharski et al. (2015). Chapter 6 Naïve Bayes. Retrieved from http://guidetodatamining.com/chapter6/
Zacharski et al. (2015). Chapter 7 Classifying unstructured text. Retrieved from http://guidetodatamining.com/chapter7/