Onderzoeksmethoden 2/het werk/2011-12/Group 4
Inhoud
Introduction
Twitter is used by many people to communicate all kinds of things. These tweets are usually grouped by subject. Since this is a social medium, people tend to express their feelings at Twitter, often by making use of typical words or emoticons. We would like to know if it is possible to automatically analyze the content of tweets of a certain subject in order to find out the emotional load of that subject. In doing so, it should be possible to present the findings in a graph to visualize the results within a certain timeframe.
Research
Research question
Is it possible to determine the emotion value of a certain subject in time using twitter?
Subquestion
Is it possible to determine the emotion value of a certain tweet using a predefined list of weighted emotion words?
Relevancy
The goal is to research these questions by content analysis of mass social media. To do so we compiled a list of emotion words, not discriminating positive or negative words. These words are used to search the content of the text. Not only the words have to be relevant to the research, but also the content analyzed. I.e. the subject has to have a certain emotional load to deliver a significant result after analysis. We did not discriminate on persons who submitted the tweets, so results do not relate to the background of persons.
Method
In order to determine the emotional value of a tweet we need a list of emotion words and give each word a weight.
Emotion word list
We first reviewed over 300 Dutch tweets and extracted all words describing some sort of emotion, which resulted in a list of 175 words. To determine the emotional value of each word we rated the the emotion value on a scale from 1 to 5 (where 1 would be a light emotion word and 5 a high emotional word), each group member gave all 175 words a rating and the average of those three ratings resulted in a final emotional value. (If we want do this more accurate we should ask more people to rate the words.)
We realize we are unable to cover all emotional words, but at least this is a good starting point and most words are covered, of course more words can easily be added later on.
Analyzing the data
Determine emotion value of a tweet
To determine the emotion value of a tweet we split all the words in a tweet and match those words against our list of emotion words, the emotion value of a tweet is the sum of the weight of all emotion words in the tweet.
Determine the emotion value over a set of tweets in time
To determine the emotion value over a set of tweets in time we determine the emotion value of each tweet in the set and take the average emotion value over all tweets.
Subjects
The subjects are drawn from Twitter and need to have some emotional load to be significant to our research. First we tried to analyze a few trending topics, but this was not significant enough and did not satisfy is. Next we analyzed a football match. The results of that investigation where staggering, and are denoted below.
ORM Model
Constraints moeten nog worden toegevoegd!
Testing our method
Before we want to describe a precise method we first tryed, with the list of words we already gathered, to test it on a certain subject. We tryed a football match and put a filter at the word "PSV". This is what we found out:
Rain: Cumulative
When it rains people express their emotions at Twitter! The chart shows the cumulative emotionvalue of every 2 minutes. This method doesn't divide by total tweets or something, it's just adding up emotion. We'll use this data to try our other methods. (Filtered word: regen)
Rain: Divided by total
This chart shows the emotion of every 2 mins (if there was any emotion) and divides the emotion by the total tweets in these 2 minutes. It's the same "rain-data" as used above. (Filtered word: regen)
Twitter Luik
The difference between the cumulative and the divided chart is much more subtle than before. Probably this is because the shorter time period we checked the tweets.
Cumulative
To get this diagram we used the following approach: First we scanned each tweet for emotion words, for all those words in one tweet we summed up the weight and coupled it to the tweet in our database. After we did that added all those values per minute and plotted that value in our diagram.
Divided by total number of tweets
To get this diagram we used the following approach:
First we scanned each tweet for emotion words, for all those words in one tweet we calculated the average value and coupled it to the tweet in our database. After that we took the average value of all tweets for every minute (including those that had a emotion value of 0) and plotted that value in this diagram. This results in a view that shows the degree of emotion from 1 to 5 for that certain moment for this certain subject.
Future research
Since Twitter is an excellent source for gathering huge amounts of data you could use it in all kind of statistical data processing. It would be interesting to see whether it could be possible to distinguish positive and negative tweets, although it might be difficult because you are forced to look at the context. Since, for instance, the existence of the word not in front of a emotion word should inverse the value. So there are a lot of things to think of when implementing such technique. The mean challenge would be to analyze combinations of nouns and adverbs, since these can have different meanings.
Furthermore it would be a good idea to try the current algorithms on an other topic. We looked, for instance, at the 'Luik' case, which were shootings that occured in the Belgian city of Luik. Everybody was shocked because innocent people were killed there. This resulted in, for most tweets, emotional messages. Hence you will not see a clear distinction between the amount of tweets per minute and the cumulative amount of tweets per minute, since almost every tweet has a certain emotion.
List with emotion words and individual scores
The emotion words in the list are based on Tweets. The team members scored each word on a scale from 1 to 5, with 1 having the lowest and 5 having the highest emotional value. The list is sorted on the average score of the three persons. The two columns on the right represent the frequency of each weighted emotion level.