You are here:

How we Built TVFMO #4 – Calculating Sentiment

Last time we talked about how we calculated the top countries measure in TVFMO. This time, we’re are going to have a look at how we calculate sentiment, more specifically how we rate a post as being positive or negative.

There are two ways of doing this, it can be done deterministically or heuristically. For technical reasons I have used the former method and that is what we are going to examine in this post. However, we may look at the heuristic method later, just for the sake of completeness.

The algorithm I use for deterministically scoring a Tweet as negative or positive is straightforward:

  1. Take a sample of the tweets from the domain under analysis
  2. Extract the adjectives
  3. Weigh the adjectives as being positive (+1) or negative (-1)
  4. For each tweet sent
    1. Tokenize the words
    2. For each word
      1. Does it appear in the weighted dictionary?
        1. Yes – retrieve that word’s weighting and add to accumulator
        2. No – Continue.
    3. If the accumulator for the Tweet is > 0 then increment positive total
    4. If the accumulator for the Tweet is < 0 then increment the negative total
  5. Cache positive and negative totals.

As you can see, in this particular analysis, we are not interested in neutral tweets, so they are ignored. Further notice that this algorithm can be extended such that instead of just weighting the words as positive or negative, we can weigh the words in terms of how positive or how negative they are, like so:

-3 = Very negative
-2 = Negative
-1 = Some what negative
0 = Neither negative nor positive
1 = Some what positive
2 = Positive
3 = Very Positive

This would allow us, not only to state that the sentiment of the tweet was negative or positive, but also to state how negative or positive it was.

So the first thing we have to do is to gather a sample of tweets, say all from the last hour, extract the adjectives and write them to file to be weighted:

image

The function getTweetsForLastHour() is a helper function…

image

Which, in turn, calls two more helper functions, lastHourInTwitterDateFormat() and getTweetText(), which are self explanatory…

image

image

The rest of the harvester code just tokenizes the words, then tags their part of speech, before selecting those which are tagged as adjectives and don’t start with // (this is how we exclude all the shortened urls that are also tagged as adjectives by the default POS tagger in the nltk library).

The created file can then be opened in Excel and a weighting added to the adjectives.

image

After the adjectives are weighted, we can score each day’s tweets.

image

Here, you see, we create a datetime object from the day and month passed in, get the day range in Twitter date format, before retrieving the tweets for that day. Having done that, we then score all the Tweets and cache the positive and negative totals, by day, in Redis.

The code to get the day range, in Twitter format, is as follows.

image

Scoring the Tweets is done using the following function.

image

The word scores are retrieved from the cache, like so.

image

And, if required, the cache is build using this code.

image

In this post I’ve demonstrated how we deterministically calculate the sentiment for each Tweet in a given domain. If you have any questions please add them to the comments section or email me at [email protected].

Until next time, happy coding! Smile

One thought on “How we Built TVFMO #4 – Calculating Sentiment

  1. David McCallum

    Excellent as usual, gimme more 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

clear formSubmit