Last time we chatted about how we access the data we create on TVFMO, via NodeJS. This time I thought we’d start to look at how we calculate the data for some of the graphs that we show.
Let’s start with the Top 5 Countries by Posting Volume. Here we wish to display the 5 countries who’s Tweeters have posted the most. Obviously, this will only count the Tweets where the place.country attribute is present. This attribute is not guaranteed to be present, in fact Twitter say:
Place Attributes are metadata about places and allow any user or application to add arbitrary metadata to a place.
So not only does the user’s application have to support this, but the user themselves, must choose to have this information made available. So really this measure is a function of how many people within a country are tweeting and also how many of those are open to sharing their location data. However, the measure is interesting nonetheless.
The first thing we need to do is to acquire all the tweets, with the place attribute set:
As you can see, I use a helper function, into which I pass a datetime object representing yesterday.
Inside this helper function I call another helper function which returns a tuple, the first element is a string representation of the start of the day and the second element contains a string representation of the end of the day. We use string representations because that is how the data is represented in the Tweet. As you can see this second helper function simply uses a string format function to transform the datetime into the required strings:
Having calculated the start and stop values for our range, we execute a MongoDB query which basically says “give me the place attribute from all of the tweets that were made between 1 second after midnight yesterday morning and 1 second before midnight last night, where the place attribute is set”. Having executed that query, we close the connection and return the cursor.
Next we pull back the country/count dictionary from the Redis cache, if this is our first time around, we answer a new dictionary:
As you can see we use the json module to dumps() and loads() our dictionary object to prevent MongoDB from changing it from a dictionary to a list of key/value pairs as it dehydrates and rehydrates it.
Having done that, we increment the country count, as appropriate:
Before sorting the dictionary, by value, and returning the top 5. This is then cached in the Redis server:
This code is run as a cron job as soon as the server is able after midnight.
The other half of this story is serving the cached data, in graph form, when requested by the client. The code to do that is straightforward. Firstly, we check to see if the data is cached, and if it is we load it up.
Then it’s just a case of creating the graph, adding the data, then returning a json string of the graph to the client. As you can see, we show the number of Tweets per country as a percentage of the total Tweets made, containing the place attribute, instead of as a percentage of all tweets made.
Well that about wraps it up for this post. Join us next time when we’ll be looking at how we created another of the measures that you see on the http://bit.ly/olympicStats page.