If you want to analyse tweets, the first thing you have to do is to capture and store them. To do that we chose MongoDB. The main reason we chose MongoDB was it’s use of BSON, meaning it’s binary storage and binary over the wire, this makes it both fast and efficient.
Having done that, the next step was to create a script that would consume the Twitter Streaming API and store each tweet in the database. We wrote a Python script to do that. Step 1 is to open a connection to the database, using PyMongo:
We want to do this outside of the main processing loop so that we don’t suffer the overhead of connecting to the database every time we receive a tweet.
The next step is to create a listener and attach it to the Twitter Stream, specifying some form of filter. In our case, we’re only interested in the official hashtags. We used the Tweepy module for this as it has support for the Streaming API. Consumer_key, consumer_secret, access_token and access_token_secret all hold values that will be specific to you and are obtainable via your Twitter app’s page. Pembroke is just the code name we gave this, tailored, instance of Socialyze.
Following on from that, We have to define our subclass of StreamListener and state what we wish to happen on various events that will occur. In our case we are really only interested in two events.
The first is on_data, which will be fired when data arrives, Twitter guarantees that this will be one complete tweet, if a tweet is sent. It may also be various “connection keeping open” characters. When we receive data, via the event, we check to ensure it’s not a digit and that it’s not an empty string, if all’s well we re-hydrate it into an object, using the json module – this is to prevent the tweet being stored as a string representation of a string representation of an object – then insert it into the database.
Twitter will punish you if you do not consume the stream fast enough (more on this later), and so it’s wise not to do any processing in the script that reads the stream. Instead we had a plan for the reading script to simply place the tweet on a RabbitMQ queue, from where it could be processed by one or more queue consumers as required. However, our script had no problem keeping up with the velocity of the stream (~13 tweets per second) so we never implemented that solution.
The second event that we are interested in is the on_error event. Upon receiving that event, we’ll just record it in a log and carry on.
So having created our script to capture the Tweets we have to run it some way. Not only that, but we have to deal with the occasions where Twitter might close the connection from their end. Such circumstances include if you are not consuming the feed fast enough, or if the shard you are connected to dies. So not only do we have to kick off our script, but we have to restart it again if it gets killed due to the stream being closed out from underneath it.
Happily this is fairly straight forward using a Linux application called Upstart. Upstart is available “out of the box” with the Ubuntu distro, and maybe some others, but I don’t know for sure.
To use Upstart we create a .conf file, like this one:
Which says, “when the machine starts up, execute this script, start it again if it stops for any reason, and stop running it when the machine shuts down”. This .conf file then gets placed in /etc/init.
As well as being started, respawned and stopped automatically, you can manually start, restart and stop the script via the command line by issuing the start ReadTwitterStream, stop ReadTwitterStream or restart ReadTwitterStream.
That’s really about it for how we capture and store our data. Stay tuned for more information on how we built The View From Mount Olympus.