You are here:

How we Built TVFMO #1 – Storing the Tweets

If you want to analyse tweets, the first thing you have to do is to capture and store them. To do that we chose MongoDB. The main reason we chose MongoDB was it’s use of BSON, meaning it’s binary storage and binary over the wire, this makes it both fast and efficient.

Having done that, the next step was to create a script that would consume the Twitter Streaming API and store each tweet in the database. We wrote a Python script to do that. Step 1 is to open a connection to the database, using PyMongo:

image

We want to do this outside of the main processing loop so that we don’t suffer the overhead of connecting to the database every time we receive a tweet.

The next step is to create a listener and attach it to the Twitter Stream, specifying some form of filter. In our case, we’re only interested in the official hashtags. We used the Tweepy module for this as it has support for the Streaming API. Consumer_key, consumer_secret, access_token and access_token_secret all hold values that will be specific to you and are obtainable via your Twitter app’s page. Pembroke is just the code name we gave this, tailored, instance of Socialyze.

image

Following on from that, We have to define our subclass of StreamListener and state what we wish to happen on various events that will occur. In our case we are really only interested in two events.

The first is on_data, which will be fired when data arrives, Twitter guarantees that this will be one complete tweet, if a tweet is sent. It may also be various “connection keeping open” characters. When we receive data, via the event, we check to ensure it’s not a digit and that it’s not an empty string, if all’s well we re-hydrate it into an object, using the json module – this is to prevent the tweet being stored as a string representation of a string representation of an object – then insert it into the database.

Twitter will punish you if you do not consume the stream fast enough (more on this later), and so it’s wise not to do any processing in the script that reads the stream. Instead we had a plan for the reading script to simply place the tweet on a RabbitMQ queue, from where it could be processed by one or more queue consumers as required. However, our script had no problem keeping up with the velocity of the stream (~13 tweets per second) so we never implemented that solution.

The second event that we are interested in is the on_error event. Upon receiving that event, we’ll just record it in a log and carry on.

image

So having created our script to capture the Tweets we have to run it some way. Not only that, but we have to deal with the occasions where Twitter might close the connection from their end. Such circumstances include if you are not consuming the feed fast enough, or if the shard you are connected to dies. So not only do we have to kick off our script, but we have to restart it again if it gets killed due to the stream being closed out from underneath it.

Happily this is fairly straight forward using a Linux application called Upstart. Upstart is available “out of the box” with the Ubuntu distro, and maybe some others, but I don’t know for sure.

To use Upstart we create a .conf file, like this one:

image

Which says, “when the machine starts up, execute this script, start it again if it stops for any reason, and stop running it when the machine shuts down”. This .conf file then gets placed in /etc/init.

As well as being started, respawned and stopped automatically, you can manually start, restart and stop the script via the command line by issuing the start ReadTwitterStream, stop ReadTwitterStream or restart ReadTwitterStream.

That’s really about it for how we capture and store our data. Stay tuned for more information on how we built The View From Mount Olympus.

6 thoughts on “How we Built TVFMO #1 – Storing the Tweets

  1. That’s great and all, but what does any of this have to do with your company’s existing products? It’s not even built on your own database engine. It would be one thing if it were demonstrating how to use Gibraltar on Python, but it doesn’t even do that.

    I think customers would rather see a Gibraltar Agent for WinRT, a Gibraltar Analyzer for WinRT, tools to put Gibraltar analysis information on web dashboards to show system health, ways to minimize log transfer costs on Windows Azure, and an ASP.NET client that lets you pivot on logged-in user sessions instead of AppDomain sessions.

  2. Thanks for your feedback Robert, we always welcome constructive comments.

    You asked what this has to do with our company’s products and I’m happy to answer that for you. The View From Mount Olympus page demonstrates Socialyze a new product from Gibraltar Software. You can find out more about it here: http://www.gibraltarsoftware.com/Labs/Socialyze/Default.aspx

    Once again, thanks for your insight on what it is that our customers want.

  3. Does not work on Windows 8 in IE10, you have to put IE in “Compatibility Mode”. There is a header you can put on that page that will force IE 10 into IE9 compatibility mode and fix the problem.

    So then, the question becomes, what does Socialyze have to do with building tools to let developers build rock solid .NET software? Especially if it is built on MongoDB and Python?

    Not sure if you saw it yet, but this partnership was just announced, and is exactly the kind of thing I am talking about: http://newrelic.com/azure. I can’t get web page load times and DOM ready times in Gibraltar. Why not? Something like that should be entirely possible with a Gibraltar.js script powered by SignalR. Instead, now I’m using your tool plus your competitor’s tool, when I’d rather just be using yours.

  4. “Does not work on Windows 8 in IE10”. Thank you for this information. In line with most other software vendors we don’t support pre-release software. I’m sure that when Windows 8 and IE10 are released they will be fully supported.

    “So then, the question becomes, what does Socialyze have to do with building tools to let developers build rock solid .NET software?” Why does that become the question? Perhaps it only becomes the question in your mind. In my mind, Socialyze is an example of Gibraltar Software looking to the future, seeing which platforms and problem domains are going to be of interest to developers and creating the tools to help them secure that future.

    Thanks again for your further feedback on Analyst, however, they do seem entirely off topic for this post on Socialyze.

  5. David McCallum

    Nice article Gary, I know nothing about Python, but the examples easy to follow.

    Hope you’ll be following up with articles on data analysis.

    One very small thing (it may be my browser), but the hyperlinks in your article are very hard to distinguish amongst the rest of the text. I only came across them by accident when I moved my mouse over one of them

  6. Hi David, glad you like the post. We’ll certainly be following up with articles on the analysis so stay tuned. Sorry about the link text, I agree it’s hard to notice, it’s a function of this particular theme. We are planning some work on our website in the near future and I’ll try to ensure that this problem is looked at then.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

clear formSubmit