June 28, 2011
Older: EventMachine and Passenger
Newer: Counters Everywhere, Part 2
Counters Everywhere
Last week, coming off hernia surgery number two of the year (and hopefully the last for a while) I eased back into development by working on Gaug.es.
In three days, I cranked out tracking of three new features. The only reason this was possible is because I have tried, failed, and succeeded on repeat at storing various stats efficiently in Mongo.
While I will be using Mongo as the examples for this article, most of it could very easily be applied to any data store that supports incrementing numbers.
How are you going to use the data?
The great thing about the boon of new data stores is the flexibility that most provide regarding storage models. Whereas SQL is about normalizing the storage of data and then flexibly querying it, NoSQL is about thinking how you will query data and then flexibly storing it.
This flexibility is great, but it means if you do not fully understand how you will be accessing data, you can really muck things up. If, on the other hand, you do understand your data and how it is accessed, you can do some really fun stuff.
So how do we access data on Gaug.es? Depends on the feature (views, browsers, platforms, screen resolutions, content, referrers, etc.), but it can mostly be broken down into these points:
- Time frame resolution. What resolution is needed? To the month? Day? Hour? Which piece of content was viewed the most matters on a per day basis, but which browser is winning the war only matters per month, or maybe even over several months.
- Number of variations. Browsers is a finite number of variations (Chrome, Firefox, Safari, IE, Opera, Other). Content is completely the opposite, as it varies drastically from website to website.
Knowing that resolution and variation drive how we need to present data is really important.
One document to rule them all
Due to the amount of data a hosted stats service has to deal with, most store each hit and then process them into reports on intervals. This leads to delays between something happening on your site and you finding out, as reports can be hours or even a day behind. This always bothered me and is why I am working really hard at making Gaug.es completely live.
Ideally, you should be able to check stats anytime and know exactly what just happened. Email newsletter? Watch the traffic pour in a few minutes after you hit send. Post to your blog? See how quickly people pick it up on Twitter and in feed readers.
In order to provide access to data in real-time, we have to store and retrieve our data differently. Instead of storing every hit and all the details and then processing those hits, we make decisions and build reports as each hit comes in.
Resolution and Variations
What kind of decisions? Exactly what I mentioned above.
First, we determine what resolution a feature needs. Top content and referrers need to be stored per day for at least a month. After that, probably month is a good enough resolution.
Browsers and screen sizes are far less interesting on a per day basis. Typically, these are only used a few times a year to make decisions such as dropping IE 6 support or deciding to target 1024×768 instead of 800×600 (remember that back in the day?).
Second, we determine the variations. Content and referrers varies greatly on a per site basis, but we can choose the browsers and screen dimensions to track. For example, with browsers, we picked Chrome, Safari, Firefox, Opera, IE and then we lump the rest of the browsers into Other. Do I really care how many people visit RailsTips in Konquerer? Nope, so why even show it.
The same goes for platforms. We track Mac, Windows, Linux, iPhone, iPad, iPod, Android, Blackberry, and Other.
Document Model
Knowing that we only have 6 variations of browsers and 9 variations of platforms to track, and that the list is not likely to grow much, I store all of them in one document per month per site. This means showing someone browser and/or platform data for an entire month is one query for a very tiny document that looks like this:
{
'_id' => 'site_id:month',
'browsers' => {
'safari' => {
'5-0' => 5,
'4-1' => 2,
},
'ie' => {
'9-0' => 5,
'8-0' => 2,
'7-0' => 1,
'6-0' => 1,
}
},
'platforms' => {
'macintosh' => 10,
'windows' => 5,
'linux' => 2,
},
}
When a track request comes in, I parse the user agent to get the browser, version, and platform. We only store the major and minor parts of the version. Who cares about 12.0.1.2? What matters is 12.0. This means we end up with 5-10 versions per month per browser instead of 50 or 100. Also, note that Mongo does not allow dots in key names, so I store the dot as a hyphen, thus 12-0.
I then do a single query on that document to increment the platform and browser/version.
query = {'_id' => "#{hit.site_id}:#{hit.month}"}
update = {'$inc' => {
"b.#{browser_name}.#{browser_version}" => 1,
"p.#{platform}" => 1,
}}
collection(hit.created_on).update(query, update, :upsert => true)
b and p are short for browser and platform. No need to waste space. The dot syntax in the strings in the update hash tell Mongo to reach into the document and increment a value for a key inside of a hash.
Also, the _id (or primary key) of the document is the site id and the month since the two together are always unique. There is no need to store a BSON ObjectId or incrementing number, as the data is always accessed for a given site and month. _id is automatically indexed in Mongo and it is the only thing that we query on, so there is no need for secondary indexes.
Range based partitioning
I also do a bit of range based partitioning at the collection level (ie: technology.2011, technology.2012). That is why I pass the date of the hit to the collection method. The collection that stores the browser and platform information is split by year. Maybe unnecessary looking back at it, but it hurts nothing. It means that a given collection stores number of sites * 12 documents at a maximum.
Mongo creates collections on the fly, so when a new year comes along, the new collection will be created automatically. As years go by, we can create smaller summary documents and drop the old collections or move them to another physical server (which is often easier and more performant than removing old data from an active collection).
Because I know that the number of variations is small (< 100-ish), I know that the overall document size is not really going to grow and that it will always efficiently fly across the wire. When you have relatively controllable data like browsers/platforms, storing it all in one document works great.
Closing Thoughts
As I said before, this article is using Mongo as an example. If you wanted to use Redis, Membase or something else with atomic incrementing, you could just have one key per month per site per browser.
Building reports on the fly through incrementing counters means:
- less storage, as you do not need the raw data
- less RAM, as there are fewer secondary indexes
- real-time querying is no problem, as you do not need to generate reports, the data is the report
It definitely involves more thought up front, but several areas of Gaug.es use this pattern and it is working great. I should also note that it increases the number of writes. Creating the reports on the fly means 7 or 8 writes for each “view” instead of 1.
The trade off is that reading the data is faster and avoids the lag caused by having to post-process it. I can see a day in the future where having all these writes will force me to find a different solution, but that is a ways off.
What do you do when you cannot limit the number of variations? I’ll leave that for next time.
Oh, and if you have not signed up for Gaug.es yet, what are you waiting on? Do it!
6 Comments
Jun 28, 2011
Good read! Looking forward to “next time” when you discuss the unlimited number of variations case.
Jun 28, 2011
yeah, me too :) thanks for sharing this!
Jul 01, 2011
Does gauges handle unique visitors in differrent time zones? If so I’m very interested to know your approach on storing and querying the data in MongoDB.
Jul 01, 2011
@Steve – Uniqueness is determined by cookies. Each site can have a time zone as well. Because we build all the reports on the fly, we just get the current time in the sites zone and ensure that all reports being updated are based on that.
Jul 05, 2011
What happens if the server saves hits in UTC for example, and the user is at +2.
Wouldn’t something coming in on 23:01 on the last day of month 1 get tracked in the “wrong” month seeing as it is 01:01 in month 2 for that user?
Perhaps I’m missing something obvious here :)
Jul 06, 2011
@Mark – You aren’t missing anything. Each site picks a time zone and we perform all time related operations in that time zone. Thus, changing the time zone leads to gaps or overlaps, but changing the time zone should never or rarely happen so it is an ok compromise.
Sorry, comments are closed for this article to ease the burden of pruning spam.