February 04, 2012
Older: What a Year
Newer: More Tiny Classes
Keep 'Em Separated
Note: If you end up enjoying this post, you should do two things: sign up for Pusher and then subscribe to destroy all software screencasts. I’m not telling you do this because I get referrals, I just really like both services.
For those that do not know, Gauges currently uses Pusher.com for flinging around all the traffic live.
Every track request to Gauges sends a request to Pusher. We do this using EventMachine in a thread, as I have previously written about.
The Problem
The downside of this, is when you get to the point we were (thousands of a requests a minute), there are so many pusher notifications to send (thousands of a minute) that the EM thread starts stealing a lot of time from the main request thread. You end up with random slow requests that have one to five seconds of “uninstrumented” time. Definitely not a happy scaler does this make.
In the past, we had talked about keeping track of which gauges were actually being watched and only sending a notification for those, but never actually did anything about it.
The Solution
Recently, Pusher added web hooks on channel occupy and channel vacate. This, combined with a growing number of slow requests, was just the motivation I needed to come up with a solution.
We (@bkeepers and I) started by mapping a simple route to a class.
class PusherApp < BaseApp
post '/pusher/ping' do
webhook = Pusher::WebHook.new(request)
if webhook.valid?
PusherPing.receive(webhook)
'ok'
else
status 401
'invalid'
end
end
end
Using a simple class method like this moves all logic out of the route and into a place that is easier to test. The receive method iterates the events and runs each ping individually.
class PusherPing
def self.receive(webhook)
webhook.events.each do |event|
new(event, webhook.time).run
end
end
end
At first, we had something like this for each PusherPing instance.
class PusherPing
def initialize(event, time)
@event = event || {}
@time = time
@event_name = @event['name']
@event_channel = @event['channel']
end
def run
case @event_name
when 'channel_occupied'
occupied
when 'channel_vacated'
vacated
end
end
def occupied
update(@time)
end
def vacated
update(nil)
end
def update(value)
# update the gauge in the
# db with the value
end
end
We pushed out the change so we could start marking gauges as occupied. We then forced a browser refresh, which effectively vacated and re-occupied all gauges people were watching.
Once we new the occupied state of each gauge was correct, we added the code to only send the request to pusher on track if a gauge was occupied.
Deploy. Celebrate. Booyeah.
The New Problem
Then, less than a day later, we realized that pusher doesn’t guarantee the order of events. Imagine someone vacating and then occupying a gauge, but receiving the occupy first and then the vacate.
This situation would mean that live tracking would never turn on for the gauge. Indeed, it started happening to a few people, who quickly let us know.
The New Solution
We figured it was better to send a few extra notifications than never send any, so we decided to “occupy” gauges on our own when people loaded up the Gauges dashboard.
We started in and quickly realized the error of our ways in the pusher ping. Having the database calls directly tied to the PusherPing class meant that we had two options:
- Use the PusherPing class to occupy a gauge when the dashboard loads, which just felt wrong.
- Re-write it to separate the occupying and vacating of a gauge from the PusherPing class.
Since we are good little developers, we went with 2. We created a GaugeOccupier class that looks like this:
class GaugeOccupier
attr_reader :ids
def initialize(*ids)
@ids = ids.flatten.compact.uniq
end
def occupy(time=Time.now.utc)
update(time)
end
def vacate
update(nil)
end
private
def update(value)
return if @ids.blank?
# do the db updates
end
end
We tested that class on its own quite quickly and refactored the PusherPing to use it.
class PusherPing
def run
case @event_name
when 'channel_occupied'
GaugeOccupier.new(gauge_id).occupy(@time)
when 'channel_vacated'
GaugeOccupier.new(gauge_id).vacate
end
end
end
Boom. PusherPing now worked the same and we had a way to “occupy” gauges separate from the PusherPing. We added the occupy logic to the correct point in our app like so:
ids = gauges.map { |gauge| gauge.id }
GaugeOccupier.new(ids).occupy
At this point, we were now “occupied” more than “vacated”, which is good. However, you may have noticed, that we still had the issue where someone loads the dashboard, we occupy the gauge, but then receive a delayed, or what I will now refer to as “stale”, hook.
To fix the stale hook issue, we simply added a bit of logic to the PusherPing class to detect staleness and simple ignore the ping if it is stale.
class PusherPing
def run
return if stale?
# do occupy/vacate
end
def stale?
return false if gauge.occupied_at.blank?
gauge.occupied_at > @time
end
end
Closing Thoughts
This is by no means a perfect solution. There are still other holes. For example, a gauge could be occupied by us after we receive a vacate hook from pusher and stay in an “occupied” state, sending notifications that no one is looking for.
To fix that issue, we can add a cleanup cron or something that occasionally gets all occupied channels from pusher and vacates gauges that are not in the list.
We decided it wasn’t worth the time. We pushed out the occupy fix and are now reaping the benefits of sending about 1/6th of the pusher requests we were before. This means our EventMachine thread is doing less work, which gives our main thread more time to process requests.
You might think us crazy for sending hundreds of http requests in a thread that shares time with the main request thread, but it is actually working quite well.
We know that some day we will have to move this to a queue and an external process that processes the queue, but that day is not today. Instead, we can focus on the next round of features that will blow people’s socks off.
2 Comments
Feb 14, 2012
> You might think us crazy for sending hundreds of http requests in a thread that shares time with the main request thread…
We do the same thing and it works like magic. Certainly not an end all solution, but its a great solution for the time being. Thanks for the inspiration on this idea :)
Apr 27, 2012
Hey johns, that was really beautifully explained by you.
Yes there are some issues with gauges stealing the instrumental time because of slow requests.
I think you figured out the good solution for the problem.
Best luck.
Sorry, comments are closed for this article to ease the burden of pruning spam.