January 22, 2010
Older: I Have No Talent
Newer: Just In Time, Not Just In Case
Multiple Domain Page Caching
The other day Brandon Wright emailed me about the following tweet:
Just deployed full page caching on Harmony. Our log file stopped spinning by which made me happy and sad.
Routing
It might seem like black magic, but it isn’t all that hard. The front side for Harmony is not the same as a typical Rails app as we have multiple domains pointed at Harmony and the paths are not known up front so they don’t go in the routes file. In order to get everything headed to a controller, the last route in our file is this:
map.dispatch '*path', :controller => 'the', :action => 'dispatch'
This uses Rails route globbing to send every path to an action named dispatch in a controller dubiously named “the” (because it made us laugh). From there, we determine if it we can find the site and if the site has an item (page, link, blog, post, etc.) that matches the path.
Caching
Somewhere down the rabbit hole we render that item based on it’s liquid template, immediately after which we call something like this:
cache_item(@item, contents)
# which looks kind of like this
def cache_item(item, contents)
# gone for brevity
FileUtils.mkdir_p(File.dirname(item.page_cache_path))
File.open(item.page_cache_path, 'w+') { |f| f.puts(contents) }
end
*We could have used caches_page in Rails, but we are already using that without including the http host for asset and theme file caching, so it was easier to just roll our own.
All cache_item does is ensure that the directory exists and then write the contents of what we are about to send back to the browser into a file. Really nothing fancy. So what does item.page_cache_path look like? For a site like railstips.org and a path of /dude/, we end up with the following cache path:
#{RAILS_ROOT}/public/cache/railstips.org/dude/index.html
Note the use of the domain in the cache path. Since we have that, we can use apache rewrites along with conditions to tell apache to check if a cached file exists based on the host. If it does, we server that file and if it doesn’t, we just hit rails, cache the file, and return the response. We use Moonshine for our deployments so all we need to do is set the Passenger page cache directory like this:
:passenger:
:page_cache_directory: '/cache/%{HTTP_HOST}'
When we deploy, this sets up the following Apache rewrite rules:
# Rewrite to check for Rails non-html cached pages (i.e. xml, json, atom, etc)
RewriteCond %{THE_REQUEST} ^(GET|HEAD)
RewriteCond %{DOCUMENT_ROOT}/cache/%{HTTP_HOST}%{REQUEST_URI} -f
RewriteRule ^(.*)$ /cache/%{HTTP_HOST}$1 [QSA,L]
# Rewrite to check for Rails cached html page
RewriteCond %{THE_REQUEST} ^(GET|HEAD)
RewriteCond %{DOCUMENT_ROOT}/cache/%{HTTP_HOST}%{REQUEST_URI}index.html -f
RewriteRule ^(.*)$ /cache/%{HTTP_HOST}$1index.html [QSA,L]
Note that in the RewriteRule, we include the HTTP_HOST, which when visiting railstips.org, would be railstips.org.
One URL to Rule Them All
The key to this being effective is only having one true url for each page. We do this right now by redirecting www to no-www and ensuring that each page has a trailing slash. First, no-www.
# no www
RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC]
RewriteRule ^(.*)$ http://%1$1 [R=301,L]
Next, we ensure that there is always a trailing slash when needed. This means that /foo redirects to /foo/ and foo.json just stays as foo.json.
RewriteCond %{THE_REQUEST} ^(GET|HEAD)
RewriteCond %{REQUEST_URI} !^/admin/
RewriteRule ^(.*/[^/\.]+)$ $1/ [R]
Ensuring that each page has one URL is better for search engines and analytics. You don’t end up with split page rank for the same page (with and without slash) and the same thing is true for pageviews.
Cache Clearing
Now that I’ve explained a bit how we do the caching, I’ll mention quickly how we clear it. As they say, cache expiration and naming are the two hardest things to do in programming. We opted for the most simple solution that would work for now.
I made a simple site cache clearer module that I include in any model that can affect a site on the front side. It looks something like this.
module SiteCacheClearer
def self.included(model)
model.after_save :clear_item_cache
model.after_destroy :clear_item_cache
end
def clear_item_cache
site.clear_item_cache if site.present?
end
end
# To use
class Item
include MongoMapper::Document
include SiteCacheClearer
end
All it does is remove the entire site’s cache whenever the model is updated or destroyed. Like I said, nothing fancy. Doesn’t check if the thing is published. Doesn’t check what pages it is actually shown on and only removes them. It just blows away cache when things change.
Someday we’ll definitely do something more advanced like a reference-based cache where only the pages that need to be blown away are, but this is working great for now. Hope this is helpful to someone.
The main thing to remember is to use the host and make sure there is only one way to get to the resource.
So what does this all mean to our read heavy application? Well, we end up with Scout graphs like this:
The blue is apache requests and the orange is Rails requests. Notice that as our apache requests go up, our Rails requests stay pretty steady.
12 Comments
Jan 22, 2010
So well explained, thanks! I can’t wait to give Harmony a try!
Jan 22, 2010
@Brandon Wright: You’re welcome!
Jan 22, 2010
Great post John, tks. Don’t you think that NGINX is a better alternative to Apache? Specially on a rails backend and to serve static files.
Are you using Apache for a reason?
Jan 24, 2010
@PabloC: I have never had any problems with Apache or any needs for anything different. I have used Nginx before on other projects, but I have no feelings either for or against it. Apache in the default Railsmachine stack and they manage our hosting so we just went with it. I’m sure we could switch to Nginx if we wanted/needed to.
Jan 24, 2010
Hah, technoweenie solved this in 2006 for mephisto. How’s that shiny new wheel design? :)
http://agilewebdevelopment.com/plugins/referenced_page_caching
Jan 24, 2010
Thanks for your feedback!
Jan 24, 2010
@courtenay: Oh, trust me. I don’t intend on reinventing the wheel when I do the reference-based cache. :) I’ll probably start with something similar to mephisto and go from there.
Jan 25, 2010
We have a similar situation. One thing we discovered along the way is that the ActionController::Request object has 2 methods: #host and #domain. The first returns whatever was in the HTTP_HOST header, but #domain() will return the top level domain (you can specify the tld length as an argument to the method).
This came in quite handy when we wanted to map *.domain.dom to just domain.dom.
Jan 25, 2010
@Ken Mayer: Nice. Good to know if I run into a situation where I need that.
Feb 03, 2010
You explained this so insanely simple that even I feel like I could do this and I have been learning Ruby on Rails for a couple of months now. Great job!
Apr 13, 2010
Great info. Thank you for SiteCacheClearer, I thought about using something of this kind sometime ago. I liked the simplicity of your implementation
May 10, 2010
Hi,
I’d like to do some page caching the same way, but I can’t get to store the page caches in another directory than the public path. Do you have any pointer on how to do this ?
Thanks
Sorry, comments are closed for this article to ease the burden of pruning spam.