Have you ever been frustrated by having to query data that you don’t control? Especially if the data you want is not accessible in a format that you desired, you have probably “locally cached” this information. What happens then when this data changes on the source? There are a few approaches:
- Rebuild the cache at a certain time – This approach allows your code to function in a way that doesn’t care too much about the data being cached. You do your thing, and a cron/scheduled job, does its thing and everyone is happy. Well, mostly. The problem with this approach is the frequency of the cache rebuilding. The shorter the frequency, the more accurate, but intensive the application becomes. The longer the frequency, the less intensive, but less accurate your data becomes. In either scenario, you will probably have to worry about mechanisms to manually rebuild the cache
- Rebuild the cache on-the-fly – This approach allows your code to be as up-to-date as possible, while preserving the local cache, and not affecting performance too much. A typical scenario would be to insert records into your local cache the first time you pull it from the native source. This takes care of the need to pre-cache objects, since it is done at request time, but it comes at a performance penalty. The first request is the longest, then subsequent requests are quick. Also, you still have the issues of when to refresh the cache, and how to allow manually refreshing the cache. Also, this complicates your code; in addition to your logic, you now have relatively meaningless cache logic side-by-side with your meaningful logic.
- Don’t cache – Just take the performance hit, optimize it as much as possible, and hope that no one cares the operation takes some extra time to complete. The problem with this approach is efficiency. Computers are fast, and people expect this. People may stop using your code all together if the performance impacts are severe enough to outweigh its usefulness.
So what is a programmer to do? Out of the approaches above, I have opted to perform caching on-the-fly with a twist. That twist takes advantage of ActiveRecord’s callbacks. What is a callback? Think of them as “in-between” steps available for you to hook into as ActiveRecord does its thing. Callbacks are an API that allow you to do this without any ugly hacks, or baseline modifications. Callbacks are also known as hooks. From Ruby on Rails official website:
“Callbacks are methods that get called at certain moments of an object’s lifecycle. With callbacks it’s possible to write code that will run whenever an Active Record object is created, saved, updated, deleted, validated, or loaded from the database.”
Simply, you can create methods with certain names in an ActiveRecord::Base derived model, and define your cache logic here. For example, if we had a Users model, we could query a user in a method like the following:
# app/models/user.rb class User #
This code sample will return the first instance of a user, with their attributes loaded. Now, if this information was pulled from our local cache, the information contained may be different than in the original source. For instance, perhaps since the cache was built, this person got married, and changed their name. Your cache is now different from your original source, and this needs to be resolved. So lets implement some cache refreshing via ActiveRecord’s callback method after_find:
# app/models/user.rb class User #
A few things to note. The name “after_find” means that this will be executed immediately following the completion of an ActiveRecord find operation. This includes: first, last, find_by_xxx, all, etc. The method then changes the User instance (local cache) with the data from the other database. ActiveRecord is smart enough to not actually issue a save command unless the data has actually changed, so don’t worry about not being efficient here. Also, you can write this without using the “self” prefix, but it helps me keep track of what is what. Also note that I using “put” just to show when this is executed. You can see that after I call find_by_username, this code is run. If there are any changes, they are reflected in the result, transparent to the rest of your application’s logic. This keeps the cache logic out of your “real” logic.
This will execute everytime we issue a find command on a User class, so this isn’t really efficient yet. Basically, the cache is always immediately expired. For performance reasons, lets make only check the other database every 10 minutes for a user:
# app/models/user.rb class User < ActiveRecord::Base attr_accessor :first_name, :last_name, :username # ActiveRecord callback def after_find if self.updated_at.blank? or self.updated_at # User.find_by_username('kristin') => # # 10 minutes elapse... (use your imagination) User.find_by_username('kristin') refreshing cache => #
Now, we can see the cache working. Every 10 minutes, the local cache is checked against the original source, and for all the other requests, it just skips the conditional, and exits. You can obviously change the 10 minute expiration to anything you desire. Better still, throw this value in a YAML config file, and reference it so that this setting can be customized.
There are many other callback functions that you can use, and can work together to be a very powerful part tool. Check out this following code:
# app/models/user.rb class User < ActiveRecord::Base attr_accessor :first_name, :last_name, :username # ActiveRecord callback def after_find if self.updated_at.blank? or self.updated_at # User.find_by_username('kristin') => # # 10 minutes elapse... (use your imagination) User.find_by_username('kristin') refreshing cache => #
This allows me to use “find_or_create_by” to generate records with incomplete information. The missing information is filled in at creation time thanks to the before_save method. Just a note, do NOT call “save” from within some of these methods, as this would create an infinite loop – think about it. Before_save calling save, which would call before_save, etc. Be careful.
There is a performance penalty for me creating a record in this manner, and it would be much better if I got all this information in one query. For example:
# pull the first user OtherDatabase::User.find_by_username('ksimpson') do |user| user = User.find_or_create_by_username(:username => user.username, :last_name => user.last_name, :first_name => user.first_name) end user => # # 10 minutes elapse... (use your imagination) User.find_by_username('kristin') refreshing cache => #
The before_save would have taken care of any missing information (as we saw above), however this comes at the penalty of a second query, and can quickly mean you have unnecessarily doubled your queries.