Rails, and other programming frameworks that include an Object Relational Mapper (ORM) make is really easy to not worry about your database. That is until your application slows way down and you are left scratching your head trying to figure out why. Recently, I have been tasked with doing some things to optimize our database code in order to speed up the controller views.
The truth about databases is that you can’t beat them at their own game. Doing so would be like making a FPS on the consoles and releasing it the same day as Modern Warfare 2. A database stores data in a manner that can be easily and efficiently queried. ActiveRecord makes it easy to write code without worrying about the database underneath but this can create a issue. If you never know what the database is doing, how can you know if you are doing it efficiently, or not? I am of the firm belief (after witnessing Oracle chomp through 1000 line SQL queries like they weren’t there) that if a database can do something that your code can do as well, it is probably best to defer to the database. Pick the best tool for the job.
“If all you have is a hammer, everything looks like a nail” – Abraham Maslow
Lets look at some examples of where some database magic can improve our lives. Take the following code base:
# == Schema Information
#
# Table name: assignments
#
# id :integer not null, primary key
# post_id :integer
# keyword_id :integer
# created_at :datetime
# updated_at :datetime
#
class Assignment < ActiveRecord::Base
belongs_to :topic
belongs_to :keyword
end
# == Schema Information
#
# Table name: keywords
#
# id :integer not null, primary key
# name :string(255)
# created_at :datetime
# updated_at :datetime
#
class Keyword < ActiveRecord::Base
has_many :assignments
end
# == Schema Information
#
# Table name: posts
#
# id :integer not null, primary key
# title :string(255)
# created_at :datetime
# updated_at :datetime
#
class Post < ActiveRecord::Base
has_many :assignments
end
I have created a “Post” object to represent a blog post. On this blog post, I can assign keywords to each post. To do this, I have another model, “Keyword”, which is associated to posts through an “Assignment” model. The schema information shows the Database structure.
Now, recently I have come across some code that aimed to collect all the keywords of a post with some special options. These keywords were to be found, their names listed, and sorted without case-sensitivity. Finally, any duplicate keys would be excluded. A programmers mind might gravitate towards a solution such as:
Post.find(1).assignments.map {|x| x.keyword.name }.sort {|a,b| a.upcase b.upcase}.uniq
Lets walk through this bit of code before I discuss better alternatives. To start with, we find our Post with an id of “1”, so that we can look at just those keywords. Next we iterate through these keywords (using the symbol to proc shorthand), and then run a sort on the returned array. This array is sorted by uppercasing the strings so that the sort is case-insensitive. Finally, “uniq” is run, to exclude any duplicate keywords. While this is a working solution, it doesn’t take any advantage of the power, and flexibility of what a database can do. A few issues:
- This code generates multiple SQL select statements. (N+1)
- Sorting can be done through “ORDER” in SQL
- Unique records can be generated through DISTINCT
The problem with generating multiple select statements is that this problem is an example of the “N+1” problem. This means that one query (“1”) will be run to determine what the attributes of the assignment is. After this is known, “N” queries are run for each keyword reference. In total, you have “N + 1” queries executed. If you have 5 keywords, the results will be unoptimized, but largely unnoticed. You will have 6 select statements:
Assignment Load (6.3ms) SELECT * FROM "assignments" WHERE ("assignments".post_id = 1)
Keyword Load (0.8ms) SELECT * FROM "keywords" WHERE ("keywords"."id" = 1)
Keyword Load (0.7ms) SELECT * FROM "keywords" WHERE ("keywords"."id" = 2)
Keyword Load (0.7ms) SELECT * FROM "keywords" WHERE ("keywords"."id" = 3)
Keyword Load (0.7ms) SELECT * FROM "keywords" WHERE ("keywords"."id" = 4)
Keyword Load (0.8ms) SELECT * FROM "keywords" WHERE ("keywords"."id" = 5)
What happens if you have 500 keywords? 5,000,000 keywords? Your performance bottleneck will quickly shift to the database as multiple queries have to be generated, executed, and returned for each page request.
What is the big deal about sorting in your application instead of the database? In order for Rails (ActiveRecord specifically) to “sort” these items, the items will be be returned from a generated select statement from the database. These items will be returned as elements in an array, and then the array is sorted by calling the “” method on the Array class. Further, this is done in Ruby – a dynamic language, which is several orders of magnitude slower than in the database (which is most likely written in C). Simply, the database is made for this kind of work, and outsourcing it to Ruby is just not efficient.
Finally, why not make records unique with Ruby’s “uniq”? Again, it has to do with meddling in database territory. It has the problems inherent to the sorting problem above, but with an additional problem. Lets say that you return 500 records – and 499 of them are duplicates. Before the “uniq” method is run, Ruby is holding in memory references to the attributes of 500 ActiveRecord instances. When you call “uniq”, it can drop them. Why go through this memory spike and plummet, when the database can be instructed to just return one record? Memory is a valuable resource in a server environment – be frugal with its consumption.
So, lets address these issues, refactoring our code to take these issues into account. Starting with the multiple SQL statements, we can combine these into one statement by joining to multiple tables. I would encourage you to run “script/dbconsole” and get familiar with SQL. Let me demonstrate what we will be building the SQL way, before we implement this the “Rails” way.
SELECT a.*
FROM keywords a,
assignments b
WHERE a.id = b.keyword_id
AND b.post_id = 1;
Another variant of this is to join the tables together using a “join” syntax. An inner join is the same type of join that we have replicated in the WHERE clause here. We can right the same SQL as follows:
SELECT a.*
FROM keywords a
INNER JOIN assignments b
ON a.id = b.keyword_id
WHERE (( b.post_id = 1 ))
We can specify this join using ActiveRecord’s “find_by_sql” if we wanted to specify the association by hand. However this case is trivial enough that ActiveRecord can build this for us using “has_many :through”. I can add the following to my models:
# == Schema Information
#
# Table name: keywords
#
# id :integer not null, primary key
# name :string(255)
# created_at :datetime
# updated_at :datetime
#
class Keyword :assignments
end
# == Schema Information
#
# Table name: posts
#
# id :integer not null, primary key
# title :string(255)
# created_at :datetime
# updated_at :datetime
#
class Post :assignments
end
Now, I can gather all the keywords for a post by executing the following:
Post.find(1).keywords
Next, lets address the sorting issue by specifying an ORDER clause. We can tackle another related problem at the same time we do the sort. Recall that we want to sort in a case-insensitive fashion. If I just call order, then “Pabst” would beat “guinness” simply because of the capitalization (and we all know Guinness beer is better). The easy solution is to call “UPPER” to make the casing the same when the comparison is made. This actually saves even more computation on the Rails side by not having to do string conversions with our array sort. In SQL, we could append the following to our SELECT statement:
SELECT a.*
FROM keywords a
INNER JOIN assignments b
ON a.id = b.keyword_id
WHERE (( b.post_id = 1 ))
ORDER BY UPPER(b.name)
The “Rails” way would be to include this on the association as follows: (Notice that Rails table alias names are probably not as reliable as listing out the name of the table itself. In this case, I have included “keywords.name”
# app/models/post.rb
...
has_many :keywords, :through => :assignments, :order => 'UPPER(keywords.name)'
...
Finally, lets address the unique problem. If I have duplicate keywords, I can return only the unique keywords by using the SQL DISTINCT modifier. In SQL, this would look like:
SELECT DISTINCT(a.name)
FROM keywords a
INNER JOIN assignments b
ON a.id = b.keyword_id
WHERE (( b.post_id = 1 ))
ORDER BY UPPER(b.name)
In Rails, we can specify modifications to our SELECT clause, by passing the :select key to ActiveRecord’s “find”, and “all” methods. This has another benefit, depending on the application. For each column in each record, ActiveRecord has to store information in memory. By choosing only the specific columns that we want returned in the SQL clause, we can reduce the memory footprint. This could look something like this:
Post.find(1).keywords.all(:select => 'DISTINCT(keywords.name)')
# 0.16 seconds to complete w/ 200 keywords - thats 3 times faster!
So in summary, we have reduced SQL select statements, computationally expensive sorting and unique method calls from our results, and have managed to do all this without any fancy tricks. A sharp developer may point out that embedding SQL functions is bad form because it isn’t database agnostic. The truth is most databases conform to a base set of ANSI SQL standards and DISTINCT, and UPPER are acceptable almost across the board.
A little database magic can make a crawling Rails action become snappy again. I am a firm believer that Rails, like any framework should not be a reason to be uncaring about your SQL. Database are the heart of most applications, and usually one of the first bottlenecks for performance. Good SQL can be married with good Ruby code, and happiness will ensue. I hope this post was informative for Rails folks that want to get started with SQL optimization.