Rails, and other programming frameworks that include an Object Relational Mapper (ORM) make is really easy to not worry about your database. That is until your application slows way down and you are left scratching your head trying to figure out why. Recently, I have been tasked with doing some things to optimize our database code in order to speed up the controller views.
The truth about databases is that you can’t beat them at their own game. Doing so would be like making a FPS on the consoles and releasing it the same day as Modern Warfare 2. A database stores data in a manner that can be easily and efficiently queried. ActiveRecord makes it easy to write code without worrying about the database underneath but this can create a issue. If you never know what the database is doing, how can you know if you are doing it efficiently, or not? I am of the firm belief (after witnessing Oracle chomp through 1000 line SQL queries like they weren’t there) that if a database can do something that your code can do as well, it is probably best to defer to the database. Pick the best tool for the job.
“If all you have is a hammer, everything looks like a nail” – Abraham Maslow
Lets look at some examples of where some database magic can improve our lives. Take the following code base:
# == Schema Information # # Table name: assignments # # id :integer not null, primary key # post_id :integer # keyword_id :integer # created_at :datetime # updated_at :datetime # class Assignment < ActiveRecord::Base belongs_to :topic belongs_to :keyword end # == Schema Information # # Table name: keywords # # id :integer not null, primary key # name :string(255) # created_at :datetime # updated_at :datetime # class Keyword < ActiveRecord::Base has_many :assignments end # == Schema Information # # Table name: posts # # id :integer not null, primary key # title :string(255) # created_at :datetime # updated_at :datetime # class Post < ActiveRecord::Base has_many :assignments end
I have created a “Post” object to represent a blog post. On this blog post, I can assign keywords to each post. To do this, I have another model, “Keyword”, which is associated to posts through an “Assignment” model. The schema information shows the Database structure.
Now, recently I have come across some code that aimed to collect all the keywords of a post with some special options. These keywords were to be found, their names listed, and sorted without case-sensitivity. Finally, any duplicate keys would be excluded. A programmers mind might gravitate towards a solution such as:
Post.find(1).assignments.map {|x| x.keyword.name }.sort {|a,b| a.upcase b.upcase}.uniq
Lets walk through this bit of code before I discuss better alternatives. To start with, we find our Post with an id of “1”, so that we can look at just those keywords. Next we iterate through these keywords (using the symbol to proc shorthand), and then run a sort on the returned array. This array is sorted by uppercasing the strings so that the sort is case-insensitive. Finally, “uniq” is run, to exclude any duplicate keywords. While this is a working solution, it doesn’t take any advantage of the power, and flexibility of what a database can do. A few issues:
- This code generates multiple SQL select statements. (N+1)
- Sorting can be done through “ORDER” in SQL
- Unique records can be generated through DISTINCT
The problem with generating multiple select statements is that this problem is an example of the “N+1” problem. This means that one query (“1”) will be run to determine what the attributes of the assignment is. After this is known, “N” queries are run for each keyword reference. In total, you have “N + 1” queries executed. If you have 5 keywords, the results will be unoptimized, but largely unnoticed. You will have 6 select statements:
Assignment Load (6.3ms) SELECT * FROM "assignments" WHERE ("assignments".post_id = 1) Keyword Load (0.8ms) SELECT * FROM "keywords" WHERE ("keywords"."id" = 1) Keyword Load (0.7ms) SELECT * FROM "keywords" WHERE ("keywords"."id" = 2) Keyword Load (0.7ms) SELECT * FROM "keywords" WHERE ("keywords"."id" = 3) Keyword Load (0.7ms) SELECT * FROM "keywords" WHERE ("keywords"."id" = 4) Keyword Load (0.8ms) SELECT * FROM "keywords" WHERE ("keywords"."id" = 5)
What happens if you have 500 keywords? 5,000,000 keywords? Your performance bottleneck will quickly shift to the database as multiple queries have to be generated, executed, and returned for each page request.
What is the big deal about sorting in your application instead of the database? In order for Rails (ActiveRecord specifically) to “sort” these items, the items will be be returned from a generated select statement from the database. These items will be returned as elements in an array, and then the array is sorted by calling the “” method on the Array class. Further, this is done in Ruby – a dynamic language, which is several orders of magnitude slower than in the database (which is most likely written in C). Simply, the database is made for this kind of work, and outsourcing it to Ruby is just not efficient.
Finally, why not make records unique with Ruby’s “uniq”? Again, it has to do with meddling in database territory. It has the problems inherent to the sorting problem above, but with an additional problem. Lets say that you return 500 records – and 499 of them are duplicates. Before the “uniq” method is run, Ruby is holding in memory references to the attributes of 500 ActiveRecord instances. When you call “uniq”, it can drop them. Why go through this memory spike and plummet, when the database can be instructed to just return one record? Memory is a valuable resource in a server environment – be frugal with its consumption.
So, lets address these issues, refactoring our code to take these issues into account. Starting with the multiple SQL statements, we can combine these into one statement by joining to multiple tables. I would encourage you to run “script/dbconsole” and get familiar with SQL. Let me demonstrate what we will be building the SQL way, before we implement this the “Rails” way.
SELECT a.* FROM keywords a, assignments b WHERE a.id = b.keyword_id AND b.post_id = 1;
Another variant of this is to join the tables together using a “join” syntax. An inner join is the same type of join that we have replicated in the WHERE clause here. We can right the same SQL as follows:
SELECT a.* FROM keywords a INNER JOIN assignments b ON a.id = b.keyword_id WHERE (( b.post_id = 1 ))
We can specify this join using ActiveRecord’s “find_by_sql” if we wanted to specify the association by hand. However this case is trivial enough that ActiveRecord can build this for us using “has_many :through”. I can add the following to my models:
# == Schema Information # # Table name: keywords # # id :integer not null, primary key # name :string(255) # created_at :datetime # updated_at :datetime # class Keyword :assignments end # == Schema Information # # Table name: posts # # id :integer not null, primary key # title :string(255) # created_at :datetime # updated_at :datetime # class Post :assignments end
Now, I can gather all the keywords for a post by executing the following:
Post.find(1).keywords
Next, lets address the sorting issue by specifying an ORDER clause. We can tackle another related problem at the same time we do the sort. Recall that we want to sort in a case-insensitive fashion. If I just call order, then “Pabst” would beat “guinness” simply because of the capitalization (and we all know Guinness beer is better). The easy solution is to call “UPPER” to make the casing the same when the comparison is made. This actually saves even more computation on the Rails side by not having to do string conversions with our array sort. In SQL, we could append the following to our SELECT statement:
SELECT a.* FROM keywords a INNER JOIN assignments b ON a.id = b.keyword_id WHERE (( b.post_id = 1 )) ORDER BY UPPER(b.name)
The “Rails” way would be to include this on the association as follows: (Notice that Rails table alias names are probably not as reliable as listing out the name of the table itself. In this case, I have included “keywords.name”
# app/models/post.rb ... has_many :keywords, :through => :assignments, :order => 'UPPER(keywords.name)' ...
Finally, lets address the unique problem. If I have duplicate keywords, I can return only the unique keywords by using the SQL DISTINCT modifier. In SQL, this would look like:
SELECT DISTINCT(a.name) FROM keywords a INNER JOIN assignments b ON a.id = b.keyword_id WHERE (( b.post_id = 1 )) ORDER BY UPPER(b.name)
In Rails, we can specify modifications to our SELECT clause, by passing the :select key to ActiveRecord’s “find”, and “all” methods. This has another benefit, depending on the application. For each column in each record, ActiveRecord has to store information in memory. By choosing only the specific columns that we want returned in the SQL clause, we can reduce the memory footprint. This could look something like this:
Post.find(1).keywords.all(:select => 'DISTINCT(keywords.name)') # 0.16 seconds to complete w/ 200 keywords - thats 3 times faster!
So in summary, we have reduced SQL select statements, computationally expensive sorting and unique method calls from our results, and have managed to do all this without any fancy tricks. A sharp developer may point out that embedding SQL functions is bad form because it isn’t database agnostic. The truth is most databases conform to a base set of ANSI SQL standards and DISTINCT, and UPPER are acceptable almost across the board.
A little database magic can make a crawling Rails action become snappy again. I am a firm believer that Rails, like any framework should not be a reason to be uncaring about your SQL. Database are the heart of most applications, and usually one of the first bottlenecks for performance. Good SQL can be married with good Ruby code, and happiness will ensue. I hope this post was informative for Rails folks that want to get started with SQL optimization.
I still don’t see how any of this is better than just writing the SQL statement and parsing the results.
LikeLike
For starters, I don’t remember making as much writing SQL! 🙂 I agree with you almost fully. ORMs are a dirty business to me, and these kinds of issues are inherent. We are discussing some of these issues at work right now. The proponents of ActiveRecord is that it keeps you in the code, and not thinking about the database, which should increase productivity. Its also database agnostic which means you can easily change (I think this is a fringe case) The opponents to ActiveRecord say that you should be worried about the database first and foremost. I think a co-worker said it best – if you chose a framework, why would you not use the tools that framework provides? The answer to why someone would choose Rails is speed and a mature framework stack. I would challenge someone to find another web development framework that is as comprehensive, documented and popular as Rails is right now. Django? ASP.NET? CakePHP? Any one of the 5 million Java “frameworks”?
Also, you will note that not once did I leave Ruby to do a database task.
LikeLike
Posts like these make me want to kill myself. I expect a post about your actual life soon (you know, the stuff that matters?).
Also, I am working this Saturday and next, but after that we need to get our PS3 on. I’ve been itching to play that second mission in RE5.
LikeLike
No, you didn’t leave Ruby, but you did re-write the (purposefully simple) SQL into something five times as long and not immediately clear.
I don’t believe you have to be “worried about the database first and foremost.” I believe you should have a clear understanding of the data you are working with. From that, a database schema flows naturally.
It seems to me that this post is highlighting the problem with relying on the framework to do everything for you. It worked great in the beginning, and allowed you to get something up and running quickly. But now you’re suffering performance issues because of that shortcut and are having to re-visit something that was already “completed” to figure out why it runs like shit. Whereas if you spend the extra little bit of time upfront to do it the “long way” you would more like not have seen the performance issue.
I don’t care what framework you use, the same sort of problem will crop up in everything. For example, Cocoa provides an API called “Core Data.” This API basically does all the dirty work of saving data for you. It can save to XML files or an SQLite database. If you do some quick searching, you will find that plenty of people hit the limits of relying on Core Data pretty quickly.
I guess what I am saying is this: if your application is just some silly little thing (e.g. a personal address book), then by all means take the shortcut. If your application is actually going to be relied upon, invest some time in it.
LikeLike
One way of writing SQL is to do it by hand (hopefully with care and efficiency). The advantage is full control, and clarity. However you have written a single SQL query outside of any program. Your program doesn’t know about the data from your query – you have to wire it all up yourself. The SQL isn’t abstracted, or encapsulated. It doesn’t represent a class, or an object. You can’t inherit from it, or any of the other things that Object Oriented Programming preaches.
A second way is to use an object relational mapper (http://en.wikipedia.org/wiki/List_of_object-relational_mapping_software) and take all that logic and integrate it into your application. When I type “Post.all”, I am doing “select * from posts”. It isn’t as immediately clear as SQL (assuming you know SQL), but it does adhere to Object Oriented Programming practices. My database has been abstracted in my application logic. I can manipulate any part of the SQL by method chaining, and argument passing. The results are a collection of Class instances. The program understands what they are and I have methods I can execute on the results. This is the reason to use an ORM.
There are other methods too. When your database structure makes no sense (think SWAN), and your ORM has no hope of “figuring out” how the tables are related, you can hand write SQL, and abstract this into Classes and Methods. This generates a lot more work, and code. I don’t think this approach is necessary with a normalized database with good structure – ORMs can figure them out without many issues.
The argument of performance issues I believe is inherent to Object Orientated Design. As you increase abstraction, you add layers to your code, and how components interoperate isn’t always the most efficient as “bare metal” programming. Think of a gas explosion – the release of energy. A gas explosion in a car engine harnesses only about 20% of the energy – far less efficient, but it accomplishes a task through a set of interoperating components. When the explosion is made, it drives the cylinder down, turning the crankshaft, rotating the tires, and moving the car forward.
I can promise you that our application is not a “silly little thing”. It is a multi-million dollar government contract that is making operations more efficient. We have invested a lot of time in the design, implementation, and choices that have gone into this operation. At the level of complexity we are at Object Oriented Programming is a necessity.
LikeLike
I know it isn’t a silly project you are working on. I was speaking generally. As you said yourself, it is totally doable to write classes to objectify the database. I happen to think that is the clearer, more efficient, way to do it. Especially considering what you wrote in this article. It sounds, and looks like, the ORM framework makes it really easy to use the database in a poor manner. Whereas, when you write the SQL yourself, you are forced to think at least a little bit about the data and what you are going to be doing with it. *shrug*
LikeLike
ANDDDD IIIIIIIIIIIIIII TOUCH MYSELF TO RUBAAAAAY
LikeLike