Open-source, Software

Updating database rows with a default position

I was recently tasked with making a table in a MySQL database sortable by the user. There are a few steps involved to take care of new records, while providing existing records with a default value.

The new records would be created with a ifnull(max(order), 0) + 1 to assign them the next position. This was sufficient since I was positioning based on created_at and a new record was guaranteed to have the highest created_at timestamp.

For the existing records, I wanted to:

  • group by user_id
  • set the order (int) column to an incrementing integer
  • with an initial ordering based on a created_at timestamp

The first approach was to select groups of users, and for each group sort the records then update each record. This approach would work, but it wouldn’t have good performance. I really wanted to do this in a single update statement.

I ended up going with an approach based on this article :

set @user_id = 0, @order = 1

update table1 as dest,
(
  select x.id, x.user_id, x.created_at,
  @order := if(@user_id = x.user_id, @order + 1, 1) as `order`,
  @user_id := x.user_id as dummy
  from (
    select id, user_id, created_at
    from table1
    order by user_id, created_at
  ) as x
) as src
set dest.`order` = src.`order`
where dest.id = src.id;

Let’s break this down:

  1. We are using database session variables to keep track of some state. We default to @user_id = 0 and @order = 1
  2. Working from from the inner-most select, we select the id (primary key), user_id (int)  and created_at (timestamp) as these are the columns needed to do our calculations for the new order value. We order these by user_id and created_at to get the sort ready. This dataset is aliased as x (yes I know I’m terrible at naming)
  3. The next select up pulls the same id, user_id and created_at from the inner-most select and does some calculations persisting the state in the session variables we setup in step 1. The actual work here is:
    1. Calculating the order: @order := if(@user_id = x.user_id, @order + 1, 1) . The := is an assignment in SQL. We set the @order to either a default of 1, or the @order + 1 .
    2. Comparing user to determine what to set the @order to. This is done in a conditional expression:  if(@user = x.user_id). Because the @user_id is set to the x.user_id value on each row, the comparison should return true until we get to the next user, at which point the @order will reset to 1.
  4. Finally the outer update takes the resulting src dataset and does some updating. We constrain to where dest.id = src.id to get the correct record, and then we set dest.order = src.order which is our calculated value.

Performance is looking great. On a dataset of 300k records, this runs in 3 seconds on a 4 CPU Docker container.

Bonus: If you have additional criteria (like only sorting favorite records for example) this can be easily applied to the inner-most SELECT in the where clause!

Happy ordering!

Advertisements
Computers, Open-source, Personal, Ruby, Software, Thoughts, Web

Taking the Magic Out of ActiveRecord – Speed up that Query!

Rails, and other programming frameworks that include an Object Relational Mapper (ORM) make is really easy to not worry about your database. That is until your application slows way down and you are left scratching your head trying to figure out why. Recently, I have been tasked with doing some things to optimize our database code in order to speed up the controller views.

The truth about databases is that you can’t beat them at their own game. Doing so would be like making a FPS on the consoles and releasing it the same day as Modern Warfare 2. A database stores data in a manner that can be easily and efficiently queried. ActiveRecord makes it easy to write code without worrying about the database underneath but this can create a issue. If you never know what the database is doing, how can you know if you are doing it efficiently, or not? I am of the firm belief (after witnessing Oracle chomp through 1000 line SQL queries like they weren’t there) that if a database can do something that your code can do as well, it is probably best to defer to the database. Pick the best tool for the job.

“If all you have is a hammer, everything looks like a nail” – Abraham Maslow

Lets look at some examples of where some database magic can improve our lives. Take the following code base:

# == Schema Information
#
# Table name: assignments
#
#  id         :integer         not null, primary key
#  post_id    :integer
#  keyword_id :integer
#  created_at :datetime
#  updated_at :datetime
#

class Assignment < ActiveRecord::Base
  belongs_to :topic
  belongs_to :keyword
end

# == Schema Information
#
# Table name: keywords
#
#  id         :integer         not null, primary key
#  name       :string(255)
#  created_at :datetime
#  updated_at :datetime
#

class Keyword < ActiveRecord::Base
  has_many :assignments
end

# == Schema Information
#
# Table name: posts
#
#  id         :integer         not null, primary key
#  title      :string(255)
#  created_at :datetime
#  updated_at :datetime
#

class Post < ActiveRecord::Base
  has_many :assignments
end

I have created a “Post” object to represent a blog post. On this blog post, I can assign keywords to each post. To do this, I have another model, “Keyword”, which is associated to posts through an “Assignment” model. The schema information shows the Database structure.

Now, recently I have come across some code that aimed to collect all the keywords of a post with some special options. These keywords were to be found, their names listed, and sorted without case-sensitivity. Finally, any duplicate keys would be excluded. A programmers mind might gravitate towards a solution such as:

Post.find(1).assignments.map {|x| x.keyword.name }.sort {|a,b| a.upcase  b.upcase}.uniq

Lets walk through this bit of code before I discuss better alternatives. To start with, we find our Post with an id of “1”, so that we can look at just those keywords. Next we iterate through these keywords (using the symbol to proc shorthand), and then run a sort on the returned array. This array is sorted by uppercasing the strings so that the sort is case-insensitive. Finally, “uniq” is run, to exclude any duplicate keywords. While this is a working solution, it doesn’t take any advantage of the power, and flexibility of what a database can do. A few issues:

  1. This code generates multiple SQL select statements. (N+1)
  2. Sorting can be done through “ORDER” in SQL
  3. Unique records can be generated through DISTINCT

The problem with generating multiple select statements is that this problem is an example of the “N+1” problem. This means that one query (“1”) will be run to determine what the attributes of the assignment is. After this is known, “N” queries are run for each keyword reference. In total, you have “N + 1” queries executed. If you have 5 keywords, the results will be unoptimized, but largely unnoticed. You will have 6 select statements:

  Assignment Load (6.3ms)   SELECT * FROM "assignments" WHERE ("assignments".post_id = 1) 
  Keyword Load (0.8ms)   SELECT * FROM "keywords" WHERE ("keywords"."id" = 1) 
  Keyword Load (0.7ms)   SELECT * FROM "keywords" WHERE ("keywords"."id" = 2) 
  Keyword Load (0.7ms)   SELECT * FROM "keywords" WHERE ("keywords"."id" = 3) 
  Keyword Load (0.7ms)   SELECT * FROM "keywords" WHERE ("keywords"."id" = 4) 
  Keyword Load (0.8ms)   SELECT * FROM "keywords" WHERE ("keywords"."id" = 5)

What happens if you have 500 keywords? 5,000,000 keywords? Your performance bottleneck will quickly shift to the database as multiple queries have to be generated, executed, and returned for each page request.

What is the big deal about sorting in your application instead of the database? In order for Rails (ActiveRecord specifically) to “sort” these items, the items will be be returned from a generated select statement from the database. These items will be returned as elements in an array, and then the array is sorted by calling the “” method on the Array class. Further, this is done in Ruby – a dynamic language, which is several orders of magnitude slower than in the database (which is most likely written in C). Simply, the database is made for this kind of work, and outsourcing it to Ruby is just not efficient.

Finally, why not make records unique with Ruby’s “uniq”? Again, it has to do with meddling in database territory. It has the problems inherent to the sorting problem above, but with an additional problem. Lets say that you return 500 records – and 499 of them are duplicates. Before the “uniq” method is run, Ruby is holding in memory references to the attributes of 500 ActiveRecord instances. When you call “uniq”, it can drop them. Why go through this memory spike and plummet, when the database can be instructed to just return one record? Memory is a valuable resource in a server environment – be frugal with its consumption.

So, lets address these issues, refactoring our code to take these issues into account. Starting with the multiple SQL statements, we can combine these into one statement by joining to multiple tables. I would encourage you to run “script/dbconsole” and get familiar with SQL. Let me demonstrate what we will be building the SQL way, before we implement this the “Rails” way.

SELECT a.* 
FROM   keywords a, 
       assignments b 
WHERE  a.id = b.keyword_id 
       AND b.post_id = 1; 

Another variant of this is to join the tables together using a “join” syntax. An inner join is the same type of join that we have replicated in the WHERE clause here. We can right the same SQL as follows:

SELECT a.* 
FROM   keywords a 
       INNER JOIN assignments b 
         ON a.id = b.keyword_id 
WHERE  (( b.post_id = 1 )) 

We can specify this join using ActiveRecord’s “find_by_sql” if we wanted to specify the association by hand. However this case is trivial enough that ActiveRecord can build this for us using “has_many :through”. I can add the following to my models:

# == Schema Information
#
# Table name: keywords
#
#  id         :integer         not null, primary key
#  name       :string(255)
#  created_at :datetime
#  updated_at :datetime
#

class Keyword  :assignments
end

# == Schema Information
#
# Table name: posts
#
#  id         :integer         not null, primary key
#  title      :string(255)
#  created_at :datetime
#  updated_at :datetime
#

class Post  :assignments
end

Now, I can gather all the keywords for a post by executing the following:

Post.find(1).keywords

Next, lets address the sorting issue by specifying an ORDER clause. We can tackle another related problem at the same time we do the sort. Recall that we want to sort in a case-insensitive fashion. If I just call order, then “Pabst” would beat “guinness” simply because of the capitalization (and we all know Guinness beer is better). The easy solution is to call “UPPER” to make the casing the same when the comparison is made. This actually saves even more computation on the Rails side by not having to do string conversions with our array sort. In SQL, we could append the following to our SELECT statement:

SELECT a.* 
FROM   keywords a 
       INNER JOIN assignments b 
         ON a.id = b.keyword_id 
WHERE  (( b.post_id = 1 )) 
ORDER BY UPPER(b.name)

The “Rails” way would be to include this on the association as follows: (Notice that Rails table alias names are probably not as reliable as listing out the name of the table itself. In this case, I have included “keywords.name”

# app/models/post.rb
...
has_many :keywords, :through => :assignments, :order => 'UPPER(keywords.name)'
...

Finally, lets address the unique problem. If I have duplicate keywords, I can return only the unique keywords by using the SQL DISTINCT modifier. In SQL, this would look like:

SELECT DISTINCT(a.name) 
FROM   keywords a 
       INNER JOIN assignments b 
         ON a.id = b.keyword_id 
WHERE  (( b.post_id = 1 )) 
ORDER BY UPPER(b.name)

In Rails, we can specify modifications to our SELECT clause, by passing the :select key to ActiveRecord’s “find”, and “all” methods. This has another benefit, depending on the application. For each column in each record, ActiveRecord has to store information in memory. By choosing only the specific columns that we want returned in the SQL clause, we can reduce the memory footprint. This could look something like this:

Post.find(1).keywords.all(:select => 'DISTINCT(keywords.name)')
# 0.16 seconds to complete w/ 200 keywords - thats 3 times faster!

So in summary, we have reduced SQL select statements, computationally expensive sorting and unique method calls from our results, and have managed to do all this without any fancy tricks. A sharp developer may point out that embedding SQL functions is bad form because it isn’t database agnostic. The truth is most databases conform to a base set of ANSI SQL standards and DISTINCT, and UPPER are acceptable almost across the board.

A little database magic can make a crawling Rails action become snappy again. I am a firm believer that Rails, like any framework should not be a reason to be uncaring about your SQL. Database are the heart of most applications, and usually one of the first bottlenecks for performance. Good SQL can be married with good Ruby code, and happiness will ensue. I hope this post was informative for Rails folks that want to get started with SQL optimization.

Computers

ActiveRecord’s Secret find_by_sql Results

Ruby on Rails logo Well, its not exactly a secret. It sure isn’t well documented however. Recently, I wanted to return a query that spanned multiple database tables. I decided to go with find_by_sql because of the mind-blowing idiocy with which this legacy database was structured. I will take a watered down version of what I was attempting to do to demonstrate how we can expose some “hidden” functionality of ActiveRecord’s find_by_sql method.

 
Channel table:
------------------------------------
id | title  | description        | user_id
------------------------------------
1  | first | the first channel | 1

User table:
-----------
id | name
-----------
1  | ben

After I constructed my find_by_sql query, it looked something like this:

Channel.find_by_sql("SELECT a.*, b.name
FROM channel a, user b
WHERE a.user_id = b.id")

This query selects all columns from table a (channel), and a single column from table b (user). This is pretty standard, as many queries need to gather values from multiple table columns in a single SELECT operation.

Running this query, you will receive an array of Channel instances with all the attributes filled in for the channel model. Missing however, will be the attributes from any table other than “Channel”:

Channel.find_by_sql("SELECT a.*, b.name
FROM channel a, user b
WHERE a.user_id = b.id")
=> "[#]"

Notice how the “name” column from table b (user) is not present in the display? You can even query this attribute directly:

c = Channel.find_by_sql("SELECT a.*, b.name
FROM channel a, user b
WHERE a.user_id = b.id")
=> "[#]"
c[0].name
=> NoMethodError: undefined method 'name' for #

We could create an attr_accessor for the Channel class, and this would resolve the NoMethodError, but it still won’t be populated for our Channel instance after a find_by_sql.

After some digging around in the source code, and online, I came across this posting, which made the brilliant suggestion of looking in channel.attributes. This method will list an array of attributes that ActiveRecord knows about. Take a look at channel.attributes.keys:

c.attributes.keys
=> ["id", "title", "description", "user_id", "name"]

There it is! Our “missing” name attribute from the SELECT query has been located. Accessing the value for this attribute is trivial:

c.attributes["name"]
=> "ben"

We can do this with as many “extra” columns as we want. If two column names conflict (say channels had a column “name”, and users also had a column “name”), the database will return “name”, and “name_1” respectively. This is a really powerful feature of ActiveRecord that will encourage people to stick with the ORM, since they can still write SQL in a pinch.

Bonus: Customizing .to_json to include find_by_sql attributes

In the preceding example, the attribute “name” would not be included in the output of a “.to_json” call, as in the following example:

c.attributes.keys
=> ["id", "title", "description", "user_id", "name"]
c.to_json
=> "{"channel":{"id":1,"title":"first","description":"The first channel"}}"

This is where we can customize what is included in our JSON output. This article showed me that you can use the :methods argument with to_json to explicitly include any custom attributes, such as those that are attr_accessor objects in your class. When passing in the :methods argument, I must specify which attributes to include:

c.attributes.keys
c.to_json
=> "{"channel":{"id":1,"title":"first","description":"The first channel"}}"
c.to_json :methods => :name
=> "{"channel":{"id":1,"title":"first","description":"The first channel","name":"ben"}}"

Good job Rails team! No ugly hacks, or overrides needed today.

Computers, Personal, Ruby, Software, Thoughts

The Woes of Work

This week it seems that I have developed programmer’s block. I don’t know if there is such a thing, but I have felt totally unable to do anything involving code. I can’t write SQL, I can’t write Ruby, and I can’t even think through concepts that my co-workers are proposing.

It probably doesn’t help that my co-worker wanted to share the same development database instance “since there are only two of us, what could possibly happen?”, and he has been migrating my ass to death. Also, with constant foundation changes, my code literally breaks every night. It also probably doesn’t help that no one likes my solutions before I even have time to flesh them out into a prototype. And the final nail in the coffin is the constant interruptions in my day from supporting everything else that we do.

When I code, I have a tendency to go with the simplest possible solution to a problem. Generally, this means less code, less bugs, and the least surprise to the end user. This also generally requires less effort. This week I have felt like someone running beside a train in which my co-workers are comfortably sitting, discussing some pretty “out-there” solutions, and I am gasping for air, and reaching up trying to grab onto anything to pull myself onboard.

I would love to develop this project in an agile way. This isn’t just some methodology written on paper – it is a mindset that you have to embrace. In part- if you build your foundation in an sensible and orthogonal way (from The Pragmatic Programmer),  then the problems of tomorrow don’t account for your design today. However, my co-developers want to build the beginning code with this grand vision of the end baked into the first lines of code. I am not ready to field questions about how ‘x’ is going to fit into ‘y’ in some scenario yet. I haven’t even written ‘x’ and ‘y’.

Our boss finally said on Thursday that he is going to remove the deadline on our project, so that we can take our time. Initially I was pretty upset, because when projects get put on the back burner, we rarely move them back to the front again. However, in my current state maybe it is for the best.

Maybe I just need a vacation…