Open-source, Software, Web

Inserting Large Data Sets in MySQL

Its always interesting for me to work with large data sets. The solutions that work in lower orders of magnitude don’t always scale, and I am left with unusable solutions in production. Often the problems require clever refactoring that at a cursory glance appear identical, but somehow skirt around some expensive operation.

I had a requirement to tune a script that was responsible for inserting 300k records in a database table. The implemented solution of iterating through a collection and calling ‘INSERT’ was not scaling very well and the operation was taking long enough to time out in some runs. This gave me the opportunity to learn about a few things in MySQL including the profiler, and (spoiler!) the INSERT multiple records syntax.

I needed some real numbers to compare the changes I would be making. My plan was to change one thing at a time and run a benchmark to tell if the performance was 1) better 2) worse, or 3) not impacted. MySQL has an easy to use profiler for getting this information. Inside of the MySQL CLI, you can issue the command:

SET profiling=1;

Any subsequent queries you run will now be profiled. You can see a listing of queries you want to know more about by typing:

SHOW profiles;

This command will show an index of queries that have run, along with their associated Query_ID. To view more information about a particular query, issue the following command replacing x with the Query_ID:

SHOW profile FOR QUERY x

Here is an example output:

+------------------------------+----------+
| Status                       | Duration |
+------------------------------+----------+
| starting                     | 0.000094 |
| checking permissions         | 0.000003 |
| checking permissions         | 0.000002 |
| checking permissions         | 0.000001 |
| checking permissions         | 0.000003 |
| Opening tables               | 0.000021 |
| System lock                  | 0.000008 |
| init                         | 0.000039 |
| optimizing                   | 0.000012 |
| statistics                   | 0.000717 |
| preparing                    | 0.000014 |
| Creating tmp table           | 0.000023 |
| executing                    | 0.000002 |
| Copying to tmp table         | 0.016192 |
| converting HEAP to MyISAM    | 0.026860 |
| Copying to tmp table on disk | 2.491668 |
| Sorting result               | 0.269554 |
| Sending data                 | 0.001139 |
| end                          | 0.000003 |
| removing tmp table           | 0.066401 |
| end                          | 0.000009 |
| query end                    | 0.000005 |
| closing tables               | 0.000011 |
| freeing items                | 0.000040 |
| logging slow query           | 0.000002 |
| cleaning up                  | 0.000015 |
+------------------------------+----------+

In one iteration of my SQL query, I was spending an excessive amount of time “copying to tmp table”. After reading the article http://www.dbtuna.com/article/55/Copying_to_tmp_table_-_MySQL_thread_states, I was able to isolate the cause of this to an ORDER clause in my query that wasn’t strictly necessary. In this example, Not too much exciting is going on, which is a Good Thing.

For a comprehensive listing of thread states listed in the Status column, view: http://dev.mysql.com/doc/refman/5.0/en/general-thread-states.html

Now that I know my query is as optimized as it can be, its time to pull out the bigger guns. On to plan B – consolidating those INSERT statements!

An INSERT statement, though executing seemingly instantaneously under small loads is comprised of many smaller operations, each with its own cost. The expense of these operations is roughly the following: (http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html)

  • Connecting: (3)
  • Sending query to server: (2)
  • Parsing query: (2)
  • Inserting row: (1 × size of row)
  • Inserting indexes: (1 × number of indexes)
  • Closing: (1)

As you can see, connecting to the server, sending the query, and parsing are relatively expensive operations. In the script I was modifying, 300k INSERT statements were generating 300k records. Fortunately for us, MySQL doesn’t force our records to be 1:1 with our INSERT statements thanks to allowing multiple insertions per INSERT. To use this feature instead of having 3 INSERT statements:

INSERT INTO foo (col1, col2) VALUES (1, 1);
INSERT INTO foo (col1, col2) VALUES (2, 2);
INSERT INTO foo (col1, col2) VALUES (3, 3);

We can instead coalesce them into a single INSERT statement

INSERT INTO foo (col1, col2) VALUES (1, 1), (2, 2), (3, 3);

How many values can we coalesce into the same INSERT statement? This isn’t driven by a max number of records, but rather a server system variable sysvar_bulk_insert_buffer_size: http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_bulk_insert_buffer_size This can be modified, but the default is 8388608 bytes. The exact number of records will vary depending on the number of columns, and the amount of data being inserted into those columns. I conservatively chose to coalesce 5k records at a time. I tried to bump this to 10k, but I encountered an exception when I exceeded this server system variable maximum.

With my INSERTS coalesced, I was able to reduce my total number of INSERT statements to 60 (300k / 5k). This yielded massive performance boosts. I was able to take the query from over an hour to run to completing in just 2 minutes. Quite a nice trick, considering the data is unchanged.

Is there room for improvement? Absolutely. A statement executed 60 times may be worth preparing, or wrapping inside of a transactional block. My real world tests didn’t yield a significant enough performance boost to make these complexities worth implementing. This may not be true with data in higher orders of magnitude, or different schema layouts. MySQL also understands INDEX hints, which allow you to suggest INDEXES that may be missed by the query planner, or force the inclusion or exclusion of beneficial, or detrimental INDEXES despite what the query planner thinks! (http://dev.mysql.com/doc/refman/5.0/en/index-hints.html)

Speaking of INDEX, if any are using UNIQUE, BTREE type, these can be dropped while the mass INSERT is running, then added back later to side-step the 1n INDEX operational hit.

In the next order of magnitude, I will probably have to rethink my approach of using INSERT statements to load data. According to the MySQL documentation, LOAD DATA INFILE is “roughly 20 times faster” than a MySQL INSERT statement. My script would no longer generate statements, but rather output to a file in a comma delimited format. This could then be loaded assuming appropriate permissions are in place.

Happy profiling!

Computers, Linux, Open-source, Ruby, Software, Web

Setup PostgreSQL with Rails on Linux

Today, I found myself needing to setup a Rails application to work with the PostgreSQL database. I found that the documentation on the PostgreSQL website was like drinking from a fire hose. Worse was every community response for an error message has a slightly different approach to the solution. Lets run through a basic Rails PostgreSQL configuration assuming Rails 3, Postgres 8.x, and Ubuntu 11.04:

Step 1: Installing PostgreSQL and Libraries

Install the PostgresSQL server, the client package to connect (psql), and the pg library needed to compile the Ruby PostgreSQL driver:

$ sudo apt-get install postgresql postgresql-client libpq-dev

After this finishes installing, you can turn to your OS X co-worker and laugh at him while he is still downloading the first tarball file. PostgreSQL will start automatically, under the user postgres. You can verify that the installation is a success by using the psql command line utility to connect as the user postgres. This can be accomplished using the following command:

$ sudo -u postgres psql

This uses sudo to elevate your basic user privileges, and the “-u” switch will execute the following command as an alternate user. As the postgres user, this will run psql. If you connect successfully, you should be at the psql interactive prompt. If not, ensure PostgreSQL is running, and that psql is in the path for postgres.

Note: From the psql interactive prompt, type “q” to exit.

Step 2: Configure a New PostgreSQL database

From the psql prompt, you can run SQL to view the current PostgreSQL users:

select * from pg_user;

You should see a table of database users returned:

usename usesysid usecreatedb usesuper usecatupd passwd valuntil useconfig
postgres 10 t t t ********    

(1 row)

We can see the postgres user that was created automatically during the installation of PostgreSQL. Lets add another user to be an owner for our Rails database. The path of least resistance may be to use your shell account username, since it will keep us from having to change some options in the database configuration file.

$ sudo -u postgres createuser 
# Shall the new role be a superuser? (y/n) n
# Shall the new role be allowed to create databases? (y/n) y
# Shall the new role be allowed to create more new roles? (y/n) n

This will create a new database user (named your shell account name), and grant that user access to login to the database. This will ask you a few questions regarding the user account. It is important for Rails that you answer “y” to whether the user should be able to create databases. If you say no, you will not be able to run any rake tasks that create or drop the database.

We can confirm by selecting from the pg_user table again.

$ sudo -u postgres psql
select * from pg_user;
usename usesysid usecreatedb usesuper usecatupd passwd valuntil useconfig
postgres 10 t t t ********    
<username> 16391 f f f ********    

(2 rows)

Step 3: Configure Rails

Switching to the Rails side, lets configure our application for Postgres. This requires the pg gem. Open your Gemfile and append:

# Gemfile
gem "pg"

Now run bundle install to update your project gems.

$ bundle install

This should compile the Ruby pg database driver, allowing Ruby to talk to Postgres. Now, lets tell our Rails application how to access our database. Open up config/database.yml, and change the adapter line to read “postgresql”. The database name by convention is the name of your project plus “_development”. Finally, your shell username is needed. Because PostgreSQL will authenticate this account locally, you will not need to supply a password option. Delete this line.

# config/database.yml
development:
  adapter: postgresql
  encoding: unicode
  database: _development
  pool: 5
  username: 

To test, run the rake task to create your database:

rake db:create

If everything works, you should have a newly created database owned by your shell account. You can login using psql by passing the name of the database as an option:

$ psql -d _development

Happy migrating!

Troubleshooting

If you get the error: “FATAL: Ident authentication failed for user “, ensure that you can see your newly created account in the pg_user table of the postgres database. (See Step 2 above)

If you get the error: “PGError: ERROR: permission denied to create database”, then ensure that your database user account has been granted the privilege CREATE. This can be done during the “createuser” command line account creation by answering “y” to the corresponding question about this permission.

If you get the error: “FATAL: role is not permitted to log in”, try manually granting the privilege to login on your database user account. This can be done by executing the following as postgres in the psql prompt:

ALTER ROLE  LOGIN;

Notes on Alternative Authentications

PostgreSQL integrates very deeply into the Linux authentication world, allowing for quite an array of connection options. By default passwords are not accepted for local connections. Instead, PostgreSQL is configured to use the “ident sameuser” method of user authentication. See more at http://www.postgresql.org/docs/8.4/static/auth-pg-hba-conf.html.

Computers, Open-source, Personal, Ruby, Software, Thoughts, Web

Taking the Magic Out of ActiveRecord – Speed up that Query!

Rails, and other programming frameworks that include an Object Relational Mapper (ORM) make is really easy to not worry about your database. That is until your application slows way down and you are left scratching your head trying to figure out why. Recently, I have been tasked with doing some things to optimize our database code in order to speed up the controller views.

The truth about databases is that you can’t beat them at their own game. Doing so would be like making a FPS on the consoles and releasing it the same day as Modern Warfare 2. A database stores data in a manner that can be easily and efficiently queried. ActiveRecord makes it easy to write code without worrying about the database underneath but this can create a issue. If you never know what the database is doing, how can you know if you are doing it efficiently, or not? I am of the firm belief (after witnessing Oracle chomp through 1000 line SQL queries like they weren’t there) that if a database can do something that your code can do as well, it is probably best to defer to the database. Pick the best tool for the job.

“If all you have is a hammer, everything looks like a nail” – Abraham Maslow

Lets look at some examples of where some database magic can improve our lives. Take the following code base:

# == Schema Information
#
# Table name: assignments
#
#  id         :integer         not null, primary key
#  post_id    :integer
#  keyword_id :integer
#  created_at :datetime
#  updated_at :datetime
#

class Assignment < ActiveRecord::Base
  belongs_to :topic
  belongs_to :keyword
end

# == Schema Information
#
# Table name: keywords
#
#  id         :integer         not null, primary key
#  name       :string(255)
#  created_at :datetime
#  updated_at :datetime
#

class Keyword < ActiveRecord::Base
  has_many :assignments
end

# == Schema Information
#
# Table name: posts
#
#  id         :integer         not null, primary key
#  title      :string(255)
#  created_at :datetime
#  updated_at :datetime
#

class Post < ActiveRecord::Base
  has_many :assignments
end

I have created a “Post” object to represent a blog post. On this blog post, I can assign keywords to each post. To do this, I have another model, “Keyword”, which is associated to posts through an “Assignment” model. The schema information shows the Database structure.

Now, recently I have come across some code that aimed to collect all the keywords of a post with some special options. These keywords were to be found, their names listed, and sorted without case-sensitivity. Finally, any duplicate keys would be excluded. A programmers mind might gravitate towards a solution such as:

Post.find(1).assignments.map {|x| x.keyword.name }.sort {|a,b| a.upcase  b.upcase}.uniq

Lets walk through this bit of code before I discuss better alternatives. To start with, we find our Post with an id of “1”, so that we can look at just those keywords. Next we iterate through these keywords (using the symbol to proc shorthand), and then run a sort on the returned array. This array is sorted by uppercasing the strings so that the sort is case-insensitive. Finally, “uniq” is run, to exclude any duplicate keywords. While this is a working solution, it doesn’t take any advantage of the power, and flexibility of what a database can do. A few issues:

  1. This code generates multiple SQL select statements. (N+1)
  2. Sorting can be done through “ORDER” in SQL
  3. Unique records can be generated through DISTINCT

The problem with generating multiple select statements is that this problem is an example of the “N+1” problem. This means that one query (“1”) will be run to determine what the attributes of the assignment is. After this is known, “N” queries are run for each keyword reference. In total, you have “N + 1” queries executed. If you have 5 keywords, the results will be unoptimized, but largely unnoticed. You will have 6 select statements:

  Assignment Load (6.3ms)   SELECT * FROM "assignments" WHERE ("assignments".post_id = 1) 
  Keyword Load (0.8ms)   SELECT * FROM "keywords" WHERE ("keywords"."id" = 1) 
  Keyword Load (0.7ms)   SELECT * FROM "keywords" WHERE ("keywords"."id" = 2) 
  Keyword Load (0.7ms)   SELECT * FROM "keywords" WHERE ("keywords"."id" = 3) 
  Keyword Load (0.7ms)   SELECT * FROM "keywords" WHERE ("keywords"."id" = 4) 
  Keyword Load (0.8ms)   SELECT * FROM "keywords" WHERE ("keywords"."id" = 5)

What happens if you have 500 keywords? 5,000,000 keywords? Your performance bottleneck will quickly shift to the database as multiple queries have to be generated, executed, and returned for each page request.

What is the big deal about sorting in your application instead of the database? In order for Rails (ActiveRecord specifically) to “sort” these items, the items will be be returned from a generated select statement from the database. These items will be returned as elements in an array, and then the array is sorted by calling the “” method on the Array class. Further, this is done in Ruby – a dynamic language, which is several orders of magnitude slower than in the database (which is most likely written in C). Simply, the database is made for this kind of work, and outsourcing it to Ruby is just not efficient.

Finally, why not make records unique with Ruby’s “uniq”? Again, it has to do with meddling in database territory. It has the problems inherent to the sorting problem above, but with an additional problem. Lets say that you return 500 records – and 499 of them are duplicates. Before the “uniq” method is run, Ruby is holding in memory references to the attributes of 500 ActiveRecord instances. When you call “uniq”, it can drop them. Why go through this memory spike and plummet, when the database can be instructed to just return one record? Memory is a valuable resource in a server environment – be frugal with its consumption.

So, lets address these issues, refactoring our code to take these issues into account. Starting with the multiple SQL statements, we can combine these into one statement by joining to multiple tables. I would encourage you to run “script/dbconsole” and get familiar with SQL. Let me demonstrate what we will be building the SQL way, before we implement this the “Rails” way.

SELECT a.* 
FROM   keywords a, 
       assignments b 
WHERE  a.id = b.keyword_id 
       AND b.post_id = 1; 

Another variant of this is to join the tables together using a “join” syntax. An inner join is the same type of join that we have replicated in the WHERE clause here. We can right the same SQL as follows:

SELECT a.* 
FROM   keywords a 
       INNER JOIN assignments b 
         ON a.id = b.keyword_id 
WHERE  (( b.post_id = 1 )) 

We can specify this join using ActiveRecord’s “find_by_sql” if we wanted to specify the association by hand. However this case is trivial enough that ActiveRecord can build this for us using “has_many :through”. I can add the following to my models:

# == Schema Information
#
# Table name: keywords
#
#  id         :integer         not null, primary key
#  name       :string(255)
#  created_at :datetime
#  updated_at :datetime
#

class Keyword  :assignments
end

# == Schema Information
#
# Table name: posts
#
#  id         :integer         not null, primary key
#  title      :string(255)
#  created_at :datetime
#  updated_at :datetime
#

class Post  :assignments
end

Now, I can gather all the keywords for a post by executing the following:

Post.find(1).keywords

Next, lets address the sorting issue by specifying an ORDER clause. We can tackle another related problem at the same time we do the sort. Recall that we want to sort in a case-insensitive fashion. If I just call order, then “Pabst” would beat “guinness” simply because of the capitalization (and we all know Guinness beer is better). The easy solution is to call “UPPER” to make the casing the same when the comparison is made. This actually saves even more computation on the Rails side by not having to do string conversions with our array sort. In SQL, we could append the following to our SELECT statement:

SELECT a.* 
FROM   keywords a 
       INNER JOIN assignments b 
         ON a.id = b.keyword_id 
WHERE  (( b.post_id = 1 )) 
ORDER BY UPPER(b.name)

The “Rails” way would be to include this on the association as follows: (Notice that Rails table alias names are probably not as reliable as listing out the name of the table itself. In this case, I have included “keywords.name”

# app/models/post.rb
...
has_many :keywords, :through => :assignments, :order => 'UPPER(keywords.name)'
...

Finally, lets address the unique problem. If I have duplicate keywords, I can return only the unique keywords by using the SQL DISTINCT modifier. In SQL, this would look like:

SELECT DISTINCT(a.name) 
FROM   keywords a 
       INNER JOIN assignments b 
         ON a.id = b.keyword_id 
WHERE  (( b.post_id = 1 )) 
ORDER BY UPPER(b.name)

In Rails, we can specify modifications to our SELECT clause, by passing the :select key to ActiveRecord’s “find”, and “all” methods. This has another benefit, depending on the application. For each column in each record, ActiveRecord has to store information in memory. By choosing only the specific columns that we want returned in the SQL clause, we can reduce the memory footprint. This could look something like this:

Post.find(1).keywords.all(:select => 'DISTINCT(keywords.name)')
# 0.16 seconds to complete w/ 200 keywords - thats 3 times faster!

So in summary, we have reduced SQL select statements, computationally expensive sorting and unique method calls from our results, and have managed to do all this without any fancy tricks. A sharp developer may point out that embedding SQL functions is bad form because it isn’t database agnostic. The truth is most databases conform to a base set of ANSI SQL standards and DISTINCT, and UPPER are acceptable almost across the board.

A little database magic can make a crawling Rails action become snappy again. I am a firm believer that Rails, like any framework should not be a reason to be uncaring about your SQL. Database are the heart of most applications, and usually one of the first bottlenecks for performance. Good SQL can be married with good Ruby code, and happiness will ensue. I hope this post was informative for Rails folks that want to get started with SQL optimization.