Inserting Large Data Sets in MySQL

Its always interesting for me to work with large data sets. The solutions that work in lower orders of magnitude don’t always scale, and I am left with unusable solutions in production. Often the problems require clever refactoring that at a cursory glance appear identical, but somehow skirt around some expensive operation.

I had a requirement to tune a script that was responsible for inserting 300k records in a database table. The implemented solution of iterating through a collection and calling ‘INSERT’ was not scaling very well and the operation was taking long enough to time out in some runs. This gave me the opportunity to learn about a few things in MySQL including the profiler, and (spoiler!) the INSERT multiple records syntax.

I needed some real numbers to compare the changes I would be making. My plan was to change one thing at a time and run a benchmark to tell if the performance was 1) better 2) worse, or 3) not impacted. MySQL has an easy to use profiler for getting this information. Inside of the MySQL CLI, you can issue the command:

SET profiling=1;

Any subsequent queries you run will now be profiled. You can see a listing of queries you want to know more about by typing:

SHOW profiles;

This command will show an index of queries that have run, along with their associated Query_ID. To view more information about a particular query, issue the following command replacing x with the Query_ID:

SHOW profile FOR QUERY x

Here is an example output:

+------------------------------+----------+
| Status                       | Duration |
+------------------------------+----------+
| starting                     | 0.000094 |
| checking permissions         | 0.000003 |
| checking permissions         | 0.000002 |
| checking permissions         | 0.000001 |
| checking permissions         | 0.000003 |
| Opening tables               | 0.000021 |
| System lock                  | 0.000008 |
| init                         | 0.000039 |
| optimizing                   | 0.000012 |
| statistics                   | 0.000717 |
| preparing                    | 0.000014 |
| Creating tmp table           | 0.000023 |
| executing                    | 0.000002 |
| Copying to tmp table         | 0.016192 |
| converting HEAP to MyISAM    | 0.026860 |
| Copying to tmp table on disk | 2.491668 |
| Sorting result               | 0.269554 |
| Sending data                 | 0.001139 |
| end                          | 0.000003 |
| removing tmp table           | 0.066401 |
| end                          | 0.000009 |
| query end                    | 0.000005 |
| closing tables               | 0.000011 |
| freeing items                | 0.000040 |
| logging slow query           | 0.000002 |
| cleaning up                  | 0.000015 |
+------------------------------+----------+

In one iteration of my SQL query, I was spending an excessive amount of time “copying to tmp table”. After reading the article http://www.dbtuna.com/article/55/Copying_to_tmp_table_-_MySQL_thread_states, I was able to isolate the cause of this to an ORDER clause in my query that wasn’t strictly necessary. In this example, Not too much exciting is going on, which is a Good Thing.

For a comprehensive listing of thread states listed in the Status column, view: http://dev.mysql.com/doc/refman/5.0/en/general-thread-states.html

Now that I know my query is as optimized as it can be, its time to pull out the bigger guns. On to plan B – consolidating those INSERT statements!

An INSERT statement, though executing seemingly instantaneously under small loads is comprised of many smaller operations, each with its own cost. The expense of these operations is roughly the following: (http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html)

  • Connecting: (3)
  • Sending query to server: (2)
  • Parsing query: (2)
  • Inserting row: (1 × size of row)
  • Inserting indexes: (1 × number of indexes)
  • Closing: (1)

As you can see, connecting to the server, sending the query, and parsing are relatively expensive operations. In the script I was modifying, 300k INSERT statements were generating 300k records. Fortunately for us, MySQL doesn’t force our records to be 1:1 with our INSERT statements thanks to allowing multiple insertions per INSERT. To use this feature instead of having 3 INSERT statements:

INSERT INTO foo (col1, col2) VALUES (1, 1);
INSERT INTO foo (col1, col2) VALUES (2, 2);
INSERT INTO foo (col1, col2) VALUES (3, 3);

We can instead coalesce them into a single INSERT statement

INSERT INTO foo (col1, col2) VALUES (1, 1), (2, 2), (3, 3);

How many values can we coalesce into the same INSERT statement? This isn’t driven by a max number of records, but rather a server system variable sysvar_bulk_insert_buffer_size: http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_bulk_insert_buffer_size This can be modified, but the default is 8388608 bytes. The exact number of records will vary depending on the number of columns, and the amount of data being inserted into those columns. I conservatively chose to coalesce 5k records at a time. I tried to bump this to 10k, but I encountered an exception when I exceeded this server system variable maximum.

With my INSERTS coalesced, I was able to reduce my total number of INSERT statements to 60 (300k / 5k). This yielded massive performance boosts. I was able to take the query from over an hour to run to completing in just 2 minutes. Quite a nice trick, considering the data is unchanged.

Is there room for improvement? Absolutely. A statement executed 60 times may be worth preparing, or wrapping inside of a transactional block. My real world tests didn’t yield a significant enough performance boost to make these complexities worth implementing. This may not be true with data in higher orders of magnitude, or different schema layouts. MySQL also understands INDEX hints, which allow you to suggest INDEXES that may be missed by the query planner, or force the inclusion or exclusion of beneficial, or detrimental INDEXES despite what the query planner thinks! (http://dev.mysql.com/doc/refman/5.0/en/index-hints.html)

Speaking of INDEX, if any are using UNIQUE, BTREE type, these can be dropped while the mass INSERT is running, then added back later to side-step the 1n INDEX operational hit.

In the next order of magnitude, I will probably have to rethink my approach of using INSERT statements to load data. According to the MySQL documentation, LOAD DATA INFILE is “roughly 20 times faster” than a MySQL INSERT statement. My script would no longer generate statements, but rather output to a file in a comma delimited format. This could then be loaded assuming appropriate permissions are in place.

Happy profiling!

Advertisement

OS X and Project Wide Find and Replace

Have you ever needed to perform a find and replace operation that spans across numerous files in your project? Doing this operation file by file can be tedious. Searching for an answer to this issue can be frustrating due to the involvement of many different Unix tools, each with their own flags. Some of these flags are incompatible between Linux and OS X, so it can quickly devolve into plugging in a solution, and moving on when it fails.

https://coderwall.com/p/7ol_ja

Upgrading Ruby with Rbenv+Homebrew

Heroku has defaulted to Ruby 2.0 for all applications, so its probably time you updated that crufty old version you have been running. Unfortunately the process is less than straightforward, especially when using a version manager. Assuming you are running rbenv with a Homebrew version of ruby-build, this guide will get you running the latest version of Ruby:

To begin, check which versions of Ruby Rbenv knows about. Rbenv delegates this work to ruby-build:

rbenv install --list

Best case scenario you have a recent version of ruby-build and you see the version of Ruby you want in this listing. At the time of this writing, version 2.0.0-p247 is the most current. If your desired version is present, skip the following steps and just install the Ruby version via:

rbenv install 2.0.0-p247

If your version is not present in the list, you will need to upgrade ruby-build so rbenv knows about the more recent versions of Ruby. Assuming you installed ruby-build via Homebrew, you can update it by issuing:

sudo brew update

Issuing this command complained about having untracked files within the Homebrew directory (which is actually just a git clone of the homebrew project). This may not be the correct way to fix this problem, but I issued the following command to stash these untracked files, and uncommited changes so they don’t interfere with the upgrade process:

cd /usr/local && git add . && git stash

This should now be a clean directory, and you can issue the brew update command again.

Now that brew is updated, you should have the latest “formula” for ruby-build. You can then issue the command to update ruby-build itself:

brew upgrade ruby-build

Once this completes, we can list versions of Ruby via rbenv again to ensure our desired Ruby version is now in the list. Once you see this, you can issue the following command to install a known version of Ruby:

rbenv install 2.0.0-p247

To use your shiny new version of Ruby, you can set this to be the default version:

rbenv global 2.0.0-p247

You can also set this per project, or by setting an environmental variable to override, so don’t worry if not all your projects are Ruby 2.0 ready. You can easily switch between versions – and that’s the point of version management right?

You can confirm you are running the latest Ruby version by issuing:

ruby -v

Note that you will need to re-bundle any gems from your Gemfile against the new Ruby version, as well as rehash any rbenv shims for these gem executables:

gem install bundler && bundle install && rbenv rehash

For more information, check out the rbenv, and ruby-build documentation. To discover the latest stable version of Ruby you can peek at the official Ruby download page to find out the latest version and patch number. Finally, check out the Homebrew docs if you are still stuck.

Making Rails Routes Fantastic

I took a week off from work to move into our new house. It was a time of rest, and relaxation despite the chaos around what moving can bring. I’ve had a few personal projects on the back burner, but never seemed to have the time or the energy to make much progress. Recently a talk with a friend reminded me how important completing those pet projects can be for your personal happiness. I’m proud to present the completion of an idea I’ve had a for a while: Routastic.

What is Routastic? It serves as an interactive Rails routes editor. Simply, I got tired of the pattern of modifying config/routes.rb, then running rake routes and grepping for some result. This is completely inefficient. My inspiration came from the beautiful Rubular.com and its interactive regular expression building. Its quick. Its painless. Its a valuable tool for everyday programming.

Please check out http://routastic.herokuapp.com/ and let me know how I can improve it.

Special thanks to Avand Amiri for suggesting the name (despite the name screaming Web 2.0, it actually is quite memorable!)

Unthinking Fullscreen Mode

Apple’s fullscreen windows are driving me nuts! We were co-existing until I “lost” a very important terminal session. Turns out I didn’t lose it, but it was in fullscreen mode. This omitted the window from cycling via Command+~, as well as being absent from the app-specific Expose (though it was present in Mission Control – something I seldom use.)

I hope they change this behavior soon. Here are some observed pain points:

  • The two second animation when switching from your desktop to Fullscreen applications via Control+left|right arrow is not only unnecessary, but frustrating. This could be much faster, and without the easing decelerating at the end of the animation. Same holds true for Command+Tab (though the animation is mercifully quicker)
  • The hotkey for taking an application into Fullscreen mode should be dictated by the window manager, not left up to the developers. Why not F11? Currently its Control+Command+F in Google Chrome, Command+Enter in iTerm, etc.
  • Why are these windows omitted from the task switcher? These programs are still running, so they should be reachable by the switcher. Is there another switcher hotkey to toggle between windows of the same application? Command+~ doesn’t work if the other window is in fullscreen mode.
  • Command+Tabbing between applications is problematic if you have multiple windows open of the same application when one is in fullscreen mode. From my fullscreen window, I switch to another application, then switching back defaults to the non-fullscreen window instance on the desktop instead of toggling back to where I was.
  • Why are they moved into their own workspace? Just remove the window chrome and leave it where it is. I was working in a fullscreen window, and I needed to use the calculator program. This created absolute insanity switching between applications due to the unnecessary animation each time. I ended up having to take the application out of fullscreen mode just to use the calculator in an efficient manner.
  • Why is there no transparency available to fullscreen applications? I like a transparency to my black backgrounded terminal. No such luck if you also use fullscreen mode.

So here is my suggestion: Get rid of the butt ugly arrows for fullscreen in the window chrome, and make that useless green button in the upper left hand (you know the one that sometimes makes windows resize _smaller_) and make that the fullscreen button. If it isn’t broken, don’t fix it.

In the interm, I’ve almost abandoned Apple’s fullscreen mode in favor of Cinch for positioning fullscreen windows. It works like the Windows 7 window manager (and Ubuntu) and allows applications to fullscreen by dragging the chrome to the top of the window.

Kanzen in the Science of Product Packaging

Unpacking the Canon MF4770n laserjet printer is a sight to behold. The machine is matte black, and the safety cards are bright orange. But its more than just removing tape and packing. The cards are elaborately and thoughtfully installed, forcing you to step through the installation process in order to remove them.

For example, take the installation of the toner cartridge. Trying to remove the safety card without opening the top of the laserprinter is a futile effort. After you disengage the lock and open the top the card continues and wraps around the toner cartridge. Pulling now gets you further. It releases the cartridge which has its own safety components. You are shown that you should shake it then pull the activation strip to expose the toner to the heating element. It only slides back in only one way. After the safety card finally allows you to disengage it, you see that it is about two feet in length.

This ingenuity continues with the paper holding tray. It is configured for shipping and storage and must be reconfigured to hold paper and operate. Rather than tedious instructions you are presented with another safety card. Pulling the tab this time unlocks, unfolds and drops the printing tray into position. It accomplishes mechanically what an instruction booklet may have taken three or four diagrams to illustrate.

The entire time I was setting up the printer I felt like I was working with an extremely polished product. I didn’t even crack the instructions (rarely do I), but this time there was no room for doubt. It just goes to show that the OOBE an have just as much of an impact on a consumer as the product itself. Installation is a time when anxiety (or frustration) is high, and knowledge is low. This is the perfect time to comfort your customer and generate repeat business. The packing and installation really sets the stage for what is to come. I’m looking at you IKEA!

The Regular Expression Behind Currency Formatting

I came across an interesting requirement today to format a floating number as currency. In the US and many other countries, currency greater than or equal to 1000 units ($1,000.00) is typically broken up into triads via a comma delimiter. My initial reaction was to use a recursive function, but I glanced at the Rails’ ActiveSupport gem to see how their number_with_delimiter method performed this same task. I was surprised to learn that this was handled via a clever regular expression.

View at https://coderwall.com/p/uccfpq

The California of Texas

Austin is a counter-culture in the heart of Texas. I’m not sure what I expected, but vegan friendly restaurants, health food stores, hundreds of miles of bike paths, a city wide ban on plastic bags, and a highly affluent populous was not what I was expecting. So beings my journey.

The Sights

We drove thirteen hours overnight to accomodate our newest passenger, Morrigan and her sleep schedule. As bad as driving all night was, at least we didn’t end up with a screaming toddler in the backseat. We visited some of Kristin’s kin in Houston, which served as our base for the weekend before pushing further West to Austin.

We negotiated a great monthly rate on a condo we found on Airbnb off South Lamar on the South side of the city. (Affiliate link) I’ve been really impressed from the moment we pulled into our new home away from home. The unit is small, but renovated, and well furnished. What we lost in space, we more than made up for in location. Right up the street is a lifetime supply of changing restaurants, and only a 10 minute drive from Zilker park.

Our first weekend we visited a local farmer’s market and bought some Texas grown produce. I never thought I would do what we did next – attend a vegan festival. Sure enough. In the land of BBQ, thousands of health nuts live in this city. Despite being vegan the food samples they had were all delicious. If I hadn’t known, I wouldn’t have tasted any difference. Vegans have masterfully replicated an omnivore’s menu.

The next week brought with it a trip to Fredericksburg, another hour West of Austin (but still not halfway through the state!). It is a German town, and had plenty of beers, wursts, and gut busting treats. The main strip is really quite a marvel, as it goes for over a mile and encompases lots of original architecture. Sunday we visited the Austin Zoo which put a spin on the traditional approach, and opted for a non-profit animal rescue shelter and animal sanctuary. Morrigan loved spotting all the critters, but the Texas heat won out, even in April. I was never so glad to pay $2 for a bottle of Coke.

This week I rented a bike and road around the Lady Bird Johnson trail which follows the river that bisects the city into North and South. It is wonderfully engineered, running past coffeeshops, restaurants, playgrounds, fishing areas, boat ramps, and dog parks. I wish Atlanta had something like this to offer. Maybe when the greenbelt is completed? (And the bikeways protected with police?!)

We have plans to go to San Antonio next weekend to visit Fiesta de los Niños, and plan on visiting my cousins who are lucky enough to live here. Following that will be our last week before the long migration home. We will be spending the last weekend back in Houston where we will visit Kristin’s folks for her mom’s birthday celebration.

Remote Working

Working remotely during this time has been interesting. I’ve learned an important lesson: don’t go anywhere during the launch date for your product! Too much was up in the air to leave beforehand to stay clear, and I was concerned that afterwards we would be scrambling. Turns out ramping up to the last minute before launch can be just as stressful.

For those of you following along who already, or wish to do remote work, ensure that you rent a place that you can have a quiet place to work. I find that mornings are my most productive time, and I try and guard this time for the important tasks. But with a smaller rental unit, and a one year old, I found thata nice pair of headphones is essential. I’ve taken advantage of ROWE several times during the week when we visited the capitol, and visiting parks, and going out for walks with the stroller. It pulls me out of my comfort zone to not be sitting at my desk at 9am, but then I realized the world doesn’t end. I simply make up the time later. Exercising this perk has really boosted my confidence in working for my company. They genuinely care about results and not office hours.

Also, have a backup plan if (when!) the Internet goes out at your rental unit. The maintenance people decided to test the waterproof claim of the rental’s outdoor wireless AP with a pressure washer. I’ll let you guess as to the winner, but I spent the rest of the day at a coffee shop. Also – scope out your coffee shops. I would love to see a website that lists the rating of different places based on workability. Some places don’t give you the vibe (despite having wireless) that they want you hanging around with that laptop of yours. Other places look promising, but offer no connectivity. Grab a cup of coffee or tea beforehand and scope things out.

A few notes on extended vacations: Don’t binge vacation! You are there for a while, so relax. Buy some groceries; every meal doesn’t have to be an experience. We have been aiming for eating out once every few days. It helps with the finances as well.

Don’t wait until the weekends to go exploring – especially when you have ROWE. The weekdays are the best times because the crowds are small, and the traffic is light. Check out your local attractions. One of the most relaxing things has been for me to join Kristin and Morrigan in the mornings for a walk around the neighborhood.

I wasn’t at home with my millions of distractions, so I wasn’t prepared for a “boredom adjustment”. Once I realized its not a pure vacation, and that I had to work, I struggled to find ways to wind down in the evenings. I ended up finding a used bookstore and stocking up on reading materials. I’ve enjoyed reading again instead of falling asleep in front of the TV, or playing video games until the early hours.

Being in a place for a month is certainly better than the binge vacation, but it is just a taste of what living somewhere would be like. Its just enough time in my opinion to really begin to absorb the culture. I find it thrilling to wonder what each day will bring, as everything is new. Maybe you spot a new restaurant (or food truck if you are in Austin), or drive past a park you want to checkout, or see a bike rental shop. I’m really making up my itinerary as I go. I’m glad I strived to find a job that affords me the opportunity to do this, and I wish that everyone who wants the same can achieve their goals. I can tell you its totally worth it!

Welcome to 2010

I broke down. I compromised my moral integrity. I did what I laughed at others for doing. I bought a tablet, and I couldn’t be happier.

What changed? Did my opinion change? Not drastically. I still don’t see them as the future of computing. They are a consumption device, and it would be difficult to do much more with them than that. But that is what I wanted.

Pricing has also changed drastically. When the iPad first came out, it was a 10″ behemoth, and it costed around $500, putting it well outside of my interests. (A high end laptop could be found starting at ~$800.) However, within the last year, some solid contenders have entered the 7″ ~$200 arena. The nVidia Tegra 3 chipset, quad core processing, and the latest Android experience on sale in the Nexus 7 for $155 shipped was too good for me to pass.

I have long said that I could’t justify a tablet when I have a desktop, two laptops, and a smartphone all within reach. My circumstances changed however during our latest vacation, and I found myself draining the battery on my smartphone daily trying to stay connected. I have discovered a few good areas that a tablet excels over other devices

  • Entertaining when space is limited (car, airplane, bed, etc)
  • Reading eBooks
  • Reading technical posts with code examples
  • Games developed for the touchscreen
  • Quick reference during certain table-top activities…

I find my discovery process similar to getting my first smartphone. I remember a few days after I had my smartphone I had the realization that I could get from my location to any other location without ever missing a turn again. I drove to a retail store to make a purchase and realized that I could mitigate buyer’s remorse by price checking while standing in the store! Information is power, and I had the Internet in my pocket. I could check reviews, prices, availability, stock nearby – all without carefully planning my trip beforehand at home.

While that smartphone does some things really well, it is a small form factor. For anyone that has ever upgraded their monitor to a larger size, or their computer to a faster model, you will know the feeling when you migrate from a smartphone to a tablet.

Despite my fears, I don’t think it will quickly become a device that collects dust. I’ve heavily used my smartphone for close to four years now, and there is no sign that this will change in the near future. The tablet is the extension of the smartphone.

I’m not saying to go blindly buy one – you should still have good reasons, and stick to a budget. But if you find yourself running your battery down on your smartphone from overuse, let me recommend a tablet to you.

And welcome to the year 2010!