Hardware, Linux, Open-source, Ruby, Software, Uncategorized

Delayed Job Performance Tuning

We found a bug. A bug that affected a lot of historical records that we now have the pleasure of reprocessing. Fortunately we already had an async job infrastructure in place with the delayed job gem. Unfortunately, this gem is intended for fairly small batches of records and isn’t tuned to handle 1M+ records in the delayed_jobs table.

After reading some best practices we decided on a queue based approach. To keep our day to day async jobs running we would use a “default” queue. And to reprocess our old records we used a new queue. Starting the workers up with a “–queue” flag did the trick. We had hardware dedicated for day-to-day operations, and new hardware dedicated to our new queue operations. Now it was simply a matter of filling up the queue with the jobs to reprocess the records.

Our initial approach maxed out the CPU on our database server. This was largely due to us not tuning our SQL in our async jobs. Because the volume we processed was always low, this was never really a noticeable problem. But when we threw lots of new jobs into the queues, it became very noticeable. The workers would start up, then mysteriously die. After some digging in /var/log/kern.log we discovered the workers were being killed due to an out of memory manager. Attaching a small swap partition helped, but once you’ve hit swap, things become horribly slow. Whats the point in keeping the worker alive if its running a thousand times slower? Clearly we needed to profile and tune. So we did. (The specifics are out of scope for this article, but it involved consolidating N+1 queries and limiting columns returned by the SELECT).

With our newly tuned SQL our spirits were high as we cranked up the workers again. Only to reach the next process bottleneck. And this is where databases gets frustrating. Delayed job workers run a query each time they are looking for a new job to find out which job to pick up. It puts a mutex lock on the record by setting locked_at and locked_by. The query looks like this:

UPDATE delayed_jobs
SET `delayed_jobs`.`locked_at` = '2016-06-05 11:48:28', 
 `delayed_jobs`.`locked_by` = 'delayed_job.2 host:ip-10-203-174-216 pid:3226' 
WHERE ((run_at <= '2016-06-05 11:48:28' 
AND (locked_at IS NULL OR locked_at < '2016-06-05 07:48:28') OR locked_by = 'delayed_job.2 host:ip-10-203-174-216 pid:3226') 
AND failed_at IS NULL) 
ORDER BY priority ASC, run_at ASC 
LIMIT 1;

The UPDATE does an ORDER which results in a filesort. Filesorts are typically something an index can resolve. So I optimistically added the following:

CREATE INDEX delayed_job_priority
ON delayed_jobs(priority,run_at);

Sadly, this index was completely ignored when I ran an EXPLAIN on my UPDATE. And the reason is that MySQL doesn’t execute an UPDATE query the same way as if you did a SELECT with the same conditions. The index probably made things worse, because now with each record update, we now also have an index update as well. I could probably fork the code and probably use some type of isolation level in a transaction to get the best of both worlds with an index based SELECT, and a quick UPDATE on a single record by id. But there are easier solutions to try first.

My UPDATE statements were pushing 40 seconds in some cases according to MySQL. Eventually the lock wait timeout is exceeded and you see an error in the delayed_jobs.log:

Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; 
try restarting transaction

Jobs were moving very slowly, and throwing more workers at it didn’t make an improvement. This is because each time a worker picks up a job, it was waiting 40+ seconds. The UPDATE was doing a filesort, and any index was being ignored. (And MySQL doesn’t support UPDATE hints). It was pretty clear that all of the jobs from the reprocessing queue needed to find a new home that didn’t negatively impact my filesort. I settled on the following solution:

CREATE TABLE delayed_jobs_backup LIKE delayed_jobs;

INSERT INTO delayed_jobs_backup
SELECT * FROM delayed_jobs WHERE queue='new_queue';

DELETE FROM delayed_jobs WHERE queue='new_queue';

This creates a new database table with the structure of the existing delayed_jobs table. The table is then populated with the jobs that needed to find a new home (All 1M+ of them). And finally, deleted from the original delayed_jobs table. Be careful doing this, and do some SELECT/EXPLAIN queries in between to ensure you are doing what you think you are doing. (Deleting 1M+ records from a production database makes me sit up in my chair a bit).

Looking at MySQL’s process list I no longer have System locks on my UPDATE statements (presumably because the table size is small enough that the filesort is mostly painless):

mysql> SHOW FULL PROCESSLIST;
# Id, User, Host, db, Command, Time, State, Info
1, user, 0.0.0.0:34346, localhost, Query, 0, System lock, UPDATE ...

The important columns here are the Time (in seconds), State, and Info. This proves that my job locking was happening quickly. I was seeing Time values of 40+ seconds before. I kept referring back to this process list to verify that the UPDATES were remaining snappy while I modified the number of workers running, and the number of jobs in the queue. I had a goal of keeping the UPDATE system lock times under 2 seconds. Adding more workers pushed the times up. Adding more jobs to the queue pushed the times up. Its a balance that probably depends very much on what you are processing, how much your database can handle, and what your memory constraints are on your worker servers.

To conclude – my job over the next few days will be to run the following command to put some jobs into the delayed_jobs table 10,000 at a time:

INSERT INTO delayed_jobs
SELECT * FROM delayed_jobs_backup LIMIT 10000;

DELETE FROM delayed_jobs_backup LIMIT 10000;

You can of course automate this. But my objective was met. The jobs can reprocess old records without impacting day to day jobs in the default queue. When the delayed_jobs table is almost empty, I move over another batch of jobs from the delayed_jobs_backup table. Rinse and repeat until there are no more jobs left to process. Its a bit more painful, but day to day operations continue to function, and I can cross of the reprocessing task from my list of things to do. All without any code changes!

I’ve been reading up on transaction isolation levels thinking something like a SELECT FOR UPDATE lock might be worthy contribution to the delayed_job codebase: http://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html

Advertisements
Ruby, Uncategorized

Testing and External Calls

Why Are External Calls Bad?

When writing unit tests, the principle is to test each component in isolation from other components. If your tests fail because of concerns outside of your subject, then this is unintentional coupling. A call to an external network resource also (S3, Pusher, Facebook, Google, etc) introduces brittleness, and will slow down your test. Network requests are very expensive compared to memory and local disk access. These quickly add up and cause a slow running test suite.

VCR

Reference a Ruby gem in the testing Gemfile group of your Rails project for VCR. It is responsible for recording request/response traffic (complete with headers, params, etc) and serializing it as a testing fixture. This fixture can then be used by the test instead of making an external network call.
VCR documentation at: https://relishapp.com/vcr/vcr/docs

Webmock

External network calls via a number of avenues are disabled in testing via the Webmock gem. These methods include Net:HTTP, Curl::Easy, etc. If you can think it up, its probably blocked. This is intentional for the reasons listed earlier. Exceptions should not be made in tests.
Webmock documentation at: https://github.com/bblimke/webmock

How to Write a Test Using VCR

Take for example an Image class. Pretend it is responsible (in part) for uploading a file to Amazon’s S3 service. This is hosted on AWS and is an external network call. Currently, creating an image in a test will fail with an exception similar to the following:

it 'uploads an image' do
  FactoryGirl.create(:image) # this fails
end
VCR::Errors::UnhandledHTTPRequestError:
================================================================================
An HTTP request has been made that VCR does not know how to handle:
  PUT https://bwtesting.s3.amazonaws.com/uploads/image/file/1/logo.png

The error explains that there is currently no cassette to play. We can create a cassette to be used going forward by temporarily enabling external network requests:

# spec/vcr_helper.rb
require 'vcr'

VCR.configure do |config|
  config.cassette_library_dir = &quot;spec/fixtures&quot;
  config.hook_into :webmock
  config.allow_http_connections_when_no_cassette = true
  config.configure_rspec_metadata!
end

Uncomment the line

config.allow_http_connections_when_no_cassette = true

Now instruct your test to use VCR by adding an Rspec metadata tag of :vcr:

it 'uploads an image', :vcr do
  FactoryGirl.create(:image)
end

Now you can run your test again with a passing result:
1 example, 0 failures, 1 passed

You will also note a new file has been created under spec/fixtures. The specific path and file name are based on the location of your test. (Note if you rename your test, be sure to rename the fixture accordingly). Look at the fixture and you will see something like:

---
http_interactions:
...
- request:
    method: put
    uri: https://bwtesting.s3.amazonaws.com/uploads/image/file/1/logo.png
...

One fixture can contain multiple network call representations.
If we comment back out in spec/vcr_helper.rb we should now be able to run our spec without a network connection:

# spec/vcr_helper.rb
require 'vcr'

VCR.configure do |config|
  config.cassette_library_dir = &quot;spec/fixtures&quot;
  config.hook_into :webmock
  # config.allow_http_connections_when_no_cassette = true
  config.configure_rspec_metadata!
end

Matching a Cassette

How does VCR know which cassette recording to use when mocking an external network request? The default behavior is to match on method (GET, POST, PUT, etc) and the URI (the full URL of the resource)

You might have guessed that this can cause an issue with RESTful resources, or anything with a unique identifier in the URL. Take this example:

it 'uploads an image', :vcr do
  FactoryGirl.create(:image)
  FactoryGirl.create(:image)
end

We create two images. This will now cause our test to fail with an error from VCR:

VCR::Errors::UnhandledHTTPRequestError:
================================================================================
An HTTP request has been made that VCR does not know how to handle:
  PUT https://bwtesting.s3.amazonaws.com/uploads/image/file/2/logo.png
VCR is currently using the following cassette:
  - spec/fixtures/Image/uploads_an_image.yml
  - :record =&gt; :once
  - :match_requests_on =&gt; [:method, :uri]

This failed because the URI we previously recorded was for
https://bwtesting.s3.amazonaws.com/uploads/image/file/1/logo.png
and now we have the URI
https://bwtesting.s3.amazonaws.com/uploads/image/file/2/logo.png

Changing Matching behaviors

VCR fortunately provides flexibility in how to match an HTTP request. As part of the metadata, you can specify what to match on. For our test, we don’t really care about the specific URL, just that we simulate a good response from Amazon. This means its sufficient to match on method and host. In your test you can specify by doing:

it 'uploads an image' do
  VCR.use_cassette('spec/fixtures/Image/uploads_an_image', match_requests_on: [:method, :host]) do
    FactoryGirl.create(:image)
    FactoryGirl.create(:image)
  end
end

Note that :vcr is gone from our Rspec metadata. We have wrapped the code responsible for issuing the external network call in an explicit VCR block that allows us to match on different criteria of the HTTP request. This test should now pass.

Custom Matching

Maybe the built in match_requests_on attributes aren’t fine grained enough for your test. In that case, you can write your own matcher. See the example below for how to match on part of the path of a URI:

it 'uploads an image' do
  VCR.use_cassette('spec/fixtures/Image/uploads_an_image', match_requests_on: [:method, :host, method(:s3_matcher)]) do
    FactoryGirl.create(:image)
    FactoryGirl.create(:image)
  end
end

def s3_matcher(request_1, request_2)
  !!URI(request_1.uri).path[/\/uploads\/image\/file\/\d+/]
end

I’ve added a new method into the array of values passed to
match_requests_on
This method takes two arguments (it iterates over each request in the fixture and compares it to this specific request until it finds a match). The return value of this method must be a boolean. If its true, then the match is made against one of the requests. If the match is false, it tries the next HTTP request until it runs out of options, and fails.

Open-source, Ruby, Software, Thoughts, Uncategorized

Implementing a Configuration

Have you ever seen a configuration in Ruby that yields to a block where properties are set on the yielded object? Rails does this with its environments files:

Your::Application.configure do
  config.cache_classes                     = true
  config.consider_all_requests_local       = false
  config.action_controller.perform_caching = true
  ...
end

Devise is another great example of this pattern:

Devise.setup do |config|
  config.secret_key = '38c8e4958385982971f'
  config.mailer_sender = "noreply@email.com"
  config.mailer = "AuthenticationMailer"
  ...
end

This is an established way to pass in configuration options in the Ruby world. How does this work? What are the alternatives?

Under the Hood

Lets start with our own implementation from scratch. We know that we are calling a method (Application.configure, or Devise.setup) and that method yields to a block:

module Example
  class << self
    def configure
      yield Config
    end
  end
end

You can see that this will yield the class Config when we call Example.configure and pass in block. Config will need to define our properties:

module Example
  class Config
    class << self
      attr_accessor :api_url, :consumer_key, :consumer_secret
    end
  end
end

We can now call our configure method and pass in a block:

Example.configure do |config|
  config.api_url = "http://api.com/v1/"
end

If we try to set a property that does not exist in our Example::Config class, then we get a helpful NoMethodError:

Example.configure do |config|
  config.not_a_real_config_setting = "http://api.com/v1/"
end

# => NoMethodError: undefined method `not_a_real_config_setting=' for Example::Config:Class

This helps to define your configuration interface. Options are explicitly declared. They cannot be mistyped, or dynamically added without an explicit runtime error.

In practice, The definitions of Example, and Example::Class would live in a gem. The Example.configure invokation would live within the code that wants to configure the properties of that gem with things such as credentials, URLs, etc. This seperation of concerns makes sense. Gems should not know anything about the business logic of your application. It should be instantiated with configuration options passed in from the project. However the project should only be aware of the public interface of the gem. We are agreeing upon where these pieces of information are moving from one domain to another. The project doesn’t know where this information will be used, and the gem doesn’t know what the information is until we pass it. So far so good!

Using this Configuration

Now that we’ve passed our information inside the configuration block, we can reference these class level (static) properties in our gem:

module Example
  class Request
    def self.get(path)
      request = Net::HTTP.get(URI.parse(ExampleConfig.api_url + path))
      request['Authorization'] = auth_header(:get, ExampleConfig.api_url)
    end

    private

    def self.auth_header(request_type, uri)
      SimpleOAuth::Header.new(request_type, uri.to_s, {},
                              {consumer_key: Config.consumer_key,
                              consumer_secret: Config.consumer_secret}).to_s
    end
  end
end

This will do a simple GET request passing in SimpleOAuth headers. Inside the get method we call Config.api_url to know where the API lives. This was set by us earlier using the Config object. SimpleOAuth headers are supplied by again calling the Config. You would invoke it like so:

Example.configure do |config|
  config.api_url = "http://api.com/v1/"
  consumer_key = "1895-1192-1234"
  consumer_secret = '76asdfh3heasd8f6akj3hea0a9s76df'
end

Example::Request.get('/products') # => {products: [product1, product2, etc]...}"
Example::Request.get('/users') # => "{users: [user1, user2, etc]...}"

Example::Config becomes the holding location for your configuration information. And by making the properties static, you don’t have to worry about passing around a reference to the instance.

Alternatives

If the yielding to a block is a little too clever for you, you can always instantiate a class and pass in the configuration as part of the constructor:

class Example
  class << self
    attr_accessor :api_url, :consumer_key, :consumer_secret

    def config(api_url:, consumer_key:, consumer_secret:)
      self.api_url = api_url
      self.consumer_key = consumer_key
      self.consumer_secret = consumer_secret
    end
  end
end

This can be instantiated like so:

Example.config(
  api_url: "http://api.com/v1/"
  consumer_key: "1895-1192-1234"
  consumer_secret: '76asdfh3heasd8f6akj3hea0a9s76df'
)

Example.api_url # => "http://api.com/v1/"

This feels less encapsulated to me. Instead of having an interface for our configuration settings, we are just settings properties directly onto a class.

What are your thoughts? What advantages does the block style configure offer over the alternative above?

For more information on the Ruby gem configuration pattern see this excellent blog post:

Uncategorized

Agile Over Waterfall

The government’s method of running software turned on a sequential design strategy known as “the waterfall”: a central calendar, the Gantt chart to end all Gantt charts, that promulgated when every task would finish. The government tried running software development as a bureaucratic process, with project managers managing project managers, and the whole thing broke.

The team instead worked in an “agile” way, which favors small, cross-disciplinary teams that stay in close communication with each other while quickly finishing iterative improvements to a product (often software).

Read more: http://www.theatlantic.com/technology/archive/2015/07/the-secret-startup-saved-healthcare-gov-the-worst-website-in-america/397784/#ixzz3fkAqHq22

Uncategorized

The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. ― Frederick P. Brooks Jr.

Uncategorized

The Egg By: Andy Weir

You were on your way home when you died.
It was a car accident. Nothing particularly remarkable, but fatal nonetheless. You left behind a wife and two children. It was a painless death. The EMTs tried their best to save you, but to no avail. Your body was so utterly shattered you were better off, trust me.
And that’s when you met me.
“What… what happened?” You asked. “Where am I?”
“You died,” I said, matter-of-factly. No point in mincing words.
“There was a… a truck and it was skidding…”
“Yup,” I said.
“I… I died?”
“Yup. But don’t feel bad about it. Everyone dies,” I said.

http://www.galactanet.com/oneoff/theegg_mod.html