Ruby, Software

Refactoring With the Observer Pattern

What is the Observer Pattern?

Have you written a system that has the concept of “events” that need to notify users? Perhaps you send an email on a system error to an administrator. Or send a “password changed” event notification to a user? In a CRM perhaps clients can subscribe to new customers as they come into the system as the result of marketing efforts?

This can be represented using the Observer pattern. Your users have subscribed to one or more events, and they are “observing” new events. The events themselves are “observable” via these event subscriptions.

This blog post will walk through refactoring an existing event notification system that did not use the observer pattern, and instead relied on a multiple passes to calculate an intersection of users and events that ran periodically via cron.

Old pattern

We have a CRM that is written in Ruby on Rails. When a database record is generated for something we want to create an event for, we use the ActiveRecord callback to create a second “event” record after save. This event record is associated to the source record and includes some contextual information like a type “NewLead”, or “SystemError”, and whether or not it has been processed (defaulted to false). Event also includes some hierarchical attributes so we can enforce authorizations in a multi-tenant application.

Once the event records are created, we had a cron entry that would fire every minute and check for new events. This would do a database query looking up unswept events, and looking up all active subscriptions. Doing some intersection logic, we would then generate an “event subscription” record for each event that can belong to an intersection. This did not scale particularly well because a generic event could potentially go to hundreds of subscribers, which would mean a database table entry per subscriber. This information is also not valuable after the notification is sent.

Once the events are swept, and “event notifications” records are generated, we could then send out an email. For more usability we have a digest feature where emails are not sent immediately when events come in. Rather we collect for a period of time, then send a summary of events that have occurred since the last email was sent.

This system did not scale due to the volume of events and subscriptions. We had issues where long inserts, and large datasets caused deadlocks in the database and out of memory errors on the server. The goals were to keep the events, and the subscriptions, but optimize the wofklow so that we could scale further, and have less transient data stored in the database. It goes without saying, but the replacement needed to be robust, and easy to reason about with good test coverage as well.

New pattern

The Observer pattern I mentioned a few minutes ago was the right fit for this particular problem. Instead of a cron initiated loop with lots of state being persisted we would move to an in-memory event loop and stream the events to the subscriptions.

One wrinkle with the observer pattern was the email digest requirement. Observers are great at taking events and then processing them immediately. Building a collection required some investigation, but using Event Machine and periodic timer I was able to preserve this functionality as we will see.

The event system

There are two primary areas – the observable, and the observer. In Ruby, we can compose the Observable pattern by including the module into our class. This sets up a pseudo interface in Ruby that we can implement in our observable and observers. Of course Ruby doesn’t have true interfaces so this is more of a guideline than a compile time check.

The observable class:

require 'observer'
module Events
  class Observable
    include ::Observable
    attr_accessor :last_tick_at

    def initialize
      self.last_tick_at = Time.zone.now
      observers.each { |observer| observer.register(self) }
    end

    def run!
      changed_subscriptions.each do |subscription|
        subscription.reregister(self)
      end

      new_events do |event|
        changed
        notify_observers(event)
      end
    end

    private

    def new_events(&block)
      Event.yield_and_sweep(&block)
    end

    def changed_subscriptions
      Subscription.where("updated_at >= ?", last_tick_at).tap do
        self.last_tick_at = Time.zone.now
      end
    end

    def observers
      Subscription.only_active.not_paused
    end
  end
end

The observable class will be instantiated by our task runner. During the initialization it grabs the current time, and registers its observers (subscriptions). Its method run! will be invoked periodically and is responsible for finding and sweeping new events, and sending out to the observers. It also checks if any subscriptions have been added, changed, or deleted via the updated_at timestamp which is compared against the last run. These subscription observers have to re-register with the observable to reflect their changes.

Next up is the runner::

require 'eventmachine'

module Events
  class Runner
    def start
      EM.run do
        observable = Events::Observable.new
        run_timer = EventMachine::PeriodicTimer.new(0) { observable.run! }
        run_timer.interval = 1.minute.to_i
      end
    end
  end
end

This runner instantiates the observable class and starts an Event Machine loop. Within this loop we add a periodic timer every minute to run the observable. As we saw in the observable, the run! method will register its observables, and send them events.

Finally, we have the observer class:

module Events
  module Observer
    attr_accessor :observed_events, :timer

    def register(observable)
      observable.add_observer(self)
      self.observed_events ||= []
      self.timer = EventMachine::PeriodicTimer.new(self.wait_interval) { process_events }
    end

    def unregister(observable)
      observable.delete_observer(self)
      self.timer.try(:cancel)
    end

    def reregister(observable)
      unregister(observable)
      register(observable)
    end

    def update(observed_event)
      return unless authorized_for_context(observed_event)
      self.observed_events.push(observed_event)
    end

    def process_events
      return unless observed_events.present?
      EventMessageMailer.delay(queue: :email).event_message(observed_events.map(&:id), self.user)
      observed_events.clear
    end

    private

    def authorized_for_context(event)
      return false unless self.event_type.blank?            || self.event_type == event.type
      return false unless self.agency_id.blank?             || self.agency_id == event.agency_id
      return false unless self.client_id.blank?             || self.client_id == event.client_id
      return false unless self.business_unit_id.blank?      || self.business_unit_id == event.business_unit_id
      true
    end
  end
end

This is a module that can be mixed in to an ActiveRecord “Subscription” model. There are some assumptions that the model will have certain attributes pertaining to the hierarchy that the subscription is for (agency_id, client_id, business_unit_id) and an attribute for what type the subscription is. In order for an event to be collected the event that is received must pass authorization checks, and must be for the appropriate event type.

Of note is the wait_for_interval. This is logic around the digest functionality. Each subscription has an attribute for how long to wait. A value could be 5 minutes, 1 hour, 1 day, etc. The process_events call will be fired when the PeriodicTimer fires on the specified interval. This clears out events when they are processed.

Lets see a diagram of how all this will work:

Screen Shot 2017-10-20 at 10.26.13 AM

Not shown is the EventMailer. This is just a standard ActiveRecord mailer however that takes an array of events, and shows them as a list. These are then sent to the user’s email address via a background job.

Observable Benefits

There are a few benefits of using the observable pattern that we replaced our existing code with:

  • Efficiency – Subscriptions are not recalculated every time we sweep for events.
  • Storage size – Because the event runner continues to run in the foreground, we don’t have to worry about persisting a lot of in between state. The concept of event subscriptions lives in memory. Messages are constructed on demand and emailed asynchronously.
  • Codebase size – Almost all of the cron entries, task runners, sweepers, and calculations on the event and subscription intersections have been removed because they aren’t necessary. We now have zero cron entries, and one task runner that is responsible for picking up new events and sending to the observers.
  • Agreed upon pattern – While only a benefit if you know the pattern, there is a much greater chance someone will know what an observer is instead of a custom solution
  • Interface checking – Because we are using the Ruby Observable module, we get some safety in adding new observers, and when changing our observable. If the observer doesn’t implement the update method, an exception will be raised when registered with the observable.

Task runner

Now that we have implemented the observer pattern, we need some scripting to have this run in the foreground and recover from any crashes. For this I used the Daemon gem:

require 'daemons'
require File.expand_path('../../config/environment',  __FILE__)
require Rails.root.join('app/models/events/runner')

Daemons.run_proc('events', {
  dir:       Rails.root.join('tmp/pids/'),
  dir_mode:  :normal,
  log_dir:   Rails.root.join('log/'),
  backtrace: true,
  monitor:   true
}) do
  logger = Logger.new(Rails.root.join('log/events.log'))
  Events::Runner.new(logger: logger).start
end

The observable can now be instantiated with a script/events.rb start command. In the event of a crash, I get log output of the stacktrace, and an automatic restart.

What is next

Currently, there isn’t a mechanism to serialize events that are in memory for subscriptions. This was due to the fact that most are minutely, and because the service so rarely gets restarted, and the risk of not receiving a notification was low this problem was left for later. Likely the approach would be trapping a SIGTERM, or SIGKILL signal, and persisting the unprocessed events for each given observable in a way that is recoverable on restart. A text file is probably sufficient, as this data isn’t valuable long term.

Learn More

If you want to learn more about the observer pattern, I recommend Head First: Design Patterns. The content is good, but the 1990’s stock photos will make you groan.

Additionally, SourceMaking has great information on many design patterns including the observer.

 

Advertisements
Apple, Computers, Hardware, Linux, Software, Windows

Using Synergy Over VPN

I’ve been watching a lot of Linus on Tech and one of their sponsors sells a product called Synergy. More information on their product page: https://symless.com/synergy . To summarize, this is a software KVM (no video, so I guess it is “KM”?) solution. The use case was perfect for me. On my desk I have an Apple work laptop, a Windows desktop, and a Linux personal laptop. Anyone that has done extensive work on a laptop keyboard and touchpad know that it isn’t optimal. I didn’t want multiple keyboards and mice on top of my desk because of the clutter. I dropped some money for Synergy, and it just works!

20170731_110826_HDR.jpg

That is until I had to connect to our company VPN the next week. They use a full tunneling solution. When I connect, I lose everything. I can’t print, I can’t access my NAS, but most importantly I can’t access my keyboard and mouse. (The video is fine because it is a hard wire to an external monitor). What to do?

SSH to the rescue! What is SSH? This is a protocol that will allow one computer to securely interface with another computer. Secure SHell. However, we will just be using it for port forwarding, and not for an interactive session. The goal is to take the OS X machine (Synergy client), and SSH into the Windows server (Synergy server). Using this SSH connection, we can forward ports within it. It is a tunnel running inside the SSH connection. This will expose a local port on OS X for 24800 that is actually pointing to the remote server port 24800. This is the port that Synergy uses for its connections.

You will need a few tools, and a little patience. Having just gone through this, I’m sharing for posterity, or maybe for anyone that has thrown in the towel with how crippled VPN makes accessing home devices.

I have the following Synergy setup:

  • Windows 10 Synergy server (keyboard and mouse are physically connected to the desktop)
  • OS X Synergy Client
  • Linux Synergy Client
  • Router with a local area network all these devices share
  • Admin access to the router for port forwarding
  • Autossh package for OS X (available via brew)

First step, get Windows 10 up to speed with SSH. How this isn’t built in as a service in the year 2017 I have no idea. Grab the OpenSSH server package for Windows from https://www.mls-software.com/opensshd.html . After downloading, extract and run the setup file. This will create a new Windows service for OpenSSH that will run on port 22. It prompts you to generate an SSH key for the server.

Once this server is running, you will need to add your user to the list of SSH users. Open up PowerShell as an administrator and change into the C:\Program Files\OpenSSH\bin directory. Run the following commands:

mkgroup -l >> ..\etc\group
mkpasswd -l >> ..\etc\passwd

Try and connect to your SSH server from the OS X client:

ssh <user>@<server IP> # e.g. ssh Ben@192.168.1.95

You should be prompted for your Windows password. Once you can successfully login to the server, we can setup public key authentication. This removes the need for you to type in your password because you identify yourself with an SSH public key. From your OS X machine get your public key:

cat ~/.ssh/id_rsa.pub

Put the contents of this file on your SSH server in the file C:\Program Files\OpenSSH\home\<user>.ssh . This is actually a symlink to C:\Users\<user>.ssh . If the directory .ssh doesn’t exist, you will need to create it first. Now we need to configure the server to allow public key authentication. Edit the C:\Program Files\OpenSSH\etc\sshd_config file and change the following lines:

StrictModes no
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys

Restart the OpenSSH server for the changes to take effect:

net stop opensshd
net start opensshd

You should now be able to SSH into the server same as before but without being prompted for a password.

Now we are ready to create an SSH tunnel. Before we incorporate AutoSSH (which handles retries and monitoring) we will do a naive attempt to SSH. In the following command:

  • -f backgrounds the process
  • -L does port tunneling in the format of <local port>:<remote host>:<remote port>
  • -N do not run a command – just tunnel the port
ssh -f <user>@<remote public IP> -L 24800:<remote public IP>:24800 -N

If this works, you should see a [LISTEN] entry for port 24800 when you list open files:

lsof -n -i | grep 24800

You may need to set your server as the DMZ on your network. Or to be safer you can simply setup port forwarding. We will need port 22 and port 24800 to resolve to the Windows server. The instructions for how to do this on a router widely vary by vendor. Typically it is under a WAN section. It typically prompts for a port, a destination IP, and destination port, and protocol. You want ports 22 and 24800 to route to your server IP for TCP and UDP.

Configure your Synergy client to use localhost instead of the remote IP. You should now be able to operate your client from the server’s peripherals via Synergy.

Everything works great until the VPN connection is made. The reason is that the SSH connection is severed. In order to recover automatically, I have added autossh to persist this tunnel. On the OS X client instead of running SSH do the following:

AUTOSSH_POLL=10 autossh -M 20000 -f -N <user>@<remote public IP> -L 24800:<remote public IP>:24800

Now when a VPN connection is made, or a disconnection happens, the autossh package will detect that it is no longer alive and retry. Because Synergy’s software also retries, after a few seconds your connectivity should begin working again.

Thanks to Synergy for making a solid product, and for having first class Linux support.

Computers, Software

Moving Data in AWS

I have been a long time user of Amazon Web Services, but I only recently started using the Data Pipeline service they offer to handle ETL needs. The service provides a cloud ready, low cost, turn key (in some cases) solution to moving data within your services. I had a difficult time getting up and running, partly due to the lack of discussion online about this service, so I want to share my experience, offer some best practices, walk through how I developed our pipelines.

Anyone with an AWS account can use Data Pipelines. But be careful there there is no BAA agreement if you are in the healthcare industry and are passing around PHI data. Fortunately, our needs do not yet require us to move around PHI information.

The first step in my ETL journey was first formalizing what data needed to go where. Before I even opened up the Data Pipeline service, I had to understand our specific needs. I identified two use cases.

Archiving Old Data

RDS instances have a max allowable size for the database and ours was getting full. The approach I took was to look at our largest tables, and focus on reducing those first. I ran some queries to understand what is using the most space: http://stackoverflow.com/a/9620273/802407

Once I had a list of my largest tables, I could classify them and assign retention rules. Some tables I decided to leave in place, and others I decided were transient data, and could be archived. (HIPAA mandates a 7 year data retention policy, so no luck just deleting). We decided as a business that different classifications could live within our application for differing time frames. Once timeframes were established, I could then write a data pipeline, and move any data older than our cut off date for that table to a storage solution outside of the database. We chose to house MySQL backups on S3 in encrypted buckets.

Fortunately the Data Pipeline service provides a number of templates to help you get started. Navigate to https://console.aws.amazon.com/datapipeline .  I found the templates good starting point, but there are some frustrations that I will mention below in the “Quirks” section. Click “Create Pipeline”. I used the template “Full Copy of RDS MySQL table to S3”. I filled in the parameters, and edited the pipeline in “Edit in Architect” mode.

Since I wanted to archive old data, I modified the SqlDataNode’s Select Query to be only records older than my retention policy:

select * from #{table} 
where #{myRDSTableLastModifiedCol} <= 
date_sub('#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' , 
  interval 6 month)

This will select records only older than 6 months from the pipeline scheduled start time. The template then moves these to S3. There are two parameters, defined in the parameters section for “#{myRDSTableName}”, and “#{myRDSTableLastModifiedCol}”. I supplied my table name, and the updated_at datetime column for my records.

I added a new SqlActivity dependent on the CopyActivity named “Delete old records”. Once they move to S3, I want to delete them from the database table. This activity “Depends on: RDSToS3CopyActivity” so if saving to S3 fails, the records are left untouched. I added the following script to mirror my select query above, but deleting the records instead:

delete from #{myRDSTableName} 
where #{myRDSTableLastModifiedCol} <= 
date_sub('#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' , 
  interval 6 month)

I would recommend doing this on a test database first before you delete production records while testing your script. Because the timestamp is the same as from the select, this will be the same record set, provided you have an updated_at column that shows when a record was last touched.

Expose Data for Reporting

My other use case was getting data to the reporting server which is in Redshift. Again, there is a nice template to get started. Click “Create Pipeline” and then select “Incremental copy of RDS MySQL table to Redshift”, then “Edit in Architect”.

The first run will need to be a full copy if I want all the data in the table. After that, I can use delta copies to only move over the new data fitting your criteria. This is driven by a SQL select query, so it is easy to modify. In the SqlDataNode I can edit the Select Query to my liking. Note that I removed the conditions from the query to get all records on the first run. I changed the instance type to something more substantial (t1.micro to m1.small), and upped the timeout from 2 hours (to 8 hours). I then went in before the next run and modified the query to put back the conditions that selected the delta data, then downgraded the instance type, and timeout values to their smaller defaults.

I then ran into an infrastructure quirk where our Redshift instance was inside a VPC, and our RDS database was inside a classic instance (non-VPC). This meant that the same EC2 instance would not be able to talk to both databases since it had to be on one side or the other. Because of this limitation, I had to modify parts of the pipeline that assumed a single EC2 instance would be talking to both databases. Note that I had to edit the JSON as the Data Pipeline UI does not allow changing the resources that activities run on from the template. I created two EC2 instances – one for talking to RDS and S3, and one for talking to S3 and Redshift.

In an attempt to make things easier, Amazon provides some Python scripts that get called under the hood to reflect on your MySQL table structure and convert it to a PostgreSQL CREATE TABLE command. This didn’t work for me because of my VPC permissions issues, so I provided my own CREATE TABLE Sql in the S3StagingDataNode. This was generated using the Python script by Amazon, but I supplied the inputs manually:

#!/bin/bash
aws s3 cp s3://datapipeline-us-east-1/sample-scripts/mysql_to_redshift.py \
/tmp/mysql_to_redshift.py

for target in table1 table2 table3
do
  mysqldump -h <host> --port=3306 -u <username> --password=<password>  \
--compatible=postgresql --default-character-set=utf8 -n -d \
-r "/tmp/${target}_structure.sql" <database> $target

  python /tmp/mysql_to_redshift.py --input_file=/tmp/${target}_structure.sql \
--output_file=/tmp/${target}_create.psql --table_name=$target \
--insert_mode=KEEP_EXISTING
done

This Bash script will pull down the public file mysql_to_redshift.py. Then it loops over the target tables you want to setup pipelines for. For each table (table1, table2, table3, etc) it does a mysqldump of the table structure. It then feeds this table structure file into the python conversion utility to produce the PostgreSQL version of the table structure. The contents of the table1_create.psql file is what I copied into my “Create Table Sql” field in the Data Pipeline UI.

Note that the “Create Table SQL” is interpreted literally, and has no schema context in Redshift. Therefor if I want to create the database table in another schema, the CREATE TABLE contents need to be modified to prepend this table name with a schema qualifier. e.g. “table1” would become “staging.table” (without the quotes). The Python utility will double quote the table name if given a table name with a period. This will incorrectly create a table in the public schema: public.”staging.table”, which is probably not what desired. Check the contents of the CREATE TABLE for accuracy.

I also changed the “Insert Mode” in the S3ToRedshiftCopyActivity” to be OVERWRITE_EXISTING . This uses the primary key of the table to detect duplicate rows. Since we might modify existing records, I wanted to replace those records in Redshift when they are modified in the application.

Quirks

The data pipeline services has a number of quirks that I stumbled upon. I hope Amazon works to refine the service, and that one day these are no longer issues. But for now I observed the following:

  • I cannot add everything via the UI. Things like parameters, and EC2 resources can only be added via editing the JSON. Don’t be afraid to open it up – it is reasonably easy to follow. Hopefully support to add these objects will come to the UI in the future.
  • The default templates are a great place to understand the pipelines, but are very messy. The JSON definitions are painful to manage. Some information are stored in parameters, others are done inline. Some activities cannot have parameterized variables. Sometimes the parameter names MUST start with “my”, e.g. “myDatabaseUsername”. I found this arbitrary and frustrating. Also some parameters have a “watermark” value, a “help” value, others don’t. At least one variable started with a “*” character. No explanation why.
  • When testing a pipeline I cannot change between a scheduled pipeline, and an on demand pipeline. I have to export the JSON definition and create a new pipeline as the other type. This makes testing painful.
  • The “Execution Details” screen is hard to interpret. The filter defaults to “Activities” only, but all of my pipeline definitions start with an EC2 resource being instantiated which is filtered out. The timeframes are also strange. I needed to change the “Schedule Interval” to be “Executed start” while testing an on demand pipeline. Finally the dates need changing from the default. It will default to 2 days ago, and will include a lot of previous test runs if you are developing. They don’t seem to be sorted in any logical way either, making tracing the pipeline execution difficult at a glance.
  • While debugging, check S3 for the logs. I found log data was contained in S3 that was not referenced at all in the Data Pipeline UI. This was critical for understanding failures.
  • The visualization in the Architect mode is not particularly helpful. The only thing I can do is click on a node and see the properties on the right. I cannot view two node’s properties at once. Worse is the parameters are in a completely different section so I can only see the variable name, or the variable value at any time. I found it only useful to see the flow of execution. For editing, I would export the data pipeline to JSON and modify in a text editor outside of the UI.

Conclusion

The Data Pipeline service offers an affordable way to automate ETLs in a cloud friendly way. The proof of concept that I did under represented the amount of work it would take to get these running in production. I frequently battled with permissions, UI quirks, and timeouts. I was persistent, and eventually got these pipelines running. One running, I stripped out the passwords and committed the JSON exports into version control for peace of mind.

I want to look into the SNS failure reporting next to be alerted when pipelines fail. I am most concerned with upstream changes to the table schemas. If these don’t match the downstream schema, the pipelines will fail. To guard against this I wrote unit tests that inspect the tables and ensure no changes have been introduced upstream. If a change is introduced, the test fails with the message that it was expecting something else. This should trigger the developer to alert the data services team to modify the pipeline downstream to prepare for upstream changes. This policy has not yet been tested.

Computers, Open-source, Software

Assign Pull Requests with Hubot

ChatOps was a term our DevOps person shared with me. Its the concept of automating everything you can through your company chat client. This could include monitoring, deploying, and just having fun (/pugbomb anyone?)

We implemented a pull request policy for all new code going to master. A pull request is opened by a developer on Github for the feature branch they want to merge into master. One of the biggest benefits to opening a pull request is having a peer review happen on your code where at least one other developer provides a review before it is committed to master. Fixing issues in the code at this point in the development cycle is relatively inexpensive compared to having this code fail QA, or BA sign off, or worst case, having a production incident.

Who reviews the pull requests? In the interest of being fair, it shouldn’t be a single developer that carries that burden. Pull requests can be a fantastic tool for transferring technical and domain knowledge so its best if everyone can participate. A developer might have never had occasion to work on the part of the codebase that contains the changes. For these reasons, we decided to do a round robin assignment of PRs to the developers. And yes, we fixed the bug where you get assigned your own PR 🙂

To accomplish this we utilized an open source project called Hubot that is maintained by Github and has integrations with Slack, as well as other chat clients. Hubot scripts are written in CoffeeScript, and the platform runs on Node.js. We enabled the integration between Github and Slack. A chat bot will publish a message on Slack when a pull request is opened. Hubot listens for these messages and then runs the code that you specify in response to the event. It sounds complex, but most of the heavy lifting is already done for you. You just listen for keywords, and take certain actions.

In our case, mentioning to a developer that he or she is assigned (via Slack) to a PR was sufficient for our use case. We created a new Slack channel, setup a webhook, and write a few lines of CoffeeScript to do the round robin assignment. It looks like this in Slack:

Screen Shot 2016-08-11 at 6.30.50 PM

HubotSlack = require 'hubot-slack'

module.exports = (robot) ->

Require the hubot-slack module. Inside this function is where we will do our work

# { <github username>: <slack username> }
 assignees =
 'bsimpson': 'ben.simpson'

getAssignee = ->
 index = robot.brain.get('assignee_index') || 0
 assignee_key = Object.keys(assignees)[index]
 assignee = assignees[assignee_key]

 if index >= Object.keys(assignees).length-1
  index = 0
 else
  index++

 robot.brain.set('assignee_index', index)
 assignee

We list our assignees in an hash and then use robot.brain to store where we left off. This allows us to restart our client without losing our place in the round robin assignment. You will notice that the assignment hash has two usernames to allow mapping from Slack to Github user accounts in the event they are different.

regex = /Pull request submitted:.* by (.*)/i

robot.listeners.push new HubotSlack.SlackBotListener robot, regex, (msg) ->
 author = msg.match[1]
 assignee = getAssignee()
 until (assignee != assignees[author])
 assignee = getAssignee()

 msg.send "<@#{assignee}> you've been assigned to #{author}'s pull request."

This is the binding we make to listen to Github bot messages. We can’t use robot.listen since this only works for human messages. Since we are talking bot to bot, we have to utilize the SlackBotListener.

Now that we have our code in place, we can create a new Slack webhook. Inside your Slack settings click “Apps and Integrations” > “Manage” > “Incoming Webhooks”. Add a new one for the channel you want Hubot to post to and get the token. This will be needed when we start Hubot.

HUBOT_SLACK_EXIT_ON_DISCONNECT=true HUBOT_SLACK_TOKEN=xxxxxxx./bin/hubot --adapter slack & echo $! > /path/to/hubot/run.pid

This will install any dependencies and run Hubot. Substitute your token from earlier for the HUBOT_SLACK_TOKEN. The pid tracking is important if you are using Monitrc to monitor your process.

Happy ChatOps-ing!

 

Hardware, Linux, Open-source, Ruby, Software, Uncategorized

Delayed Job Performance Tuning

We found a bug. A bug that affected a lot of historical records that we now have the pleasure of reprocessing. Fortunately we already had an async job infrastructure in place with the delayed job gem. Unfortunately, this gem is intended for fairly small batches of records and isn’t tuned to handle 1M+ records in the delayed_jobs table.

After reading some best practices we decided on a queue based approach. To keep our day to day async jobs running we would use a “default” queue. And to reprocess our old records we used a new queue. Starting the workers up with a “–queue” flag did the trick. We had hardware dedicated for day-to-day operations, and new hardware dedicated to our new queue operations. Now it was simply a matter of filling up the queue with the jobs to reprocess the records.

Our initial approach maxed out the CPU on our database server. This was largely due to us not tuning our SQL in our async jobs. Because the volume we processed was always low, this was never really a noticeable problem. But when we threw lots of new jobs into the queues, it became very noticeable. The workers would start up, then mysteriously die. After some digging in /var/log/kern.log we discovered the workers were being killed due to an out of memory manager. Attaching a small swap partition helped, but once you’ve hit swap, things become horribly slow. Whats the point in keeping the worker alive if its running a thousand times slower? Clearly we needed to profile and tune. So we did. (The specifics are out of scope for this article, but it involved consolidating N+1 queries and limiting columns returned by the SELECT).

With our newly tuned SQL our spirits were high as we cranked up the workers again. Only to reach the next process bottleneck. And this is where databases gets frustrating. Delayed job workers run a query each time they are looking for a new job to find out which job to pick up. It puts a mutex lock on the record by setting locked_at and locked_by. The query looks like this:

UPDATE delayed_jobs
SET `delayed_jobs`.`locked_at` = '2016-06-05 11:48:28', 
 `delayed_jobs`.`locked_by` = 'delayed_job.2 host:ip-10-203-174-216 pid:3226' 
WHERE ((run_at <= '2016-06-05 11:48:28' 
AND (locked_at IS NULL OR locked_at < '2016-06-05 07:48:28') OR locked_by = 'delayed_job.2 host:ip-10-203-174-216 pid:3226') 
AND failed_at IS NULL) 
ORDER BY priority ASC, run_at ASC 
LIMIT 1;

The UPDATE does an ORDER which results in a filesort. Filesorts are typically something an index can resolve. So I optimistically added the following:

CREATE INDEX delayed_job_priority
ON delayed_jobs(priority,run_at);

Sadly, this index was completely ignored when I ran an EXPLAIN on my UPDATE. And the reason is that MySQL doesn’t execute an UPDATE query the same way as if you did a SELECT with the same conditions. The index probably made things worse, because now with each record update, we now also have an index update as well. I could probably fork the code and probably use some type of isolation level in a transaction to get the best of both worlds with an index based SELECT, and a quick UPDATE on a single record by id. But there are easier solutions to try first.

My UPDATE statements were pushing 40 seconds in some cases according to MySQL. Eventually the lock wait timeout is exceeded and you see an error in the delayed_jobs.log:

Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; 
try restarting transaction

Jobs were moving very slowly, and throwing more workers at it didn’t make an improvement. This is because each time a worker picks up a job, it was waiting 40+ seconds. The UPDATE was doing a filesort, and any index was being ignored. (And MySQL doesn’t support UPDATE hints). It was pretty clear that all of the jobs from the reprocessing queue needed to find a new home that didn’t negatively impact my filesort. I settled on the following solution:

CREATE TABLE delayed_jobs_backup LIKE delayed_jobs;

INSERT INTO delayed_jobs_backup
SELECT * FROM delayed_jobs WHERE queue='new_queue';

DELETE FROM delayed_jobs WHERE queue='new_queue';

This creates a new database table with the structure of the existing delayed_jobs table. The table is then populated with the jobs that needed to find a new home (All 1M+ of them). And finally, deleted from the original delayed_jobs table. Be careful doing this, and do some SELECT/EXPLAIN queries in between to ensure you are doing what you think you are doing. (Deleting 1M+ records from a production database makes me sit up in my chair a bit).

Looking at MySQL’s process list I no longer have System locks on my UPDATE statements (presumably because the table size is small enough that the filesort is mostly painless):

mysql> SHOW FULL PROCESSLIST;
# Id, User, Host, db, Command, Time, State, Info
1, user, 0.0.0.0:34346, localhost, Query, 0, System lock, UPDATE ...

The important columns here are the Time (in seconds), State, and Info. This proves that my job locking was happening quickly. I was seeing Time values of 40+ seconds before. I kept referring back to this process list to verify that the UPDATES were remaining snappy while I modified the number of workers running, and the number of jobs in the queue. I had a goal of keeping the UPDATE system lock times under 2 seconds. Adding more workers pushed the times up. Adding more jobs to the queue pushed the times up. Its a balance that probably depends very much on what you are processing, how much your database can handle, and what your memory constraints are on your worker servers.

To conclude – my job over the next few days will be to run the following command to put some jobs into the delayed_jobs table 10,000 at a time:

INSERT INTO delayed_jobs
SELECT * FROM delayed_jobs_backup LIMIT 10000;

DELETE FROM delayed_jobs_backup LIMIT 10000;

You can of course automate this. But my objective was met. The jobs can reprocess old records without impacting day to day jobs in the default queue. When the delayed_jobs table is almost empty, I move over another batch of jobs from the delayed_jobs_backup table. Rinse and repeat until there are no more jobs left to process. Its a bit more painful, but day to day operations continue to function, and I can cross of the reprocessing task from my list of things to do. All without any code changes!

I’ve been reading up on transaction isolation levels thinking something like a SELECT FOR UPDATE lock might be worthy contribution to the delayed_job codebase: http://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html

Computers, Open-source, Ruby, Software

Upgrading to Rails 4

Recently, I started a new job and the first big assignment was to upgrade their software stack to a more recent version. They are running Rails 3.2 and wanted to upgrade as far forward as they can. With Rails 3.2 support gone for all but severe bug fixes, and Rails 5 due any month now, this is something they wisely didn’t want to put off.

Its a smaller company, and they have been open to a lot of my feedback and suggestions. I was basically given the reins and told to do what needed to be done to get us upgraded.

So the first task was some research, and I stumbled upon the official Rails upgrade guide pretty quickly. It nicely outlines the breaking changes. Fortunately, the big change was to strong parameters, but this can be deferred by including protected_attributes and kicking this can down the road. We will be logging what controller actions receive which parameters, instead of raising so we will have some time to collect some data before we switch over in one painful release.

The guides stressed that the test suite is critical during the upgrade. I was fortunate enough to have a project with adequate testing coverage. It wasn’t the 80% sweet spot, but it was certainly valuable at ~40%. However, the suite had fallen into disuse, so the first task was to get them back to green.

Once the test suite was green, it became a matter of KEEPING it green. Luck smiled a second time and they had an old CI server that had fallen into disuse. It was powered by CruiseControl.rb and it was little fuss to get it back up and running again. The migrations could no longer be played from the projects inception to the current time.

This is where luck stopped smiling upon me. The project did not track db/schema.rb and the migrations were not playable. The only way to get an instance of the database was to download the schema from production. Not the best practice, so I went about tracking the schema, and getting adoption of this new practice. Further complicating the schema approach was the decision to move all older migrations into subfolders in db/migrate by year (e.g. 2011, 2012, etc). This was done I found out because Textmate doesn’t like big directories. The issue is that db:schema:load isn’t recursive in its retrieval of migration versions. It took me a bit to understand what was happening, and how it was happening. After a failed monkey patch to the migrator logic in ActiveRecord, I decided to just move the migrations back into db/migrate and eliminate the subdirectories. Sorry Textmate!

Now the database could be rapidly provisioned, and I got a seed working with a minimal set of data. Back in CI I reconfigured the build script to use db:schema:load instead of db:migrate and with the green test suite, we got builds working again.

We used a utility called CC Menu to show the build status in the notification bar in OS X: http://ccmenu.org/

To make the builds even more visible, I discovered an integration with Slack to report the build status in our chat. https://github.com/epinault/cruisecontrolrb-to-slack . I made my own fork and added some famous movie quotes for successes and failures since I found the default messages lacking: https://github.com/bsimpson/cruisecontrolrb-to-slack . I didn’t think our female developers would appreciate the “you’re a stud!” message.

Back to the Rails 4 upgrade. The tests are passing in master, so I made a feature branch that will be long lived called “rails-upgrade”. I merge master in daily. The “rails-upgrade” branch will serve as an integration point for other features branches that will merge into it. The plan is to keep any upgrade related changes out of master until its time to deploy. That means separate branches, separate CI builds, and separate staging servers for manual QA.

One lesson I’ve learned is that a deprecation warning may not always be just informational. In particular, Rails 4 requires all scopes to be callable (lambdas, or procs). This was breaking the way that associations with scopes would be built: users.roles.admin.find_or_create! would previously find an associated admin record, or create it. However, in Rails 4, it fails creation because the role’s reference to user is nil. I’m not sure why, but its reproducable, and changing the admin scope on Role to a callable restores this reference back to user.

Ideally, I’d have wanted to get the test suite green before tackling deprecation warnings because I want to change as little as possible before I get back to a known good status. However, not fixing this deprecation warning was actually causing tests to fail.

Now we are down to a handful of failings tests on Rails 4. Most deal with the ActiveRecord syntax changes. Hopeful I can get these knocked out quickly. Then its on to manual QA.

In summary – get your test suite green. Keep it green. Do the upgrade and get it back to green. Then make any changes to remove deprecation warnings, keeping the suite green. The test suite is your guide during an upgrade like this.

Open-source, Ruby, Software, Thoughts, Uncategorized

Implementing a Configuration

Have you ever seen a configuration in Ruby that yields to a block where properties are set on the yielded object? Rails does this with its environments files:

Your::Application.configure do
  config.cache_classes                     = true
  config.consider_all_requests_local       = false
  config.action_controller.perform_caching = true
  ...
end

Devise is another great example of this pattern:

Devise.setup do |config|
  config.secret_key = '38c8e4958385982971f'
  config.mailer_sender = "noreply@email.com"
  config.mailer = "AuthenticationMailer"
  ...
end

This is an established way to pass in configuration options in the Ruby world. How does this work? What are the alternatives?

Under the Hood

Lets start with our own implementation from scratch. We know that we are calling a method (Application.configure, or Devise.setup) and that method yields to a block:

module Example
  class << self
    def configure
      yield Config
    end
  end
end

You can see that this will yield the class Config when we call Example.configure and pass in block. Config will need to define our properties:

module Example
  class Config
    class << self
      attr_accessor :api_url, :consumer_key, :consumer_secret
    end
  end
end

We can now call our configure method and pass in a block:

Example.configure do |config|
  config.api_url = "http://api.com/v1/"
end

If we try to set a property that does not exist in our Example::Config class, then we get a helpful NoMethodError:

Example.configure do |config|
  config.not_a_real_config_setting = "http://api.com/v1/"
end

# => NoMethodError: undefined method `not_a_real_config_setting=' for Example::Config:Class

This helps to define your configuration interface. Options are explicitly declared. They cannot be mistyped, or dynamically added without an explicit runtime error.

In practice, The definitions of Example, and Example::Class would live in a gem. The Example.configure invokation would live within the code that wants to configure the properties of that gem with things such as credentials, URLs, etc. This seperation of concerns makes sense. Gems should not know anything about the business logic of your application. It should be instantiated with configuration options passed in from the project. However the project should only be aware of the public interface of the gem. We are agreeing upon where these pieces of information are moving from one domain to another. The project doesn’t know where this information will be used, and the gem doesn’t know what the information is until we pass it. So far so good!

Using this Configuration

Now that we’ve passed our information inside the configuration block, we can reference these class level (static) properties in our gem:

module Example
  class Request
    def self.get(path)
      request = Net::HTTP.get(URI.parse(ExampleConfig.api_url + path))
      request['Authorization'] = auth_header(:get, ExampleConfig.api_url)
    end

    private

    def self.auth_header(request_type, uri)
      SimpleOAuth::Header.new(request_type, uri.to_s, {},
                              {consumer_key: Config.consumer_key,
                              consumer_secret: Config.consumer_secret}).to_s
    end
  end
end

This will do a simple GET request passing in SimpleOAuth headers. Inside the get method we call Config.api_url to know where the API lives. This was set by us earlier using the Config object. SimpleOAuth headers are supplied by again calling the Config. You would invoke it like so:

Example.configure do |config|
  config.api_url = "http://api.com/v1/"
  consumer_key = "1895-1192-1234"
  consumer_secret = '76asdfh3heasd8f6akj3hea0a9s76df'
end

Example::Request.get('/products') # => {products: [product1, product2, etc]...}"
Example::Request.get('/users') # => "{users: [user1, user2, etc]...}"

Example::Config becomes the holding location for your configuration information. And by making the properties static, you don’t have to worry about passing around a reference to the instance.

Alternatives

If the yielding to a block is a little too clever for you, you can always instantiate a class and pass in the configuration as part of the constructor:

class Example
  class << self
    attr_accessor :api_url, :consumer_key, :consumer_secret

    def config(api_url:, consumer_key:, consumer_secret:)
      self.api_url = api_url
      self.consumer_key = consumer_key
      self.consumer_secret = consumer_secret
    end
  end
end

This can be instantiated like so:

Example.config(
  api_url: "http://api.com/v1/"
  consumer_key: "1895-1192-1234"
  consumer_secret: '76asdfh3heasd8f6akj3hea0a9s76df'
)

Example.api_url # => "http://api.com/v1/"

This feels less encapsulated to me. Instead of having an interface for our configuration settings, we are just settings properties directly onto a class.

What are your thoughts? What advantages does the block style configure offer over the alternative above?

For more information on the Ruby gem configuration pattern see this excellent blog post: