Ruby, Software

Refactoring With the Observer Pattern

What is the Observer Pattern?

Have you written a system that has the concept of “events” that need to notify users? Perhaps you send an email on a system error to an administrator. Or send a “password changed” event notification to a user? In a CRM perhaps clients can subscribe to new customers as they come into the system as the result of marketing efforts?

This can be represented using the Observer pattern. Your users have subscribed to one or more events, and they are “observing” new events. The events themselves are “observable” via these event subscriptions.

This blog post will walk through refactoring an existing event notification system that did not use the observer pattern, and instead relied on a multiple passes to calculate an intersection of users and events that ran periodically via cron.

Old pattern

We have a CRM that is written in Ruby on Rails. When a database record is generated for something we want to create an event for, we use the ActiveRecord callback to create a second “event” record after save. This event record is associated to the source record and includes some contextual information like a type “NewLead”, or “SystemError”, and whether or not it has been processed (defaulted to false). Event also includes some hierarchical attributes so we can enforce authorizations in a multi-tenant application.

Once the event records are created, we had a cron entry that would fire every minute and check for new events. This would do a database query looking up unswept events, and looking up all active subscriptions. Doing some intersection logic, we would then generate an “event subscription” record for each event that can belong to an intersection. This did not scale particularly well because a generic event could potentially go to hundreds of subscribers, which would mean a database table entry per subscriber. This information is also not valuable after the notification is sent.

Once the events are swept, and “event notifications” records are generated, we could then send out an email. For more usability we have a digest feature where emails are not sent immediately when events come in. Rather we collect for a period of time, then send a summary of events that have occurred since the last email was sent.

This system did not scale due to the volume of events and subscriptions. We had issues where long inserts, and large datasets caused deadlocks in the database and out of memory errors on the server. The goals were to keep the events, and the subscriptions, but optimize the wofklow so that we could scale further, and have less transient data stored in the database. It goes without saying, but the replacement needed to be robust, and easy to reason about with good test coverage as well.

New pattern

The Observer pattern I mentioned a few minutes ago was the right fit for this particular problem. Instead of a cron initiated loop with lots of state being persisted we would move to an in-memory event loop and stream the events to the subscriptions.

One wrinkle with the observer pattern was the email digest requirement. Observers are great at taking events and then processing them immediately. Building a collection required some investigation, but using Event Machine and periodic timer I was able to preserve this functionality as we will see.

The event system

There are two primary areas – the observable, and the observer. In Ruby, we can compose the Observable pattern by including the module into our class. This sets up a pseudo interface in Ruby that we can implement in our observable and observers. Of course Ruby doesn’t have true interfaces so this is more of a guideline than a compile time check.

The observable class:

require 'observer'
module Events
  class Observable
    include ::Observable
    attr_accessor :last_tick_at

    def initialize
      self.last_tick_at = Time.zone.now
      observers.each { |observer| observer.register(self) }
    end

    def run!
      changed_subscriptions.each do |subscription|
        subscription.reregister(self)
      end

      new_events do |event|
        changed
        notify_observers(event)
      end
    end

    private

    def new_events(&block)
      Event.yield_and_sweep(&block)
    end

    def changed_subscriptions
      Subscription.where("updated_at >= ?", last_tick_at).tap do
        self.last_tick_at = Time.zone.now
      end
    end

    def observers
      Subscription.only_active.not_paused
    end
  end
end

The observable class will be instantiated by our task runner. During the initialization it grabs the current time, and registers its observers (subscriptions). Its method run! will be invoked periodically and is responsible for finding and sweeping new events, and sending out to the observers. It also checks if any subscriptions have been added, changed, or deleted via the updated_at timestamp which is compared against the last run. These subscription observers have to re-register with the observable to reflect their changes.

Next up is the runner::

require 'eventmachine'

module Events
  class Runner
    def start
      EM.run do
        observable = Events::Observable.new
        run_timer = EventMachine::PeriodicTimer.new(0) { observable.run! }
        run_timer.interval = 1.minute.to_i
      end
    end
  end
end

This runner instantiates the observable class and starts an Event Machine loop. Within this loop we add a periodic timer every minute to run the observable. As we saw in the observable, the run! method will register its observables, and send them events.

Finally, we have the observer class:

module Events
  module Observer
    attr_accessor :observed_events, :timer

    def register(observable)
      observable.add_observer(self)
      self.observed_events ||= []
      self.timer = EventMachine::PeriodicTimer.new(self.wait_interval) { process_events }
    end

    def unregister(observable)
      observable.delete_observer(self)
      self.timer.try(:cancel)
    end

    def reregister(observable)
      unregister(observable)
      register(observable)
    end

    def update(observed_event)
      return unless authorized_for_context(observed_event)
      self.observed_events.push(observed_event)
    end

    def process_events
      return unless observed_events.present?
      EventMessageMailer.delay(queue: :email).event_message(observed_events.map(&:id), self.user)
      observed_events.clear
    end

    private

    def authorized_for_context(event)
      return false unless self.event_type.blank?            || self.event_type == event.type
      return false unless self.agency_id.blank?             || self.agency_id == event.agency_id
      return false unless self.client_id.blank?             || self.client_id == event.client_id
      return false unless self.business_unit_id.blank?      || self.business_unit_id == event.business_unit_id
      true
    end
  end
end

This is a module that can be mixed in to an ActiveRecord “Subscription” model. There are some assumptions that the model will have certain attributes pertaining to the hierarchy that the subscription is for (agency_id, client_id, business_unit_id) and an attribute for what type the subscription is. In order for an event to be collected the event that is received must pass authorization checks, and must be for the appropriate event type.

Of note is the wait_for_interval. This is logic around the digest functionality. Each subscription has an attribute for how long to wait. A value could be 5 minutes, 1 hour, 1 day, etc. The process_events call will be fired when the PeriodicTimer fires on the specified interval. This clears out events when they are processed.

Lets see a diagram of how all this will work:

Screen Shot 2017-10-20 at 10.26.13 AM

Not shown is the EventMailer. This is just a standard ActiveRecord mailer however that takes an array of events, and shows them as a list. These are then sent to the user’s email address via a background job.

Observable Benefits

There are a few benefits of using the observable pattern that we replaced our existing code with:

  • Efficiency – Subscriptions are not recalculated every time we sweep for events.
  • Storage size – Because the event runner continues to run in the foreground, we don’t have to worry about persisting a lot of in between state. The concept of event subscriptions lives in memory. Messages are constructed on demand and emailed asynchronously.
  • Codebase size – Almost all of the cron entries, task runners, sweepers, and calculations on the event and subscription intersections have been removed because they aren’t necessary. We now have zero cron entries, and one task runner that is responsible for picking up new events and sending to the observers.
  • Agreed upon pattern – While only a benefit if you know the pattern, there is a much greater chance someone will know what an observer is instead of a custom solution
  • Interface checking – Because we are using the Ruby Observable module, we get some safety in adding new observers, and when changing our observable. If the observer doesn’t implement the update method, an exception will be raised when registered with the observable.

Task runner

Now that we have implemented the observer pattern, we need some scripting to have this run in the foreground and recover from any crashes. For this I used the Daemon gem:

require 'daemons'
require File.expand_path('../../config/environment',  __FILE__)
require Rails.root.join('app/models/events/runner')

Daemons.run_proc('events', {
  dir:       Rails.root.join('tmp/pids/'),
  dir_mode:  :normal,
  log_dir:   Rails.root.join('log/'),
  backtrace: true,
  monitor:   true
}) do
  logger = Logger.new(Rails.root.join('log/events.log'))
  Events::Runner.new(logger: logger).start
end

The observable can now be instantiated with a script/events.rb start command. In the event of a crash, I get log output of the stacktrace, and an automatic restart.

What is next

Currently, there isn’t a mechanism to serialize events that are in memory for subscriptions. This was due to the fact that most are minutely, and because the service so rarely gets restarted, and the risk of not receiving a notification was low this problem was left for later. Likely the approach would be trapping a SIGTERM, or SIGKILL signal, and persisting the unprocessed events for each given observable in a way that is recoverable on restart. A text file is probably sufficient, as this data isn’t valuable long term.

Learn More

If you want to learn more about the observer pattern, I recommend Head First: Design Patterns. The content is good, but the 1990’s stock photos will make you groan.

Additionally, SourceMaking has great information on many design patterns including the observer.

 

Advertisements
Apple, Computers, Hardware, Linux, Software, Windows

Using Synergy Over VPN

I’ve been watching a lot of Linus on Tech and one of their sponsors sells a product called Synergy. More information on their product page: https://symless.com/synergy . To summarize, this is a software KVM (no video, so I guess it is “KM”?) solution. The use case was perfect for me. On my desk I have an Apple work laptop, a Windows desktop, and a Linux personal laptop. Anyone that has done extensive work on a laptop keyboard and touchpad know that it isn’t optimal. I didn’t want multiple keyboards and mice on top of my desk because of the clutter. I dropped some money for Synergy, and it just works!

20170731_110826_HDR.jpg

That is until I had to connect to our company VPN the next week. They use a full tunneling solution. When I connect, I lose everything. I can’t print, I can’t access my NAS, but most importantly I can’t access my keyboard and mouse. (The video is fine because it is a hard wire to an external monitor). What to do?

SSH to the rescue! What is SSH? This is a protocol that will allow one computer to securely interface with another computer. Secure SHell. However, we will just be using it for port forwarding, and not for an interactive session. The goal is to take the OS X machine (Synergy client), and SSH into the Windows server (Synergy server). Using this SSH connection, we can forward ports within it. It is a tunnel running inside the SSH connection. This will expose a local port on OS X for 24800 that is actually pointing to the remote server port 24800. This is the port that Synergy uses for its connections.

You will need a few tools, and a little patience. Having just gone through this, I’m sharing for posterity, or maybe for anyone that has thrown in the towel with how crippled VPN makes accessing home devices.

I have the following Synergy setup:

  • Windows 10 Synergy server (keyboard and mouse are physically connected to the desktop)
  • OS X Synergy Client
  • Linux Synergy Client
  • Router with a local area network all these devices share
  • Admin access to the router for port forwarding
  • Autossh package for OS X (available via brew)

First step, get Windows 10 up to speed with SSH. How this isn’t built in as a service in the year 2017 I have no idea. Grab the OpenSSH server package for Windows from https://www.mls-software.com/opensshd.html . After downloading, extract and run the setup file. This will create a new Windows service for OpenSSH that will run on port 22. It prompts you to generate an SSH key for the server.

Once this server is running, you will need to add your user to the list of SSH users. Open up PowerShell as an administrator and change into the C:\Program Files\OpenSSH\bin directory. Run the following commands:

mkgroup -l >> ..\etc\group
mkpasswd -l >> ..\etc\passwd

Try and connect to your SSH server from the OS X client:

ssh <user>@<server IP> # e.g. ssh Ben@192.168.1.95

You should be prompted for your Windows password. Once you can successfully login to the server, we can setup public key authentication. This removes the need for you to type in your password because you identify yourself with an SSH public key. From your OS X machine get your public key:

cat ~/.ssh/id_rsa.pub

Put the contents of this file on your SSH server in the file C:\Program Files\OpenSSH\home\<user>.ssh . This is actually a symlink to C:\Users\<user>.ssh . If the directory .ssh doesn’t exist, you will need to create it first. Now we need to configure the server to allow public key authentication. Edit the C:\Program Files\OpenSSH\etc\sshd_config file and change the following lines:

StrictModes no
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys

Restart the OpenSSH server for the changes to take effect:

net stop opensshd
net start opensshd

You should now be able to SSH into the server same as before but without being prompted for a password.

Now we are ready to create an SSH tunnel. Before we incorporate AutoSSH (which handles retries and monitoring) we will do a naive attempt to SSH. In the following command:

  • -f backgrounds the process
  • -L does port tunneling in the format of <local port>:<remote host>:<remote port>
  • -N do not run a command – just tunnel the port
ssh -f <user>@<remote public IP> -L 24800:<remote public IP>:24800 -N

If this works, you should see a [LISTEN] entry for port 24800 when you list open files:

lsof -n -i | grep 24800

You may need to set your server as the DMZ on your network. Or to be safer you can simply setup port forwarding. We will need port 22 and port 24800 to resolve to the Windows server. The instructions for how to do this on a router widely vary by vendor. Typically it is under a WAN section. It typically prompts for a port, a destination IP, and destination port, and protocol. You want ports 22 and 24800 to route to your server IP for TCP and UDP.

Configure your Synergy client to use localhost instead of the remote IP. You should now be able to operate your client from the server’s peripherals via Synergy.

Everything works great until the VPN connection is made. The reason is that the SSH connection is severed. In order to recover automatically, I have added autossh to persist this tunnel. On the OS X client instead of running SSH do the following:

AUTOSSH_POLL=10 autossh -M 20000 -f -N <user>@<remote public IP> -L 24800:<remote public IP>:24800

Now when a VPN connection is made, or a disconnection happens, the autossh package will detect that it is no longer alive and retry. Because Synergy’s software also retries, after a few seconds your connectivity should begin working again.

Thanks to Synergy for making a solid product, and for having first class Linux support.

Computers, Software

Moving Data in AWS

I have been a long time user of Amazon Web Services, but I only recently started using the Data Pipeline service they offer to handle ETL needs. The service provides a cloud ready, low cost, turn key (in some cases) solution to moving data within your services. I had a difficult time getting up and running, partly due to the lack of discussion online about this service, so I want to share my experience, offer some best practices, walk through how I developed our pipelines.

Anyone with an AWS account can use Data Pipelines. But be careful there there is no BAA agreement if you are in the healthcare industry and are passing around PHI data. Fortunately, our needs do not yet require us to move around PHI information.

The first step in my ETL journey was first formalizing what data needed to go where. Before I even opened up the Data Pipeline service, I had to understand our specific needs. I identified two use cases.

Archiving Old Data

RDS instances have a max allowable size for the database and ours was getting full. The approach I took was to look at our largest tables, and focus on reducing those first. I ran some queries to understand what is using the most space: http://stackoverflow.com/a/9620273/802407

Once I had a list of my largest tables, I could classify them and assign retention rules. Some tables I decided to leave in place, and others I decided were transient data, and could be archived. (HIPAA mandates a 7 year data retention policy, so no luck just deleting). We decided as a business that different classifications could live within our application for differing time frames. Once timeframes were established, I could then write a data pipeline, and move any data older than our cut off date for that table to a storage solution outside of the database. We chose to house MySQL backups on S3 in encrypted buckets.

Fortunately the Data Pipeline service provides a number of templates to help you get started. Navigate to https://console.aws.amazon.com/datapipeline .  I found the templates good starting point, but there are some frustrations that I will mention below in the “Quirks” section. Click “Create Pipeline”. I used the template “Full Copy of RDS MySQL table to S3”. I filled in the parameters, and edited the pipeline in “Edit in Architect” mode.

Since I wanted to archive old data, I modified the SqlDataNode’s Select Query to be only records older than my retention policy:

select * from #{table} 
where #{myRDSTableLastModifiedCol} <= 
date_sub('#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' , 
  interval 6 month)

This will select records only older than 6 months from the pipeline scheduled start time. The template then moves these to S3. There are two parameters, defined in the parameters section for “#{myRDSTableName}”, and “#{myRDSTableLastModifiedCol}”. I supplied my table name, and the updated_at datetime column for my records.

I added a new SqlActivity dependent on the CopyActivity named “Delete old records”. Once they move to S3, I want to delete them from the database table. This activity “Depends on: RDSToS3CopyActivity” so if saving to S3 fails, the records are left untouched. I added the following script to mirror my select query above, but deleting the records instead:

delete from #{myRDSTableName} 
where #{myRDSTableLastModifiedCol} <= 
date_sub('#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' , 
  interval 6 month)

I would recommend doing this on a test database first before you delete production records while testing your script. Because the timestamp is the same as from the select, this will be the same record set, provided you have an updated_at column that shows when a record was last touched.

Expose Data for Reporting

My other use case was getting data to the reporting server which is in Redshift. Again, there is a nice template to get started. Click “Create Pipeline” and then select “Incremental copy of RDS MySQL table to Redshift”, then “Edit in Architect”.

The first run will need to be a full copy if I want all the data in the table. After that, I can use delta copies to only move over the new data fitting your criteria. This is driven by a SQL select query, so it is easy to modify. In the SqlDataNode I can edit the Select Query to my liking. Note that I removed the conditions from the query to get all records on the first run. I changed the instance type to something more substantial (t1.micro to m1.small), and upped the timeout from 2 hours (to 8 hours). I then went in before the next run and modified the query to put back the conditions that selected the delta data, then downgraded the instance type, and timeout values to their smaller defaults.

I then ran into an infrastructure quirk where our Redshift instance was inside a VPC, and our RDS database was inside a classic instance (non-VPC). This meant that the same EC2 instance would not be able to talk to both databases since it had to be on one side or the other. Because of this limitation, I had to modify parts of the pipeline that assumed a single EC2 instance would be talking to both databases. Note that I had to edit the JSON as the Data Pipeline UI does not allow changing the resources that activities run on from the template. I created two EC2 instances – one for talking to RDS and S3, and one for talking to S3 and Redshift.

In an attempt to make things easier, Amazon provides some Python scripts that get called under the hood to reflect on your MySQL table structure and convert it to a PostgreSQL CREATE TABLE command. This didn’t work for me because of my VPC permissions issues, so I provided my own CREATE TABLE Sql in the S3StagingDataNode. This was generated using the Python script by Amazon, but I supplied the inputs manually:

#!/bin/bash
aws s3 cp s3://datapipeline-us-east-1/sample-scripts/mysql_to_redshift.py \
/tmp/mysql_to_redshift.py

for target in table1 table2 table3
do
  mysqldump -h <host> --port=3306 -u <username> --password=<password>  \
--compatible=postgresql --default-character-set=utf8 -n -d \
-r "/tmp/${target}_structure.sql" <database> $target

  python /tmp/mysql_to_redshift.py --input_file=/tmp/${target}_structure.sql \
--output_file=/tmp/${target}_create.psql --table_name=$target \
--insert_mode=KEEP_EXISTING
done

This Bash script will pull down the public file mysql_to_redshift.py. Then it loops over the target tables you want to setup pipelines for. For each table (table1, table2, table3, etc) it does a mysqldump of the table structure. It then feeds this table structure file into the python conversion utility to produce the PostgreSQL version of the table structure. The contents of the table1_create.psql file is what I copied into my “Create Table Sql” field in the Data Pipeline UI.

Note that the “Create Table SQL” is interpreted literally, and has no schema context in Redshift. Therefor if I want to create the database table in another schema, the CREATE TABLE contents need to be modified to prepend this table name with a schema qualifier. e.g. “table1” would become “staging.table” (without the quotes). The Python utility will double quote the table name if given a table name with a period. This will incorrectly create a table in the public schema: public.”staging.table”, which is probably not what desired. Check the contents of the CREATE TABLE for accuracy.

I also changed the “Insert Mode” in the S3ToRedshiftCopyActivity” to be OVERWRITE_EXISTING . This uses the primary key of the table to detect duplicate rows. Since we might modify existing records, I wanted to replace those records in Redshift when they are modified in the application.

Quirks

The data pipeline services has a number of quirks that I stumbled upon. I hope Amazon works to refine the service, and that one day these are no longer issues. But for now I observed the following:

  • I cannot add everything via the UI. Things like parameters, and EC2 resources can only be added via editing the JSON. Don’t be afraid to open it up – it is reasonably easy to follow. Hopefully support to add these objects will come to the UI in the future.
  • The default templates are a great place to understand the pipelines, but are very messy. The JSON definitions are painful to manage. Some information are stored in parameters, others are done inline. Some activities cannot have parameterized variables. Sometimes the parameter names MUST start with “my”, e.g. “myDatabaseUsername”. I found this arbitrary and frustrating. Also some parameters have a “watermark” value, a “help” value, others don’t. At least one variable started with a “*” character. No explanation why.
  • When testing a pipeline I cannot change between a scheduled pipeline, and an on demand pipeline. I have to export the JSON definition and create a new pipeline as the other type. This makes testing painful.
  • The “Execution Details” screen is hard to interpret. The filter defaults to “Activities” only, but all of my pipeline definitions start with an EC2 resource being instantiated which is filtered out. The timeframes are also strange. I needed to change the “Schedule Interval” to be “Executed start” while testing an on demand pipeline. Finally the dates need changing from the default. It will default to 2 days ago, and will include a lot of previous test runs if you are developing. They don’t seem to be sorted in any logical way either, making tracing the pipeline execution difficult at a glance.
  • While debugging, check S3 for the logs. I found log data was contained in S3 that was not referenced at all in the Data Pipeline UI. This was critical for understanding failures.
  • The visualization in the Architect mode is not particularly helpful. The only thing I can do is click on a node and see the properties on the right. I cannot view two node’s properties at once. Worse is the parameters are in a completely different section so I can only see the variable name, or the variable value at any time. I found it only useful to see the flow of execution. For editing, I would export the data pipeline to JSON and modify in a text editor outside of the UI.

Conclusion

The Data Pipeline service offers an affordable way to automate ETLs in a cloud friendly way. The proof of concept that I did under represented the amount of work it would take to get these running in production. I frequently battled with permissions, UI quirks, and timeouts. I was persistent, and eventually got these pipelines running. One running, I stripped out the passwords and committed the JSON exports into version control for peace of mind.

I want to look into the SNS failure reporting next to be alerted when pipelines fail. I am most concerned with upstream changes to the table schemas. If these don’t match the downstream schema, the pipelines will fail. To guard against this I wrote unit tests that inspect the tables and ensure no changes have been introduced upstream. If a change is introduced, the test fails with the message that it was expecting something else. This should trigger the developer to alert the data services team to modify the pipeline downstream to prepare for upstream changes. This policy has not yet been tested.

Personal, Thoughts

Pulling Away

I have felt a particular way for a long time. It has been a nebulous, vague feeling for most of it, but recently its started to coalesce in my head. Its a realization that I am pulling away.

I think it started with our second daughter, Adeline. Raising two kids is without a doubt, one of the hardest things I’ve ever done. Its incredibly rewarding, but it has taken its toll on my physically, emotionally, and mentally. At the hospital when Adeline was born my mom said to me that raising two kids is four times as hard. Extrapolating, I expect three kids is eight times as hard. That is how kids work I suppose. The chaos magnifies. I simply don’t think that I could have a third, despite strong wants some days.

I’m now 32, and the world makes less and less sense to me. Trends don’t interest me. I look at communities like Instagram, and Snapchat and I just don’t get it. Facebook has changed from my college days. What was a way to write to your friends is now a trash media outlet and everyone on it is angry. Maybe I boxed myself into some political bubble in their news feed algorithm. I’ve stopped visiting them, along with Twitter, Untapped, Yelp, Strava, and more. Its all needless.

I don’t really get the new movie trends, like the hundreds of superhero comic to movie adaptations. The first few were neat, but we keep remaking them.  I don’t follow sports. I have a hard time listening to new music on the radio. New fashions like vaping and skinny jeans just irritate me. Video games have gone to shit now that publishers are wrapped up in multiplayer and add-ons. Programming just feels like an endless reinvention of the wheel with diminishing progress.

Maybe it’s just what getting older feels like. Maybe it’s wisdom. Not falling prey to the fear of missing out. Focusing on what matters to me as an individual. Quit making everything a competition. Quit trying to impress others with checkins, and pictures showing how much fun you are having. Quit seeking peer approval.

Then last night the presidential election happened. It hit me in a way I didn’t expect. I educated myself on the issues, while staying away from mainstream media, talk shows, and largely out of debates with peers. I figured most of the drama was just for viewership by the media outlets. I believed that some people felt differently from me, but more were like me than not like me. But when the polling results came in, I found myself angry, confused, fighting back tears. It meant more than past elections. I’ve voted for losing candidates, and went on with life. This one stung. To me it was a signal of the end of what I understood our country to be.

I’ve been in this mindset for months now, and the election pushed me to write about it. I needed to get the way I’m feeling out of my head and down on paper. Is this just aging? Is this stress? Diet and exercise? Is this objectively where we are as a civilization, and everyone is waiting for it to change just like me? Have I just become bored, or complacent with my life?

I certainly feel old, tired, slow. Unnaturally. And I’m not sure what to do about it other than writing a cathartic blog post.

Computers, Open-source, Software

Assign Pull Requests with Hubot

ChatOps was a term our DevOps person shared with me. Its the concept of automating everything you can through your company chat client. This could include monitoring, deploying, and just having fun (/pugbomb anyone?)

We implemented a pull request policy for all new code going to master. A pull request is opened by a developer on Github for the feature branch they want to merge into master. One of the biggest benefits to opening a pull request is having a peer review happen on your code where at least one other developer provides a review before it is committed to master. Fixing issues in the code at this point in the development cycle is relatively inexpensive compared to having this code fail QA, or BA sign off, or worst case, having a production incident.

Who reviews the pull requests? In the interest of being fair, it shouldn’t be a single developer that carries that burden. Pull requests can be a fantastic tool for transferring technical and domain knowledge so its best if everyone can participate. A developer might have never had occasion to work on the part of the codebase that contains the changes. For these reasons, we decided to do a round robin assignment of PRs to the developers. And yes, we fixed the bug where you get assigned your own PR 🙂

To accomplish this we utilized an open source project called Hubot that is maintained by Github and has integrations with Slack, as well as other chat clients. Hubot scripts are written in CoffeeScript, and the platform runs on Node.js. We enabled the integration between Github and Slack. A chat bot will publish a message on Slack when a pull request is opened. Hubot listens for these messages and then runs the code that you specify in response to the event. It sounds complex, but most of the heavy lifting is already done for you. You just listen for keywords, and take certain actions.

In our case, mentioning to a developer that he or she is assigned (via Slack) to a PR was sufficient for our use case. We created a new Slack channel, setup a webhook, and write a few lines of CoffeeScript to do the round robin assignment. It looks like this in Slack:

Screen Shot 2016-08-11 at 6.30.50 PM

HubotSlack = require 'hubot-slack'

module.exports = (robot) ->

Require the hubot-slack module. Inside this function is where we will do our work

# { <github username>: <slack username> }
 assignees =
 'bsimpson': 'ben.simpson'

getAssignee = ->
 index = robot.brain.get('assignee_index') || 0
 assignee_key = Object.keys(assignees)[index]
 assignee = assignees[assignee_key]

 if index >= Object.keys(assignees).length-1
  index = 0
 else
  index++

 robot.brain.set('assignee_index', index)
 assignee

We list our assignees in an hash and then use robot.brain to store where we left off. This allows us to restart our client without losing our place in the round robin assignment. You will notice that the assignment hash has two usernames to allow mapping from Slack to Github user accounts in the event they are different.

regex = /Pull request submitted:.* by (.*)/i

robot.listeners.push new HubotSlack.SlackBotListener robot, regex, (msg) ->
 author = msg.match[1]
 assignee = getAssignee()
 until (assignee != assignees[author])
 assignee = getAssignee()

 msg.send "<@#{assignee}> you've been assigned to #{author}'s pull request."

This is the binding we make to listen to Github bot messages. We can’t use robot.listen since this only works for human messages. Since we are talking bot to bot, we have to utilize the SlackBotListener.

Now that we have our code in place, we can create a new Slack webhook. Inside your Slack settings click “Apps and Integrations” > “Manage” > “Incoming Webhooks”. Add a new one for the channel you want Hubot to post to and get the token. This will be needed when we start Hubot.

HUBOT_SLACK_EXIT_ON_DISCONNECT=true HUBOT_SLACK_TOKEN=xxxxxxx./bin/hubot --adapter slack & echo $! > /path/to/hubot/run.pid

This will install any dependencies and run Hubot. Substitute your token from earlier for the HUBOT_SLACK_TOKEN. The pid tracking is important if you are using Monitrc to monitor your process.

Happy ChatOps-ing!

 

Hardware, Linux, Open-source, Ruby, Software, Uncategorized

Delayed Job Performance Tuning

We found a bug. A bug that affected a lot of historical records that we now have the pleasure of reprocessing. Fortunately we already had an async job infrastructure in place with the delayed job gem. Unfortunately, this gem is intended for fairly small batches of records and isn’t tuned to handle 1M+ records in the delayed_jobs table.

After reading some best practices we decided on a queue based approach. To keep our day to day async jobs running we would use a “default” queue. And to reprocess our old records we used a new queue. Starting the workers up with a “–queue” flag did the trick. We had hardware dedicated for day-to-day operations, and new hardware dedicated to our new queue operations. Now it was simply a matter of filling up the queue with the jobs to reprocess the records.

Our initial approach maxed out the CPU on our database server. This was largely due to us not tuning our SQL in our async jobs. Because the volume we processed was always low, this was never really a noticeable problem. But when we threw lots of new jobs into the queues, it became very noticeable. The workers would start up, then mysteriously die. After some digging in /var/log/kern.log we discovered the workers were being killed due to an out of memory manager. Attaching a small swap partition helped, but once you’ve hit swap, things become horribly slow. Whats the point in keeping the worker alive if its running a thousand times slower? Clearly we needed to profile and tune. So we did. (The specifics are out of scope for this article, but it involved consolidating N+1 queries and limiting columns returned by the SELECT).

With our newly tuned SQL our spirits were high as we cranked up the workers again. Only to reach the next process bottleneck. And this is where databases gets frustrating. Delayed job workers run a query each time they are looking for a new job to find out which job to pick up. It puts a mutex lock on the record by setting locked_at and locked_by. The query looks like this:

UPDATE delayed_jobs
SET `delayed_jobs`.`locked_at` = '2016-06-05 11:48:28', 
 `delayed_jobs`.`locked_by` = 'delayed_job.2 host:ip-10-203-174-216 pid:3226' 
WHERE ((run_at <= '2016-06-05 11:48:28' 
AND (locked_at IS NULL OR locked_at < '2016-06-05 07:48:28') OR locked_by = 'delayed_job.2 host:ip-10-203-174-216 pid:3226') 
AND failed_at IS NULL) 
ORDER BY priority ASC, run_at ASC 
LIMIT 1;

The UPDATE does an ORDER which results in a filesort. Filesorts are typically something an index can resolve. So I optimistically added the following:

CREATE INDEX delayed_job_priority
ON delayed_jobs(priority,run_at);

Sadly, this index was completely ignored when I ran an EXPLAIN on my UPDATE. And the reason is that MySQL doesn’t execute an UPDATE query the same way as if you did a SELECT with the same conditions. The index probably made things worse, because now with each record update, we now also have an index update as well. I could probably fork the code and probably use some type of isolation level in a transaction to get the best of both worlds with an index based SELECT, and a quick UPDATE on a single record by id. But there are easier solutions to try first.

My UPDATE statements were pushing 40 seconds in some cases according to MySQL. Eventually the lock wait timeout is exceeded and you see an error in the delayed_jobs.log:

Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; 
try restarting transaction

Jobs were moving very slowly, and throwing more workers at it didn’t make an improvement. This is because each time a worker picks up a job, it was waiting 40+ seconds. The UPDATE was doing a filesort, and any index was being ignored. (And MySQL doesn’t support UPDATE hints). It was pretty clear that all of the jobs from the reprocessing queue needed to find a new home that didn’t negatively impact my filesort. I settled on the following solution:

CREATE TABLE delayed_jobs_backup LIKE delayed_jobs;

INSERT INTO delayed_jobs_backup
SELECT * FROM delayed_jobs WHERE queue='new_queue';

DELETE FROM delayed_jobs WHERE queue='new_queue';

This creates a new database table with the structure of the existing delayed_jobs table. The table is then populated with the jobs that needed to find a new home (All 1M+ of them). And finally, deleted from the original delayed_jobs table. Be careful doing this, and do some SELECT/EXPLAIN queries in between to ensure you are doing what you think you are doing. (Deleting 1M+ records from a production database makes me sit up in my chair a bit).

Looking at MySQL’s process list I no longer have System locks on my UPDATE statements (presumably because the table size is small enough that the filesort is mostly painless):

mysql> SHOW FULL PROCESSLIST;
# Id, User, Host, db, Command, Time, State, Info
1, user, 0.0.0.0:34346, localhost, Query, 0, System lock, UPDATE ...

The important columns here are the Time (in seconds), State, and Info. This proves that my job locking was happening quickly. I was seeing Time values of 40+ seconds before. I kept referring back to this process list to verify that the UPDATES were remaining snappy while I modified the number of workers running, and the number of jobs in the queue. I had a goal of keeping the UPDATE system lock times under 2 seconds. Adding more workers pushed the times up. Adding more jobs to the queue pushed the times up. Its a balance that probably depends very much on what you are processing, how much your database can handle, and what your memory constraints are on your worker servers.

To conclude – my job over the next few days will be to run the following command to put some jobs into the delayed_jobs table 10,000 at a time:

INSERT INTO delayed_jobs
SELECT * FROM delayed_jobs_backup LIMIT 10000;

DELETE FROM delayed_jobs_backup LIMIT 10000;

You can of course automate this. But my objective was met. The jobs can reprocess old records without impacting day to day jobs in the default queue. When the delayed_jobs table is almost empty, I move over another batch of jobs from the delayed_jobs_backup table. Rinse and repeat until there are no more jobs left to process. Its a bit more painful, but day to day operations continue to function, and I can cross of the reprocessing task from my list of things to do. All without any code changes!

I’ve been reading up on transaction isolation levels thinking something like a SELECT FOR UPDATE lock might be worthy contribution to the delayed_job codebase: http://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html

Ruby, Uncategorized

Testing and External Calls

Why Are External Calls Bad?

When writing unit tests, the principle is to test each component in isolation from other components. If your tests fail because of concerns outside of your subject, then this is unintentional coupling. A call to an external network resource also (S3, Pusher, Facebook, Google, etc) introduces brittleness, and will slow down your test. Network requests are very expensive compared to memory and local disk access. These quickly add up and cause a slow running test suite.

VCR

Reference a Ruby gem in the testing Gemfile group of your Rails project for VCR. It is responsible for recording request/response traffic (complete with headers, params, etc) and serializing it as a testing fixture. This fixture can then be used by the test instead of making an external network call.
VCR documentation at: https://relishapp.com/vcr/vcr/docs

Webmock

External network calls via a number of avenues are disabled in testing via the Webmock gem. These methods include Net:HTTP, Curl::Easy, etc. If you can think it up, its probably blocked. This is intentional for the reasons listed earlier. Exceptions should not be made in tests.
Webmock documentation at: https://github.com/bblimke/webmock

How to Write a Test Using VCR

Take for example an Image class. Pretend it is responsible (in part) for uploading a file to Amazon’s S3 service. This is hosted on AWS and is an external network call. Currently, creating an image in a test will fail with an exception similar to the following:

it 'uploads an image' do
  FactoryGirl.create(:image) # this fails
end
VCR::Errors::UnhandledHTTPRequestError:
================================================================================
An HTTP request has been made that VCR does not know how to handle:
  PUT https://bwtesting.s3.amazonaws.com/uploads/image/file/1/logo.png

The error explains that there is currently no cassette to play. We can create a cassette to be used going forward by temporarily enabling external network requests:

# spec/vcr_helper.rb
require 'vcr'

VCR.configure do |config|
  config.cassette_library_dir = &quot;spec/fixtures&quot;
  config.hook_into :webmock
  config.allow_http_connections_when_no_cassette = true
  config.configure_rspec_metadata!
end

Uncomment the line

config.allow_http_connections_when_no_cassette = true

Now instruct your test to use VCR by adding an Rspec metadata tag of :vcr:

it 'uploads an image', :vcr do
  FactoryGirl.create(:image)
end

Now you can run your test again with a passing result:
1 example, 0 failures, 1 passed

You will also note a new file has been created under spec/fixtures. The specific path and file name are based on the location of your test. (Note if you rename your test, be sure to rename the fixture accordingly). Look at the fixture and you will see something like:

---
http_interactions:
...
- request:
    method: put
    uri: https://bwtesting.s3.amazonaws.com/uploads/image/file/1/logo.png
...

One fixture can contain multiple network call representations.
If we comment back out in spec/vcr_helper.rb we should now be able to run our spec without a network connection:

# spec/vcr_helper.rb
require 'vcr'

VCR.configure do |config|
  config.cassette_library_dir = &quot;spec/fixtures&quot;
  config.hook_into :webmock
  # config.allow_http_connections_when_no_cassette = true
  config.configure_rspec_metadata!
end

Matching a Cassette

How does VCR know which cassette recording to use when mocking an external network request? The default behavior is to match on method (GET, POST, PUT, etc) and the URI (the full URL of the resource)

You might have guessed that this can cause an issue with RESTful resources, or anything with a unique identifier in the URL. Take this example:

it 'uploads an image', :vcr do
  FactoryGirl.create(:image)
  FactoryGirl.create(:image)
end

We create two images. This will now cause our test to fail with an error from VCR:

VCR::Errors::UnhandledHTTPRequestError:
================================================================================
An HTTP request has been made that VCR does not know how to handle:
  PUT https://bwtesting.s3.amazonaws.com/uploads/image/file/2/logo.png
VCR is currently using the following cassette:
  - spec/fixtures/Image/uploads_an_image.yml
  - :record =&gt; :once
  - :match_requests_on =&gt; [:method, :uri]

This failed because the URI we previously recorded was for
https://bwtesting.s3.amazonaws.com/uploads/image/file/1/logo.png
and now we have the URI
https://bwtesting.s3.amazonaws.com/uploads/image/file/2/logo.png

Changing Matching behaviors

VCR fortunately provides flexibility in how to match an HTTP request. As part of the metadata, you can specify what to match on. For our test, we don’t really care about the specific URL, just that we simulate a good response from Amazon. This means its sufficient to match on method and host. In your test you can specify by doing:

it 'uploads an image' do
  VCR.use_cassette('spec/fixtures/Image/uploads_an_image', match_requests_on: [:method, :host]) do
    FactoryGirl.create(:image)
    FactoryGirl.create(:image)
  end
end

Note that :vcr is gone from our Rspec metadata. We have wrapped the code responsible for issuing the external network call in an explicit VCR block that allows us to match on different criteria of the HTTP request. This test should now pass.

Custom Matching

Maybe the built in match_requests_on attributes aren’t fine grained enough for your test. In that case, you can write your own matcher. See the example below for how to match on part of the path of a URI:

it 'uploads an image' do
  VCR.use_cassette('spec/fixtures/Image/uploads_an_image', match_requests_on: [:method, :host, method(:s3_matcher)]) do
    FactoryGirl.create(:image)
    FactoryGirl.create(:image)
  end
end

def s3_matcher(request_1, request_2)
  !!URI(request_1.uri).path[/\/uploads\/image\/file\/\d+/]
end

I’ve added a new method into the array of values passed to
match_requests_on
This method takes two arguments (it iterates over each request in the fixture and compares it to this specific request until it finds a match). The return value of this method must be a boolean. If its true, then the match is made against one of the requests. If the match is false, it tries the next HTTP request until it runs out of options, and fails.