Computers, Games, Linux, Software

Gaming On Linux – The Metagame

I’ve always been a big PC gamer. But my latest hobby combines that gaming passion with Linux. I’ve been watching Linux gaming from the sidelines for years but my assessment was always that Windows was for gaming, and Linux was for Getting Shit Done. The two worlds didn’t overlap and I was resigned to dual boot, or use two entirely separate machines.

This has all changed in a relatively short period of time, and the catalysts are the DXVK project and Proton which is a compatibility layer for Windows on Linux. It is a fork of WINE but really focuses on gaming compatibility. A really cool trick of DXVK is the ability to convert DirectX calls to Vulkan (DirectX to Vulkan) – a lower level of GPU instructions. This allows games to run very close to (if not even slightly better on some titles like RDR2) than their DirectX counterparts.

In addition to Proton (with DXVK) Two other initiatives have really propelled gaming into what I could call the realm of approach-ability: A native Steam client, and the Lutris project.

Steam

Steam, in case you have been living under a rock has a huge market share and influence on how games are distributed. There have been many imitators, but Steam still does it best – at least as far as Linux is concerned. A few titles have native Linux clients but most don’t. What Steam on Linux does is seamlessly integrate their compatibility tool Proton with their UI. You can click a game’s properties and check “Force the use of a specific Steam Play Compatibility Tool”. (Incidentally Deus Ex: Mankind Divided, Pillars of Eternity, Age of Wonders III are all Linux native!)

Lutris

Lutris takes a more inclusive approach. While it can leverage DXVK WINE builds (and even Proton builds) it allows you to access games from all of the other game clients including Epic Games Launcher, Uplay, and Origin. The software operates as a database of scripts that are user maintained to configure WINE environments for maximum compatibility. It creates a bottle per game and when you uninstall it removes the entire bottle. No left over DLLs, or files – they exist in complete encapsulation. Even better because they are all encapsulated you can run one game as Windows XP, and another as Windows 10 – each one contains its own environment. Below you can see that I have Far Cry 5 (Uplay), Titanfall 2 (Origin) and Control (Epic Games Store), alongside standalone games (Return of the Obra Dinn is a standalone install).

I’ve been so thrilled with Lutris that I support them on Patreon.

Gaming Considerations

Emptor caveat! With all things Linux you need to be prepared to tinker. If you want to crack open a beer, boot a game, and jump right in then you’d best look elsewhere. That being said, I’d take a wild ass guess and say 80% just work, 10% work with minor tinkering, and the remainder either don’t work at all, or don’t work well enough to be enjoyable because of performance issues (looking at you Control, and Anno 1800). Most games run smooth as butter and its easy to forget its not running on Windows.

The biggest performance issue seems to be what I call microstutter. It is an jarring experience where ~5 or so frames drop. This usually happens while assets are loading. I can’t be sure it isn’t CPU overhead from my encrypted partition. Other people talk about caching shaders. This takes a bit of your disk to compile the shaders so they are ready to use and not being translated in real time.

Before purchasing a game I typically check these three things:

  • Is there a native Linux client (the answer is usually no, but I’ve been surprised)

  • Are there compatibility issues listed on ProtonDB?

  • (Non-Steam titles only) Is there an installer on Lutris

Another big consideration is that aside from Steam, most of the gaming clients are bad. I don’t think this is a problem specific to Linux – they are just bad. Often times to fix a crash or performance you will need to disable all of the UI overlay garbage that these launchers try and throw overtop of your game. Often they tell you to put the client into offline mode. This likely means that multiplayer support is going to be limited on these titles. If you need multiplayer – Steam is probably your best bet. Linux is officially supported and the build is very stable, and I’ve played many hours of multiplayer games (Divinity 2, Tabletop Simulator, and others).

Utilities

A few other thoughts and tools to get you started – MangoHUD is a great FPS overlay. It not only shows FPS, but it goes into frametime, and shows CPU/GPU loads.

Gamemode is a utility that is supposed to improve game performance. It is supposed to change the nice / ionice levels of your games and manage your CPU governor (if on a laptop) to maximize performance. I’m not sure how much of a difference it makes – and these are changes you can make yourself but this tool does it automatically when a game is launched. Lutris and a few titles natively look for this executable and use it if available.

Of course you’ll want the latest nVidia or AMD GPU drivers. There are some useful tools in the nVidia control panel like forcing a composition pipeline which is a rather interesting way to deal with the frame tearing in a way different from V-sync.

Conclusions

Gaming on Linux has been something I never thought I’d see happen. I’m sure that cloud gaming (aka game streaming) has been propelling development in this area. After all, companies like Valve aren’t working on these projects out of the goodness of their hearts. And if you can get a Windows game to run in a DXVK environment you are making it cloud ready and can scale up Linux servers better than you probably could on Windows servers. That is my theory anyway.

It is also wonderful not having to switch operating systems when I want to work or game. I can do both. In fact – my workspace 3 is now the gaming workspace. I’m back to one machine, and one environment that is comprehensive for my needs.

If you’ve been on the sidelines looking on now is a great time to dive in. And with each release of Proton, and Lutris support gets better and better. And you are helping the chicken and the egg problem – developers are more likely to support Linux if there are more Linux gamers. So try the metagame of gaming on Linux – its quite rewarding!

Computers, Open-source, Ruby, Software

Fetching CircleCI Artifacts

Do you use CircleCI for your continuous workflows? Do you use their scheduled jobs? Recently we had a need to retrieve some performance benchmarking via a Lighthouse service that records page load times out to a text file. This job is scheduled to run daily.

Unfortunately it can be difficult to find a build in the CircleCI UI for a given project since there is no search, and only 20 builds at a time are shown in order of most recently run. Fortunately CircleCI has an API that lets us automate the task of trying to find a build and view the artifacts from the run: https://circleci.com/docs/api/#recent-builds-across-all-projects

We can fetch up to 100 builds a given project a time. Some scripting allows us to narrow our results down to just the build types we are interested in. From here we now have the build number from the most recent build of a given type. In our case the type is the pageload times from Lighthouse.

Once we have found a specific build for a given project we can use the API again to ask about its artifacts: https://circleci.com/docs/api/#artifacts-of-a-build . This allows us to get the container information and paths of any artifacts produced by the job. Including our page load times.

We now have the URL for a given artifact, and it is just a matter of downloading the file by suffixing the CircleCI token to our URL: https://circleci.com/docs/api/#download-an-artifact-file

We now have the output from our artifact. From here we can put this information into our company Slack, or even push it to a collaborate spreadsheet that the team routinely reviews. The specifics of how to automate this script, and what to do with its output is outside the scope of this post, but I will share our Ruby script for interacting with CircleCI. Should be easily adaptable to other languages. It can be viewed below:

# Finds a CircleCI job of a given name, and retrieves an artifact from a given CircleCI project build
# Usage:
# $ API_TOKEN=xxx GITHUB_USERNAME=bsimpson GITHUB_PROJECT=some-project TARGET_JOB=lighthouse ARTIFACT=averages_pageload ruby ./circleci.rb
require "net/http"
require "json"
API_TOKEN = ENV["API_TOKEN"]
LIMIT = 100
GITHUB_USERNAME = ENV["GITHUB_USERNAME"]
GITHUB_PROJECT = ENV["GITHUB_PROJECT"]
TARGET_JOB = ENV["TARGET_JOB"]
ARTIFACT = ENV["ARTIFACT"]
# Recent builds for a single project
# curl https://circleci.com/api/v1.1/project/:vcs-type/:username/:project?circle-token=:token&limit=20&offset=5&filter=completed
def find_job(job=TARGET_JOB)
offset = 0
while offset < LIMIT * 10 do
url = "https://circleci.com/api/v1.1/project/github/%{github_username}/%{github_project}?circle-token=%{token}&limit=%{limit}&offset=%{offset}"
uri = URI.parse url % {
github_username: GITHUB_USERNAME,
github_project: GITHUB_PROJECT,
token: API_TOKEN,
limit: LIMIT,
offset: offset
}
response = Net::HTTP.get(uri)
jobs = JSON.parse(response)
matching_job = jobs.detect { |job| job["build_parameters"]["CIRCLE_JOB"].match(TARGET_JOB) }
if matching_job
return matching_job
end
puts "Trying offset #{offset}…"
offset += LIMIT
end
puts "Exhausted pages"
end
# Return artifacts of a build
# curl https://circleci.com/api/v1.1/project/:vcs-type/:username/:project/:build_num/artifacts?circle-token=:token
def find_artifacts(job, artifact=ARTIFACT)
build_num = job["build_num"]
url = "https://circleci.com/api/v1.1/project/github/%{github_username}/%{github_project}/%{build_num}/artifacts?circle-token=%{token}"
uri = URI.parse url % {
github_username: GITHUB_USERNAME,
github_project: GITHUB_PROJECT,
build_num: build_num,
token: API_TOKEN
}
response = Net::HTTP.get(uri)
artifacts = JSON.parse(response)
matching_artifact = artifacts.detect { |artifact| artifact["path"].match(ARTIFACT) }
return matching_artifact
end
# Download an artifact
# https://132-55688803-gh.circle-artifacts.com/0//tmp/circle-artifacts.7wgAaIU/file.txt?circle-token=:token
def download_artifact(artifact)
url = "#{artifact["url"]}?circle-token=%{token}"
uri = URI.parse url % {
token: API_TOKEN
}
response = Net::HTTP.get(uri)
puts response
return response
end
job = find_job
artifact = find_artifacts(job)
download_artifact(artifact)

view raw
circleci.rb
hosted with ❤ by GitHub

Computers, Hardware

Move over Router – Mesh is Here

Its been a while since I’ve had my quality of life dramatically improved by a device upgrade. I recently moved my home office upstairs, and with it came a shuffling around of wireless equipment from the first floor to the second. The new office space is on the opposite side and floor from the living room with our smart TV. No matter how I positioned the wifi router one side suffered. And putting the router in the middle meant it would be in the kids rooms.

Adding a wireless range extender practically made the problem worse as the devices tried to connect to the wrong network for their location, and the speeds while connected to the range extender were terrible.

Fed up I started doing some research into routers with the furthest range, highest speeds, etc. That is when I came across a new category of “mesh” networks. These devices offer multiple access points called nodes that promise to seamlessly shuffle clients around based on the optimal connection path. After some research I decided on the TP Link Deco M4 3 pack . I had a promo code and the total price came out to ~$150 shipped.

After using it for a few weeks, I’m ready to review. Spoiler alert – I’m bursting with happiness. I’ll address a few main categories of these devices:

Range

I have a 2,500 sq foot house on 2 floors + deck . The 3 nodes cover this easily. Nowhere in the house, or yard have less than 3 bars. I have configured these in a triangular arrangement with two nodes being on opposite sides on the house on the 2nd floor (as diagonal as I could get them). The other node is on the 1st floor half way between the nodes on the top floor.

I haven’t tried a two node setup, which might be more representative of what the old router + range extender were delivering, but why would I? The whole point of a mesh network is that you can keep adding nodes until you have great coverage.

As an experiment I walked around the cul-de-sac and stayed connected an impressive way out. Whatever is inside these nodes (or maybe they are working in aggregate) had great transmitting power, all without looking garish with external antennas everywhere.

Speed

On the TP-Link Deco M4 network, I get 300+Mbps anywhere inside the house. Outside on the deck this drops to 200Mbps For comparison, with the old ASUS RT-AC68U + Linksys RE6500 range extender I would get ~200Mbps in the same room as the router. The range extender never got above 100Mbps, and the deck (when I could get signal) would be around 20Mbps. The mesh network link speed blows away the traditional router + extender setup.

One more technical note here – the nodes are tri-band which means that you get the full bandwidth through each node instead of it being halved.

Setup

The TP-Link (and many of the other commercial mesh kits) come with a smartphone app to setup the devices. I was initially turned off by this. After all – everything today claims it needs you to install an app when a basic mobile website is probably sufficient.

The app however is clean, and aided in setup versus the traditional laptop approach, potentially having to plugin with an ethernet cable to the router to initially configure the network.

The nodes are all identical, so it doesn’t matter which on you connect to the modem. It correctly figured out that was the Internet connection, and even circumvented the silly tendency for modems to only bind to one MAC address. The physical setup involves nothing more than plugging in the node to an AC outlet, and for the initial node plugging it into the modem. The app detects the node you are next to, and walks you through setting up the wireless network.

Flashing lights on each of the nodes informs you if they are online and working properly or experiencing an issue.

The nodes all share the same wifi network name, and devices will switch between them automatically. Setup options are pretty standard (maybe even somewhat limited). You choose an SSID, create a password, and choose whether to broadcast the network. You don’t even pick between 2.4Ghz and 5Ghz networks – this is all managed for you. The device will use the best network available. My old laptop can’t see 5Ghz networks and connected just fine. The Deco offers a few other features like QOS settings, reserved IP addresses, blacklisting, reports, etc.

Price

This looks to have been a historical weak point for mesh networks. New technologies typically come with premium price tags. I think enough time has passed that mesh network kits are about on part with a new router and range extender. I paid $140 having caught a sale that took $40 off the price.

Conclusion

I would absolutely recommend a mesh network to just about anyone, possibly with the exception of someone that has advanced needs for their network setup. This feels like an evolution of the wireless router. It offers superior range and speeds relative to my previous router + range extender setup for about the same price. Setup is painless, and this has fixed all of my wireless issues throughout the entire house. I’ve retired my router, and range extender.

I’ve also retired my USB wireless adapter for my desktop since I have a mesh node sitting on the desk, and have opted instead to connect with an ethernet cable. I’ve also managed to retire a wifi bridge for a NAS device that I have because again with 3 nodes, I can easily place this NAS next a node and connect with an ethernet cable.

All said and done I threw out more equipment than I setup. This was an absolutely great decision in my opinion and at the risk of sounding like a sponsored post – I can say I couldn’t be happier.

Open-source, Software

Updating database rows with a default position

I was recently tasked with making a table in a MySQL database sortable by the user. There are a few steps involved to take care of new records, while providing existing records with a default value.

The new records would be created with a ifnull(max(order), 0) + 1 to assign them the next position. This was sufficient since I was positioning based on created_at and a new record was guaranteed to have the highest created_at timestamp.

For the existing records, I wanted to:

  • group by user_id
  • set the order (int) column to an incrementing integer
  • with an initial ordering based on a created_at timestamp

The first approach was to select groups of users, and for each group sort the records then update each record. This approach would work, but it wouldn’t have good performance. I really wanted to do this in a single update statement.

I ended up going with an approach based on this article :

set @user_id = 0, @order = 1

update table1 as dest,
(
  select x.id, x.user_id, x.created_at,
  @order := if(@user_id = x.user_id, @order + 1, 1) as `order`,
  @user_id := x.user_id as dummy
  from (
    select id, user_id, created_at
    from table1
    order by user_id, created_at
  ) as x
) as src
set dest.`order` = src.`order`
where dest.id = src.id;

Let’s break this down:

  1. We are using database session variables to keep track of some state. We default to @user_id = 0 and @order = 1
  2. Working from from the inner-most select, we select the id (primary key), user_id (int)  and created_at (timestamp) as these are the columns needed to do our calculations for the new order value. We order these by user_id and created_at to get the sort ready. This dataset is aliased as x (yes I know I’m terrible at naming)
  3. The next select up pulls the same id, user_id and created_at from the inner-most select and does some calculations persisting the state in the session variables we setup in step 1. The actual work here is:
    1. Calculating the order: @order := if(@user_id = x.user_id, @order + 1, 1) . The := is an assignment in SQL. We set the @order to either a default of 1, or the @order + 1 .
    2. Comparing user to determine what to set the @order to. This is done in a conditional expression:  if(@user = x.user_id). Because the @user_id is set to the x.user_id value on each row, the comparison should return true until we get to the next user, at which point the @order will reset to 1.
  4. Finally the outer update takes the resulting src dataset and does some updating. We constrain to where dest.id = src.id to get the correct record, and then we set dest.order = src.order which is our calculated value.

Performance is looking great. On a dataset of 300k records, this runs in 3 seconds on a 4 CPU Docker container.

Bonus: If you have additional criteria (like only sorting favorite records for example) this can be easily applied to the inner-most SELECT in the where clause!

Happy ordering!

Computers, Open-source, Ruby, Software, Thoughts, Web

new Job().beginTraining()

It has been an exciting month for me. I put in my notice with Influence Health at the beginning of June, and served through the end of the month. After that took 2 weeks off before my training begins for Doximity on July 16th. The training is in San Francisco, and then immediately followed up with a July 30th co-location in Boulder, Colorado. This position is remote, aside from the several co-locations done each year. I am excited to start with a new team, on a new project, with a mix of old and new technologies.

Over the career at Influence Health I don’t feel that I got much deeper in my knowledge of Rails. What I feel I gained instead was the breadth of general programming skills. I configured Github repos, setup Jenkins, scripted a Hubot instance to assign pull requests, and made a continuous integration system. I created new policies following best practices to open a pull request with a change, write unit tests, and have one other person on the team review your changes before merging. I implemented linting, and worked with one of my coworkers to bring Webpack into Rails to rethinking how we manage Javascript. I also went very deep into the AWS services touching S3, Lambda, RDS, Redshift, Data Pipelines, Batch, Step, ELB, EC2, ECS, Cloudfront, and into other technologies like PostgreSQL, Docker, ElasticSearch, Capistrano, EventMachine, and Daemons. Being exposed to all of these new services has made me consider different approaches to traditional problems, and I feel has made me a better developer.

The new job at Doximity sheds my managerial role that I was voluntold to do at Influence Health. I thought I might have enjoyed it (maybe I would have under better circumstances). At the end of the day it wasn’t being a manager that killed the deal for me. It was being a manager, while still being a tech lead, while being an architect, a core contributor, and many other things. To manage well is a full time job. Tacking it onto an existing role made me feel inadequate as a manager, and I don’t like that feeling. So with the managerial role off my title the new role is back to software developer, and I’m ok with that. The compensation is right, and I felt like I was getting further away from the code than I wanted to be. At the end of the day developing something and seeing it work is what drives me. There is a technical lead track that I might pursue in several months if I feel like I am ready.

The technology stack is a mixture of Ruby and Javascript. After working with Javascript more heavily the last 6 months I have mixed feelings. I’m definitely excited because I do think the future of web development has coalesced on Javascript. And Javascript has risen to the challenge and gotten a lot better. Gone are the imposter “rich internet applications” like Silverlight, and Flex. Gone are the browser plugins for languages like Java and Flash. Javascript just works. And the browsers are really blazing trails, even Microsoft, so I believe that learning Javascript is a solid career investment. There is an excitement in the ecosystem (a little too excited imo, but I’ll take that over being dead)

Popularity aside, Javascript has less magic than Ruby which is a again, both good and bad. I appreciate seeing require statements, and knowing with absolute certainty that private methods are private. In Ruby for everything you can do to protect something, someone else can (and I find frequently does) find a way to circumvent it. I especially appreciate the strong linting culture that mitigates entire debates on code style.

I find the syntax of Javascript to be unattractive coming from Ruby, but it is more consistent. All of the parenthesis, semicolons, etc are just noisy. The surface area of Javascript is also much smaller which leads everyone to jump to utility libraries like Lodash, Underscore, etc. The language just needs to keep maturing and building in common methods. Date manipulation in particular is atrocious. async/await seems like we finally have a clean syntax for managing asynchronous code.

I do still feel like we are fitting a square peg into a round hole by having client side, single page applications. This isn’t the way the web was designed (it was made for request/response), and the fat client pattern still feels immature. Having a client side framework like Angular, (or even a library like React) does take care of managing some of the complexities . GraphQL takes the sting out of fetching all the combinations of data you might want from the server. Auth is taken care of with JWT and services like Auth0.

On the server side, using Node has been a mixed bag. I would like to see a few big frameworks that the community standardizes on, however the mentality seems to be build your own framework from a collection of your favorite libraries. As a result you can’t just jump into a project and immediately know where things are. I do however really enjoy asynchronous execution of code. It is a little harder to write and understand but wow can it be fast. I have already seen very positive results from Ruby implementations of batch jobs that took hours converted to Javascript and taking minutes. You simply don’t wait around, and it can think of your dependency tree in a completely new way.

At the end of the day I am excited but a little cautious. Ruby + Javascript sounds like a killer combination if you use each tool for what it does best. I don’t see the Ruby community lasting another decade so this is the perfect transition point to jump into Javascript. And I’m glad that it was Javascript that won out over Flex, Silverlight, JSP, etc. At least for the next 5 years until the new shiny technology comes out and people jump ship.

Ruby, Software

Refactoring With the Observer Pattern

What is the Observer Pattern?

Have you written a system that has the concept of “events” that need to notify users? Perhaps you send an email on a system error to an administrator. Or send a “password changed” event notification to a user? In a CRM perhaps clients can subscribe to new customers as they come into the system as the result of marketing efforts?

This can be represented using the Observer pattern. Your users have subscribed to one or more events, and they are “observing” new events. The events themselves are “observable” via these event subscriptions.

This blog post will walk through refactoring an existing event notification system that did not use the observer pattern, and instead relied on a multiple passes to calculate an intersection of users and events that ran periodically via cron.

Old pattern

We have a CRM that is written in Ruby on Rails. When a database record is generated for something we want to create an event for, we use the ActiveRecord callback to create a second “event” record after save. This event record is associated to the source record and includes some contextual information like a type “NewLead”, or “SystemError”, and whether or not it has been processed (defaulted to false). Event also includes some hierarchical attributes so we can enforce authorizations in a multi-tenant application.

Once the event records are created, we had a cron entry that would fire every minute and check for new events. This would do a database query looking up unswept events, and looking up all active subscriptions. Doing some intersection logic, we would then generate an “event subscription” record for each event that can belong to an intersection. This did not scale particularly well because a generic event could potentially go to hundreds of subscribers, which would mean a database table entry per subscriber. This information is also not valuable after the notification is sent.

Once the events are swept, and “event notifications” records are generated, we could then send out an email. For more usability we have a digest feature where emails are not sent immediately when events come in. Rather we collect for a period of time, then send a summary of events that have occurred since the last email was sent.

This system did not scale due to the volume of events and subscriptions. We had issues where long inserts, and large datasets caused deadlocks in the database and out of memory errors on the server. The goals were to keep the events, and the subscriptions, but optimize the wofklow so that we could scale further, and have less transient data stored in the database. It goes without saying, but the replacement needed to be robust, and easy to reason about with good test coverage as well.

New pattern

The Observer pattern I mentioned a few minutes ago was the right fit for this particular problem. Instead of a cron initiated loop with lots of state being persisted we would move to an in-memory event loop and stream the events to the subscriptions.

One wrinkle with the observer pattern was the email digest requirement. Observers are great at taking events and then processing them immediately. Building a collection required some investigation, but using Event Machine and periodic timer I was able to preserve this functionality as we will see.

The event system

There are two primary areas – the observable, and the observer. In Ruby, we can compose the Observable pattern by including the module into our class. This sets up a pseudo interface in Ruby that we can implement in our observable and observers. Of course Ruby doesn’t have true interfaces so this is more of a guideline than a compile time check.

The observable class:

require 'observer'
module Events
  class Observable
    include ::Observable
    attr_accessor :last_tick_at

    def initialize
      self.last_tick_at = Time.zone.now
      observers.each { |observer| observer.register(self) }
    end

    def run!
      changed_subscriptions.each do |subscription|
        subscription.reregister(self)
      end

      new_events do |event|
        changed
        notify_observers(event)
      end
    end

    private

    def new_events(&block)
      Event.yield_and_sweep(&block)
    end

    def changed_subscriptions
      Subscription.where("updated_at >= ?", last_tick_at).tap do
        self.last_tick_at = Time.zone.now
      end
    end

    def observers
      Subscription.only_active.not_paused
    end
  end
end

The observable class will be instantiated by our task runner. During the initialization it grabs the current time, and registers its observers (subscriptions). Its method run! will be invoked periodically and is responsible for finding and sweeping new events, and sending out to the observers. It also checks if any subscriptions have been added, changed, or deleted via the updated_at timestamp which is compared against the last run. These subscription observers have to re-register with the observable to reflect their changes.

Next up is the runner::

require 'eventmachine'

module Events
  class Runner
    def start
      EM.run do
        observable = Events::Observable.new
        run_timer = EventMachine::PeriodicTimer.new(0) { observable.run! }
        run_timer.interval = 1.minute.to_i
      end
    end
  end
end

This runner instantiates the observable class and starts an Event Machine loop. Within this loop we add a periodic timer every minute to run the observable. As we saw in the observable, the run! method will register its observables, and send them events.

Finally, we have the observer class:

module Events
  module Observer
    attr_accessor :observed_events, :timer

    def register(observable)
      observable.add_observer(self)
      self.observed_events ||= []
      self.timer = EventMachine::PeriodicTimer.new(self.wait_interval) { process_events }
    end

    def unregister(observable)
      observable.delete_observer(self)
      self.timer.try(:cancel)
    end

    def reregister(observable)
      unregister(observable)
      register(observable)
    end

    def update(observed_event)
      return unless authorized_for_context(observed_event)
      self.observed_events.push(observed_event)
    end

    def process_events
      return unless observed_events.present?
      EventMessageMailer.delay(queue: :email).event_message(observed_events.map(&:id), self.user)
      observed_events.clear
    end

    private

    def authorized_for_context(event)
      return false unless self.event_type.blank?            || self.event_type == event.type
      return false unless self.agency_id.blank?             || self.agency_id == event.agency_id
      return false unless self.client_id.blank?             || self.client_id == event.client_id
      return false unless self.business_unit_id.blank?      || self.business_unit_id == event.business_unit_id
      true
    end
  end
end

This is a module that can be mixed in to an ActiveRecord “Subscription” model. There are some assumptions that the model will have certain attributes pertaining to the hierarchy that the subscription is for (agency_id, client_id, business_unit_id) and an attribute for what type the subscription is. In order for an event to be collected the event that is received must pass authorization checks, and must be for the appropriate event type.

Of note is the wait_for_interval. This is logic around the digest functionality. Each subscription has an attribute for how long to wait. A value could be 5 minutes, 1 hour, 1 day, etc. The process_events call will be fired when the PeriodicTimer fires on the specified interval. This clears out events when they are processed.

Lets see a diagram of how all this will work:

Screen Shot 2017-10-20 at 10.26.13 AM

Not shown is the EventMailer. This is just a standard ActiveRecord mailer however that takes an array of events, and shows them as a list. These are then sent to the user’s email address via a background job.

Observable Benefits

There are a few benefits of using the observable pattern that we replaced our existing code with:

  • Efficiency – Subscriptions are not recalculated every time we sweep for events.
  • Storage size – Because the event runner continues to run in the foreground, we don’t have to worry about persisting a lot of in between state. The concept of event subscriptions lives in memory. Messages are constructed on demand and emailed asynchronously.
  • Codebase size – Almost all of the cron entries, task runners, sweepers, and calculations on the event and subscription intersections have been removed because they aren’t necessary. We now have zero cron entries, and one task runner that is responsible for picking up new events and sending to the observers.
  • Agreed upon pattern – While only a benefit if you know the pattern, there is a much greater chance someone will know what an observer is instead of a custom solution
  • Interface checking – Because we are using the Ruby Observable module, we get some safety in adding new observers, and when changing our observable. If the observer doesn’t implement the update method, an exception will be raised when registered with the observable.

Task runner

Now that we have implemented the observer pattern, we need some scripting to have this run in the foreground and recover from any crashes. For this I used the Daemon gem:

require 'daemons'
require File.expand_path('../../config/environment',  __FILE__)
require Rails.root.join('app/models/events/runner')

Daemons.run_proc('events', {
  dir:       Rails.root.join('tmp/pids/'),
  dir_mode:  :normal,
  log_dir:   Rails.root.join('log/'),
  backtrace: true,
  monitor:   true
}) do
  logger = Logger.new(Rails.root.join('log/events.log'))
  Events::Runner.new(logger: logger).start
end

The observable can now be instantiated with a script/events.rb start command. In the event of a crash, I get log output of the stacktrace, and an automatic restart.

What is next

Currently, there isn’t a mechanism to serialize events that are in memory for subscriptions. This was due to the fact that most are minutely, and because the service so rarely gets restarted, and the risk of not receiving a notification was low this problem was left for later. Likely the approach would be trapping a SIGTERM, or SIGKILL signal, and persisting the unprocessed events for each given observable in a way that is recoverable on restart. A text file is probably sufficient, as this data isn’t valuable long term.

Learn More

If you want to learn more about the observer pattern, I recommend Head First: Design Patterns. The content is good, but the 1990’s stock photos will make you groan.

Additionally, SourceMaking has great information on many design patterns including the observer.

 

Apple, Computers, Hardware, Linux, Software, Windows

Using Synergy Over VPN

I’ve been watching a lot of Linus on Tech and one of their sponsors sells a product called Synergy. More information on their product page: https://symless.com/synergy . To summarize, this is a software KVM (no video, so I guess it is “KM”?) solution. The use case was perfect for me. On my desk I have an Apple work laptop, a Windows desktop, and a Linux personal laptop. Anyone that has done extensive work on a laptop keyboard and touchpad know that it isn’t optimal. I didn’t want multiple keyboards and mice on top of my desk because of the clutter. I dropped some money for Synergy, and it just works!

20170731_110826_HDR.jpg

That is until I had to connect to our company VPN the next week. They use a full tunneling solution. When I connect, I lose everything. I can’t print, I can’t access my NAS, but most importantly I can’t access my keyboard and mouse. (The video is fine because it is a hard wire to an external monitor). What to do?

SSH to the rescue! What is SSH? This is a protocol that will allow one computer to securely interface with another computer. Secure SHell. However, we will just be using it for port forwarding, and not for an interactive session. The goal is to take the OS X machine (Synergy client), and SSH into the Windows server (Synergy server). Using this SSH connection, we can forward ports within it. It is a tunnel running inside the SSH connection. This will expose a local port on OS X for 24800 that is actually pointing to the remote server port 24800. This is the port that Synergy uses for its connections.

You will need a few tools, and a little patience. Having just gone through this, I’m sharing for posterity, or maybe for anyone that has thrown in the towel with how crippled VPN makes accessing home devices.

I have the following Synergy setup:

  • Windows 10 Synergy server (keyboard and mouse are physically connected to the desktop)
  • OS X Synergy Client
  • Linux Synergy Client
  • Router with a local area network all these devices share
  • Admin access to the router for port forwarding
  • Autossh package for OS X (available via brew)

First step, get Windows 10 up to speed with SSH. How this isn’t built in as a service in the year 2017 I have no idea. Grab the OpenSSH server package for Windows from https://www.mls-software.com/opensshd.html . After downloading, extract and run the setup file. This will create a new Windows service for OpenSSH that will run on port 22. It prompts you to generate an SSH key for the server.

Once this server is running, you will need to add your user to the list of SSH users. Open up PowerShell as an administrator and change into the C:\Program Files\OpenSSH\bin directory. Run the following commands:

mkgroup -l >> ..\etc\group
mkpasswd -l >> ..\etc\passwd

Try and connect to your SSH server from the OS X client:

ssh <user>@<server IP> # e.g. ssh Ben@192.168.1.95

You should be prompted for your Windows password. Once you can successfully login to the server, we can setup public key authentication. This removes the need for you to type in your password because you identify yourself with an SSH public key. From your OS X machine get your public key:

cat ~/.ssh/id_rsa.pub

Put the contents of this file on your SSH server in the file C:\Program Files\OpenSSH\home\<user>.ssh . This is actually a symlink to C:\Users\<user>.ssh . If the directory .ssh doesn’t exist, you will need to create it first. Now we need to configure the server to allow public key authentication. Edit the C:\Program Files\OpenSSH\etc\sshd_config file and change the following lines:

StrictModes no
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys

Restart the OpenSSH server for the changes to take effect:

net stop opensshd
net start opensshd

You should now be able to SSH into the server same as before but without being prompted for a password.

Now we are ready to create an SSH tunnel. Before we incorporate AutoSSH (which handles retries and monitoring) we will do a naive attempt to SSH. In the following command:

  • -f backgrounds the process
  • -L does port tunneling in the format of <local port>:<remote host>:<remote port>
  • -N do not run a command – just tunnel the port
ssh -f <user>@<remote public IP> -L 24800:<remote public IP>:24800 -N

If this works, you should see a [LISTEN] entry for port 24800 when you list open files:

lsof -n -i | grep 24800

You may need to set your server as the DMZ on your network. Or to be safer you can simply setup port forwarding. We will need port 22 and port 24800 to resolve to the Windows server. The instructions for how to do this on a router widely vary by vendor. Typically it is under a WAN section. It typically prompts for a port, a destination IP, and destination port, and protocol. You want ports 22 and 24800 to route to your server IP for TCP and UDP.

Configure your Synergy client to use localhost instead of the remote IP. You should now be able to operate your client from the server’s peripherals via Synergy.

Everything works great until the VPN connection is made. The reason is that the SSH connection is severed. In order to recover automatically, I have added autossh to persist this tunnel. On the OS X client instead of running SSH do the following:

AUTOSSH_POLL=10 autossh -M 20000 -f -N <user>@<remote public IP> -L 24800:<remote public IP>:24800

Now when a VPN connection is made, or a disconnection happens, the autossh package will detect that it is no longer alive and retry. Because Synergy’s software also retries, after a few seconds your connectivity should begin working again.

Thanks to Synergy for making a solid product, and for having first class Linux support.

Computers, Software

Moving Data in AWS

I have been a long time user of Amazon Web Services, but I only recently started using the Data Pipeline service they offer to handle ETL needs. The service provides a cloud ready, low cost, turn key (in some cases) solution to moving data within your services. I had a difficult time getting up and running, partly due to the lack of discussion online about this service, so I want to share my experience, offer some best practices, walk through how I developed our pipelines.

Anyone with an AWS account can use Data Pipelines. But be careful there there is no BAA agreement if you are in the healthcare industry and are passing around PHI data. Fortunately, our needs do not yet require us to move around PHI information.

The first step in my ETL journey was first formalizing what data needed to go where. Before I even opened up the Data Pipeline service, I had to understand our specific needs. I identified two use cases.

Archiving Old Data

RDS instances have a max allowable size for the database and ours was getting full. The approach I took was to look at our largest tables, and focus on reducing those first. I ran some queries to understand what is using the most space: http://stackoverflow.com/a/9620273/802407

Once I had a list of my largest tables, I could classify them and assign retention rules. Some tables I decided to leave in place, and others I decided were transient data, and could be archived. (HIPAA mandates a 7 year data retention policy, so no luck just deleting). We decided as a business that different classifications could live within our application for differing time frames. Once timeframes were established, I could then write a data pipeline, and move any data older than our cut off date for that table to a storage solution outside of the database. We chose to house MySQL backups on S3 in encrypted buckets.

Fortunately the Data Pipeline service provides a number of templates to help you get started. Navigate to https://console.aws.amazon.com/datapipeline .  I found the templates good starting point, but there are some frustrations that I will mention below in the “Quirks” section. Click “Create Pipeline”. I used the template “Full Copy of RDS MySQL table to S3”. I filled in the parameters, and edited the pipeline in “Edit in Architect” mode.

Since I wanted to archive old data, I modified the SqlDataNode’s Select Query to be only records older than my retention policy:

select * from #{table} 
where #{myRDSTableLastModifiedCol} <= 
date_sub('#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' , 
  interval 6 month)

This will select records only older than 6 months from the pipeline scheduled start time. The template then moves these to S3. There are two parameters, defined in the parameters section for “#{myRDSTableName}”, and “#{myRDSTableLastModifiedCol}”. I supplied my table name, and the updated_at datetime column for my records.

I added a new SqlActivity dependent on the CopyActivity named “Delete old records”. Once they move to S3, I want to delete them from the database table. This activity “Depends on: RDSToS3CopyActivity” so if saving to S3 fails, the records are left untouched. I added the following script to mirror my select query above, but deleting the records instead:

delete from #{myRDSTableName} 
where #{myRDSTableLastModifiedCol} <= 
date_sub('#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' , 
  interval 6 month)

I would recommend doing this on a test database first before you delete production records while testing your script. Because the timestamp is the same as from the select, this will be the same record set, provided you have an updated_at column that shows when a record was last touched.

Expose Data for Reporting

My other use case was getting data to the reporting server which is in Redshift. Again, there is a nice template to get started. Click “Create Pipeline” and then select “Incremental copy of RDS MySQL table to Redshift”, then “Edit in Architect”.

The first run will need to be a full copy if I want all the data in the table. After that, I can use delta copies to only move over the new data fitting your criteria. This is driven by a SQL select query, so it is easy to modify. In the SqlDataNode I can edit the Select Query to my liking. Note that I removed the conditions from the query to get all records on the first run. I changed the instance type to something more substantial (t1.micro to m1.small), and upped the timeout from 2 hours (to 8 hours). I then went in before the next run and modified the query to put back the conditions that selected the delta data, then downgraded the instance type, and timeout values to their smaller defaults.

I then ran into an infrastructure quirk where our Redshift instance was inside a VPC, and our RDS database was inside a classic instance (non-VPC). This meant that the same EC2 instance would not be able to talk to both databases since it had to be on one side or the other. Because of this limitation, I had to modify parts of the pipeline that assumed a single EC2 instance would be talking to both databases. Note that I had to edit the JSON as the Data Pipeline UI does not allow changing the resources that activities run on from the template. I created two EC2 instances – one for talking to RDS and S3, and one for talking to S3 and Redshift.

In an attempt to make things easier, Amazon provides some Python scripts that get called under the hood to reflect on your MySQL table structure and convert it to a PostgreSQL CREATE TABLE command. This didn’t work for me because of my VPC permissions issues, so I provided my own CREATE TABLE Sql in the S3StagingDataNode. This was generated using the Python script by Amazon, but I supplied the inputs manually:

#!/bin/bash
aws s3 cp s3://datapipeline-us-east-1/sample-scripts/mysql_to_redshift.py \
/tmp/mysql_to_redshift.py

for target in table1 table2 table3
do
  mysqldump -h <host> --port=3306 -u <username> --password=<password>  \
--compatible=postgresql --default-character-set=utf8 -n -d \
-r "/tmp/${target}_structure.sql" <database> $target

  python /tmp/mysql_to_redshift.py --input_file=/tmp/${target}_structure.sql \
--output_file=/tmp/${target}_create.psql --table_name=$target \
--insert_mode=KEEP_EXISTING
done

This Bash script will pull down the public file mysql_to_redshift.py. Then it loops over the target tables you want to setup pipelines for. For each table (table1, table2, table3, etc) it does a mysqldump of the table structure. It then feeds this table structure file into the python conversion utility to produce the PostgreSQL version of the table structure. The contents of the table1_create.psql file is what I copied into my “Create Table Sql” field in the Data Pipeline UI.

Note that the “Create Table SQL” is interpreted literally, and has no schema context in Redshift. Therefor if I want to create the database table in another schema, the CREATE TABLE contents need to be modified to prepend this table name with a schema qualifier. e.g. “table1” would become “staging.table” (without the quotes). The Python utility will double quote the table name if given a table name with a period. This will incorrectly create a table in the public schema: public.”staging.table”, which is probably not what desired. Check the contents of the CREATE TABLE for accuracy.

I also changed the “Insert Mode” in the S3ToRedshiftCopyActivity” to be OVERWRITE_EXISTING . This uses the primary key of the table to detect duplicate rows. Since we might modify existing records, I wanted to replace those records in Redshift when they are modified in the application.

Quirks

The data pipeline services has a number of quirks that I stumbled upon. I hope Amazon works to refine the service, and that one day these are no longer issues. But for now I observed the following:

  • I cannot add everything via the UI. Things like parameters, and EC2 resources can only be added via editing the JSON. Don’t be afraid to open it up – it is reasonably easy to follow. Hopefully support to add these objects will come to the UI in the future.
  • The default templates are a great place to understand the pipelines, but are very messy. The JSON definitions are painful to manage. Some information are stored in parameters, others are done inline. Some activities cannot have parameterized variables. Sometimes the parameter names MUST start with “my”, e.g. “myDatabaseUsername”. I found this arbitrary and frustrating. Also some parameters have a “watermark” value, a “help” value, others don’t. At least one variable started with a “*” character. No explanation why.
  • When testing a pipeline I cannot change between a scheduled pipeline, and an on demand pipeline. I have to export the JSON definition and create a new pipeline as the other type. This makes testing painful.
  • The “Execution Details” screen is hard to interpret. The filter defaults to “Activities” only, but all of my pipeline definitions start with an EC2 resource being instantiated which is filtered out. The timeframes are also strange. I needed to change the “Schedule Interval” to be “Executed start” while testing an on demand pipeline. Finally the dates need changing from the default. It will default to 2 days ago, and will include a lot of previous test runs if you are developing. They don’t seem to be sorted in any logical way either, making tracing the pipeline execution difficult at a glance.
  • While debugging, check S3 for the logs. I found log data was contained in S3 that was not referenced at all in the Data Pipeline UI. This was critical for understanding failures.
  • The visualization in the Architect mode is not particularly helpful. The only thing I can do is click on a node and see the properties on the right. I cannot view two node’s properties at once. Worse is the parameters are in a completely different section so I can only see the variable name, or the variable value at any time. I found it only useful to see the flow of execution. For editing, I would export the data pipeline to JSON and modify in a text editor outside of the UI.

Conclusion

The Data Pipeline service offers an affordable way to automate ETLs in a cloud friendly way. The proof of concept that I did under represented the amount of work it would take to get these running in production. I frequently battled with permissions, UI quirks, and timeouts. I was persistent, and eventually got these pipelines running. One running, I stripped out the passwords and committed the JSON exports into version control for peace of mind.

I want to look into the SNS failure reporting next to be alerted when pipelines fail. I am most concerned with upstream changes to the table schemas. If these don’t match the downstream schema, the pipelines will fail. To guard against this I wrote unit tests that inspect the tables and ensure no changes have been introduced upstream. If a change is introduced, the test fails with the message that it was expecting something else. This should trigger the developer to alert the data services team to modify the pipeline downstream to prepare for upstream changes. This policy has not yet been tested.

Computers, Open-source, Software

Assign Pull Requests with Hubot

ChatOps was a term our DevOps person shared with me. Its the concept of automating everything you can through your company chat client. This could include monitoring, deploying, and just having fun (/pugbomb anyone?)

We implemented a pull request policy for all new code going to master. A pull request is opened by a developer on Github for the feature branch they want to merge into master. One of the biggest benefits to opening a pull request is having a peer review happen on your code where at least one other developer provides a review before it is committed to master. Fixing issues in the code at this point in the development cycle is relatively inexpensive compared to having this code fail QA, or BA sign off, or worst case, having a production incident.

Who reviews the pull requests? In the interest of being fair, it shouldn’t be a single developer that carries that burden. Pull requests can be a fantastic tool for transferring technical and domain knowledge so its best if everyone can participate. A developer might have never had occasion to work on the part of the codebase that contains the changes. For these reasons, we decided to do a round robin assignment of PRs to the developers. And yes, we fixed the bug where you get assigned your own PR 🙂

To accomplish this we utilized an open source project called Hubot that is maintained by Github and has integrations with Slack, as well as other chat clients. Hubot scripts are written in CoffeeScript, and the platform runs on Node.js. We enabled the integration between Github and Slack. A chat bot will publish a message on Slack when a pull request is opened. Hubot listens for these messages and then runs the code that you specify in response to the event. It sounds complex, but most of the heavy lifting is already done for you. You just listen for keywords, and take certain actions.

In our case, mentioning to a developer that he or she is assigned (via Slack) to a PR was sufficient for our use case. We created a new Slack channel, setup a webhook, and write a few lines of CoffeeScript to do the round robin assignment. It looks like this in Slack:

Screen Shot 2016-08-11 at 6.30.50 PM

HubotSlack = require 'hubot-slack'

module.exports = (robot) ->

Require the hubot-slack module. Inside this function is where we will do our work

# { <github username>: <slack username> }
 assignees =
 'bsimpson': 'ben.simpson'

getAssignee = ->
 index = robot.brain.get('assignee_index') || 0
 assignee_key = Object.keys(assignees)[index]
 assignee = assignees[assignee_key]

 if index >= Object.keys(assignees).length-1
  index = 0
 else
  index++

 robot.brain.set('assignee_index', index)
 assignee

We list our assignees in an hash and then use robot.brain to store where we left off. This allows us to restart our client without losing our place in the round robin assignment. You will notice that the assignment hash has two usernames to allow mapping from Slack to Github user accounts in the event they are different.

regex = /Pull request submitted:.* by (.*)/i

robot.listeners.push new HubotSlack.SlackBotListener robot, regex, (msg) ->
 author = msg.match[1]
 assignee = getAssignee()
 until (assignee != assignees[author])
 assignee = getAssignee()

 msg.send "<@#{assignee}> you've been assigned to #{author}'s pull request."

This is the binding we make to listen to Github bot messages. We can’t use robot.listen since this only works for human messages. Since we are talking bot to bot, we have to utilize the SlackBotListener.

Now that we have our code in place, we can create a new Slack webhook. Inside your Slack settings click “Apps and Integrations” > “Manage” > “Incoming Webhooks”. Add a new one for the channel you want Hubot to post to and get the token. This will be needed when we start Hubot.

HUBOT_SLACK_EXIT_ON_DISCONNECT=true HUBOT_SLACK_TOKEN=xxxxxxx./bin/hubot --adapter slack & echo $! > /path/to/hubot/run.pid

This will install any dependencies and run Hubot. Substitute your token from earlier for the HUBOT_SLACK_TOKEN. The pid tracking is important if you are using Monitrc to monitor your process.

Happy ChatOps-ing!

 

Hardware, Linux, Open-source, Ruby, Software, Uncategorized

Delayed Job Performance Tuning

We found a bug. A bug that affected a lot of historical records that we now have the pleasure of reprocessing. Fortunately we already had an async job infrastructure in place with the delayed job gem. Unfortunately, this gem is intended for fairly small batches of records and isn’t tuned to handle 1M+ records in the delayed_jobs table.

After reading some best practices we decided on a queue based approach. To keep our day to day async jobs running we would use a “default” queue. And to reprocess our old records we used a new queue. Starting the workers up with a “–queue” flag did the trick. We had hardware dedicated for day-to-day operations, and new hardware dedicated to our new queue operations. Now it was simply a matter of filling up the queue with the jobs to reprocess the records.

Our initial approach maxed out the CPU on our database server. This was largely due to us not tuning our SQL in our async jobs. Because the volume we processed was always low, this was never really a noticeable problem. But when we threw lots of new jobs into the queues, it became very noticeable. The workers would start up, then mysteriously die. After some digging in /var/log/kern.log we discovered the workers were being killed due to an out of memory manager. Attaching a small swap partition helped, but once you’ve hit swap, things become horribly slow. Whats the point in keeping the worker alive if its running a thousand times slower? Clearly we needed to profile and tune. So we did. (The specifics are out of scope for this article, but it involved consolidating N+1 queries and limiting columns returned by the SELECT).

With our newly tuned SQL our spirits were high as we cranked up the workers again. Only to reach the next process bottleneck. And this is where databases gets frustrating. Delayed job workers run a query each time they are looking for a new job to find out which job to pick up. It puts a mutex lock on the record by setting locked_at and locked_by. The query looks like this:

UPDATE delayed_jobs
SET `delayed_jobs`.`locked_at` = '2016-06-05 11:48:28', 
 `delayed_jobs`.`locked_by` = 'delayed_job.2 host:ip-10-203-174-216 pid:3226' 
WHERE ((run_at <= '2016-06-05 11:48:28' 
AND (locked_at IS NULL OR locked_at < '2016-06-05 07:48:28') OR locked_by = 'delayed_job.2 host:ip-10-203-174-216 pid:3226') 
AND failed_at IS NULL) 
ORDER BY priority ASC, run_at ASC 
LIMIT 1;

The UPDATE does an ORDER which results in a filesort. Filesorts are typically something an index can resolve. So I optimistically added the following:

CREATE INDEX delayed_job_priority
ON delayed_jobs(priority,run_at);

Sadly, this index was completely ignored when I ran an EXPLAIN on my UPDATE. And the reason is that MySQL doesn’t execute an UPDATE query the same way as if you did a SELECT with the same conditions. The index probably made things worse, because now with each record update, we now also have an index update as well. I could probably fork the code and probably use some type of isolation level in a transaction to get the best of both worlds with an index based SELECT, and a quick UPDATE on a single record by id. But there are easier solutions to try first.

My UPDATE statements were pushing 40 seconds in some cases according to MySQL. Eventually the lock wait timeout is exceeded and you see an error in the delayed_jobs.log:

Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; 
try restarting transaction

Jobs were moving very slowly, and throwing more workers at it didn’t make an improvement. This is because each time a worker picks up a job, it was waiting 40+ seconds. The UPDATE was doing a filesort, and any index was being ignored. (And MySQL doesn’t support UPDATE hints). It was pretty clear that all of the jobs from the reprocessing queue needed to find a new home that didn’t negatively impact my filesort. I settled on the following solution:

CREATE TABLE delayed_jobs_backup LIKE delayed_jobs;

INSERT INTO delayed_jobs_backup
SELECT * FROM delayed_jobs WHERE queue='new_queue';

DELETE FROM delayed_jobs WHERE queue='new_queue';

This creates a new database table with the structure of the existing delayed_jobs table. The table is then populated with the jobs that needed to find a new home (All 1M+ of them). And finally, deleted from the original delayed_jobs table. Be careful doing this, and do some SELECT/EXPLAIN queries in between to ensure you are doing what you think you are doing. (Deleting 1M+ records from a production database makes me sit up in my chair a bit).

Looking at MySQL’s process list I no longer have System locks on my UPDATE statements (presumably because the table size is small enough that the filesort is mostly painless):

mysql> SHOW FULL PROCESSLIST;
# Id, User, Host, db, Command, Time, State, Info
1, user, 0.0.0.0:34346, localhost, Query, 0, System lock, UPDATE ...

The important columns here are the Time (in seconds), State, and Info. This proves that my job locking was happening quickly. I was seeing Time values of 40+ seconds before. I kept referring back to this process list to verify that the UPDATES were remaining snappy while I modified the number of workers running, and the number of jobs in the queue. I had a goal of keeping the UPDATE system lock times under 2 seconds. Adding more workers pushed the times up. Adding more jobs to the queue pushed the times up. Its a balance that probably depends very much on what you are processing, how much your database can handle, and what your memory constraints are on your worker servers.

To conclude – my job over the next few days will be to run the following command to put some jobs into the delayed_jobs table 10,000 at a time:

INSERT INTO delayed_jobs
SELECT * FROM delayed_jobs_backup LIMIT 10000;

DELETE FROM delayed_jobs_backup LIMIT 10000;

You can of course automate this. But my objective was met. The jobs can reprocess old records without impacting day to day jobs in the default queue. When the delayed_jobs table is almost empty, I move over another batch of jobs from the delayed_jobs_backup table. Rinse and repeat until there are no more jobs left to process. Its a bit more painful, but day to day operations continue to function, and I can cross of the reprocessing task from my list of things to do. All without any code changes!

I’ve been reading up on transaction isolation levels thinking something like a SELECT FOR UPDATE lock might be worthy contribution to the delayed_job codebase: http://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html