Apple, Computers, Hardware, Linux, Software, Windows

Using Synergy Over VPN

I’ve been watching a lot of Linus on Tech and one of their sponsors sells a product called Synergy. More information on their product page: . To summarize, this is a software KVM (no video, so I guess it is “KM”?) solution. The use case was perfect for me. On my desk I have an Apple work laptop, a Windows desktop, and a Linux personal laptop. Anyone that has done extensive work on a laptop keyboard and touchpad know that it isn’t optimal. I didn’t want multiple keyboards and mice on top of my desk because of the clutter. I dropped some money for Synergy, and it just works!


That is until I had to connect to our company VPN the next week. They use a full tunneling solution. When I connect, I lose everything. I can’t print, I can’t access my NAS, but most importantly I can’t access my keyboard and mouse. (The video is fine because it is a hard wire to an external monitor). What to do?

SSH to the rescue! What is SSH? This is a protocol that will allow one computer to securely interface with another computer. Secure SHell. However, we will just be using it for port forwarding, and not for an interactive session. The goal is to take the OS X machine (Synergy client), and SSH into the Windows server (Synergy server). Using this SSH connection, we can forward ports within it. It is a tunnel running inside the SSH connection. This will expose a local port on OS X for 24800 that is actually pointing to the remote server port 24800. This is the port that Synergy uses for its connections.

You will need a few tools, and a little patience. Having just gone through this, I’m sharing for posterity, or maybe for anyone that has thrown in the towel with how crippled VPN makes accessing home devices.

I have the following Synergy setup:

  • Windows 10 Synergy server (keyboard and mouse are physically connected to the desktop)
  • OS X Synergy Client
  • Linux Synergy Client
  • Router with a local area network all these devices share
  • Admin access to the router for port forwarding
  • Autossh package for OS X (available via brew)

First step, get Windows 10 up to speed with SSH. How this isn’t built in as a service in the year 2017 I have no idea. Grab the OpenSSH server package for Windows from . After downloading, extract and run the setup file. This will create a new Windows service for OpenSSH that will run on port 22. It prompts you to generate an SSH key for the server.

Once this server is running, you will need to add your user to the list of SSH users. Open up PowerShell as an administrator and change into the C:\Program Files\OpenSSH\bin directory. Run the following commands:

mkgroup -l >> ..\etc\group
mkpasswd -l >> ..\etc\passwd

Try and connect to your SSH server from the OS X client:

ssh <user>@<server IP> # e.g. ssh Ben@

You should be prompted for your Windows password. Once you can successfully login to the server, we can setup public key authentication. This removes the need for you to type in your password because you identify yourself with an SSH public key. From your OS X machine get your public key:

cat ~/.ssh/

Put the contents of this file on your SSH server in the file C:\Program Files\OpenSSH\home\<user>.ssh . This is actually a symlink to C:\Users\<user>.ssh . If the directory .ssh doesn’t exist, you will need to create it first. Now we need to configure the server to allow public key authentication. Edit the C:\Program Files\OpenSSH\etc\sshd_config file and change the following lines:

StrictModes no
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys

Restart the OpenSSH server for the changes to take effect:

net stop opensshd
net start opensshd

You should now be able to SSH into the server same as before but without being prompted for a password.

Now we are ready to create an SSH tunnel. Before we incorporate AutoSSH (which handles retries and monitoring) we will do a naive attempt to SSH. In the following command:

  • -f backgrounds the process
  • -L does port tunneling in the format of <local port>:<remote host>:<remote port>
  • -N do not run a command – just tunnel the port
ssh -f <user>@<remote public IP> -L 24800:<remote public IP>:24800 -N

If this works, you should see a [LISTEN] entry for port 24800 when you list open files:

lsof -n -i | grep 24800

You may need to set your server as the DMZ on your network. Or to be safer you can simply setup port forwarding. We will need port 22 and port 24800 to resolve to the Windows server. The instructions for how to do this on a router widely vary by vendor. Typically it is under a WAN section. It typically prompts for a port, a destination IP, and destination port, and protocol. You want ports 22 and 24800 to route to your server IP for TCP and UDP.

Configure your Synergy client to use localhost instead of the remote IP. You should now be able to operate your client from the server’s peripherals via Synergy.

Everything works great until the VPN connection is made. The reason is that the SSH connection is severed. In order to recover automatically, I have added autossh to persist this tunnel. On the OS X client instead of running SSH do the following:

AUTOSSH_POLL=10 autossh -M 20000 -f -N <user>@<remote public IP> -L 24800:<remote public IP>:24800

Now when a VPN connection is made, or a disconnection happens, the autossh package will detect that it is no longer alive and retry. Because Synergy’s software also retries, after a few seconds your connectivity should begin working again.

Thanks to Synergy for making a solid product, and for having first class Linux support.

Computers, Software

Moving Data in AWS

I have been a long time user of Amazon Web Services, but I only recently started using the Data Pipeline service they offer to handle ETL needs. The service provides a cloud ready, low cost, turn key (in some cases) solution to moving data within your services. I had a difficult time getting up and running, partly due to the lack of discussion online about this service, so I want to share my experience, offer some best practices, walk through how I developed our pipelines.

Anyone with an AWS account can use Data Pipelines. But be careful there there is no BAA agreement if you are in the healthcare industry and are passing around PHI data. Fortunately, our needs do not yet require us to move around PHI information.

The first step in my ETL journey was first formalizing what data needed to go where. Before I even opened up the Data Pipeline service, I had to understand our specific needs. I identified two use cases.

Archiving Old Data

RDS instances have a max allowable size for the database and ours was getting full. The approach I took was to look at our largest tables, and focus on reducing those first. I ran some queries to understand what is using the most space:

Once I had a list of my largest tables, I could classify them and assign retention rules. Some tables I decided to leave in place, and others I decided were transient data, and could be archived. (HIPAA mandates a 7 year data retention policy, so no luck just deleting). We decided as a business that different classifications could live within our application for differing time frames. Once timeframes were established, I could then write a data pipeline, and move any data older than our cut off date for that table to a storage solution outside of the database. We chose to house MySQL backups on S3 in encrypted buckets.

Fortunately the Data Pipeline service provides a number of templates to help you get started. Navigate to .  I found the templates good starting point, but there are some frustrations that I will mention below in the “Quirks” section. Click “Create Pipeline”. I used the template “Full Copy of RDS MySQL table to S3”. I filled in the parameters, and edited the pipeline in “Edit in Architect” mode.

Since I wanted to archive old data, I modified the SqlDataNode’s Select Query to be only records older than my retention policy:

select * from #{table} 
where #{myRDSTableLastModifiedCol} <= 
date_sub('#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' , 
  interval 6 month)

This will select records only older than 6 months from the pipeline scheduled start time. The template then moves these to S3. There are two parameters, defined in the parameters section for “#{myRDSTableName}”, and “#{myRDSTableLastModifiedCol}”. I supplied my table name, and the updated_at datetime column for my records.

I added a new SqlActivity dependent on the CopyActivity named “Delete old records”. Once they move to S3, I want to delete them from the database table. This activity “Depends on: RDSToS3CopyActivity” so if saving to S3 fails, the records are left untouched. I added the following script to mirror my select query above, but deleting the records instead:

delete from #{myRDSTableName} 
where #{myRDSTableLastModifiedCol} <= 
date_sub('#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' , 
  interval 6 month)

I would recommend doing this on a test database first before you delete production records while testing your script. Because the timestamp is the same as from the select, this will be the same record set, provided you have an updated_at column that shows when a record was last touched.

Expose Data for Reporting

My other use case was getting data to the reporting server which is in Redshift. Again, there is a nice template to get started. Click “Create Pipeline” and then select “Incremental copy of RDS MySQL table to Redshift”, then “Edit in Architect”.

The first run will need to be a full copy if I want all the data in the table. After that, I can use delta copies to only move over the new data fitting your criteria. This is driven by a SQL select query, so it is easy to modify. In the SqlDataNode I can edit the Select Query to my liking. Note that I removed the conditions from the query to get all records on the first run. I changed the instance type to something more substantial (t1.micro to m1.small), and upped the timeout from 2 hours (to 8 hours). I then went in before the next run and modified the query to put back the conditions that selected the delta data, then downgraded the instance type, and timeout values to their smaller defaults.

I then ran into an infrastructure quirk where our Redshift instance was inside a VPC, and our RDS database was inside a classic instance (non-VPC). This meant that the same EC2 instance would not be able to talk to both databases since it had to be on one side or the other. Because of this limitation, I had to modify parts of the pipeline that assumed a single EC2 instance would be talking to both databases. Note that I had to edit the JSON as the Data Pipeline UI does not allow changing the resources that activities run on from the template. I created two EC2 instances – one for talking to RDS and S3, and one for talking to S3 and Redshift.

In an attempt to make things easier, Amazon provides some Python scripts that get called under the hood to reflect on your MySQL table structure and convert it to a PostgreSQL CREATE TABLE command. This didn’t work for me because of my VPC permissions issues, so I provided my own CREATE TABLE Sql in the S3StagingDataNode. This was generated using the Python script by Amazon, but I supplied the inputs manually:

aws s3 cp s3://datapipeline-us-east-1/sample-scripts/ \

for target in table1 table2 table3
  mysqldump -h <host> --port=3306 -u <username> --password=<password>  \
--compatible=postgresql --default-character-set=utf8 -n -d \
-r "/tmp/${target}_structure.sql" <database> $target

  python /tmp/ --input_file=/tmp/${target}_structure.sql \
--output_file=/tmp/${target}_create.psql --table_name=$target \

This Bash script will pull down the public file Then it loops over the target tables you want to setup pipelines for. For each table (table1, table2, table3, etc) it does a mysqldump of the table structure. It then feeds this table structure file into the python conversion utility to produce the PostgreSQL version of the table structure. The contents of the table1_create.psql file is what I copied into my “Create Table Sql” field in the Data Pipeline UI.

Note that the “Create Table SQL” is interpreted literally, and has no schema context in Redshift. Therefor if I want to create the database table in another schema, the CREATE TABLE contents need to be modified to prepend this table name with a schema qualifier. e.g. “table1” would become “staging.table” (without the quotes). The Python utility will double quote the table name if given a table name with a period. This will incorrectly create a table in the public schema: public.”staging.table”, which is probably not what desired. Check the contents of the CREATE TABLE for accuracy.

I also changed the “Insert Mode” in the S3ToRedshiftCopyActivity” to be OVERWRITE_EXISTING . This uses the primary key of the table to detect duplicate rows. Since we might modify existing records, I wanted to replace those records in Redshift when they are modified in the application.


The data pipeline services has a number of quirks that I stumbled upon. I hope Amazon works to refine the service, and that one day these are no longer issues. But for now I observed the following:

  • I cannot add everything via the UI. Things like parameters, and EC2 resources can only be added via editing the JSON. Don’t be afraid to open it up – it is reasonably easy to follow. Hopefully support to add these objects will come to the UI in the future.
  • The default templates are a great place to understand the pipelines, but are very messy. The JSON definitions are painful to manage. Some information are stored in parameters, others are done inline. Some activities cannot have parameterized variables. Sometimes the parameter names MUST start with “my”, e.g. “myDatabaseUsername”. I found this arbitrary and frustrating. Also some parameters have a “watermark” value, a “help” value, others don’t. At least one variable started with a “*” character. No explanation why.
  • When testing a pipeline I cannot change between a scheduled pipeline, and an on demand pipeline. I have to export the JSON definition and create a new pipeline as the other type. This makes testing painful.
  • The “Execution Details” screen is hard to interpret. The filter defaults to “Activities” only, but all of my pipeline definitions start with an EC2 resource being instantiated which is filtered out. The timeframes are also strange. I needed to change the “Schedule Interval” to be “Executed start” while testing an on demand pipeline. Finally the dates need changing from the default. It will default to 2 days ago, and will include a lot of previous test runs if you are developing. They don’t seem to be sorted in any logical way either, making tracing the pipeline execution difficult at a glance.
  • While debugging, check S3 for the logs. I found log data was contained in S3 that was not referenced at all in the Data Pipeline UI. This was critical for understanding failures.
  • The visualization in the Architect mode is not particularly helpful. The only thing I can do is click on a node and see the properties on the right. I cannot view two node’s properties at once. Worse is the parameters are in a completely different section so I can only see the variable name, or the variable value at any time. I found it only useful to see the flow of execution. For editing, I would export the data pipeline to JSON and modify in a text editor outside of the UI.


The Data Pipeline service offers an affordable way to automate ETLs in a cloud friendly way. The proof of concept that I did under represented the amount of work it would take to get these running in production. I frequently battled with permissions, UI quirks, and timeouts. I was persistent, and eventually got these pipelines running. One running, I stripped out the passwords and committed the JSON exports into version control for peace of mind.

I want to look into the SNS failure reporting next to be alerted when pipelines fail. I am most concerned with upstream changes to the table schemas. If these don’t match the downstream schema, the pipelines will fail. To guard against this I wrote unit tests that inspect the tables and ensure no changes have been introduced upstream. If a change is introduced, the test fails with the message that it was expecting something else. This should trigger the developer to alert the data services team to modify the pipeline downstream to prepare for upstream changes. This policy has not yet been tested.

Computers, Open-source, Software

Assign Pull Requests with Hubot

ChatOps was a term our DevOps person shared with me. Its the concept of automating everything you can through your company chat client. This could include monitoring, deploying, and just having fun (/pugbomb anyone?)

We implemented a pull request policy for all new code going to master. A pull request is opened by a developer on Github for the feature branch they want to merge into master. One of the biggest benefits to opening a pull request is having a peer review happen on your code where at least one other developer provides a review before it is committed to master. Fixing issues in the code at this point in the development cycle is relatively inexpensive compared to having this code fail QA, or BA sign off, or worst case, having a production incident.

Who reviews the pull requests? In the interest of being fair, it shouldn’t be a single developer that carries that burden. Pull requests can be a fantastic tool for transferring technical and domain knowledge so its best if everyone can participate. A developer might have never had occasion to work on the part of the codebase that contains the changes. For these reasons, we decided to do a round robin assignment of PRs to the developers. And yes, we fixed the bug where you get assigned your own PR 🙂

To accomplish this we utilized an open source project called Hubot that is maintained by Github and has integrations with Slack, as well as other chat clients. Hubot scripts are written in CoffeeScript, and the platform runs on Node.js. We enabled the integration between Github and Slack. A chat bot will publish a message on Slack when a pull request is opened. Hubot listens for these messages and then runs the code that you specify in response to the event. It sounds complex, but most of the heavy lifting is already done for you. You just listen for keywords, and take certain actions.

In our case, mentioning to a developer that he or she is assigned (via Slack) to a PR was sufficient for our use case. We created a new Slack channel, setup a webhook, and write a few lines of CoffeeScript to do the round robin assignment. It looks like this in Slack:

Screen Shot 2016-08-11 at 6.30.50 PM

HubotSlack = require 'hubot-slack'

module.exports = (robot) ->

Require the hubot-slack module. Inside this function is where we will do our work

# { <github username>: <slack username> }
 assignees =
 'bsimpson': 'ben.simpson'

getAssignee = ->
 index = robot.brain.get('assignee_index') || 0
 assignee_key = Object.keys(assignees)[index]
 assignee = assignees[assignee_key]

 if index >= Object.keys(assignees).length-1
  index = 0

 robot.brain.set('assignee_index', index)

We list our assignees in an hash and then use robot.brain to store where we left off. This allows us to restart our client without losing our place in the round robin assignment. You will notice that the assignment hash has two usernames to allow mapping from Slack to Github user accounts in the event they are different.

regex = /Pull request submitted:.* by (.*)/i

robot.listeners.push new HubotSlack.SlackBotListener robot, regex, (msg) ->
 author = msg.match[1]
 assignee = getAssignee()
 until (assignee != assignees[author])
 assignee = getAssignee()

 msg.send "<@#{assignee}> you've been assigned to #{author}'s pull request."

This is the binding we make to listen to Github bot messages. We can’t use robot.listen since this only works for human messages. Since we are talking bot to bot, we have to utilize the SlackBotListener.

Now that we have our code in place, we can create a new Slack webhook. Inside your Slack settings click “Apps and Integrations” > “Manage” > “Incoming Webhooks”. Add a new one for the channel you want Hubot to post to and get the token. This will be needed when we start Hubot.

HUBOT_SLACK_EXIT_ON_DISCONNECT=true HUBOT_SLACK_TOKEN=xxxxxxx./bin/hubot --adapter slack & echo $! > /path/to/hubot/

This will install any dependencies and run Hubot. Substitute your token from earlier for the HUBOT_SLACK_TOKEN. The pid tracking is important if you are using Monitrc to monitor your process.

Happy ChatOps-ing!


Hardware, Linux, Open-source, Ruby, Software, Uncategorized

Delayed Job Performance Tuning

We found a bug. A bug that affected a lot of historical records that we now have the pleasure of reprocessing. Fortunately we already had an async job infrastructure in place with the delayed job gem. Unfortunately, this gem is intended for fairly small batches of records and isn’t tuned to handle 1M+ records in the delayed_jobs table.

After reading some best practices we decided on a queue based approach. To keep our day to day async jobs running we would use a “default” queue. And to reprocess our old records we used a new queue. Starting the workers up with a “–queue” flag did the trick. We had hardware dedicated for day-to-day operations, and new hardware dedicated to our new queue operations. Now it was simply a matter of filling up the queue with the jobs to reprocess the records.

Our initial approach maxed out the CPU on our database server. This was largely due to us not tuning our SQL in our async jobs. Because the volume we processed was always low, this was never really a noticeable problem. But when we threw lots of new jobs into the queues, it became very noticeable. The workers would start up, then mysteriously die. After some digging in /var/log/kern.log we discovered the workers were being killed due to an out of memory manager. Attaching a small swap partition helped, but once you’ve hit swap, things become horribly slow. Whats the point in keeping the worker alive if its running a thousand times slower? Clearly we needed to profile and tune. So we did. (The specifics are out of scope for this article, but it involved consolidating N+1 queries and limiting columns returned by the SELECT).

With our newly tuned SQL our spirits were high as we cranked up the workers again. Only to reach the next process bottleneck. And this is where databases gets frustrating. Delayed job workers run a query each time they are looking for a new job to find out which job to pick up. It puts a mutex lock on the record by setting locked_at and locked_by. The query looks like this:

UPDATE delayed_jobs
SET `delayed_jobs`.`locked_at` = '2016-06-05 11:48:28', 
 `delayed_jobs`.`locked_by` = 'delayed_job.2 host:ip-10-203-174-216 pid:3226' 
WHERE ((run_at <= '2016-06-05 11:48:28' 
AND (locked_at IS NULL OR locked_at < '2016-06-05 07:48:28') OR locked_by = 'delayed_job.2 host:ip-10-203-174-216 pid:3226') 
AND failed_at IS NULL) 
ORDER BY priority ASC, run_at ASC 

The UPDATE does an ORDER which results in a filesort. Filesorts are typically something an index can resolve. So I optimistically added the following:

CREATE INDEX delayed_job_priority
ON delayed_jobs(priority,run_at);

Sadly, this index was completely ignored when I ran an EXPLAIN on my UPDATE. And the reason is that MySQL doesn’t execute an UPDATE query the same way as if you did a SELECT with the same conditions. The index probably made things worse, because now with each record update, we now also have an index update as well. I could probably fork the code and probably use some type of isolation level in a transaction to get the best of both worlds with an index based SELECT, and a quick UPDATE on a single record by id. But there are easier solutions to try first.

My UPDATE statements were pushing 40 seconds in some cases according to MySQL. Eventually the lock wait timeout is exceeded and you see an error in the delayed_jobs.log:

Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; 
try restarting transaction

Jobs were moving very slowly, and throwing more workers at it didn’t make an improvement. This is because each time a worker picks up a job, it was waiting 40+ seconds. The UPDATE was doing a filesort, and any index was being ignored. (And MySQL doesn’t support UPDATE hints). It was pretty clear that all of the jobs from the reprocessing queue needed to find a new home that didn’t negatively impact my filesort. I settled on the following solution:

CREATE TABLE delayed_jobs_backup LIKE delayed_jobs;

INSERT INTO delayed_jobs_backup
SELECT * FROM delayed_jobs WHERE queue='new_queue';

DELETE FROM delayed_jobs WHERE queue='new_queue';

This creates a new database table with the structure of the existing delayed_jobs table. The table is then populated with the jobs that needed to find a new home (All 1M+ of them). And finally, deleted from the original delayed_jobs table. Be careful doing this, and do some SELECT/EXPLAIN queries in between to ensure you are doing what you think you are doing. (Deleting 1M+ records from a production database makes me sit up in my chair a bit).

Looking at MySQL’s process list I no longer have System locks on my UPDATE statements (presumably because the table size is small enough that the filesort is mostly painless):

# Id, User, Host, db, Command, Time, State, Info
1, user,, localhost, Query, 0, System lock, UPDATE ...

The important columns here are the Time (in seconds), State, and Info. This proves that my job locking was happening quickly. I was seeing Time values of 40+ seconds before. I kept referring back to this process list to verify that the UPDATES were remaining snappy while I modified the number of workers running, and the number of jobs in the queue. I had a goal of keeping the UPDATE system lock times under 2 seconds. Adding more workers pushed the times up. Adding more jobs to the queue pushed the times up. Its a balance that probably depends very much on what you are processing, how much your database can handle, and what your memory constraints are on your worker servers.

To conclude – my job over the next few days will be to run the following command to put some jobs into the delayed_jobs table 10,000 at a time:

INSERT INTO delayed_jobs
SELECT * FROM delayed_jobs_backup LIMIT 10000;

DELETE FROM delayed_jobs_backup LIMIT 10000;

You can of course automate this. But my objective was met. The jobs can reprocess old records without impacting day to day jobs in the default queue. When the delayed_jobs table is almost empty, I move over another batch of jobs from the delayed_jobs_backup table. Rinse and repeat until there are no more jobs left to process. Its a bit more painful, but day to day operations continue to function, and I can cross of the reprocessing task from my list of things to do. All without any code changes!

I’ve been reading up on transaction isolation levels thinking something like a SELECT FOR UPDATE lock might be worthy contribution to the delayed_job codebase:

Computers, Open-source, Ruby

Rspec Bisect

During a journey to upgrade to Rails 4, the test suite was crucial for success. I ran the test suite and recorded all of the test failures. Then I went through test by test fixing them until I was all done. Barely containing my excitement I pushed to remote and waited. Sadly, our CI server reported a build failure. What the heck?! The tests all passed in isolation. Why do they not pass when run together? Well…

The mark of a good test is that they can be run in any order and still pass. In order to accomplish this, we need each test (or the suite at large) to be responsible for setting up its own data, and tearing down that data when its finished. You want to leave the state of everything exactly like how you found it before the test was run. This can include clearing caches, truncating database tables, or rolling back transactional changes. Many test suites including Rspec have the capability to do these things for you without much headache.

But what happens when your tests are run in a different order and fail? Or what happens when your tests pass in isolation, but fail when run in part of the larger test suite? Likely something in an unknown test is causing your test failure. And finding which other test is being naughty can be a real pain. Fortunately, this is the case for a new feature in Rspec 3.3 called bisect.

In case you are not familiar with bisect its a simple, but powerful algorithm. Suppose you want to drop an egg off a building at different floors to discover the lowest floor on which the egg will break. There are multiple ways to accomplish this task. The most wasteful way would be to start at floor one, drop the egg, if it doesn’t break, go to floor two and repeat. When you drop the egg and it breaks, then you have your answer. This is slow. In Big O notation, this would be O(n) where n is the number of floors in the building.

We can do better. What if we could cut the floors that we test on in half each time we perform our egg drop test? For argument sake, if the building has 100 floors (and we have an unusually strong egg!) we might bisect the floors and start at floor 50. If we drop the egg and it breaks, we know that the answer is somewhere between 1 and 50. If it doesn’t break, we know the answer is between 51 and 100. Lets assume it breaks. We can then do our next test (with a new egg) on floor 25. If it breaks, we know our answer is between 1 and 25. If it doesn’t break, we know our answer is between floors 26 and 50. Just by dropping two eggs on different floors, we’ve cut our sample size from 100 to 25. In Big O notation, this would be O(n/2). The number of times we have to do this test is the number of powers of 2 we can divide into our number of floors. On a 100 floor building, we would have our answer in ~6 tests (2^6). We are going to save a lot of eggs.

So why are we talking about eggs when we have test failures to fix? Rspec offers a sweet utility called “bisect” that will do something analogous to our egg drop algorithm in our test suite. We have a known test failure in the larger suite, but it passes in isolation. And we want to know which currently unknown test is interfering and causing our known test failure. How would you tackle this? Well, you could remove half your tests (keeping your failing test) and see if you still get a test error when running the test suite. If you do, you remove half again, and repeat your testing. Do this until you get down to two tests – the known failing test, and the previously unknown (now known) passing test. But that is tedious work to do manually.

Lets take a look at Rspec bisect:

$ bundle exec rspec --seed 1234 --bisect
Bisect started using options: "--seed 1234"
Running suite to find failures... (3 minutes 14.1 seconds)
Starting bisect with 3 failing examples and 2475 non-failing examples.
Checking that failure(s) are order-dependent... failure appears to be order-dependent

Round 1: bisecting over non-failing examples 1-2475 . ignoring examples 1-1238 (1 minute 38.37 seconds)
Round 2: bisecting over non-failing examples 1239-2475 . ignoring examples 1239-1857 (57.25 seconds)
Round 3: bisecting over non-failing examples 1858-2475 . ignoring examples 1858-2166 (29.98 seconds)
Round 4: bisecting over non-failing examples 2167-2475 .. ignoring examples 2322-2475 (46.11 seconds)
Round 5: bisecting over non-failing examples 2167-2321 .. ignoring examples 2245-2321 (43.61 seconds)
Round 6: bisecting over non-failing examples 2167-2244 .. ignoring examples 2206-2244 (42.92 seconds)
Round 7: bisecting over non-failing examples 2167-2205 .. ignoring examples 2187-2205 (43.65 seconds)
Round 8: bisecting over non-failing examples 2167-2186 .. ignoring examples 2177-2186 (41.96 seconds)
Round 9: bisecting over non-failing examples 2167-2176 . ignoring examples 2167-2171 (24.15 seconds)
Round 10: bisecting over non-failing examples 2172-2176 .. ignoring examples 2175-2176 (36.95 seconds)
Round 11: bisecting over non-failing examples 2172-2174 .. ignoring example 2174 (37.3 seconds)
Round 12: bisecting over non-failing examples 2172-2173 . ignoring example 2172 (24.68 seconds)
Bisect complete! Reduced necessary non-failing examples from 2475 to 1 in 9 minutes 1 second.

The minimal reproduction command is:
  rspec ./spec/models/file_1.rb[1:4:3] ./spec/models/file_2.rb[1:1:1,1:1:2,1:1:3] --seed 1234

That is a lot of output. Lets break it down:

  • We call rspec with a given --seed. What this seed value does is determine the order of the tests. When --seed is omitted a random number is chosen which makes the tests order random. This is normally a good thing. But when bisecting, its a bad thing because we want to reproduce a certain arrangement of tests to reliably generate our failure.
  • The test suite initially takes 3 minutes 14 seconds to run. This is the longest test run because it is discovering which test failures we have so it must run the full suite.
  • The first round we ignore the first half of our tests. This takes half the time to run because we are running half the tests.
  • The second round we detected the failure from round one, so we know the half we tested includes the failure. We’ve just eliminated 50% of our tests from being the culprit. The next bisect will ignore the first half of the remaining pool and test again. We half the test run time again (tests don’t have exactly the same duration)
  • The third round we repeat what we did in round two, but with the remaining pool, and so on. If the failure is no longer detected, we test on the half we most recently removed.
  • Finally we get a completed message with a command to reproduce the failure. I can copy and paste this into a terminal and see the two files that are causing a failure.

rspec ./spec/models/file_1.rb[1:4:3] ./spec/models/file_2.rb[1:1:1,1:1:2,1:1:3] --seed 1234 What is that identity notation you might ask? ./spec/models/file_1.rb is straightforward. This is just the filename. The next part is a bit more interesting: [1:4:3]. This is telling us the nesting of the assertion that is failing when combined with the other file. Not only does Rspec bisect tell us the two files that are interfering each other, but which tests within those two files are interfering! The notation 1:4:3 is telling us its the first describe context, then the fourth context within that context, and finally the third test within those previous two contexts. In the case of the second notation: ./spec/models/file_2.rb[1:1:1,1:1:2,1:1:3] its just telling us that we have multiple tests that are failing (3 to be precise), separated by commas. Notice that the command also includes the seed value we used when bisecting. This is still important because your failure might be order dependent.

And what if your tests fail in a different order? Same principle. When you get an intermittent test failure when running the test suite that you suspect is related to order, note the seed value that was used. Perform the same rspec bisect algorithm (and the same command) and use this seed value instead of 1234. This will reproduce the test failure you got by running the same tests in the same order as when the failure was recorded.

The Rspec team did an awesome job with the bisect utility. This used to be a dreaded problem to face as someone running tests. To verify your fix you had to run the entire test suite just to know if your change worked. This would have been a 4 minute wait in the example above, but could easily have been much longer. When reproducing failures, you want the shortest interval possible. Running 4 tests is certainly faster than running 2475 tests!

Another shout out goes to the authors of the transpec gem that is used when upgrading versions of Rspec. This automatically changes your deprecated code with the newer style syntax. These two features together show that the Rspec team has a powerful and mature product.

More information:

Computers, Open-source, Ruby, Software

Upgrading to Rails 4

Recently, I started a new job and the first big assignment was to upgrade their software stack to a more recent version. They are running Rails 3.2 and wanted to upgrade as far forward as they can. With Rails 3.2 support gone for all but severe bug fixes, and Rails 5 due any month now, this is something they wisely didn’t want to put off.

Its a smaller company, and they have been open to a lot of my feedback and suggestions. I was basically given the reins and told to do what needed to be done to get us upgraded.

So the first task was some research, and I stumbled upon the official Rails upgrade guide pretty quickly. It nicely outlines the breaking changes. Fortunately, the big change was to strong parameters, but this can be deferred by including protected_attributes and kicking this can down the road. We will be logging what controller actions receive which parameters, instead of raising so we will have some time to collect some data before we switch over in one painful release.

The guides stressed that the test suite is critical during the upgrade. I was fortunate enough to have a project with adequate testing coverage. It wasn’t the 80% sweet spot, but it was certainly valuable at ~40%. However, the suite had fallen into disuse, so the first task was to get them back to green.

Once the test suite was green, it became a matter of KEEPING it green. Luck smiled a second time and they had an old CI server that had fallen into disuse. It was powered by CruiseControl.rb and it was little fuss to get it back up and running again. The migrations could no longer be played from the projects inception to the current time.

This is where luck stopped smiling upon me. The project did not track db/schema.rb and the migrations were not playable. The only way to get an instance of the database was to download the schema from production. Not the best practice, so I went about tracking the schema, and getting adoption of this new practice. Further complicating the schema approach was the decision to move all older migrations into subfolders in db/migrate by year (e.g. 2011, 2012, etc). This was done I found out because Textmate doesn’t like big directories. The issue is that db:schema:load isn’t recursive in its retrieval of migration versions. It took me a bit to understand what was happening, and how it was happening. After a failed monkey patch to the migrator logic in ActiveRecord, I decided to just move the migrations back into db/migrate and eliminate the subdirectories. Sorry Textmate!

Now the database could be rapidly provisioned, and I got a seed working with a minimal set of data. Back in CI I reconfigured the build script to use db:schema:load instead of db:migrate and with the green test suite, we got builds working again.

We used a utility called CC Menu to show the build status in the notification bar in OS X:

To make the builds even more visible, I discovered an integration with Slack to report the build status in our chat. . I made my own fork and added some famous movie quotes for successes and failures since I found the default messages lacking: . I didn’t think our female developers would appreciate the “you’re a stud!” message.

Back to the Rails 4 upgrade. The tests are passing in master, so I made a feature branch that will be long lived called “rails-upgrade”. I merge master in daily. The “rails-upgrade” branch will serve as an integration point for other features branches that will merge into it. The plan is to keep any upgrade related changes out of master until its time to deploy. That means separate branches, separate CI builds, and separate staging servers for manual QA.

One lesson I’ve learned is that a deprecation warning may not always be just informational. In particular, Rails 4 requires all scopes to be callable (lambdas, or procs). This was breaking the way that associations with scopes would be built: users.roles.admin.find_or_create! would previously find an associated admin record, or create it. However, in Rails 4, it fails creation because the role’s reference to user is nil. I’m not sure why, but its reproducable, and changing the admin scope on Role to a callable restores this reference back to user.

Ideally, I’d have wanted to get the test suite green before tackling deprecation warnings because I want to change as little as possible before I get back to a known good status. However, not fixing this deprecation warning was actually causing tests to fail.

Now we are down to a handful of failings tests on Rails 4. Most deal with the ActiveRecord syntax changes. Hopeful I can get these knocked out quickly. Then its on to manual QA.

In summary – get your test suite green. Keep it green. Do the upgrade and get it back to green. Then make any changes to remove deprecation warnings, keeping the suite green. The test suite is your guide during an upgrade like this.

Computers, Events, Family, Ruby

First day at Influence Health

Influence Health
October 21st 2015 it was announced that Barclays would be laying us off, and shutting down our division. I’ve never been laid off before, and the emotions went from shock, disbelief, anger, and even sometimes relief. It was finally over. All of the technical debt, and poor business decisions were something in the past. Over the next few months I would be interviewing, relaxing and getting ready to work at a new place, with new people.

I had a lot of decisions to make. The first big decision was whether I was going to work full time remote again. There are a lot of perks, but there are downsides too. I felt lonely, isolated, disconnected. My wife didn’t understand what I did all day. I had no coworkers I could get drinks with, or swap horror stories with over lunch. And with two kids in the house, it has become quite noisy, and I find myself getting frustrated when our youngest has a bad day and spends it crying. And it can be stressful on a marriage to be around each other 24/7. Nothing ever changes. Little things start to become big things.

I decided that for general health, I would be looking for a job with a flexible work schedule. A few days at home, a few days in the office. Once I narrowed down what type of arrangement I wanted, then it was on to the culture. This time around I have enough experience that I felt comfortable donning the title of a senior software developer. I’ve seen enough good and bad that I can tell the differences between them.

Next was the work culture. Did I want another big company with more people than I could ever memorize? A startup struggling to become solvent? Something in between? The size of the company dictates the culture to some extent. The bank culture was slow, fearful, and inefficient. I wanted to get my hands dirty, but I wasn’t ready for 80 hour work weeks in a startup trying to get off the ground. I wanted something smaller, but established. And with a laid back culture. I didn’t want to wear a suit everyday, and I’d like the freedom to not be stuck in a cubicle.

I settled on a fun looking company called Influence Health (formerly BrightWhistle) in Midtown, Atlanta. They are right next to Piedmont Park, and its quite a drive to get here from Johns Creek. But with flexible hours, I miss the rush hour madness, and with 3 days at home, I’m not sitting in the car that often. Though, it is fun to sit in the Mazda RX-8!

The first day was a 9:30 (worst possible time due to traffic) orientation. Then it was on to unpacking my equipment while I met the team, and then I spent the rest of the day juggling orientation tasks and setting up my new laptop the way that I like it.

I decided to use Git to manage my dot files as I’ve read about a few other developers doing. You can see my progress here:

I like the incremental approach, and I’ve expanded it a bit to make it super easy.

The first difficult task (outside of basic environment setup like Rbenv, Ruby, MySQL, Sphinx, etc) was the bundling of gems. This led to quite a few dependencies that couldn’t install because it couldn’t find the native libraries. This was a hit and miss adventure, but I’m reaching the end and I’ve taken notes. I’ve asked if there is a developer wiki somewhere to keep all of this painful learning. Maybe I will push to implement one if I can get some support behind the idea.

The laptop was imaged, but it provides enough freedom to configure however I would like. They use Microsoft Office, which isn’t great, but the Office365 web client has superior searching than its desktop counterpart Outlook so I guess I will just stick with that. They use Slack for communications, and the Mac app makes juggling all of my accounts very simple.

Overall I’m pleased with the team, the culture, and the equipment they have provided to me. I feel rested, focused, and ready to tackle this new job head on. Its been strange to sit in an office with other people after four years of remote work, but I’m adjusting. My voice is worn out, since I never typically speak at home.

Anyway, stay tuned for the next episodic blog post of my adventures in Midtown!