Open-source, Software

Updating database rows with a default position

I was recently tasked with making a table in a MySQL database sortable by the user. There are a few steps involved to take care of new records, while providing existing records with a default value.

The new records would be created with a ifnull(max(order), 0) + 1 to assign them the next position. This was sufficient since I was positioning based on created_at and a new record was guaranteed to have the highest created_at timestamp.

For the existing records, I wanted to:

  • group by user_id
  • set the order (int) column to an incrementing integer
  • with an initial ordering based on a created_at timestamp

The first approach was to select groups of users, and for each group sort the records then update each record. This approach would work, but it wouldn’t have good performance. I really wanted to do this in a single update statement.

I ended up going with an approach based on this article :

set @user_id = 0, @order = 1

update table1 as dest,
(
  select x.id, x.user_id, x.created_at,
  @order := if(@user_id = x.user_id, @order + 1, 1) as `order`,
  @user_id := x.user_id as dummy
  from (
    select id, user_id, created_at
    from table1
    order by user_id, created_at
  ) as x
) as src
set dest.`order` = src.`order`
where dest.id = src.id;

Let’s break this down:

  1. We are using database session variables to keep track of some state. We default to @user_id = 0 and @order = 1
  2. Working from from the inner-most select, we select the id (primary key), user_id (int)  and created_at (timestamp) as these are the columns needed to do our calculations for the new order value. We order these by user_id and created_at to get the sort ready. This dataset is aliased as x (yes I know I’m terrible at naming)
  3. The next select up pulls the same id, user_id and created_at from the inner-most select and does some calculations persisting the state in the session variables we setup in step 1. The actual work here is:
    1. Calculating the order: @order := if(@user_id = x.user_id, @order + 1, 1) . The := is an assignment in SQL. We set the @order to either a default of 1, or the @order + 1 .
    2. Comparing user to determine what to set the @order to. This is done in a conditional expression:  if(@user = x.user_id). Because the @user_id is set to the x.user_id value on each row, the comparison should return true until we get to the next user, at which point the @order will reset to 1.
  4. Finally the outer update takes the resulting src dataset and does some updating. We constrain to where dest.id = src.id to get the correct record, and then we set dest.order = src.order which is our calculated value.

Performance is looking great. On a dataset of 300k records, this runs in 3 seconds on a 4 CPU Docker container.

Bonus: If you have additional criteria (like only sorting favorite records for example) this can be easily applied to the inner-most SELECT in the where clause!

Happy ordering!

Advertisements
Computers, Open-source, Ruby, Software, Thoughts, Web

new Job().beginTraining()

It has been an exciting month for me. I put in my notice with Influence Health at the beginning of June, and served through the end of the month. After that took 2 weeks off before my training begins for Doximity on July 16th. The training is in San Francisco, and then immediately followed up with a July 30th co-location in Boulder, Colorado. This position is remote, aside from the several co-locations done each year. I am excited to start with a new team, on a new project, with a mix of old and new technologies.

Over the career at Influence Health I don’t feel that I got much deeper in my knowledge of Rails. What I feel I gained instead was the breadth of general programming skills. I configured Github repos, setup Jenkins, scripted a Hubot instance to assign pull requests, and made a continuous integration system. I created new policies following best practices to open a pull request with a change, write unit tests, and have one other person on the team review your changes before merging. I implemented linting, and worked with one of my coworkers to bring Webpack into Rails to rethinking how we manage Javascript. I also went very deep into the AWS services touching S3, Lambda, RDS, Redshift, Data Pipelines, Batch, Step, ELB, EC2, ECS, Cloudfront, and into other technologies like PostgreSQL, Docker, ElasticSearch, Capistrano, EventMachine, and Daemons. Being exposed to all of these new services has made me consider different approaches to traditional problems, and I feel has made me a better developer.

The new job at Doximity sheds my managerial role that I was voluntold to do at Influence Health. I thought I might have enjoyed it (maybe I would have under better circumstances). At the end of the day it wasn’t being a manager that killed the deal for me. It was being a manager, while still being a tech lead, while being an architect, a core contributor, and many other things. To manage well is a full time job. Tacking it onto an existing role made me feel inadequate as a manager, and I don’t like that feeling. So with the managerial role off my title the new role is back to software developer, and I’m ok with that. The compensation is right, and I felt like I was getting further away from the code than I wanted to be. At the end of the day developing something and seeing it work is what drives me. There is a technical lead track that I might pursue in several months if I feel like I am ready.

The technology stack is a mixture of Ruby and Javascript. After working with Javascript more heavily the last 6 months I have mixed feelings. I’m definitely excited because I do think the future of web development has coalesced on Javascript. And Javascript has risen to the challenge and gotten a lot better. Gone are the imposter “rich internet applications” like Silverlight, and Flex. Gone are the browser plugins for languages like Java and Flash. Javascript just works. And the browsers are really blazing trails, even Microsoft, so I believe that learning Javascript is a solid career investment. There is an excitement in the ecosystem (a little too excited imo, but I’ll take that over being dead)

Popularity aside, Javascript has less magic than Ruby which is a again, both good and bad. I appreciate seeing require statements, and knowing with absolute certainty that private methods are private. In Ruby for everything you can do to protect something, someone else can (and I find frequently does) find a way to circumvent it. I especially appreciate the strong linting culture that mitigates entire debates on code style.

I find the syntax of Javascript to be unattractive coming from Ruby, but it is more consistent. All of the parenthesis, semicolons, etc are just noisy. The surface area of Javascript is also much smaller which leads everyone to jump to utility libraries like Lodash, Underscore, etc. The language just needs to keep maturing and building in common methods. Date manipulation in particular is atrocious. async/await seems like we finally have a clean syntax for managing asynchronous code.

I do still feel like we are fitting a square peg into a round hole by having client side, single page applications. This isn’t the way the web was designed (it was made for request/response), and the fat client pattern still feels immature. Having a client side framework like Angular, (or even a library like React) does take care of managing some of the complexities . GraphQL takes the sting out of fetching all the combinations of data you might want from the server. Auth is taken care of with JWT and services like Auth0.

On the server side, using Node has been a mixed bag. I would like to see a few big frameworks that the community standardizes on, however the mentality seems to be build your own framework from a collection of your favorite libraries. As a result you can’t just jump into a project and immediately know where things are. I do however really enjoy asynchronous execution of code. It is a little harder to write and understand but wow can it be fast. I have already seen very positive results from Ruby implementations of batch jobs that took hours converted to Javascript and taking minutes. You simply don’t wait around, and it can think of your dependency tree in a completely new way.

At the end of the day I am excited but a little cautious. Ruby + Javascript sounds like a killer combination if you use each tool for what it does best. I don’t see the Ruby community lasting another decade so this is the perfect transition point to jump into Javascript. And I’m glad that it was Javascript that won out over Flex, Silverlight, JSP, etc. At least for the next 5 years until the new shiny technology comes out and people jump ship.

quotes, Thoughts, Uncategorized

Fahrenheit 451

Everyone must leave something behind when he dies, my grandfather said. A child or a book or a painting or a house or a wall built or a pair of shoes made. Or a garden planted. Something your hand touched some way so your soul has somewhere to go when you die, and when people look at that tree or that flower you planted, you’re there. It doesn’t matter what you do, he said, so long as you change something from the way it was before you touched it into something that’s like you after you take your hands away.

  • Ray Bradbury
Ruby, Software

Refactoring With the Observer Pattern

What is the Observer Pattern?

Have you written a system that has the concept of “events” that need to notify users? Perhaps you send an email on a system error to an administrator. Or send a “password changed” event notification to a user? In a CRM perhaps clients can subscribe to new customers as they come into the system as the result of marketing efforts?

This can be represented using the Observer pattern. Your users have subscribed to one or more events, and they are “observing” new events. The events themselves are “observable” via these event subscriptions.

This blog post will walk through refactoring an existing event notification system that did not use the observer pattern, and instead relied on a multiple passes to calculate an intersection of users and events that ran periodically via cron.

Old pattern

We have a CRM that is written in Ruby on Rails. When a database record is generated for something we want to create an event for, we use the ActiveRecord callback to create a second “event” record after save. This event record is associated to the source record and includes some contextual information like a type “NewLead”, or “SystemError”, and whether or not it has been processed (defaulted to false). Event also includes some hierarchical attributes so we can enforce authorizations in a multi-tenant application.

Once the event records are created, we had a cron entry that would fire every minute and check for new events. This would do a database query looking up unswept events, and looking up all active subscriptions. Doing some intersection logic, we would then generate an “event subscription” record for each event that can belong to an intersection. This did not scale particularly well because a generic event could potentially go to hundreds of subscribers, which would mean a database table entry per subscriber. This information is also not valuable after the notification is sent.

Once the events are swept, and “event notifications” records are generated, we could then send out an email. For more usability we have a digest feature where emails are not sent immediately when events come in. Rather we collect for a period of time, then send a summary of events that have occurred since the last email was sent.

This system did not scale due to the volume of events and subscriptions. We had issues where long inserts, and large datasets caused deadlocks in the database and out of memory errors on the server. The goals were to keep the events, and the subscriptions, but optimize the wofklow so that we could scale further, and have less transient data stored in the database. It goes without saying, but the replacement needed to be robust, and easy to reason about with good test coverage as well.

New pattern

The Observer pattern I mentioned a few minutes ago was the right fit for this particular problem. Instead of a cron initiated loop with lots of state being persisted we would move to an in-memory event loop and stream the events to the subscriptions.

One wrinkle with the observer pattern was the email digest requirement. Observers are great at taking events and then processing them immediately. Building a collection required some investigation, but using Event Machine and periodic timer I was able to preserve this functionality as we will see.

The event system

There are two primary areas – the observable, and the observer. In Ruby, we can compose the Observable pattern by including the module into our class. This sets up a pseudo interface in Ruby that we can implement in our observable and observers. Of course Ruby doesn’t have true interfaces so this is more of a guideline than a compile time check.

The observable class:

require 'observer'
module Events
  class Observable
    include ::Observable
    attr_accessor :last_tick_at

    def initialize
      self.last_tick_at = Time.zone.now
      observers.each { |observer| observer.register(self) }
    end

    def run!
      changed_subscriptions.each do |subscription|
        subscription.reregister(self)
      end

      new_events do |event|
        changed
        notify_observers(event)
      end
    end

    private

    def new_events(&block)
      Event.yield_and_sweep(&block)
    end

    def changed_subscriptions
      Subscription.where("updated_at >= ?", last_tick_at).tap do
        self.last_tick_at = Time.zone.now
      end
    end

    def observers
      Subscription.only_active.not_paused
    end
  end
end

The observable class will be instantiated by our task runner. During the initialization it grabs the current time, and registers its observers (subscriptions). Its method run! will be invoked periodically and is responsible for finding and sweeping new events, and sending out to the observers. It also checks if any subscriptions have been added, changed, or deleted via the updated_at timestamp which is compared against the last run. These subscription observers have to re-register with the observable to reflect their changes.

Next up is the runner::

require 'eventmachine'

module Events
  class Runner
    def start
      EM.run do
        observable = Events::Observable.new
        run_timer = EventMachine::PeriodicTimer.new(0) { observable.run! }
        run_timer.interval = 1.minute.to_i
      end
    end
  end
end

This runner instantiates the observable class and starts an Event Machine loop. Within this loop we add a periodic timer every minute to run the observable. As we saw in the observable, the run! method will register its observables, and send them events.

Finally, we have the observer class:

module Events
  module Observer
    attr_accessor :observed_events, :timer

    def register(observable)
      observable.add_observer(self)
      self.observed_events ||= []
      self.timer = EventMachine::PeriodicTimer.new(self.wait_interval) { process_events }
    end

    def unregister(observable)
      observable.delete_observer(self)
      self.timer.try(:cancel)
    end

    def reregister(observable)
      unregister(observable)
      register(observable)
    end

    def update(observed_event)
      return unless authorized_for_context(observed_event)
      self.observed_events.push(observed_event)
    end

    def process_events
      return unless observed_events.present?
      EventMessageMailer.delay(queue: :email).event_message(observed_events.map(&:id), self.user)
      observed_events.clear
    end

    private

    def authorized_for_context(event)
      return false unless self.event_type.blank?            || self.event_type == event.type
      return false unless self.agency_id.blank?             || self.agency_id == event.agency_id
      return false unless self.client_id.blank?             || self.client_id == event.client_id
      return false unless self.business_unit_id.blank?      || self.business_unit_id == event.business_unit_id
      true
    end
  end
end

This is a module that can be mixed in to an ActiveRecord “Subscription” model. There are some assumptions that the model will have certain attributes pertaining to the hierarchy that the subscription is for (agency_id, client_id, business_unit_id) and an attribute for what type the subscription is. In order for an event to be collected the event that is received must pass authorization checks, and must be for the appropriate event type.

Of note is the wait_for_interval. This is logic around the digest functionality. Each subscription has an attribute for how long to wait. A value could be 5 minutes, 1 hour, 1 day, etc. The process_events call will be fired when the PeriodicTimer fires on the specified interval. This clears out events when they are processed.

Lets see a diagram of how all this will work:

Screen Shot 2017-10-20 at 10.26.13 AM

Not shown is the EventMailer. This is just a standard ActiveRecord mailer however that takes an array of events, and shows them as a list. These are then sent to the user’s email address via a background job.

Observable Benefits

There are a few benefits of using the observable pattern that we replaced our existing code with:

  • Efficiency – Subscriptions are not recalculated every time we sweep for events.
  • Storage size – Because the event runner continues to run in the foreground, we don’t have to worry about persisting a lot of in between state. The concept of event subscriptions lives in memory. Messages are constructed on demand and emailed asynchronously.
  • Codebase size – Almost all of the cron entries, task runners, sweepers, and calculations on the event and subscription intersections have been removed because they aren’t necessary. We now have zero cron entries, and one task runner that is responsible for picking up new events and sending to the observers.
  • Agreed upon pattern – While only a benefit if you know the pattern, there is a much greater chance someone will know what an observer is instead of a custom solution
  • Interface checking – Because we are using the Ruby Observable module, we get some safety in adding new observers, and when changing our observable. If the observer doesn’t implement the update method, an exception will be raised when registered with the observable.

Task runner

Now that we have implemented the observer pattern, we need some scripting to have this run in the foreground and recover from any crashes. For this I used the Daemon gem:

require 'daemons'
require File.expand_path('../../config/environment',  __FILE__)
require Rails.root.join('app/models/events/runner')

Daemons.run_proc('events', {
  dir:       Rails.root.join('tmp/pids/'),
  dir_mode:  :normal,
  log_dir:   Rails.root.join('log/'),
  backtrace: true,
  monitor:   true
}) do
  logger = Logger.new(Rails.root.join('log/events.log'))
  Events::Runner.new(logger: logger).start
end

The observable can now be instantiated with a script/events.rb start command. In the event of a crash, I get log output of the stacktrace, and an automatic restart.

What is next

Currently, there isn’t a mechanism to serialize events that are in memory for subscriptions. This was due to the fact that most are minutely, and because the service so rarely gets restarted, and the risk of not receiving a notification was low this problem was left for later. Likely the approach would be trapping a SIGTERM, or SIGKILL signal, and persisting the unprocessed events for each given observable in a way that is recoverable on restart. A text file is probably sufficient, as this data isn’t valuable long term.

Learn More

If you want to learn more about the observer pattern, I recommend Head First: Design Patterns. The content is good, but the 1990’s stock photos will make you groan.

Additionally, SourceMaking has great information on many design patterns including the observer.

 

Apple, Computers, Hardware, Linux, Software, Windows

Using Synergy Over VPN

I’ve been watching a lot of Linus on Tech and one of their sponsors sells a product called Synergy. More information on their product page: https://symless.com/synergy . To summarize, this is a software KVM (no video, so I guess it is “KM”?) solution. The use case was perfect for me. On my desk I have an Apple work laptop, a Windows desktop, and a Linux personal laptop. Anyone that has done extensive work on a laptop keyboard and touchpad know that it isn’t optimal. I didn’t want multiple keyboards and mice on top of my desk because of the clutter. I dropped some money for Synergy, and it just works!

20170731_110826_HDR.jpg

That is until I had to connect to our company VPN the next week. They use a full tunneling solution. When I connect, I lose everything. I can’t print, I can’t access my NAS, but most importantly I can’t access my keyboard and mouse. (The video is fine because it is a hard wire to an external monitor). What to do?

SSH to the rescue! What is SSH? This is a protocol that will allow one computer to securely interface with another computer. Secure SHell. However, we will just be using it for port forwarding, and not for an interactive session. The goal is to take the OS X machine (Synergy client), and SSH into the Windows server (Synergy server). Using this SSH connection, we can forward ports within it. It is a tunnel running inside the SSH connection. This will expose a local port on OS X for 24800 that is actually pointing to the remote server port 24800. This is the port that Synergy uses for its connections.

You will need a few tools, and a little patience. Having just gone through this, I’m sharing for posterity, or maybe for anyone that has thrown in the towel with how crippled VPN makes accessing home devices.

I have the following Synergy setup:

  • Windows 10 Synergy server (keyboard and mouse are physically connected to the desktop)
  • OS X Synergy Client
  • Linux Synergy Client
  • Router with a local area network all these devices share
  • Admin access to the router for port forwarding
  • Autossh package for OS X (available via brew)

First step, get Windows 10 up to speed with SSH. How this isn’t built in as a service in the year 2017 I have no idea. Grab the OpenSSH server package for Windows from https://www.mls-software.com/opensshd.html . After downloading, extract and run the setup file. This will create a new Windows service for OpenSSH that will run on port 22. It prompts you to generate an SSH key for the server.

Once this server is running, you will need to add your user to the list of SSH users. Open up PowerShell as an administrator and change into the C:\Program Files\OpenSSH\bin directory. Run the following commands:

mkgroup -l >> ..\etc\group
mkpasswd -l >> ..\etc\passwd

Try and connect to your SSH server from the OS X client:

ssh <user>@<server IP> # e.g. ssh Ben@192.168.1.95

You should be prompted for your Windows password. Once you can successfully login to the server, we can setup public key authentication. This removes the need for you to type in your password because you identify yourself with an SSH public key. From your OS X machine get your public key:

cat ~/.ssh/id_rsa.pub

Put the contents of this file on your SSH server in the file C:\Program Files\OpenSSH\home\<user>.ssh . This is actually a symlink to C:\Users\<user>.ssh . If the directory .ssh doesn’t exist, you will need to create it first. Now we need to configure the server to allow public key authentication. Edit the C:\Program Files\OpenSSH\etc\sshd_config file and change the following lines:

StrictModes no
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys

Restart the OpenSSH server for the changes to take effect:

net stop opensshd
net start opensshd

You should now be able to SSH into the server same as before but without being prompted for a password.

Now we are ready to create an SSH tunnel. Before we incorporate AutoSSH (which handles retries and monitoring) we will do a naive attempt to SSH. In the following command:

  • -f backgrounds the process
  • -L does port tunneling in the format of <local port>:<remote host>:<remote port>
  • -N do not run a command – just tunnel the port
ssh -f <user>@<remote public IP> -L 24800:<remote public IP>:24800 -N

If this works, you should see a [LISTEN] entry for port 24800 when you list open files:

lsof -n -i | grep 24800

You may need to set your server as the DMZ on your network. Or to be safer you can simply setup port forwarding. We will need port 22 and port 24800 to resolve to the Windows server. The instructions for how to do this on a router widely vary by vendor. Typically it is under a WAN section. It typically prompts for a port, a destination IP, and destination port, and protocol. You want ports 22 and 24800 to route to your server IP for TCP and UDP.

Configure your Synergy client to use localhost instead of the remote IP. You should now be able to operate your client from the server’s peripherals via Synergy.

Everything works great until the VPN connection is made. The reason is that the SSH connection is severed. In order to recover automatically, I have added autossh to persist this tunnel. On the OS X client instead of running SSH do the following:

AUTOSSH_POLL=10 autossh -M 20000 -f -N <user>@<remote public IP> -L 24800:<remote public IP>:24800

Now when a VPN connection is made, or a disconnection happens, the autossh package will detect that it is no longer alive and retry. Because Synergy’s software also retries, after a few seconds your connectivity should begin working again.

Thanks to Synergy for making a solid product, and for having first class Linux support.

Computers, Software

Moving Data in AWS

I have been a long time user of Amazon Web Services, but I only recently started using the Data Pipeline service they offer to handle ETL needs. The service provides a cloud ready, low cost, turn key (in some cases) solution to moving data within your services. I had a difficult time getting up and running, partly due to the lack of discussion online about this service, so I want to share my experience, offer some best practices, walk through how I developed our pipelines.

Anyone with an AWS account can use Data Pipelines. But be careful there there is no BAA agreement if you are in the healthcare industry and are passing around PHI data. Fortunately, our needs do not yet require us to move around PHI information.

The first step in my ETL journey was first formalizing what data needed to go where. Before I even opened up the Data Pipeline service, I had to understand our specific needs. I identified two use cases.

Archiving Old Data

RDS instances have a max allowable size for the database and ours was getting full. The approach I took was to look at our largest tables, and focus on reducing those first. I ran some queries to understand what is using the most space: http://stackoverflow.com/a/9620273/802407

Once I had a list of my largest tables, I could classify them and assign retention rules. Some tables I decided to leave in place, and others I decided were transient data, and could be archived. (HIPAA mandates a 7 year data retention policy, so no luck just deleting). We decided as a business that different classifications could live within our application for differing time frames. Once timeframes were established, I could then write a data pipeline, and move any data older than our cut off date for that table to a storage solution outside of the database. We chose to house MySQL backups on S3 in encrypted buckets.

Fortunately the Data Pipeline service provides a number of templates to help you get started. Navigate to https://console.aws.amazon.com/datapipeline .  I found the templates good starting point, but there are some frustrations that I will mention below in the “Quirks” section. Click “Create Pipeline”. I used the template “Full Copy of RDS MySQL table to S3”. I filled in the parameters, and edited the pipeline in “Edit in Architect” mode.

Since I wanted to archive old data, I modified the SqlDataNode’s Select Query to be only records older than my retention policy:

select * from #{table} 
where #{myRDSTableLastModifiedCol} <= 
date_sub('#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' , 
  interval 6 month)

This will select records only older than 6 months from the pipeline scheduled start time. The template then moves these to S3. There are two parameters, defined in the parameters section for “#{myRDSTableName}”, and “#{myRDSTableLastModifiedCol}”. I supplied my table name, and the updated_at datetime column for my records.

I added a new SqlActivity dependent on the CopyActivity named “Delete old records”. Once they move to S3, I want to delete them from the database table. This activity “Depends on: RDSToS3CopyActivity” so if saving to S3 fails, the records are left untouched. I added the following script to mirror my select query above, but deleting the records instead:

delete from #{myRDSTableName} 
where #{myRDSTableLastModifiedCol} <= 
date_sub('#{format(@scheduledStartTime, 'YYYY-MM-dd HH-mm-ss')}' , 
  interval 6 month)

I would recommend doing this on a test database first before you delete production records while testing your script. Because the timestamp is the same as from the select, this will be the same record set, provided you have an updated_at column that shows when a record was last touched.

Expose Data for Reporting

My other use case was getting data to the reporting server which is in Redshift. Again, there is a nice template to get started. Click “Create Pipeline” and then select “Incremental copy of RDS MySQL table to Redshift”, then “Edit in Architect”.

The first run will need to be a full copy if I want all the data in the table. After that, I can use delta copies to only move over the new data fitting your criteria. This is driven by a SQL select query, so it is easy to modify. In the SqlDataNode I can edit the Select Query to my liking. Note that I removed the conditions from the query to get all records on the first run. I changed the instance type to something more substantial (t1.micro to m1.small), and upped the timeout from 2 hours (to 8 hours). I then went in before the next run and modified the query to put back the conditions that selected the delta data, then downgraded the instance type, and timeout values to their smaller defaults.

I then ran into an infrastructure quirk where our Redshift instance was inside a VPC, and our RDS database was inside a classic instance (non-VPC). This meant that the same EC2 instance would not be able to talk to both databases since it had to be on one side or the other. Because of this limitation, I had to modify parts of the pipeline that assumed a single EC2 instance would be talking to both databases. Note that I had to edit the JSON as the Data Pipeline UI does not allow changing the resources that activities run on from the template. I created two EC2 instances – one for talking to RDS and S3, and one for talking to S3 and Redshift.

In an attempt to make things easier, Amazon provides some Python scripts that get called under the hood to reflect on your MySQL table structure and convert it to a PostgreSQL CREATE TABLE command. This didn’t work for me because of my VPC permissions issues, so I provided my own CREATE TABLE Sql in the S3StagingDataNode. This was generated using the Python script by Amazon, but I supplied the inputs manually:

#!/bin/bash
aws s3 cp s3://datapipeline-us-east-1/sample-scripts/mysql_to_redshift.py \
/tmp/mysql_to_redshift.py

for target in table1 table2 table3
do
  mysqldump -h <host> --port=3306 -u <username> --password=<password>  \
--compatible=postgresql --default-character-set=utf8 -n -d \
-r "/tmp/${target}_structure.sql" <database> $target

  python /tmp/mysql_to_redshift.py --input_file=/tmp/${target}_structure.sql \
--output_file=/tmp/${target}_create.psql --table_name=$target \
--insert_mode=KEEP_EXISTING
done

This Bash script will pull down the public file mysql_to_redshift.py. Then it loops over the target tables you want to setup pipelines for. For each table (table1, table2, table3, etc) it does a mysqldump of the table structure. It then feeds this table structure file into the python conversion utility to produce the PostgreSQL version of the table structure. The contents of the table1_create.psql file is what I copied into my “Create Table Sql” field in the Data Pipeline UI.

Note that the “Create Table SQL” is interpreted literally, and has no schema context in Redshift. Therefor if I want to create the database table in another schema, the CREATE TABLE contents need to be modified to prepend this table name with a schema qualifier. e.g. “table1” would become “staging.table” (without the quotes). The Python utility will double quote the table name if given a table name with a period. This will incorrectly create a table in the public schema: public.”staging.table”, which is probably not what desired. Check the contents of the CREATE TABLE for accuracy.

I also changed the “Insert Mode” in the S3ToRedshiftCopyActivity” to be OVERWRITE_EXISTING . This uses the primary key of the table to detect duplicate rows. Since we might modify existing records, I wanted to replace those records in Redshift when they are modified in the application.

Quirks

The data pipeline services has a number of quirks that I stumbled upon. I hope Amazon works to refine the service, and that one day these are no longer issues. But for now I observed the following:

  • I cannot add everything via the UI. Things like parameters, and EC2 resources can only be added via editing the JSON. Don’t be afraid to open it up – it is reasonably easy to follow. Hopefully support to add these objects will come to the UI in the future.
  • The default templates are a great place to understand the pipelines, but are very messy. The JSON definitions are painful to manage. Some information are stored in parameters, others are done inline. Some activities cannot have parameterized variables. Sometimes the parameter names MUST start with “my”, e.g. “myDatabaseUsername”. I found this arbitrary and frustrating. Also some parameters have a “watermark” value, a “help” value, others don’t. At least one variable started with a “*” character. No explanation why.
  • When testing a pipeline I cannot change between a scheduled pipeline, and an on demand pipeline. I have to export the JSON definition and create a new pipeline as the other type. This makes testing painful.
  • The “Execution Details” screen is hard to interpret. The filter defaults to “Activities” only, but all of my pipeline definitions start with an EC2 resource being instantiated which is filtered out. The timeframes are also strange. I needed to change the “Schedule Interval” to be “Executed start” while testing an on demand pipeline. Finally the dates need changing from the default. It will default to 2 days ago, and will include a lot of previous test runs if you are developing. They don’t seem to be sorted in any logical way either, making tracing the pipeline execution difficult at a glance.
  • While debugging, check S3 for the logs. I found log data was contained in S3 that was not referenced at all in the Data Pipeline UI. This was critical for understanding failures.
  • The visualization in the Architect mode is not particularly helpful. The only thing I can do is click on a node and see the properties on the right. I cannot view two node’s properties at once. Worse is the parameters are in a completely different section so I can only see the variable name, or the variable value at any time. I found it only useful to see the flow of execution. For editing, I would export the data pipeline to JSON and modify in a text editor outside of the UI.

Conclusion

The Data Pipeline service offers an affordable way to automate ETLs in a cloud friendly way. The proof of concept that I did under represented the amount of work it would take to get these running in production. I frequently battled with permissions, UI quirks, and timeouts. I was persistent, and eventually got these pipelines running. One running, I stripped out the passwords and committed the JSON exports into version control for peace of mind.

I want to look into the SNS failure reporting next to be alerted when pipelines fail. I am most concerned with upstream changes to the table schemas. If these don’t match the downstream schema, the pipelines will fail. To guard against this I wrote unit tests that inspect the tables and ensure no changes have been introduced upstream. If a change is introduced, the test fails with the message that it was expecting something else. This should trigger the developer to alert the data services team to modify the pipeline downstream to prepare for upstream changes. This policy has not yet been tested.

Personal, Thoughts

Pulling Away

I have felt a particular way for a long time. It has been a nebulous, vague feeling for most of it, but recently its started to coalesce in my head. Its a realization that I am pulling away.

I think it started with our second daughter, Adeline. Raising two kids is without a doubt, one of the hardest things I’ve ever done. Its incredibly rewarding, but it has taken its toll on my physically, emotionally, and mentally. At the hospital when Adeline was born my mom said to me that raising two kids is four times as hard. Extrapolating, I expect three kids is eight times as hard. That is how kids work I suppose. The chaos magnifies. I simply don’t think that I could have a third, despite strong wants some days.

I’m now 32, and the world makes less and less sense to me. Trends don’t interest me. I look at communities like Instagram, and Snapchat and I just don’t get it. Facebook has changed from my college days. What was a way to write to your friends is now a trash media outlet and everyone on it is angry. Maybe I boxed myself into some political bubble in their news feed algorithm. I’ve stopped visiting them, along with Twitter, Untapped, Yelp, Strava, and more. Its all needless.

I don’t really get the new movie trends, like the hundreds of superhero comic to movie adaptations. The first few were neat, but we keep remaking them.  I don’t follow sports. I have a hard time listening to new music on the radio. New fashions like vaping and skinny jeans just irritate me. Video games have gone to shit now that publishers are wrapped up in multiplayer and add-ons. Programming just feels like an endless reinvention of the wheel with diminishing progress.

Maybe it’s just what getting older feels like. Maybe it’s wisdom. Not falling prey to the fear of missing out. Focusing on what matters to me as an individual. Quit making everything a competition. Quit trying to impress others with checkins, and pictures showing how much fun you are having. Quit seeking peer approval.

Then last night the presidential election happened. It hit me in a way I didn’t expect. I educated myself on the issues, while staying away from mainstream media, talk shows, and largely out of debates with peers. I figured most of the drama was just for viewership by the media outlets. I believed that some people felt differently from me, but more were like me than not like me. But when the polling results came in, I found myself angry, confused, fighting back tears. It meant more than past elections. I’ve voted for losing candidates, and went on with life. This one stung. To me it was a signal of the end of what I understood our country to be.

I’ve been in this mindset for months now, and the election pushed me to write about it. I needed to get the way I’m feeling out of my head and down on paper. Is this just aging? Is this stress? Diet and exercise? Is this objectively where we are as a civilization, and everyone is waiting for it to change just like me? Have I just become bored, or complacent with my life?

I certainly feel old, tired, slow. Unnaturally. And I’m not sure what to do about it other than writing a cathartic blog post.