Apple, Computers, Linux, Open-source, Ruby, Software, Thoughts, Web

PostgreSQL for Ruby on Rails on Ubuntu

My new desktop came in at work this week, and the installation was painless thanks to the great driver support of Ubuntu 11.10. For anyone setting up a Rails development box based on Linux, I have some tips to get around some pain points when using a PostgresSQL database.

Installation:

Postgres can be quickly and easily installed using apt-get on Debian or Ubuntu based distributions. Issue the command:

apt-get install postgresql

Ruby Driver

In order for Ruby to connect to PostgreSQL databases, you will need to install the pg gem. This gem will need the development package of PostgreSQL to successfully build its native extension. To install the PostgreSQL development package, issue the following command:

apt-get install libpq-dev # EDIT: postgresql-dev was replaced by this package on Ubuntu 11.10

Setup A PostgreSQL Role

You can configure PostgreSQL to allow your account to have superuser access, allowing your Rails tasks to create and drop databases. This is useful for development, but is strongly discouraged for a production. That being said, we can create a PostgreSQL role by logging into psql as postgres as follows:

su postgres -c psql

This will open a PostgreSQL prompt as the database owner postgres. Next, we need to create an account for our user. This should match the response from “whoami”:

create role  superuser login;

We can now exit from psql by issuing “q“. Try to connect to psql directly by issuing the following command from your shell account:

psql postgres

This should allow you to connect to the default database postgres without being prompted for credentials. You should now be able to issue the rake commands for creating, and dropping the database:

rake db:create

Rspec Prompts for Credentials

I was being prompted by Rspec for credentials when running my test suite. If you would like to remove this credential prompt, please read the following:

There are differences in how the PostgreSQL package is configured in Homebrew on OS X, and how it is packaged in the Ubuntu and across other distributions. One difference is in the level of security configured in the pg_hba.conf file. This file is responsible for identifying which sources using which authentication mechanisms should be allowed or denied. By default, Rspec will cause a prompt for a password even if your shell account has trusted permissions. This is because Rspec connects not as a local process, but to localhost. To allow connections to localhost to be trusted, you will need to modify the pg_hba.conf file.

Next, we can modify the pg_hba.conf file located at /etc/postgresql/<version>/main/pg_hba.conf

Comment out the lines any lines at the bottom of the file and append the following:

local   all             all                                      trust
host    all             all              127.0.0.1/32            trust
host    all             all              ::1/128                 trust

This will allow connections from the shell, as well as connections to 127.0.0.1 (localhost) using both IPv4 and IPv6.

You will need to restart PostgreSQL for the changes from this file to take affect:

/etc/init.d/postgresql restart

PostgreSQL Extensions

If you want to make use of any of the additional extensions to Postgres, including fuzzystrmatching, you will need to install the postgresql-contrib package:

apt-get install postgresql-contrib

The extensions will install to /usr/share/postgresql/<version>/extension/

Using the Postgres version 9, you can create these extensions in your database by using the new CREATE EXTENSION syntax. In the case of the fuzzystrmatch extensions, you can issue the following command from inside a PostgresSQL command prompt to load the extensions:

psql ;

Once inside your database:

create extension fuzzystrmatch;
Computers, Open-source, Ruby

Fuzzy Matching in PostgreSQL with Nicknames

Introduction

If you have had the LIKE comparison in SQL leave you wanting, this post is for you. The reason I say this is because LIKE is the tip of the iceberg when it comes to the searching capabilities of a modern database. Signs that you are ready to move on from LIKE:

  • You are attempting to pad both sides of the LIKE query with wildcards
  • You are joining multiple keywords with a wildcard separator
  • You are using multiple LIKE statements on the same column in your query

 

I have been working on a project lately that uses PostgreSQL as its database, and discovered some interesting ways of locating data by using the contributed fuzzy matching functions. When searching for people in a database there are many things to consider:

  • The search query may contain a typo
  • The user may have an uncommon spelling of their name
  • The user may have a nickname

 

How can we address these scenarios?

Fuzzy Matching Algorithms To the Rescue

Levenshtein is a great algorithm to detect typos in a search query. It operates based on how distant one search term is from another term. Starting with the “source” word, it counts the number of operations (additions, subtractions, substitutions) it takes to arrive at the “destination” word. This make the Levenshtein algorithm particularly good at catching seach typos, and uncommon spellings. Here is an example:

INSERT INTO users (first_name) VALUES ('William'); --Notice the double "L"
SELECT * FROM users WHERE LEVENSHTEIN(LOWER(users.first_name), 'wiliam') < 2;
id first_name last_name created_at updated_at
4 William   2011-08-21 14:54:34.513516 2011-08-21 14:54:34.513516

(1 row)

I am retrieving all records where the result has a distance of two or less from the source word to the target. The number of changes required to go from “wiliam” to “william” is one addition of the letter “L”. You can of course modify the query to change the number of changes allowed. Note that the query is case sensitive. Going from an uppercase letter to the same letter in lowercase will count as 1 substitution. For this reason, make sure you are down-casing everything before comparing to operate on letters only, and not case.

Another choice when locating person records is the phoenetic algorithms of Soundex, Metaphone, and Double Metaphone. These have linguistic awareness of the English language (and other languages in the case of the Metaphone, and DMetaphone algorithms). The Soundex is the oldest phonetic algorithm, and has been deprecated in favor of the more complex Metaphone and DMetaphone algorithms. Nevertheless, if you are working with English only names, Soundex may provide performance benefits that are worth the trade-offs.

Soundex breaks down a word, and assigns the pronunciation a value that can be compared against other pronunciations. Every word when passed through the Soundex function returns a four digit value, resulting in each word being the values of 0 and 4 for least similar, to most similar respectively. DIFFERENCE uses SOUNDEX under the hood. “William” and “Willem” have a pronunciation that is mostly similar. Again, you can adjust this setting by changing the integer that you compare the result of DIFFERENCE against. Here is an example of METAPHONE:

SELECT * FROM users WHERE DIFFERENCE(users.first_name, 'willem') > 2;

Metaphone takes an integer which specifies how specific it should be when calculating the similarity. The higher the integer, the more specific the comparison. While the results are the same as Soundex in this scenario, it is important to remember that the power of METAPHONE, and DMETAPHONE are in the pronunciations of these characters in other languages. This is particularly beneficial when working with non-English names:

SELECT * FROM users WHERE METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2);

Double Metaphone expands on the METAPHONE capabilities, but since it has no additional configuration options, it may not be suitable for your needs. You should experiment with the different fuzzy matching options to get the best result for your application:

SELECT * FROM users WHERE DMETAPHONE(first_name) = DMETAPHONE('Willem');

Nickname Matching

These methods are great when comparing data that is similar in distance, or in pronunciation, however it will not be able to address the concern of the use of nicknames, and alternate names within the system.

Nicknames allow us to match the name “William” with its alternate names of “Bill”, “Bud”, “Will”, and “Willie”. Using the algorithms discussed so far, the name “Will” (and possibly “Willie”) would be the only results of this match. Nicknames represent relationships between names that a database will need to understand to address the last criteria of our search.

In order to accomplish this, I have have leveraged the data from the Nickname and Dominuitive Name Lookup project. I have created a gem that can be added to your Ruby on Rails project that will fetch the nickname data, parse it, and insert it into a nicknames table, along with a foreign key that references related names. This gem will allow you to query nicknames using the following query:

SELECT * from nicknames WHERE nickname_id IN (SELECT nickname_id FROM nicknames WHERE name = 'william');

or in Ruby (check out the project here):

Nickname.for('William').map(&:name)

We can do the following to make a comprehensive search using all methods that we have discussed so far:

-- levenshtein
SELECT * FROM users WHERE LEVENSHTEIN(LOWER(users.first_name), 'wiliam') < 2 
--dmetaphone
OR DMETAPHONE(first_name) = DMETAPHONE('Willem')
--nicknames
OR LOWER(users.first_name) IN (SELECT name from nicknames WHERE nickname_id IN (SELECT nickname_id FROM nicknames WHERE name = 'william'));

or in Ruby:

# app/models/user.rb
class User < ActiveRecord::Base
  scope :levenshtein, lambda {|term| where(["LEVENSHTEIN(LOWER(first_name), LOWER(?)) < 2", term])}
  scope :dmetaphone, lambda {|term| where(["DMETAPHONE(first_name) = DMETAPHONE(?)", term])}
  scope :nicknames, lambda {|term| where(["LOWER(first_name) IN (?)", Nickname.for(term.downcase).map(&:name)])}

  def self.fuzzy_match(term)
    levenshtein(term) | dmetaphone(term) | nicknames(term)
  end 
end

User.fuzzy_match('william')

This will generate separate queries when calling fuzzy_match. This has the benefit of encapsulating each algorithm, but at the cost of worse database performance. If performance is desired over encapsulation, restructure the query to include these comparisons in the same WHERE clause, as is shown in the SQL method.

TL;DR

Stop using LIKE if anything more than the bare minimum comparison is needed. The time involved in setting up these extra functions in PostgreSQL is just a few minutes, and it will greatly enhance the search capabilities available to you and your users.

Additional references:

Computers, Open-source, Personal, Ruby, Software, Thoughts, Web

Another Helping of Abstraction, Please

Rails 3.1 is soon to be released, and with it comes two new abstraction libraries – CoffeeScript, and S(ass)CSS. These libraries are used to generate Javascript code, and CSS definitions respectively. While abstraction libraries are nothing new to Rails, the inclusion of two more got me thinking about the direction that Rails stack is heading.

CoffeeScript’s syntax seems to be to make Javascript as Ruby-ish as possible. It describes Javascript’s curly braces and semicolons as embarrassing.

SCSS aims to address some of the repetitive elements of CSS through the use of variables, nesting, and mixins. This feels more acceptable to me than CoffeeScript, but my first encounter left me burned.

A few other abstraction libraries of relevance: Haml aims to generate HTML without the use of HTML tags. Additionally, Squeel‘s (MetaWhere 2.0) aim is to remove all SQL from your projects.

So what am I bitching about? Abstraction is a good thing right? I see two categories of abstraction. The first being the “good” kind, that allow you to be completely ignorant of the underpinnings. For example, Ruby converting down into machine code.

The “bad” kind of abstraction are the substitution of a language with a DSL. This creates a lot of issues starting with development and debugging. In the case of CoffeeScript and SASS, you have to compile the DSL files into Javascript, and CSS files. I feel like this compile step is a step back from what we gain working with dynamic languages like Ruby, and Javascript to begin with.

Development in these libraries also requires that you understand both the DSL of the library, as well as being familiar with the language of the generated code. This additional skill requirement adds time to a project, and raises the entry bar for new developers. Its difficult to be proficient at a language, and a DSL that generates that language at the same time. A Ruby developer told me yesterday that he was surprised at how rusty his knowledge of SQL had gotten. Its shocking to me that a web developer would struggle with SQL, but I think its an accurate sentiment on which many Rails developers would agree.

Another casualty of abstraction is performance. Not only is the generated code sub-optimized relative to coding it by hand, it is also being run through through more system calls to get there. You can either compile up front (CoffeeScript, SASS), or you can incur this penalty on-the-fly (Haml, Squeel).

While none of the libraries are a requirement of web development, when working on a team that uses these technologies you are expected to produce consistent code. Even though these libraries let you execute “native” code, doing so is discouraged because of the availability of the DSL. The syntax for embedding native code (if its even allowed) is often cumbersome, and loses editor functionality such as syntax highlighting and parsing.

Since when did Ruby on Rails web developers stop working with SQL, CSS, HTML, and Javascript? I am beginning to feel like the Ruby camp is becoming the far left extremists of the web development world. The web is built on these core technologies, and the benefits of abstracting them doesn’t seem to outweigh the costs.

Apple, Computers, Events, Hardware, Open-source, Personal, Ruby, Software, Thoughts, Web, Windows

3 Days Down, 40 Years to Go

Yesterday at 5:00pm marked the end of my first week at Beacon Interactive Systems. My coworkers are all really nice, and there is a surprising geographic mix between them. Some folks have lived in Massachusetts their whole lives, while others come from Maryland, and Michigan. The cultural differences between “down South” and here are pretty minimal, unless you just feel like having a good laugh. There have been two big adjustments however: Snow is really not a big deal up here – people hardly notice it outside. The second is restaurants don’t have sweet tea. You would have to drink sweet tea to understand why this is a big deal.

In general:

  • The job is much less stressful. Even during crunch times, you hear Southpark and Big Lebowski quotes (“I’m not your pal, guy!”).
  • The environment is a lot less structured. You come in whenever, you leave whenever. If you want to go outside and toss around the football, go for it. Good team-builder by the way.
  • The skill sets of my coworkers are all very impressive. Its the rifle vs shotgun approach.
  • The job area is nice – its next to Harvard. Getting there is rough – I have to cut across the city. My 20 minute commute takes about an hour.
  • Developing on a Mac is an easier transition than I thought. I won’t say that I’m in love with it yet, but its workable. The biggest pain has been this silly bundled keyboard and mouse. No one else uses them. Also, package management on Mac sucks compared to Linux. I think I would actually prefer to use Linux. Time will tell on this one.
  • The coffee isn’t as good.

An interesting collision of viewpoints occurred my second day at the job, while I was shadowing a coworker on a joint project. He was showing me their (complex) system of bug detection, and correction. They write up a use case, file a ticket, branch the code, create a changset, rebase it, merge it into QA, verify it, then push it back upstream. Not coming from anything near that complex (“Hey Ben – login to the production server and change it!”) I was amazed that they spent so much time on this process. I asked if they ever just ignore a bug that would be too minimal to matter. My coworker asked me to clarify what I meant. I replied with “You know, its good enough for government.” He paused and looked at me funny, then reiterated that they address all bugs that are discovered. A bug is a bug. It will take me a while to harden my resolve to be like theirs, and aim for perfection. Perfection wasn’t possible before because we had the typical scenario of overworked, underpaid, and on a deadline.

We are moving into our new building in a few weeks. When we move, there will be a train station across the street from the new building, and I will probably make the transition to riding into work. Its about the same amount of time, but I would have the ability to sleep, read, surf the Internet, etc all without causing an accident.

Wish me luck for next week – its been a difficult adjustment.

Computers

ActiveRecord’s Secret find_by_sql Results

Ruby on Rails logo Well, its not exactly a secret. It sure isn’t well documented however. Recently, I wanted to return a query that spanned multiple database tables. I decided to go with find_by_sql because of the mind-blowing idiocy with which this legacy database was structured. I will take a watered down version of what I was attempting to do to demonstrate how we can expose some “hidden” functionality of ActiveRecord’s find_by_sql method.

 
Channel table:
------------------------------------
id | title  | description        | user_id
------------------------------------
1  | first | the first channel | 1

User table:
-----------
id | name
-----------
1  | ben

After I constructed my find_by_sql query, it looked something like this:

Channel.find_by_sql("SELECT a.*, b.name
FROM channel a, user b
WHERE a.user_id = b.id")

This query selects all columns from table a (channel), and a single column from table b (user). This is pretty standard, as many queries need to gather values from multiple table columns in a single SELECT operation.

Running this query, you will receive an array of Channel instances with all the attributes filled in for the channel model. Missing however, will be the attributes from any table other than “Channel”:

Channel.find_by_sql("SELECT a.*, b.name
FROM channel a, user b
WHERE a.user_id = b.id")
=> "[#]"

Notice how the “name” column from table b (user) is not present in the display? You can even query this attribute directly:

c = Channel.find_by_sql("SELECT a.*, b.name
FROM channel a, user b
WHERE a.user_id = b.id")
=> "[#]"
c[0].name
=> NoMethodError: undefined method 'name' for #

We could create an attr_accessor for the Channel class, and this would resolve the NoMethodError, but it still won’t be populated for our Channel instance after a find_by_sql.

After some digging around in the source code, and online, I came across this posting, which made the brilliant suggestion of looking in channel.attributes. This method will list an array of attributes that ActiveRecord knows about. Take a look at channel.attributes.keys:

c.attributes.keys
=> ["id", "title", "description", "user_id", "name"]

There it is! Our “missing” name attribute from the SELECT query has been located. Accessing the value for this attribute is trivial:

c.attributes["name"]
=> "ben"

We can do this with as many “extra” columns as we want. If two column names conflict (say channels had a column “name”, and users also had a column “name”), the database will return “name”, and “name_1” respectively. This is a really powerful feature of ActiveRecord that will encourage people to stick with the ORM, since they can still write SQL in a pinch.

Bonus: Customizing .to_json to include find_by_sql attributes

In the preceding example, the attribute “name” would not be included in the output of a “.to_json” call, as in the following example:

c.attributes.keys
=> ["id", "title", "description", "user_id", "name"]
c.to_json
=> "{"channel":{"id":1,"title":"first","description":"The first channel"}}"

This is where we can customize what is included in our JSON output. This article showed me that you can use the :methods argument with to_json to explicitly include any custom attributes, such as those that are attr_accessor objects in your class. When passing in the :methods argument, I must specify which attributes to include:

c.attributes.keys
c.to_json
=> "{"channel":{"id":1,"title":"first","description":"The first channel"}}"
c.to_json :methods => :name
=> "{"channel":{"id":1,"title":"first","description":"The first channel","name":"ben"}}"

Good job Rails team! No ugly hacks, or overrides needed today.

Computers, Open-source, Ruby, Software

Rdoc for Rails Projects

For those that don’t know, RDoc is the infamous substitution Ruby developers throw out there for “getting started”. RDoc will read code from your Ruby files, and build meaningful output in the form of HTML, chm (Windows help files), RI, and XML. The most common appearance of RDoc for me has been the HTML output. And example of this format can be found on RDoc’s page

I was recently trying to generate documentation for a Rails application that I wrote, however I couldn’t find the documentation for the built in “rake doc:app” task. My complaint with the default options is that it documents the entire Rails framework, including any gems you have included, etc. This is typically WAY TOO MUCH INFORMATION. Considering that Rails is a framework with its own documentation, combining its documentation in with your documentation doesn’t seem like a good idea. If I want to view how ActiveRecord works, I would go to http://api.rubyonrails.org/classes/ActiveRecord/Base.html, not to my generated RDoc collection.

I decided not to use the rake task, and want back to RDoc (since they have a man page) to see what my available options are. “doc:app” just calls RDoc anyways. I found that RDoc will allow you to build documentation based on a directory, instead of the whole she-bang. From the man page:

“rdoc [options] [names…]

If a name is a directory, it is traversed. If no names are specified, all Ruby files in the current directory (and subdirectories) are processed. This means that we can issue a command to look at just the “app/” folder of our project (where the majority of our created code resides) with this command:

# (From inside the Rails root directory)
rdoc -o doc/ app/*

This will create our output at doc/. Inside this directory, you will see an index.html file allowing you to build your class.

If you want to exclude files, or directories, you can use the argument “-x”. If you want to include other working directories, you can use the “-i” argument.

Here are a few other interesting things to try with RDoc:

  • –all – This will generate documentation for public AND private methods. By default, RDoc skips the private methods. This can be used to generate internal, and external documentation, where more or less information is exposed.
  • –diagram – This requires some additional libraries, but it promises to generate visual graphics that show your classes and modules. You will need to install graphviz first.
  • –line-numbers – Helpful for internal documentation, and for tasks like debugging
  • –style – You can specify your own CSS file for custom styles

As for formatting your comments, take a look at the RDoc official docs (created with RDoc!). There is a section on Markup syntax. It is very similar to the Wiki markup syntax.

Computers, Open-source, Personal, Ruby, Software, Thoughts, Web

Action-Packed ActionView Helpers

The Rails ActionView Helpers class es seems to be a gift and a curse. While its sub-classes provide a incredible amount of flexibility, its overlapping and inconsistent naming conventions and fragile parameter passing are a constant headache for me when working with forms. The sub-classes seem to have significant overlap, and if I knew enough to propose a code-change, it may just be worth taking a shot at cleaning up some of this confusion.

I think that I have yet to achieve the simple goal of having successfully constructed a form in the “first-go”. The first part (after constructing your controller and model) to constructing a form opening tag is to use one of the following methods:

form(record_name, options)
form_for(record_or_name_or_array, *args, &proc)
remote_form_for(record_or_name_or_array, *args, &proc)
form_tag(url_for_options, options, *parameters_for_url, &block)

The easiest is form where you pass it the name of a model, and it auto-magically generates your form for you. Not too shabby, but it is pretty inflexible – mostly used for prototyping so I tend to avoid it.

Next is form_for – the method I most commonly use. You can optionally pass it a block, which I will *discuss* in a moment. It is important to note that this is the recommend method per the docs if you are working with a form that represents a model in your code.

remote_form_for is the AJAXish version of form_for, however its naming convention is inconsistent with its cousins link_to_remote and button_to_remote. How about form_for_remote so I don’t pull out my hair when I get a method not found error?

form_tag is intended to help create a form when there is no model defined to represent the form data. In other words, this does very little in the way of magic (*sigh*)

Using form_for, I begin coding my form, and the next step for me is to look at the other helper methods. In part:

text_field(object_name, method, options)
text_field_tag(name, value, options)

These two methods may initially look like there is quite a bit of duplication going on inside the Rails helpers. A closer inspection reveals that the non-“tag” method is geared towards representing a model in the form_for context, whereas the other method text_field_tag is our model-less counterpart. More magic occurs here when the text_field helper method automatically includes the current objects value when constructing the tag. This is useful for sharing the same form (in a partial view) for both the create and edit views:


  

This would create a form for the User model, and generate a textbox for the username method. This is weird to me – that the method text_field would be aware of anything passed to form_for.

To make it even more terse, this can be refactored in DRYer way as:


  

Here we are using the block argument of the form_for tag. This block variable then has methods such as text_field (I assume this is an identically named method and not some other block magic). Also, note form_for assumes that our instance of a User class is contained inside an instance variable named @user – so we can omit that.

Its pretty cool to be able to manipulate forms with this amount of flexibility – three ways that are increasingly terse and assumptious. However, this mechanism of choosing your level of “integration” with a model comes at the cost of confusion to users (or at least me). Looking at the problem and trying to come up with a solution, I decided a possible alternate approach would have been to have one text_field , and and one form_for helper method (that does the work of any other similar methods) that could take in a hash of options to modify its behavior. (like ActiveRecord). A possible usage example might be:

# This will not actually work
 :user, :instance => @user. :html => {}, :url => {}, :errors => flash[:error] do |f|  %>
  # As a block method
   :username}) %>
  
  # Outside the block - passing model and attribute
   :user, :attribute=> :username}) %>
  
  # Without being associated with a model at all 
   :username, :value => '', :html => {}}) %>
  
  # As its own block
  
    
  

The hole goes deeper as we get into more advanced form helper methods such as the select:

select(object, method, choices, options, html_options)

This method is beautifully terse – but approaches being unreadable. The choices parameter (which I think would be better passed along with the other options in one argument as a hash) is typically an iteration of an ActiveRecord find, building an array each pass. For example:

select(:user, :role, Role.all.map {|r| [r.name, r.id], {:include_blank => true}})

This seems awkward to me. Perhaps if the select could be a block instead? For the block, you could indicate the population of values intended to be generated as option tags. You could then write something similar to:

 :user, :options => Role.all} do |s| %>
  
  

If this is done in a generic way, you could potentially eliminate the methods of collection_select, grouped_options_for_select, and options_for_select.

Resources:
ActionView::Helpers::FormHelper
ActionView::Helpers::FormTagHelper
ActionView::Helpers::FormOptionsHelper
ActionView::Helpers::ActiveRecordHelper
ActionView::Helpers::PrototypeHelper

Next week, I will tackle the naming discrepancies between javascript_include_tag and stylesheet_link_tag – arrgh!