Fuzzy Matching in PostgreSQL with Nicknames

Introduction

If you have had the LIKE comparison in SQL leave you wanting, this post is for you. The reason I say this is because LIKE is the tip of the iceberg when it comes to the searching capabilities of a modern database. Signs that you are ready to move on from LIKE:

  • You are attempting to pad both sides of the LIKE query with wildcards
  • You are joining multiple keywords with a wildcard separator
  • You are using multiple LIKE statements on the same column in your query

 

I have been working on a project lately that uses PostgreSQL as its database, and discovered some interesting ways of locating data by using the contributed fuzzy matching functions. When searching for people in a database there are many things to consider:

  • The search query may contain a typo
  • The user may have an uncommon spelling of their name
  • The user may have a nickname

 

How can we address these scenarios?

Fuzzy Matching Algorithms To the Rescue

Levenshtein is a great algorithm to detect typos in a search query. It operates based on how distant one search term is from another term. Starting with the “source” word, it counts the number of operations (additions, subtractions, substitutions) it takes to arrive at the “destination” word. This make the Levenshtein algorithm particularly good at catching seach typos, and uncommon spellings. Here is an example:

INSERT INTO users (first_name) VALUES ('William'); --Notice the double "L"
SELECT * FROM users WHERE LEVENSHTEIN(LOWER(users.first_name), 'wiliam') < 2;
id first_name last_name created_at updated_at
4 William   2011-08-21 14:54:34.513516 2011-08-21 14:54:34.513516

(1 row)

I am retrieving all records where the result has a distance of two or less from the source word to the target. The number of changes required to go from “wiliam” to “william” is one addition of the letter “L”. You can of course modify the query to change the number of changes allowed. Note that the query is case sensitive. Going from an uppercase letter to the same letter in lowercase will count as 1 substitution. For this reason, make sure you are down-casing everything before comparing to operate on letters only, and not case.

Another choice when locating person records is the phoenetic algorithms of Soundex, Metaphone, and Double Metaphone. These have linguistic awareness of the English language (and other languages in the case of the Metaphone, and DMetaphone algorithms). The Soundex is the oldest phonetic algorithm, and has been deprecated in favor of the more complex Metaphone and DMetaphone algorithms. Nevertheless, if you are working with English only names, Soundex may provide performance benefits that are worth the trade-offs.

Soundex breaks down a word, and assigns the pronunciation a value that can be compared against other pronunciations. Every word when passed through the Soundex function returns a four digit value, resulting in each word being the values of 0 and 4 for least similar, to most similar respectively. DIFFERENCE uses SOUNDEX under the hood. “William” and “Willem” have a pronunciation that is mostly similar. Again, you can adjust this setting by changing the integer that you compare the result of DIFFERENCE against. Here is an example of METAPHONE:

SELECT * FROM users WHERE DIFFERENCE(users.first_name, 'willem') > 2;

Metaphone takes an integer which specifies how specific it should be when calculating the similarity. The higher the integer, the more specific the comparison. While the results are the same as Soundex in this scenario, it is important to remember that the power of METAPHONE, and DMETAPHONE are in the pronunciations of these characters in other languages. This is particularly beneficial when working with non-English names:

SELECT * FROM users WHERE METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2);

Double Metaphone expands on the METAPHONE capabilities, but since it has no additional configuration options, it may not be suitable for your needs. You should experiment with the different fuzzy matching options to get the best result for your application:

SELECT * FROM users WHERE DMETAPHONE(first_name) = DMETAPHONE('Willem');

Nickname Matching

These methods are great when comparing data that is similar in distance, or in pronunciation, however it will not be able to address the concern of the use of nicknames, and alternate names within the system.

Nicknames allow us to match the name “William” with its alternate names of “Bill”, “Bud”, “Will”, and “Willie”. Using the algorithms discussed so far, the name “Will” (and possibly “Willie”) would be the only results of this match. Nicknames represent relationships between names that a database will need to understand to address the last criteria of our search.

In order to accomplish this, I have have leveraged the data from the Nickname and Dominuitive Name Lookup project. I have created a gem that can be added to your Ruby on Rails project that will fetch the nickname data, parse it, and insert it into a nicknames table, along with a foreign key that references related names. This gem will allow you to query nicknames using the following query:

SELECT * from nicknames WHERE nickname_id IN (SELECT nickname_id FROM nicknames WHERE name = 'william');

or in Ruby (check out the project here):

Nickname.for('William').map(&:name)

We can do the following to make a comprehensive search using all methods that we have discussed so far:

-- levenshtein
SELECT * FROM users WHERE LEVENSHTEIN(LOWER(users.first_name), 'wiliam') < 2 
--dmetaphone
OR DMETAPHONE(first_name) = DMETAPHONE('Willem')
--nicknames
OR LOWER(users.first_name) IN (SELECT name from nicknames WHERE nickname_id IN (SELECT nickname_id FROM nicknames WHERE name = 'william'));

or in Ruby:

# app/models/user.rb
class User < ActiveRecord::Base
  scope :levenshtein, lambda {|term| where(["LEVENSHTEIN(LOWER(first_name), LOWER(?)) < 2", term])}
  scope :dmetaphone, lambda {|term| where(["DMETAPHONE(first_name) = DMETAPHONE(?)", term])}
  scope :nicknames, lambda {|term| where(["LOWER(first_name) IN (?)", Nickname.for(term.downcase).map(&:name)])}

  def self.fuzzy_match(term)
    levenshtein(term) | dmetaphone(term) | nicknames(term)
  end 
end

User.fuzzy_match('william')

This will generate separate queries when calling fuzzy_match. This has the benefit of encapsulating each algorithm, but at the cost of worse database performance. If performance is desired over encapsulation, restructure the query to include these comparisons in the same WHERE clause, as is shown in the SQL method.

TL;DR

Stop using LIKE if anything more than the bare minimum comparison is needed. The time involved in setting up these extra functions in PostgreSQL is just a few minutes, and it will greatly enhance the search capabilities available to you and your users.

Additional references:

Advertisement

5 Comments

  1. Sam Saccone says:

    this is some good stuff

    Like

  2. This is pretty cool. Maybe one day I will have an opportunity to mess with PostgreSQL, but I don’t think it will be soon.

    I want to pick on one thing here, though. I think we can agree that when performing a search, speed is everything. The longer your search takes, the less likely the user is going to wait for it to finish. I haven’t done any research into these algorithms, but based on your description, and my limited understanding of the complexity, I am going to assume that they can be a bit “slow.”

    So, when I see you using an IN() clause, it makes me cringe. It also makes me think you are only creating a single table that contains both the name and the nicknames. This would be “bad.” You could speed things up quite a bit if you use a query like the following to get the list of nicknames for a given name:

    “SELECT b.nickname FROM names a, nicknames b WHERE a.id = b.name_id AND a.name = LOWER(‘William’);”

    P.S.: Your first query uses “Wiliam” and not “William” like you said it should.

    Like

    1. Ben Simpson says:

      Thanks for the catch on the typo.

      You are correct in assuming that these queries are relatively slow compared to like, as there is a lot more computation under the hood. The good news is that these extensions are written in C, and are community reviewed, so they should be fairly optimized. I agree that these types of queries are better for low volume areas of the application, as this will still become a bottleneck, no matter how optimized the extensions are.

      Concerning the nicknames, it is a self-referential table with primary key, name, and foreign key columns. My table of users is separate from the table of nicknames. The subquery retrieves all nicknames, and then queries users that have a name matching something in the nicknames collection. I could probably do this more efficiently with a join.

      Like

  3. Moshe Jacobson says:

    Have you investigated creating a text search dictionary/thesaurus for indexing? It seems like this would be a lot faster and also more versatile for cases such as “al”, which could be albert or allen. I would like to do something like this, but wanted to see if it’s already been done before reinventing the wheel.

    Like

  4. Ben Simpson says:

    @Moshe I looked at creating a search dictionary for nicknames, however if I recall the limitation is that there can be one and only one lexeme for each name. For example, William would have a lexeme of Bill, but could not also have a lexeme of Liam. I decided to go with the table approach because some names can have multiple nicknames. Great idea though, and perhaps it can be done, but a cursory glance suggests this is a hard limitation.

    Like

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.