Introduction
If you have had the LIKE comparison in SQL leave you wanting, this post is for you. The reason I say this is because LIKE is the tip of the iceberg when it comes to the searching capabilities of a modern database. Signs that you are ready to move on from LIKE:
- You are attempting to pad both sides of the LIKE query with wildcards
- You are joining multiple keywords with a wildcard separator
- You are using multiple LIKE statements on the same column in your query
I have been working on a project lately that uses PostgreSQL as its database, and discovered some interesting ways of locating data by using the contributed fuzzy matching functions. When searching for people in a database there are many things to consider:
- The search query may contain a typo
- The user may have an uncommon spelling of their name
- The user may have a nickname
How can we address these scenarios?
Fuzzy Matching Algorithms To the Rescue
Levenshtein is a great algorithm to detect typos in a search query. It operates based on how distant one search term is from another term. Starting with the “source” word, it counts the number of operations (additions, subtractions, substitutions) it takes to arrive at the “destination” word. This make the Levenshtein algorithm particularly good at catching seach typos, and uncommon spellings. Here is an example:
INSERT INTO users (first_name) VALUES ('William'); --Notice the double "L" SELECT * FROM users WHERE LEVENSHTEIN(LOWER(users.first_name), 'wiliam') < 2;
id | first_name | last_name | created_at | updated_at |
---|---|---|---|---|
4 | William | 2011-08-21 14:54:34.513516 | 2011-08-21 14:54:34.513516 |
(1 row)
I am retrieving all records where the result has a distance of two or less from the source word to the target. The number of changes required to go from “wiliam” to “william” is one addition of the letter “L”. You can of course modify the query to change the number of changes allowed. Note that the query is case sensitive. Going from an uppercase letter to the same letter in lowercase will count as 1 substitution. For this reason, make sure you are down-casing everything before comparing to operate on letters only, and not case.
Another choice when locating person records is the phoenetic algorithms of Soundex, Metaphone, and Double Metaphone. These have linguistic awareness of the English language (and other languages in the case of the Metaphone, and DMetaphone algorithms). The Soundex is the oldest phonetic algorithm, and has been deprecated in favor of the more complex Metaphone and DMetaphone algorithms. Nevertheless, if you are working with English only names, Soundex may provide performance benefits that are worth the trade-offs.
Soundex breaks down a word, and assigns the pronunciation a value that can be compared against other pronunciations. Every word when passed through the Soundex function returns a four digit value, resulting in each word being the values of 0 and 4 for least similar, to most similar respectively. DIFFERENCE uses SOUNDEX under the hood. “William” and “Willem” have a pronunciation that is mostly similar. Again, you can adjust this setting by changing the integer that you compare the result of DIFFERENCE against. Here is an example of METAPHONE:
SELECT * FROM users WHERE DIFFERENCE(users.first_name, 'willem') > 2;
Metaphone takes an integer which specifies how specific it should be when calculating the similarity. The higher the integer, the more specific the comparison. While the results are the same as Soundex in this scenario, it is important to remember that the power of METAPHONE, and DMETAPHONE are in the pronunciations of these characters in other languages. This is particularly beneficial when working with non-English names:
SELECT * FROM users WHERE METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2);
Double Metaphone expands on the METAPHONE capabilities, but since it has no additional configuration options, it may not be suitable for your needs. You should experiment with the different fuzzy matching options to get the best result for your application:
SELECT * FROM users WHERE DMETAPHONE(first_name) = DMETAPHONE('Willem');
Nickname Matching
These methods are great when comparing data that is similar in distance, or in pronunciation, however it will not be able to address the concern of the use of nicknames, and alternate names within the system.
Nicknames allow us to match the name “William” with its alternate names of “Bill”, “Bud”, “Will”, and “Willie”. Using the algorithms discussed so far, the name “Will” (and possibly “Willie”) would be the only results of this match. Nicknames represent relationships between names that a database will need to understand to address the last criteria of our search.
In order to accomplish this, I have have leveraged the data from the Nickname and Dominuitive Name Lookup project. I have created a gem that can be added to your Ruby on Rails project that will fetch the nickname data, parse it, and insert it into a nicknames table, along with a foreign key that references related names. This gem will allow you to query nicknames using the following query:
SELECT * from nicknames WHERE nickname_id IN (SELECT nickname_id FROM nicknames WHERE name = 'william');
or in Ruby (check out the project here):
Nickname.for('William').map(&:name)
We can do the following to make a comprehensive search using all methods that we have discussed so far:
-- levenshtein SELECT * FROM users WHERE LEVENSHTEIN(LOWER(users.first_name), 'wiliam') < 2 --dmetaphone OR DMETAPHONE(first_name) = DMETAPHONE('Willem') --nicknames OR LOWER(users.first_name) IN (SELECT name from nicknames WHERE nickname_id IN (SELECT nickname_id FROM nicknames WHERE name = 'william'));
or in Ruby:
# app/models/user.rb class User < ActiveRecord::Base scope :levenshtein, lambda {|term| where(["LEVENSHTEIN(LOWER(first_name), LOWER(?)) < 2", term])} scope :dmetaphone, lambda {|term| where(["DMETAPHONE(first_name) = DMETAPHONE(?)", term])} scope :nicknames, lambda {|term| where(["LOWER(first_name) IN (?)", Nickname.for(term.downcase).map(&:name)])} def self.fuzzy_match(term) levenshtein(term) | dmetaphone(term) | nicknames(term) end end User.fuzzy_match('william')
This will generate separate queries when calling fuzzy_match. This has the benefit of encapsulating each algorithm, but at the cost of worse database performance. If performance is desired over encapsulation, restructure the query to include these comparisons in the same WHERE clause, as is shown in the SQL method.
TL;DR
Stop using LIKE if anything more than the bare minimum comparison is needed. The time involved in setting up these extra functions in PostgreSQL is just a few minutes, and it will greatly enhance the search capabilities available to you and your users.
this is some good stuff
LikeLike
This is pretty cool. Maybe one day I will have an opportunity to mess with PostgreSQL, but I don’t think it will be soon.
I want to pick on one thing here, though. I think we can agree that when performing a search, speed is everything. The longer your search takes, the less likely the user is going to wait for it to finish. I haven’t done any research into these algorithms, but based on your description, and my limited understanding of the complexity, I am going to assume that they can be a bit “slow.”
So, when I see you using an IN() clause, it makes me cringe. It also makes me think you are only creating a single table that contains both the name and the nicknames. This would be “bad.” You could speed things up quite a bit if you use a query like the following to get the list of nicknames for a given name:
“SELECT b.nickname FROM names a, nicknames b WHERE a.id = b.name_id AND a.name = LOWER(‘William’);”
P.S.: Your first query uses “Wiliam” and not “William” like you said it should.
LikeLike
Thanks for the catch on the typo.
You are correct in assuming that these queries are relatively slow compared to like, as there is a lot more computation under the hood. The good news is that these extensions are written in C, and are community reviewed, so they should be fairly optimized. I agree that these types of queries are better for low volume areas of the application, as this will still become a bottleneck, no matter how optimized the extensions are.
Concerning the nicknames, it is a self-referential table with primary key, name, and foreign key columns. My table of users is separate from the table of nicknames. The subquery retrieves all nicknames, and then queries users that have a name matching something in the nicknames collection. I could probably do this more efficiently with a join.
LikeLike
Have you investigated creating a text search dictionary/thesaurus for indexing? It seems like this would be a lot faster and also more versatile for cases such as “al”, which could be albert or allen. I would like to do something like this, but wanted to see if it’s already been done before reinventing the wheel.
LikeLike
@Moshe I looked at creating a search dictionary for nicknames, however if I recall the limitation is that there can be one and only one lexeme for each name. For example, William would have a lexeme of Bill, but could not also have a lexeme of Liam. I decided to go with the table approach because some names can have multiple nicknames. Great idea though, and perhaps it can be done, but a cursory glance suggests this is a hard limitation.
LikeLike