I recently used the emoji search feature in an app. I typed “car” and expected to be greeted with 🚗 as a top result but found things like 🥕 (carrot) to be much higher ranking. This post explores how a search algorithm for searching a small list can be enhanced.
A search experience like this might be called substring match search because it works something like this:
- Start with a list of values
- Filter based on substring matching
- Return results
The solution is simple and for some applications with small lists it may be perfect.
There are some shortcomings, however. Misspellings are not tolerated, and ranking is nonexistent. As a result, the query “carr” might not return the car at all, and “car” might return the carrot as the first result.
I worked on a project similar to the emoji search where a simple solution was desired. The list of items to query was small (hundreds to thousands). The search could be made fast and reasonably good without introducing a new dependency like Elasticsearch.
That project started with a prototype quite similar to the substring match search algorithm. It evolved to use a distance algorithm for ranking.
Computing Distance
Let’s go back to the “car” query. Instead of simply filtering the results for anything that contains “car,” what if we had a way to compare the “car” query to each possible result and compute a similarity? If the two are too dissimilar, the result can be discarded; if not, the result can be kept with the score for later sorting.
Such an algorithm solves the two problems above:
- Small misspellings can be accounted for, as a difference of one or two characters
- The results can be ranked from most relevant to least relevant, because each results has a distance score - and results below a threshold can be excluded
Any distance algorithm is going to be slower than the simple substring search above, but for small datasets it might still be appropriate. As with all problems, knowing your requirements is key!
Jaro-Winkler Algorithm
An algorithm worth considering in these situations is Jaro-Winkler. Jaro-Winkler has two components1:
- Jaro similarity: The foundation of the algorithm is to compute a similarity score based on the number of edits.
- Prefix boost: The algorithm boosts results who share a prefix with the search query, enhancing the results.
There are many other algorithms worth considering, but Jaro-Winkler may be a nice iteration over the substring search approach.
Search Requirements
Clearly there are many different search problems, and whether a solution is good depends on the use case. Search problems of a certain scale couldn’t use an algorithm like Jaro-Winkler; it would be too expensive. Also the desired behavior varies widely in search. For instance, a use case might not tolerate misspellings, but does require ranking with a prefix boost.
What was fascinating to me, in playing with search a little, is the way an algorithm can be written and tweaked for a use case; for instance, I had started with Jaro similarity score and then added a prefix boost and tuned that for my use case. It’s a fun world to tinker with!
If you have never played with search, give it a try! You could grab a list of emoji and see what algorithm makes for a nice search experience for you.
-
See the Wikipedia entry if you want more details. ↩