Saturday, April 20, 2013

Java implementation of Optimal String Alignment

For a while, I've used the Apache Commons lang StringUtils implementation of Levenshtein distance.  It implements a few well known tricks to use less memory by only hanging on to two arrays instead of allocating a huge n x m table for the memoisation table.  It also only checks a "stripe" of width 2 * k +1 where k is the maximum number of edits.  

In most practical usages of levenshtein you just care if a string is within some small number (1, 2, 3) of edits from another string.  This avoid much of the n * m computation that makes levenstein "expensive".  We found that with a k <= 3, levenshtein with these tricks was faster than Jaro-Winkler distance, which is an approximate edit distance calculation that was created to be a faster approximate (well there were many reasons).

Unfortunately, the Apache Commons Lang implementation only calculates Levenshtein and not the possible more useful Damerau-Levenshtein distance.  Levenshtein defines the edit operations insert, delete, and substitute.  The Damerau variant adds *transposition* to the list, which is pretty useful for most of the places I use edit distance.  Unfortunately DL distance is not a true metric in that it doesn't respect the triangle inequality, but there are plenty of applications that are unaffected by this.  As you can see from that wikipedia page, there is often confusion between Optimal String Alignment and DL distance.  In practice OSA is a simpler algorithm and requires less book-keeping so the runtime is probably marginally faster.

I could not find any implementations of OSA or DL that used the memory tricks and "stripe" tricks that I saw in Apache Commons Lang.  So I implemented my own OSA using those tricks.  At some point I'll also implement DL with the tricks and see what the performance differences are:

Here's OSA in Java.  It's public domain; feel free to use as you like. The unit tests are below. Only dependency is on Guava- but its just the preconditions class and an annotation for documentation so easy to remove that dependency if you like: