cross-posted from: https://chrastecky.dev/post/15
PHP has long had a levenshtein() function, but it comes with a significant limitation: it doesn’t support UTF-8.
If you’re not familiar with the Levenshtein distance, it’s a way to measure how different two strings are — by counting the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.
For example, the following code returns
2
instead of the correct result,1
:
var_dump(levenshtein('göthe', 'gothe'));
There are workarounds — such as using a pure PHP implementation or converting strings to a custom single-byte encoding — but they come with downsides, like slower performance or non-standard behavior.
With the new
grapheme_levenshtein()
function in PHP 8.5, the code above now correctly returns1
.Grapheme-Based Comparison
What makes this new function especially powerful is that it operates on graphemes, not bytes or code points. For instance, the character é (accented ‘e’) can be represented in two ways: as a single code point (
U+00E9
) or as a combination of the letter e (U+0065
) and a combining accent (U+0301
). In PHP, you can write these as:
$string1 = "\u{00e9}"; $string2 = "\u{0065}\u{0301}";
Even though these strings are technically different at the byte level, they represent the same grapheme. The new
grapheme_levenshtein()
function correctly recognizes this and returns0
— meaning no difference.This is particularly useful when working with complex scripts such as Japanese, Chinese, or Korean, where grapheme clusters play a bigger role than in Latin or Cyrillic alphabets.
Just for fun: what do you think the original
levenshtein()
function will return for the example above?
var_dump(levenshtein("\u{0065}\u{0301}", "\u{00e9}"));