LEVDIST( ) function
Returns the Levenshtein distance between two specified strings, which is a measurement of how much the two strings differ.
Syntax
LEVDIST(string1, string2 <,case_sensitive>)
Parameters
Name | Type | Description |
---|---|---|
string1 | character | The first string in the comparison. |
string2 | character | The second string in the comparison. |
case_sensitive optional |
logical |
Specify T for a case-sensitive comparison of strings, or F to ignore case. If omitted, the default value of T is used. |
Output
Numeric. The value is the Levenshtein distance between two strings.
Examples
Basic examples
Returns 3, because two substitutions and one insertion are required to transform "smith" into "Smythe":
LEVDIST("smith","Smythe")
Returns 2, because case is ignored, so only two substitutions are required to transform "smith's" into "Smythes":
LEVDIST("smith's","Smythes",F)
Returns the Levenshtein distance between each value in the Last_Name field and the string "Smith":
LEVDIST(TRIM(Last_Name),"Smith")
Advanced examples
Ranking values against "Smith"
Create the computed field Lev_Dist to display the Levenshtein distance between "Smith" and each value in the Last_Name field:
DEFINE FIELD Lev_Dist COMPUTED LEVDIST(TRIM(Last_Name),"Smith", F)
Add the computed field Lev_Dist to the view, and then quick sort it in ascending order, to rank all values in the Last_Name field by their amount of difference from "Smith".
Isolating fuzzy duplicates for "Smith"
Create a filter that isolates all values in the Last_Name field that are within a specified Levenshtein distance of "Smith":
SET FILTER TO LEVDIST(TRIM(Last_Name),"Smith", F) < 3
Changing the number in the expression allows you to adjust the amount of Levenshtein distance in the filtered values.
Remarks
When to use LEVDIST( )
Use the LEVDIST( ) function to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. LEVDIST( ) also identifies exact duplicates.
How it works
The LEVDIST( ) function returns the Levenshtein distance between the two evaluated strings, which is a value representing the minimum number of single character edits required to make one string identical to the other string.
Each required edit increments the value of the Levenshtein distance by 1. The greater the Levenshtein distance, the greater the difference between the two strings. A distance of zero (0) means the strings are identical.
Types of edits
The edits can be of three types:
- insertion
- deletion
- substitution
Transpositions (two adjacent letters reversed) are not recognized by the Levenshtein algorithm, and count as two edits – specifically, two substitutions.
Non-alphanumeric characters
Punctuation marks, special characters, and blanks are treated as single characters, just like letters and numbers.
The case of characters
Changing the case of a character counts as one substitution, unless you turn off case sensitivity using the case_sensitive setting.
The position of characters
Levenshtein distance takes the position of characters into account. The same characters ordered differently may result in a different Levenshtein distance.
Returns 2:
LEVDIST("abc", "dec")
Returns 3:
LEVDIST("abc", "cde")
Using TRIM( ) with LEVDIST( )
To ensure accurate results when using LEVDIST( ) to compare a literal string such as "Smith" with a character field, you must use the TRIM( ) function to remove trailing blanks from the field.
If you are comparing two fields, you must use the TRIM( ) function with each field.
The Levenshtein algorithm counts blanks as characters, so any trailing blanks are included in the calculation of the number of edits required to make two strings identical.
Using OMIT( ) with LEVDIST( )
The OMIT( ) function can improve the effectiveness of the LEVDIST( ) function by removing generic elements such as "Corporation" or "Inc." from field values. Removal of generic elements focuses the LEVDIST( ) string comparison on just the portion of the strings where a meaningful difference may occur.
Related functions
- ISFUZZYDUP( ) provides an alternate method for comparing strings based on Levenshtein
distance.
Unlike the default behavior of LEVDIST( ), ISFUZZYDUP( ) is not case-sensitive.
- DICECOEFFICIENT( ) de-emphasizes or completely ignores the relative position of characters or character blocks when comparing strings.
- SOUNDSLIKE( ) and SOUNDEX( ) compare strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).