LEVDIST( ) function

Returns the Levenshtein distance between two specified strings, which is a measurement of how much the two strings differ.

Syntax

LEVDIST(string1, string2 <,case_sensitive>)

Parameters

Name Type Description
string1 character The first string in the comparison.
string2 character The second string in the comparison.

case_sensitive

optional

logical

Specify T for a case-sensitive comparison of strings, or F to ignore case.

If omitted, the default value of T is used.

Output

Numeric. The value is the Levenshtein distance between two strings.

Examples

Basic examples

Returns 3, because two substitutions and one insertion are required to transform "smith" into "Smythe":

LEVDIST("smith","Smythe")

Returns 2, because case is ignored, so only two substitutions are required to transform "smith's" into "Smythes":

LEVDIST("smith's","Smythes",F)

Returns the Levenshtein distance between each value in the Last_Name field and the string "Smith":

LEVDIST(TRIM(Last_Name),"Smith")

Advanced examples

Ranking values against "Smith"

Create the computed field Lev_Dist to display the Levenshtein distance between "Smith" and each value in the Last_Name field:

DEFINE FIELD Lev_Dist COMPUTED LEVDIST(TRIM(Last_Name),"Smith", F)

Add the computed field Lev_Dist to the view, and then quick sort it in ascending order, to rank all values in the Last_Name field by their amount of difference from "Smith".

Isolating fuzzy duplicates for "Smith"

Create a filter that isolates all values in the Last_Name field that are within a specified Levenshtein distance of "Smith":

SET FILTER TO LEVDIST(TRIM(Last_Name),"Smith", F) < 3

Changing the number in the expression allows you to adjust the amount of Levenshtein distance in the filtered values.

Remarks

When to use LEVDIST( )

Use the LEVDIST( ) function to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. LEVDIST( ) also identifies exact duplicates.

How it works

The LEVDIST( ) function returns the Levenshtein distance between the two evaluated strings, which is a value representing the minimum number of single character edits required to make one string identical to the other string.

Each required edit increments the value of the Levenshtein distance by 1. The greater the Levenshtein distance, the greater the difference between the two strings. A distance of zero (0) means the strings are identical.

Types of edits

The edits can be of three types:

  • insertion
  • deletion
  • substitution

Transpositions (two adjacent letters reversed) are not recognized by the Levenshtein algorithm, and count as two edits – specifically, two substitutions.

Non-alphanumeric characters

Punctuation marks, special characters, and blanks are treated as single characters, just like letters and numbers.

The case of characters

Changing the case of a character counts as one substitution, unless you turn off case sensitivity using the case_sensitive setting.

The position of characters

Levenshtein distance takes the position of characters into account. The same characters ordered differently may result in a different Levenshtein distance.

Returns 2:

LEVDIST("abc", "dec")

Returns 3:

 LEVDIST("abc", "cde") 

Using TRIM( ) with LEVDIST( )

To ensure accurate results when using LEVDIST( ) to compare a literal string such as "Smith" with a character field, you must use the TRIM( ) function to remove trailing blanks from the field.

If you are comparing two fields, you must use the TRIM( ) function with each field.

The Levenshtein algorithm counts blanks as characters, so any trailing blanks are included in the calculation of the number of edits required to make two strings identical.

Using OMIT( ) with LEVDIST( )

The OMIT( ) function can improve the effectiveness of the LEVDIST( ) function by removing generic elements such as "Corporation" or "Inc." from field values. Removal of generic elements focuses the LEVDIST( ) string comparison on just the portion of the strings where a meaningful difference may occur.

Related functions

  • ISFUZZYDUP( ) provides an alternate method for comparing strings based on Levenshtein distance.

    Unlike the default behavior of LEVDIST( ), ISFUZZYDUP( ) is not case-sensitive.

  • DICECOEFFICIENT( ) de-emphasizes or completely ignores the relative position of characters or character blocks when comparing strings.
  • SOUNDSLIKE( ) and SOUNDEX( ) compare strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).
ACL Scripting Guide 14.1