LEVDIST( ) function

Returns the Levenshtein Distance between two specified strings, which is a measurement of how much the two strings differ.

Syntax

LEVDIST(string1, string2 <,case_sensitive>)

Parameters

string1

Character. The first string in the comparison.

string2

Character. The second string in the comparison.

case_sensitive

Optional. Logical. Specify T for a case-sensitive comparison of strings, or F to ignore case. If the case_sensitive parameter is omitted the default value of T is used.

Output

Numeric. The value is the Levenshtein Distance between two strings.

Remarks

You can use this function to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. The function also identifies exact duplicates.

The function returns the Levenshtein Distance between the two evaluated strings, which is a value representing the minimum number of single character edits required to make one string identical to the other string. The edits can be of three types: insertion, deletion, and substitution. Transpositions (two adjacent letters reversed) are not recognized by the Levenshtein algorithm, and count as two edits – specifically, two substitutions. The greater the Levenshtein Distance, the greater the difference between two strings. A distance of zero (0) means the strings are identical. Punctuation marks, special characters, and blanks are treated as single characters, just like letters. Changing the case of a character counts as one substitution, unless you turn off case sensitivity using the case_sensitive parameter.

Levenshtein Distance takes the position of characters into account. The same characters ordered differently may result in a different Levenshtein Distance. For example, LEVDIST("abc", "dec") = 2, whereas LEVDIST("abc", "cde") = 3.

Note

To ensure accurate results when using LEVDIST( ) to compare a literal string such as “Smith” with a character field, you must use the TRIM( ) function to remove trailing blanks from the field. If you are comparing two fields, you must use the TRIM( ) function with each field. The Levenshtein algorithm counts blanks as characters, so any trailing blanks are included in the calculation of the number of edits required to make two strings identical.

The OMIT( ) function can improve the effectiveness of the LEVDIST( ) function by removing generic elements such as “Corporation” or “Inc.” from field values. Removal of generic elements focuses the LEVDIST( ) string comparison on just the portion of the strings where a meaningful difference may occur.

The ISFUZZYDUP( ) function provides an alternative method for comparing strings based on Levenshtein Distance.

Note

Unlike the default behavior of the LEVDIST( ) function, the ISFUZZYDUP( ) function is not case-sensitive.

The SOUNDSLIKE( ) and SOUNDEX( ) functions provide a method for comparing strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).

Examples

Example Return value

LEVDIST("smith","Smythe")

3

Two substitutions and one insertion are required to transform “smith” into “Smythe”.

LEVDIST("smith's","Smythes",F)

2

Because case is ignored, only two substitutions are required to transform “smith’s” into “Smythes”.

LEVDIST(TRIM(Surname),"Smith")

The Levenshtein Distance between each value in the “Surname” field and the string “Smith”.

You can use the expression LEVDIST(TRIM(Surname),"Smith") to create a computed field, and then sort the computed field in ascending order to rank all values in the “Surname” field by their amount of difference from “Smith”.

To create a filter that isolates all values in the “Surname” field that are within a specified Levenshtein Distance of “Smith”, specify:

LEVDIST(TRIM(Surname),"Smith")<3

Changing the number in the expression allows you to adjust the amount of Levenshtein Distance in the filtered values.

Related reference
FUZZYDUP command
ISFUZZYDUP( ) function
OMIT( ) function
TRIM( ) function
SOUNDSLIKE( ) function
SOUNDEX( ) function


(C) 2013 ACL Services Ltd. All Rights Reserved. | Send feedback