ISFUZZYDUP( ) function

Returns a logical value indicating whether a string is a fuzzy duplicate of a comparison string.

Syntax

ISFUZZYDUP(string1, string2, levdist <,diffpct>)

Parameters

Name Type Description
string1 character The first string in the comparison.
string2 character The second string in the comparison.
levdist numeric

The maximum allowable Levenshtein distance between the two strings for them to be identified as fuzzy duplicates.

The levdist value cannot be less than 1 or greater than 10.

Increasing the levdist value increases the number of results by including values with a greater degree of fuzziness – that is, values that are more different from each other.

diffpct

optional

numeric

The upper threshold for the 'difference percentage'.

Difference percentage is explained in How it works.

The diffpct value cannot be less than 1 or greater than 99.

Increasing the diffpct value increases the number of results by including values with a greater proportion of difference relative to their length.

If omitted, difference percentage is not considered during processing of the ISFUZZYDUP( ) function.

 

Output

Logical. Returns T (true) if string values are fuzzy duplicates, and F (false) otherwise.

Examples

Basic examples

Returns F, because two edits are required to transform "Smith" into "Smythe", but the levdist value is only 1:

ISFUZZYDUP("Smith","Smythe", 1, 99)

Returns T, because two edits are required to transform "Smith" into "Smythe", and the levdist value is 2:

ISFUZZYDUP("Smith","Smythe", 2, 99)

Returns T, because zero edits are required to transform "SMITH" into "smith", and the levdist value is 1 (the ISFUZZYDUP( ) function is not case-sensitive):

ISFUZZYDUP("SMITH","smith", 1, 99)

Returns a logical value (T or F) indicating whether individual values in the Last_Name field are fuzzy duplicates for the string "Smith":

ISFUZZYDUP(Last_Name,"Smith", 3, 99)

Advanced examples

Working with difference percentage

The difference percentage gives you a tool for reducing the number of false positives returned by ISFUZZYDUP( ).

No diffpct specified

Returns T, because five edits are required to transform "abc" into "Smith", and the levdist value is 5:

ISFUZZYDUP("abc", "Smith", 5)

diffpct specified

Returns F, even though "abc" is within the specified Levenshtein distance of "Smith", because 5 edits/a string length of 3 results in a difference percentage of 167%, which exceeds the specified diffpct of 99%:

ISFUZZYDUP("abc", "Smith", 5, 99)

Difference percentage is fully explained in How it works.

Isolating fuzzy duplicates for "Smith"

Create a filter that isolates all values in the Last_Name field that are fuzzy duplicates for "Smith":

SET FILTER TO ISFUZZYDUP(Last_Name, "Smith", 3, 99)

Changing the levdist or diffpct values allows you to adjust the amount of difference in the filtered values.

Isolating fuzzy duplicates for a vendor name

Create a filter that isolates all values in the Vendor_Name field that are fuzzy duplicates for "Pacific Lighting and Electrical Supply, Inc.":

SET FILTER TO ISFUZZYDUP(Vendor_Name, "Pacific Lighting and Electrical Supply, Inc.", 2, 99)

Improve the effectiveness of the filter by using additional functions with the ISFUZZYDUP( ) function.

Using ISFUZZYDUP( ) with OMIT( ) returns:

  • Pacific Lighting and Electrical Supply, Inc.
  • Pacific Lighting and Electrical Supply
  • Pacific Lighting & Electrical Supply, Inc.
SET FILTER TO ISFUZZYDUP(OMIT(Vendor_Name, ".,&,and,Inc,Ltd,"), "Pacific Lighting Electrical Supply", 2, 99)

Using ISFUZZYDUP( ) with SORTWORDS( ) and UPPER( ) returns:

  • Pacific Lighting and Electrical Supply, Inc.
  • Pacific Electrical and Lighting Supply, Inc.

SET FILTER TO ISFUZZYDUP(SORTWORDS(UPPER(Vendor_Name)), SORTWORDS(UPPER("Pacific Lighting and Electrical Supply, Inc.")), 2, 99)

Using ISFUZZYDUP( ) with SORTWORDS( ), UPPER( ), and OMIT( ) returns:

  • Pacific Lighting and Electrical Supply, Inc.
  • Pacific Lighting and Electrical Supply
  • Pacific Lighting & Electrical Supply, Inc.
  • Pacific Electrical and Lighting Supply, Inc.
SET FILTER TO ISFUZZYDUP(SORTWORDS(UPPER(OMIT(Vendor_Name, ".,&,and,Inc,Ltd,"))), SORTWORDS(UPPER("Pacific Lighting Electrical Supply")), 2, 99)

Remarks

When to use ISFUZZYDUP( )

Use the ISFUZZYDUP( ) function to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data.

How it works

The ISFUZZYDUP( ) function calculates the Levenshtein distance between two strings, and calculates the difference percentage.

ISFUZZYDUP( ) evaluates to T (true) if:

  • The Levenshtein distance is less than or equal to the levdist value.
  • The difference percentage is less than or equal to the diffpct value (if specified).

Levenshtein distance

The Levenshtein distance is a value representing the minimum number of single character edits required to make one string identical to the other string.

For more information, see LEVDIST( ) function.

Difference percentage

The difference percentage is the percentage of the shorter of the two evaluated strings that is different.

The difference percentage is the result of the following internal Analytics calculation, which uses the Levenshtein distance between the two strings:

Levenshtein distance / number of characters in the shorter string × 100 = difference percentage

Using the optional difference percentage helps reduce the number of false positives returned by ISFUZZYDUP( ):

  • The upper threshold for diffpct is 99%, which prevents the entire replacement of a string in order to make it identical.
  • Strings that require a large number of edits in relation to their length are excluded.

Usage tips

  • Case-sensitivity The function is not case-sensitive, so "SMITH" is equivalent to "smith."
  • Trailing blanks The function automatically trims trailing blanks in fields, so there is no need to use the TRIM( ) function when specifying a field as a parameter.
  • Sorting elements The SORTWORDS( ) function can improve the effectiveness of the ISFUZZYDUP( ) function by sorting individual elements in field values into a sequential order.

    Sorting elements, such as the components of an address, can make two strings with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that a pair of strings are selected as fuzzy duplicates of each other.

  • Removing generic elements The OMIT( ) and EXCLUDE( ) functions can improve the effectiveness of the ISFUZZYDUP( ) function by removing generic elements such as "Corporation" or "Inc.", or characters such as commas, periods, and ampersands (&), from field values.

    Removal of generic elements and punctuation focuses the ISFUZZYDUP( ) string comparison on just the portion of the strings where a meaningful difference may occur.

How the FUZZYDUP command and the ISFUZZYDUP( ) function differ

The FUZZYDUP command identifies all fuzzy duplicates in a field, organizes them in non-exhaustive groups, and outputs results that in total are exhaustive.

The ISFUZZYDUP( ) function generates a single, exhaustive list of fuzzy duplicates for a specific character value.

The command and the function both identify exact duplicates. Unlike the command, you cannot exclude exact duplicates when using the function.

What exhaustive means

Exhaustive means that all values within the specified degree of difference of the test value are returned, regardless of their position in the test field relative to the test value.

The ISFUZZYDUP( ) function is useful if the non-exhaustive groups produced by the FUZZYDUP command are not convenient for the purposes of your analysis, and you need to directly scrutinize every fuzzy duplicate for a specific character value.

Related functions

  • LEVDIST( ) provides an alternate method for comparing strings based on Levenshtein distance.

    Unlike ISFUZZYDUP( ), LEVDIST( ) is case-sensitive by default.

  • DICECOEFFICIENT( ) de-emphasizes or completely ignores the relative position of characters or character blocks when comparing strings.
  • SOUNDSLIKE( ) and SOUNDEX( ) compare strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).