Fuzzy duplicate helper functions

Two Analytics functions help make the fuzzy duplicates feature more effective:

  • SORTWORDS( )
  • OMIT( )

You can use the two functions separately, or in combination.

A third function, ISFUZZYDUP( ), gives you the option to identify fuzzy duplicates for a specific value, rather than for an entire field.

SORTWORDS function

When using the fuzzy duplicates feature, use the SORTWORDS( ) function to create an expression or a computed field that sorts individual elements in test field values into a sequential order.

Sorting elements, such as the components of an address, reduces the importance of the physical position of elements in fuzzy duplicates comparisons. The resulting improvement in effectiveness allows you to use a much lower Difference Threshold and produce a smaller, more focused set of results containing fewer false positives.

For detailed information, see SORTWORDS( ) function. For more information about the Difference Threshold, see How the difference settings work.

For a video providing an overview of SORTWORDS( ), see Fuzzy Matching Using SORTWORDS() (English only).

Example

The following two values would require a Difference Threshold of at least 22 to be included in fuzzy duplicate output results:

  • 125 SW 39TH ST, Suite 100
  • Suite 100, 125 SW 39TH ST

The maximum allowable Difference Threshold is 10, so the fuzzy duplicates feature would never identify the two values as fuzzy duplicates of each other. Although, clearly, they are the same address.

By contrast, if you use SORTWORDS( ) to create an expression or a computed field that sorts the individual address elements, a Difference Threshold of only 2 would return the two addresses as fuzzy duplicates of each other:

  • 100 125 39TH ST, SW Suite
  • 100, 125 39TH ST SW Suite

OMIT function

When using the fuzzy duplicates feature, use the OMIT( ) function to create an expression or a computed field that removes generic elements from test field values.

Removal of elements such as hyphens, commas, and number signs, and words or abbreviations such as "Inc.", "Street", or "St.", focuses the fuzzy duplicates comparisons on just the portion of the test values where a meaningful difference may occur. The resulting improvement in effectiveness allows you to use a much lower Difference Threshold and produce a smaller, more focused set of results containing fewer false positives.

For detailed information, see OMIT( ) function. For more information about the Difference Threshold, see How the difference settings work.

Example

The following two values require a Difference Threshold of at least 8 to be included in fuzzy duplicate output results:

  • Intercity Couriers Corporation
  • Inter-city Couriers Corp.

A Difference Threshold of 8 might produce a large, unfocused set of results containing many false positives. However, a lower Difference Threshold would allow the two values to escape detection as fuzzy duplicates of each other.

By contrast, if you use OMIT( ) to create an expression or a computed field that removes generic elements such as "Corporation" and "Corp.", a Difference Threshold of only 1 would return the two names as fuzzy duplicates of each other:

  • Intercity Couriers
  • Inter-city Couriers

ISFUZZYDUP function

After using the fuzzy duplicates feature and reviewing the results, you can use the ISFUZZYDUP( ) function to output a single, exhaustive list of fuzzy duplicates for a specific value in the results. You can take this additional step for values that appear to be of particular relevance to your analysis goal.

Exhaustive means that all values within the specified degree of difference of the test value are returned, regardless of their position in the test field relative to the test value.

By design, the fuzzy duplicates feature organizes the output results in non-exhaustive groups. The results, in total, are exhaustive, but the individual groups may or may not be. This approach prevents the output results from becoming very large and unmanageable.

The non-exhaustive groups may be sufficient for the purposes of your analysis. If they are not, you can use ISFUZZYDUP( ) to produce exhaustive results for individual values.

For detailed information, see ISFUZZYDUP( ) function. For more information about non-exhaustive groups, see How fuzzy duplicates are grouped.