FUZZYDUP command

Concept Information

Testing for fuzzy duplicates

Detects nearly identical values (fuzzy duplicates) in a character field.

Note

To use fuzzy matching to combine fields from two Analytics tables into a new, single Analytics table, see FUZZYJOIN command.

Syntax

FUZZYDUP ON key_field <OTHER field <...n>|OTHER ALL <EXCLUDE field_name <...n>>> LEVDISTANCE value <DIFFPCT percentage> <RESULTSIZE percentage> <EXACT> <IF test> TO table_name <LOCAL> <OPEN>

Parameters

Name Description
ON key_field The character field or expression to test for fuzzy duplicates.
OTHER field <...n> | OTHER ALL

optional

One or more additional fields to include in the output.

  • OTHER field <...n> include the specified field or fields

    Fields are included in the order that you list them.

  • OTHER ALL include all fields in the table that are not specified as the key field.

    Fields are included in the order that they appear in the table layout.

EXCLUDE field_name

optional

Only valid when using OTHER ALL.

The field or fields to exclude from the command. EXCLUDE allows you to fine-tune OTHER ALL, by excluding the specified fields.

EXCLUDE must immediately follow OTHER ALL. For example:

OTHER ALL EXCLUDE field_1 field_2
LEVDISTANCE value

The maximum allowable Levenshtein distance between two strings for them to be identified as fuzzy duplicates and included in the results.

The LEVDISTANCE value cannot be less than 1 or greater than 10. Increasing the LEVDISTANCE value increases the number of results by including values with a greater degree of fuzziness – that is, values that are more different from one another.

For more information, see FUZZYDUP behavior.

DIFFPCT percentage

optional

A threshold that limits the 'difference percentage' or the proportion of a string that can be different.

The percentage that results from an internal Analytics calculation performed on potential fuzzy duplicate pairs must be less than or equal to the DIFFPCT value for the pair to be included in the results. The DIFFPCT value cannot be less than 1 or greater than 99.

If DIFFPCT is omitted the threshold is turned off and difference percentage is not considered during processing of the FUZZYDUP command.

For more information, see FUZZYDUP behavior.

RESULTSIZE percentage

optional

The maximum size of the set of output results as a percentage of the number of records in the key field.

For example, for a key field with 50,000 records, a RESULTSIZE of 3 would terminate processing if the results exceeded 1500 fuzzy duplicates (50,000 x 0.03). No output table is produced if processing is terminated.

The RESULTSIZE value cannot be less than 1 or greater than 1000 (one thousand) percent. The limit of 1000% is to accommodate the nature of many-to-many matching, which can produce results that are more numerous than the original test data set.

If RESULTSIZE is omitted the threshold is turned off and result size is not considered during processing of the FUZZYDUP command.

Caution

Omitting RESULTSIZE can produce an unduly large set of results that takes a very long time to process, or can cause available memory to be exceeded, which terminates processing. Omit RESULTSIZE only if you are confident that the results will be of a manageable size.

EXACT

optional

Includes exact duplicates as well as fuzzy duplicates in the results.

IF test

optional

A conditional expression that must be true in order to process each record. The command is executed on only those records that satisfy the condition.

Note

The IF parameter is evaluated against only the records remaining in a table after any scope parameters have been applied (WHILE, FIRST, NEXT).

TO table_name

The location to send the results of the command to:

  • table_name saves the results to an Analytics table

    Specify table_name as a quoted string with a .FIL file extension. For example: TO "Output.FIL"

    By default, the table data file (.FIL) is saved to the folder containing the Analytics project.

    Use either an absolute or relative file path to save the data file to a different, existing folder:

    • TO "C:\Output.FIL"
    • TO "Results\Output.FIL"

    Note

    Table names are limited to 64 alphanumeric characters, not including the .FIL extension. The name can include the underscore character ( _ ), but no other special characters, or any spaces. The name cannot start with a number.

LOCAL

optional

Saves the output file in the same location as the Analytics project.

Note

Applicable only when running the command against a server table with an output file that is an Analytics table.

The LOCAL parameter must immediately follow the TO parameter.

OPEN

optional

Opens the table created by the command after the command executes. Only valid if the command creates an output table.

Analytics output variables

Name Contains
GAPDUPn

The total number of gaps, duplicates, or fuzzy duplicate groups identified by the command.

Examples

Test a surname field for fuzzy duplicates

You test a surname field for fuzzy duplicates (the Last_Name field in the Employee_List table in ACL DATA\Sample Data Files\Metaphor_Employee_Data.ACL). The results are output to a new Analytics table.

  • In addition to the test field, other fields are included in the results.
  • The maximum allowable Levenshtein distance is 1.
  • The proportion of a string that can be different is limited to 50%.
  • The size of the results is limited to 20% of the test field size.
  • In addition to fuzzy duplicates, exact duplicates are included.
OPEN Employee_List
FUZZYDUP ON Last_Name OTHER First_Name EmpNo LEVDISTANCE 1 DIFFPCT 50 RESULTSIZE 20 EXACT TO "Fuzzy_Last_Name" OPEN

Remarks

How it works

The FUZZYDUP command finds nearly identical values (fuzzy duplicates), or locates inconsistent spelling in manually entered data.

Unlike the ISFUZZYDUP( ) function, which identifies an exhaustive list of fuzzy duplicates for a single character value, the FUZZYDUP command identifies all fuzzy duplicates in a field, organizes them in non-exhaustive groups, and outputs results.

For detailed information about how this command works, see Fuzzy duplicates analysis.

What non-exhaustive means

Non-exhaustive means that individual fuzzy duplicate groups in the results may not contain all the fuzzy duplicates in a test field that are within the specified degree of difference of the group owner. However, if a group owner is a fuzzy duplicate of another value in the test field, the two values will appear together in a group somewhere in the results. So groups may be non-exhaustive, but the results, in total, are exhaustive.

If producing a single, exhaustive list of fuzzy duplicates for a specific value in the test field is important to your analysis, you can use the ISFUZZYDUP( ) function for this purpose.

FUZZYDUP behavior

The FUZZYDUP command has two parameters that allow you to control the degree of difference between fuzzy duplicates, and the size of the results:

  • LEVDISTANCE
  • DIFFPCT

You may need to try different combinations of settings for these two parameters to find out what works best for a particular data set.

LEVDISTANCE (Levenshtein distance)

When processing data, the FUZZYDUP command calculates the Levenshtein distance between each evaluated pair of strings in the test field, and calculates the difference percentage. The Levenshtein distance is a value representing the minimum number of single character edits required to make one string identical to the other string. For more information, see LEVDIST( ) function.

DIFFPCT (Difference percentage)

The difference percentage is the percentage of the shorter of the two evaluated strings that is different, and is the result of the following internal Analytics calculation, which uses the Levenshtein distance between the two strings:

Levenshtein distance / number of characters in the shorter string × 100 = difference percentage

More information

For detailed information about the fuzzy duplicate difference settings, controlling result size, and fuzzy duplicate groups, see Fuzzy duplicates analysis.

Case-sensitivity

The FUZZYDUP command is not case-sensitive, so "SMITH" is equivalent to "smith."

Trailing blanks automatically trimmed

The FUZZYDUP command automatically trims trailing blanks in key_field, so there is no need to use the TRIM( ) or ALLTRIM( ) function when specifying a single field for key_field.

If you concatenate fields for key_field, you should use ALLTRIM( ), as shown below.

Improving the effectiveness of FUZZYDUP

Three techniques can significantly improve the effectiveness of the FUZZYDUP command:

  • sorting individual elements in test field values
  • removing generic elements from test field values
  • concatenating test fields

These techniques generate more focused sets of results with fewer false positives and more true positives. You can use the techniques separately, or in combination.

Sorting individual elements in test field values

The SORTWORDS( ) function can improve the effectiveness of the FUZZYDUP command by sorting individual elements in test field values into a sequential order.

Sorting elements, such as the components of an address, can make two strings with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that a pair of strings are selected as fuzzy duplicates of each other.

For more information, see SORTWORDS( ) function.

For a video providing an overview of SORTWORDS( ), see Fuzzy Matching Using SORTWORDS() (English only).

Removing generic elements from test field values

The OMIT( ) function can improve the effectiveness of the FUZZYDUP command by removing generic elements such as "Corporation" or "Inc.", or characters such as commas, periods, and ampersands (&), from test field values.

Removal of generic elements and punctuation focuses the FUZZYDUP string comparison on just the portion of the strings where a meaningful difference may occur.

For more information, see OMIT( ) function.

Concatenating test fields

Concatenating two or more test fields can improve the effectiveness of the FUZZYDUP command by increase the degree of uniqueness of the test values.

For example, by concatenating an Address field and a City field you avoid fuzzy matches between addresses in different cities:

FUZZYDUP ON ALLTRIM(Address)+ALLTRIM(City) OTHER Address City Vendor_Name LEVDISTANCE 4 DIFFPCT 50 RESULTSIZE 20 EXACT TO "Vendor_Name_Fuzzy_Dupes" OPEN

Other string comparison methods

  • DICECOEFFICIENT( ) function provides a method for comparing strings that de-emphasizes or completely ignores the relative position of characters or character blocks.
  • SOUNDSLIKE( ) and SOUNDEX( ) functions provide a method for comparing strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).