Testing for fuzzy duplicates

Concept Information

FUZZYDUP command

You can test a character field in the active table to detect whether nearly identical values exist (fuzzy duplicates). You can optionally include identical values (exact duplicates) in the output results as well as nearly identical values.

A message appears in the log if one or more fuzzy duplicate groups in the results reach the maximum size. For more information, see Controlling the size of fuzzy duplicate results.

Improving the effectiveness of fuzzy duplicates testing

You can significantly improve the effectiveness of fuzzy duplicates testing by incorporating one or more of the following techniques:

  • sorting individual elements in test field values
  • removing generic elements from test field values
  • concatenating test fields

For more information, see Fuzzy duplicate helper functions, and Concatenating fields.

Reducing execution time and the size of the output results

The fuzzy duplicates feature is processor-intensive, because every value in a test field must be compared with every subsequent value in the field.

If your analysis allows it, use methods such as filtering or extracting subsets of records to limit the size of the data set you test. Smaller data sets reduce overall execution time, and also help control the size of the output results.

Steps

  1. Select Analyze > Fuzzy Duplicates.
  2. On the Main tab, do one of the following:
    • Select the field to test from the Fuzzy Duplicates On list.
    • Click Fuzzy Duplicates On to select the field, or to create an expression.

      Tip

      Creating an expression is how you concatenate test fields, remove generic elements from test field values, or sort individual elements in test field values. For more information, see Fuzzy duplicate helper functions, and Concatenating fields.

  3. Optional. Select one or more List Fields to include any additional field(s) in the results, or click List Fields to select the field(s), to Add All fields, or to create an expression.

    Additional fields can provide useful context for the results, and can help verify whether fuzzy duplicates reference the same real-world entity.

    Note

    The field selected for fuzzy duplicates testing is displayed automatically at the beginning of any result records and does not need to be specifically selected under List Fields.

  4. Specify a Difference Threshold to control the amount of difference between fuzzy duplicates.

    The setting is explained below.

  5. Do one of the following:
    • Specify a Difference Percentage to control the percentage of each fuzzy duplicate that can be different.
    • Deselect Difference Percentage to turn it off.

    The setting is explained below.

  6. Do one of the following:
    •  Specify a Result Size (%) to set the maximum size of the results relative to the size of the test field.
    •  Deselect Result Size (%) to turn it off.

    The setting is explained below.

  7. If you want to include exact duplicates as well as fuzzy duplicates in the results, select Include Exact Duplicates.

    For more information, see How fuzzy duplicates are grouped.

  8. If there are records in the current view that you want to exclude from processing, enter a condition in the If text box, or click If to create an IF statement using the Expression Builder.

    The IF statement considers all records in the view and filters out those that do not meet the specified condition.

  9. If you are connected to a server table, do one of the following:
    • Select Local to save the output table to the same location as the project, or to specify a path or navigate to a different local folder.
    • Leave Local deselected to save the output table to the Prefix folder on a server.
    • Note

      For output results produced from analysis or processing of Analytics Exchange server tables, select Local. You cannot deselect the Local setting to import results tables to Analytics Exchange.

  10. Do one of the following:
    • In the To text box, specify the name of the Analytics table that will contain the output results.
    • Click To and select an existing table in the Save or Save File As dialog box to overwrite or append to the table.

    You can also specify an absolute or relative file path, or navigate to a different folder, to save or append the table in a location other than the project location. For example: C:\Results\Output.fil or Results\Output.fil.

    Regardless of where you save or append the table, it is added to the open project if it is not already in the project.

    If Analytics prefills a table name, you can accept the prefilled name, or change it.

  11. Select Use Output Table if you want the output table to open automatically upon completion of the operation.

  12. Click OK.
  13. If the overwrite prompt appears, select the appropriate option.

Fuzzy Duplicates dialog box options

The table below provides detailed information about options in the Fuzzy Duplicates dialog box.

Options – Fuzzy Duplicates dialog box Description
Difference Threshold

The allowable amount of difference between fuzzy duplicates.

Specify a number from 1 to 10. Increasing the Difference Threshold increases the number of characters that can differ between fuzzy duplicate pairs, which increases the size of the results.

For more information, see How the difference settings work.

Difference Percentage

The percentage of each fuzzy duplicate that can be different.

Specify a percentage from 1 to 99. Increasing the Difference Percentage increases the percentage of a fuzzy duplicate that can be different, which increases the size of the results.

If you turn off Difference Percentage, the results do not take into account the percentage of a fuzzy duplicate that is different. The results will be larger than when you use Difference Percentage with any setting.

For more information, see How the difference settings work.

Result Size (%)

The maximum size of the results relative to the size of the test field.

Specify a percentage from 1 to 1000 (one thousand). This option allows you to automatically terminate the fuzzy duplicates operation if the size of the results grows beyond what you consider useful.

For example, for a test field with 50,000 values, a Result Size (%) of 1 would terminate processing if the results exceeded 500 fuzzy duplicates. No output table is produced if processing is terminated.

If you turn off Result Size (%), Analytics does not impose any limit on the size of the results.

Caution

Turning off Result Size (%) can produce an unduly large set of results that takes a very long time to process, or can cause available memory to be exceeded, which terminates processing. Turn off this option only if you are confident that the results will be of a manageable size.

For more information, see Controlling the size of fuzzy duplicate results.