Testing for fuzzy duplicates

You can test a character field in the active table to detect whether nearly identical values exist (fuzzy duplicates). You can optionally include identical values (exact duplicates) in the output results as well as nearly identical values.

A message appears in the log if one or more fuzzy duplicate groups in the results reach the maximum size. For more information, see Setting a maximum fuzzy duplicate group size.

Note

Before testing for fuzzy duplicates, using the OMIT( ) function to remove generic elements from a fuzzy duplicates test field, or concatenating fields to increase the degree of uniqueness of test values, can improve the quality and reduce the size of the results.

For more information, see Fuzzy duplicate helper functions, and Concatenating fields.

To test for fuzzy duplicates:

  1. Select Analyze > Fuzzy Duplicates.
  2. On the Main tab, do one of the following:
    • Select the field to test from the Fuzzy Duplicates On list.

    • Click Fuzzy Duplicates On to select the field, or to create an expression.

  3. Optional. Select one or more List Fields to include any additional field(s) in the results, or click List Fields to select the field(s), to Add All fields, or to create an expression.

    Additional fields can provide useful context for the results, and can help verify whether fuzzy duplicates reference the same real-world entity. The field selected for fuzzy duplicates testing is displayed automatically at the beginning of any result records and does not need to be specifically selected under List Fields.

  4. Specify a Difference Threshold to control the amount of difference between fuzzy duplicates.

    You can specify a number from 1 to 10. Increasing the Difference Threshold increases the number of characters that can differ between fuzzy duplicate pairs, which increases the size of the results. For more information, see How the difference settings work.

  5. Do one of the following:
    • Specify a Difference Percentage to control the percentage of each fuzzy duplicate that can be different.

      You can specify a percentage from 1 to 99. Increasing the Difference Percentage increases the percentage of a fuzzy duplicate that can be different, which increases the size of the results.

    • Deselect Difference Percentage to turn it off.

      If you turn off Difference Percentage the results do not take into account the percentage of a fuzzy duplicate that is different. The results will be larger than when you use Difference Percentage with any setting.

    For more information, see How the difference settings work.

  6. Do one of the following:
    • Specify a Result Size (%) to set the maximum size of the results relative to the size of the test field.

      You can specify a percentage from 1 to 1000 (one thousand). This option allows you to automatically terminate the fuzzy duplicates operation if the size of the results grows beyond what you consider useful. For example, for a test field with 50,000 values, a Result Size (%) of 1 would terminate processing if the results exceeded 500 fuzzy duplicates. No output table is produced if processing is terminated.

    • Deselect Result Size (%) to turn it off.

      If you turn off Result Size (%), ACL does not impose any limit on the size of the results.

      Note

      Turning off Result Size (%) can produce an unduly large set of results that takes a very long time to process, or can cause available memory to be exceeded, which terminates processing. Turn off this option only if you are confident that the results will be of a manageable size.

  7. If you want to include exact duplicates as well as fuzzy duplicates in the results, select Include Exact Duplicates.

    For more information, see Including exact duplicates in results.

  8. If there are records in the current view that you want to exclude from processing, enter a condition in the If text box, or click If to create an IF statement using the Expression Builder.

    The IF statement considers all records in the view and filters out those that do not meet the specified condition.

  9. If you are connected to a server table, do one of the following:
    • Select Local to save the output table to the same location as the project, or to specify a path or navigate to a different local folder.

    • Leave Local deselected to save the output table to the Prefix folder on the ACL Server.

      Note

      For output results produced from analysis or processing of ACL Analytics Exchange server tables, select Local. You cannot use the Local setting to import results tables to ACL Analytics Exchange Server.

  10. Do one of the following:
    • In the To text box, specify the name of the ACL table that will contain the output results.

    • Click To and specify the ACL table name, or select an existing table in the Save or Save File As dialog box to overwrite the table.

    If ACL prefills a table name, you can accept the prefilled name, or change it.

    You can also specify an absolute or relative file path, or navigate to a different folder, to save the table in a location other than the project location. For example: C:\Results\Output.fil or Results\Output.fil. Regardless of where you save the table, it is added to the open project if it is not already in the project.

  11. Select or deselect Use Output Table depending on whether or not you want the ACL table containing the output results to open automatically upon completion of the operation.
  12. Click OK.
  13. If the overwrite prompt appears, select the appropriate option.
Related concepts
Fuzzy duplicates overview
About fuzzy duplicates
Controlling the size of fuzzy duplicate results
How the difference settings work
How fuzzy duplicates are grouped
Fuzzy duplicate helper functions
Related tasks
Working with fuzzy duplicate output results
Testing for duplicates


(C) 2013 ACL Services Ltd. All Rights Reserved. | Send feedback