Working with fuzzy duplicate output results

If you want to see the actual difference threshold (Levenshtein Distance) between each group owner and group member in a fuzzy duplicates output table, and the difference percentage that applies to each owner-member pair, you can add computed fields to display these values. Once you have added the computed fields, you can create a nested sort order to rank the output results by their degree of fuzziness.

You need to create three computed fields, and the fields must be created in this order:

To add difference threshold and difference percentage fields:

  1. Open the table containing the fuzzy duplicate output results and select Edit > Table Layout.
  2. Create the group owner computed field by doing the following:
    1. On the Edit Fields/Expressions tab, click Add a New Expression .
    2. In the Name field, type Group_Owner.
    3. In the Default Value field, type the physical name of the fuzzy duplicates test field in the output results (for example, vendor_name), or click f(x) to select it in the Expression Builder.
    4. In the If field, type the following expression: NOT ISBLANK(GROUP_FL)
    5. Select the Static field.
    6. Click Accept Entry .
  3. Create the Levenshtein Distance computed field by doing the following:
    1. Click Add a New Expression .
    2. In the Name field, type Lev_Dist.
    3. In the Default Value field, type the following expression, or click f(x) to build the expression in the Expression Builder:

      LEVDIST(ALLTRIM(Group_Owner),ALLTRIM(fuzzy_dup_test_field),F)

      Replace fuzzy_dup_test_field with the actual name of the fuzzy duplicates test field.

    4. Click Accept Entry .
  4. Create the difference percentage computed field by doing the following:
    1. Click Add a New Expression .
    2. In the Name field, type Diff_Pct.
    3. In the Default Value field, type the following expression, or click f(x) to build the expression in the Expression Builder:

      100*DEC(Lev_Dist,2)/MINIMUM(LENGTH(ALLTRIM(Group_Owner)),

      LENGTH(ALLTRIM(fuzzy_dup_test_field)))

      Replace fuzzy_dup_test_field with the actual name of the fuzzy duplicates test field.

    4. Click Accept Entry and click Close to exit the Table Layout dialog box.
  5. Add the Lev_Dist and Diff_Pct computed fields to the view.

    The difference threshold (Levenshtein Distance) between each group owner and group member, and the difference percentage that applies to each owner-member pair, is now displayed.

    For information about how to add fields to a view, see Adding columns to a view.

  6. If you want to rank the output results by their degree of fuzziness, do the following:
    1. Extract all fields except the Group field to a new table, and filter out records in which the Group field is not blank.

      The ACLScript syntax for the extract operation appears below.

      EXTRACT FIELDS Lev_Dist Diff_Pct GROUP_NUM Group_Owner ORIG_REC_NUM fuzzy_dup_test_field IF ISBLANK(GROUP_FL) TO "Ranked_Fuzzy_Dupes_1" OPEN

      Replace fuzzy_dup_test_field with the actual name of the fuzzy duplicates test field.

    2. Perform a nested sort of the extracted table using Lev_Dist as the first sort field and Diff_Pct as the second sort field.

      The ACLScript syntax for the sort operation appears below.

      SORT ON Lev_Dist Diff_Pct TO "Ranked_Fuzzy_Dupes_2" OPEN

      The fuzziness of the output results increases as you go down the table. The Group Number field is the original record number of the group owner in each fuzzy duplicate pair, and the Original Record Number field is the original record number of the group member in each pair.

      For information about how to create a nested sort, see Sorting records.

Related concepts
Fuzzy duplicates overview
About fuzzy duplicates
Controlling the size of fuzzy duplicate results
How the difference settings work
How fuzzy duplicates are grouped
Fuzzy duplicate helper functions
Related tasks
Testing for fuzzy duplicates
Adding columns to a view
Extracting data
Sorting records


(C) 2013 ACL Services Ltd. All Rights Reserved. | Send feedback