Controlling the size of fuzzy duplicate results

Fuzzy duplicate results have the potential to grow very large because the fuzzy duplicates feature uses an algorithm that performs a many-to-many comparison of values in the test field. If results get too large in relation to the test table, they may no longer be useful or meaningful, and the majority of the results could be false positives.

You can use one or more of the following methods to control the size of fuzzy duplicate results:

Note

Concatenating test fields, and using the OMIT( ) function, are the methods least likely to exclude valid fuzzy duplicates. The other methods, if set too restrictively, can exclude valid fuzzy duplicates. You may need to try different combinations of settings to find out what works best for a particular data set.

Setting a maximum fuzzy duplicate group size

Using the SET command to specify a maximum fuzzy duplicate group size can be a way of limiting the size of groups that would otherwise contain a large number of false positives. This feature is most useful if you find a setting that limits the size of only some of the groups in the output results. If all or most of the groups reach their maximum size, the setting may be too small, and you may be excluding valid fuzzy duplicates. The other possibility is that the difference settings are not restrictive enough, which is causing the size of the groups to grow larger.

The default maximum group size is 20, and does not include the group owner. You can specify a different maximum from 2 to 100. The specified maximum remains in effect for the duration of the ACL session.

If a fuzzy duplicate group reaches the maximum size, any subsequent fuzzy duplicates for the group owner are not detected and do not appear in the group. These excluded fuzzy duplicates may or may not appear in a subsequent group, depending on whether they are part of a subsequent fuzzy duplicate match. If producing an exhaustive list of fuzzy duplicates for an owner of a group that has reached its maximum size is important to your analysis, you can use the ISFUZZYDUP( ) function for this purpose. For more information, see Fuzzy duplicate helper functions.

A message appears in the log if one or more groups reach the maximum size. If the number of groups that reach the maximum size is ten or fewer, the groups are individually identified by group number.

Exact duplicates are included in the group size calculation even if you have chosen not to include exact duplicates in the results. For example, if a group is identified in the log as having reached the maximum group size of 20 (1 group owner and 20 group members), but only 18 group members appear in the results, at least two exact duplicates for the group owner exist in the test field. Groups that are composed entirely of exact duplicates are also referenced in the log if they reach the maximum group size, but the groups do not appear in the results if you have chosen not to include exact duplicates.

For more information, see “SET command” in the ACL Language Reference.

Specifying a maximum result size

Using the Result Size (%) field to specify a maximum result size allows you to automatically terminate the fuzzy duplicates operation if the size of the results grows beyond what you consider useful. No output table is produced if the operation is terminated.

By default, the size of the set of results is 10% of the size of the test field. You can specify a different percentage from 1 to 1000 (one thousand). The limit of 1000% is to accommodate the nature of many-to-many matching, and to prevent runaway processing. Many-to-many matching can produce results that are more numerous than the original test data set. However, results that exceed the size of the original test data set are likely to be primarily false positives.

The result size calculation uses rounding to produce only positive integers, and rounds up any number less than 2 to 2, the minimum result size (1 group owner and 1 group member).

Choosing an appropriate result size limit depends on how you intend to use the results of a fuzzy duplicates operation. If you intend to manually inspect or scan the results, you should impose a limit that makes manual inspection practical. If the fuzzy duplicates operation exceeds the limit, specify more restrictive difference settings, or set a larger limit and consider using other ACL commands for additional processing of the results. If your analysis already includes additional processing of the results, you can set a much larger result size limit.

Do not turn off Result Size (%) unless you are confident that the results will be of a manageable size. Running the fuzzy duplicates operation without any limit on the number of results can cause the operation to run for a very long time, or exceed available memory, which terminates processing.

Related concepts
About fuzzy duplicates
How the difference settings work
How fuzzy duplicates are grouped
Fuzzy duplicate helper functions
Fuzzy duplicates overview
Related tasks
Testing for fuzzy duplicates
Working with fuzzy duplicate output results


(C) 2013 ACL Services Ltd. All Rights Reserved. | Send feedback