Controlling the size of fuzzy duplicate results

Fuzzy duplicate results have the potential to grow very large because the fuzzy duplicates feature uses an algorithm that performs a many-to-many comparison of values in the test field. The comparison, by design, also returns matches more easily than a comparison that requires exact matching.

Depending on the nature of the data and the difference settings you specify, the results can potentially be many times larger than the table being tested. If results get very large in relation to the test table, they may no longer be useful or meaningful, and the majority of the results could be false positives.

Methods for controlling the size of fuzzy duplicate results

You can use one or more of the following methods to control the size of fuzzy duplicate results and reduce the number of false positives that are returned:

Use more than one test field concatenate test fields to increase the degree of uniqueness of test values.
Sort elements in test field values use the SORTWORDS( ) function to sort individual elements in test field values into a sequential order, which allows you to use a smaller Difference Threshold.
Remove generic elements from test field values use the OMIT( ) function to remove generic elements from test field values, which allows you to use a smaller Difference Threshold.
Difference Threshold use a small Difference Threshold initially (for example, 3 or less), and only increase it if you feel the results are overly restrictive.
Difference Percentage use the default Difference Percentage initially (50), and only increase it if you feel the results are overly restrictive. Do not turn off Difference Percentage unless you have a specific reason for doing so.
Result Size (%) Based on the number of values in the test field, specify a Result Size (%) that prevents the results growing to an unmanageable size. Result Size (%) sets the maximum size of the results relative to the size of the test field. Do not turn off Result Size (%) unless you have a specific reason for doing so.
Note
This setting has no effect on the inclusion or exclusion of false positives.
Limit fuzzy duplicate group size use the SET command to specify a maximum fuzzy duplicate group size smaller than the default size of 20 — for example, SET FUZZYGROUPSIZE TO 10.
Note
This setting has no effect on the inclusion or exclusion of false positives.

Caution

Some of the methods itemized above, if set too restrictively, can exclude valid fuzzy duplicates. You may need to try different combinations of settings to find out what works best for a particular data set.

The methods least likely to exclude valid fuzzy duplicates are concatenating test fields, using the SORTWORDS( ) function, and using the OMIT( ) function.

Specifying a maximum result size

Using the Result Size (%) option to specify a maximum result size allows you to automatically terminate the fuzzy duplicates operation if the size of the results grows beyond what you consider manageable. No output table is produced if the operation is terminated.

The Result Size (%) option is a safety mechanism to prevent extremely long processing times. It has no relation to the validity of the results that are returned. Specifying a large result size limit may simply increase the number of false positives in the results. Conversely, specifying a small result size may cause processing to terminate before all valid fuzzy duplicates are captured.

Choosing an appropriate limit

Choosing an appropriate limit for the result size is a matter of judgment and may require some experimentation. Start with a conservative limit. If the limit is exceeded and processing terminated, you can increase the limit. Once you have a limit that allows processing to complete, examine the results. If they include a large proportion of false positives, the best approach is to use one or more of the Methods for controlling the size of fuzzy duplicate results.

An optimal result set includes all valid fuzzy duplicates in the test field (true positives) while also minimizing the number of false positives. Achieving an optimal result set typically requires balancing all the fuzzy duplicates settings and helper methods available to you.

Why you can specify a result size limit greater than one hundred percent

By default, the maximum size of the set of results is 10% of the size of the test field. You can specify a different percentage from 1 to 1000 (one thousand). The limit of 1000% is to accommodate the nature of many-to-many matching, and to prevent runaway processing. Many-to-many matching can produce results that are more numerous than the original test data set. However, results that exceed the size of the original test data set may be primarily false positives.

Rounding of the result size calculation

The result size calculation uses rounding to produce only positive integers, and rounds up any number less than 2 to 2, the minimum result size (1 group owner and 1 group member).

Turning off the result size limit

Generally, you should not turn off Result Size (%) unless you are confident that the results will be of a manageable size. Running the fuzzy duplicates operation without any limit on the number of results can cause the operation to run for a very long time, or exceed available memory, which terminates processing.

Setting a maximum fuzzy duplicate group size

Using the SET command to specify a maximum fuzzy duplicate group size can be a way of limiting the size of groups that would otherwise contain a large number of false positives. This feature is most useful if you find a setting that limits the size of only some of the groups in the output results. If all or most of the groups reach their maximum size, the setting may be too small, and you may be excluding valid fuzzy duplicates. The other possibility is that the difference settings are not restrictive enough, which is causing the size of the groups to grow larger.

The default maximum group size is 20, and does not include the group owner. You can specify a different maximum from 2 to 100. The specified maximum remains in effect for the duration of the Analytics session.

What happens if a group reaches the maximum size?

If a fuzzy duplicate group reaches the maximum size, any subsequent fuzzy duplicates for the group owner are not detected and do not appear in the group. These excluded fuzzy duplicates may or may not appear in a subsequent group, depending on whether they are part of a subsequent fuzzy duplicate match.

If producing an exhaustive list of fuzzy duplicates for an owner of a group that has reached its maximum size is important to your analysis, you can use the ISFUZZYDUP( ) function for this purpose. For more information, see Fuzzy duplicate helper functions.

A message appears in the log if one or more groups reach the maximum size. If the number of groups that reach the maximum size is ten or fewer, the groups are individually identified by group number.

Exact duplicates are included in the group size calculation

Exact duplicates are included in the group size calculation even if you have chosen not to include exact duplicates in the results. For example, if a group is identified in the log as having reached the maximum group size of 20 (1 group owner and 20 group members), but only 18 group members appear in the results, at least two exact duplicates for the group owner exist in the test field.

Groups that are composed entirely of exact duplicates are also referenced in the log if they reach the maximum group size, but the groups do not appear in the results if you have chosen not to include exact duplicates.

For more information, see SET command.