About fuzzy duplicates

Fuzzy duplicates are nearly identical character values that may refer to the same real-world entity. For example, the following four values may all be the same company:

Common causes of fuzzy duplicates are data entry errors such as typos and misspellings, differing methods of formatting data, and differing data entry conventions. The intentional creation of nearly identical values may indicate fraud. Fuzzy duplicates impede data analysis, which relies upon data referencing real-world entities in a consistent manner.

You can test individual character fields in a table to identify any fuzzy duplicates in a field, and produce output results that group fuzzy duplicates based on a degree of difference that you specify. The fuzzy duplicate groups provide you with a starting point. You will probably need to perform additional analysis, such as a duplicates test of fields associated with the fuzzy duplicates test field, to determine if any of the members of a group do in fact reference the same real-world entity.

The fuzzy duplicates test field can contain letters, numbers, and special characters, which allows you to test values such as names of people and companies, addresses, and social security numbers and product numbers, if the numbers are formatted as character data.

The fuzzy duplicates feature does not support fuzzy joins, or a fuzzy comparison of values in two separate fields or two separate tables.

Fuzzy duplicate output results

Fuzzy duplicate output results are arranged in groups. Figure 1 provides an example of the results produced by testing for fuzzy duplicates in the Last Name field of a table. The Original Record Number of the first fuzzy duplicate in each group is used to identify the group. The Original Record Number is from the original test table, not from the output table. For example, “Janson” is a value in record number 3 in the original table, and because “Janson” is the first value in the group, based on record sequence in the original table, the group is identified as Group 3, and in the Group Number field “3” is assigned to each member of the group. For more information, see How fuzzy duplicates are grouped.

Figure 1. Fuzzy Duplicates output results

Character-based comparison

When comparing two values, the fuzzy duplicates feature performs a character-based comparison, not a word-based comparison. The feature treats blanks or spaces between words as characters, and does not differentiate between individual words. Regardless of the number of individual words in a value, the feature treats the value as a single, unbroken string of characters.

The implication of this approach is that some values that appear to be fuzzy duplicates may not be included in the output results, based on the nature of the data and the difference settings you specify in the Fuzzy Duplicates dialog box. Consider the following examples:

The first example could be two versions of the same name, one using initials and the other using spelled-out first and middle names. The second example could be a short version and a longer version of a company name. However, neither of these pairs will be returned as fuzzy duplicates unless the difference settings are set quite loosely, which would have the adverse effect of also returning large numbers of false positives. The fuzzy duplicates feature processes each pair in the examples as simply two strings of characters. In each case, because the two strings differ significantly in length, the two strings are significantly different from each other when considered at the character level.

For more information, see How the difference settings work.

Additional information about fuzzy duplicates

Testing a field for fuzzy duplicates does not require that the field is sorted, and sorting a field in advance of testing does not assist the fuzzy duplicates operation in any way. However, you may choose to sort a test field in advance because it can make the output results easier to scan, and the Fuzzy Duplicates dialog box does not include the Presort option.

When testing for fuzzy duplicates, you can optionally include exact duplicates in the output results. If you are interested in finding only exact duplicates, use the duplicates feature instead.

Related concepts
Controlling the size of fuzzy duplicate results
How the difference settings work
How fuzzy duplicates are grouped
Fuzzy duplicate helper functions
Fuzzy duplicates overview
Related tasks
Testing for fuzzy duplicates
Working with fuzzy duplicate output results

(C) 2015 ACL Services Ltd. All Rights Reserved.