Fuzzy duplicates overview

The topics in this section explain ACL’s fuzzy duplicates feature, which allows you to test a character field for nearly identical values that may refer to the same real-world entity.

Testing for fuzzy duplicates is a more involved process than identifying exact duplicates. Understanding the settings that control the degree of difference between fuzzy duplicates, and how fuzzy duplicates are grouped in the output results, will help optimize your use of the feature. In addition to the main fuzzy duplicates feature, you may need to use one or both of the fuzzy duplicate helper functions, or concatenate test fields, to achieve your goals.

The fuzzy duplicates feature is processor-intensive, because every value in a test field must be compared with every subsequent record’s value in that field. If your analysis allows it, use methods such as filtering or extracting subsets of records to limit the size of the data set you test. Smaller data sets improve the speed of execution, and also help control the size of the results.

Depending on the nature of the data you are testing, and the goal of your fuzzy duplicate analysis, you may need to perform several tasks to produce useful results. Table 1 provides a sequence for these tasks. With the exception of the fuzzy duplicates feature itself, the tasks are optional, but performing one or more of them may improve the quality of the results.

Table 1. Sequence of tasks in fuzzy duplicate analysis

Task

Optional

ACL feature

Details

1

Limit the size of the test data set

Yes

Filters

Extracting subsets of data

Improve performance by processing only records that are meaningful to your analysis

2

Remove generic elements such as “Corporation” or “Inc.” from field values

Yes

OMIT( ) function

Reduce the size and increase the precision of results by focusing on just the portion of a character value where a meaningful difference may occur

3

Concatenate fields to increase the uniqueness of the test values

Yes

an ACL expression using the Add operator (+)

Reduce the size and increase the precision of results by testing values of greater uniqueness, which are produced by concatenating two or more character fields

4

Identify all fuzzy duplicates in a field and output non-exhaustive results

No

Fuzzy Duplicates feature

The main Fuzzy Duplicates feature

5

Identify an exhaustive list of fuzzy duplicates for a single character value from the non-exhaustive results

Yes

ISFUZZYDUP( ) function

Produce a convenient and exhaustive list of fuzzy duplicates for a character value of particular relevance to your audit goal

Section contents

Related concepts
About duplicates


(C) 2013 ACL Services Ltd. All Rights Reserved. | Send feedback