Fuzzy duplicates overview
You can use Analytics’s fuzzy duplicates feature to test a character field for nearly identical values that may refer to the same real-world entity.
Testing for fuzzy duplicates is a more involved process than identifying exact duplicates. Understanding the settings that control the degree of difference between fuzzy duplicates, and how fuzzy duplicates are grouped in the output results, will help optimize your use of the feature.
In addition to the main fuzzy duplicates feature, you may need to use one or both of the fuzzy duplicate helper functions, or concatenate test fields, to achieve your goals.
Tip
The fuzzy duplicates feature is processor-intensive, because every value in a test field must be compared with every subsequent value in the field. If your analysis allows it, use methods such as filtering or extracting subsets of records to limit the size of the data set you test. Smaller data sets improve the speed of execution, and also help control the size of the results.
Fuzzy duplicates versus fuzzy join
The fuzzy duplicates feature analyzes values in a single field in a single Analytics table. To use fuzzy matching to combine fields from two Analytics tables into a new, third table, see Fuzzy join.
Sequence of tasks in fuzzy duplicate analysis
Depending on the nature of the data you are testing, and the goal of your fuzzy duplicate analysis, you may need to perform several tasks to produce useful results. The table below provides a sequence for these tasks.
Note
With the exception of the fuzzy duplicates feature itself, the tasks are optional, but performing one or more of them may improve the quality of the results.
|
Task |
Optional |
Analytics feature |
Details |
---|---|---|---|---|
1 |
Limit the size of the test data set |
Yes |
Filters Extracting subsets of data |
Improve performance by processing only records that are meaningful to your analysis |
2 |
Remove generic elements such as “Corporation” or “Inc.” from field values |
Yes |
OMIT( ) function |
Reduce the size and increase the precision of results by focusing on just the portion of a character value where a meaningful difference may occur |
3 |
Concatenate fields to increase the uniqueness of the test values |
Yes |
an Analytics expression using the Add operator (+) |
Reduce the size and increase the precision of results by testing values of greater uniqueness, which are produced by concatenating two or more character fields |
4 |
Identify all fuzzy duplicates in a field and output non-exhaustive results |
No |
Fuzzy Duplicates feature |
The main Fuzzy Duplicates feature |
5 |
Identify an exhaustive list of fuzzy duplicates for a single character value from the non-exhaustive results |
Yes |
ISFUZZYDUP( ) function |
Produce a convenient and exhaustive list of fuzzy duplicates for a character value of particular relevance to your audit goal |