FUZZYDUP command
Concept Information
Detects nearly identical values (fuzzy duplicates) in a character field.
Note
To use fuzzy matching to combine fields from two Analytics tables into a new, single Analytics table, see FUZZYJOIN command.
Syntax
FUZZYDUP ON key_field <OTHER field <...n>|OTHER ALL <EXCLUDE field_name <...n>>> LEVDISTANCE value <DIFFPCT percentage> <RESULTSIZE percentage> <EXACT> <IF test> TO table_name <LOCAL> <OPEN>
Parameters
Name | Description |
---|---|
ON key_field | The character field or expression to test for fuzzy duplicates. |
OTHER field <...n> | OTHER ALL optional |
One or more additional fields to include in the output.
|
EXCLUDE field_name optional |
Only valid when using OTHER ALL. The field or fields to exclude from the command. EXCLUDE allows you to fine-tune OTHER ALL, by excluding the specified fields. EXCLUDE must immediately follow OTHER ALL. For example: OTHER ALL EXCLUDE field_1 field_2 |
LEVDISTANCE value |
The maximum allowable Levenshtein distance between two strings for them to be identified as fuzzy duplicates and included in the results. The LEVDISTANCE value cannot be less than 1 or greater than 10. Increasing the LEVDISTANCE value increases the number of results by including values with a greater degree of fuzziness – that is, values that are more different from one another. For more information, see FUZZYDUP behavior. |
DIFFPCT percentage optional |
A threshold that limits the 'difference percentage' or the proportion of a string that can be different. The percentage that results from an internal Analytics calculation performed on potential fuzzy duplicate pairs must be less than or equal to the DIFFPCT value for the pair to be included in the results. The DIFFPCT value cannot be less than 1 or greater than 99. If DIFFPCT is omitted the threshold is turned off and difference percentage is not considered during processing of the FUZZYDUP command. For more information, see FUZZYDUP behavior. |
RESULTSIZE percentage optional |
The maximum size of the set of output results as a percentage of the number of records in the key field. For example, for a key field with 50,000 records, a RESULTSIZE of 3 would terminate processing if the results exceeded 1500 fuzzy duplicates (50,000 x 0.03). No output table is produced if processing is terminated. The RESULTSIZE value cannot be less than 1 or greater than 1000 (one thousand) percent. The limit of 1000% is to accommodate the nature of many-to-many matching, which can produce results that are more numerous than the original test data set. If RESULTSIZE is omitted the threshold is turned off and result size is not considered during processing of the FUZZYDUP command. Caution Omitting RESULTSIZE can produce an unduly large set of results that takes a very long time to process, or can cause available memory to be exceeded, which terminates processing. Omit RESULTSIZE only if you are confident that the results will be of a manageable size. |
EXACT optional |
Includes exact duplicates as well as fuzzy duplicates in the results. |
IF test optional |
A conditional expression that must be true in order to process each record. The command is executed on only those records that satisfy the condition. Note The IF parameter is evaluated against only the records remaining in a table after any scope parameters have been applied (WHILE, FIRST, NEXT). |
TO table_name |
The location to send the results of the command to:
|
LOCAL optional |
Saves the output file in the same location as the Analytics project. Note Applicable only when running the command against a server table with an output file that is an Analytics table. The LOCAL parameter must immediately follow the TO parameter. |
OPEN optional |
Opens the table created by the command after the command executes. Only valid if the command creates an output table. |
Analytics output variables
Name | Contains |
---|---|
GAPDUPn |
The total number of gaps, duplicates, or fuzzy duplicate groups identified by the command. |
Examples
Test a surname field for fuzzy duplicates
You test a surname field for fuzzy duplicates (the Last_Name field in the Employee_List table in ACL DATA\Sample Data Files\Metaphor_Employee_Data.ACL). The results are output to a new Analytics table.
- In addition to the test field, other fields are included in the results.
- The maximum allowable Levenshtein distance is 1.
- The proportion of a string that can be different is limited to 50%.
- The size of the results is limited to 20% of the test field size.
- In addition to fuzzy duplicates, exact duplicates are included.
OPEN Employee_List FUZZYDUP ON Last_Name OTHER First_Name EmpNo LEVDISTANCE 1 DIFFPCT 50 RESULTSIZE 20 EXACT TO "Fuzzy_Last_Name" OPEN
Remarks
How it works
The FUZZYDUP command finds nearly identical values (fuzzy duplicates), or locates inconsistent spelling in manually entered data.
Unlike the ISFUZZYDUP( ) function, which identifies an exhaustive list of fuzzy duplicates for a single character value, the FUZZYDUP command identifies all fuzzy duplicates in a field, organizes them in non-exhaustive groups, and outputs results.
For detailed information about how this command works, see Fuzzy duplicates analysis.
What non-exhaustive means
Non-exhaustive means that individual fuzzy duplicate groups in the results may not contain all the fuzzy duplicates in a test field that are within the specified degree of difference of the group owner. However, if a group owner is a fuzzy duplicate of another value in the test field, the two values will appear together in a group somewhere in the results. So groups may be non-exhaustive, but the results, in total, are exhaustive.
If producing a single, exhaustive list of fuzzy duplicates for a specific value in the test field is important to your analysis, you can use the ISFUZZYDUP( ) function for this purpose.
FUZZYDUP behavior
The FUZZYDUP command has two parameters that allow you to control the degree of difference between fuzzy duplicates, and the size of the results:
- LEVDISTANCE
- DIFFPCT
You may need to try different combinations of settings for these two parameters to find out what works best for a particular data set.
LEVDISTANCE (Levenshtein distance)
When processing data, the FUZZYDUP command calculates the Levenshtein distance between each evaluated pair of strings in the test field, and calculates the difference percentage. The Levenshtein distance is a value representing the minimum number of single character edits required to make one string identical to the other string. For more information, see LEVDIST( ) function.
DIFFPCT (Difference percentage)
The difference percentage is the percentage of the shorter of the two evaluated strings that is different, and is the result of the following internal Analytics calculation, which uses the Levenshtein distance between the two strings:
Levenshtein distance / number of characters in the shorter string × 100 = difference percentage
More information
For detailed information about the fuzzy duplicate difference settings, controlling result size, and fuzzy duplicate groups, see Fuzzy duplicates analysis.
Case-sensitivity
The FUZZYDUP command is not case-sensitive, so "SMITH" is equivalent to "smith."
Trailing blanks automatically trimmed
The FUZZYDUP command automatically trims trailing blanks in key_field, so there is no need to use the TRIM( ) or ALLTRIM( ) function when specifying a single field for key_field.
If you concatenate fields for key_field, you should use ALLTRIM( ), as shown below.
Improving the effectiveness of FUZZYDUP
Three techniques can significantly improve the effectiveness of the FUZZYDUP command:
- sorting individual elements in test field values
- removing generic elements from test field values
- concatenating test fields
These techniques generate more focused sets of results with fewer false positives and more true positives. You can use the techniques separately, or in combination.
Sorting individual elements in test field values
The SORTWORDS( ) function can improve the effectiveness of the FUZZYDUP command by sorting individual elements in test field values into a sequential order.
Sorting elements, such as the components of an address, can make two strings with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that a pair of strings are selected as fuzzy duplicates of each other.
For more information, see SORTWORDS( ) function.
For a video providing an overview of SORTWORDS( ), see Fuzzy Matching Using SORTWORDS() (English only).
Removing generic elements from test field values
The OMIT( ) function can improve the effectiveness of the FUZZYDUP command by removing generic elements such as "Corporation" or "Inc.", or characters such as commas, periods, and ampersands (&), from test field values.
Removal of generic elements and punctuation focuses the FUZZYDUP string comparison on just the portion of the strings where a meaningful difference may occur.
For more information, see OMIT( ) function.
Concatenating test fields
Concatenating two or more test fields can improve the effectiveness of the FUZZYDUP command by increase the degree of uniqueness of the test values.
For example, by concatenating an Address field and a City field you avoid fuzzy matches between addresses in different cities:
FUZZYDUP ON ALLTRIM(Address)+ALLTRIM(City) OTHER Address City Vendor_Name LEVDISTANCE 4 DIFFPCT 50 RESULTSIZE 20 EXACT TO "Vendor_Name_Fuzzy_Dupes" OPEN
Other string comparison methods
- DICECOEFFICIENT( ) function provides a method for comparing strings that de-emphasizes or completely ignores the relative position of characters or character blocks.
- SOUNDSLIKE( ) and SOUNDEX( ) functions provide a method for comparing strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).