FUZZYDUP command

Used to detect nearly identical values (fuzzy duplicates) in a character field.

Syntax

FUZZYDUP ON key_field <OTHER fields> LEVDISTANCE value <DIFFPCT value>
<RESULTSIZE value> <EXACT> <IF test> TO table_name <LOCAL> <OPEN>

Parameters

ON key_field

Specifies the character field or expression to test for fuzzy duplicates.

OTHER fields

Specifies a list of fields or expressions to include in the output.

LEVDISTANCE value

Specifies the maximum allowable Levenshtein Distance between two strings for them to be identified as fuzzy duplicates and included in the results. The LEVDISTANCE value cannot be less than 1 or greater than 10.

DIFFPCT value

Optional. Specifies a threshold that limits the ‘difference percentage’ or the proportion of a string that can be different. The percentage that results from an internal ACL calculation performed on potential fuzzy duplicate pairs must be less than or equal to the DIFFPCT value for the pair to be included in the results. The DIFFPCT value cannot be less than 1 or greater than 99. If the parameter is omitted the threshold is turned off and difference percentage is not considered during processing of the FUZZYDUP command.

RESULTSIZE value

Optional. Specifies the maximum size of the set of results when calculated as a percentage of the test field size. The RESULTSIZE value cannot be less than 1 or greater than 1000 (one thousand) percent. If the parameter is omitted the threshold is turned off and result size is not considered during processing of the FUZZYDUP command.

Note

Omitting the RESULTSIZE parameter can produce an unduly large set of results that takes a very long time to process, or can cause available memory to be exceeded, which terminates processing. Omit the parameter only if you are confident that the results will be of a manageable size.

EXACT

Optional. If specified, includes exact duplicates as well as fuzzy duplicates in the results.

IF test

Optional. Specifies a condition that must be met. The command is only executed on records that pass the test.

TO table_name

Specify TO table_name to write the results to an ACL table. You must specify the table_name value as a quoted string with a .FIL file extension to create an ACL table.

For example: TO "Output.FIL"

You can also specify an absolute or relative file path to an existing folder to write the .FIL file to.

For example: TO "C:\Output.FIL" or TO "Results\Output.FIL"

LOCAL

Optional. Specifies that the output file should be saved in the same location as the ACL project. This parameter only applies when the command is run on an ACL Server table and the output file is an ACL table.

OPEN

Optional. Specifies that the table created by the command should be opened after the command executes. This parameter is only valid if the command creates an output table.

ACL output variables

GAPDUPn

Stores the total number of fuzzy duplicate groups identified by the command.

Remarks

You can use the FUZZYDUP command to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. Unlike the ISFUZZYDUP( ) function, which identifies an exhaustive list of fuzzy duplicates for a single character value, the FUZZYDUP command identifies all fuzzy duplicates in a field, organizes them in groups, and outputs non-exhaustive results.

Non-exhaustive means that individual fuzzy duplicate groups in the results may not contain all the fuzzy duplicates in a test field that are within the specified degree of difference of the group owner. However, if a group owner is a fuzzy duplicate of another value in the test field, the two values will appear together in a group somewhere in the results. If producing an exhaustive list of fuzzy duplicates for a specific value in the test field is important to your analysis, you can use the ISFUZZYDUP( ) function for this purpose.

The FUZZYDUP command has a number of parameters that you can specify to control the degree of difference between fuzzy duplicates, and the size of the results. You may need to try different combinations of parameters and settings to find out what works best for a particular data set.

When processing data, the FUZZYDUP command calculates the Levenshtein Distance between each evaluated pair of strings in the test field, and calculates the difference percentage. The Levenshtein Distance is a value representing the minimum number of single character edits required to make one string identical to the other string. For more information, see LEVDIST( ) function. The difference percentage is the percentage of the shorter of the two evaluated strings that is different, and is the result of the following internal ACL calculation, which uses the Levenshtein Distance between the two strings:

Levenshtein Distance / number of characters in the shorter string × 100 = difference percentage

For detailed information about the fuzzy duplicate difference settings, controlling result size, and fuzzy duplicate groups, and see the ACL Analytics User Guide.

The command is not case-sensitive, so “SMITH” is equivalent to “smith.” The command also automatically trims trailing blanks in fields, so there is no need to use the TRIM( ) function when specifying a field as a parameter.

Concatenating two or more test fields can improve the effectiveness of the FUZZYDUP command by increase the degree of uniqueness of the test values. The OMIT( ) function can also improve the effectiveness of the command by removing generic elements such as “Corporation” or “Inc.” from field values. Removal of generic elements focuses the FUZZYDUP string comparison on just the portion of the strings where a meaningful difference may occur.

The SOUNDSLIKE( ) and SOUNDEX( ) functions provide a method for comparing strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).

Example

The following example tests a surname field for fuzzy duplicates (the Last_Name field in the “Employee_List” table in ACL DATA\Sample Data Files\Metaphor_Employee_Data.ACL), and outputs the results to a new ACL table.

FUZZYDUP ON Last_Name OTHER First_Name EmpNo LEVDISTANCE 1 DIFFPCT 50 RESULTSIZE 20 EXACT TO "Fuzzy_Last_Name" OPEN

Related reference
ISFUZZYDUP( ) function
LEVDIST( ) function
OMIT( ) function
SOUNDSLIKE( ) function
SOUNDEX( ) function
DUPLICATES command


(C) 2013 ACL Services Ltd. All Rights Reserved. | Send feedback