ISFUZZYDUP(string1, string2, levdist <,diffpct>)
Character. The first string in the comparison.
Character. The second string in the comparison.
Numeric. The maximum allowable Levenshtein Distance between the two strings for them to be identified as fuzzy duplicates. The levdist value cannot be less than 1 or greater than 10. Increasing the levdist value increases the number of results by including values with a greater degree of fuzziness – that is, values that are more different from one another.
Optional. Numeric. A threshold that limits the ‘difference percentage’ or the proportion of a string that can be different. The percentage that results from an internal ACL calculation performed on potential fuzzy duplicate pairs must be less than or equal to the diffpct value for the ISFUZZYDUP( ) function to evaluate to T (true). The diffpct value cannot be less than 1 or greater than 99. If the parameter is omitted the threshold is turned off and difference percentage is not considered during processing of the ISFUZZYDUP( ) function.
Logical. Returns T (true) if the string parameter values are fuzzy duplicates, and F (false) otherwise.
You can use the ISFUZZYDUP( ) function to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data. Unlike the FUZZYDUP command, which identifies all fuzzy duplicates in a field, organizes them in groups, and outputs non-exhaustive results, the ISFUZZYDUP( ) function can identify an exhaustive list of fuzzy duplicates for a single character value.
Exhaustive means that all values within the specified degree of difference of the test value are returned, regardless of their position in the test field relative to the test value. The function is useful if the non-exhaustive results produced by FUZZYDUP are not sufficient for the purposes of your analysis, and you need to directly scrutinize every fuzzy duplicate for a specific character value.
The function also identifies exact duplicates. Unlike the FUZZYDUP command, you cannot exclude exact duplicates when using the function.
When processing data, the ISFUZZYDUP( ) function calculates the Levenshtein Distance between the two evaluated strings, and calculates the difference percentage. The Levenshtein Distance is a value representing the minimum number of single character edits required to make one string identical to the other string. For more information, see LEVDIST( ) function. The difference percentage is the percentage of the shorter of the two evaluated strings that is different, and is the result of the following internal ACL calculation, which uses the Levenshtein Distance between the two strings:
Levenshtein Distance / number of characters in the shorter string × 100 = difference percentage
For the ISFUZZYDUP( ) function to evaluate to T (true), the Levenshtein Distance must be less than or equal to the levdist parameter value, and the difference percentage must be less than or equal to the diffpct parameter value (if specified).
The function is not case-sensitive, so “SMITH” is equivalent to “smith.” The function also automatically trims trailing blanks in fields, so there is no need to use the TRIM( ) function when specifying a field as a parameter.
The OMIT( ) function can improve the effectiveness of the ISFUZZYDUP( ) function by removing generic elements such as “Corporation” or “Inc.” from field values. Removal of generic elements focuses the ISFUZZYDUP( ) string comparison on just the portion of the strings where a meaningful difference may occur.
The LEVDIST( ) function provides an alternative method for comparing strings based on Levenshtein Distance.
Unlike the ISFUZZYDUP( ) function, the LEVDIST( ) function is case-sensitive by default.
Other string comparison functions
The DICECOEFFICIENT( ) function de-emphasizes or completely ignores the relative position of characters or character blocks when comparing strings.
The SOUNDSLIKE( ) and SOUNDEX( ) functions compare strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).
Example | Return value |
---|---|
ISFUZZYDUP("Smith","Smythe",1,99) |
F Two edits are required to transform “Smith” into “Smythe”, but the levdist parameter value is only 1. |
ISFUZZYDUP("Smith","Smythe",2,99) |
T Two edits are required to transform “Smith” into “Smythe”, and the levdist parameter value is 2. |
ISFUZZYDUP("SMITH","smith",1,1) |
T Zero edits are required to transform “SMITH” into “smith”, and the levdist parameter value is 1. The ISFUZZYDUP( ) function is not case-sensitive. |
ISFUZZYDUP("Smith","Smythe",2,35) |
F Two edits are required to transform “Smith” into “Smythe”, the shorter string is 5 characters, yielding a difference percentage of 40 (2/5 × 100), which is not less than or equal to the diffpct parameter value of 35. |
ISFUZZYDUP(Surname,"Smith",3,99) |
A logical value (T or F) indicating whether individual values in the “Surname” field are fuzzy duplicates for the string “Smith”. |
To create a filter that isolates all values in the “Surname” field that are fuzzy duplicates for “Smith”, specify:
ISFUZZYDUP(Surname,"Smith",3,99)
Changing the levdist or diffpct parameter values allows you to adjust the amount of difference in the filtered values.