ISFUZZYDUP( ) function
Returns a logical value indicating whether a string is a fuzzy duplicate of a comparison string.
Syntax
ISFUZZYDUP(string1, string2, levdist <,diffpct>)
Parameters
Name | Type | Description |
---|---|---|
string1 | character | The first string in the comparison. |
string2 | character | The second string in the comparison. |
levdist | numeric |
The maximum allowable Levenshtein distance between the two strings for them to be identified as fuzzy duplicates. The levdist value cannot be less than 1 or greater than 10. Increasing the levdist value increases the number of results by including values with a greater degree of fuzziness – that is, values that are more different from each another. |
diffpct
optional |
numeric |
The upper threshold for the 'difference percentage'. Difference percentage is explained in How it works. The diffpct value cannot be less than 1 or greater than 99. Increasing the diffpct value increases the number of results by including values with a greater proportion of difference relative to their length. If omitted, difference percentage is not considered during processing of the ISFUZZYDUP( ) function.
|
Output
Logical. Returns T (true) if string values are fuzzy duplicates, and F (false) otherwise.
Examples
Basic examples
Returns F, because two edits are required to transform "Smith" into "Smythe", but the levdist value is only 1:
ISFUZZYDUP("Smith","Smythe", 1, 99)
Returns T, because two edits are required to transform "Smith" into "Smythe", and the levdist value is 2:
ISFUZZYDUP("Smith","Smythe", 2, 99)
Returns T, because zero edits are required to transform "SMITH" into "smith", and the levdist value is 1 (the ISFUZZYDUP( ) function is not case-sensitive):
ISFUZZYDUP("SMITH","smith", 1, 99)
Returns a logical value (T or F) indicating whether individual values in the Last_Name field are fuzzy duplicates for the string "Smith":
ISFUZZYDUP(Last_Name,"Smith", 3, 99)
Advanced examples
Working with difference percentage
The difference percentage gives you a tool for reducing the number of false positives returned by ISFUZZYDUP( ).
No diffpct specified
Returns T, because five edits are required to transform "abc" into "Smith", and the levdist value is 5:
ISFUZZYDUP("abc", "Smith", 5)
diffpct specified
Returns F, even though "abc" is within the specified Levenshtein distance of "Smith", because 5 edits/a string length of 3 results in a difference percentage of 167%, which exceeds the specified diffpct of 99%:
ISFUZZYDUP("abc", "Smith", 5, 99)
Difference percentage is fully explained in How it works.
Isolating fuzzy duplicates for "Smith"
Create a filter that isolates all values in the Last_Name field that are fuzzy duplicates for "Smith":
SET FILTER TO ISFUZZYDUP(Last_Name, "Smith", 3, 99)
Changing the levdist or diffpct values allows you to adjust the amount of difference in the filtered values.
Remarks
When to use ISFUZZYDUP( )
Use the ISFUZZYDUP( ) function to find nearly identical values (fuzzy duplicates) or locate inconsistent spelling in manually entered data.
How it works
The ISFUZZYDUP( ) function calculates the Levenshtein distance between two strings, and calculates the difference percentage.
ISFUZZYDUP( ) evaluates to T (true) if:
- The Levenshtein distance is less than or equal to the levdist value.
- The difference percentage is less than or equal to the diffpct value (if specified).
Levenshtein distance
The Levenshtein distance is a value representing the minimum number of single character edits required to make one string identical to the other string.
For more information, see LEVDIST( ) function.
Difference percentage
The difference percentage is the percentage of the shorter of the two evaluated strings that is different.
The difference percentage is the result of the following internal Analytics calculation, which uses the Levenshtein distance between the two strings:
Levenshtein distance / number of characters in the shorter string × 100 = difference percentage
Using the optional difference percentage helps reduce the number of false positives returned by ISFUZZYDUP( ):
- The upper threshold for diffpct is 99%, which prevents the entire replacement of a string in order to make it identical.
- Strings that require a large number of edits in relation to their length are excluded.
Usage tips
- Case-sensitivity The function is not case-sensitive, so "SMITH" is equivalent to "smith."
- Trailing blanks The function automatically trims trailing blanks in fields, so there is no need to use the TRIM( ) function when specifying a field as a parameter.
- Removing generic
elements The
OMIT( ) function can improve the effectiveness of the ISFUZZYDUP( )
function by removing generic elements such as "Corporation" or "Inc."
from field values.
Removal of generic elements focuses the ISFUZZYDUP( ) string comparison on just the portion of the strings where a meaningful difference may occur.
How the FUZZYDUP command and the ISFUZZYDUP( ) function differ
The FUZZYDUP command identifies all fuzzy duplicates in a field, organizes them in groups, and outputs non-exhaustive results.
The ISFUZZYDUP( ) function identifies an exhaustive list of fuzzy duplicates for a single character value.
The command and the function both identify exact duplicates. Unlike the command, you cannot exclude exact duplicates when using the function.
What exhaustive means
Exhaustive means that all values within the specified degree of difference of the test value are returned, regardless of their position in the test field relative to the test value.
The ISFUZZYDUP( ) function is useful if the non-exhaustive results produced by the FUZZYDUP command are not sufficient for the purposes of your analysis, and you need to directly scrutinize every fuzzy duplicate for a specific character value.
Related functions
- LEVDIST( ) provides
an alternate method for comparing strings based on Levenshtein
distance.
Unlike ISFUZZYDUP( ), LEVDIST( ) is case-sensitive by default.
- DICECOEFFICIENT( ) de-emphasizes or completely ignores the relative position of characters or character blocks when comparing strings.
- SOUNDSLIKE( ) and SOUNDEX( ) compare strings based on a phonetic comparison (sound) rather than on an orthographic comparison (spelling).