How the difference settings work

The results of a fuzzy duplicates operation consist of one or more groups of fuzzy duplicates. The groups are formed based on the degree of difference you specify when you perform the operation. The degree of difference is a combination of two settings in the Fuzzy Duplicates dialog box:

Difference Threshold allows you to control how much two fuzzy duplicates can differ. Difference Percentage allows you to control what proportion of an individual value can be different. The two settings act as two separate thresholds. Values in the field you are testing must be within the bounds of both thresholds to be included in a group of fuzzy duplicates in the results. By adjusting the two settings you can maximize the precision and usefulness of the results.

You can turn off Difference Percentage, in which case values only need to be within the bounds of the Difference Threshold. You cannot turn off Difference Threshold.

Difference Threshold

The Difference Threshold is the maximum allowable Levenshtein Distance between two values for them to be identified as fuzzy duplicates. Levenshtein Distance is a numeric value resulting from a computing science algorithm that calculates the minimum number of single character edits required to make one value identical to another. For example, the Levenshtein Distance between “Smith” and “Smythe” is 2 – ‘y’ must be substituted for ‘i’, and ‘e’ must be inserted. The greater the Levenshtein Distance, the greater the difference between two values. A distance of 0 (zero) means two values are identical. Table 1 provides examples of various Levenshtein Distances. Note that the Levenshtein algorithm treats blanks or spaces between words as characters.

Increasing the Difference Threshold increases the maximum allowable Levenshtein Distance, which increases the size of the results by including values that are more different from one another. You can specify a Difference Threshold from 1 to 10. The upper limit is imposed because increasing the maximum Levenshtein Distance beyond a certain point creates a very large set of results that contains primarily false positives. The lower limit is imposed because entering 0 (zero) would include only exact duplicates. If you are interested in finding only exact duplicates, use the duplicates feature instead.

For more information about Levenshtein Distance, see “LEVDIST( ) function” in the ACL Language Reference.

Table 1. Examples of Levenshtein Distances

Value 1

Value 2

Levenshtein Distance

Included in results if Difference Threshold set to 3

Smith

Smith

0

Yes

(if Include Exact Duplicates is checked)

Smith

Smithe

1

Yes

Smith

Smythe

2

Yes

Hanssen

Jansn

3

Yes

Smith

Brown

5

No

Intercity Couriers

Intercity Couriers Inc.

5

No

Diamond Tire

Diamond Tire & Auto

7

No

JW Smith

John William Smith

10

No

Difference Percentage

The Difference Percentage is the maximum allowable percentage of the shorter of two compared values that can be different for the two values to be identified as fuzzy duplicates. Using the Levenshtein Distance between each pair of values it compares in the test field, ACL performs the following internal calculation:

Levenshtein Distance / number of characters in the shorter value × 100 = difference percentage

For example, the Levenshtein Distance between “Smith” and “Smythe” is 2, and the shorter of the two values is 5 characters long, producing a difference percentage of 40 (2/5). If the difference percentage is less than or equal to the specified Difference Percentage, the two values are eligible to be included in the results, assuming they are also within the maximum allowable Levenshtein Distance of each other (the Difference Threshold). Table 2 provides examples of various difference percentages.

Increasing the Difference Percentage increases the size of the results by including values that contain a greater percentage of difference. You can specify a Difference Percentage from 1 to 99. The upper limit is imposed because allowing difference percentages of 100 or greater could include pairs of values that are completely different from each another in the same fuzzy duplicates group in the results. For example, “ABC” and “XYZ” have a Levenshtein Distance of 3, and a shorter value length of 3, producing a difference percentage of 100. The lower limit is imposed because entering 0 (zero) would include only exact duplicates. If you are interested in finding only exact duplicates, use the duplicates feature instead.

You can optionally turn off Difference Percentage. If you turn off Difference Percentage the results do not take into account the percentage of a value that is different. You may capture some additional valid fuzzy duplicates, such as “JW Smith” and “John William Smith”. However, fuzzy duplicate groups could also include values that are completely different from each other, such as “Smith” and “Brown”. The results will also be larger than when you use Difference Percentage with any setting.

Table 2. Examples of difference percentages

Value 1 (length)

Value 2 (length)

Levenshtein Distance, and difference percentage

Included in results if Difference Percentage set to 50

Smith (5)

Smith (5)

0, 0% (0/5)

Yes

(if Include Exact Duplicates is checked)

Smith (5)

Smithe (6)

1, 20% (1/5)

Yes

Smith (5)

Smythe (6)

2, 40% (2/5)

Yes

Hanssen (7)

Jansn (5)

3, 60% (3/5)

No

Smith (5)

Brown (5)

5, 100% (5/5)

No

Intercity Couriers (18)

Intercity Couriers Inc. (23)

5, 27.77% (5/18)

Yes

Diamond Tire (12)

Diamond Tire & Auto (19)

7, 58.33% (7/12)

No

JW Smith (8)

John William Smith (18)

10, 125% (10/8)

No

How Difference Threshold and Difference Percentage work together

The examples in Table 3 illustrate how Difference Threshold and Difference Percentage work together. The compared values that appear in Table 1 and Table 2 must now be within the bounds of both thresholds to be included in the results. “Hanssen/Jansn” and “Intercity Couriers/Intercity Couriers Inc.”, which are included if Difference Threshold and Difference Percentage are considered individually, are no longer included because they do not fall within the bounds of both thresholds.

Table 3. How Difference Threshold and Difference Percentage work together

Value 1 (length)

Value 2 (length)

Levenshtein Distance, and difference percentage

Included in results if Difference Threshold set to 3 and Difference Percentage set to 50

Smith (5)

Smith (5)

0, 0% (0/5)

Yes

(if Include Exact Duplicates is checked)

Smith (5)

Smithe (6)

1, 20% (1/5)

Yes

Smith (5)

Smythe (6)

2, 40% (2/5)

Yes

Hanssen (7)

Jansn (5)

3, 60% (3/5)

No

Smith (5)

Brown (5)

5, 100% (5/5)

No

Intercity Couriers (18)

Intercity Couriers Inc. (23)

5, 27.77% (5/18)

No

Diamond Tire (12)

Diamond Tire & Auto (19)

7, 58.33% (7/12)

No

JW Smith (8)

John William Smith (18)

10, 125% (10/8)

No

Related concepts
About fuzzy duplicates
Controlling the size of fuzzy duplicate results
How fuzzy duplicates are grouped
Fuzzy duplicate helper functions
Fuzzy duplicates overview
Related tasks
Testing for fuzzy duplicates
Working with fuzzy duplicate output results


(C) 2013 ACL Services Ltd. All Rights Reserved. | Send feedback