How the difference settings work

Fuzzy duplicates are selected based on the degree of difference you specify, and then grouped in the output results. The degree of difference is a combination of two settings in the Fuzzy Duplicates dialog box:

  • Difference Threshold controls how much two fuzzy duplicates can differ
  • Difference Percentage control what proportion of an individual value can be different

The two settings act as two separate thresholds. Values in the field you are testing must be within the bounds of both thresholds to be included in a group of fuzzy duplicates in the results. By adjusting the two settings you can maximize the precision and usefulness of the results.

You can turn off Difference Percentage, in which case values only need to be within the bounds of the Difference Threshold. You cannot turn off Difference Threshold.

Difference Threshold in detail

The Difference Threshold is the maximum allowable Levenshtein distance between two values for them to be identified as fuzzy duplicates.

What is Levenshtein distance?

Levenshtein distance is the minimum number of single character edits required to make one value identical to another. The number of required edits is calculated by a computing science algorithm.

Example of Levenshtein distance

The Levenshtein distance between “Smith” and “Smythe” is 2:

  • edit 1 ‘y’ must be substituted for ‘i’
  • edit 2 ‘e’ must be inserted

The greater the Levenshtein distance, the greater the difference between two values. A distance of 0 (zero) means two values are identical.

The table below provides examples of various Levenshtein distances. For more information about Levenshtein distance, see LEVDIST( ).

Note

The Levenshtein algorithm treats blanks or spaces between words as characters.

Value 1

Value 2

Levenshtein distance

Included in results if Difference Threshold set to 3

Smith

Smith

0

Yes

(if Include Exact Duplicates is checked)

Smith

Smithe

1

Yes

Smith

Smythe

2

Yes

Hanssen

Jansn

3

Yes

Smith

Brown

5

No

Intercity Couriers

Intercity Couriers Inc.

5

No

Diamond Tire

Diamond Tire & Auto

7

No

JW Smith

John William Smith

10

No

Changing the Difference Threshold

Increasing the Difference Threshold increases the maximum allowable Levenshtein distance, which increases the size of the results by including values that are more different from one another. You can specify a Difference Threshold from 1 to 10.

The upper limit is imposed because increasing the maximum Levenshtein distance beyond a certain point creates a very large set of results that contains primarily false positives.

The lower limit is imposed because entering 0 (zero) would include only exact duplicates. If you are interested in finding only exact duplicates, use the duplicates feature instead.

Difference Percentage in detail

The Difference Percentage is the maximum allowable percentage of the shorter of two compared values that can be different for the two values to be identified as fuzzy duplicates.

How is the difference percentage calculated?

Using the Levenshtein distance between each pair of values it compares in the test field, Analytics performs the following internal calculation:

Levenshtein distance / number of characters in the shorter value × 100 = difference percentage

Example of difference percentage

The Levenshtein distance between “Smith” and “Smythe” is 2, and the shorter of the two values is 5 characters long, producing a difference percentage of 40 (2/5 x 100).

If the difference percentage is less than or equal to the specified Difference Percentage, the two values are eligible to be included in the results, assuming they are also within the maximum allowable Levenshtein distance of each other (the Difference Threshold).

The table below provides examples of various difference percentages.

Value 1 (length)

Value 2 (length)

Levenshtein distance, and difference percentage

Included in results if Difference Percentage set to 50

Smith (5)

Smith (5)

0, 0% (0/5)

Yes

(if Include Exact Duplicates is checked)

Smith (5)

Smithe (6)

1, 20% (1/5)

Yes

Smith (5)

Smythe (6)

2, 40% (2/5)

Yes

Hanssen (7)

Jansn (5)

3, 60% (3/5)

No

Smith (5)

Brown (5)

5, 100% (5/5)

No

Intercity Couriers (18)

Intercity Couriers Inc. (23)

5, 27.77% (5/18)

Yes

Diamond Tire (12)

Diamond Tire & Auto (19)

7, 58.33% (7/12)

No

JW Smith (8)

John William Smith (18)

10, 125% (10/8)

No

Changing the Difference Percentage

Increasing the Difference Percentage increases the size of the results by including values that contain a greater percentage of difference. You can specify a Difference Percentage from 1 to 99.

The upper limit is imposed because allowing difference percentages of 100 or greater could include pairs of values that are completely different from each other in the same fuzzy duplicates group in the results. For example, “ABC” and “XYZ” have a Levenshtein distance of 3, and a shorter value length of 3, producing a difference percentage of 100.

The lower limit is imposed because entering 0 (zero) would include only exact duplicates. If you are interested in finding only exact duplicates, use the duplicates feature instead.

Turning off Difference Percentage

You can optionally turn off Difference Percentage. If you turn off Difference Percentage the results do not take into account the percentage of a value that is different. You may capture some additional valid fuzzy duplicates, such as “JW Smith” and “John William Smith”. However, fuzzy duplicate groups could also include values that are completely different from each other, such as “Smith” and “Brown”. The results will also be larger than when you use Difference Percentage with any setting.

How Difference Threshold and Difference Percentage work together

The table below shows how Difference Threshold and Difference Percentage work together. The compared values that appear in Difference Threshold in detail and Difference Percentage in detail must now be within the bounds of both thresholds to be included in the results.

“Hanssen/Jansn” and “Intercity Couriers/Intercity Couriers Inc.” are included if Difference Threshold and Difference Percentage are considered individually. However, they are not included when the two settings are considered together because they do not fall within the bounds of both thresholds.

Value 1 (length)

Value 2 (length)

Levenshtein distance, and difference percentage

Included in results if Difference Threshold set to 3 and Difference Percentage set to 50

Smith (5)

Smith (5)

0, 0% (0/5)

Yes

(if Include Exact Duplicates is checked)

Smith (5)

Smithe (6)

1, 20% (1/5)

Yes

Smith (5)

Smythe (6)

2, 40% (2/5)

Yes

Hanssen (7)

Jansn (5)

3, 60% (3/5)

No

Smith (5)

Brown (5)

5, 100% (5/5)

No

Intercity Couriers (18)

Intercity Couriers Inc. (23)

5, 27.77% (5/18)

No

Diamond Tire (12)

Diamond Tire & Auto (19)

7, 58.33% (7/12)

No

JW Smith (8)

John William Smith (18)

10, 125% (10/8)

No