DICECOEFFICIENT( ) function

Returns the Dice’s Coefficient of two specified strings, which is a measurement of how similar the two strings are.

Syntax

DICECOEFFICIENT(string1, string2 <,ngram>)

Parameters

string1

Character. The first string in the comparison.

string2

Character. The second string in the comparison.

ngram

Optional. Numeric. Specifies the n-gram length to use. N-grams are overlapping substrings (character blocks) into which the comparison strings are divided as part of the Dice’s Coefficient calculation. Specify a whole number, 1 or greater. Increasing the ngram length makes the criterion for similarity between two strings stricter. If you do not specify a length, the default length of 2 is used.

Output

Numeric. The value is the Dice’s Coefficient of the two strings. The range is 0.0000 to 1.0000, inclusive.

Remarks

You can use this function to find nearly identical values (fuzzy duplicates). You can also use it to find values with identical or near-identical content but transposed elements. For example:

The function returns the Dice’s Coefficient of the two evaluated strings, which is a measurement of the degree of similarity between the strings, on a scale from 0.0000 to 1.0000. The greater the returned value the more similar the two strings:

Filtering or sorting the values in a field based on their Dice’s Coefficient identifies those values that are most similar to the comparison string.

The function is not case-sensitive, so “SMITH” is equivalent to “smith.” The function automatically trims leading and trailing blanks in fields, so there is no need to use the TRIM( ) or ALLTRIM( ) functions when specifying a field as a parameter.

The OMIT( ) and EXCLUDE( ) functions can improve the effectiveness of the DICECOEFFICIENT( ) function by removing generic elements such as “Corporation” or “Inc.”, or characters such as commas, periods, and ampersands (&), from field values. Removal of generic elements and punctuation focuses the DICECOEFFICIENT( ) string comparison on just the portion of the strings where a meaningful difference may occur.

How Dice’s Coefficient is calculated

Dice’s Coefficient is calculated by first dividing the strings being compared into n-grams. N-grams (also referred to as q-grams) are overlapping substrings, or overlapping character blocks, with a length of n. You can specify the length of n using the ngram parameter, or accept the default length of 2. For example, here are the names “John Smith” and “Smith, John D.” divided into n-grams with a length of 2, and n-grams with a length of 3. Underscores indicate spaces. Internal spaces and punctuation are counted as characters.

Table 1. Two names divided into n-grams

n-gram length

“John Smith” n-grams

“Smith, John D.” n-grams

2

Jo | oh | hn | n_ | _S | Sm | mi | it | th

Sm | mi | it | th | h, | ,_ | _J | Jo | oh | hn | n_ | _D | D.

3

Joh | ohn | hn_ | n_S | _Sm | Smi | mit | ith

Smi | mit | ith | th, | h,_ | ,_J | _Jo | Joh | ohn | hn_ | n_D | _D.

Once the n-grams have been established for two strings being compared, the calculation is completed using the following formula:

Shared n-grams are n-grams that appear in both strings. For example, “ABC” and “BCD” share the n-gram “BC”, assuming an n-gram length of 2 (AB | BC and BC | CD).

Table 2 illustrates calculating the Dice’s Coefficient for the two strings, “John Smith” and “Smith, John D.”, using different n-gram lengths. Note that as the n-gram length increases for the same pair of strings, the Dice’s Coefficient value decreases, indicating less similarity. Although the strings remain the same, the criterion for similarity becomes stricter because dividing the strings into longer n-grams means that longer sequences of characters must match for an n-gram to qualify as shared.

Another way of thinking about this point is that the relative position of characters is weighted more heavily as you increase the n-gram length. By contrast, the relative position of characters is not considered when using an n-gram length of 1. Relative position refers to the position of characters in relation to one another, rather than to their absolute position within a string. If you are specifically looking for transposition, use an n-gram length of 1.

Table 2. Calculating the Dice’s Coefficient for two similar strings

n-gram length

“John Smith” n-grams

“Smith, John D.” n-grams

Shared n-grams

Dice’s Coefficient

1

J | o | h | n | _ | S | m | i | t | h

(10 n-grams)

S | m | i | t | h | , | _ | J | o | h | n | _ | D | .

(14 n-grams)

10

2x10 / (10+14) = 0.8333

2

(default)

Jo | oh | hn | n_ | _S | Sm | mi | it | th

(9 n-grams)

Sm | mi | it | th | h, | ,_ | _J | Jo | oh | hn | n_ | _D | D.

(13 n-grams)

8

2x8 / (9+13) = 0.7273

3

Joh | ohn | hn_ | n_S | _Sm | Smi | mit | ith

(8 n-grams)

Smi | mit | ith | th, | h,_ | ,_J | _Jo | Joh | ohn | hn_ | n_D | _D.

(12 n-grams)

6

2x6 / (8+12) = 0.6000

4

John | ohn_ | hn_S | n_Sm | _Smi | Smit | mith

(7 n-grams)

Smit | mith | ith, | th,_ | h,_J | ,_Jo | _Joh | John | ohn_ | hn_D | n_D.

(11 n-grams)

4

2x4 / (7+11) = 0.4444

DICECOEFFICIENT( ) and LEVDIST( ) compared

One of the key differences between the DICECOEFFICIENT( ) function and the LEVDIST( ) and ISFUZZYDUP( ) functions, which use Levenshtein Distance, is that DICECOEFFICIENT( ) de-emphasizes or completely ignores the relative position of characters or character blocks in the two strings being compared, whereas relative position is significant in the functions based on Levenshtein Distance. If you are comparing strings such as addresses, in which entire elements might be transposed, DICECOEFFICIENT( ) might be a better choice. For example, the same address with the “Suite” element transposed is identified as highly similar by DICECOEFFICIENT( ), but highly different by LEVDIST( ):

Table 3. DICECOEFFICIENT( ) and LEVDIST( ) compared, transposed elements

Address pair

Dice’s Coefficient

(default n-gram of 2)

Levenshtein Distance

  • 125 SW 39TH ST, Suite 100

  • Suite 100, 125 SW 39TH ST

0.9167

22

(the greater the Levenshtein Distance, the more two strings differ)

If transposition is not an issue, LEVDIST( ) may give more useful results. For example, the same corporation name with different punctuation is identified as highly different by DICECOEFFICIENT( ), but highly similar by LEVDIST( ):

Table 4. DICECOEFFICIENT( ) and LEVDIST( ) compared, no transposition

Corporation name pair

Dice’s Coefficient

(default n-gram of 2)

Levenshtein Distance

  • AVS, Inc

  • A.V.S. Inc

0.3750

3

Examples

Example Return value

DICECOEFFICIENT("125 SW 39TH ST, Suite 100","Suite 100, 125 SW 39TH ST")

0.9167

Using the default n-gram length (2), the two strings are 92% similar.

DICECOEFFICIENT("125 SW 39TH ST, Suite 100","Suite 100, 125 SW 39TH ST", 1)

1.0000

Using an n-gram length of 1, the two strings are 100% similar.

DICECOEFFICIENT("125 SW 39TH ST, Suite 100","Suite 100, 125 SW 39TH ST", 3)

0.8261

Using an n-gram length of 3, the two strings are 83% similar.

DICECOEFFICIENT(Address,"125 SW 39TH ST, Suite 100")

The Dice’s Coefficient of each value in the “Address” field when compared to the string “125 SW 39TH ST, Suite 100” (based on the default n-gram length of 2).

DICECOEFFICIENT("John Smith","Smith, John")

0.7368

Using the default n-gram length (2), the two strings are 74% similar.

DICECOEFFICIENT("John Smith", EXCLUDE("Smith, John", ","), 1)

1.0000

By excluding the comma between last name and first name, and by using an n-gram length of 1, the two strings are 100% similar.

You can use the expression DICECOEFFICIENT(Address,"125 SW 39TH ST, Suite 100") to create a computed field, and then sort the computed field in descending order to rank all values in the “Address” field based on their similarity to “125 SW 39TH ST, Suite 100”.

To create a filter that isolates all values in the “Address” field based on degree of similarity to “125 SW 39TH ST, Suite 100”, specify:

DICECOEFFICIENT(Address,"125 SW 39TH ST, Suite 100")>.5

Changing the number in the expression allows you to adjust the degree of similarity in the filtered values.

Related reference
FUZZYDUP command
LEVDIST( ) function
ISFUZZYDUP( ) function
SOUNDSLIKE( ) function
SOUNDEX( ) function
OMIT( ) function
EXCLUDE( ) function


(C) 2015 ACL Services Ltd. All Rights Reserved.