DICECOEFFICIENT( ) function
Returns the Dice's coefficient of two specified strings, which is a measurement of how similar the two strings are.
Syntax
DICECOEFFICIENT(string1, string2 <,ngram>)
Parameters
Name | Type | Description |
---|---|---|
string1 | character | The first string in the comparison. |
string2 | character | The second string in the comparison. |
ngram optional |
numeric |
The n-gram length to use. Specify a whole number, 1 or greater. Increasing the ngram length makes the criterion for similarity between two strings stricter. If you do not specify a length, the default length of 2 is used. N-grams are overlapping substrings (character blocks) into which the comparison strings are divided as part of the Dice's coefficient calculation. For detailed information, see Remarks. |
Output
Numeric. The value is the Dice's coefficient of the two strings, which represents the percentage of the total number of n-grams in the two strings that are identical. The range is 0.0000 to 1.0000, inclusive.
Examples
Basic examples
How the n-gram length affects the result
The three examples below compare the same two strings. The degree of similarity returned varies depending on the specified n-gram length.
Returns 0.9167 (using the default n-gram length (2), the n-grams in the two strings are 92% identical):
DICECOEFFICIENT("125 SW 39TH ST, Suite 100","Suite 100, 125 SW 39TH ST")
Returns 1.0000 (using an n-gram length of 1, the n-grams in the two strings are 100% identical):
DICECOEFFICIENT("125 SW 39TH ST, Suite 100","Suite 100, 125 SW 39TH ST", 1)
Returns 0.8261 (using an n-gram length of 3, the n-grams in the two strings are 83% identical):
DICECOEFFICIENT("125 SW 39TH ST, Suite 100","Suite 100, 125 SW 39TH ST", 3)
Field input
Returns the Dice's coefficient of each value in the Address field when compared to the string "125 SW 39TH ST, Suite 100" (based on the default n-gram length of 2):
DICECOEFFICIENT(Address,"125 SW 39TH ST, Suite 100")
Advanced examples
Working with transposed elements
By reducing the n-gram length, and removing non-essential characters, you can optimize DICECOEFFICIENT( ) when searching for transposed elements.
Returns 0.7368 (using the default n-gram length (2), the n-grams in the two strings are 74% identical):
DICECOEFFICIENT("John Smith","Smith, John")
Returns 1.0000 (by excluding the comma between last name and first name, and by using an n-gram length of 1, the n-grams in the two strings are 100% identical):
DICECOEFFICIENT("John Smith", EXCLUDE("Smith, John", ","), 1)
Ranking values against "125 SW 39TH ST, Suite 100"
Create the computed field Dice_Co to display the Dice's coefficient between "125 SW 39TH ST, Suite 100" and each value in the Address field:
DEFINE FIELD Dice_Co COMPUTED DICECOEFFICIENT(Address,"125 SW 39TH ST, Suite 100")
Add the computed field Dice_Co to the view, and then quick sort it in descending order, to rank all values in the Address field based on their similarity to "125 SW 39TH ST, Suite 100".
Isolating fuzzy duplicates for "125 SW 39TH ST, Suite 100"
Create a filter that isolates all values in the Address field that are within a specified degree of similarity to "125 SW 39TH ST, Suite 100":
SET FILTER TO DICECOEFFICIENT(Address,"125 SW 39TH ST, Suite 100") > 0.5
Changing the number in the expression allows you to adjust the degree of similarity in the filtered values.
Remarks
When to use DICECOEFFICIENT( )
Use the DICECOEFFICIENT( ) function to find nearly identical values (fuzzy duplicates). You can also use DICECOEFFICIENT( ) to find values with identical or near-identical content, but transposed elements. For example:
- telephone numbers, or social security numbers, with transposed digits
- versions of the same address, formatted differently
How it works
DICECOEFFICIENT( ) returns the Dice's coefficient of the two evaluated strings, which is a measurement of the degree of similarity between the strings, on a scale from 0.0000 to 1.0000. The greater the returned value the more similar the two strings:
-
1.0000 – means that each string is composed of an identical set of characters, although the characters may be in a different order, and may use different case.
-
0.7500 – means the n-grams in the two strings are 75% identical.
-
0.0000 – means the two strings have no shared n-grams (explained below), or the specified length of the n-gram used in the calculation is longer than the shorter of the two strings being compared.
Usage tips
- Filtering or sorting Filtering or sorting the values in a field based on their Dice's coefficient identifies those values that are most similar to the comparison string.
- Case-sensitivity The function is not case-sensitive, so "SMITH" is equivalent to "smith."
- Leading and trailing blanks The function automatically trims leading and trailing blanks in fields, so there is no need to use the TRIM( ) or ALLTRIM( ) functions when specifying a field as a parameter.
- Removing generic
elements The OMIT( )
and EXCLUDE( ) functions can improve the effectiveness
of the DICECOEFFICIENT( ) function by removing generic
elements such as "Corporation" or "Inc.", or characters such as
commas, periods, and ampersands (&), from field values.
Removal of generic elements and punctuation focuses the DICECOEFFICIENT( ) string comparison on just the portion of the strings where a meaningful difference may occur.
How Dice's coefficient is calculated
Dice's coefficient represents the percentage of the total number of n-grams in two strings that are identical.
Dice's coefficient is calculated by first dividing the strings being compared into n-grams. N-grams (also referred to as q-grams) are overlapping substrings, or overlapping character blocks, with a length of n. You can specify the length of n using the ngram parameter, or accept the default length of 2.
Two names divided into n-grams
Here are the names "John Smith" and "Smith, John D." divided into n-grams with a length of 2, and n-grams with a length of 3. Underscores indicate spaces. Internal spaces and punctuation are counted as characters.
n-gram length |
"John Smith" n-grams |
"Smith, John D." n-grams |
---|---|---|
2 |
Jo | oh | hn | n_ | _S | Sm | mi | it | th |
Sm | mi | it | th | h, | ,_ | _J | Jo | oh | hn | n_ | _D | D. |
3 |
Joh | ohn | hn_ | n_S | _Sm | Smi | mit | ith |
Smi | mit | ith | th, | h,_ | ,_J | _Jo | Joh | ohn | hn_ | n_D | _D. |
The Dice's coefficient formula
Once the n-grams have been established for two strings being compared, the calculation is completed using the following formula:
-
2 x the number of shared n-grams / the total number of n-grams in both strings
Shared n-grams are n-grams that appear in both strings. For example, "ABC" and "BCD" share the n-gram "BC", assuming an n-gram length of 2 (AB | BC and BC | CD).
Examples of calculating the Dice's coefficient
The table below illustrates calculating the Dice's coefficient for the two strings, "John Smith" and "Smith, John D.", using different n-gram lengths.
Note that as the n-gram length increases for the same pair of strings, the Dice's coefficient value decreases, indicating less similarity. Although the strings remain the same, the criterion for similarity becomes stricter because dividing the strings into longer n-grams means that longer sequences of characters must match for an n-gram to qualify as shared.
Another way of thinking about this point is that the relative position of characters is weighted more heavily as you increase the n-gram length. By contrast, the relative position of characters is not considered when using an n-gram length of 1. Relative position refers to the position of characters in relation to one another, rather than to their absolute position within a string.
Tip
If you are specifically looking for transposition, use an n-gram length of 1.
n-gram length |
"John Smith" n-grams |
"Smith, John D." n-grams |
Shared n-grams |
Dice's coefficient |
---|---|---|---|---|
1 |
J | o | h | n | _ | S | m | i | t | h (10 n-grams) |
S | m | i | t | h | , | _ | J | o | h | n | _ | D | . (14 n-grams) |
10 |
2x10 / (10+14) = 0.8333 |
2 (default) |
Jo | oh | hn | n_ | _S | Sm | mi | it | th (9 n-grams) |
Sm | mi | it | th | h, | ,_ | _J | Jo | oh | hn | n_ | _D | D. (13 n-grams) |
8 |
2x8 / (9+13) = 0.7273 |
3 |
Joh | ohn | hn_ | n_S | _Sm | Smi | mit | ith (8 n-grams) |
Smi | mit | ith | th, | h,_ | ,_J | _Jo | Joh | ohn | hn_ | n_D | _D. (12 n-grams) |
6 |
2x6 / (8+12) = 0.6000 |
4 |
John | ohn_ | hn_S | n_Sm | _Smi | Smit | mith (7 n-grams) |
Smit | mith | ith, | th,_ | h,_J | ,_Jo | _Joh | John | ohn_ | hn_D | n_D. (11 n-grams) |
4 |
2x4 / (7+11) = 0.4444 |
DICECOEFFICIENT( ) compared to ISFUZZYDUP( ) and LEVDIST( )
One of the key differences between the DICECOEFFICIENT( ) function and the ISFUZZYDUP( ) and LEVDIST( ) functions, which use Levenshtein distance, is that DICECOEFFICIENT( ) de-emphasizes or completely ignores the relative position of characters or character blocks in the two strings being compared. Relative position is significant in the functions based on Levenshtein distance.
Comparison values with transposition
If you are comparing strings such as addresses, in which entire elements might be transposed, DICECOEFFICIENT( ) might be a better choice. For example, the same address with the "Suite" element transposed is identified as highly similar by DICECOEFFICIENT( ), but highly different by LEVDIST( ):
Address pair |
Dice's coefficient (default n-gram of 2) |
Levenshtein distance |
---|---|---|
|
0.9167 |
22 (the greater the Levenshtein distance, the more two strings differ) |
Comparison values without transposition
If transposition is not an issue, LEVDIST( ) may give more useful results. For example, the same corporation name with different punctuation is identified as highly different by DICECOEFFICIENT( ), but highly similar by LEVDIST( ):
Corporation name pair |
Dice's coefficient (default n-gram of 2) |
Levenshtein distance |
---|---|---|
|
0.3750 |
3 |