Fuzzy join

Concept Information

An Analytics fuzzy join use fuzzy matching of key field values to combine two Analytics tables into a new third table. In most respects, a fuzzy join is like a regular Analytics join (see Joining tables). The main difference is that in addition to joining records based on exact matching of key field values, a fuzzy join can join records based on approximate matching.

Fuzzy joining is useful when primary and secondary keys contain the same kind of data, but in slightly different form. Or the data in the keys has slight irregularities, such as typos, that might prevent an exact match.

Example

Scenario

You want to identify any vendors who are also employees as one way of analyzing data for possible improper payments.

Approach

You join the Vendor master table with the Employee table, using the address field in each table as a common key (Vendor_Street, and Emp_Address). However, the form of the address data in the key fields varies slightly, so you use a fuzzy join instead of a regular join.

A look at some of the data

Without significant data cleansing and harmonization work, the primary and secondary key values shown below would not be joined by a regular Analytics join, even though they are very likely matching addresses.

Primary key values	Secondary key values
605 3rd Avenue	605 Third Avenue
400 High St SE	400 High Street S.E.
2203 Rowan Street	2203 Rowen St

Even with data cleansing and harmonization, key values with minor differences in spelling, such as "Rowan" and "Rowen", would probably not be matched.

The key values could be joined by a fuzzy join, depending on the fuzzy join settings.

Output results

In the example of the joined table below, exact key field matches are highlighted purple, and fuzzy key field matches are highlighted green.

Fuzzy join versus fuzzy duplicates

A fuzzy join analyzes values in key fields in two tables. To test a single field in a single Analytics table for nearly identical values, see Fuzzy duplicates analysis.

Improving the effectiveness of fuzzy joining

You can significantly improve the effectiveness of fuzzy joining by incorporating one or more of the following techniques:

sorting individual elements in primary and secondary key field values
removing generic elements from primary and secondary key field values
harmonizing primary and secondary key field values

These techniques allow you to use tighter fuzzy settings and still get the same fuzzy matches, while reducing the number of false positive matches. You can use the techniques separately, or in combination.

Create an expression or a computed field

To use any of the techniques, you need to create an expression or a computed field using the appropriate Analytics function and one or both key fields.

For more information about expressions, see Using expressions.

For more information about computed fields, see Computed fields.

Note

The Fuzzy Join dialog box does not allow the creation of an expression on a secondary key field. However, you can manually create a secondary key field expression in the Analytics command line or in a script. Another option is to create a computed field for use as the secondary key field.

Sorting individual elements in key field values

The SORTWORDS( ) function can improve the effectiveness of fuzzy joining by sorting individual elements in primary and secondary key field values into a sequential order.

Sorting elements, such as the components of an address, can make key field values with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that key field values are selected as fuzzy matches for each other.

For more information, see SORTWORDS( ) function.

For a video providing an overview of SORTWORDS( ), see Fuzzy Matching Using SORTWORDS() (English only).

Note

Sorting elements in key field values is best suited for fuzzy joining using the Levenshtein distance algorithm.

Sorting elements when fuzzy joining using the Dice coefficient algorithm may or may not be beneficial. Test a set of sample data before deciding whether to use SORTWORDS( ) in conjunction with the Dice coefficient algorithm in a production setting.

Caution

If you use SORTWORDS( ) in conjunction with fuzzy joining you must apply SORTWORDS( ) to both strings or both fields being compared.

Removing generic elements from key field values

The OMIT( ) function can improve the effectiveness of fuzzy joining by removing generic elements such as "Corporation" or "Inc.", or characters such as commas, periods, and ampersands (&), from primary and secondary key field values.

Removal of generic elements and punctuation focuses fuzzy matching on just the portion of the key field values where a meaningful difference may occur.

For more information, see OMIT( ) function.

Harmonizing key field values

The REPLACE( ) or REGEXREPLACE( ) functions can improve the effectiveness of fuzzy joining by harmonizing variant forms of the same element in primary and secondary key field values. For example, you could harmonize "Street", "St.", and "St" to use the single value "St".

Harmonizing elements can make key field values with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that key field values are selected as fuzzy matches for each other.

For more information, see REPLACE( ) function for straightforward replacements, and REGEXREPLACE( ) function for more complex replacements.

Output table size and command performance

Output table size

The fuzzy join is similar to the Analytics many-to-many join. All primary key values can potentially be matched to all secondary key values. The size of the output table can be many times greater than the size of either the primary or the secondary input tables.

Command performance

The fuzzy matching algorithms ensure that only key values within a specified degree of fuzziness, or exactly matching values, are actually joined. However, every possible primary-secondary match must be tested, which means that the fuzzy joining process can be time-consuming. The number of individual tests that must be performed is equal to the number of records in the primary table times the number of records in the secondary table.

Limit matching to the first secondary match

You can significantly reduce processing time, and reduce the size of the output results, by selecting Join only the first occurrence of secondary key matches. Enabling this option specifies that each primary key value is joined to only the first occurrence of any matching secondary key values.

Enabling the option is appropriate in either of these situations:

Any matches? you only want to know if any matches, exact or fuzzy, exist between two tables, and you want to avoid the processing time required to identify all matches
At most one match you are certain that at most only one match exists in the secondary table for each primary key value

Enabling the option is not appropriate if you need output results that contain all possible joins between primary and secondary key values.

Note

If you select Join only the first occurrence of secondary key matches and the first occurrence of a match happens to be an exact match, any subsequent fuzzy matches for the primary key value are not included in the joined output table.

Best practices

Keep output table size and command performance in mind when you prepare primary and secondary input tables, and specify the degree of fuzziness.

Tailor the data Ensure that only relevant records are included in the primary and secondary tables. If some records have no chance of being matched, filter them out before performing fuzzy matching.
Test runs For large data sets, do test runs on a small portion of the data as a more efficient way of arriving at suitable settings for the fuzzy matching algorithms. Start with more conservative fuzzy settings, and if required, progressively loosen them.

Fuzzy matching algorithms

When you perform a fuzzy join, you choose between two different fuzzy matching algorithms:

Dice coefficient
Levenshtein distance

The algorithms operate completely independently of each other, and can produce somewhat different results. One approach is to perform a fuzzy join twice, once with each algorithm, and compare results. Typically, a number of the fuzzy matches in each result set overlap, but some matches can be unique to each result set.

Degree of fuzziness

You specify the degree of fuzziness for each algorithm, which can dramatically change the size and makeup of the result set. "Degree of fuzziness" refers to how closely two values match.

Depending on the algorithm you select, you use the following settings to control the degree of fuzziness:

Algorithm	Setting
Dice coefficient	N-gram Percent
Levenshtein distance	Distance

Try experimenting with different degrees of fuzziness. Begin conservatively and produce smaller result sets, and then progressively loosen the settings until you start getting too many joined values that are obviously not matches (false positives).

Dice coefficient

The Dice coefficient algorithm works by measuring the degree of similarity between a primary and a secondary key value, on a scale from 0.0000 to 1.0000. The greater the Dice's coefficient of the two values, the more similar they are.

Show me more

Dice's coefficient	Meaning
1.0000	Each value is composed of an identical set of characters, although the characters may be in a different order, and may use different case. The n-grams in the two values are 100% identical. N-grams are explained below.
0.7500	The n-grams in the two values are 75% identical.
0.0000	The two values have no identical n-grams, or the length specified in the N-gram setting is longer than the shorter of the two values being compared.

Dice's coefficient

Meaning

1.0000

Each value is composed of an identical set of characters, although the characters may be in a different order, and may use different case.

The n-grams in the two values are 100% identical.

N-grams are explained below.

0.7500

The n-grams in the two values are 75% identical.

0.0000

The two values have no identical n-grams, or the length specified in the N-gram setting is longer than the shorter of the two values being compared.

N-grams

Dice's coefficient is calculated by first dividing the values being compared into n-grams. N-grams are overlapping blocks of characters, with a length of n, which is whatever length you specify in the N-gram setting.

Here are two of the values from the example above, divided into n-grams with a length of 2 characters (n = 2).

2203 Rowan Street	22 \| 20 \| 03 \| 3_ \| _R \| Ro \| ow \| wa \| an \| n_ \| _S \| St \| tr \| re \| ee \| et
2203 Rowen St	22 \| 20 \| 03 \| 3_ \| _R \| Ro \| ow \| we \| en \| n_ \| _S \| St

The Dice's coefficient is the percentage of n-grams in the two values that are identical. In this case, 20 of 28 n-grams are identical, which is 71.43%, or 0.7143 expressed as a decimal fraction.

Note

Increasing the length in the N-gram setting makes the criterion for similarity between two values stricter.

Percent

When you specify a Percent setting, you are specifying the minimum allowable Dice's coefficient of two values for them to qualify as a fuzzy match. For example, if you specify 0.7500 , at least 75% of the n-grams in two values must be identical to create a match.

Percent setting	Meaning	2203 Rowan Street / 2203 Rowen St
0.7500	To qualify as a fuzzy match, at least 75% of the n-grams in two values must be identical.	Not matched, not included in the joined table (Dice's coefficient = 0.7143)
0.7000	To qualify as a fuzzy match, at least 70% of the n-grams in two values must be identical.	Matched, included in the joined table (Dice's coefficient = 0.7143)

Percent setting

Meaning

2203 Rowan Street / 2203 Rowen St

0.7500

To qualify as a fuzzy match, at least 75% of the n-grams in two values must be identical.

Not matched, not included in the joined table

(Dice's coefficient = 0.7143)

0.7000

To qualify as a fuzzy match, at least 70% of the n-grams in two values must be identical.

Matched, included in the joined table

(Dice's coefficient = 0.7143)

For detailed information about how Dice's coefficient works, see DICECOEFFICIENT( ) function.

Levenshtein distance

The Levenshtein distance algorithm works by measuring the degree of difference between a primary and a secondary key value, on a whole number scale starting at 0. The scale represents the number of single-character edits required to make one value identical to the other value. The greater the Levenshtein distance between the two values, the more different they are.

Show me more

Levenshtein distance	Meaning
0	Each value is composed of an identical set of characters, in an identical order. Case may differ.
2	Two single-character edits are required to make the two values identical. For example: "Smith" and "Smythe" edit 1 substitute 'y' for 'i' edit 2 insert 'e'
3	Three single-character edits are required to make the two values identical. For example: "Hanssen" and "Jansn" edit 1 substitute 'J' for 'H' edit 2 delete 's' edit 3 delete 'e'

Levenshtein distance

Meaning

Each value is composed of an identical set of characters, in an identical order. Case may differ.

Two single-character edits are required to make the two values identical.

For example: "Smith" and "Smythe"

edit 1 substitute 'y' for 'i'
edit 2 insert 'e'

Three single-character edits are required to make the two values identical.

For example: "Hanssen" and "Jansn"

edit 1 substitute 'J' for 'H'
edit 2 delete 's'
edit 3 delete 'e'

Distance

When you specify a Distance setting, you are specifying the maximum allowable Levenshtein distance between two values for them to qualify as a fuzzy match. For example, if you specify 2 , no more than two edits can be required to make two values identical.

Distance setting	Meaning	Hanssen / Jansn
2	To qualify as a fuzzy match, no more than 2 character edits can be required to make two values identical.	Not matched, not included in the joined table (Levenshtein distance = 3)
3	To qualify as a fuzzy match, no more than 3 character edits can be required to make two values identical.	Matched, included in the joined table (Levenshtein distance = 3)

Distance setting

Meaning

Hanssen / Jansn

To qualify as a fuzzy match, no more than 2 character edits can be required to make two values identical.

Not matched, not included in the joined table

(Levenshtein distance = 3)

To qualify as a fuzzy match, no more than 3 character edits can be required to make two values identical.

Matched, included in the joined table

(Levenshtein distance = 3)

For detailed information about how Levenshtein distance works, see LEVDIST( ) function. Unlike the function, the Levenshtein distance algorithm used in the fuzzy join automatically trims leading and trailing blanks, and is not case-sensitive.

Steps

You can use fuzzy matching of key field values to combine two Analytics tables into a new third table.

Show me how

In the Navigator, open the primary table, and right-click the secondary table and select Open as Secondary.
The primary and secondary table icons update with the numbers 1 and 2 to indicate their relation to each other .
Select Data > Fuzzy Join.
On the Main tab select the fuzzy matching algorithm you want to use:
- Dice coefficient
- Levenshtein
Depending on the algorithm you selected, provide settings to control the degree of fuzziness.
Dice coefficient
- N-gram
- Percent
Levenshtein
- Distance
The settings are explained below.
(Optional) Select Join only the first occurrence of secondary key matches to specify that each primary key value is joined to only the first occurrence of any matching secondary key values.
Select the primary key field from the Primary Keys list.
You can select only one primary key field, and it must be a character field.
Select the secondary key field from the Secondary Keys list.
You can select only one secondary key field, and it must be a character field.
Select the fields to include in the joined table from the Primary Fields and Secondary Fields lists.
Note
You must explicitly select the primary and secondary key fields if you want to include them in the joined table.
Tip
You can Ctrl+click to select multiple non-adjacent fields, and Shift+click to select multiple adjacent fields.
In the To text box, specify the name of the new, joined table.
(Optional) On the More tab:
1. If you want to process only a subset of records, select one of the options in the Scope panel.
2. If you want to append (add) the output results to the end of an existing Analytics table, select Append To Existing File.
Click OK.
The new, joined table is output.

Fuzzy Join dialog box options

The tables below provide detailed information about the options in the Fuzzy Join dialog box.

Main tab

Options – Fuzzy Join dialog box	Description
Dice coefficient	Use Dice's coefficient for fuzzy matching between primary and secondary key values. N-gram the n-gram length to use Specify a whole number, 1 or greater. Increasing the n-gram length makes the criterion for similarity between two values stricter. Percent the minimum allowable Dice's coefficient of two values for them to qualify as a fuzzy match Specify a decimal fraction, from 0.0000 to 1.0000 (for example, 0.7500). Use a maximum of four decimal places. Decreasing the value increases the number of matches by including matches with a greater degree of fuzziness – that is, values that are more different from each other.
Levenshtein	Use Levenshtein distance for fuzzy matching between primary and secondary key values. Distance the maximum allowable Levenshtein distance between two values for them to qualify as a fuzzy match Specify a whole number, 1 or greater. Increasing the value increases the number of matches by including matches with a greater degree of fuzziness – that is, values that are more different from each other.
Join only the first occurrence of secondary key matches	Specifies that each primary key value is joined to only the first occurrence of any secondary key matches. If you leave the option unchecked, the default behavior is to join each primary key value to all occurrences of any secondary key matches.
Secondary Table	An alternate method for selecting the secondary table.
Primary Keys Secondary Keys	Specifies the common key field to use to join the two tables. You can select the common key field directly in the Primary Keys and Secondary Keys lists. You can also click Primary Keys or Secondary Keys to open the Selected Fields dialog box where you can select the common key field, or create an expression on the primary key. Key field guidelines: Data type The key fields must be the character data type. Length If key fields are not the same length, they are automatically harmonized. Names and start positions Key field names and start positions can be different, but they must describe the same data element. Multiple key fields Only one key field per table is supported.
Primary Fields Secondary Fields	Specifies the fields to include in the joined table. You can select fields directly in the Primary Fields and Secondary Fields lists. You can also click Primary Fields or Secondary Fields to open the Selected Fields dialog box where you can select the fields, or create an expression on one or more primary fields. The order in which you select primary and secondary fields dictates the field order in the resulting joined table. As a group, the primary fields appear before the secondary fields in the joined table.
Use Output Table	Specifies whether the Analytics table containing the output results opens automatically upon completion of the operation.
If	(Optional) Allows you to create a condition to exclude records from processing. You can enter a condition in the If text box, or click If to create an IF statement using the Expression Builder. The condition can reference either the primary or the secondary table, or both. Note To access the secondary table fields in the Expression Builder, select the secondary table in the From Table drop-down list. The If condition is evaluated against only the records remaining in a table after any scope options have been applied (First, Next, While).
To	Specifies the name and location of the output table. To save the output table to the Analytics project folder enter only the table name. To save the output table in a location other than the project folder specify an absolute or relative file path, or click To and navigate to a different folder. For example: C:\Results\Output.fil or Results\Output.fil. Regardless of where you save the output table, it is added to the open project if it is not already in the project. If Analytics prefills a table name, you can accept the prefilled name, or change it.

More tab

Options – Fuzzy Join dialog box	Description
Scope panel	Specifies which records in the primary table are processed: All (default) all records in the primary table are processed. First select this option and enter a number in the text box to start processing at the first record in the primary table and include only the specified number of records. Next select this option and enter a number in the text box to start processing at the currently selected record in the primary table view and include only the specified number of records. The actual record number in the leftmost column must be selected, not data in the row. While select this option to use a WHILE statement to limit the processing of records in the primary table based on criteria. You can enter a condition in the While text box, or click While to create a WHILE statement using the Expression Builder. A WHILE statement allows records to be processed only while the specified condition evaluates to true. You can use the While option in conjunction with the All, First, or Next options. Record processing stops as soon as one limit is reached. Note The number of records specified in the First or Next options references either the physical or the indexed order of records in a table, and disregards any filtering or quick sorting applied to the view. However, results of analytical operations respect any filtering. If a view is quick sorted, Next behaves like First.
Append To Existing File	Specifies that the output results are appended (added) to the end of an existing Analytics table. Note Leaving Append To Existing File deselected is recommended if you are uncertain whether the output results and the existing table have an identical data structure. For more information about appending and data structure, see Appending output results to an existing table.
OK	Executes the operation. If you are joining character key fields of different lengths, a message appears stating that Analytics will attempt to harmonize the fields. If the overwrite prompt appears, select the appropriate option. If you are expecting the Append option to appear and it does not, click No to cancel the operation and see Appending output results to an existing table.

Fuzzy join

Example

Scenario

Approach

A look at some of the data

Output results

Fuzzy join versus fuzzy duplicates

Improving the effectiveness of fuzzy joining

Create an expression or a computed field

Sorting individual elements in key field values

Removing generic elements from key field values

Harmonizing key field values

Output table size and command performance

Output table size

Command performance

Limit matching to the first secondary match

Best practices

Fuzzy matching algorithms

Degree of fuzziness

Dice coefficient

Levenshtein distance

Steps

Page options

Is this page helpful?

Is this page helpful?