Fuzzy join

Concept Information

FUZZYJOIN command

An Analytics fuzzy join use fuzzy matching of key field values to combine two Analytics tables into a new third table. In most respects, a fuzzy join is like a regular Analytics join (see Joining tables). The main difference is that in addition to joining records based on exact matching of key field values, a fuzzy join can join records based on approximate matching.

Fuzzy joining is useful when primary and secondary keys contain the same kind of data, but in slightly different form. Or the data in the keys has slight irregularities, such as typos, that might prevent an exact match.

Example

Scenario

You want to identify any vendors who are also employees as one way of analyzing data for possible improper payments.

Approach

You join the Vendor master table with the Employee table, using the address field in each table as a common key (Vendor_Street, and Emp_Address). However, the form of the address data in the key fields varies slightly, so you use a fuzzy join instead of a regular join.

A look at some of the data

Without significant data cleansing and harmonization work, the primary and secondary key values shown below would not be joined by a regular Analytics join, even though they are very likely matching addresses.

Primary key values Secondary key values
605 3rd Avenue 605 Third Avenue
400 High St SE 400 High Street S.E.
2203 Rowan Street 2203 Rowen St

Even with data cleansing and harmonization, key values with minor differences in spelling, such as "Rowan" and "Rowen", would probably not be matched.

The key values could be joined by a fuzzy join, depending on the fuzzy join settings.

Output results

In the example of the joined table below, exact key field matches are highlighted purple, and fuzzy key field matches are highlighted green.

Fuzzy join versus fuzzy duplicates

A fuzzy join analyzes values in key fields in two tables. To test a single field in a single Analytics table for nearly identical values, see Fuzzy duplicates analysis.

Improving the effectiveness of fuzzy joining

You can significantly improve the effectiveness of fuzzy joining by incorporating one or more of the following techniques:

  • sorting individual elements in primary and secondary key field values
  • removing generic elements from primary and secondary key field values
  • harmonizing primary and secondary key field values

These techniques allow you to use tighter fuzzy settings and still get the same fuzzy matches, while reducing the number of false positive matches. You can use the techniques separately, or in combination.

Create an expression or a computed field

To use any of the techniques, you need to create an expression or a computed field using the appropriate Analytics function and one or both key fields.

For more information about expressions, see Using expressions.

For more information about computed fields, see Computed fields.

Note

The Fuzzy Join dialog box does not allow the creation of an expression on a secondary key field. However, you can manually create a secondary key field expression in the Analytics command line or in a script. Another option is to create a computed field for use as the secondary key field.

Sorting individual elements in key field values

The SORTWORDS( ) function can improve the effectiveness of fuzzy joining by sorting individual elements in primary and secondary key field values into a sequential order.

Sorting elements, such as the components of an address, can make key field values with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that key field values are selected as fuzzy matches for each other.

For more information, see SORTWORDS( ) function.

For a video providing an overview of SORTWORDS( ), see Fuzzy Matching Using SORTWORDS() (English only).

Note

Sorting elements in key field values is best suited for fuzzy joining using the Levenshtein distance algorithm.

Sorting elements when fuzzy joining using the Dice coefficient algorithm may or may not be beneficial. Test a set of sample data before deciding whether to use SORTWORDS( ) in conjunction with the Dice coefficient algorithm in a production setting.

Caution

If you use SORTWORDS( ) in conjunction with fuzzy joining you must apply SORTWORDS( ) to both strings or both fields being compared.

Removing generic elements from key field values

The OMIT( ) function can improve the effectiveness of fuzzy joining by removing generic elements such as "Corporation" or "Inc.", or characters such as commas, periods, and ampersands (&), from primary and secondary key field values.

Removal of generic elements and punctuation focuses fuzzy matching on just the portion of the key field values where a meaningful difference may occur.

For more information, see OMIT( ) function.

Harmonizing key field values

The REPLACE( ) or REGEXREPLACE( ) functions can improve the effectiveness of fuzzy joining by harmonizing variant forms of the same element in primary and secondary key field values. For example, you could harmonize "Street", "St.", and "St" to use the single value "St".

Harmonizing elements can make key field values with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that key field values are selected as fuzzy matches for each other.

For more information, see REPLACE( ) function for straightforward replacements, and REGEXREPLACE( ) function for more complex replacements.

Output table size and command performance

Output table size

The fuzzy join is similar to the Analytics many-to-many join. All primary key values can potentially be matched to all secondary key values. The size of the output table can be many times greater than the size of either the primary or the secondary input tables.

Command performance

The fuzzy matching algorithms ensure that only key values within a specified degree of fuzziness, or exactly matching values, are actually joined. However, every possible primary-secondary match must be tested, which means that the fuzzy joining process can be time-consuming. The number of individual tests that must be performed is equal to the number of records in the primary table times the number of records in the secondary table.

Limit matching to the first secondary match

You can significantly reduce processing time, and reduce the size of the output results, by selecting Join only the first occurrence of secondary key matches. Enabling this option specifies that each primary key value is joined to only the first occurrence of any matching secondary key values.

Enabling the option is appropriate in either of these situations:

  • Any matches? you only want to know if any matches, exact or fuzzy, exist between two tables, and you want to avoid the processing time required to identify all matches
  • At most one match you are certain that at most only one match exists in the secondary table for each primary key value

Enabling the option is not appropriate if you need output results that contain all possible joins between primary and secondary key values.

Note

If you select Join only the first occurrence of secondary key matches and the first occurrence of a match happens to be an exact match, any subsequent fuzzy matches for the primary key value are not included in the joined output table.

Best practices

Keep output table size and command performance in mind when you prepare primary and secondary input tables, and specify the degree of fuzziness.

  • Tailor the data Ensure that only relevant records are included in the primary and secondary tables. If some records have no chance of being matched, filter them out before performing fuzzy matching.
  • Test runs For large data sets, do test runs on a small portion of the data as a more efficient way of arriving at suitable settings for the fuzzy matching algorithms. Start with more conservative fuzzy settings, and if required, progressively loosen them.

Fuzzy matching algorithms

When you perform a fuzzy join, you choose between two different fuzzy matching algorithms:

  • Dice coefficient
  • Levenshtein distance

The algorithms operate completely independently of each other, and can produce somewhat different results. One approach is to perform a fuzzy join twice, once with each algorithm, and compare results. Typically, a number of the fuzzy matches in each result set overlap, but some matches can be unique to each result set.

Degree of fuzziness

You specify the degree of fuzziness for each algorithm, which can dramatically change the size and makeup of the result set. "Degree of fuzziness" refers to how closely two values match.

Depending on the algorithm you select, you use the following settings to control the degree of fuzziness:

Algorithm Setting

Dice coefficient

  • N-gram
  • Percent

Levenshtein distance

  • Distance

Try experimenting with different degrees of fuzziness. Begin conservatively and produce smaller result sets, and then progressively loosen the settings until you start getting too many joined values that are obviously not matches (false positives).

Dice coefficient

The Dice coefficient algorithm works by measuring the degree of similarity between a primary and a secondary key value, on a scale from 0.0000 to 1.0000. The greater the Dice's coefficient of the two values, the more similar they are.

Levenshtein distance

The Levenshtein distance algorithm works by measuring the degree of difference between a primary and a secondary key value, on a whole number scale starting at 0. The scale represents the number of single-character edits required to make one value identical to the other value. The greater the Levenshtein distance between the two values, the more different they are.

Steps

You can use fuzzy matching of key field values to combine two Analytics tables into a new third table.