Fuzzy join

An Analytics fuzzy join use fuzzy matching of key field values to combine two Analytics tables into a new third table. In most respects, a fuzzy join is like a regular Analytics join (see Joining tables). The main difference is that in addition to joining records based on exact matching of key field values, a fuzzy join can join records based on approximate matching.

Fuzzy joining is useful when primary and secondary keys contain the same kind of data, but in slightly different form. Or the data in the keys has slight irregularities, such as typos, that might prevent an exact match.

Example

Scenario

You want to identify any vendors who are also employees as one way of analyzing data for possible improper payments.

Approach

You join the Vendor master table with the Employee table, using the address field in each table as a common key (Vendor_Street, and Emp_Address). However, the form of the address data in the key fields varies slightly, so you use a fuzzy join instead of a regular join.

A look at some of the data

Without significant data cleansing and harmonization work, the primary and secondary key values shown below would not be joined by a regular Analytics join, even though they are very likely matching addresses.

Primary key values Secondary key values
605 3rd Avenue 605 Third Avenue
400 High St SE 400 High Street S.E.
2203 Rowan Street 2203 Rowen St

Even with data cleansing and harmonization, key values with minor differences in spelling, such as "Rowan" and "Rowen", would probably not be matched.

The key values could be joined by a fuzzy join, depending on the fuzzy join settings.

Output results

In the example of the joined table below, exact key field matches are highlighted purple, and fuzzy key field matches are highlighted green.

Fuzzy join versus fuzzy duplicates

A fuzzy join analyzes values in key fields in two tables. To test a single field in a single Analytics table for nearly identical values, see Fuzzy duplicates overview.

Output table size and command performance

Output table size

The fuzzy join is similar to the Analytics many-to-many join. All primary key values can potentially be matched to all secondary key values. The size of the output table can be many times greater than the size of either the primary or the secondary input tables.

Command performance

The fuzzy matching algorithms ensure that only key values within a specified degree of fuzziness, or exactly matching values, are actually joined. However, every possible primary-secondary match must be tested, which means that the fuzzy joining process can be time-consuming. The number of individual tests that must be performed is equal to the number of records in the primary table times the number of records in the secondary table.

Best practices

Keep output table size and command performance in mind when you prepare primary and secondary input tables, and specify the degree of fuzziness.

  • Tailor the data Ensure that only relevant records are included in the primary and secondary tables. If some records have no chance of being matched, filter them out before performing fuzzy matching.
  • Test runs For large data sets, do test runs on a small portion of the data as a more efficient way of arriving at suitable settings for the fuzzy matching algorithms. Start with more conservative fuzzy settings, and if required, progressively loosen them.

Fuzzy matching algorithms

When you perform a fuzzy join, you choose between two different fuzzy matching algorithms:

  • Dice coefficient
  • Levenshtein distance

The algorithms operate completely independently of each other, and can produce somewhat different results. One approach is to perform a fuzzy join twice, once with each algorithm, and compare results. Typically, a number of the fuzzy matches in each result set overlap, but some matches can be unique to each result set.

Degree of fuzziness

You specify the degree of fuzziness for each algorithm, which can dramatically change the size and makeup of the result set. "Degree of fuzziness" refers to how closely two values match.

Depending on the algorithm you select, you use the following settings to control the degree of fuzziness:

Algorithm Setting

Dice coefficient

  • N-gram
  • Percent

Levenshtein distance

  • Distance

Try experimenting with different degrees of fuzziness. Begin conservatively and produce smaller result sets, and then progressively loosen the settings until you start getting too many joined values that are obviously not matches (false positives).

Dice coefficient

The Dice coefficient algorithm works by measuring the degree of similarity between a primary and a secondary key value, on a scale from 0.0000 to 1.0000. The greater the Dice's coefficient of the two values, the more similar they are.

Levenshtein distance

The Levenshtein distance algorithm works by measuring the degree of difference between a primary and a secondary key value, on a whole number scale starting at 0. The scale represents the number of single-character edits required to make one value identical to the other value. The greater the Levenshtein distance between the two values, the more different they are.

Getting better results

Using Analytics functions to perform data cleansing and harmonization of the primary and secondary key fields can improve the effectiveness of the fuzzy join. For example, if you harmonize values such as "Street", "St.", and "St", or remove them altogether, you can use tighter fuzzy settings and still get the same fuzzy matches, while reducing the number of false positive matches.

Removing generic elements

You can use the OMIT( ) and EXCLUDE( ) functions to remove generic elements such as "Corporation" or "Inc.", or characters such as commas, periods, and ampersands (&), from field values.

Removal of generic elements and punctuation focuses the fuzzy join comparison of key field values on just the portion of the values where a meaningful difference may occur.

Steps

You can use fuzzy matching of key field values to combine two Analytics tables into a new third table.

Analytics 14.1 Help