FUZZYJOIN command

Concept Information

Fuzzy join

Uses fuzzy matching to combine fields from two Analytics tables into a new, single Analytics table.

Note

To detect nearly identical values in a single character field (fuzzy duplicates), see FUZZYDUP command.

To join tables using exactly matching key field values, see JOIN command.

Syntax

FUZZYJOIN {DICE PERCENT percentage NGRAM n-gram_length|LEVDISTANCE DISTANCE value} PKEY primary_key_field SKEY secondary_key_field {FIELDS primary_fields|FIELDS ALL <EXCLUDE primary_fields <...n>>} <WITH secondary_fields|WITH ALL <EXCLUDE secondary_fields <...n>>> <IF test> <OPEN> TO table_name <FIRSTMATCH> <WHILE test> <FIRST range|NEXT range> <APPEND>

Note

You cannot run the FUZZYJOIN command locally against a server table.

You must specify the FUZZYJOIN command name in full. You cannot abbreviate it.

Parameters

Name Description
DICE PERCENT percentage NGRAM n-gram_length | LEVDISTANCE DISTANCE value

The fuzzy matching algorithm to use.

DICE use the Dice coefficient algorithm

  • PERCENT percentage the minimum allowable Dice's coefficient of two strings for them to qualify as a fuzzy match

    Specify a decimal fraction, from 0.0000 to 1.0000 (for example, 0.7500). Use a maximum of four decimal places.

    Decreasing the value increases the number of matches by including matches with a greater degree of fuzziness – that is, strings that are more different from each other.

  • NGRAM n-gram_length the n-gram length to use

    Specify a whole number, 1 or greater.

    Increasing the n-gram length makes the criterion for similarity between two strings stricter.

    N-grams are overlapping substrings (character blocks) into which the comparison strings are divided as part of the Dice's coefficient calculation.

    Note

    When you specify DICE, the FUZZYJOIN command uses the DICECOEFFICIENT( ) function in an IF statement to conditionally join key field values. For detailed information about the function, see DICECOEFFICIENT( ) function.

LEVDISTANCE use the Levenshtein distance algorithm

  • DISTANCE value the maximum allowable Levenshtein distance between two strings for them to qualify as a fuzzy match

    Specify a whole number, 1 or greater.

    Increasing the value increases the number of matches by including matches with a greater degree of fuzziness – that is, strings that are more different from each other.

    Note

    When you specify LEVDISTANCE, the FUZZYJOIN command uses the LEVDIST( ) function in an IF statement to conditionally join key field values. For detailed information about the function, see LEVDIST( ) function.

    Unlike the function, the Levenshtein distance algorithm in the FUZZYJOIN command automatically trims leading and trailing blanks, and is not case-sensitive.

PKEY primary_key_field

The character key field, or expression, in the primary table.

You can specify only one primary key field.

SKEY secondary_key_field

The character key field, or expression, in the secondary table.

You can specify only one secondary key field.

FIELDS primary_fields | FIELDS ALL

The fields or expressions from the primary table to include in the joined output table.

  • FIELDS primary_fields include the specified field or fields

    Fields are included in the order that you list them.

  • FIELDS ALL include all fields from the table

    Fields are included in the order that they appear in the table layout.

Note

You must explicitly specify the primary key field if you want to include it in the joined table. Specifying FIELDS ALL also includes it.

EXCLUDE primary_fields

optional

Only valid when performing a fuzzy join using FIELDS ALL.

The field or fields to exclude from the command. EXCLUDE allows you to fine-tune FIELDS ALL, by excluding the specified fields.

EXCLUDE must immediately follow FIELDS ALL. For example:

FIELDS ALL EXCLUDE field_1 field_2 

WITH secondary_fields | WITH ALL

optional

The fields or expressions from the secondary table to include in the joined output table.

  • WITH secondary_fields include the specified field or fields

    Fields are included in the order that you list them.

  • WITH ALL include all fields from the table

    Fields are included in the order that they appear in the table layout.

Note

You must explicitly specify the secondary key field if you want to include it in the joined table. Specifying WITH ALL also includes it.

EXCLUDE secondary_fields

optional

Only valid when performing a fuzzy join using WITH ALL.

The field or fields to exclude from the command. EXCLUDE allows you to fine-tune WITH ALL, by excluding the specified fields.

EXCLUDE must immediately follow WITH ALL. For example:

WITH ALL EXCLUDE field_1 field_2

IF test

optional

A conditional expression that must be true in order to process each record. The command is executed on only those records that satisfy the condition.

Note

The IF parameter is evaluated against only the records remaining in a table after any scope parameters have been applied (WHILE, FIRST, NEXT).

Note

The IF condition can reference the primary table, the secondary table, or both.

OPEN

optional

Opens the table created by the command after the command executes. Only valid if the command creates an output table.

TO table_name

The location to send the results of the command to:

  • table_name saves the results to an Analytics table

    Specify table_name as a quoted string with a .FIL file extension. For example: TO "Output.FIL"

    By default, the table data file (.FIL) is saved to the folder containing the Analytics project.

    Use either an absolute or relative file path to save the data file to a different, existing folder:

    • TO "C:\Output.FIL"
    • TO "Results\Output.FIL"

    Note

    Table names are limited to 64 alphanumeric characters, not including the .FIL extension. The name can include the underscore character ( _ ), but no other special characters, or any spaces. The name cannot start with a number.

FIRSTMATCH

optional

Specifies that each primary key value is joined to only the first occurrence of any secondary key matches.

If the first occurrence happens to be an exact match, any subsequent fuzzy matches for the primary key value are not included in the joined output table.

If you omit FIRSTMATCH, the default behavior of FUZZYJOIN is to join each primary key value to all occurrences of any secondary key matches.

FIRSTMATCH is useful if you only want to know if any matches, exact or fuzzy, exist between two tables, and you want to avoid the processing time required to identify all matches.

You can also use FIRSTMATCH if you are certain that at most only one match exists in the secondary table for each primary key value.

WHILE test

optional

A conditional expression that must be true in order to process each record. The command is executed until the condition evaluates as false, or the end of the table is reached.

Note

If you use WHILE in conjunction with FIRST or NEXT, record processing stops as soon as one limit is reached.

FIRST range | NEXT range

optional

The number of records to process:

  • FIRST start processing from the first record until the specified number of records is reached
  • NEXT start processing from the currently selected record until the specified number of records is reached

Use range to specify the number of records to process.

If you omit FIRST and NEXT, all records are processed by default.

APPEND

optional

Appends the command output to the end of an existing file instead of overwriting it.

Note

You must ensure that the structure of the command output and the existing file are identical:

  • the same fields
  • the same field order
  • matching fields are the same length
  • matching fields are the same data type

Analytics appends output to an existing file regardless of its structure. If the structure of the output and the existing file do not match, jumbled, missing, or inaccurate data can result.

ISOLOCALE locale_code

optional

Note

Applicable in the Unicode edition of Analytics only.

The system locale in the format language_country. For example, to use Canadian French, enter fr_ca.

Use the following codes:

  • language ISO 639 standard language code
  • country ISO 3166 standard country code

    If you do not specify a country code, the default country for the language is used.

If you do not use ISOLOCALE, the default system locale is used.

Examples

Use fuzzy matching to join two tables as a way of discovering employees who may also be vendors

The example below joins the Empmast and Vendor tables using address as the common key field (the Address and Vendor_Street fields).

The FUZZYJOIN command creates a new table with either exactly matched or fuzzy matched primary and secondary records. The result is a list of any employees and vendors with either an identical address, or a similar address.

FUZZYJOIN with the Dice coefficient algorithm

OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN DICE PERCENT 0.8000 NGRAM 2 PKEY Address SKEY Vendor_Street FIELDS Employee_Number First_Name Last_Name Address WITH Vendor_Number Vendor_Name Vendor_Street OPEN TO "Employee_Vendor_Match"

FUZZYJOIN with the Levenshtein distance algorithm

OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN LEVDISTANCE DISTANCE 5 PKEY Address SKEY Vendor_Street FIELDS Employee_Number First_Name Last_Name Address WITH Vendor_Number Vendor_Name Vendor_Street OPEN TO "Employee_Vendor_Match"

Include all fields

This version of the FUZZYJOIN command includes all fields from the primary and secondary tables in the joined output table.

OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN LEVDISTANCE DISTANCE 5 PKEY Address SKEY Vendor_Street FIELDS ALL WITH ALL OPEN TO "Employee_Vendor_Match"

Improve the effectiveness of fuzzy matching

The example below uses the SORTWORDS( ) function to improve the effectiveness of fuzzy matching between the Address and Vendor_Street fields. Use of the UPPER( ) function ensures that case does not affect the sorting of elements in key field values.

OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN LEVDISTANCE DISTANCE 5 PKEY SORTWORDS(UPPER(Address)) SKEY SORTWORDS(UPPER(Vendor_Street)) FIELDS Employee_Number First_Name Last_Name Address WITH Vendor_Number Vendor_Name Vendor_Street OPEN TO "Employee_Vendor_Match"

Remarks

For more information about how this command works, see Fuzzy join.

Case sensitivity

The FUZZYJOIN command is not case-sensitive, regardless of which fuzzy matching algorithm you use. So "SMITH" is equivalent to "smith."

Leading and trailing blanks

The FUZZYJOIN command automatically trims leading and trailing blanks in fields, regardless of which fuzzy matching algorithm you use. There is no need to use the TRIM( ) or ALLTRIM( ) functions when specifying the primary and secondary key fields.

Improving the effectiveness of FUZZYJOIN

Three techniques can significantly improve the effectiveness of the FUZZYJOIN command:

  • sorting individual elements in primary and secondary key field values
  • removing generic elements from primary and secondary key field values
  • harmonizing primary and secondary key field values

These techniques allow you to use tighter fuzzy settings and still get the same fuzzy matches, while reducing the number of false positive matches. You can use the techniques separately, or in combination.

Sorting individual elements in key field values

The SORTWORDS( ) function can improve the effectiveness of the FUZZYJOIN command by sorting individual elements in primary and secondary key field values into a sequential order.

Sorting elements, such as the components of an address, can make key field values with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that key field values are selected as fuzzy matches for each other.

For more information, see SORTWORDS( ) function.

For a video providing an overview of SORTWORDS( ), see Fuzzy Matching Using SORTWORDS() (English only).

Note

Sorting elements in key field values is best suited for fuzzy joining using the Levenshtein distance algorithm.

Sorting elements when fuzzy joining using the Dice coefficient algorithm may or may not be beneficial. Test a set of sample data before deciding whether to use SORTWORDS( ) in conjunction with the Dice coefficient algorithm in a production setting.

Caution

If you use SORTWORDS( ) in conjunction with the FUZZYJOIN command you must apply SORTWORDS( ) to both strings or both fields being compared.

Removing generic elements from key field values

The OMIT( ) function can improve the effectiveness of the FUZZYJOIN command by removing generic elements such as "Corporation" or "Inc.", or characters such as commas, periods, and ampersands (&), from primary and secondary key field values.

Removal of generic elements and punctuation focuses fuzzy matching on just the portion of the key field values where a meaningful difference may occur.

For more information, see OMIT( ) function.

Harmonizing key field values

The REPLACE( ) or REGEXREPLACE( ) functions can improve the effectiveness of the FUZZYJOIN command by harmonizing variant forms of the same element in primary and secondary key field values. For example, you could harmonize "Street", "St.", and "St" to use the single value "St".

Harmonizing elements can make key field values with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that key field values are selected as fuzzy matches for each other.

For more information, see REPLACE( ) function for straightforward replacements, and REGEXREPLACE( ) function for more complex replacements.