ACL Scripting Guide 14.1

FUZZYJOIN command

Uses fuzzy matching to combine fields from two Analytics tables into a new, single Analytics table.

Note

To detect nearly identical values (fuzzy duplicates) in a single character field, see FUZZYDUP command.

For various options when joining tables using exactly matching key field values, see JOIN command.

FUZZYJOIN {DICE PERCENT percentage NGRAM n-gram_length|LEVDISTANCE DISTANCE value} PKEY primary_key_field SKEY secondary_key_field {FIELDS primary_fields|FIELDS ALL} <WITH secondary_fields|WITH ALL> <IF test> <OPEN> <FIRSTMATCH> TO table_name <WHILE test> <FIRST range|NEXT range> <APPEND>

Note

You cannot run the FUZZYJOIN command locally against a server table.

You must specify the FUZZYJOIN command name in full. You cannot abbreviate it.

Name Description
DICE PERCENT percentage NGRAM n-gram_length | LEVDISTANCE DISTANCE value

The fuzzy matching algorithm to use.

DICE use the Dice coefficient algorithm

  • PERCENT percentage the minimum allowable Dice's coefficient of two strings for them to qualify as a fuzzy match

    Specify a decimal fraction, from 0.0000 to 1.0000 (for example, 0.7500). Use a maximum of four decimal places.

    Decreasing the value increases the number of matches by including matches with a greater degree of fuzziness – that is, strings that are more different from each another.

  • NGRAM n-gram_length the n-gram length to use

    Specify a whole number, 1 or greater.

    Increasing the n-gram length makes the criterion for similarity between two strings stricter.

    N-grams are overlapping substrings (character blocks) into which the comparison strings are divided as part of the Dice's coefficient calculation.

    Note

    When you specify DICE, the FUZZYJOIN command uses the DICECOEFFICIENT( ) function in an IF statement to conditionally join key field values. For detailed information about the function, see DICECOEFFICIENT( ) function.

LEVDISTANCE use the Levenshtein distance algorithm

  • DISTANCE value the maximum allowable Levenshtein distance between two strings for them to qualify as a fuzzy match

    Specify a whole number, 1 or greater.

    Increasing the value increases the number of matches by including matches with a greater degree of fuzziness – that is, strings that are more different from each another.

    Note

    When you specify LEVDISTANCE, the FUZZYJOIN command uses the LEVDIST( ) function in an IF statement to conditionally join key field values. For detailed information about the function, see LEVDIST( ) function.

    Unlike the function, the Levenshtein distance algorithm in the FUZZYJOIN command automatically trims leading and trailing blanks, and is not case-sensitive.

PKEY primary_key_field

The character key field, or expression, in the primary table.

You can specify only one primary key field.

SKEY secondary_key_field

The character key field, or expression, in the secondary table.

You can specify only one secondary key field.

FIELDS primary_fields | FIELDS ALL

The fields or expressions from the primary table to include in the joined output table.

  • primary_fields include the specified field or fields
  • ALL include all fields from the table

Note

You must explicitly specify the primary key field if you want to include it in the joined table. Specifying ALL also includes it.

WITH secondary_fields | WITH ALL

optional

The fields or expressions from the secondary table to include in the joined output table.

  • secondary_fields include the specified field or fields
  • ALL include all fields from the table

Note

You must explicitly specify the secondary key field if you want to include it in the joined table. Specifying ALL also includes it.

IF test

optional

A conditional expression that must be true in order to process each record. The command is executed on only those records that satisfy the condition.

Note

The IF parameter is evaluated against only the records remaining in a table after any scope parameters have been applied (WHILE, FIRST, NEXT).

Note

The IF condition can reference the primary table, the secondary table, or both.

OPEN

optional

Opens the table created by the command after the command executes. Only valid if the command creates an output table.

FIRSTMATCH

optional

Specifies that each primary key value is joined to only the first occurrence of any secondary key matches.

If the first occurrence happens to be an exact match, any subsequent fuzzy matches for the primary key value are not included in the joined output table.

If you omit FIRSTMATCH, the default behavior of FUZZYJOIN is to join each primary key value to all occurrences of any secondary key matches.

FIRSTMATCH is useful if you only want to know if any matches, exact or fuzzy, exist between two tables, and you want to reduce the processing time required to identify all matches.

You can also use FIRSTMATCH if you are certain that at most only one match exists in the secondary table for each primary key value.

Note

FIRSTMATCH is available only as an ACLScript parameter. The option is not available in the Analytics user interface.

TO table_name

The location to send the results of the command to:

  • table_name saves the results to an Analytics table

    Specify table_name as a quoted string with a .FIL file extension. For example: TO "Output.FIL"

    By default, the table data file (.FIL) is saved to the folder containing the Analytics project.

    Use either an absolute or relative file path to save the data file to a different, existing folder:

    • TO "C:\Output.FIL"
    • TO "Results\Output.FIL"

    Note

    Table names are limited to 64 alphanumeric characters, not including the .FIL extension. The name can include the underscore character ( _ ), but no other special characters, or any spaces. The name cannot start with a number.

WHILE test

optional

A conditional expression that must be true in order to process each record. The command is executed until the condition evaluates as false, or the end of the table is reached.

Note

If you use WHILE in conjunction with FIRST or NEXT, record processing stops as soon as one limit is reached.

FIRST range | NEXT range

optional

The number of records to process:

  • FIRST start processing from the first record until the specified number of records is reached
  • NEXT start processing from the currently selected record until the specified number of records is reached

Use range to specify the number of records to process.

If you omit FIRST and NEXT, all records are processed by default.

APPEND

optional

Appends the command output to the end of an existing file instead of overwriting it.

Note

You must ensure that the structure of the command output and the existing file are identical:

  • the same fields
  • the same field order
  • matching fields are the same length
  • matching fields are the same data type

Analytics appends output to an existing file regardless of its structure. If the structure of the output and the existing file do not match, jumbled, missing, or inaccurate data can result.

ISOLOCALE locale_code

optional

Note

Applicable in the Unicode edition of Analytics only.

The system locale in the format language_country. For example, to use Canadian French, enter fr_ca.

Use the following codes:

  • language ISO 639 standard language code
  • country ISO 3166 standard country code

    If you do not specify a country code, the default country for the language is used.

If you do not use ISOLOCALE, the default system locale is used.

Use fuzzy matching to join two tables as a way of discovering employees who may also be vendors

The example below joins the Empmast and Vendor tables using address as the common key field (the Address and Vendor_Street fields).

The FUZZYJOIN command creates a new table with either exactly matched or fuzzy matched primary and secondary records. The result is a list of any employees and vendors with either an identical address, or a similar address.

FUZZYJOIN with the Dice coefficient algorithm

OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN DICE PERCENT 0.8000 NGRAM 2 PKEY Address SKEY Vendor_Street FIELDS Employee_Number First_Name Last_Name Address WITH Vendor_Number Vendor_Name Vendor_Street OPEN TO "Employee_Vendor_Match"

FUZZYJOIN with the Levenshtein distance algorithm

OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN LEVDISTANCE DISTANCE 5 PKEY Address SKEY Vendor_Street FIELDS Employee_Number First_Name Last_Name Address WITH Vendor_Number Vendor_Name Vendor_Street OPEN TO "Employee_Vendor_Match"

Include all fields

This version of the FUZZYJOIN command includes all fields from the primary and secondary tables in the joined output table.

OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN LEVDISTANCE DISTANCE 5 PKEY Address SKEY Vendor_Street FIELDS ALL WITH ALL OPEN TO "Employee_Vendor_Match"

Note

For more information about how this command works, see the Analytics Help.

Case sensitivity

The FUZZYJOIN command is not case-sensitive, regardless of which fuzzy matching algorithm you use. So "SMITH" is equivalent to "smith."

Leading and trailing blanks

The FUZZYJOIN command automatically trims leading and trailing blanks in fields, regardless of which fuzzy matching algorithm you use. There is no need to use the TRIM( ) or ALLTRIM( ) functions when specifying the primary and secondary key fields.