FUZZYJOIN command
Uses fuzzy matching to combine fields from two Analytics tables into a new, single Analytics table.
Note
To detect nearly identical values (fuzzy duplicates) in a single character field, see FUZZYDUP command.
For various options when joining tables using exactly matching key field values, see JOIN command.
Syntax
FUZZYJOIN {DICE PERCENT percentage NGRAM n-gram_length|LEVDISTANCE DISTANCE value} PKEY primary_key_field SKEY secondary_key_field {FIELDS primary_fields|FIELDS ALL} <WITH secondary_fields|WITH ALL> <IF test> <OPEN> <FIRSTMATCH> TO table_name <WHILE test> <FIRST range|NEXT range> <APPEND>
Note
You cannot run the FUZZYJOIN command locally against a server table.
You must specify the FUZZYJOIN command name in full. You cannot abbreviate it.
Parameters
Name | Description |
---|---|
DICE PERCENT percentage NGRAM n-gram_length | LEVDISTANCE DISTANCE value |
The fuzzy matching algorithm to use. DICE use the Dice coefficient algorithm
LEVDISTANCE use the Levenshtein distance algorithm
|
PKEY primary_key_field |
The character key field, or expression, in the primary table. You can specify only one primary key field. |
SKEY secondary_key_field |
The character key field, or expression, in the secondary table. You can specify only one secondary key field. |
FIELDS primary_fields | FIELDS ALL |
The fields or expressions from the primary table to include in the joined output table.
Note You must explicitly specify the primary key field if you want to include it in the joined table. Specifying ALL also includes it. |
WITH secondary_fields | WITH ALL optional |
The fields or expressions from the secondary table to include in the joined output table.
Note You must explicitly specify the secondary key field if you want to include it in the joined table. Specifying ALL also includes it. |
IF test optional |
A conditional expression that must be true in order to process each record. The command is executed on only those records that satisfy the condition. Note The IF parameter is evaluated against only the records remaining in a table after any scope parameters have been applied (WHILE, FIRST, NEXT). Note The IF condition can reference the primary table, the secondary table, or both. |
OPEN optional |
Opens the table created by the command after the command executes. Only valid if the command creates an output table. |
FIRSTMATCH optional |
Specifies that each primary key value is joined to only the first occurrence of any secondary key matches. If the first occurrence happens to be an exact match, any subsequent fuzzy matches for the primary key value are not included in the joined output table. If you omit FIRSTMATCH, the default behavior of FUZZYJOIN is to join each primary key value to all occurrences of any secondary key matches. FIRSTMATCH is useful if you only want to know if any matches, exact or fuzzy, exist between two tables, and you want to reduce the processing time required to identify all matches. You can also use FIRSTMATCH if you are certain that at most only one match exists in the secondary table for each primary key value. Note FIRSTMATCH is available only as an ACLScript parameter. The option is not available in the Analytics user interface. |
TO table_name |
The location to send the results of the command to:
|
WHILE test optional |
A conditional expression that must be true in order to process each record. The command is executed until the condition evaluates as false, or the end of the table is reached. Note If you use WHILE in conjunction with FIRST or NEXT, record processing stops as soon as one limit is reached. |
FIRST range | NEXT range optional |
The number of records to process:
Use range to specify the number of records to process. If you omit FIRST and NEXT, all records are processed by default. |
APPEND optional |
Appends the command output to the end of an existing file instead of overwriting it. Note You must ensure that the structure of the command output and the existing file are identical:
Analytics appends output to an existing file regardless of its structure. If the structure of the output and the existing file do not match, jumbled, missing, or inaccurate data can result. |
ISOLOCALE locale_code optional |
Note Applicable in the Unicode edition of Analytics only. The system locale in the format language_country. For example, to use Canadian French, enter fr_ca. Use the following codes:
If you do not use ISOLOCALE, the default system locale is used. |
Examples
Use fuzzy matching to join two tables as a way of discovering employees who may also be vendors
The example below joins the Empmast and Vendor tables using address as the common key field (the Address and Vendor_Street fields).
The FUZZYJOIN command creates a new table with either exactly matched or fuzzy matched primary and secondary records. The result is a list of any employees and vendors with either an identical address, or a similar address.
FUZZYJOIN with the Dice coefficient algorithm
OPEN Empmast PRIMARY OPEN Vendor SECONDARY FUZZYJOIN DICE PERCENT 0.8000 NGRAM 2 PKEY Address SKEY Vendor_Street FIELDS Employee_Number First_Name Last_Name Address WITH Vendor_Number Vendor_Name Vendor_Street OPEN TO "Employee_Vendor_Match"
FUZZYJOIN with the Levenshtein distance algorithm
OPEN Empmast PRIMARY OPEN Vendor SECONDARY FUZZYJOIN LEVDISTANCE DISTANCE 5 PKEY Address SKEY Vendor_Street FIELDS Employee_Number First_Name Last_Name Address WITH Vendor_Number Vendor_Name Vendor_Street OPEN TO "Employee_Vendor_Match"
Include all fields
This version of the FUZZYJOIN command includes all fields from the primary and secondary tables in the joined output table.
OPEN Empmast PRIMARY OPEN Vendor SECONDARY FUZZYJOIN LEVDISTANCE DISTANCE 5 PKEY Address SKEY Vendor_Street FIELDS ALL WITH ALL OPEN TO "Employee_Vendor_Match"
Remarks
Note
For more information about how this command works, see the Analytics Help.
Case sensitivity
The FUZZYJOIN command is not case-sensitive, regardless of which fuzzy matching algorithm you use. So "SMITH" is equivalent to "smith."
Leading and trailing blanks
The FUZZYJOIN command automatically trims leading and trailing blanks in fields, regardless of which fuzzy matching algorithm you use. There is no need to use the TRIM( ) or ALLTRIM( ) functions when specifying the primary and secondary key fields.