FUZZYJOIN command
Concept Information
Uses fuzzy matching to combine fields from two Analytics tables into a new, single Analytics table.
Note
To detect nearly identical values in a single character field (fuzzy duplicates), see FUZZYDUP command.
To join tables using exactly matching key field values, see JOIN command.
Syntax
FUZZYJOIN {DICE PERCENT percentage NGRAM n-gram_length|LEVDISTANCE DISTANCE value} PKEY primary_key_field SKEY secondary_key_field {FIELDS primary_fields|FIELDS ALL <EXCLUDE primary_fields <...n>>} <WITH secondary_fields|WITH ALL <EXCLUDE secondary_fields <...n>>> <IF test> <OPEN> TO table_name <FIRSTMATCH> <WHILE test> <FIRST range|NEXT range> <APPEND>
Note
You cannot run the FUZZYJOIN command locally against a server table.
You must specify the FUZZYJOIN command name in full. You cannot abbreviate it.
Parameters
Name | Description |
---|---|
DICE PERCENT percentage NGRAM n-gram_length | LEVDISTANCE DISTANCE value |
The fuzzy matching algorithm to use. DICE use the Dice coefficient algorithm
LEVDISTANCE use the Levenshtein distance algorithm
|
PKEY primary_key_field |
The character key field, or expression, in the primary table. You can specify only one primary key field. |
SKEY secondary_key_field |
The character key field, or expression, in the secondary table. You can specify only one secondary key field. |
FIELDS primary_fields | FIELDS ALL |
The fields or expressions from the primary table to include in the joined output table.
Note You must explicitly specify the primary key field if you want to include it in the joined table. Specifying FIELDS ALL also includes it. |
EXCLUDE primary_fields optional |
Only valid when performing a fuzzy join using FIELDS ALL. The field or fields to exclude from the command. EXCLUDE allows you to fine-tune FIELDS ALL, by excluding the specified fields. EXCLUDE must immediately follow FIELDS ALL. For example: FIELDS ALL EXCLUDE field_1 field_2 |
WITH secondary_fields | WITH ALL optional |
The fields or expressions from the secondary table to include in the joined output table.
Note You must explicitly specify the secondary key field if you want to include it in the joined table. Specifying WITH ALL also includes it. |
EXCLUDE secondary_fields optional |
Only valid when performing a fuzzy join using WITH ALL. The field or fields to exclude from the command. EXCLUDE allows you to fine-tune WITH ALL, by excluding the specified fields. EXCLUDE must immediately follow WITH ALL. For example: WITH ALL EXCLUDE field_1 field_2 |
IF test optional |
A conditional expression that must be true in order to process each record. The command is executed on only those records that satisfy the condition. Note The IF parameter is evaluated against only the records remaining in a table after any scope parameters have been applied (WHILE, FIRST, NEXT). Note The IF condition can reference the primary table, the secondary table, or both. |
OPEN optional |
Opens the table created by the command after the command executes. Only valid if the command creates an output table. |
TO table_name |
The location to send the results of the command to:
|
FIRSTMATCH optional |
Specifies that each primary key value is joined to only the first occurrence of any secondary key matches. If the first occurrence happens to be an exact match, any subsequent fuzzy matches for the primary key value are not included in the joined output table. If you omit FIRSTMATCH, the default behavior of FUZZYJOIN is to join each primary key value to all occurrences of any secondary key matches. FIRSTMATCH is useful if you only want to know if any matches, exact or fuzzy, exist between two tables, and you want to avoid the processing time required to identify all matches. You can also use FIRSTMATCH if you are certain that at most only one match exists in the secondary table for each primary key value. |
WHILE test optional |
A conditional expression that must be true in order to process each record. The command is executed until the condition evaluates as false, or the end of the table is reached. Note If you use WHILE in conjunction with FIRST or NEXT, record processing stops as soon as one limit is reached. |
FIRST range | NEXT range optional |
The number of records to process:
Use range to specify the number of records to process. If you omit FIRST and NEXT, all records are processed by default. |
APPEND optional |
Appends the command output to the end of an existing file instead of overwriting it. Note You must ensure that the structure of the command output and the existing file are identical:
Analytics appends output to an existing file regardless of its structure. If the structure of the output and the existing file do not match, jumbled, missing, or inaccurate data can result. |
ISOLOCALE locale_code optional |
Note Applicable in the Unicode edition of Analytics only. The system locale in the format language_country. For example, to use Canadian French, enter fr_ca. Use the following codes:
If you do not use ISOLOCALE, the default system locale is used. |
Examples
Use fuzzy matching to join two tables as a way of discovering employees who may also be vendors
The example below joins the Empmast and Vendor tables using address as the common key field (the Address and Vendor_Street fields).
The FUZZYJOIN command creates a new table with either exactly matched or fuzzy matched primary and secondary records. The result is a list of any employees and vendors with either an identical address, or a similar address.
FUZZYJOIN with the Dice coefficient algorithm
OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN DICE PERCENT 0.8000 NGRAM 2 PKEY Address SKEY Vendor_Street FIELDS Employee_Number First_Name Last_Name Address WITH Vendor_Number Vendor_Name Vendor_Street OPEN TO "Employee_Vendor_Match"
FUZZYJOIN with the Levenshtein distance algorithm
OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN LEVDISTANCE DISTANCE 5 PKEY Address SKEY Vendor_Street FIELDS Employee_Number First_Name Last_Name Address WITH Vendor_Number Vendor_Name Vendor_Street OPEN TO "Employee_Vendor_Match"
Include all fields
This version of the FUZZYJOIN command includes all fields from the primary and secondary tables in the joined output table.
OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN LEVDISTANCE DISTANCE 5 PKEY Address SKEY Vendor_Street FIELDS ALL WITH ALL OPEN TO "Employee_Vendor_Match"
Improve the effectiveness of fuzzy matching
The example below uses the SORTWORDS( ) function to improve the effectiveness of fuzzy matching between the Address and Vendor_Street fields. Use of the UPPER( ) function ensures that case does not affect the sorting of elements in key field values.
OPEN Empmast PRIMARY
OPEN Vendor SECONDARY
FUZZYJOIN LEVDISTANCE DISTANCE 5 PKEY SORTWORDS(UPPER(Address)) SKEY SORTWORDS(UPPER(Vendor_Street)) FIELDS Employee_Number First_Name Last_Name Address WITH Vendor_Number Vendor_Name Vendor_Street OPEN TO "Employee_Vendor_Match"
Remarks
For more information about how this command works, see Fuzzy join.
Case sensitivity
The FUZZYJOIN command is not case-sensitive, regardless of which fuzzy matching algorithm you use. So "SMITH" is equivalent to "smith."
Leading and trailing blanks
The FUZZYJOIN command automatically trims leading and trailing blanks in fields, regardless of which fuzzy matching algorithm you use. There is no need to use the TRIM( ) or ALLTRIM( ) functions when specifying the primary and secondary key fields.
Improving the effectiveness of FUZZYJOIN
Three techniques can significantly improve the effectiveness of the FUZZYJOIN command:
- sorting individual elements in primary and secondary key field values
- removing generic elements from primary and secondary key field values
- harmonizing primary and secondary key field values
These techniques allow you to use tighter fuzzy settings and still get the same fuzzy matches, while reducing the number of false positive matches. You can use the techniques separately, or in combination.
Sorting individual elements in key field values
The SORTWORDS( ) function can improve the effectiveness of the FUZZYJOIN command by sorting individual elements in primary and secondary key field values into a sequential order.
Sorting elements, such as the components of an address, can make key field values with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that key field values are selected as fuzzy matches for each other.
For more information, see SORTWORDS( ) function.
For a video providing an overview of SORTWORDS( ), see Fuzzy Matching Using SORTWORDS() (English only).
Note
Sorting elements in key field values is best suited for fuzzy joining using the Levenshtein distance algorithm.
Sorting elements when fuzzy joining using the Dice coefficient algorithm may or may not be beneficial. Test a set of sample data before deciding whether to use SORTWORDS( ) in conjunction with the Dice coefficient algorithm in a production setting.
Caution
If you use SORTWORDS( ) in conjunction with the FUZZYJOIN command you must apply SORTWORDS( ) to both strings or both fields being compared.
Removing generic elements from key field values
The OMIT( ) function can improve the effectiveness of the FUZZYJOIN command by removing generic elements such as "Corporation" or "Inc.", or characters such as commas, periods, and ampersands (&), from primary and secondary key field values.
Removal of generic elements and punctuation focuses fuzzy matching on just the portion of the key field values where a meaningful difference may occur.
For more information, see OMIT( ) function.
Harmonizing key field values
The REPLACE( ) or REGEXREPLACE( ) functions can improve the effectiveness of the FUZZYJOIN command by harmonizing variant forms of the same element in primary and secondary key field values. For example, you could harmonize "Street", "St.", and "St" to use the single value "St".
Harmonizing elements can make key field values with the same information, but a different format, more closely resemble each other. A closer resemblance improves the chances that key field values are selected as fuzzy matches for each other.
For more information, see REPLACE( ) function for straightforward replacements, and REGEXREPLACE( ) function for more complex replacements.