Returns a string with individual words sorted in sequential order.
The string, or the expression, containing words to be sorted.
A "word" is any character or group of characters separated by spaces:
Multiple spaces between words are automatically converted to a single space. Leading or trailing spaces are automatically trimmed.
Literal character input
Returns "1 2 A Z a z" (non-Unicode Analytics):
SORTWORDS("Z a 2 z A 1")
Returns "1 2 a A z Z" (Unicode Analytics):
SORTWORDS("Z a 2 z A 1")
Returns "1 2 A A Z Z":
SORTWORDS(UPPER("Z a 2 z A 1"))
Returns "CA, FL NY, TX,":
SORTWORDS("CA, TX, NY, FL")
Returns "CA FL NY TX":
SORTWORDS(OMIT("CA, TX, NY, FL", ","))
Returns all the values in the Vendor_Address field with address elements sorted into sequential order:
Sort address elements to improve fuzzy duplicate matching
You can use SORTWORDS( ) as a helper function when performing various kinds of fuzzy matching in Analytics.
Effect of SORTWORDS( ) on Levenshtein distance
First, let's look at the Levenshtein distance between two occurrences of the same address, differently formatted.
Without the SORTWORDS( ) function, the Levenshtein distance returned is 22. A Levenshtein distance that large suggests the two strings are not the same address:
LEVDIST("125 SW 39TH ST, Suite 100", "Suite 100, 125 SW 39TH ST")
Now, let's add the SORTWORDS( ) function. The Levenshtein distance returned is 2 – dramatically lower – which suggests the two strings are the same address.
LEVDIST(SORTWORDS("125 SW 39TH ST, Suite 100"), SORTWORDS("Suite 100, 125 SW 39TH ST"))
Isolating fuzzy duplicates for "125 SW 39TH ST, Suite 100"
You create a filter that isolates all values in the Vendor_Address field that are within a specified Levenshtein distance of "125 SW 39TH ST, Suite 100":
SET FILTER TO LEVDIST(SORTWORDS(Vendor_Address), SORTWORDS("125 SW 39TH ST, Suite 100"), F) < 3
SET FILTER TO ISFUZZYDUP(SORTWORDS(Vendor_Address), SORTWORDS("125 SW 39TH ST, Suite 100"), 3, 99)
Increasing or decreasing the Levenshtein distance in the expressions ( 3 ) allows you to adjust the degree of difference in the filtered values.
For more information about Levenshtein distance, see LEVDIST( ) function.
For a video providing an overview of the function, see Fuzzy Matching Using SORTWORDS() (English only).
The sort sequence used by SORTWORDS( )
The SORTWORDS( ) function uses whatever sort sequence is specified in the Sort Order option (Tools > Options > Table). The default sort sequences are shown below.
For detailed information, see The Sort Order option and sort sequences .
Sort Order default
Associated sort sequence
Numbers, then uppercase, then lowercase:
0, 1, 2... A, B, C... a, b, c...
For example, “Z” sorts before “a”.
Mix Languages (UCA)
(Unicode collation algorithm)
Numbers, then lowercase and uppercase intermixed:
0, 1, 2... a, A, b, B, c, C...
For example, “a” sorts before “Z”.
SORTWORDS( ) is case sensitive. Depending on which edition of Analytics you are using (non-Unicode or Unicode), casing in strings may affect sorting.
You can use the UPPER( ) function in conjunction with SORTWORDS( ) if you do not want case to affect sorting:
SORTWORDS( ) can improve fuzzy matching
SORTWORDS( ) can improve the effectiveness of fuzzy matching commands, and filters or conditions that use fuzzy matching functions:
- FUZZYJOIN command
- FUZZYDUP command
- ISFUZZYDUP( ) function
- LEVDIST( ) function
- DICECOEFFICIENT( ) function
If you use SORTWORDS( ) in conjunction with any of the fuzzy matching commands or functions you must apply SORTWORDS( ) to both strings or both fields being compared. Applying the function to only one of the two strings or fields can seriously degrade the results of fuzzy matching.
Levenshtein distance algorithm
SORTWORDS( ) is especially helpful when used with commands and functions based on the Levenshtein distance algorithm, which returns greater difference values when identical or similar elements in two strings are in different positions. By sorting the elements first, you can significantly reduce the difference values returned by the Levenshtein algorithm. Result sets have fewer false positives and more true positives.
Dice coefficient algorithm
SORTWORDS( ) may be helpful when used with commands and functions based on the Dice coefficient algorithm. However, an improvement in effectiveness is not always the case. By design, the Dice coefficient algorithm minimizes the importance of the position of elements, so sorting elements has less impact than it does with the Levenshtein distance algorithm.
Depending on the nature of the data, SORTWORDS( ) may actually degrade effectiveness by causing the Dice coefficient algorithm to return lower scores. Test a set of sample data before deciding whether to use SORTWORDS( ) in conjunction with the Dice coefficient algorithm in a production setting.
A second consideration is that the benefit of using SORTWORDS( ) is more modest when the Dice coefficient n-gram length is shorter. As you reduce the n-gram length, the Dice coefficient algorithm increasingly minimizes the importance of the position of elements.